Use Integer, Character, and Long valueOf methods when
passing parameters to MessageFormat and other places
that expect objects instead of primitives
Change-Id: I5942fbdbca6a378136c00d951ce61167f2366ca4
Parsing the size from a packed object header was incorrectly computing
the total inflated length when the length exceeded the range of a Java
int. The next 7 bits of size information was shifted left as an int
using a shift of 25 bits, placing the higher bits of the size into the
sign position. When this size was extended to a long to be added to
the current size accumulator the size went negative, resulting in
NegativeArraySizeException being thrown.
Fix all places where this particular pattern of code is used to read a
pack size field, or a binary delta header, as they both use the same
variable length encoding scheme.
Change-Id: I04008728ed828f18202652c3d5401cf95a441d0a
ReceivePack (and PackParser) can be configured with the
maxObjectSizeLimit in order to prevent users from pushing too large
objects to Git. The limit check is applied to all object types
although it is most likely that a BLOB will exceed the limit. In all
cases the size of the object header is excluded from the object size
which is checked against the limit as this is the size of which a BLOB
object would take in the working tree when checked out as a file.
When an object exceeds the maxObjectSizeLimit the receive-pack will
abort immediately.
Delta objects (both offset and ref delta) are also checked against the
limit. However, for delta objects we will first check the size of the
inflated delta block against the maxObjectSizeLimit and abort
immediately if it exceeds the limit. In this case we even do not know
the exact size of the resolved delta object but we assume it will be
larger than the given maxObjectSizeLimit as delta is generally only
chosen if the delta can copy more data from the base object than the
delta needs to insert or needs to represent the copy ranges. Aborting
early, in this case, avoids unnecessary inflating of the (huge) delta
block.
Unfortunately, it is too expensive (especially for a large delta) to
compute SHA-1 of an object that causes the receive-pack to abort.
This would decrease the value of this feature whose main purpose is to
protect server resources from users pushing huge objects. Therefore
we don't report the SHA-1 in the error message.
Change-Id: I177ef24553faacda444ed5895e40ac8925ca0d1e
Signed-off-by: Sasa Zivkov <sasa.zivkov@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
IndexPack: Defer the "Resolving deltas" progress meter
If delta resolution completes in < 1000 milliseconds, don't bother
showing the progress meter. This is actually very common for a Gerrit
Code Review server, where the client is probably sending 1 commit and
only a few trees/blobs modified... and the base objects are hot in the
process buffer cache.
The 1000 millisecond delay is just a guess at a reasonable time to wait.
Change-Id: I440baa64ab0dfa21be61deae8dcd3ca061bed8ce
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This progress meter never reached 100% as it did not update while
resolving the external bases in thin packs.
Instead of updating in batches at the top level, update once per delta
that is resolved. The batching progress meter type should smooth out
the frequent updates to an update rate that is more reasonable to send
to the UI, while also ensuring a successful pack parse always reaches
100% deltas resolved.
Change-Id: Ic77dcac542cfa97213a6b0194708f9d3c256d223
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The cached object databases should not require a close to release
their cached resources. Most object databases just return their
own reference for newCachedDatabase(), so a close() here kills
the real database's internal caches, and possibly underlying files,
resulting in poor performance for the callers of PackParser like
ReceivePack or FetchProcess trying to then go look up objects that
were just parsed, or that current references point to.
Change-Id: Ia4a239093866e5b9faf82744f729fb73f4373f1a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some servlet containers require the servlet to read the EOF marker
from the input stream before a response can be output if the stream
is using "Transfer-Encoding: chunked"... which is typical for any
sort of large push to a repository over smart HTTP.
Ensure the EOF is always read by the PackParser when it is handling
the stream, and fail fast if there is more data present than expected
since this does indicate a protocol error.
Also ensure the EOF is read by UploadPack before it starts to output
a partial response using packing progress meters.
Change-Id: I131db9dea20b2324cb7c3272a814f21296bc64bd
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
jgit.storage.dht is a storage provider implementation for JGit that
permits storing the Git repository in a distributed hashtable, NoSQL
system, or other database. The actual underlying storage system is
undefined, and can be plugged in by implementing 7 small interfaces:
* Database
* RepositoryIndexTable
* RepositoryTable
* RefTable
* ChunkTable
* ObjectIndexTable
* WriteBuffer
The storage provider interface tries to assume very little about the
underlying storage system, and requires only three key features:
* key -> value lookup (a hashtable is suitable)
* atomic updates on single rows
* asynchronous operations (Java's ExecutorService is easy to use)
Most NoSQL database products offer all 3 of these features in their
clients, and so does any decent network based cache system like the
open source memcache product. Relying only on key equality for data
retrevial makes it simple for the storage engine to distribute across
multiple machines. Traditional SQL systems could also be used with a
JDBC based spi implementation.
Before submitting this change I have implemented six storage systems
for the spi layer:
* Apache HBase[1]
* Apache Cassandra[2]
* Google Bigtable[3]
* an in-memory implementation for unit testing
* a JDBC implementation for SQL
* a generic cache provider that can ride on top of memcache
All six systems came in with an spi layer around 1000 lines of code to
implement the above 7 interfaces. This is a huge reduction in size
compared to prior attempts to implement a new JGit storage layer. As
this package shows, a complete JGit storage implementation is more
than 17,000 lines of fairly complex code.
A simple cache is provided in storage.dht.spi.cache. Implementers can
use CacheDatabase to wrap any other type of Database and perform fast
reads against a network based cache service, such as the open source
memcached[4]. An implementation of CacheService must be provided to
glue this spi onto the network cache.
[1] https://github.com/spearce/jgit_hbase
[2] https://github.com/spearce/jgit_cassandra
[3] http://labs.google.com/papers/bigtable.html
[4] http://memcached.org/
Change-Id: I0aa4072781f5ccc019ca421c036adff2c40c4295
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
ObjectIdOwnerMap: More lightweight map for ObjectIds
OwnerMap is about 200 ms faster than SubclassMap, more friendly to the
GC, and uses less storage: testing the "Counting objects" part of
PackWriter on 1886362 objects:
ObjectIdSubclassMap:
load factor 50%
table: 4194304 (wasted 2307942)
ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256
ms avg 34800 (last 9 runs)
ObjectIdOwnerMap:
load factor 100%
table: 2097152 (wasted 210790)
directory: 1024
ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140
ms avg 34597 (last 9 runs)
The major difference with OwnerMap is entries must extend from
ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own
private "next" field into each object. This allows the OwnerMap to use
a singly linked list for chaining collisions within a bucket. By
putting collisions in a linked list, we gain the entire table back for
the SHA-1 bits to index their own "private" slot.
Unfortunately this means that each object can appear in at most ONE
OwnerMap, as there is only one "next" field within the object instance
to thread into the map. For types that are very object map heavy like
RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this
is sufficient, these entity types are only put into one map by their
container. By introducing a new map type, we don't break existing
applications that might be trying to use ObjectIdSubclassMap to track
RevCommits they obtained from a RevWalk.
The OwnerMap uses less memory. Each object uses 1 reference more (so
we're up 1,886,362 references), but the table is 1/2 the size (2^20
rather than 2^21). The table itself wastes only 210,790 slots, rather
than 2,307,942. So OwnerMap is wasting 200k fewer references.
OwnerMap is more friendly to the GC, because it hardly ever generates
garbage. As the map reaches its 100% load factor target, it doubles in
size by allocating additional segment arrays of 2048 entries. (So the
first grow allocates 1 segment, second 2 segments, third 4 segments,
etc.) These segments are hooked into the pre-allocated directory of
1024 spaces. This permits the map to grow to 2 million objects before
the directory itself has to grow. By using segments of 2048 entries,
we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is
easier to satisfy then 2,307,942 bytes (for the 512k table that is
just an intermediate step in the SubclassMap). By reusing the
previously allocated segments (they are re-hashed in-place) we don't
release any memory during a table grow.
When the directory grows, it does so by discarding the old one and
using one that is 4x larger (so the directory goes to 4096 entries on
its first grow). A directory of size 4096 can handle up to 8 millon
objects. The second directory grow (16384) goes to 33 million objects.
At that point we're starting to really push the limits of the JVM
heap, but at least its many small arrays. Previously SubclassMap would
need a table of 67108864 entries to handle that object count, which
needs a single contiguous allocation of 256 MiB. That's hard to come
by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB
each. This is much easier to fit into a fragmented heap.
Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of resizing an ArrayList until all objects have been added,
append objects into a specialized List type that uses small arrays
of 1024 entries for each 1024 objects added.
For a large repository like linux-2.6, PackWriter will now allocate
1,758 smaller arrays to hold the object list, without creating any
garbage from the intermediate states due to list expansion.
1024 was chosen as the block size (and initial directory size) as this
is a reasonable balance for the PackWriter code. Each block uses
approximately 4096 bytes in a 32 bit JVM, as does the default top
level block directory. The top level directory doesn't expand until 1
million items have been added to the list, which for linux-2.6 won't
yet occur as the lists are per-object-type and are thus bounded to
about 1/3 of 1.8 million.
Change-Id: If9e4092eb502394c5d3d044b58cf49952772f6d6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
CGit push clients 1.6.6 and later support progress messages on the
side-band-64k channel during push, as this was introduced to handle
server side hook errors reported over smart HTTP.
Since JGit's delta resolution isn't always as fast as CGit's is,
a user may think the server has crashed and failed to report
status if the user pushed a lot of content and sees no feedback.
Exposing the progress monitor during the resolving deltas phase
will let the user know the server is still making forward progress.
This also helps BasePackPushConnection, which has a bounded timeout
on how long it will wait before assuming the remote server is dead.
Progress messages pushed down the side-band channel will reset the
read timer, helping the connection to stay alive and avoid timing
out before the remote side's work is complete.
Change-Id: I429c825e5a724d2f21c66f95526d9c49edcc6ca9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Permit disabling birthday attack checks in PackParser
Reading a repository for millions of missing objects might be very
expensive to perform, especially if the repository is on a network
filesystem or some other costly RPC backend. A repository owner
might choose to accept some risk in return for better performance,
so allow disabling collision checking when receiving a pack.
Currently there is no way for an end-user to disable this feature.
This is intentional, because it is generally *NOT* a good idea to
skip this check. Instead this feature is supplied for storage
implementations to bypass the default checking logic, should they
have their own custom routines that is just as effective but can
be handled more efficiently.
Change-Id: I90c801bb40e86412209de0c43e294a28f6a767a5
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If a pack uses OFS_DELTA only (e.g. its an initial push to a
repository) and PackParser's implementation is broken such that the
delta chain that hangs below a particular object offset is empty, the
entryCount won't match the expected objectCount. Fail fast rather
than claiming the stream was parsed correctly.
The current implementation is not broken as described above. I broke
the code when I implemented my own new subclass of PackParser (which
incorrectly mucked with the object offset information), leading me to
discover this consistency check was missing.
Change-Id: I07540f0ae1144ef6f3bda48774dbdefb8876e1d3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Refactor IndexPack to not require local filesystem
By moving the logic that parses a pack stream from the network (or
a bundle) into a type that can be constructed by an ObjectInserter,
repository implementations have a chance to inject their own logic
for storing object data received into the destination repository.
The API isn't completely generic yet, there are still quite a few
assumptions that the PackParser subclass is storing the data onto
the local filesystem as a single file. But its about the simplest
split of IndexPack I can come up with without completely ripping
the code apart.
Change-Id: I5b167c9cc6d7a7c56d0197c62c0fd0036a83ec6c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>