mirrors/jgit - jgit - source @ dussan.org

Commit Graph

Author	SHA1	Message	Date
Kevin Sawicki	17fb542e9e	Remove 86 boxing warnings Use Integer, Character, and Long valueOf methods when passing parameters to MessageFormat and other places that expect objects instead of primitives Change-Id: I5942fbdbca6a378136c00d951ce61167f2366ca4	12 years ago
Shawn O. Pearce	6c0d300a54	Fix loading packed objects >2G Parsing the size from a packed object header was incorrectly computing the total inflated length when the length exceeded the range of a Java int. The next 7 bits of size information was shifted left as an int using a shift of 25 bits, placing the higher bits of the size into the sign position. When this size was extended to a long to be added to the current size accumulator the size went negative, resulting in NegativeArraySizeException being thrown. Fix all places where this particular pattern of code is used to read a pack size field, or a binary delta header, as they both use the same variable length encoding scheme. Change-Id: I04008728ed828f18202652c3d5401cf95a441d0a	12 years ago
Robin Rosenberg	95d311f888	Move JGitText to an internal package Change-Id: I763590a45d75f00a09097ab6f89581a3bbd3c797	12 years ago
Sasa Zivkov	1fbe688f51	maxObjectSizeLimit for receive-pack. ReceivePack (and PackParser) can be configured with the maxObjectSizeLimit in order to prevent users from pushing too large objects to Git. The limit check is applied to all object types although it is most likely that a BLOB will exceed the limit. In all cases the size of the object header is excluded from the object size which is checked against the limit as this is the size of which a BLOB object would take in the working tree when checked out as a file. When an object exceeds the maxObjectSizeLimit the receive-pack will abort immediately. Delta objects (both offset and ref delta) are also checked against the limit. However, for delta objects we will first check the size of the inflated delta block against the maxObjectSizeLimit and abort immediately if it exceeds the limit. In this case we even do not know the exact size of the resolved delta object but we assume it will be larger than the given maxObjectSizeLimit as delta is generally only chosen if the delta can copy more data from the base object than the delta needs to insert or needs to represent the copy ranges. Aborting early, in this case, avoids unnecessary inflating of the (huge) delta block. Unfortunately, it is too expensive (especially for a large delta) to compute SHA-1 of an object that causes the receive-pack to abort. This would decrease the value of this feature whose main purpose is to protect server resources from users pushing huge objects. Therefore we don't report the SHA-1 in the error message. Change-Id: I177ef24553faacda444ed5895e40ac8925ca0d1e Signed-off-by: Sasa Zivkov <sasa.zivkov@sap.com> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>	12 years ago
Shawn O. Pearce	c81f6ab3ab	IndexPack: Defer the "Resolving deltas" progress meter If delta resolution completes in < 1000 milliseconds, don't bother showing the progress meter. This is actually very common for a Gerrit Code Review server, where the client is probably sending 1 commit and only a few trees/blobs modified... and the base objects are hot in the process buffer cache. The 1000 millisecond delay is just a guess at a reasonable time to wait. Change-Id: I440baa64ab0dfa21be61deae8dcd3ca061bed8ce Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	e0111b18c8	IndexPack: Fix "Resolving deltas" progress meter This progress meter never reached 100% as it did not update while resolving the external bases in thin packs. Instead of updating in batches at the top level, update once per delta that is resolved. The batching progress meter type should smooth out the frequent updates to an update rate that is more reasonable to send to the UI, while also ensuring a successful pack parse always reaches 100% deltas resolved. Change-Id: Ic77dcac542cfa97213a6b0194708f9d3c256d223 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bfa62d88d4	Don't close ObjectDatabase after parsing pack The cached object databases should not require a close to release their cached resources. Most object databases just return their own reference for newCachedDatabase(), so a close() here kills the real database's internal caches, and possibly underlying files, resulting in poor performance for the callers of PackParser like ReceivePack or FetchProcess trying to then go look up objects that were just parsed, or that current references point to. Change-Id: Ia4a239093866e5b9faf82744f729fb73f4373f1a Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	67a1a0993f	Ensure the HTTP request is fully consumed Some servlet containers require the servlet to read the EOF marker from the input stream before a response can be output if the stream is using "Transfer-Encoding: chunked"... which is typical for any sort of large push to a repository over smart HTTP. Ensure the EOF is always read by the PackParser when it is handling the stream, and fail fast if there is more data present than expected since this does indicate a protocol error. Also ensure the EOF is read by UploadPack before it starts to output a partial response using packing progress meters. Change-Id: I131db9dea20b2324cb7c3272a814f21296bc64bd Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	de8946c0c2	Store Git on any DHT jgit.storage.dht is a storage provider implementation for JGit that permits storing the Git repository in a distributed hashtable, NoSQL system, or other database. The actual underlying storage system is undefined, and can be plugged in by implementing 7 small interfaces: * Database * RepositoryIndexTable * RepositoryTable * RefTable * ChunkTable * ObjectIndexTable * WriteBuffer The storage provider interface tries to assume very little about the underlying storage system, and requires only three key features: * key -> value lookup (a hashtable is suitable) * atomic updates on single rows * asynchronous operations (Java's ExecutorService is easy to use) Most NoSQL database products offer all 3 of these features in their clients, and so does any decent network based cache system like the open source memcache product. Relying only on key equality for data retrevial makes it simple for the storage engine to distribute across multiple machines. Traditional SQL systems could also be used with a JDBC based spi implementation. Before submitting this change I have implemented six storage systems for the spi layer: * Apache HBase[1] * Apache Cassandra[2] * Google Bigtable[3] * an in-memory implementation for unit testing * a JDBC implementation for SQL * a generic cache provider that can ride on top of memcache All six systems came in with an spi layer around 1000 lines of code to implement the above 7 interfaces. This is a huge reduction in size compared to prior attempts to implement a new JGit storage layer. As this package shows, a complete JGit storage implementation is more than 17,000 lines of fairly complex code. A simple cache is provided in storage.dht.spi.cache. Implementers can use CacheDatabase to wrap any other type of Database and perform fast reads against a network based cache service, such as the open source memcached[4]. An implementation of CacheService must be provided to glue this spi onto the network cache. [1] https://github.com/spearce/jgit_hbase [2] https://github.com/spearce/jgit_cassandra [3] http://labs.google.com/papers/bigtable.html [4] http://memcached.org/ Change-Id: I0aa4072781f5ccc019ca421c036adff2c40c4295 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bd970007be	ObjectIdOwnerMap: More lightweight map for ObjectIds OwnerMap is about 200 ms faster than SubclassMap, more friendly to the GC, and uses less storage: testing the "Counting objects" part of PackWriter on `1886362` objects: ObjectIdSubclassMap: load factor 50% table: `4194304` (wasted `2307942`) ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256 ms avg 34800 (last 9 runs) ObjectIdOwnerMap: load factor 100% table: `2097152` (wasted 210790) directory: 1024 ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140 ms avg 34597 (last 9 runs) The major difference with OwnerMap is entries must extend from ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own private "next" field into each object. This allows the OwnerMap to use a singly linked list for chaining collisions within a bucket. By putting collisions in a linked list, we gain the entire table back for the SHA-1 bits to index their own "private" slot. Unfortunately this means that each object can appear in at most ONE OwnerMap, as there is only one "next" field within the object instance to thread into the map. For types that are very object map heavy like RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this is sufficient, these entity types are only put into one map by their container. By introducing a new map type, we don't break existing applications that might be trying to use ObjectIdSubclassMap to track RevCommits they obtained from a RevWalk. The OwnerMap uses less memory. Each object uses 1 reference more (so we're up 1,886,362 references), but the table is 1/2 the size (2^20 rather than 2^21). The table itself wastes only 210,790 slots, rather than 2,307,942. So OwnerMap is wasting 200k fewer references. OwnerMap is more friendly to the GC, because it hardly ever generates garbage. As the map reaches its 100% load factor target, it doubles in size by allocating additional segment arrays of 2048 entries. (So the first grow allocates 1 segment, second 2 segments, third 4 segments, etc.) These segments are hooked into the pre-allocated directory of 1024 spaces. This permits the map to grow to 2 million objects before the directory itself has to grow. By using segments of 2048 entries, we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is easier to satisfy then 2,307,942 bytes (for the 512k table that is just an intermediate step in the SubclassMap). By reusing the previously allocated segments (they are re-hashed in-place) we don't release any memory during a table grow. When the directory grows, it does so by discarding the old one and using one that is 4x larger (so the directory goes to 4096 entries on its first grow). A directory of size 4096 can handle up to 8 millon objects. The second directory grow (16384) goes to 33 million objects. At that point we're starting to really push the limits of the JVM heap, but at least its many small arrays. Previously SubclassMap would need a table of `67108864` entries to handle that object count, which needs a single contiguous allocation of 256 MiB. That's hard to come by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB each. This is much easier to fit into a fragmented heap. Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	f67e5602af	PackWriter: Reduce GC during enumeration Instead of resizing an ArrayList until all objects have been added, append objects into a specialized List type that uses small arrays of 1024 entries for each 1024 objects added. For a large repository like linux-2.6, PackWriter will now allocate 1,758 smaller arrays to hold the object list, without creating any garbage from the intermediate states due to list expansion. 1024 was chosen as the block size (and initial directory size) as this is a reasonable balance for the PackWriter code. Each block uses approximately 4096 bytes in a 32 bit JVM, as does the default top level block directory. The top level directory doesn't expand until 1 million items have been added to the list, which for linux-2.6 won't yet occur as the lists are per-object-type and are thus bounded to about 1/3 of 1.8 million. Change-Id: If9e4092eb502394c5d3d044b58cf49952772f6d6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	168114fd39	Show resolving deltas progress to push clients CGit push clients 1.6.6 and later support progress messages on the side-band-64k channel during push, as this was introduced to handle server side hook errors reported over smart HTTP. Since JGit's delta resolution isn't always as fast as CGit's is, a user may think the server has crashed and failed to report status if the user pushed a lot of content and sees no feedback. Exposing the progress monitor during the resolving deltas phase will let the user know the server is still making forward progress. This also helps BasePackPushConnection, which has a bounded timeout on how long it will wait before assuming the remote server is dead. Progress messages pushed down the side-band channel will reset the read timer, helping the connection to stay alive and avoid timing out before the remote side's work is complete. Change-Id: I429c825e5a724d2f21c66f95526d9c49edcc6ca9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	36e396f8b9	Permit disabling birthday attack checks in PackParser Reading a repository for millions of missing objects might be very expensive to perform, especially if the repository is on a network filesystem or some other costly RPC backend. A repository owner might choose to accept some risk in return for better performance, so allow disabling collision checking when receiving a pack. Currently there is no way for an end-user to disable this feature. This is intentional, because it is generally NOT a good idea to skip this check. Instead this feature is supplied for storage implementations to bypass the default checking logic, should they have their own custom routines that is just as effective but can be handled more efficiently. Change-Id: I90c801bb40e86412209de0c43e294a28f6a767a5 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	d62350d907	Ensure all deltas were resolved in a pack If a pack uses OFS_DELTA only (e.g. its an initial push to a repository) and PackParser's implementation is broken such that the delta chain that hangs below a particular object offset is empty, the entryCount won't match the expected objectCount. Fail fast rather than claiming the stream was parsed correctly. The current implementation is not broken as described above. I broke the code when I implemented my own new subclass of PackParser (which incorrectly mucked with the object offset information), leading me to discover this consistency check was missing. Change-Id: I07540f0ae1144ef6f3bda48774dbdefb8876e1d3 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	13 years ago
Shawn O. Pearce	1bf0c3cdb1	Refactor IndexPack to not require local filesystem By moving the logic that parses a pack stream from the network (or a bundle) into a type that can be constructed by an ObjectInserter, repository implementations have a chance to inject their own logic for storing object data received into the destination repository. The API isn't completely generic yet, there are still quite a few assumptions that the PackParser subclass is storing the data onto the local filesystem as a single file. But its about the simplest split of IndexPack I can come up with without completely ripping the code apart. Change-Id: I5b167c9cc6d7a7c56d0197c62c0fd0036a83ec6c Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	13 years ago

15 Commits (4e5944fbf8579fddc8b81ae11484b392cc6fc527)