mirrors/jgit - jgit - source @ dussan.org

Commit Graph

Author	SHA1	Message	Date
Robin Rosenberg	32ff57a2b2	Cleanup javadocs so they pass the java8 doclint checks Bug: 431552 Change-Id: I469316f5645205016e1fa6b0fbd2ff3b509b14bc Signed-off-by: Robin Stocker <robin@nibor.org>	10 years ago
Shawn Pearce	5d8a9f6f3f	Rescale "Compressing objects" progress meter by size Instead of counting objects processed, count number of bytes added into the window. This should rescale the progress meter so that 30% complete means 30% of the total uncompressed content size has been inflated and fed into the window. In theory the progress meter should be more accurate about its percentage complete/remaining fraction than with objects. When counting objects small objects move the progress meter more rapidly than large objects, but demand a smaller amount of work than large objects being compressed. Change-Id: Id2848c16a2148b5ca51e0ca1e29c5be97eefeb48	11 years ago
Shawn Pearce	21e4aa2b9e	Split delta search buckets by byte weight Instead of assuming all objects cost the same amount of time to delta compress, aggregate the byte size of objects in the list and partition threads with roughly equal total bytes. Before splitting the list select the N largest paths and assign each one to its own thread. This allows threads to get through the worst cases in parallel before attempting smaller paths that are more likely to be splittable. By running the largest path buckets first on each thread the likely slowest part of compression is done early, while progress is still reporting a low percentage. This gives users a better impression of how fast the phase will run. On very complex inputs the slow part is more likely to happen first, making a user realize its time to go grab lunch, or even run it overnight. If the worst sections are earlier, memory overruns may show up earlier, giving the user a chance to correct the configuration and try again before wasting large amounts of time. It also makes it less likely the delta compression phase reaches 92% in 30 minutes and then crawls for 10 hours through the remaining 8%. Change-Id: I7621c4349b99e40098825c4966b8411079992e5f	11 years ago
Shawn Pearce	c9707e6353	Always attempt delta compression when reuseDeltas is false If reuseObjects=true but reuseDeltas=false the caller wants attempt a delta for every object in the input list. Test for reuseDeltas to ensure every object passes through the searchInWindow() method. If no delta is possible for an object and it will be stored whole (non-delta format), PackWriter may still reuse its content from any source pack. This avoids an inflate()-deflate() cycle to recompress the object contents. Change-Id: I845caeded419ef4551ef1c85787dd5ffd73235d9	11 years ago
Shawn Pearce	eb17495ca4	Disable CRC32 computation when no PackIndex will be created If a server is streaming 3GiB worth of pack data to a client there is no reason to compute the CRC32 checksum on the objects. The CRC32 code computed by PackWriter is used only in the new index created by writeIndex(), which is never invoked for the native Git network protocols. Object reuse may still compute its own CRC32 to verify the data being copied from an existing pack has not been corrupted. This check is done by the ObjectReader that implements ObjectReuseAsIs and has no relationship to the CRC32 being skipped during output. Change-Id: I05626f2e0d6ce19119b57d8a27193922636d60a7	11 years ago
Shawn Pearce	d0a5337625	Steal work from delta threads to rebalance CPU load If the configuration wants to run 4 threads the delta search work is initially split somewhat evenly across the 4 threads. During execution some threads will finish early due to the work not being split fairly, as the initial partitions were based on object count and not cost to inflate or size of DeltaIndex. When a thread finishes early it now tries to take 50% of the work remaining on a sibling thread, and executes that before exiting. This repeats as each thread completes until a thread has only 1 object remaining. Repacking Blink, Chromium's new fork of WebKit (2.2M objects 3.9G): [pack] reuseDeltas = false reuseObjects = false depth = 50 threads = 8 window = 250 windowMemory = 800m before: ~105% CPU after 80% after: >780% CPU to 100% Change-Id: I65e45422edd96778aba4b6e5a0fd489ea48e8ca3	11 years ago
Shawn Pearce	5d446f410d	Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812	11 years ago
Shawn Pearce	01a0699acc	Micro-optimize reuseDeltaFor in PackWriter This switch is called mostly for OBJ_TREE and OBJ_BLOB types, which typically make up 66% of the objects in a repository. Simplify the test for these common types by testing for the one bit they have in common and returning early. Object type 5 is currently undefined. In the old code it would hit the default and return true. In the new code it will match the early case and also return true. In either implementation 5 should never show up as it is not a valid type known to Git. Object type 6 OFS_DELTA is not permitted to be supplied here. Object type 7 REF_DELTA is not permitted to be supplied here. Change-Id: I0ede8acee928bb3e73c744450863942064864e9c	11 years ago
Shawn Pearce	8e83c36e27	Static import OBJ_* constants into PackWriter Shortens most of the code that touches the objectLists. Change-Id: Ib14d366dd311e544e7ba50e9ce07a6f3ce0cf254	11 years ago
Shawn Pearce	93a27ce728	Simplify size test in PackWriter Clip the configured limit to Integer.MAX_VALUE at the top of the loop, saving a compare branch per object considered. This can cut 2M branches out of a repacking of the Linux kernel. Rewrite the logic so the primary path is to match the conditional; most objects are larger than BLKSZ (16 bytes) and less than limit. This may help branch prediction on CPUs if the CPU tries to assume execution takes the side of the branch and not the second. Change-Id: I5133d1651640939afe9fbcfd8cfdb59965c57d5a	11 years ago
Shawn Pearce	594d4ceb12	Simplify setDoNotDelta() to always set the flag This method is only invoked with true as the argument. Remove the unnecessary parameter and branch, making the code easier for the JIT to optimize. Change-Id: I68a9cd82f197b7d00a524ea3354260a0828083c6	11 years ago
Shawn Pearce	f32b861243	JGit 3.0: move internal classes into an internal subpackage This breaks all existing callers once. Applications are not supposed to build against the internal storage API unless they can accept API churn and make necessary updates as versions change. Change-Id: I2ab1327c202ef2003565e1b0770a583970e432e9	11 years ago
Shawn Pearce	3760e4319b	Remove cached_packs support in favor of bitmaps The bitmap code in PackWriter knows exactly when to use a pack as a "cached pack". It enables cached pack usage only when the pack has a bitmap and its entire closure of objects needs to be sent. This is a much simpler code path to maintain, and JGit actually has a way to write the necessary index. Change-Id: I2645d482f8733fdf0c4120cc59ba9aa4d4ba6881	11 years ago
Colby Ranger	f82821728b	Enable writing pack indexes with bitmaps in the GC. Update the dfs and file GC implementations to prepare and write bitmaps on the packs that contain the full closure of the object graph. Update the DfsPackDescription to include the index version. Change-Id: I3f1421e9cd90fe93e7e2ef2b8179ae2f1ba819ed	11 years ago
Colby Ranger	dafcb8f6db	Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d	11 years ago
Colby Ranger	be7a135e94	Break the dependency on RevObject when creating a newObjectToPack(). Update the ObjectReuseAsIs API to support creating new ObjectToPack with only the AnyObjectId and Git object type. This is needed to support the future pack index bitmaps, which only contain this information and do not want the overhead of creating a temporary object for every ObjectId. Change-Id: I906360b471412688bf429ecef74fd988f47875dc	11 years ago
Colby Ranger	7fbd6588be	Reduce memory held and speed up DfsGarbageCollector. getObjectList() returns a list of ObjectToPack. These can hold on to a lot of memory. Furthermore, binary searching for objects in a sorted array can be slow. Improve the speed and reduce the memory by creating a copy of the ObjectId and inserting it into an ObjectIdOwnerMap. Change-Id: Ib5aa5b7447e05938b47fa55812a87b9872c20ea7	11 years ago
Colby Ranger	7c58f6282a	Update DfsGarbageCollector to not read back a pack index. Previously, the Dfs GC excluded objects from packs by passing a previously written index to the PackWriter. Reading back a file on Dfs is slow. Instead, allow the PackWriter to expose the objects included in a pack and forward that to invocations of excludeObjects() . Change-Id: I377cb4ab07f62cf790505e1eeb0b2efe81897c79	11 years ago
Robin Rosenberg	c310fa0c80	Mark non-externalizable strings as such A few classes such as Constanrs are marked with @SuppressWarnings, as are toString() methods with many liternal, but otherwise $NLS-n$ is used for string containing text that should not be translated. A few literals may fall into the gray zone, but mostly I've tried to only tag the obvious ones. Change-Id: I22e50a77e2bf9e0b842a66bdf674e8fa1692f590	11 years ago
Colby Ranger	b77ba04976	Do not delta compress objects that have already tried to compress. If an object is in a pack file already, delta compression will not attempt to re-compress it. This assumes that the previous packing already performed the optimal compression attempt, however, the subclasses of StoredObjectRepresentation may use other heuristics to determine if the stored format is optimal. Change-Id: I403de522f4b0dd2667d54f6faed621f392c07786	11 years ago
Christian Halstrick	0f84b86e01	fix PackWriter excluded objects handling PackWriter supports excluding objects from being written to the pack. You may specify a PackIndex which lists all those objects which should not go into the new pack. This feature was broken because not all commits have been checked whether they should be excluded or not. For other object types the exclude algorithm worked. This commit adds the missing check. Change-Id: Id0047098393641ccba784c58b8325175c22fcece Signed-off-by: Christian Halstrick <christian.halstrick@sap.com> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>	12 years ago
Kevin Sawicki	17fb542e9e	Remove 86 boxing warnings Use Integer, Character, and Long valueOf methods when passing parameters to MessageFormat and other places that expect objects instead of primitives Change-Id: I5942fbdbca6a378136c00d951ce61167f2366ca4	12 years ago
Robin Rosenberg	95d311f888	Move JGitText to an internal package Change-Id: I763590a45d75f00a09097ab6f89581a3bbd3c797	12 years ago
Shawn O. Pearce	60e51251db	Do not write edge objects to the pack stream Consider two objects A->B where A uses B as a delta base, and these are in the same source pack file ordered as "A B". If cached packs is enabled and B is also in the cached pack that will be appended onto the end of the thin pack, and both A, B are supposed to be in the thin pack, PackWriter must consider the fact that A's base B is an edge object that claims to be part of the new pack, but is actually "external" and cannot be written first. If the object reuse system considered B candidates fist this bug does not arise, as B will be marked as edge due to it existing in the cached pack. When the A candidates are later examined, A sees a valid delta base is available as an edge, and will not later try to "write base first" during the writing phase. However, when the reuse system considers A candidates first they see that B will be in the outgoing pack, as it is still part of the thin pack, and arrange for A to be written first. Later when A switches from being in-pack to being an edge object (as it is part of the cached pack) the pointer in B does not get its type changed from ObjectToPack to ObjectId, so B thinks A is non-edge. We work around this case by also checking that the delta base B is non-edge before writing the object to the pack. Later when A writes its object header, delta base B's ObjectToPack will have an offset == 0, which makes isWritten() = false, and the OBJ_REF delta format will be used for A's header. This will be resolved by the client to the copy of B that appears in the later cached pack. Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca	12 years ago
Shawn O. Pearce	1421106d76	Use long for more object counts in PackWriter Packs can contain up to 2^32-1 objects, which exceeds the range of a Java int. Try harder to accept higher object counts in some cases by using long more often when we are working with the object count value. This is a trivial refactoring, we may have to make even more changes to the object handling code to support more than 2^31-1 objects. Change-Id: I8cd8146e97cd1c738ad5b48fa9e33804982167e7	12 years ago
Shawn O. Pearce	41a18d57bc	Search for annotated tag reuse first Annotated tags are relatively rare and currently are scheduled in a pack file near the commits, decreasing the time it takes to resolve client requests reading tags as part of a history traversal. Putting them first before the commits allows the storage system to page in the tag area, and have it relatively hot in the LRU when the nearby commit area gets examined too. Later looking at the tree and blob data will pollute the cache, making it more likely the tags are not loaded and would require file IO. Change-Id: I425f1f63ef937b8447c396939222ea20fdda290f	12 years ago
Shawn O. Pearce	29997ab084	Correct progress monitor on "Getting sizes:" phase This counter always was running 1 higher, because it incremented after the queue was exhausted (and every object was processed). Move increments to be after the queue has provided a result, to ensure we do not show a higher in-progress count than total count. Change-Id: I97f815a0492c0957300475af409b6c6260008463	12 years ago
Dave Borowitz	2b584b9216	Keep track of a static collection of all PackWriter instances Stored in a weak concurrent hash map, which we clean up while iterating. Usually the weak reference behavior should not be necessary because PackWriters should be released with release(), but we still want to avoid leaks when dealing with broken client code. Change-Id: I337abb952ac6524f7f920fedf04065edf84d01d2	12 years ago
Dave Borowitz	f26b79d044	Estimate the amount of memory used by a PackWriter Memory usage is dominated by three terms: - The maximum memory allocated to each delta window. - The maximum size of a single file held in memory during delta search. - ObjectToPack instances owned by the writer. For the first two terms, rather than doing complex instrumentation of the DeltaWindows, we just overestimate based on the config parameters (though we may underestimate if the maximum size is not set). For the ObjectToPack instances, we do some rough byte accounting of the underlying Java object representation. Change-Id: I23fe3cf9d260a91f1aeb6ea22d75af8ddb9b1939	12 years ago
Dave Borowitz	16b8ebf2d1	Add an object encapsulating the state of a PackWriter Exposes essentially the same state machine to the programmer as is exposed to the client via a ProgressMonitor, using a wrapper around beginTask()/endTask(). Change-Id: Ic3622b4acea65d2b9b3551c668806981fa7293e3	12 years ago
Shawn O. Pearce	1b6a549ff3	PackWriter: Export more statistics Export the shallow pack information, and also a handy function to sum up the total times. Include the time writing out the index file, if it was created. Change-Id: I7f60ae6848455a357b25feedb23743bbf6c153cf Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	12 years ago
Matt Fischer	9952223e06	Implement server support for shallow clones This implements the server side of shallow clones only (i.e. git-upload-pack), not the client side. CQ: 5517 Bug: 301627 Change-Id: Ied5f501f9c8d1fe90ab2ba44fac5fa67ed0035a4 Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	14 years ago
Shawn O. Pearce	a1a8c6d77e	PackWriter: support excluding objects already in other packs This can be useful when implementing garbage collection and there are packs that should not be copied, such as huge packs that have a sibling ".keep" file alongside of them. Callers driving PackWriter need to initialize the list of packs not to include objects from by passing each index to excludeObjects(). Change-Id: Id7f34df69df97be406bcae184308e92b0e8690fd Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	12 years ago
Shawn O. Pearce	74333e63b6	PackWriter: Make want/have actual sets During parsing these are used with contains(). If they are a List type, the contains operation is not efficient. Some callers such as UploadPack often pass a List here, so convert to Set when the type isn't efficient for contains(). Change-Id: If948ae3bf1f46e756bd2d5db14795e12ba7a6207 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	12 years ago
Shawn O. Pearce	2610eaf386	Revert "PackWriter: Do not delta compress already packed objects" This reverts commit `67b064fc9f`. The "tiny optimization" introduced by 67b0 turns out to have a big savings on wall-clock time when the object store is very slow (e.g. the DHT support in JGit), but comes with a much bigger penalty in space used by the output stream. CGit packed with 67b0 enabled is 7 MiB larger than it should be (36 MiB rather than 28/29 MiB). The much bigger Linux kernel repository gained over 200 MiB, though some of this may have been caused by a smaller window setting. Revert this patch as PackWriter should be optimizing for space used rather than time spent, since its primary use is network transfer, and that isn't free. Change-Id: I7413a9ef89762208159b4a1adc5a22a4c9245611 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	12 years ago
Shawn O. Pearce	99e6cfb131	PackWriter: Only search for base objects on thin packs A non-thin pack does not need to worry about preferred bases, the pack will be self-contained and all required delta base objects will appear within the pack itself. Obtaining the path buffer and length from the ObjectWalk to build the preferred base table is "expensive", so avoid the cost unless a thin pack is being constructed. Change-Id: I16e30cd864f4189d4304e7957a7cd5bdb9e84528 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	12 years ago
Shawn O. Pearce	68cc21b60d	PackWriter: Skip progress messages on fast operations If the "Finding sources" phase will complete in <1 second with no delta compression enabled, don't bother showing the progress meter for this phase. Small repositories on the local filesystem tend to rip through this phase always subsecond and the ProgressMonitor display can actually slow the operation down. If delta compression is enabled, there are two phases that may run very quickly. Set the timer to 500 milliseconds instead, reducing the risk that the user has to wait longer than 1 second before any sort of output from the packer occurs. Change-Id: I58110f17e2a5ffa0134f9768b94804d16bbb8399 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	8ac65d33ed	PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	36a38adf71	PackWriter: Combine small reuse batches together If the total number of objects to look for reuse on is under 4096 this is really close to a reasonable batch size for the DHT storage system to lookup at once. Combine all of the objects into a single temporary list, perform reuse, and then prune the main lists if any duplicate objects were detected from a selected CachedPack. The intention here is to try and avoid 4 tiny sequential lookups on the storage system when the time to wait for each of those to finish is higher than the CPU time required to build (and later GC) this temporary list. Change-Id: I528daf9d2f7744dc4a6281750c2d61d8f9da9f3a Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	0be24ebf33	PackWriter: Remove dummy list 0 Instead of looping over the objectsLists array, always set slot 0 to null and explicitly work on the 4 indexes that matter. This kills some loops and increases the length of the code slightly, but I've always really disliked that dummy 0 slot. Change-Id: I5ad938501c1c61f637ffdaff0d0d88e3962d8942 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	9f5bbb5dd4	PackWriter: Speed up pruning of objects from cached packs During object enumeration for the thin pack, very few objects come out that are duplicated with the cached pack. Typically these are only cases where a blob or tree was cherry-picked forward, got a copy or rename, or was reverted... all relatively infrequent events. Speed up pruning of the thin pack object list by combining the phase with the object representation selection. Implementers should already be offering to reuse the object from the cached pack if it is stored there, at which point the implementation can perform a very fast type of containment test using the cached pack's identity rather than yet another index lookup. For the local disk case this is probably not a big improvement, but it does help on the DHT implementation where the two passes combined into one reduces latency. Change-Id: I6a07fc75d9075bf6233e967360b6546f9e9a2b33 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bb1956e647	PackWriter: Collect stats by object type Frequently enough I'm wondering how much of a pack is commits vs. trees, and the total line doesn't really tell us this because its a gross total from the pack. Computing the counts per object type is simple during packing, as PackWriter already has everything in memory broken up by object type. Its virtually free to get these values and track them. Change-Id: Id5e6b1902ea909c72f103a0fbca5d8bc316f9ab3 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	7a9bf1e2e0	PackWriter: Rename getObjectsNumber to getObjectCount This better matches with PackFile and CachedPack's methods that return the same value. Change-Id: Idb9b7c71d2048dd2344a62c2cde20b4e34529ab7 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	65f9a6e58b	Fix dumb transport push PackWriter incorrectly returned 0 from getObjectsNumber() when the pack has not been written yet. This caused dumb transports like amazon-s3:// and sftp:// to abort early and never write out a pack, under the assumption that the pack had no objects. Until the pack header is written to the output stream, compute the current object count each time it is requested. Once the header is started, use the object count from the stats object. Change-Id: I041a2368ae0cfe6f649ec28658d41a6355933900 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bd970007be	ObjectIdOwnerMap: More lightweight map for ObjectIds OwnerMap is about 200 ms faster than SubclassMap, more friendly to the GC, and uses less storage: testing the "Counting objects" part of PackWriter on `1886362` objects: ObjectIdSubclassMap: load factor 50% table: `4194304` (wasted `2307942`) ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256 ms avg 34800 (last 9 runs) ObjectIdOwnerMap: load factor 100% table: `2097152` (wasted 210790) directory: 1024 ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140 ms avg 34597 (last 9 runs) The major difference with OwnerMap is entries must extend from ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own private "next" field into each object. This allows the OwnerMap to use a singly linked list for chaining collisions within a bucket. By putting collisions in a linked list, we gain the entire table back for the SHA-1 bits to index their own "private" slot. Unfortunately this means that each object can appear in at most ONE OwnerMap, as there is only one "next" field within the object instance to thread into the map. For types that are very object map heavy like RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this is sufficient, these entity types are only put into one map by their container. By introducing a new map type, we don't break existing applications that might be trying to use ObjectIdSubclassMap to track RevCommits they obtained from a RevWalk. The OwnerMap uses less memory. Each object uses 1 reference more (so we're up 1,886,362 references), but the table is 1/2 the size (2^20 rather than 2^21). The table itself wastes only 210,790 slots, rather than 2,307,942. So OwnerMap is wasting 200k fewer references. OwnerMap is more friendly to the GC, because it hardly ever generates garbage. As the map reaches its 100% load factor target, it doubles in size by allocating additional segment arrays of 2048 entries. (So the first grow allocates 1 segment, second 2 segments, third 4 segments, etc.) These segments are hooked into the pre-allocated directory of 1024 spaces. This permits the map to grow to 2 million objects before the directory itself has to grow. By using segments of 2048 entries, we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is easier to satisfy then 2,307,942 bytes (for the 512k table that is just an intermediate step in the SubclassMap). By reusing the previously allocated segments (they are re-hashed in-place) we don't release any memory during a table grow. When the directory grows, it does so by discarding the old one and using one that is 4x larger (so the directory goes to 4096 entries on its first grow). A directory of size 4096 can handle up to 8 millon objects. The second directory grow (16384) goes to 33 million objects. At that point we're starting to really push the limits of the JVM heap, but at least its many small arrays. Previously SubclassMap would need a table of `67108864` entries to handle that object count, which needs a single contiguous allocation of 256 MiB. That's hard to come by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB each. This is much easier to fit into a fragmented heap. Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	f67e5602af	PackWriter: Reduce GC during enumeration Instead of resizing an ArrayList until all objects have been added, append objects into a specialized List type that uses small arrays of 1024 entries for each 1024 objects added. For a large repository like linux-2.6, PackWriter will now allocate 1,758 smaller arrays to hold the object list, without creating any garbage from the intermediate states due to list expansion. 1024 was chosen as the block size (and initial directory size) as this is a reasonable balance for the PackWriter code. Each block uses approximately 4096 bytes in a 32 bit JVM, as does the default top level block directory. The top level directory doesn't expand until 1 million items have been added to the list, which for linux-2.6 won't yet occur as the lists are per-object-type and are thus bounded to about 1/3 of 1.8 million. Change-Id: If9e4092eb502394c5d3d044b58cf49952772f6d6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	a468cb57c2	PackWriter: Validate reused cached packs If object reuse validation is enabled, the output pack is going to probably be stored locally. When reusing an existing cached pack to save object enumeration costs, ensure the cached pack has not been corrupted by checking its SHA-1 trailer. If it has, writing will abort and the output pack won't be complete. This prevents anyone from trying to use the output pack, and catches corruption before it can be carried any further. Change-Id: If89d0d4e429d9f4c86f14de6c0020902705153e6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	1b2062fe37	PackWriter: Avoid CRC-32 validation when feeding IndexPack There is no need to validate the object contents during copyObjectAsIs if the result is going to be parsed by unpack-objects or index-pack. Both programs will compute the SHA-1 of the object, and also validate most of the pack structure. For git daemon like servers, this work is already done on the client end of the connection, so the server doesn't need to repeat that work itself. Disable object validation for the 3 transport cases where we know the remote side will handle object validation for us (push, bundle creation, and upload pack). This improves performance on the server side by reducing the work that must be done. Change-Id: Iabb78eec45898e4a17f7aab3fb94c004d8d69af6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bd6853e90a	PackWriter: Position tags after commits Annotated tags need to be parsed by many viewing tools, but putting them at the end of the pack hurts because kernel prefetching might not have loaded them, since they are so far from the commits they reference. Position tags right behind the commits, but before the trees. Typically the annotated tag set for a repository is very small, so the extra prefetch burden it puts on tools that don't need annotated tags (but do need commits and trees) is fairly low. Change-Id: Ibbabdd94e7d563901c0309c79a496ee049cdec50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	26dffbe04d	PackWriter: Refactor object writing loop This simple refactoring makes it easier to pre-process each of the object lists before its handed into the actual write routine. Change-Id: Iea95e5ecbc7374f6bcbb43d1c75285f4f564d09d Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago

12 Commits (32ff57a2b2b9480f4d374a2592fada7f720b124f)