mirrors/jgit - jgit - source @ dussan.org

Commit Graph

Author	SHA1	Message	Date
Shawn O. Pearce	74333e63b6	PackWriter: Make want/have actual sets During parsing these are used with contains(). If they are a List type, the contains operation is not efficient. Some callers such as UploadPack often pass a List here, so convert to Set when the type isn't efficient for contains(). Change-Id: If948ae3bf1f46e756bd2d5db14795e12ba7a6207 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	12 years ago
Shawn O. Pearce	2610eaf386	Revert "PackWriter: Do not delta compress already packed objects" This reverts commit `67b064fc9f`. The "tiny optimization" introduced by 67b0 turns out to have a big savings on wall-clock time when the object store is very slow (e.g. the DHT support in JGit), but comes with a much bigger penalty in space used by the output stream. CGit packed with 67b0 enabled is 7 MiB larger than it should be (36 MiB rather than 28/29 MiB). The much bigger Linux kernel repository gained over 200 MiB, though some of this may have been caused by a smaller window setting. Revert this patch as PackWriter should be optimizing for space used rather than time spent, since its primary use is network transfer, and that isn't free. Change-Id: I7413a9ef89762208159b4a1adc5a22a4c9245611 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	99e6cfb131	PackWriter: Only search for base objects on thin packs A non-thin pack does not need to worry about preferred bases, the pack will be self-contained and all required delta base objects will appear within the pack itself. Obtaining the path buffer and length from the ObjectWalk to build the preferred base table is "expensive", so avoid the cost unless a thin pack is being constructed. Change-Id: I16e30cd864f4189d4304e7957a7cd5bdb9e84528 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	68cc21b60d	PackWriter: Skip progress messages on fast operations If the "Finding sources" phase will complete in <1 second with no delta compression enabled, don't bother showing the progress meter for this phase. Small repositories on the local filesystem tend to rip through this phase always subsecond and the ProgressMonitor display can actually slow the operation down. If delta compression is enabled, there are two phases that may run very quickly. Set the timer to 500 milliseconds instead, reducing the risk that the user has to wait longer than 1 second before any sort of output from the packer occurs. Change-Id: I58110f17e2a5ffa0134f9768b94804d16bbb8399 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	8ac65d33ed	PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	36a38adf71	PackWriter: Combine small reuse batches together If the total number of objects to look for reuse on is under 4096 this is really close to a reasonable batch size for the DHT storage system to lookup at once. Combine all of the objects into a single temporary list, perform reuse, and then prune the main lists if any duplicate objects were detected from a selected CachedPack. The intention here is to try and avoid 4 tiny sequential lookups on the storage system when the time to wait for each of those to finish is higher than the CPU time required to build (and later GC) this temporary list. Change-Id: I528daf9d2f7744dc4a6281750c2d61d8f9da9f3a Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	0be24ebf33	PackWriter: Remove dummy list 0 Instead of looping over the objectsLists array, always set slot 0 to null and explicitly work on the 4 indexes that matter. This kills some loops and increases the length of the code slightly, but I've always really disliked that dummy 0 slot. Change-Id: I5ad938501c1c61f637ffdaff0d0d88e3962d8942 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	9f5bbb5dd4	PackWriter: Speed up pruning of objects from cached packs During object enumeration for the thin pack, very few objects come out that are duplicated with the cached pack. Typically these are only cases where a blob or tree was cherry-picked forward, got a copy or rename, or was reverted... all relatively infrequent events. Speed up pruning of the thin pack object list by combining the phase with the object representation selection. Implementers should already be offering to reuse the object from the cached pack if it is stored there, at which point the implementation can perform a very fast type of containment test using the cached pack's identity rather than yet another index lookup. For the local disk case this is probably not a big improvement, but it does help on the DHT implementation where the two passes combined into one reduces latency. Change-Id: I6a07fc75d9075bf6233e967360b6546f9e9a2b33 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bb1956e647	PackWriter: Collect stats by object type Frequently enough I'm wondering how much of a pack is commits vs. trees, and the total line doesn't really tell us this because its a gross total from the pack. Computing the counts per object type is simple during packing, as PackWriter already has everything in memory broken up by object type. Its virtually free to get these values and track them. Change-Id: Id5e6b1902ea909c72f103a0fbca5d8bc316f9ab3 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	7a9bf1e2e0	PackWriter: Rename getObjectsNumber to getObjectCount This better matches with PackFile and CachedPack's methods that return the same value. Change-Id: Idb9b7c71d2048dd2344a62c2cde20b4e34529ab7 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	65f9a6e58b	Fix dumb transport push PackWriter incorrectly returned 0 from getObjectsNumber() when the pack has not been written yet. This caused dumb transports like amazon-s3:// and sftp:// to abort early and never write out a pack, under the assumption that the pack had no objects. Until the pack header is written to the output stream, compute the current object count each time it is requested. Once the header is started, use the object count from the stats object. Change-Id: I041a2368ae0cfe6f649ec28658d41a6355933900 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bd970007be	ObjectIdOwnerMap: More lightweight map for ObjectIds OwnerMap is about 200 ms faster than SubclassMap, more friendly to the GC, and uses less storage: testing the "Counting objects" part of PackWriter on `1886362` objects: ObjectIdSubclassMap: load factor 50% table: `4194304` (wasted `2307942`) ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256 ms avg 34800 (last 9 runs) ObjectIdOwnerMap: load factor 100% table: `2097152` (wasted 210790) directory: 1024 ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140 ms avg 34597 (last 9 runs) The major difference with OwnerMap is entries must extend from ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own private "next" field into each object. This allows the OwnerMap to use a singly linked list for chaining collisions within a bucket. By putting collisions in a linked list, we gain the entire table back for the SHA-1 bits to index their own "private" slot. Unfortunately this means that each object can appear in at most ONE OwnerMap, as there is only one "next" field within the object instance to thread into the map. For types that are very object map heavy like RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this is sufficient, these entity types are only put into one map by their container. By introducing a new map type, we don't break existing applications that might be trying to use ObjectIdSubclassMap to track RevCommits they obtained from a RevWalk. The OwnerMap uses less memory. Each object uses 1 reference more (so we're up 1,886,362 references), but the table is 1/2 the size (2^20 rather than 2^21). The table itself wastes only 210,790 slots, rather than 2,307,942. So OwnerMap is wasting 200k fewer references. OwnerMap is more friendly to the GC, because it hardly ever generates garbage. As the map reaches its 100% load factor target, it doubles in size by allocating additional segment arrays of 2048 entries. (So the first grow allocates 1 segment, second 2 segments, third 4 segments, etc.) These segments are hooked into the pre-allocated directory of 1024 spaces. This permits the map to grow to 2 million objects before the directory itself has to grow. By using segments of 2048 entries, we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is easier to satisfy then 2,307,942 bytes (for the 512k table that is just an intermediate step in the SubclassMap). By reusing the previously allocated segments (they are re-hashed in-place) we don't release any memory during a table grow. When the directory grows, it does so by discarding the old one and using one that is 4x larger (so the directory goes to 4096 entries on its first grow). A directory of size 4096 can handle up to 8 millon objects. The second directory grow (16384) goes to 33 million objects. At that point we're starting to really push the limits of the JVM heap, but at least its many small arrays. Previously SubclassMap would need a table of `67108864` entries to handle that object count, which needs a single contiguous allocation of 256 MiB. That's hard to come by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB each. This is much easier to fit into a fragmented heap. Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	f67e5602af	PackWriter: Reduce GC during enumeration Instead of resizing an ArrayList until all objects have been added, append objects into a specialized List type that uses small arrays of 1024 entries for each 1024 objects added. For a large repository like linux-2.6, PackWriter will now allocate 1,758 smaller arrays to hold the object list, without creating any garbage from the intermediate states due to list expansion. 1024 was chosen as the block size (and initial directory size) as this is a reasonable balance for the PackWriter code. Each block uses approximately 4096 bytes in a 32 bit JVM, as does the default top level block directory. The top level directory doesn't expand until 1 million items have been added to the list, which for linux-2.6 won't yet occur as the lists are per-object-type and are thus bounded to about 1/3 of 1.8 million. Change-Id: If9e4092eb502394c5d3d044b58cf49952772f6d6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	a468cb57c2	PackWriter: Validate reused cached packs If object reuse validation is enabled, the output pack is going to probably be stored locally. When reusing an existing cached pack to save object enumeration costs, ensure the cached pack has not been corrupted by checking its SHA-1 trailer. If it has, writing will abort and the output pack won't be complete. This prevents anyone from trying to use the output pack, and catches corruption before it can be carried any further. Change-Id: If89d0d4e429d9f4c86f14de6c0020902705153e6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	1b2062fe37	PackWriter: Avoid CRC-32 validation when feeding IndexPack There is no need to validate the object contents during copyObjectAsIs if the result is going to be parsed by unpack-objects or index-pack. Both programs will compute the SHA-1 of the object, and also validate most of the pack structure. For git daemon like servers, this work is already done on the client end of the connection, so the server doesn't need to repeat that work itself. Disable object validation for the 3 transport cases where we know the remote side will handle object validation for us (push, bundle creation, and upload pack). This improves performance on the server side by reducing the work that must be done. Change-Id: Iabb78eec45898e4a17f7aab3fb94c004d8d69af6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bd6853e90a	PackWriter: Position tags after commits Annotated tags need to be parsed by many viewing tools, but putting them at the end of the pack hurts because kernel prefetching might not have loaded them, since they are so far from the commits they reference. Position tags right behind the commits, but before the trees. Typically the annotated tag set for a repository is very small, so the extra prefetch burden it puts on tools that don't need annotated tags (but do need commits and trees) is fairly low. Change-Id: Ibbabdd94e7d563901c0309c79a496ee049cdec50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	26dffbe04d	PackWriter: Refactor object writing loop This simple refactoring makes it easier to pre-process each of the object lists before its handed into the actual write routine. Change-Id: Iea95e5ecbc7374f6bcbb43d1c75285f4f564d09d Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	751c329b35	PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	67b064fc9f	PackWriter: Do not delta compress already packed objects This is a tiny optimization to how delta search works. Checking for isReuseAsIs() avoids doing delta compression search on non-delta objects already stored in packs within the repository. Such objects are not likely to be delta compressable, as they were already delta searched when their containing pack was generated and they were not delta compressed at that time. Doing delta compression now is unlikely to produce a different result, but would waste a lot of CPU. The isReuseAsIs() flag is checked before isDoNotDelta() because it is very common to reuse objects in the output pack. Most objects get reused, and only a handful have the isDoNotDelta() bit set. Moving the check earlier allows the loop to more quickly skip through objects that will never need to be considered. Change-Id: Ied757363f775058177fc1befb8ace20fe9759bac Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	7505b93546	PackWriter: Add missing timers to Statistics We did not record the time spent on the object reuse search or the object size lookup, both of which occur between the counting phase and the compressing phase. If there are enough objects involved, these times can be significant so its worth timing them and recording it. Change-Id: I89084acfc598bb6533d75d90cb8de459f0ed93be Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	977446e5da	PackWriter: Fix total delta count The total delta count is supposed to include reused deltas, not just newly created deltas. Change-Id: I98cbdcef80d59714a4f62ff322e7b709b08b6d26 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	8f865bfffe	PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	19037e8cfc	PackWriter: Parse tag target objects in a batch If the underlying storage has a high latency per SHA-1 lookup (e.g. the DHT support we are working on), parsing each wanted annotated tag object back to its underlying commit is too slow, its a sequential lookup for each tag. With hundreds of tags in a repository this takes far too long. Instead queue up a list of the tags whose objects need to be found, and then locate all of those in one parseAny batch. This works for the common case of annotated tag to single tree or commit. For the less often used tag->tag->commit, it at least gets us one level parsed in the larger batch before we have to go back to sequential lookups. Change-Id: I94beef3f14281406f15c8cf9fa02d83faf102a19 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	388ba7e005	PackWriter: Correct total delta count when reusing pack If the CachedPack knows its delta count, we need to increment both the totalDeltas and reusedDeltas fields of the stats object. Change-Id: I70113609c22476ce7f1e4d9a92f486e9b0f59e44 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	3e64b928d5	PackWriter: Short-circuit counting on full cached pack reuse If one or more cached packs fully covers the request, don't bother with looking up the objects and trying to walk the graph. Just use the cached packs and return immediately. This helps clones of quiet repositories that have not been modified since their last repack, its likely the cached packs are accurate and no graph walking is required. Change-Id: I9062a5ac2f71b525322590209664a84051fd5f8a Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	4275c4c1cf	PackWriter: Fix warning about untyped collection Change-Id: I44699d8ab9768844ba91f7224a7d4ee685c93ce6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	733780e8a1	PackWriter: Sort commits by parse order to improve locality RevWalk in JGit and the revision code in C Git both parse commits out of the pack file in an order that differs from strict timestamp and topological sorting. Both implementations pop a commit from the head of a date queue, and then immediately parse all of its parents in order to insert those into the date queue at the proper positions as determined by their committer timestamp field. This implies that the parents are parsed when their most recent child is popped from the queue, and not where they are popped during traversal. Hoisting a parent commit to be immediately behind its child improves locality by making sure all parents of a merge are clustered together, and thus can be paged into the parser by the pack file buffering system (aka WindowCache in JGit) together. Change-Id: I80f9e64cafa2e8f082776b43845edf23065386a2 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	14f99dc29d	PackWriter: Try for accurate delta reuse on cached pack If a cached pack is used, it might know how many deltas are contained within it. Record that count as part of our reusedDeltas field for the stats line we show clients. Change-Id: I1c61fb817305a95eeac654cccf132cba20b2339c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	c8c4524b6b	UploadPack: Expose PackWriter activity to a logger The UploadPackLogger interface allows applications that embed GitServlet or otherwise use UploadPack to service clients to track and log how PackWriter was used, and what it sent. This provides more granularity into the request activity than might be available from the HTTP server logs, helping administrators to better understand utilization and Git server performance. Change-Id: I1d36b060eb3385339d5f986e68192789ef70fc4e Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	5664fb3bfb	UploadPack: Donate parsed commits to PackWriter When UploadPack has computed the merge base between the client's have set and the want set, its already loaded and parsed all of the interesting commits that PackWriter needs to transmit to the client. Switching the RevWalk and its object pool over to be an ObjectWalk saves PackWriter from needing to re-parse these same commits from the ObjectDatabase, reducing the startup latency for the enumeration phase of packing. UploadPack doesn't want to use an ObjectWalk for the okToGiveUp() tests because its slower, during each commit popped it needs to cache the tree into the pendingObjects list, and during each reset() it discards a bunch of ObjectWalk specific state and reallocates some internal collections. ObjectWalk was never meant to be rapidly reset() like UploadPack does, so its perhaps somewhat cleaner to allow "upgrading" a RevWalk to an ObjectWalk. Bug: 301639 Change-Id: I97ef52a0b79d78229c272880aedb7f74d0f7532f Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	461b012e95	PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root \| git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB \| 19.01 MiB/s, done. remote: Total `1861830` (delta 4706), reused `1851053` (delta `1553844`) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total `1861830` (delta 2407), reused `1856197` (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB \| 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	71f168fcd7	PackWriter: Display totals after sending objects CGit pack-objects displays a totals line after the pack data was fully written. This can be useful to understand some of the decisions made by the packer, and has been a great tool for helping to debug some of that code. Track some of the basic values, and send it to the client when packing is done: remote: Counting objects: 1826776, done remote: Finding sources: 100% (55121/55121) remote: Getting sizes: 100% (25654/25654) remote: Compressing objects: 100% (11434/11434) remote: Total `1861830` (delta 3926), reused `1854705` (delta 38306) Receiving objects: 100% (1861830/1861830), 386.03 MiB \| 30.32 MiB/s, done. Change-Id: If3b039017a984ed5d5ae80940ce32bda93652df5 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	13bcf05a9e	PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB \| 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB \| 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	13 years ago
Shawn O. Pearce	2fbcba41e3	PackWriter: Cleanup findObjectToPack method Some of this code predates making ObjectId.equals() final and fixing RevObject.equals() to match ObjectId.equals(). It was therefore more complex than it needs to be, because it tried to work around RevObject's broken equals() rules by converting to ObjectId in a different collection. Also combine setUpWalker() and findObjectsToPack() methods, these can be one method and the code is actually cleaner. Change-Id: I0f4cf9997cd66d8b6e7f80873979ef1439e507fe Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	13 years ago
Shawn O. Pearce	8f63dface2	PackWriter: Correct 'Compressing objects' progress message The first 'Compressing objects' progress message is wrong, its actually PackWriter looking up the sizes of each object in the ObjectDatabase, so objects can be sorted correctly in the later type-size sort that tries to take advantage of "Linus' Law" to improve delta compression. Rename the progress to say 'Getting sizes', which is an accurate description of what it is doing. Change-Id: Ida0a052ad2f6e994996189ca12959caab9e556a3 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	13 years ago
Shawn O. Pearce	37a10e3006	PackWriter: Don't include edges in progress meter When compressing objects, don't include the edges in the progress meter. These cost almost no CPU time as they are simply pushed into and popped out of the delta search window. Change-Id: I7ea19f0263e463c65da34a7e92718c6db1d4a131 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	13 years ago
Shawn O. Pearce	065a0a8122	Revert "Teach PackWriter how to reuse an existing object list" This reverts commit `f5fe2dca3c`. I regret adding this feature to the public API. Caches aren't always the best idea, as they require work to maintain. Here the cache is redundant information that must be computed, and when it grows stale must be removed. The redundant information takes up more disk space, about the same size as the pack-*.idx files are. For the linux-2.6 repository, that's more than 40 MB for a 400 MB repository. So the cache is a 10% increase in disk usage. The entire point of this cache is to improve PackWriter performance, and only PackWriter performance, and only when sending an initial clone to a new client. There may be better ways to optimize this, and until we have a solid solution, we shouldn't be using a separate cache in JGit.	13 years ago
Shawn O. Pearce	f5fe2dca3c	Teach PackWriter how to reuse an existing object list Counting the objects needed for packing is the most expensive part of an UploadPack request that has no uninteresting objects (otherwise known as an initial clone). During this phase the PackWriter is enumerating the entire set of objects in this repository, so they can be sent to the client for their new clone. Allow the ObjectReader (and therefore the underlying storage system) to keep a cached list of all reachable objects from a small number of points in the project's history. If one of those points is reached during enumeration of the commit graph, most objects are obtained from the cached list instead of direct traversal. PackWriter uses the list by discarding the current object lists and restarting a traversal from all refs but marking the object list name as uninteresting. This allows PackWriter to enumerate all objects that are more recent than the list creation, or that were on side branches that the list does not include. However, ObjectWalk tags all of the trees and commits within the list commit as UNINTERESTING, which would normally cause PackWriter to construct a thin pack that excludes these objects. To avoid that, addObject() was refactored to allow this list-based enumeration to always include an object, even if it has been tagged UNINTERESTING by the ObjectWalk. This implies the list-based enumeration may only be used for initial clones, where all objects are being sent. The UNINTERESTING labeling occurs because StartGenerator always enables the BoundaryGenerator if the walker is an ObjectWalk and a commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not enabled. This is the default reasonable behavior for an ObjectWalk, but isn't desired here in PackWriter with the list-based enumeration. Rather than trying to change all of this behavior, PackWriter works around it. Because the list name commit's immediate files and trees were all enumerated before the list enumeration itself starts (and are also within the list itself) PackWriter runs the risk of adding the same objects to its ObjectIdSubclassMap twice. Since this breaks the internal map data structure (and also may cause the object to transmit twice), PackWriter needs to use a new "added" RevFlag to track whether or not an object has been put into the outgoing list yet. Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	c218a0760d	PackWriter: Use TOPO order only for incremental packs When performing an initial clone of a repository there are no uninteresting commits, and the resulting pack will be completely self-contained. Therefore PackWriter does not need to honor C Git standard TOPO ordering as described in JGit commit `ba984ba2e0` ("Fix checkReferencedIsReachable to use correct base list"). Switching to COMMIT_TIME_DESC when there are no uninteresting commits allows the "Counting objects" phase to emit progress earlier, as the RevWalk will not buffer the commit list. When TOPO is set the RevWalk enumerates all commits first, before outputing any for PackWriter to mark progress updates from. Change-Id: If2b6a9903b536c7fb3c45f85d0a67ff6c6e66f22 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	bdf535de4f	Call ProgressMonitor.update() from main thread Don't permit transient worker threads to access the underlying output stream of a ProgressMonitor, as they might get marked as the stream's writer thread. Instead proxy update events from the workers back onto the application's real work thread. This ensures that the stream only sees a single thread, and its the thread that will remain alive for the entire life cycle of the operation. This fixes IOException("Write end dead") during local repository fetch when threaded delta search is enabled. One of the transient delta search threads became the designated writer for the pipe, and when it terminated the reader end thought the writer was dead, even though the main writer thread was still executing in PackWriter. Bug: 326557 Change-Id: I01d1b20a3d7be1c0b480c7fb5c9773c161fe5c15 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	ba984ba2e0	Fix checkReferencedIsReachable to use correct base list When checkReferencedIsReachable is set in ReceivePack we are trying to prove that the push client is permitted to access an object that it did not send to us, but that the received objects link to either via a link inside of an object (e.g. commit parent pointer or tree member) or by a delta base reference. To do this check we are making a list of every potential delta base, and then ensuring that every delta base used appears on this list. If a delta base does not appear on this list, we abort with an error, letting the client know we are missing a particular object. Preventing spurious errors about missing delta base objects requires us to use the exact same list of potential delta bases as the remote push client used. This means we must use TOPO ordering, and we need to enable BOUNDARY sorting so that ObjectWalk will correctly include any trees found during the enumeration back to the common merge base between the interesting and uninteresting heads. To ensure JGit's own push client matches this same potential delta base list, we need to undo `60aae90d4d` ("Disable topological sorting in PackWriter") and switch back to using the conventional TOPO ordering for commits in a pack file. This ensures that our own push client will use the same potential base object list as checkReferencedIsReachable uses on the receiving side. Change-Id: I14d0a326deb62a43f987b375cfe519711031e172 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	c11711f98e	Use limited getCachedBytes code to reduce duplication Rather than duplicating this block everywhere, reuse the limited size form of getCachedBytes to acquire the content of an object. Change-Id: I2e26a823e6fd0964d8f8dbfaa0fc2e8834c179c1 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>	13 years ago
Shawn O. Pearce	28ba4747bc	Allow ObjectReuseAsIs to have more control over write ordering The reuse system used by an object database may be able to benefit from knowing what objects are coming next, and even improve data throughput by delaying (or moving up) objects that are stored near each other in the source database. Pushing the iteration down into the reuse code makes it possible for a smarter implementation to aggregate reuse. But for the standard pack file format on disk we don't bother, its quite efficient already. Change-Id: I64f0048ca7071a8b44950d6c2a5dfbca3be6bba6 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago
Shawn O. Pearce	f048af3fd1	Implement async/batch lookup of object data An ObjectReader implementation may be very slow for a single object, but yet support bulk queries efficiently by batching multiple small requests into a single larger request. This easily happens when the reader is built on top of a database that is stored on another host, as the network round-trip time starts to dominate the operation cost. RevWalk, ObjectWalk, UploadPack and PackWriter are the first major users of this new bulk interface, with the goal being to support an efficient way to pack a repository for a fetch/clone client when the source repository is stored in a high-latency storage system. Processing the want/have lists is now done in bulk, to remove the high costs associated with common ancestor negotiation. PackWriter already performs object reuse selection in bulk, but it now can also do the object size lookup and object counting phases with higher efficiency. Actual object reuse, deltification, and final output are still doing sequential lookups, making them a bit more expensive to perform. Change-Id: I4c966f84917482598012074c370b9831451404ee Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago
Shawn O. Pearce	b85af06324	Allow object reuse selection to occur in parallel ObjectReader implementations may wish to use multiple threads in order to evaluate object reuse faster. Let the reader make that decision by passing the iteration down into the reader. Because the work is pushed into the reader, it may need to locate a given ObjectToPack given its ObjectId. This can easily occur if the reader has sent a list of ObjectIds to the object database and gets back information keyed only by ObjectId, without the ObjectToPack handle. Expose lookup using the PackWriter's own internal map, so the reader doesn't need to build a redundant copy to track the assocation of ObjectId back to ObjectToPack. Change-Id: I0c536405a55034881fb5db92a2d2a99534faed34 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago
Shawn O. Pearce	cc6210619b	Flush the pack header as soon as its ready When the output stream is deeply buffered (e.g. 1 MiB or more in an HTTP servlet on some containers) trying to kick out the header earlier will prevent the client from stalling hard while the first 1 MiB is received and it can process the pack header. Forcing a flush here lets the client see the header and start its progress monitor for "Receiving objects: (1/N)" so the user knows there is still activity occurring, even though the buffering may cause there to be some lag as the buffer fills up on the sending side. Change-Id: I3edf39e8f703fe87a738dc236d426b194db85e3a Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago
Shawn O. Pearce	1a06179ea7	Move PackWriter configuration to PackConfig This refactoring permits applications to configure global per-process settings for all packing and easily pass it through to per-request PackWriters, ensuring that the process configuration overrides the repository specific settings. For example this might help in a daemon environment where the server wants to cap the resources used to serve a dynamic upload pack request, even though the repository's own pack.* settings might be configured to be more aggressive. This allows fast but less bandwidth efficient serving of clients, while still retaining good compression through a cron managed `git gc`. Change-Id: I58cc5e01b48924b1a99f79aa96c8150cdfc50846 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago
Shawn O. Pearce	21f76c2a69	Remove static progress task names from PackWriter These need to be dynamic based on the current thread's environment at time of execution in order to be properly localized for the end user that will be seeing these messages. Change-Id: I4976f462cfe606edd2761c0e36b2f6b20f63d53c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago
Shawn O. Pearce	1b783d0370	Allow PackWriter callers to manage the thread pool By permitting the caller of PackWriter to select the Executor it uses for task execution, we give the caller the ability to manage the lifecycle of the thread pool, including reusing it across concurrent pack generators. This is the first step to supporting application thread pools within Daemon or another managed service like Gerrit Code Review. Change-Id: I96bee7b9c30ff9885f2bd261d0b6daaac713b5a4 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago
Robin Stocker	a00377a7e2	Fix Javadoc warnings There were some broken links, incorrect uses of @value, an invalid tag and an outdated comment. Change-Id: I22886bcc869a4b62bd606ebed40669f7b4723664 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago

1 2

71 Commits (74333e63b60440be5ff9f591f2203b635e26e3a0)