You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

PackWriter.java 79KB

Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
ObjectIdOwnerMap: More lightweight map for ObjectIds OwnerMap is about 200 ms faster than SubclassMap, more friendly to the GC, and uses less storage: testing the "Counting objects" part of PackWriter on 1886362 objects: ObjectIdSubclassMap: load factor 50% table: 4194304 (wasted 2307942) ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256 ms avg 34800 (last 9 runs) ObjectIdOwnerMap: load factor 100% table: 2097152 (wasted 210790) directory: 1024 ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140 ms avg 34597 (last 9 runs) The major difference with OwnerMap is entries must extend from ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own private "next" field into each object. This allows the OwnerMap to use a singly linked list for chaining collisions within a bucket. By putting collisions in a linked list, we gain the entire table back for the SHA-1 bits to index their own "private" slot. Unfortunately this means that each object can appear in at most ONE OwnerMap, as there is only one "next" field within the object instance to thread into the map. For types that are very object map heavy like RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this is sufficient, these entity types are only put into one map by their container. By introducing a new map type, we don't break existing applications that might be trying to use ObjectIdSubclassMap to track RevCommits they obtained from a RevWalk. The OwnerMap uses less memory. Each object uses 1 reference more (so we're up 1,886,362 references), but the table is 1/2 the size (2^20 rather than 2^21). The table itself wastes only 210,790 slots, rather than 2,307,942. So OwnerMap is wasting 200k fewer references. OwnerMap is more friendly to the GC, because it hardly ever generates garbage. As the map reaches its 100% load factor target, it doubles in size by allocating additional segment arrays of 2048 entries. (So the first grow allocates 1 segment, second 2 segments, third 4 segments, etc.) These segments are hooked into the pre-allocated directory of 1024 spaces. This permits the map to grow to 2 million objects before the directory itself has to grow. By using segments of 2048 entries, we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is easier to satisfy then 2,307,942 bytes (for the 512k table that is just an intermediate step in the SubclassMap). By reusing the previously allocated segments (they are re-hashed in-place) we don't release any memory during a table grow. When the directory grows, it does so by discarding the old one and using one that is 4x larger (so the directory goes to 4096 entries on its first grow). A directory of size 4096 can handle up to 8 millon objects. The second directory grow (16384) goes to 33 million objects. At that point we're starting to really push the limits of the JVM heap, but at least its many small arrays. Previously SubclassMap would need a table of 67108864 entries to handle that object count, which needs a single contiguous allocation of 256 MiB. That's hard to come by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB each. This is much easier to fit into a fragmented heap. Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Shallow fetch: Respect "shallow" lines When fetching from a shallow clone, the client sends "have" lines to tell the server about objects it already has and "shallow" lines to tell where its local history terminates. In some circumstances, the server fails to honor the shallow lines and fails to return objects that the client needs. UploadPack passes the "have" lines to PackWriter so PackWriter can omit them from the generated pack. UploadPack processes "shallow" lines by calling RevWalk.assumeShallow() with the set of shallow commits. RevWalk creates and caches RevCommits for these shallow commits, clearing out their parents. That way, walks correctly terminate at the shallow commits instead of assuming the client has history going back behind them. UploadPack converts its RevWalk to an ObjectWalk, maintaining the cached RevCommits, and passes it to PackWriter. Unfortunately, to support shallow fetches the PackWriter does the following: if (shallowPack && !(walk instanceof DepthWalk.ObjectWalk)) walk = new DepthWalk.ObjectWalk(reader, depth); That is, when the client sends a "deepen" line (fetch --depth=<n>) and the caller has not passed in a DepthWalk.ObjectWalk, PackWriter throws away the RevWalk that was passed in and makes a new one. The cleared parent lists prepared by RevWalk.assumeShallow() are lost. Fortunately UploadPack intends to pass in a DepthWalk.ObjectWalk. It tries to create it by calling toObjectWalkWithSameObjects() on a DepthWalk.RevWalk. But it doesn't work: because DepthWalk.RevWalk does not override the standard RevWalk#toObjectWalkWithSameObjects implementation, the result is a plain ObjectWalk instead of an instance of DepthWalk.ObjectWalk. The result is that the "shallow" information is thrown away and objects reachable from the shallow commits can be omitted from the pack sent when fetching with --depth from a shallow clone. Multiple factors collude to limit the circumstances under which this bug can be observed: 1. Commits with depth != 0 don't enter DepthGenerator's pending queue. That means a "have" cannot have any effect on DepthGenerator unless it is also a "want". 2. DepthGenerator#next() doesn't call carryFlagsImpl(), so the uninteresting flag is not propagated to ancestors there even if a "have" is also a "want". 3. JGit treats a depth of 1 as "1 past the wants". Because of (2), the only place the UNINTERESTING flag can leak to a shallow commit's parents is in the carryFlags() call from markUninteresting(). carryFlags() only traverses commits that have already been parsed: commits yet to be parsed are supposed to inherit correct flags from their parent in PendingGenerator#next (which doesn't happen here --- that is (2)). So the list of commits that have already been parsed becomes relevant. When we hit the markUninteresting() call, all "want"s, "have"s, and commits to be unshallowed have been parsed. carryFlags() only affects the parsed commits. If the "want" is a direct parent of a "have", then it carryFlags() marks it as uninteresting. If the "have" was also a "shallow", then its parent pointer should have been null and the "want" shouldn't have been marked, so we see the bug. If the "want" is a more distant ancestor then (2) keeps the uninteresting state from propagating to the "want" and we don't see the bug. If the "shallow" is not also a "have" then the shallow commit isn't parsed so (2) keeps the uninteresting state from propagating to the "want so we don't see the bug. Here is a reproduction case (time flowing left to right, arrows pointing to parents). "C" must be a commit that the client reports as a "have" during negotiation. That can only happen if the server reports it as an existing branch or tag in the first round of negotiation: A <-- B <-- C <-- D First do git clone --depth 1 <repo> which yields D as a "have" and C as a "shallow" commit. Then try git fetch --depth 1 <repo> B:refs/heads/B Negotiation sets up: have D, shallow C, have C, want B. But due to this bug B is marked as uninteresting and is not sent. Change-Id: I6e14b57b2f85e52d28cdcf356df647870f475440 Signed-off-by: Terry Parker <tparker@google.com>
7 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Do not write edge objects to the pack stream Consider two objects A->B where A uses B as a delta base, and these are in the same source pack file ordered as "A B". If cached packs is enabled and B is also in the cached pack that will be appended onto the end of the thin pack, and both A, B are supposed to be in the thin pack, PackWriter must consider the fact that A's base B is an edge object that claims to be part of the new pack, but is actually "external" and cannot be written first. If the object reuse system considered B candidates fist this bug does not arise, as B will be marked as edge due to it existing in the cached pack. When the A candidates are later examined, A sees a valid delta base is available as an edge, and will not later try to "write base first" during the writing phase. However, when the reuse system considers A candidates first they see that B will be in the outgoing pack, as it is still part of the thin pack, and arrange for A to be written first. Later when A switches from being in-pack to being an edge object (as it is part of the cached pack) the pointer in B does not get its type changed from ObjectToPack to ObjectId, so B thinks A is non-edge. We work around this case by also checking that the delta base B is non-edge before writing the object to the pack. Later when A writes its object header, delta base B's ObjectToPack will have an offset == 0, which makes isWritten() = false, and the OBJ_REF delta format will be used for A's header. This will be resolved by the client to the copy of B that appears in the later cached pack. Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca
12 years ago
Do not write edge objects to the pack stream Consider two objects A->B where A uses B as a delta base, and these are in the same source pack file ordered as "A B". If cached packs is enabled and B is also in the cached pack that will be appended onto the end of the thin pack, and both A, B are supposed to be in the thin pack, PackWriter must consider the fact that A's base B is an edge object that claims to be part of the new pack, but is actually "external" and cannot be written first. If the object reuse system considered B candidates fist this bug does not arise, as B will be marked as edge due to it existing in the cached pack. When the A candidates are later examined, A sees a valid delta base is available as an edge, and will not later try to "write base first" during the writing phase. However, when the reuse system considers A candidates first they see that B will be in the outgoing pack, as it is still part of the thin pack, and arrange for A to be written first. Later when A switches from being in-pack to being an edge object (as it is part of the cached pack) the pointer in B does not get its type changed from ObjectToPack to ObjectId, so B thinks A is non-edge. We work around this case by also checking that the delta base B is non-edge before writing the object to the pack. Later when A writes its object header, delta base B's ObjectToPack will have an offset == 0, which makes isWritten() = false, and the OBJ_REF delta format will be used for A's header. This will be resolved by the client to the copy of B that appears in the later cached pack. Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca
12 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Shallow fetch: Respect "shallow" lines When fetching from a shallow clone, the client sends "have" lines to tell the server about objects it already has and "shallow" lines to tell where its local history terminates. In some circumstances, the server fails to honor the shallow lines and fails to return objects that the client needs. UploadPack passes the "have" lines to PackWriter so PackWriter can omit them from the generated pack. UploadPack processes "shallow" lines by calling RevWalk.assumeShallow() with the set of shallow commits. RevWalk creates and caches RevCommits for these shallow commits, clearing out their parents. That way, walks correctly terminate at the shallow commits instead of assuming the client has history going back behind them. UploadPack converts its RevWalk to an ObjectWalk, maintaining the cached RevCommits, and passes it to PackWriter. Unfortunately, to support shallow fetches the PackWriter does the following: if (shallowPack && !(walk instanceof DepthWalk.ObjectWalk)) walk = new DepthWalk.ObjectWalk(reader, depth); That is, when the client sends a "deepen" line (fetch --depth=<n>) and the caller has not passed in a DepthWalk.ObjectWalk, PackWriter throws away the RevWalk that was passed in and makes a new one. The cleared parent lists prepared by RevWalk.assumeShallow() are lost. Fortunately UploadPack intends to pass in a DepthWalk.ObjectWalk. It tries to create it by calling toObjectWalkWithSameObjects() on a DepthWalk.RevWalk. But it doesn't work: because DepthWalk.RevWalk does not override the standard RevWalk#toObjectWalkWithSameObjects implementation, the result is a plain ObjectWalk instead of an instance of DepthWalk.ObjectWalk. The result is that the "shallow" information is thrown away and objects reachable from the shallow commits can be omitted from the pack sent when fetching with --depth from a shallow clone. Multiple factors collude to limit the circumstances under which this bug can be observed: 1. Commits with depth != 0 don't enter DepthGenerator's pending queue. That means a "have" cannot have any effect on DepthGenerator unless it is also a "want". 2. DepthGenerator#next() doesn't call carryFlagsImpl(), so the uninteresting flag is not propagated to ancestors there even if a "have" is also a "want". 3. JGit treats a depth of 1 as "1 past the wants". Because of (2), the only place the UNINTERESTING flag can leak to a shallow commit's parents is in the carryFlags() call from markUninteresting(). carryFlags() only traverses commits that have already been parsed: commits yet to be parsed are supposed to inherit correct flags from their parent in PendingGenerator#next (which doesn't happen here --- that is (2)). So the list of commits that have already been parsed becomes relevant. When we hit the markUninteresting() call, all "want"s, "have"s, and commits to be unshallowed have been parsed. carryFlags() only affects the parsed commits. If the "want" is a direct parent of a "have", then it carryFlags() marks it as uninteresting. If the "have" was also a "shallow", then its parent pointer should have been null and the "want" shouldn't have been marked, so we see the bug. If the "want" is a more distant ancestor then (2) keeps the uninteresting state from propagating to the "want" and we don't see the bug. If the "shallow" is not also a "have" then the shallow commit isn't parsed so (2) keeps the uninteresting state from propagating to the "want so we don't see the bug. Here is a reproduction case (time flowing left to right, arrows pointing to parents). "C" must be a commit that the client reports as a "have" during negotiation. That can only happen if the server reports it as an existing branch or tag in the first round of negotiation: A <-- B <-- C <-- D First do git clone --depth 1 <repo> which yields D as a "have" and C as a "shallow" commit. Then try git fetch --depth 1 <repo> B:refs/heads/B Negotiation sets up: have D, shallow C, have C, want B. But due to this bug B is marked as uninteresting and is not sent. Change-Id: I6e14b57b2f85e52d28cdcf356df647870f475440 Signed-off-by: Terry Parker <tparker@google.com>
7 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
Teach PackWriter how to reuse an existing object list Counting the objects needed for packing is the most expensive part of an UploadPack request that has no uninteresting objects (otherwise known as an initial clone). During this phase the PackWriter is enumerating the entire set of objects in this repository, so they can be sent to the client for their new clone. Allow the ObjectReader (and therefore the underlying storage system) to keep a cached list of all reachable objects from a small number of points in the project's history. If one of those points is reached during enumeration of the commit graph, most objects are obtained from the cached list instead of direct traversal. PackWriter uses the list by discarding the current object lists and restarting a traversal from all refs but marking the object list name as uninteresting. This allows PackWriter to enumerate all objects that are more recent than the list creation, or that were on side branches that the list does not include. However, ObjectWalk tags all of the trees and commits within the list commit as UNINTERESTING, which would normally cause PackWriter to construct a thin pack that excludes these objects. To avoid that, addObject() was refactored to allow this list-based enumeration to always include an object, even if it has been tagged UNINTERESTING by the ObjectWalk. This implies the list-based enumeration may only be used for initial clones, where all objects are being sent. The UNINTERESTING labeling occurs because StartGenerator always enables the BoundaryGenerator if the walker is an ObjectWalk and a commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not enabled. This is the default reasonable behavior for an ObjectWalk, but isn't desired here in PackWriter with the list-based enumeration. Rather than trying to change all of this behavior, PackWriter works around it. Because the list name commit's immediate files and trees were all enumerated before the list enumeration itself starts (and are also within the list itself) PackWriter runs the risk of adding the same objects to its ObjectIdSubclassMap twice. Since this breaks the internal map data structure (and also may cause the object to transmit twice), PackWriter needs to use a new "added" RevFlag to track whether or not an object has been put into the outgoing list yet. Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Teach PackWriter how to reuse an existing object list Counting the objects needed for packing is the most expensive part of an UploadPack request that has no uninteresting objects (otherwise known as an initial clone). During this phase the PackWriter is enumerating the entire set of objects in this repository, so they can be sent to the client for their new clone. Allow the ObjectReader (and therefore the underlying storage system) to keep a cached list of all reachable objects from a small number of points in the project's history. If one of those points is reached during enumeration of the commit graph, most objects are obtained from the cached list instead of direct traversal. PackWriter uses the list by discarding the current object lists and restarting a traversal from all refs but marking the object list name as uninteresting. This allows PackWriter to enumerate all objects that are more recent than the list creation, or that were on side branches that the list does not include. However, ObjectWalk tags all of the trees and commits within the list commit as UNINTERESTING, which would normally cause PackWriter to construct a thin pack that excludes these objects. To avoid that, addObject() was refactored to allow this list-based enumeration to always include an object, even if it has been tagged UNINTERESTING by the ObjectWalk. This implies the list-based enumeration may only be used for initial clones, where all objects are being sent. The UNINTERESTING labeling occurs because StartGenerator always enables the BoundaryGenerator if the walker is an ObjectWalk and a commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not enabled. This is the default reasonable behavior for an ObjectWalk, but isn't desired here in PackWriter with the list-based enumeration. Rather than trying to change all of this behavior, PackWriter works around it. Because the list name commit's immediate files and trees were all enumerated before the list enumeration itself starts (and are also within the list itself) PackWriter runs the risk of adding the same objects to its ObjectIdSubclassMap twice. Since this breaks the internal map data structure (and also may cause the object to transmit twice), PackWriter needs to use a new "added" RevFlag to track whether or not an object has been put into the outgoing list yet. Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089209020912092209320942095209620972098209921002101210221032104210521062107210821092110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137213821392140214121422143214421452146214721482149215021512152215321542155215621572158215921602161216221632164216521662167216821692170217121722173217421752176217721782179218021812182218321842185218621872188218921902191219221932194219521962197219821992200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275227622772278227922802281228222832284228522862287228822892290229122922293229422952296229722982299230023012302230323042305230623072308230923102311231223132314231523162317231823192320232123222323232423252326232723282329233023312332233323342335233623372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419242024212422242324242425242624272428242924302431243224332434243524362437243824392440244124422443244424452446244724482449245024512452245324542455245624572458245924602461246224632464246524662467246824692470247124722473247424752476247724782479248024812482248324842485248624872488248924902491249224932494249524962497249824992500250125022503250425052506250725082509251025112512251325142515251625172518251925202521252225232524252525262527252825292530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560
  1. /*
  2. * Copyright (C) 2008-2010, Google Inc.
  3. * Copyright (C) 2008, Marek Zawirski <marek.zawirski@gmail.com> and others
  4. *
  5. * This program and the accompanying materials are made available under the
  6. * terms of the Eclipse Distribution License v. 1.0 which is available at
  7. * https://www.eclipse.org/org/documents/edl-v10.php.
  8. *
  9. * SPDX-License-Identifier: BSD-3-Clause
  10. */
  11. package org.eclipse.jgit.internal.storage.pack;
  12. import static java.util.Objects.requireNonNull;
  13. import static org.eclipse.jgit.internal.storage.pack.StoredObjectRepresentation.PACK_DELTA;
  14. import static org.eclipse.jgit.internal.storage.pack.StoredObjectRepresentation.PACK_WHOLE;
  15. import static org.eclipse.jgit.lib.Constants.OBJECT_ID_LENGTH;
  16. import static org.eclipse.jgit.lib.Constants.OBJ_BLOB;
  17. import static org.eclipse.jgit.lib.Constants.OBJ_COMMIT;
  18. import static org.eclipse.jgit.lib.Constants.OBJ_TAG;
  19. import static org.eclipse.jgit.lib.Constants.OBJ_TREE;
  20. import java.io.IOException;
  21. import java.io.OutputStream;
  22. import java.lang.ref.WeakReference;
  23. import java.security.MessageDigest;
  24. import java.text.MessageFormat;
  25. import java.time.Duration;
  26. import java.util.ArrayList;
  27. import java.util.Arrays;
  28. import java.util.Collection;
  29. import java.util.Collections;
  30. import java.util.HashMap;
  31. import java.util.HashSet;
  32. import java.util.Iterator;
  33. import java.util.List;
  34. import java.util.Map;
  35. import java.util.NoSuchElementException;
  36. import java.util.Set;
  37. import java.util.concurrent.ConcurrentHashMap;
  38. import java.util.concurrent.ExecutionException;
  39. import java.util.concurrent.Executor;
  40. import java.util.concurrent.ExecutorService;
  41. import java.util.concurrent.Executors;
  42. import java.util.concurrent.Future;
  43. import java.util.concurrent.TimeUnit;
  44. import java.util.zip.CRC32;
  45. import java.util.zip.CheckedOutputStream;
  46. import java.util.zip.Deflater;
  47. import java.util.zip.DeflaterOutputStream;
  48. import org.eclipse.jgit.annotations.NonNull;
  49. import org.eclipse.jgit.annotations.Nullable;
  50. import org.eclipse.jgit.errors.CorruptObjectException;
  51. import org.eclipse.jgit.errors.IncorrectObjectTypeException;
  52. import org.eclipse.jgit.errors.LargeObjectException;
  53. import org.eclipse.jgit.errors.MissingObjectException;
  54. import org.eclipse.jgit.errors.SearchForReuseTimeout;
  55. import org.eclipse.jgit.errors.StoredObjectRepresentationNotAvailableException;
  56. import org.eclipse.jgit.internal.JGitText;
  57. import org.eclipse.jgit.internal.storage.file.PackBitmapIndexBuilder;
  58. import org.eclipse.jgit.internal.storage.file.PackBitmapIndexWriterV1;
  59. import org.eclipse.jgit.internal.storage.file.PackIndexWriter;
  60. import org.eclipse.jgit.lib.AnyObjectId;
  61. import org.eclipse.jgit.lib.AsyncObjectSizeQueue;
  62. import org.eclipse.jgit.lib.BatchingProgressMonitor;
  63. import org.eclipse.jgit.lib.BitmapIndex;
  64. import org.eclipse.jgit.lib.BitmapIndex.BitmapBuilder;
  65. import org.eclipse.jgit.lib.BitmapObject;
  66. import org.eclipse.jgit.lib.Constants;
  67. import org.eclipse.jgit.lib.NullProgressMonitor;
  68. import org.eclipse.jgit.lib.ObjectId;
  69. import org.eclipse.jgit.lib.ObjectIdOwnerMap;
  70. import org.eclipse.jgit.lib.ObjectIdSet;
  71. import org.eclipse.jgit.lib.ObjectLoader;
  72. import org.eclipse.jgit.lib.ObjectReader;
  73. import org.eclipse.jgit.lib.ProgressMonitor;
  74. import org.eclipse.jgit.lib.Repository;
  75. import org.eclipse.jgit.lib.ThreadSafeProgressMonitor;
  76. import org.eclipse.jgit.revwalk.AsyncRevObjectQueue;
  77. import org.eclipse.jgit.revwalk.BitmapWalker;
  78. import org.eclipse.jgit.revwalk.DepthWalk;
  79. import org.eclipse.jgit.revwalk.ObjectWalk;
  80. import org.eclipse.jgit.revwalk.RevCommit;
  81. import org.eclipse.jgit.revwalk.RevFlag;
  82. import org.eclipse.jgit.revwalk.RevObject;
  83. import org.eclipse.jgit.revwalk.RevSort;
  84. import org.eclipse.jgit.revwalk.RevTag;
  85. import org.eclipse.jgit.revwalk.RevTree;
  86. import org.eclipse.jgit.storage.pack.PackConfig;
  87. import org.eclipse.jgit.storage.pack.PackStatistics;
  88. import org.eclipse.jgit.transport.FilterSpec;
  89. import org.eclipse.jgit.transport.ObjectCountCallback;
  90. import org.eclipse.jgit.transport.PacketLineOut;
  91. import org.eclipse.jgit.transport.WriteAbortedException;
  92. import org.eclipse.jgit.util.BlockList;
  93. import org.eclipse.jgit.util.TemporaryBuffer;
  94. /**
  95. * <p>
  96. * PackWriter class is responsible for generating pack files from specified set
  97. * of objects from repository. This implementation produce pack files in format
  98. * version 2.
  99. * </p>
  100. * <p>
  101. * Source of objects may be specified in two ways:
  102. * <ul>
  103. * <li>(usually) by providing sets of interesting and uninteresting objects in
  104. * repository - all interesting objects and their ancestors except uninteresting
  105. * objects and their ancestors will be included in pack, or</li>
  106. * <li>by providing iterator of {@link org.eclipse.jgit.revwalk.RevObject}
  107. * specifying exact list and order of objects in pack</li>
  108. * </ul>
  109. * <p>
  110. * Typical usage consists of creating an instance, configuring options,
  111. * preparing the list of objects by calling {@link #preparePack(Iterator)} or
  112. * {@link #preparePack(ProgressMonitor, Set, Set)}, and streaming with
  113. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}. If the
  114. * pack is being stored as a file the matching index can be written out after
  115. * writing the pack by {@link #writeIndex(OutputStream)}. An optional bitmap
  116. * index can be made by calling {@link #prepareBitmapIndex(ProgressMonitor)}
  117. * followed by {@link #writeBitmapIndex(OutputStream)}.
  118. * </p>
  119. * <p>
  120. * Class provide set of configurable options and
  121. * {@link org.eclipse.jgit.lib.ProgressMonitor} support, as operations may take
  122. * a long time for big repositories. Deltas searching algorithm is <b>NOT
  123. * IMPLEMENTED</b> yet - this implementation relies only on deltas and objects
  124. * reuse.
  125. * </p>
  126. * <p>
  127. * This class is not thread safe. It is intended to be used in one thread as a
  128. * single pass to produce one pack. Invoking methods multiple times or out of
  129. * order is not supported as internal data structures are destroyed during
  130. * certain phases to save memory when packing large repositories.
  131. * </p>
  132. */
  133. public class PackWriter implements AutoCloseable {
  134. private static final int PACK_VERSION_GENERATED = 2;
  135. /** Empty set of objects for {@code preparePack()}. */
  136. public static final Set<ObjectId> NONE = Collections.emptySet();
  137. private static final Map<WeakReference<PackWriter>, Boolean> instances =
  138. new ConcurrentHashMap<>();
  139. private static final Iterable<PackWriter> instancesIterable = () -> new Iterator<PackWriter>() {
  140. private final Iterator<WeakReference<PackWriter>> it = instances
  141. .keySet().iterator();
  142. private PackWriter next;
  143. @Override
  144. public boolean hasNext() {
  145. if (next != null) {
  146. return true;
  147. }
  148. while (it.hasNext()) {
  149. WeakReference<PackWriter> ref = it.next();
  150. next = ref.get();
  151. if (next != null) {
  152. return true;
  153. }
  154. it.remove();
  155. }
  156. return false;
  157. }
  158. @Override
  159. public PackWriter next() {
  160. if (hasNext()) {
  161. PackWriter result = next;
  162. next = null;
  163. return result;
  164. }
  165. throw new NoSuchElementException();
  166. }
  167. @Override
  168. public void remove() {
  169. throw new UnsupportedOperationException();
  170. }
  171. };
  172. /**
  173. * Get all allocated, non-released PackWriters instances.
  174. *
  175. * @return all allocated, non-released PackWriters instances.
  176. */
  177. public static Iterable<PackWriter> getInstances() {
  178. return instancesIterable;
  179. }
  180. @SuppressWarnings("unchecked")
  181. BlockList<ObjectToPack>[] objectsLists = new BlockList[OBJ_TAG + 1];
  182. {
  183. objectsLists[OBJ_COMMIT] = new BlockList<>();
  184. objectsLists[OBJ_TREE] = new BlockList<>();
  185. objectsLists[OBJ_BLOB] = new BlockList<>();
  186. objectsLists[OBJ_TAG] = new BlockList<>();
  187. }
  188. private ObjectIdOwnerMap<ObjectToPack> objectsMap = new ObjectIdOwnerMap<>();
  189. // edge objects for thin packs
  190. private List<ObjectToPack> edgeObjects = new BlockList<>();
  191. // Objects the client is known to have already.
  192. private BitmapBuilder haveObjects;
  193. private List<CachedPack> cachedPacks = new ArrayList<>(2);
  194. private Set<ObjectId> tagTargets = NONE;
  195. private Set<? extends ObjectId> excludeFromBitmapSelection = NONE;
  196. private ObjectIdSet[] excludeInPacks;
  197. private ObjectIdSet excludeInPackLast;
  198. private Deflater myDeflater;
  199. private final ObjectReader reader;
  200. /** {@link #reader} recast to the reuse interface, if it supports it. */
  201. private final ObjectReuseAsIs reuseSupport;
  202. final PackConfig config;
  203. private final PackStatistics.Accumulator stats;
  204. private final MutableState state;
  205. private final WeakReference<PackWriter> selfRef;
  206. private PackStatistics.ObjectType.Accumulator typeStats;
  207. private List<ObjectToPack> sortedByName;
  208. private byte[] packcsum;
  209. private boolean deltaBaseAsOffset;
  210. private boolean reuseDeltas;
  211. private boolean reuseDeltaCommits;
  212. private boolean reuseValidate;
  213. private boolean thin;
  214. private boolean useCachedPacks;
  215. private boolean useBitmaps;
  216. private boolean ignoreMissingUninteresting = true;
  217. private boolean pruneCurrentObjectList;
  218. private boolean shallowPack;
  219. private boolean canBuildBitmaps;
  220. private boolean indexDisabled;
  221. private boolean checkSearchForReuseTimeout = false;
  222. private final Duration searchForReuseTimeout;
  223. private long searchForReuseStartTimeEpoc;
  224. private int depth;
  225. private Collection<? extends ObjectId> unshallowObjects;
  226. private PackBitmapIndexBuilder writeBitmaps;
  227. private CRC32 crc32;
  228. private ObjectCountCallback callback;
  229. private FilterSpec filterSpec = FilterSpec.NO_FILTER;
  230. private PackfileUriConfig packfileUriConfig;
  231. /**
  232. * Create writer for specified repository.
  233. * <p>
  234. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  235. * {@link #preparePack(ProgressMonitor, Set, Set)}.
  236. *
  237. * @param repo
  238. * repository where objects are stored.
  239. */
  240. public PackWriter(Repository repo) {
  241. this(repo, repo.newObjectReader());
  242. }
  243. /**
  244. * Create a writer to load objects from the specified reader.
  245. * <p>
  246. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  247. * {@link #preparePack(ProgressMonitor, Set, Set)}.
  248. *
  249. * @param reader
  250. * reader to read from the repository with.
  251. */
  252. public PackWriter(ObjectReader reader) {
  253. this(new PackConfig(), reader);
  254. }
  255. /**
  256. * Create writer for specified repository.
  257. * <p>
  258. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  259. * {@link #preparePack(ProgressMonitor, Set, Set)}.
  260. *
  261. * @param repo
  262. * repository where objects are stored.
  263. * @param reader
  264. * reader to read from the repository with.
  265. */
  266. public PackWriter(Repository repo, ObjectReader reader) {
  267. this(new PackConfig(repo), reader);
  268. }
  269. /**
  270. * Create writer with a specified configuration.
  271. * <p>
  272. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  273. * {@link #preparePack(ProgressMonitor, Set, Set)}.
  274. *
  275. * @param config
  276. * configuration for the pack writer.
  277. * @param reader
  278. * reader to read from the repository with.
  279. */
  280. public PackWriter(PackConfig config, ObjectReader reader) {
  281. this(config, reader, null);
  282. }
  283. /**
  284. * Create writer with a specified configuration.
  285. * <p>
  286. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  287. * {@link #preparePack(ProgressMonitor, Set, Set)}.
  288. *
  289. * @param config
  290. * configuration for the pack writer.
  291. * @param reader
  292. * reader to read from the repository with.
  293. * @param statsAccumulator
  294. * accumulator for statics
  295. */
  296. public PackWriter(PackConfig config, final ObjectReader reader,
  297. @Nullable PackStatistics.Accumulator statsAccumulator) {
  298. this.config = config;
  299. this.reader = reader;
  300. if (reader instanceof ObjectReuseAsIs)
  301. reuseSupport = ((ObjectReuseAsIs) reader);
  302. else
  303. reuseSupport = null;
  304. deltaBaseAsOffset = config.isDeltaBaseAsOffset();
  305. reuseDeltas = config.isReuseDeltas();
  306. searchForReuseTimeout = config.getSearchForReuseTimeout();
  307. reuseValidate = true; // be paranoid by default
  308. stats = statsAccumulator != null ? statsAccumulator
  309. : new PackStatistics.Accumulator();
  310. state = new MutableState();
  311. selfRef = new WeakReference<>(this);
  312. instances.put(selfRef, Boolean.TRUE);
  313. }
  314. /**
  315. * Set the {@code ObjectCountCallback}.
  316. * <p>
  317. * It should be set before calling
  318. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}.
  319. *
  320. * @param callback
  321. * the callback to set
  322. * @return this object for chaining.
  323. */
  324. public PackWriter setObjectCountCallback(ObjectCountCallback callback) {
  325. this.callback = callback;
  326. return this;
  327. }
  328. /**
  329. * Records the set of shallow commits in the client.
  330. *
  331. * @param clientShallowCommits
  332. * the shallow commits in the client
  333. */
  334. public void setClientShallowCommits(Set<ObjectId> clientShallowCommits) {
  335. stats.clientShallowCommits = Collections
  336. .unmodifiableSet(new HashSet<>(clientShallowCommits));
  337. }
  338. /**
  339. * Check whether writer can store delta base as an offset (new style
  340. * reducing pack size) or should store it as an object id (legacy style,
  341. * compatible with old readers).
  342. *
  343. * Default setting: {@value PackConfig#DEFAULT_DELTA_BASE_AS_OFFSET}
  344. *
  345. * @return true if delta base is stored as an offset; false if it is stored
  346. * as an object id.
  347. */
  348. public boolean isDeltaBaseAsOffset() {
  349. return deltaBaseAsOffset;
  350. }
  351. /**
  352. * Check whether the search for reuse phase is taking too long. This could
  353. * be the case when the number of objects and pack files is high and the
  354. * system is under pressure. If that's the case and
  355. * checkSearchForReuseTimeout is true abort the search.
  356. *
  357. * @throws SearchForReuseTimeout
  358. * if the search for reuse is taking too long.
  359. */
  360. public void checkSearchForReuseTimeout() throws SearchForReuseTimeout {
  361. if (checkSearchForReuseTimeout
  362. && Duration.ofMillis(System.currentTimeMillis()
  363. - searchForReuseStartTimeEpoc)
  364. .compareTo(searchForReuseTimeout) > 0) {
  365. throw new SearchForReuseTimeout(searchForReuseTimeout);
  366. }
  367. }
  368. /**
  369. * Set writer delta base format. Delta base can be written as an offset in a
  370. * pack file (new approach reducing file size) or as an object id (legacy
  371. * approach, compatible with old readers).
  372. *
  373. * Default setting: {@value PackConfig#DEFAULT_DELTA_BASE_AS_OFFSET}
  374. *
  375. * @param deltaBaseAsOffset
  376. * boolean indicating whether delta base can be stored as an
  377. * offset.
  378. */
  379. public void setDeltaBaseAsOffset(boolean deltaBaseAsOffset) {
  380. this.deltaBaseAsOffset = deltaBaseAsOffset;
  381. }
  382. /**
  383. * Set the writer to check for long search for reuse, exceeding the timeout.
  384. * Selecting an object representation can be an expensive operation. It is
  385. * possible to set a max search for reuse time (see
  386. * PackConfig#CONFIG_KEY_SEARCH_FOR_REUSE_TIMEOUT for more details).
  387. *
  388. * However some operations, i.e.: GC, need to find the best candidate
  389. * regardless how much time the operation will need to finish.
  390. *
  391. * This method enables the search for reuse timeout check, otherwise
  392. * disabled.
  393. */
  394. public void enableSearchForReuseTimeout() {
  395. this.checkSearchForReuseTimeout = true;
  396. }
  397. /**
  398. * Check if the writer will reuse commits that are already stored as deltas.
  399. *
  400. * @return true if the writer would reuse commits stored as deltas, assuming
  401. * delta reuse is already enabled.
  402. */
  403. public boolean isReuseDeltaCommits() {
  404. return reuseDeltaCommits;
  405. }
  406. /**
  407. * Set the writer to reuse existing delta versions of commits.
  408. *
  409. * @param reuse
  410. * if true, the writer will reuse any commits stored as deltas.
  411. * By default the writer does not reuse delta commits.
  412. */
  413. public void setReuseDeltaCommits(boolean reuse) {
  414. reuseDeltaCommits = reuse;
  415. }
  416. /**
  417. * Check if the writer validates objects before copying them.
  418. *
  419. * @return true if validation is enabled; false if the reader will handle
  420. * object validation as a side-effect of it consuming the output.
  421. */
  422. public boolean isReuseValidatingObjects() {
  423. return reuseValidate;
  424. }
  425. /**
  426. * Enable (or disable) object validation during packing.
  427. *
  428. * @param validate
  429. * if true the pack writer will validate an object before it is
  430. * put into the output. This additional validation work may be
  431. * necessary to avoid propagating corruption from one local pack
  432. * file to another local pack file.
  433. */
  434. public void setReuseValidatingObjects(boolean validate) {
  435. reuseValidate = validate;
  436. }
  437. /**
  438. * Whether this writer is producing a thin pack.
  439. *
  440. * @return true if this writer is producing a thin pack.
  441. */
  442. public boolean isThin() {
  443. return thin;
  444. }
  445. /**
  446. * Whether writer may pack objects with delta base object not within set of
  447. * objects to pack
  448. *
  449. * @param packthin
  450. * a boolean indicating whether writer may pack objects with
  451. * delta base object not within set of objects to pack, but
  452. * belonging to party repository (uninteresting/boundary) as
  453. * determined by set; this kind of pack is used only for
  454. * transport; true - to produce thin pack, false - otherwise.
  455. */
  456. public void setThin(boolean packthin) {
  457. thin = packthin;
  458. }
  459. /**
  460. * Whether to reuse cached packs.
  461. *
  462. * @return {@code true} to reuse cached packs. If true index creation isn't
  463. * available.
  464. */
  465. public boolean isUseCachedPacks() {
  466. return useCachedPacks;
  467. }
  468. /**
  469. * Whether to use cached packs
  470. *
  471. * @param useCached
  472. * if set to {@code true} and a cached pack is present, it will
  473. * be appended onto the end of a thin-pack, reducing the amount
  474. * of working set space and CPU used by PackWriter. Enabling this
  475. * feature prevents PackWriter from creating an index for the
  476. * newly created pack, so its only suitable for writing to a
  477. * network client, where the client will make the index.
  478. */
  479. public void setUseCachedPacks(boolean useCached) {
  480. useCachedPacks = useCached;
  481. }
  482. /**
  483. * Whether to use bitmaps
  484. *
  485. * @return {@code true} to use bitmaps for ObjectWalks, if available.
  486. */
  487. public boolean isUseBitmaps() {
  488. return useBitmaps;
  489. }
  490. /**
  491. * Whether to use bitmaps
  492. *
  493. * @param useBitmaps
  494. * if set to true, bitmaps will be used when preparing a pack.
  495. */
  496. public void setUseBitmaps(boolean useBitmaps) {
  497. this.useBitmaps = useBitmaps;
  498. }
  499. /**
  500. * Whether the index file cannot be created by this PackWriter.
  501. *
  502. * @return {@code true} if the index file cannot be created by this
  503. * PackWriter.
  504. */
  505. public boolean isIndexDisabled() {
  506. return indexDisabled || !cachedPacks.isEmpty();
  507. }
  508. /**
  509. * Whether to disable creation of the index file.
  510. *
  511. * @param noIndex
  512. * {@code true} to disable creation of the index file.
  513. */
  514. public void setIndexDisabled(boolean noIndex) {
  515. this.indexDisabled = noIndex;
  516. }
  517. /**
  518. * Whether to ignore missing uninteresting objects
  519. *
  520. * @return {@code true} to ignore objects that are uninteresting and also
  521. * not found on local disk; false to throw a
  522. * {@link org.eclipse.jgit.errors.MissingObjectException} out of
  523. * {@link #preparePack(ProgressMonitor, Set, Set)} if an
  524. * uninteresting object is not in the source repository. By default,
  525. * true, permitting gracefully ignoring of uninteresting objects.
  526. */
  527. public boolean isIgnoreMissingUninteresting() {
  528. return ignoreMissingUninteresting;
  529. }
  530. /**
  531. * Whether writer should ignore non existing uninteresting objects
  532. *
  533. * @param ignore
  534. * {@code true} if writer should ignore non existing
  535. * uninteresting objects during construction set of objects to
  536. * pack; false otherwise - non existing uninteresting objects may
  537. * cause {@link org.eclipse.jgit.errors.MissingObjectException}
  538. */
  539. public void setIgnoreMissingUninteresting(boolean ignore) {
  540. ignoreMissingUninteresting = ignore;
  541. }
  542. /**
  543. * Set the tag targets that should be hoisted earlier during packing.
  544. * <p>
  545. * Callers may put objects into this set before invoking any of the
  546. * preparePack methods to influence where an annotated tag's target is
  547. * stored within the resulting pack. Typically these will be clustered
  548. * together, and hoisted earlier in the file even if they are ancient
  549. * revisions, allowing readers to find tag targets with better locality.
  550. *
  551. * @param objects
  552. * objects that annotated tags point at.
  553. */
  554. public void setTagTargets(Set<ObjectId> objects) {
  555. tagTargets = objects;
  556. }
  557. /**
  558. * Configure this pack for a shallow clone.
  559. *
  560. * @param depth
  561. * maximum depth of history to return. 1 means return only the
  562. * "wants".
  563. * @param unshallow
  564. * objects which used to be shallow on the client, but are being
  565. * extended as part of this fetch
  566. */
  567. public void setShallowPack(int depth,
  568. Collection<? extends ObjectId> unshallow) {
  569. this.shallowPack = true;
  570. this.depth = depth;
  571. this.unshallowObjects = unshallow;
  572. }
  573. /**
  574. * @param filter the filter which indicates what and what not this writer
  575. * should include
  576. */
  577. public void setFilterSpec(@NonNull FilterSpec filter) {
  578. filterSpec = requireNonNull(filter);
  579. }
  580. /**
  581. * @param config configuration related to packfile URIs
  582. * @since 5.5
  583. */
  584. public void setPackfileUriConfig(PackfileUriConfig config) {
  585. packfileUriConfig = config;
  586. }
  587. /**
  588. * Returns objects number in a pack file that was created by this writer.
  589. *
  590. * @return number of objects in pack.
  591. * @throws java.io.IOException
  592. * a cached pack cannot supply its object count.
  593. */
  594. public long getObjectCount() throws IOException {
  595. if (stats.totalObjects == 0) {
  596. long objCnt = 0;
  597. objCnt += objectsLists[OBJ_COMMIT].size();
  598. objCnt += objectsLists[OBJ_TREE].size();
  599. objCnt += objectsLists[OBJ_BLOB].size();
  600. objCnt += objectsLists[OBJ_TAG].size();
  601. for (CachedPack pack : cachedPacks)
  602. objCnt += pack.getObjectCount();
  603. return objCnt;
  604. }
  605. return stats.totalObjects;
  606. }
  607. private long getUnoffloadedObjectCount() throws IOException {
  608. long objCnt = 0;
  609. objCnt += objectsLists[OBJ_COMMIT].size();
  610. objCnt += objectsLists[OBJ_TREE].size();
  611. objCnt += objectsLists[OBJ_BLOB].size();
  612. objCnt += objectsLists[OBJ_TAG].size();
  613. for (CachedPack pack : cachedPacks) {
  614. CachedPackUriProvider.PackInfo packInfo =
  615. packfileUriConfig.cachedPackUriProvider.getInfo(
  616. pack, packfileUriConfig.protocolsSupported);
  617. if (packInfo == null) {
  618. objCnt += pack.getObjectCount();
  619. }
  620. }
  621. return objCnt;
  622. }
  623. /**
  624. * Returns the object ids in the pack file that was created by this writer.
  625. * <p>
  626. * This method can only be invoked after
  627. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
  628. * been invoked and completed successfully.
  629. *
  630. * @return set of objects in pack.
  631. * @throws java.io.IOException
  632. * a cached pack cannot supply its object ids.
  633. */
  634. public ObjectIdOwnerMap<ObjectIdOwnerMap.Entry> getObjectSet()
  635. throws IOException {
  636. if (!cachedPacks.isEmpty())
  637. throw new IOException(
  638. JGitText.get().cachedPacksPreventsListingObjects);
  639. if (writeBitmaps != null) {
  640. return writeBitmaps.getObjectSet();
  641. }
  642. ObjectIdOwnerMap<ObjectIdOwnerMap.Entry> r = new ObjectIdOwnerMap<>();
  643. for (BlockList<ObjectToPack> objList : objectsLists) {
  644. if (objList != null) {
  645. for (ObjectToPack otp : objList)
  646. r.add(new ObjectIdOwnerMap.Entry(otp) {
  647. // A new entry that copies the ObjectId
  648. });
  649. }
  650. }
  651. return r;
  652. }
  653. /**
  654. * Add a pack index whose contents should be excluded from the result.
  655. *
  656. * @param idx
  657. * objects in this index will not be in the output pack.
  658. */
  659. public void excludeObjects(ObjectIdSet idx) {
  660. if (excludeInPacks == null) {
  661. excludeInPacks = new ObjectIdSet[] { idx };
  662. excludeInPackLast = idx;
  663. } else {
  664. int cnt = excludeInPacks.length;
  665. ObjectIdSet[] newList = new ObjectIdSet[cnt + 1];
  666. System.arraycopy(excludeInPacks, 0, newList, 0, cnt);
  667. newList[cnt] = idx;
  668. excludeInPacks = newList;
  669. }
  670. }
  671. /**
  672. * Prepare the list of objects to be written to the pack stream.
  673. * <p>
  674. * Iterator <b>exactly</b> determines which objects are included in a pack
  675. * and order they appear in pack (except that objects order by type is not
  676. * needed at input). This order should conform general rules of ordering
  677. * objects in git - by recency and path (type and delta-base first is
  678. * internally secured) and responsibility for guaranteeing this order is on
  679. * a caller side. Iterator must return each id of object to write exactly
  680. * once.
  681. * </p>
  682. *
  683. * @param objectsSource
  684. * iterator of object to store in a pack; order of objects within
  685. * each type is important, ordering by type is not needed;
  686. * allowed types for objects are
  687. * {@link org.eclipse.jgit.lib.Constants#OBJ_COMMIT},
  688. * {@link org.eclipse.jgit.lib.Constants#OBJ_TREE},
  689. * {@link org.eclipse.jgit.lib.Constants#OBJ_BLOB} and
  690. * {@link org.eclipse.jgit.lib.Constants#OBJ_TAG}; objects
  691. * returned by iterator may be later reused by caller as object
  692. * id and type are internally copied in each iteration.
  693. * @throws java.io.IOException
  694. * when some I/O problem occur during reading objects.
  695. */
  696. public void preparePack(@NonNull Iterator<RevObject> objectsSource)
  697. throws IOException {
  698. while (objectsSource.hasNext()) {
  699. addObject(objectsSource.next());
  700. }
  701. }
  702. /**
  703. * Prepare the list of objects to be written to the pack stream.
  704. *
  705. * <p>
  706. * PackWriter will concat and write out the specified packs as-is.
  707. *
  708. * @param c
  709. * cached packs to be written.
  710. */
  711. public void preparePack(Collection<? extends CachedPack> c) {
  712. cachedPacks.addAll(c);
  713. }
  714. /**
  715. * Prepare the list of objects to be written to the pack stream.
  716. * <p>
  717. * Basing on these 2 sets, another set of objects to put in a pack file is
  718. * created: this set consists of all objects reachable (ancestors) from
  719. * interesting objects, except uninteresting objects and their ancestors.
  720. * This method uses class {@link org.eclipse.jgit.revwalk.ObjectWalk}
  721. * extensively to find out that appropriate set of output objects and their
  722. * optimal order in output pack. Order is consistent with general git
  723. * in-pack rules: sort by object type, recency, path and delta-base first.
  724. * </p>
  725. *
  726. * @param countingMonitor
  727. * progress during object enumeration.
  728. * @param want
  729. * collection of objects to be marked as interesting (start
  730. * points of graph traversal). Must not be {@code null}.
  731. * @param have
  732. * collection of objects to be marked as uninteresting (end
  733. * points of graph traversal). Pass {@link #NONE} if all objects
  734. * reachable from {@code want} are desired, such as when serving
  735. * a clone.
  736. * @throws java.io.IOException
  737. * when some I/O problem occur during reading objects.
  738. */
  739. public void preparePack(ProgressMonitor countingMonitor,
  740. @NonNull Set<? extends ObjectId> want,
  741. @NonNull Set<? extends ObjectId> have) throws IOException {
  742. preparePack(countingMonitor, want, have, NONE, NONE);
  743. }
  744. /**
  745. * Prepare the list of objects to be written to the pack stream.
  746. * <p>
  747. * Like {@link #preparePack(ProgressMonitor, Set, Set)} but also allows
  748. * specifying commits that should not be walked past ("shallow" commits).
  749. * The caller is responsible for filtering out commits that should not be
  750. * shallow any more ("unshallow" commits as in {@link #setShallowPack}) from
  751. * the shallow set.
  752. *
  753. * @param countingMonitor
  754. * progress during object enumeration.
  755. * @param want
  756. * objects of interest, ancestors of which will be included in
  757. * the pack. Must not be {@code null}.
  758. * @param have
  759. * objects whose ancestors (up to and including {@code shallow}
  760. * commits) do not need to be included in the pack because they
  761. * are already available from elsewhere. Must not be
  762. * {@code null}.
  763. * @param shallow
  764. * commits indicating the boundary of the history marked with
  765. * {@code have}. Shallow commits have parents but those parents
  766. * are considered not to be already available. Parents of
  767. * {@code shallow} commits and earlier generations will be
  768. * included in the pack if requested by {@code want}. Must not be
  769. * {@code null}.
  770. * @throws java.io.IOException
  771. * an I/O problem occurred while reading objects.
  772. */
  773. public void preparePack(ProgressMonitor countingMonitor,
  774. @NonNull Set<? extends ObjectId> want,
  775. @NonNull Set<? extends ObjectId> have,
  776. @NonNull Set<? extends ObjectId> shallow) throws IOException {
  777. preparePack(countingMonitor, want, have, shallow, NONE);
  778. }
  779. /**
  780. * Prepare the list of objects to be written to the pack stream.
  781. * <p>
  782. * Like {@link #preparePack(ProgressMonitor, Set, Set)} but also allows
  783. * specifying commits that should not be walked past ("shallow" commits).
  784. * The caller is responsible for filtering out commits that should not be
  785. * shallow any more ("unshallow" commits as in {@link #setShallowPack}) from
  786. * the shallow set.
  787. *
  788. * @param countingMonitor
  789. * progress during object enumeration.
  790. * @param want
  791. * objects of interest, ancestors of which will be included in
  792. * the pack. Must not be {@code null}.
  793. * @param have
  794. * objects whose ancestors (up to and including {@code shallow}
  795. * commits) do not need to be included in the pack because they
  796. * are already available from elsewhere. Must not be
  797. * {@code null}.
  798. * @param shallow
  799. * commits indicating the boundary of the history marked with
  800. * {@code have}. Shallow commits have parents but those parents
  801. * are considered not to be already available. Parents of
  802. * {@code shallow} commits and earlier generations will be
  803. * included in the pack if requested by {@code want}. Must not be
  804. * {@code null}.
  805. * @param noBitmaps
  806. * collection of objects to be excluded from bitmap commit
  807. * selection.
  808. * @throws java.io.IOException
  809. * an I/O problem occurred while reading objects.
  810. */
  811. public void preparePack(ProgressMonitor countingMonitor,
  812. @NonNull Set<? extends ObjectId> want,
  813. @NonNull Set<? extends ObjectId> have,
  814. @NonNull Set<? extends ObjectId> shallow,
  815. @NonNull Set<? extends ObjectId> noBitmaps) throws IOException {
  816. try (ObjectWalk ow = getObjectWalk()) {
  817. ow.assumeShallow(shallow);
  818. preparePack(countingMonitor, ow, want, have, noBitmaps);
  819. }
  820. }
  821. private ObjectWalk getObjectWalk() {
  822. return shallowPack ? new DepthWalk.ObjectWalk(reader, depth - 1)
  823. : new ObjectWalk(reader);
  824. }
  825. /**
  826. * A visitation policy which uses the depth at which the object is seen to
  827. * decide if re-traversal is necessary. In particular, if the object has
  828. * already been visited at this depth or shallower, it is not necessary to
  829. * re-visit at this depth.
  830. */
  831. private static class DepthAwareVisitationPolicy
  832. implements ObjectWalk.VisitationPolicy {
  833. private final Map<ObjectId, Integer> lowestDepthVisited = new HashMap<>();
  834. private final ObjectWalk walk;
  835. DepthAwareVisitationPolicy(ObjectWalk walk) {
  836. this.walk = requireNonNull(walk);
  837. }
  838. @Override
  839. public boolean shouldVisit(RevObject o) {
  840. Integer lastDepth = lowestDepthVisited.get(o);
  841. if (lastDepth == null) {
  842. return true;
  843. }
  844. return walk.getTreeDepth() < lastDepth.intValue();
  845. }
  846. @Override
  847. public void visited(RevObject o) {
  848. lowestDepthVisited.put(o, Integer.valueOf(walk.getTreeDepth()));
  849. }
  850. }
  851. /**
  852. * Prepare the list of objects to be written to the pack stream.
  853. * <p>
  854. * Basing on these 2 sets, another set of objects to put in a pack file is
  855. * created: this set consists of all objects reachable (ancestors) from
  856. * interesting objects, except uninteresting objects and their ancestors.
  857. * This method uses class {@link org.eclipse.jgit.revwalk.ObjectWalk}
  858. * extensively to find out that appropriate set of output objects and their
  859. * optimal order in output pack. Order is consistent with general git
  860. * in-pack rules: sort by object type, recency, path and delta-base first.
  861. * </p>
  862. *
  863. * @param countingMonitor
  864. * progress during object enumeration.
  865. * @param walk
  866. * ObjectWalk to perform enumeration.
  867. * @param interestingObjects
  868. * collection of objects to be marked as interesting (start
  869. * points of graph traversal). Must not be {@code null}.
  870. * @param uninterestingObjects
  871. * collection of objects to be marked as uninteresting (end
  872. * points of graph traversal). Pass {@link #NONE} if all objects
  873. * reachable from {@code want} are desired, such as when serving
  874. * a clone.
  875. * @param noBitmaps
  876. * collection of objects to be excluded from bitmap commit
  877. * selection.
  878. * @throws java.io.IOException
  879. * when some I/O problem occur during reading objects.
  880. */
  881. public void preparePack(ProgressMonitor countingMonitor,
  882. @NonNull ObjectWalk walk,
  883. @NonNull Set<? extends ObjectId> interestingObjects,
  884. @NonNull Set<? extends ObjectId> uninterestingObjects,
  885. @NonNull Set<? extends ObjectId> noBitmaps)
  886. throws IOException {
  887. if (countingMonitor == null)
  888. countingMonitor = NullProgressMonitor.INSTANCE;
  889. if (shallowPack && !(walk instanceof DepthWalk.ObjectWalk))
  890. throw new IllegalArgumentException(
  891. JGitText.get().shallowPacksRequireDepthWalk);
  892. if (filterSpec.getTreeDepthLimit() >= 0) {
  893. walk.setVisitationPolicy(new DepthAwareVisitationPolicy(walk));
  894. }
  895. findObjectsToPack(countingMonitor, walk, interestingObjects,
  896. uninterestingObjects, noBitmaps);
  897. }
  898. /**
  899. * Determine if the pack file will contain the requested object.
  900. *
  901. * @param id
  902. * the object to test the existence of.
  903. * @return true if the object will appear in the output pack file.
  904. * @throws java.io.IOException
  905. * a cached pack cannot be examined.
  906. */
  907. public boolean willInclude(AnyObjectId id) throws IOException {
  908. ObjectToPack obj = objectsMap.get(id);
  909. return obj != null && !obj.isEdge();
  910. }
  911. /**
  912. * Lookup the ObjectToPack object for a given ObjectId.
  913. *
  914. * @param id
  915. * the object to find in the pack.
  916. * @return the object we are packing, or null.
  917. */
  918. public ObjectToPack get(AnyObjectId id) {
  919. ObjectToPack obj = objectsMap.get(id);
  920. return obj != null && !obj.isEdge() ? obj : null;
  921. }
  922. /**
  923. * Computes SHA-1 of lexicographically sorted objects ids written in this
  924. * pack, as used to name a pack file in repository.
  925. *
  926. * @return ObjectId representing SHA-1 name of a pack that was created.
  927. */
  928. public ObjectId computeName() {
  929. final byte[] buf = new byte[OBJECT_ID_LENGTH];
  930. final MessageDigest md = Constants.newMessageDigest();
  931. for (ObjectToPack otp : sortByName()) {
  932. otp.copyRawTo(buf, 0);
  933. md.update(buf, 0, OBJECT_ID_LENGTH);
  934. }
  935. return ObjectId.fromRaw(md.digest());
  936. }
  937. /**
  938. * Returns the index format version that will be written.
  939. * <p>
  940. * This method can only be invoked after
  941. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
  942. * been invoked and completed successfully.
  943. *
  944. * @return the index format version.
  945. */
  946. public int getIndexVersion() {
  947. int indexVersion = config.getIndexVersion();
  948. if (indexVersion <= 0) {
  949. for (BlockList<ObjectToPack> objs : objectsLists)
  950. indexVersion = Math.max(indexVersion,
  951. PackIndexWriter.oldestPossibleFormat(objs));
  952. }
  953. return indexVersion;
  954. }
  955. /**
  956. * Create an index file to match the pack file just written.
  957. * <p>
  958. * Called after
  959. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}.
  960. * <p>
  961. * Writing an index is only required for local pack storage. Packs sent on
  962. * the network do not need to create an index.
  963. *
  964. * @param indexStream
  965. * output for the index data. Caller is responsible for closing
  966. * this stream.
  967. * @throws java.io.IOException
  968. * the index data could not be written to the supplied stream.
  969. */
  970. public void writeIndex(OutputStream indexStream) throws IOException {
  971. if (isIndexDisabled())
  972. throw new IOException(JGitText.get().cachedPacksPreventsIndexCreation);
  973. long writeStart = System.currentTimeMillis();
  974. final PackIndexWriter iw = PackIndexWriter.createVersion(
  975. indexStream, getIndexVersion());
  976. iw.write(sortByName(), packcsum);
  977. stats.timeWriting += System.currentTimeMillis() - writeStart;
  978. }
  979. /**
  980. * Create a bitmap index file to match the pack file just written.
  981. * <p>
  982. * Called after {@link #prepareBitmapIndex(ProgressMonitor)}.
  983. *
  984. * @param bitmapIndexStream
  985. * output for the bitmap index data. Caller is responsible for
  986. * closing this stream.
  987. * @throws java.io.IOException
  988. * the index data could not be written to the supplied stream.
  989. */
  990. public void writeBitmapIndex(OutputStream bitmapIndexStream)
  991. throws IOException {
  992. if (writeBitmaps == null)
  993. throw new IOException(JGitText.get().bitmapsMustBePrepared);
  994. long writeStart = System.currentTimeMillis();
  995. final PackBitmapIndexWriterV1 iw = new PackBitmapIndexWriterV1(bitmapIndexStream);
  996. iw.write(writeBitmaps, packcsum);
  997. stats.timeWriting += System.currentTimeMillis() - writeStart;
  998. }
  999. private List<ObjectToPack> sortByName() {
  1000. if (sortedByName == null) {
  1001. int cnt = 0;
  1002. cnt += objectsLists[OBJ_COMMIT].size();
  1003. cnt += objectsLists[OBJ_TREE].size();
  1004. cnt += objectsLists[OBJ_BLOB].size();
  1005. cnt += objectsLists[OBJ_TAG].size();
  1006. sortedByName = new BlockList<>(cnt);
  1007. sortedByName.addAll(objectsLists[OBJ_COMMIT]);
  1008. sortedByName.addAll(objectsLists[OBJ_TREE]);
  1009. sortedByName.addAll(objectsLists[OBJ_BLOB]);
  1010. sortedByName.addAll(objectsLists[OBJ_TAG]);
  1011. Collections.sort(sortedByName);
  1012. }
  1013. return sortedByName;
  1014. }
  1015. private void beginPhase(PackingPhase phase, ProgressMonitor monitor,
  1016. long cnt) {
  1017. state.phase = phase;
  1018. String task;
  1019. switch (phase) {
  1020. case COUNTING:
  1021. task = JGitText.get().countingObjects;
  1022. break;
  1023. case GETTING_SIZES:
  1024. task = JGitText.get().searchForSizes;
  1025. break;
  1026. case FINDING_SOURCES:
  1027. task = JGitText.get().searchForReuse;
  1028. break;
  1029. case COMPRESSING:
  1030. task = JGitText.get().compressingObjects;
  1031. break;
  1032. case WRITING:
  1033. task = JGitText.get().writingObjects;
  1034. break;
  1035. case BUILDING_BITMAPS:
  1036. task = JGitText.get().buildingBitmaps;
  1037. break;
  1038. default:
  1039. throw new IllegalArgumentException(
  1040. MessageFormat.format(JGitText.get().illegalPackingPhase, phase));
  1041. }
  1042. monitor.beginTask(task, (int) cnt);
  1043. }
  1044. private void endPhase(ProgressMonitor monitor) {
  1045. monitor.endTask();
  1046. }
  1047. /**
  1048. * Write the prepared pack to the supplied stream.
  1049. * <p>
  1050. * Called after
  1051. * {@link #preparePack(ProgressMonitor, ObjectWalk, Set, Set, Set)} or
  1052. * {@link #preparePack(ProgressMonitor, Set, Set)}.
  1053. * <p>
  1054. * Performs delta search if enabled and writes the pack stream.
  1055. * <p>
  1056. * All reused objects data checksum (Adler32/CRC32) is computed and
  1057. * validated against existing checksum.
  1058. *
  1059. * @param compressMonitor
  1060. * progress monitor to report object compression work.
  1061. * @param writeMonitor
  1062. * progress monitor to report the number of objects written.
  1063. * @param packStream
  1064. * output stream of pack data. The stream should be buffered by
  1065. * the caller. The caller is responsible for closing the stream.
  1066. * @throws java.io.IOException
  1067. * an error occurred reading a local object's data to include in
  1068. * the pack, or writing compressed object data to the output
  1069. * stream.
  1070. * @throws WriteAbortedException
  1071. * the write operation is aborted by
  1072. * {@link org.eclipse.jgit.transport.ObjectCountCallback} .
  1073. */
  1074. public void writePack(ProgressMonitor compressMonitor,
  1075. ProgressMonitor writeMonitor, OutputStream packStream)
  1076. throws IOException {
  1077. if (compressMonitor == null)
  1078. compressMonitor = NullProgressMonitor.INSTANCE;
  1079. if (writeMonitor == null)
  1080. writeMonitor = NullProgressMonitor.INSTANCE;
  1081. excludeInPacks = null;
  1082. excludeInPackLast = null;
  1083. boolean needSearchForReuse = reuseSupport != null && (
  1084. reuseDeltas
  1085. || config.isReuseObjects()
  1086. || !cachedPacks.isEmpty());
  1087. if (compressMonitor instanceof BatchingProgressMonitor) {
  1088. long delay = 1000;
  1089. if (needSearchForReuse && config.isDeltaCompress())
  1090. delay = 500;
  1091. ((BatchingProgressMonitor) compressMonitor).setDelayStart(
  1092. delay,
  1093. TimeUnit.MILLISECONDS);
  1094. }
  1095. if (needSearchForReuse)
  1096. searchForReuse(compressMonitor);
  1097. if (config.isDeltaCompress())
  1098. searchForDeltas(compressMonitor);
  1099. crc32 = new CRC32();
  1100. final PackOutputStream out = new PackOutputStream(
  1101. writeMonitor,
  1102. isIndexDisabled()
  1103. ? packStream
  1104. : new CheckedOutputStream(packStream, crc32),
  1105. this);
  1106. long objCnt = packfileUriConfig == null ? getObjectCount() :
  1107. getUnoffloadedObjectCount();
  1108. stats.totalObjects = objCnt;
  1109. if (callback != null)
  1110. callback.setObjectCount(objCnt);
  1111. beginPhase(PackingPhase.WRITING, writeMonitor, objCnt);
  1112. long writeStart = System.currentTimeMillis();
  1113. try {
  1114. List<CachedPack> unwrittenCachedPacks;
  1115. if (packfileUriConfig != null) {
  1116. unwrittenCachedPacks = new ArrayList<>();
  1117. CachedPackUriProvider p = packfileUriConfig.cachedPackUriProvider;
  1118. PacketLineOut o = packfileUriConfig.pckOut;
  1119. o.writeString("packfile-uris\n"); //$NON-NLS-1$
  1120. for (CachedPack pack : cachedPacks) {
  1121. CachedPackUriProvider.PackInfo packInfo = p.getInfo(
  1122. pack, packfileUriConfig.protocolsSupported);
  1123. if (packInfo != null) {
  1124. o.writeString(packInfo.getHash() + ' ' +
  1125. packInfo.getUri() + '\n');
  1126. stats.offloadedPackfiles += 1;
  1127. stats.offloadedPackfileSize += packInfo.getSize();
  1128. } else {
  1129. unwrittenCachedPacks.add(pack);
  1130. }
  1131. }
  1132. packfileUriConfig.pckOut.writeDelim();
  1133. packfileUriConfig.pckOut.writeString("packfile\n"); //$NON-NLS-1$
  1134. } else {
  1135. unwrittenCachedPacks = cachedPacks;
  1136. }
  1137. out.writeFileHeader(PACK_VERSION_GENERATED, objCnt);
  1138. out.flush();
  1139. writeObjects(out);
  1140. if (!edgeObjects.isEmpty() || !cachedPacks.isEmpty()) {
  1141. for (PackStatistics.ObjectType.Accumulator typeStat : stats.objectTypes) {
  1142. if (typeStat == null)
  1143. continue;
  1144. stats.thinPackBytes += typeStat.bytes;
  1145. }
  1146. }
  1147. stats.reusedPacks = Collections.unmodifiableList(cachedPacks);
  1148. for (CachedPack pack : unwrittenCachedPacks) {
  1149. long deltaCnt = pack.getDeltaCount();
  1150. stats.reusedObjects += pack.getObjectCount();
  1151. stats.reusedDeltas += deltaCnt;
  1152. stats.totalDeltas += deltaCnt;
  1153. reuseSupport.copyPackAsIs(out, pack);
  1154. }
  1155. writeChecksum(out);
  1156. out.flush();
  1157. } finally {
  1158. stats.timeWriting = System.currentTimeMillis() - writeStart;
  1159. stats.depth = depth;
  1160. for (PackStatistics.ObjectType.Accumulator typeStat : stats.objectTypes) {
  1161. if (typeStat == null)
  1162. continue;
  1163. typeStat.cntDeltas += typeStat.reusedDeltas;
  1164. stats.reusedObjects += typeStat.reusedObjects;
  1165. stats.reusedDeltas += typeStat.reusedDeltas;
  1166. stats.totalDeltas += typeStat.cntDeltas;
  1167. }
  1168. }
  1169. stats.totalBytes = out.length();
  1170. reader.close();
  1171. endPhase(writeMonitor);
  1172. }
  1173. /**
  1174. * Get statistics of what this PackWriter did in order to create the final
  1175. * pack stream.
  1176. *
  1177. * @return description of what this PackWriter did in order to create the
  1178. * final pack stream. This should only be invoked after the calls to
  1179. * create the pack/index/bitmap have completed.
  1180. */
  1181. public PackStatistics getStatistics() {
  1182. return new PackStatistics(stats);
  1183. }
  1184. /**
  1185. * Get snapshot of the current state of this PackWriter.
  1186. *
  1187. * @return snapshot of the current state of this PackWriter.
  1188. */
  1189. public State getState() {
  1190. return state.snapshot();
  1191. }
  1192. /**
  1193. * {@inheritDoc}
  1194. * <p>
  1195. * Release all resources used by this writer.
  1196. */
  1197. @Override
  1198. public void close() {
  1199. reader.close();
  1200. if (myDeflater != null) {
  1201. myDeflater.end();
  1202. myDeflater = null;
  1203. }
  1204. instances.remove(selfRef);
  1205. }
  1206. private void searchForReuse(ProgressMonitor monitor) throws IOException {
  1207. long cnt = 0;
  1208. cnt += objectsLists[OBJ_COMMIT].size();
  1209. cnt += objectsLists[OBJ_TREE].size();
  1210. cnt += objectsLists[OBJ_BLOB].size();
  1211. cnt += objectsLists[OBJ_TAG].size();
  1212. long start = System.currentTimeMillis();
  1213. searchForReuseStartTimeEpoc = start;
  1214. beginPhase(PackingPhase.FINDING_SOURCES, monitor, cnt);
  1215. if (cnt <= 4096) {
  1216. // For small object counts, do everything as one list.
  1217. BlockList<ObjectToPack> tmp = new BlockList<>((int) cnt);
  1218. tmp.addAll(objectsLists[OBJ_TAG]);
  1219. tmp.addAll(objectsLists[OBJ_COMMIT]);
  1220. tmp.addAll(objectsLists[OBJ_TREE]);
  1221. tmp.addAll(objectsLists[OBJ_BLOB]);
  1222. searchForReuse(monitor, tmp);
  1223. if (pruneCurrentObjectList) {
  1224. // If the list was pruned, we need to re-prune the main lists.
  1225. pruneEdgesFromObjectList(objectsLists[OBJ_COMMIT]);
  1226. pruneEdgesFromObjectList(objectsLists[OBJ_TREE]);
  1227. pruneEdgesFromObjectList(objectsLists[OBJ_BLOB]);
  1228. pruneEdgesFromObjectList(objectsLists[OBJ_TAG]);
  1229. }
  1230. } else {
  1231. searchForReuse(monitor, objectsLists[OBJ_TAG]);
  1232. searchForReuse(monitor, objectsLists[OBJ_COMMIT]);
  1233. searchForReuse(monitor, objectsLists[OBJ_TREE]);
  1234. searchForReuse(monitor, objectsLists[OBJ_BLOB]);
  1235. }
  1236. endPhase(monitor);
  1237. stats.timeSearchingForReuse = System.currentTimeMillis() - start;
  1238. if (config.isReuseDeltas() && config.getCutDeltaChains()) {
  1239. cutDeltaChains(objectsLists[OBJ_TREE]);
  1240. cutDeltaChains(objectsLists[OBJ_BLOB]);
  1241. }
  1242. }
  1243. private void searchForReuse(ProgressMonitor monitor, List<ObjectToPack> list)
  1244. throws IOException, MissingObjectException {
  1245. pruneCurrentObjectList = false;
  1246. reuseSupport.selectObjectRepresentation(this, monitor, list);
  1247. if (pruneCurrentObjectList)
  1248. pruneEdgesFromObjectList(list);
  1249. }
  1250. private void cutDeltaChains(BlockList<ObjectToPack> list)
  1251. throws IOException {
  1252. int max = config.getMaxDeltaDepth();
  1253. for (int idx = list.size() - 1; idx >= 0; idx--) {
  1254. int d = 0;
  1255. ObjectToPack b = list.get(idx).getDeltaBase();
  1256. while (b != null) {
  1257. if (d < b.getChainLength())
  1258. break;
  1259. b.setChainLength(++d);
  1260. if (d >= max && b.isDeltaRepresentation()) {
  1261. reselectNonDelta(b);
  1262. break;
  1263. }
  1264. b = b.getDeltaBase();
  1265. }
  1266. }
  1267. if (config.isDeltaCompress()) {
  1268. for (ObjectToPack otp : list)
  1269. otp.clearChainLength();
  1270. }
  1271. }
  1272. private void searchForDeltas(ProgressMonitor monitor)
  1273. throws MissingObjectException, IncorrectObjectTypeException,
  1274. IOException {
  1275. // Commits and annotated tags tend to have too many differences to
  1276. // really benefit from delta compression. Consequently just don't
  1277. // bother examining those types here.
  1278. //
  1279. ObjectToPack[] list = new ObjectToPack[
  1280. objectsLists[OBJ_TREE].size()
  1281. + objectsLists[OBJ_BLOB].size()
  1282. + edgeObjects.size()];
  1283. int cnt = 0;
  1284. cnt = findObjectsNeedingDelta(list, cnt, OBJ_TREE);
  1285. cnt = findObjectsNeedingDelta(list, cnt, OBJ_BLOB);
  1286. if (cnt == 0)
  1287. return;
  1288. int nonEdgeCnt = cnt;
  1289. // Queue up any edge objects that we might delta against. We won't
  1290. // be sending these as we assume the other side has them, but we need
  1291. // them in the search phase below.
  1292. //
  1293. for (ObjectToPack eo : edgeObjects) {
  1294. eo.setWeight(0);
  1295. list[cnt++] = eo;
  1296. }
  1297. // Compute the sizes of the objects so we can do a proper sort.
  1298. // We let the reader skip missing objects if it chooses. For
  1299. // some readers this can be a huge win. We detect missing objects
  1300. // by having set the weights above to 0 and allowing the delta
  1301. // search code to discover the missing object and skip over it, or
  1302. // abort with an exception if we actually had to have it.
  1303. //
  1304. final long sizingStart = System.currentTimeMillis();
  1305. beginPhase(PackingPhase.GETTING_SIZES, monitor, cnt);
  1306. AsyncObjectSizeQueue<ObjectToPack> sizeQueue = reader.getObjectSize(
  1307. Arrays.<ObjectToPack> asList(list).subList(0, cnt), false);
  1308. try {
  1309. final long limit = Math.min(
  1310. config.getBigFileThreshold(),
  1311. Integer.MAX_VALUE);
  1312. for (;;) {
  1313. try {
  1314. if (!sizeQueue.next())
  1315. break;
  1316. } catch (MissingObjectException notFound) {
  1317. monitor.update(1);
  1318. if (ignoreMissingUninteresting) {
  1319. ObjectToPack otp = sizeQueue.getCurrent();
  1320. if (otp != null && otp.isEdge()) {
  1321. otp.setDoNotDelta();
  1322. continue;
  1323. }
  1324. otp = objectsMap.get(notFound.getObjectId());
  1325. if (otp != null && otp.isEdge()) {
  1326. otp.setDoNotDelta();
  1327. continue;
  1328. }
  1329. }
  1330. throw notFound;
  1331. }
  1332. ObjectToPack otp = sizeQueue.getCurrent();
  1333. if (otp == null)
  1334. otp = objectsMap.get(sizeQueue.getObjectId());
  1335. long sz = sizeQueue.getSize();
  1336. if (DeltaIndex.BLKSZ < sz && sz < limit)
  1337. otp.setWeight((int) sz);
  1338. else
  1339. otp.setDoNotDelta(); // too small, or too big
  1340. monitor.update(1);
  1341. }
  1342. } finally {
  1343. sizeQueue.release();
  1344. }
  1345. endPhase(monitor);
  1346. stats.timeSearchingForSizes = System.currentTimeMillis() - sizingStart;
  1347. // Sort the objects by path hash so like files are near each other,
  1348. // and then by size descending so that bigger files are first. This
  1349. // applies "Linus' Law" which states that newer files tend to be the
  1350. // bigger ones, because source files grow and hardly ever shrink.
  1351. //
  1352. Arrays.sort(list, 0, cnt, (ObjectToPack a, ObjectToPack b) -> {
  1353. int cmp = (a.isDoNotDelta() ? 1 : 0) - (b.isDoNotDelta() ? 1 : 0);
  1354. if (cmp != 0) {
  1355. return cmp;
  1356. }
  1357. cmp = a.getType() - b.getType();
  1358. if (cmp != 0) {
  1359. return cmp;
  1360. }
  1361. cmp = (a.getPathHash() >>> 1) - (b.getPathHash() >>> 1);
  1362. if (cmp != 0) {
  1363. return cmp;
  1364. }
  1365. cmp = (a.getPathHash() & 1) - (b.getPathHash() & 1);
  1366. if (cmp != 0) {
  1367. return cmp;
  1368. }
  1369. cmp = (a.isEdge() ? 0 : 1) - (b.isEdge() ? 0 : 1);
  1370. if (cmp != 0) {
  1371. return cmp;
  1372. }
  1373. return b.getWeight() - a.getWeight();
  1374. });
  1375. // Above we stored the objects we cannot delta onto the end.
  1376. // Remove them from the list so we don't waste time on them.
  1377. while (0 < cnt && list[cnt - 1].isDoNotDelta()) {
  1378. if (!list[cnt - 1].isEdge())
  1379. nonEdgeCnt--;
  1380. cnt--;
  1381. }
  1382. if (cnt == 0)
  1383. return;
  1384. final long searchStart = System.currentTimeMillis();
  1385. searchForDeltas(monitor, list, cnt);
  1386. stats.deltaSearchNonEdgeObjects = nonEdgeCnt;
  1387. stats.timeCompressing = System.currentTimeMillis() - searchStart;
  1388. for (int i = 0; i < cnt; i++)
  1389. if (!list[i].isEdge() && list[i].isDeltaRepresentation())
  1390. stats.deltasFound++;
  1391. }
  1392. private int findObjectsNeedingDelta(ObjectToPack[] list, int cnt, int type) {
  1393. for (ObjectToPack otp : objectsLists[type]) {
  1394. if (otp.isDoNotDelta()) // delta is disabled for this path
  1395. continue;
  1396. if (otp.isDeltaRepresentation()) // already reusing a delta
  1397. continue;
  1398. otp.setWeight(0);
  1399. list[cnt++] = otp;
  1400. }
  1401. return cnt;
  1402. }
  1403. private void reselectNonDelta(ObjectToPack otp) throws IOException {
  1404. otp.clearDeltaBase();
  1405. otp.clearReuseAsIs();
  1406. boolean old = reuseDeltas;
  1407. reuseDeltas = false;
  1408. reuseSupport.selectObjectRepresentation(this,
  1409. NullProgressMonitor.INSTANCE,
  1410. Collections.singleton(otp));
  1411. reuseDeltas = old;
  1412. }
  1413. private void searchForDeltas(final ProgressMonitor monitor,
  1414. final ObjectToPack[] list, final int cnt)
  1415. throws MissingObjectException, IncorrectObjectTypeException,
  1416. LargeObjectException, IOException {
  1417. int threads = config.getThreads();
  1418. if (threads == 0)
  1419. threads = Runtime.getRuntime().availableProcessors();
  1420. if (threads <= 1 || cnt <= config.getDeltaSearchWindowSize())
  1421. singleThreadDeltaSearch(monitor, list, cnt);
  1422. else
  1423. parallelDeltaSearch(monitor, list, cnt, threads);
  1424. }
  1425. private void singleThreadDeltaSearch(ProgressMonitor monitor,
  1426. ObjectToPack[] list, int cnt) throws IOException {
  1427. long totalWeight = 0;
  1428. for (int i = 0; i < cnt; i++) {
  1429. ObjectToPack o = list[i];
  1430. totalWeight += DeltaTask.getAdjustedWeight(o);
  1431. }
  1432. long bytesPerUnit = 1;
  1433. while (DeltaTask.MAX_METER <= (totalWeight / bytesPerUnit))
  1434. bytesPerUnit <<= 10;
  1435. int cost = (int) (totalWeight / bytesPerUnit);
  1436. if (totalWeight % bytesPerUnit != 0)
  1437. cost++;
  1438. beginPhase(PackingPhase.COMPRESSING, monitor, cost);
  1439. new DeltaWindow(config, new DeltaCache(config), reader,
  1440. monitor, bytesPerUnit,
  1441. list, 0, cnt).search();
  1442. endPhase(monitor);
  1443. }
  1444. @SuppressWarnings("Finally")
  1445. private void parallelDeltaSearch(ProgressMonitor monitor,
  1446. ObjectToPack[] list, int cnt, int threads) throws IOException {
  1447. DeltaCache dc = new ThreadSafeDeltaCache(config);
  1448. ThreadSafeProgressMonitor pm = new ThreadSafeProgressMonitor(monitor);
  1449. DeltaTask.Block taskBlock = new DeltaTask.Block(threads, config,
  1450. reader, dc, pm,
  1451. list, 0, cnt);
  1452. taskBlock.partitionTasks();
  1453. beginPhase(PackingPhase.COMPRESSING, monitor, taskBlock.cost());
  1454. pm.startWorkers(taskBlock.tasks.size());
  1455. Executor executor = config.getExecutor();
  1456. final List<Throwable> errors =
  1457. Collections.synchronizedList(new ArrayList<>(threads));
  1458. if (executor instanceof ExecutorService) {
  1459. // Caller supplied us a service, use it directly.
  1460. runTasks((ExecutorService) executor, pm, taskBlock, errors);
  1461. } else if (executor == null) {
  1462. // Caller didn't give us a way to run the tasks, spawn up a
  1463. // temporary thread pool and make sure it tears down cleanly.
  1464. ExecutorService pool = Executors.newFixedThreadPool(threads);
  1465. Throwable e1 = null;
  1466. try {
  1467. runTasks(pool, pm, taskBlock, errors);
  1468. } catch (Exception e) {
  1469. e1 = e;
  1470. } finally {
  1471. pool.shutdown();
  1472. for (;;) {
  1473. try {
  1474. if (pool.awaitTermination(60, TimeUnit.SECONDS)) {
  1475. break;
  1476. }
  1477. } catch (InterruptedException e) {
  1478. if (e1 != null) {
  1479. e.addSuppressed(e1);
  1480. }
  1481. throw new IOException(JGitText
  1482. .get().packingCancelledDuringObjectsWriting, e);
  1483. }
  1484. }
  1485. }
  1486. } else {
  1487. // The caller gave us an executor, but it might not do
  1488. // asynchronous execution. Wrap everything and hope it
  1489. // can schedule these for us.
  1490. for (DeltaTask task : taskBlock.tasks) {
  1491. executor.execute(() -> {
  1492. try {
  1493. task.call();
  1494. } catch (Throwable failure) {
  1495. errors.add(failure);
  1496. }
  1497. });
  1498. }
  1499. try {
  1500. pm.waitForCompletion();
  1501. } catch (InterruptedException ie) {
  1502. // We can't abort the other tasks as we have no handle.
  1503. // Cross our fingers and just break out anyway.
  1504. //
  1505. throw new IOException(
  1506. JGitText.get().packingCancelledDuringObjectsWriting,
  1507. ie);
  1508. }
  1509. }
  1510. // If any task threw an error, try to report it back as
  1511. // though we weren't using a threaded search algorithm.
  1512. //
  1513. if (!errors.isEmpty()) {
  1514. Throwable err = errors.get(0);
  1515. if (err instanceof Error)
  1516. throw (Error) err;
  1517. if (err instanceof RuntimeException)
  1518. throw (RuntimeException) err;
  1519. if (err instanceof IOException)
  1520. throw (IOException) err;
  1521. throw new IOException(err.getMessage(), err);
  1522. }
  1523. endPhase(monitor);
  1524. }
  1525. private static void runTasks(ExecutorService pool,
  1526. ThreadSafeProgressMonitor pm,
  1527. DeltaTask.Block tb, List<Throwable> errors) throws IOException {
  1528. List<Future<?>> futures = new ArrayList<>(tb.tasks.size());
  1529. for (DeltaTask task : tb.tasks)
  1530. futures.add(pool.submit(task));
  1531. try {
  1532. pm.waitForCompletion();
  1533. for (Future<?> f : futures) {
  1534. try {
  1535. f.get();
  1536. } catch (ExecutionException failed) {
  1537. errors.add(failed.getCause());
  1538. }
  1539. }
  1540. } catch (InterruptedException ie) {
  1541. for (Future<?> f : futures)
  1542. f.cancel(true);
  1543. throw new IOException(
  1544. JGitText.get().packingCancelledDuringObjectsWriting, ie);
  1545. }
  1546. }
  1547. private void writeObjects(PackOutputStream out) throws IOException {
  1548. writeObjects(out, objectsLists[OBJ_COMMIT]);
  1549. writeObjects(out, objectsLists[OBJ_TAG]);
  1550. writeObjects(out, objectsLists[OBJ_TREE]);
  1551. writeObjects(out, objectsLists[OBJ_BLOB]);
  1552. }
  1553. private void writeObjects(PackOutputStream out, List<ObjectToPack> list)
  1554. throws IOException {
  1555. if (list.isEmpty())
  1556. return;
  1557. typeStats = stats.objectTypes[list.get(0).getType()];
  1558. long beginOffset = out.length();
  1559. if (reuseSupport != null) {
  1560. reuseSupport.writeObjects(out, list);
  1561. } else {
  1562. for (ObjectToPack otp : list)
  1563. out.writeObject(otp);
  1564. }
  1565. typeStats.bytes += out.length() - beginOffset;
  1566. typeStats.cntObjects = list.size();
  1567. }
  1568. void writeObject(PackOutputStream out, ObjectToPack otp) throws IOException {
  1569. if (!otp.isWritten())
  1570. writeObjectImpl(out, otp);
  1571. }
  1572. private void writeObjectImpl(PackOutputStream out, ObjectToPack otp)
  1573. throws IOException {
  1574. if (otp.wantWrite()) {
  1575. // A cycle exists in this delta chain. This should only occur if a
  1576. // selected object representation disappeared during writing
  1577. // (for example due to a concurrent repack) and a different base
  1578. // was chosen, forcing a cycle. Select something other than a
  1579. // delta, and write this object.
  1580. reselectNonDelta(otp);
  1581. }
  1582. otp.markWantWrite();
  1583. while (otp.isReuseAsIs()) {
  1584. writeBase(out, otp.getDeltaBase());
  1585. if (otp.isWritten())
  1586. return; // Delta chain cycle caused this to write already.
  1587. crc32.reset();
  1588. otp.setOffset(out.length());
  1589. try {
  1590. reuseSupport.copyObjectAsIs(out, otp, reuseValidate);
  1591. out.endObject();
  1592. otp.setCRC((int) crc32.getValue());
  1593. typeStats.reusedObjects++;
  1594. if (otp.isDeltaRepresentation()) {
  1595. typeStats.reusedDeltas++;
  1596. typeStats.deltaBytes += out.length() - otp.getOffset();
  1597. }
  1598. return;
  1599. } catch (StoredObjectRepresentationNotAvailableException gone) {
  1600. if (otp.getOffset() == out.length()) {
  1601. otp.setOffset(0);
  1602. otp.clearDeltaBase();
  1603. otp.clearReuseAsIs();
  1604. reuseSupport.selectObjectRepresentation(this,
  1605. NullProgressMonitor.INSTANCE,
  1606. Collections.singleton(otp));
  1607. continue;
  1608. }
  1609. // Object writing already started, we cannot recover.
  1610. //
  1611. CorruptObjectException coe;
  1612. coe = new CorruptObjectException(otp, ""); //$NON-NLS-1$
  1613. coe.initCause(gone);
  1614. throw coe;
  1615. }
  1616. }
  1617. // If we reached here, reuse wasn't possible.
  1618. //
  1619. if (otp.isDeltaRepresentation()) {
  1620. writeDeltaObjectDeflate(out, otp);
  1621. } else {
  1622. writeWholeObjectDeflate(out, otp);
  1623. }
  1624. out.endObject();
  1625. otp.setCRC((int) crc32.getValue());
  1626. }
  1627. private void writeBase(PackOutputStream out, ObjectToPack base)
  1628. throws IOException {
  1629. if (base != null && !base.isWritten() && !base.isEdge())
  1630. writeObjectImpl(out, base);
  1631. }
  1632. private void writeWholeObjectDeflate(PackOutputStream out,
  1633. final ObjectToPack otp) throws IOException {
  1634. final Deflater deflater = deflater();
  1635. final ObjectLoader ldr = reader.open(otp, otp.getType());
  1636. crc32.reset();
  1637. otp.setOffset(out.length());
  1638. out.writeHeader(otp, ldr.getSize());
  1639. deflater.reset();
  1640. DeflaterOutputStream dst = new DeflaterOutputStream(out, deflater);
  1641. ldr.copyTo(dst);
  1642. dst.finish();
  1643. }
  1644. private void writeDeltaObjectDeflate(PackOutputStream out,
  1645. final ObjectToPack otp) throws IOException {
  1646. writeBase(out, otp.getDeltaBase());
  1647. crc32.reset();
  1648. otp.setOffset(out.length());
  1649. DeltaCache.Ref ref = otp.popCachedDelta();
  1650. if (ref != null) {
  1651. byte[] zbuf = ref.get();
  1652. if (zbuf != null) {
  1653. out.writeHeader(otp, otp.getCachedSize());
  1654. out.write(zbuf);
  1655. typeStats.cntDeltas++;
  1656. typeStats.deltaBytes += out.length() - otp.getOffset();
  1657. return;
  1658. }
  1659. }
  1660. try (TemporaryBuffer.Heap delta = delta(otp)) {
  1661. out.writeHeader(otp, delta.length());
  1662. Deflater deflater = deflater();
  1663. deflater.reset();
  1664. DeflaterOutputStream dst = new DeflaterOutputStream(out, deflater);
  1665. delta.writeTo(dst, null);
  1666. dst.finish();
  1667. }
  1668. typeStats.cntDeltas++;
  1669. typeStats.deltaBytes += out.length() - otp.getOffset();
  1670. }
  1671. private TemporaryBuffer.Heap delta(ObjectToPack otp)
  1672. throws IOException {
  1673. DeltaIndex index = new DeltaIndex(buffer(otp.getDeltaBaseId()));
  1674. byte[] res = buffer(otp);
  1675. // We never would have proposed this pair if the delta would be
  1676. // larger than the unpacked version of the object. So using it
  1677. // as our buffer limit is valid: we will never reach it.
  1678. //
  1679. TemporaryBuffer.Heap delta = new TemporaryBuffer.Heap(res.length);
  1680. index.encode(delta, res);
  1681. return delta;
  1682. }
  1683. private byte[] buffer(AnyObjectId objId) throws IOException {
  1684. return buffer(config, reader, objId);
  1685. }
  1686. static byte[] buffer(PackConfig config, ObjectReader or, AnyObjectId objId)
  1687. throws IOException {
  1688. // PackWriter should have already pruned objects that
  1689. // are above the big file threshold, so our chances of
  1690. // the object being below it are very good. We really
  1691. // shouldn't be here, unless the implementation is odd.
  1692. return or.open(objId).getCachedBytes(config.getBigFileThreshold());
  1693. }
  1694. private Deflater deflater() {
  1695. if (myDeflater == null)
  1696. myDeflater = new Deflater(config.getCompressionLevel());
  1697. return myDeflater;
  1698. }
  1699. private void writeChecksum(PackOutputStream out) throws IOException {
  1700. packcsum = out.getDigest();
  1701. out.write(packcsum);
  1702. }
  1703. private void findObjectsToPack(@NonNull ProgressMonitor countingMonitor,
  1704. @NonNull ObjectWalk walker, @NonNull Set<? extends ObjectId> want,
  1705. @NonNull Set<? extends ObjectId> have,
  1706. @NonNull Set<? extends ObjectId> noBitmaps) throws IOException {
  1707. final long countingStart = System.currentTimeMillis();
  1708. beginPhase(PackingPhase.COUNTING, countingMonitor, ProgressMonitor.UNKNOWN);
  1709. stats.interestingObjects = Collections.unmodifiableSet(new HashSet<ObjectId>(want));
  1710. stats.uninterestingObjects = Collections.unmodifiableSet(new HashSet<ObjectId>(have));
  1711. excludeFromBitmapSelection = noBitmaps;
  1712. canBuildBitmaps = config.isBuildBitmaps()
  1713. && !shallowPack
  1714. && have.isEmpty()
  1715. && (excludeInPacks == null || excludeInPacks.length == 0);
  1716. if (!shallowPack && useBitmaps) {
  1717. BitmapIndex bitmapIndex = reader.getBitmapIndex();
  1718. if (bitmapIndex != null) {
  1719. BitmapWalker bitmapWalker = new BitmapWalker(
  1720. walker, bitmapIndex, countingMonitor);
  1721. findObjectsToPackUsingBitmaps(bitmapWalker, want, have);
  1722. endPhase(countingMonitor);
  1723. stats.timeCounting = System.currentTimeMillis() - countingStart;
  1724. stats.bitmapIndexMisses = bitmapWalker.getCountOfBitmapIndexMisses();
  1725. return;
  1726. }
  1727. }
  1728. List<ObjectId> all = new ArrayList<>(want.size() + have.size());
  1729. all.addAll(want);
  1730. all.addAll(have);
  1731. final RevFlag include = walker.newFlag("include"); //$NON-NLS-1$
  1732. final RevFlag added = walker.newFlag("added"); //$NON-NLS-1$
  1733. walker.carry(include);
  1734. int haveEst = have.size();
  1735. if (have.isEmpty()) {
  1736. walker.sort(RevSort.COMMIT_TIME_DESC);
  1737. } else {
  1738. walker.sort(RevSort.TOPO);
  1739. if (thin)
  1740. walker.sort(RevSort.BOUNDARY, true);
  1741. }
  1742. List<RevObject> wantObjs = new ArrayList<>(want.size());
  1743. List<RevObject> haveObjs = new ArrayList<>(haveEst);
  1744. List<RevTag> wantTags = new ArrayList<>(want.size());
  1745. // Retrieve the RevWalk's versions of "want" and "have" objects to
  1746. // maintain any state previously set in the RevWalk.
  1747. AsyncRevObjectQueue q = walker.parseAny(all, true);
  1748. try {
  1749. for (;;) {
  1750. try {
  1751. RevObject o = q.next();
  1752. if (o == null)
  1753. break;
  1754. if (have.contains(o))
  1755. haveObjs.add(o);
  1756. if (want.contains(o)) {
  1757. o.add(include);
  1758. wantObjs.add(o);
  1759. if (o instanceof RevTag)
  1760. wantTags.add((RevTag) o);
  1761. }
  1762. } catch (MissingObjectException e) {
  1763. if (ignoreMissingUninteresting
  1764. && have.contains(e.getObjectId()))
  1765. continue;
  1766. throw e;
  1767. }
  1768. }
  1769. } finally {
  1770. q.release();
  1771. }
  1772. if (!wantTags.isEmpty()) {
  1773. all = new ArrayList<>(wantTags.size());
  1774. for (RevTag tag : wantTags)
  1775. all.add(tag.getObject());
  1776. q = walker.parseAny(all, true);
  1777. try {
  1778. while (q.next() != null) {
  1779. // Just need to pop the queue item to parse the object.
  1780. }
  1781. } finally {
  1782. q.release();
  1783. }
  1784. }
  1785. if (walker instanceof DepthWalk.ObjectWalk) {
  1786. DepthWalk.ObjectWalk depthWalk = (DepthWalk.ObjectWalk) walker;
  1787. for (RevObject obj : wantObjs) {
  1788. depthWalk.markRoot(obj);
  1789. }
  1790. // Mark the tree objects associated with "have" commits as
  1791. // uninteresting to avoid writing redundant blobs. A normal RevWalk
  1792. // lazily propagates the "uninteresting" state from a commit to its
  1793. // tree during the walk, but DepthWalks can terminate early so
  1794. // preemptively propagate that state here.
  1795. for (RevObject obj : haveObjs) {
  1796. if (obj instanceof RevCommit) {
  1797. RevTree t = ((RevCommit) obj).getTree();
  1798. depthWalk.markUninteresting(t);
  1799. }
  1800. }
  1801. if (unshallowObjects != null) {
  1802. for (ObjectId id : unshallowObjects) {
  1803. depthWalk.markUnshallow(walker.parseAny(id));
  1804. }
  1805. }
  1806. } else {
  1807. for (RevObject obj : wantObjs)
  1808. walker.markStart(obj);
  1809. }
  1810. for (RevObject obj : haveObjs)
  1811. walker.markUninteresting(obj);
  1812. final int maxBases = config.getDeltaSearchWindowSize();
  1813. Set<RevTree> baseTrees = new HashSet<>();
  1814. BlockList<RevCommit> commits = new BlockList<>();
  1815. Set<ObjectId> roots = new HashSet<>();
  1816. RevCommit c;
  1817. while ((c = walker.next()) != null) {
  1818. if (exclude(c))
  1819. continue;
  1820. if (c.has(RevFlag.UNINTERESTING)) {
  1821. if (baseTrees.size() <= maxBases)
  1822. baseTrees.add(c.getTree());
  1823. continue;
  1824. }
  1825. commits.add(c);
  1826. if (c.getParentCount() == 0) {
  1827. roots.add(c.copy());
  1828. }
  1829. countingMonitor.update(1);
  1830. }
  1831. stats.rootCommits = Collections.unmodifiableSet(roots);
  1832. if (shallowPack) {
  1833. for (RevCommit cmit : commits) {
  1834. addObject(cmit, 0);
  1835. }
  1836. } else {
  1837. int commitCnt = 0;
  1838. boolean putTagTargets = false;
  1839. for (RevCommit cmit : commits) {
  1840. if (!cmit.has(added)) {
  1841. cmit.add(added);
  1842. addObject(cmit, 0);
  1843. commitCnt++;
  1844. }
  1845. for (int i = 0; i < cmit.getParentCount(); i++) {
  1846. RevCommit p = cmit.getParent(i);
  1847. if (!p.has(added) && !p.has(RevFlag.UNINTERESTING)
  1848. && !exclude(p)) {
  1849. p.add(added);
  1850. addObject(p, 0);
  1851. commitCnt++;
  1852. }
  1853. }
  1854. if (!putTagTargets && 4096 < commitCnt) {
  1855. for (ObjectId id : tagTargets) {
  1856. RevObject obj = walker.lookupOrNull(id);
  1857. if (obj instanceof RevCommit
  1858. && obj.has(include)
  1859. && !obj.has(RevFlag.UNINTERESTING)
  1860. && !obj.has(added)) {
  1861. obj.add(added);
  1862. addObject(obj, 0);
  1863. }
  1864. }
  1865. putTagTargets = true;
  1866. }
  1867. }
  1868. }
  1869. commits = null;
  1870. if (thin && !baseTrees.isEmpty()) {
  1871. BaseSearch bases = new BaseSearch(countingMonitor, baseTrees, //
  1872. objectsMap, edgeObjects, reader);
  1873. RevObject o;
  1874. while ((o = walker.nextObject()) != null) {
  1875. if (o.has(RevFlag.UNINTERESTING))
  1876. continue;
  1877. if (exclude(o))
  1878. continue;
  1879. int pathHash = walker.getPathHashCode();
  1880. byte[] pathBuf = walker.getPathBuffer();
  1881. int pathLen = walker.getPathLength();
  1882. bases.addBase(o.getType(), pathBuf, pathLen, pathHash);
  1883. if (!depthSkip(o, walker)) {
  1884. filterAndAddObject(o, o.getType(), pathHash, want);
  1885. }
  1886. countingMonitor.update(1);
  1887. }
  1888. } else {
  1889. RevObject o;
  1890. while ((o = walker.nextObject()) != null) {
  1891. if (o.has(RevFlag.UNINTERESTING))
  1892. continue;
  1893. if (exclude(o))
  1894. continue;
  1895. if (!depthSkip(o, walker)) {
  1896. filterAndAddObject(o, o.getType(), walker.getPathHashCode(),
  1897. want);
  1898. }
  1899. countingMonitor.update(1);
  1900. }
  1901. }
  1902. for (CachedPack pack : cachedPacks)
  1903. countingMonitor.update((int) pack.getObjectCount());
  1904. endPhase(countingMonitor);
  1905. stats.timeCounting = System.currentTimeMillis() - countingStart;
  1906. stats.bitmapIndexMisses = -1;
  1907. }
  1908. private void findObjectsToPackUsingBitmaps(
  1909. BitmapWalker bitmapWalker, Set<? extends ObjectId> want,
  1910. Set<? extends ObjectId> have)
  1911. throws MissingObjectException, IncorrectObjectTypeException,
  1912. IOException {
  1913. BitmapBuilder haveBitmap = bitmapWalker.findObjects(have, null, true);
  1914. BitmapBuilder wantBitmap = bitmapWalker.findObjects(want, haveBitmap,
  1915. false);
  1916. BitmapBuilder needBitmap = wantBitmap.andNot(haveBitmap);
  1917. if (useCachedPacks && reuseSupport != null && !reuseValidate
  1918. && (excludeInPacks == null || excludeInPacks.length == 0))
  1919. cachedPacks.addAll(
  1920. reuseSupport.getCachedPacksAndUpdate(needBitmap));
  1921. for (BitmapObject obj : needBitmap) {
  1922. ObjectId objectId = obj.getObjectId();
  1923. if (exclude(objectId)) {
  1924. needBitmap.remove(objectId);
  1925. continue;
  1926. }
  1927. filterAndAddObject(objectId, obj.getType(), 0, want);
  1928. }
  1929. if (thin)
  1930. haveObjects = haveBitmap;
  1931. }
  1932. private static void pruneEdgesFromObjectList(List<ObjectToPack> list) {
  1933. final int size = list.size();
  1934. int src = 0;
  1935. int dst = 0;
  1936. for (; src < size; src++) {
  1937. ObjectToPack obj = list.get(src);
  1938. if (obj.isEdge())
  1939. continue;
  1940. if (dst != src)
  1941. list.set(dst, obj);
  1942. dst++;
  1943. }
  1944. while (dst < list.size())
  1945. list.remove(list.size() - 1);
  1946. }
  1947. /**
  1948. * Include one object to the output file.
  1949. * <p>
  1950. * Objects are written in the order they are added. If the same object is
  1951. * added twice, it may be written twice, creating a larger than necessary
  1952. * file.
  1953. *
  1954. * @param object
  1955. * the object to add.
  1956. * @throws org.eclipse.jgit.errors.IncorrectObjectTypeException
  1957. * the object is an unsupported type.
  1958. */
  1959. public void addObject(RevObject object)
  1960. throws IncorrectObjectTypeException {
  1961. if (!exclude(object))
  1962. addObject(object, 0);
  1963. }
  1964. private void addObject(RevObject object, int pathHashCode) {
  1965. addObject(object, object.getType(), pathHashCode);
  1966. }
  1967. private void addObject(
  1968. final AnyObjectId src, final int type, final int pathHashCode) {
  1969. final ObjectToPack otp;
  1970. if (reuseSupport != null)
  1971. otp = reuseSupport.newObjectToPack(src, type);
  1972. else
  1973. otp = new ObjectToPack(src, type);
  1974. otp.setPathHash(pathHashCode);
  1975. objectsLists[type].add(otp);
  1976. objectsMap.add(otp);
  1977. }
  1978. /**
  1979. * Determines if the object should be omitted from the pack as a result of
  1980. * its depth (probably because of the tree:<depth> filter).
  1981. * <p>
  1982. * Causes {@code walker} to skip traversing the current tree, which ought to
  1983. * have just started traversal, assuming this method is called as soon as a
  1984. * new depth is reached.
  1985. * <p>
  1986. * This method increments the {@code treesTraversed} statistic.
  1987. *
  1988. * @param obj
  1989. * the object to check whether it should be omitted.
  1990. * @param walker
  1991. * the walker being used for traveresal.
  1992. * @return whether the given object should be skipped.
  1993. */
  1994. private boolean depthSkip(@NonNull RevObject obj, ObjectWalk walker) {
  1995. long treeDepth = walker.getTreeDepth();
  1996. // Check if this object needs to be rejected because it is a tree or
  1997. // blob that is too deep from the root tree.
  1998. // A blob is considered one level deeper than the tree that contains it.
  1999. if (obj.getType() == OBJ_BLOB) {
  2000. treeDepth++;
  2001. } else {
  2002. stats.treesTraversed++;
  2003. }
  2004. if (filterSpec.getTreeDepthLimit() < 0 ||
  2005. treeDepth <= filterSpec.getTreeDepthLimit()) {
  2006. return false;
  2007. }
  2008. walker.skipTree();
  2009. return true;
  2010. }
  2011. // Adds the given object as an object to be packed, first performing
  2012. // filtering on blobs at or exceeding a given size.
  2013. private void filterAndAddObject(@NonNull AnyObjectId src, int type,
  2014. int pathHashCode, @NonNull Set<? extends AnyObjectId> want)
  2015. throws IOException {
  2016. // Check if this object needs to be rejected, doing the cheaper
  2017. // checks first.
  2018. boolean reject =
  2019. (!filterSpec.allowsType(type) && !want.contains(src)) ||
  2020. (filterSpec.getBlobLimit() >= 0 &&
  2021. type == OBJ_BLOB &&
  2022. !want.contains(src) &&
  2023. reader.getObjectSize(src, OBJ_BLOB) > filterSpec.getBlobLimit());
  2024. if (!reject) {
  2025. addObject(src, type, pathHashCode);
  2026. }
  2027. }
  2028. private boolean exclude(AnyObjectId objectId) {
  2029. if (excludeInPacks == null)
  2030. return false;
  2031. if (excludeInPackLast.contains(objectId))
  2032. return true;
  2033. for (ObjectIdSet idx : excludeInPacks) {
  2034. if (idx.contains(objectId)) {
  2035. excludeInPackLast = idx;
  2036. return true;
  2037. }
  2038. }
  2039. return false;
  2040. }
  2041. /**
  2042. * Select an object representation for this writer.
  2043. * <p>
  2044. * An {@link org.eclipse.jgit.lib.ObjectReader} implementation should invoke
  2045. * this method once for each representation available for an object, to
  2046. * allow the writer to find the most suitable one for the output.
  2047. *
  2048. * @param otp
  2049. * the object being packed.
  2050. * @param next
  2051. * the next available representation from the repository.
  2052. */
  2053. public void select(ObjectToPack otp, StoredObjectRepresentation next) {
  2054. int nFmt = next.getFormat();
  2055. if (!cachedPacks.isEmpty()) {
  2056. if (otp.isEdge())
  2057. return;
  2058. if (nFmt == PACK_WHOLE || nFmt == PACK_DELTA) {
  2059. for (CachedPack pack : cachedPacks) {
  2060. if (pack.hasObject(otp, next)) {
  2061. otp.setEdge();
  2062. otp.clearDeltaBase();
  2063. otp.clearReuseAsIs();
  2064. pruneCurrentObjectList = true;
  2065. return;
  2066. }
  2067. }
  2068. }
  2069. }
  2070. if (nFmt == PACK_DELTA && reuseDeltas && reuseDeltaFor(otp)) {
  2071. ObjectId baseId = next.getDeltaBase();
  2072. ObjectToPack ptr = objectsMap.get(baseId);
  2073. if (ptr != null && !ptr.isEdge()) {
  2074. otp.setDeltaBase(ptr);
  2075. otp.setReuseAsIs();
  2076. } else if (thin && have(ptr, baseId)) {
  2077. otp.setDeltaBase(baseId);
  2078. otp.setReuseAsIs();
  2079. } else {
  2080. otp.clearDeltaBase();
  2081. otp.clearReuseAsIs();
  2082. }
  2083. } else if (nFmt == PACK_WHOLE && config.isReuseObjects()) {
  2084. int nWeight = next.getWeight();
  2085. if (otp.isReuseAsIs() && !otp.isDeltaRepresentation()) {
  2086. // We've chosen another PACK_WHOLE format for this object,
  2087. // choose the one that has the smaller compressed size.
  2088. //
  2089. if (otp.getWeight() <= nWeight)
  2090. return;
  2091. }
  2092. otp.clearDeltaBase();
  2093. otp.setReuseAsIs();
  2094. otp.setWeight(nWeight);
  2095. } else {
  2096. otp.clearDeltaBase();
  2097. otp.clearReuseAsIs();
  2098. }
  2099. otp.setDeltaAttempted(reuseDeltas && next.wasDeltaAttempted());
  2100. otp.select(next);
  2101. }
  2102. private final boolean have(ObjectToPack ptr, AnyObjectId objectId) {
  2103. return (ptr != null && ptr.isEdge())
  2104. || (haveObjects != null && haveObjects.contains(objectId));
  2105. }
  2106. /**
  2107. * Prepares the bitmaps to be written to the bitmap index file.
  2108. * <p>
  2109. * Bitmaps can be used to speed up fetches and clones by storing the entire
  2110. * object graph at selected commits. Writing a bitmap index is an optional
  2111. * feature that not all pack users may require.
  2112. * <p>
  2113. * Called after {@link #writeIndex(OutputStream)}.
  2114. * <p>
  2115. * To reduce memory internal state is cleared during this method, rendering
  2116. * the PackWriter instance useless for anything further than a call to write
  2117. * out the new bitmaps with {@link #writeBitmapIndex(OutputStream)}.
  2118. *
  2119. * @param pm
  2120. * progress monitor to report bitmap building work.
  2121. * @return whether a bitmap index may be written.
  2122. * @throws java.io.IOException
  2123. * when some I/O problem occur during reading objects.
  2124. */
  2125. public boolean prepareBitmapIndex(ProgressMonitor pm) throws IOException {
  2126. if (!canBuildBitmaps || getObjectCount() > Integer.MAX_VALUE
  2127. || !cachedPacks.isEmpty())
  2128. return false;
  2129. if (pm == null)
  2130. pm = NullProgressMonitor.INSTANCE;
  2131. int numCommits = objectsLists[OBJ_COMMIT].size();
  2132. List<ObjectToPack> byName = sortByName();
  2133. sortedByName = null;
  2134. objectsLists = null;
  2135. objectsMap = null;
  2136. writeBitmaps = new PackBitmapIndexBuilder(byName);
  2137. byName = null;
  2138. PackWriterBitmapPreparer bitmapPreparer = new PackWriterBitmapPreparer(
  2139. reader, writeBitmaps, pm, stats.interestingObjects, config);
  2140. Collection<BitmapCommit> selectedCommits = bitmapPreparer
  2141. .selectCommits(numCommits, excludeFromBitmapSelection);
  2142. beginPhase(PackingPhase.BUILDING_BITMAPS, pm, selectedCommits.size());
  2143. BitmapWalker walker = bitmapPreparer.newBitmapWalker();
  2144. AnyObjectId last = null;
  2145. for (BitmapCommit cmit : selectedCommits) {
  2146. if (!cmit.isReuseWalker()) {
  2147. walker = bitmapPreparer.newBitmapWalker();
  2148. }
  2149. BitmapBuilder bitmap = walker.findObjects(
  2150. Collections.singleton(cmit), null, false);
  2151. if (last != null && cmit.isReuseWalker() && !bitmap.contains(last))
  2152. throw new IllegalStateException(MessageFormat.format(
  2153. JGitText.get().bitmapMissingObject, cmit.name(),
  2154. last.name()));
  2155. last = BitmapCommit.copyFrom(cmit).build();
  2156. writeBitmaps.processBitmapForWrite(cmit, bitmap.build(),
  2157. cmit.getFlags());
  2158. // The bitmap walker should stop when the walk hits the previous
  2159. // commit, which saves time.
  2160. walker.setPrevCommit(last);
  2161. walker.setPrevBitmap(bitmap);
  2162. pm.update(1);
  2163. }
  2164. endPhase(pm);
  2165. return true;
  2166. }
  2167. private boolean reuseDeltaFor(ObjectToPack otp) {
  2168. int type = otp.getType();
  2169. if ((type & 2) != 0) // OBJ_TREE(2) or OBJ_BLOB(3)
  2170. return true;
  2171. if (type == OBJ_COMMIT)
  2172. return reuseDeltaCommits;
  2173. if (type == OBJ_TAG)
  2174. return false;
  2175. return true;
  2176. }
  2177. private class MutableState {
  2178. /** Estimated size of a single ObjectToPack instance. */
  2179. // Assume 64-bit pointers, since this is just an estimate.
  2180. private static final long OBJECT_TO_PACK_SIZE =
  2181. (2 * 8) // Object header
  2182. + (2 * 8) + (2 * 8) // ObjectToPack fields
  2183. + (8 + 8) // PackedObjectInfo fields
  2184. + 8 // ObjectIdOwnerMap fields
  2185. + 40 // AnyObjectId fields
  2186. + 8; // Reference in BlockList
  2187. private final long totalDeltaSearchBytes;
  2188. private volatile PackingPhase phase;
  2189. MutableState() {
  2190. phase = PackingPhase.COUNTING;
  2191. if (config.isDeltaCompress()) {
  2192. int threads = config.getThreads();
  2193. if (threads <= 0)
  2194. threads = Runtime.getRuntime().availableProcessors();
  2195. totalDeltaSearchBytes = (threads * config.getDeltaSearchMemoryLimit())
  2196. + config.getBigFileThreshold();
  2197. } else
  2198. totalDeltaSearchBytes = 0;
  2199. }
  2200. State snapshot() {
  2201. long objCnt = 0;
  2202. BlockList<ObjectToPack>[] lists = objectsLists;
  2203. if (lists != null) {
  2204. objCnt += lists[OBJ_COMMIT].size();
  2205. objCnt += lists[OBJ_TREE].size();
  2206. objCnt += lists[OBJ_BLOB].size();
  2207. objCnt += lists[OBJ_TAG].size();
  2208. // Exclude CachedPacks.
  2209. }
  2210. long bytesUsed = OBJECT_TO_PACK_SIZE * objCnt;
  2211. PackingPhase curr = phase;
  2212. if (curr == PackingPhase.COMPRESSING)
  2213. bytesUsed += totalDeltaSearchBytes;
  2214. return new State(curr, bytesUsed);
  2215. }
  2216. }
  2217. /** Possible states that a PackWriter can be in. */
  2218. public enum PackingPhase {
  2219. /** Counting objects phase. */
  2220. COUNTING,
  2221. /** Getting sizes phase. */
  2222. GETTING_SIZES,
  2223. /** Finding sources phase. */
  2224. FINDING_SOURCES,
  2225. /** Compressing objects phase. */
  2226. COMPRESSING,
  2227. /** Writing objects phase. */
  2228. WRITING,
  2229. /** Building bitmaps phase. */
  2230. BUILDING_BITMAPS;
  2231. }
  2232. /** Summary of the current state of a PackWriter. */
  2233. public class State {
  2234. private final PackingPhase phase;
  2235. private final long bytesUsed;
  2236. State(PackingPhase phase, long bytesUsed) {
  2237. this.phase = phase;
  2238. this.bytesUsed = bytesUsed;
  2239. }
  2240. /** @return the PackConfig used to build the writer. */
  2241. public PackConfig getConfig() {
  2242. return config;
  2243. }
  2244. /** @return the current phase of the writer. */
  2245. public PackingPhase getPhase() {
  2246. return phase;
  2247. }
  2248. /** @return an estimate of the total memory used by the writer. */
  2249. public long estimateBytesUsed() {
  2250. return bytesUsed;
  2251. }
  2252. @SuppressWarnings("nls")
  2253. @Override
  2254. public String toString() {
  2255. return "PackWriter.State[" + phase + ", memory=" + bytesUsed + "]";
  2256. }
  2257. }
  2258. /**
  2259. * Configuration related to the packfile URI feature.
  2260. *
  2261. * @since 5.5
  2262. */
  2263. public static class PackfileUriConfig {
  2264. @NonNull
  2265. private final PacketLineOut pckOut;
  2266. @NonNull
  2267. private final Collection<String> protocolsSupported;
  2268. @NonNull
  2269. private final CachedPackUriProvider cachedPackUriProvider;
  2270. /**
  2271. * @param pckOut where to write "packfile-uri" lines to (should
  2272. * output to the same stream as the one passed to
  2273. * PackWriter#writePack)
  2274. * @param protocolsSupported list of protocols supported (e.g. "https")
  2275. * @param cachedPackUriProvider provider of URIs corresponding
  2276. * to cached packs
  2277. * @since 5.5
  2278. */
  2279. public PackfileUriConfig(@NonNull PacketLineOut pckOut,
  2280. @NonNull Collection<String> protocolsSupported,
  2281. @NonNull CachedPackUriProvider cachedPackUriProvider) {
  2282. this.pckOut = pckOut;
  2283. this.protocolsSupported = protocolsSupported;
  2284. this.cachedPackUriProvider = cachedPackUriProvider;
  2285. }
  2286. }
  2287. }