You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

PackWriter.java 75KB

Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
ObjectIdOwnerMap: More lightweight map for ObjectIds OwnerMap is about 200 ms faster than SubclassMap, more friendly to the GC, and uses less storage: testing the "Counting objects" part of PackWriter on 1886362 objects: ObjectIdSubclassMap: load factor 50% table: 4194304 (wasted 2307942) ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256 ms avg 34800 (last 9 runs) ObjectIdOwnerMap: load factor 100% table: 2097152 (wasted 210790) directory: 1024 ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140 ms avg 34597 (last 9 runs) The major difference with OwnerMap is entries must extend from ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own private "next" field into each object. This allows the OwnerMap to use a singly linked list for chaining collisions within a bucket. By putting collisions in a linked list, we gain the entire table back for the SHA-1 bits to index their own "private" slot. Unfortunately this means that each object can appear in at most ONE OwnerMap, as there is only one "next" field within the object instance to thread into the map. For types that are very object map heavy like RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this is sufficient, these entity types are only put into one map by their container. By introducing a new map type, we don't break existing applications that might be trying to use ObjectIdSubclassMap to track RevCommits they obtained from a RevWalk. The OwnerMap uses less memory. Each object uses 1 reference more (so we're up 1,886,362 references), but the table is 1/2 the size (2^20 rather than 2^21). The table itself wastes only 210,790 slots, rather than 2,307,942. So OwnerMap is wasting 200k fewer references. OwnerMap is more friendly to the GC, because it hardly ever generates garbage. As the map reaches its 100% load factor target, it doubles in size by allocating additional segment arrays of 2048 entries. (So the first grow allocates 1 segment, second 2 segments, third 4 segments, etc.) These segments are hooked into the pre-allocated directory of 1024 spaces. This permits the map to grow to 2 million objects before the directory itself has to grow. By using segments of 2048 entries, we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is easier to satisfy then 2,307,942 bytes (for the 512k table that is just an intermediate step in the SubclassMap). By reusing the previously allocated segments (they are re-hashed in-place) we don't release any memory during a table grow. When the directory grows, it does so by discarding the old one and using one that is 4x larger (so the directory goes to 4096 entries on its first grow). A directory of size 4096 can handle up to 8 millon objects. The second directory grow (16384) goes to 33 million objects. At that point we're starting to really push the limits of the JVM heap, but at least its many small arrays. Previously SubclassMap would need a table of 67108864 entries to handle that object count, which needs a single contiguous allocation of 256 MiB. That's hard to come by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB each. This is much easier to fit into a fragmented heap. Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
ObjectIdOwnerMap: More lightweight map for ObjectIds OwnerMap is about 200 ms faster than SubclassMap, more friendly to the GC, and uses less storage: testing the "Counting objects" part of PackWriter on 1886362 objects: ObjectIdSubclassMap: load factor 50% table: 4194304 (wasted 2307942) ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256 ms avg 34800 (last 9 runs) ObjectIdOwnerMap: load factor 100% table: 2097152 (wasted 210790) directory: 1024 ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140 ms avg 34597 (last 9 runs) The major difference with OwnerMap is entries must extend from ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own private "next" field into each object. This allows the OwnerMap to use a singly linked list for chaining collisions within a bucket. By putting collisions in a linked list, we gain the entire table back for the SHA-1 bits to index their own "private" slot. Unfortunately this means that each object can appear in at most ONE OwnerMap, as there is only one "next" field within the object instance to thread into the map. For types that are very object map heavy like RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this is sufficient, these entity types are only put into one map by their container. By introducing a new map type, we don't break existing applications that might be trying to use ObjectIdSubclassMap to track RevCommits they obtained from a RevWalk. The OwnerMap uses less memory. Each object uses 1 reference more (so we're up 1,886,362 references), but the table is 1/2 the size (2^20 rather than 2^21). The table itself wastes only 210,790 slots, rather than 2,307,942. So OwnerMap is wasting 200k fewer references. OwnerMap is more friendly to the GC, because it hardly ever generates garbage. As the map reaches its 100% load factor target, it doubles in size by allocating additional segment arrays of 2048 entries. (So the first grow allocates 1 segment, second 2 segments, third 4 segments, etc.) These segments are hooked into the pre-allocated directory of 1024 spaces. This permits the map to grow to 2 million objects before the directory itself has to grow. By using segments of 2048 entries, we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is easier to satisfy then 2,307,942 bytes (for the 512k table that is just an intermediate step in the SubclassMap). By reusing the previously allocated segments (they are re-hashed in-place) we don't release any memory during a table grow. When the directory grows, it does so by discarding the old one and using one that is 4x larger (so the directory goes to 4096 entries on its first grow). A directory of size 4096 can handle up to 8 millon objects. The second directory grow (16384) goes to 33 million objects. At that point we're starting to really push the limits of the JVM heap, but at least its many small arrays. Previously SubclassMap would need a table of 67108864 entries to handle that object count, which needs a single contiguous allocation of 256 MiB. That's hard to come by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB each. This is much easier to fit into a fragmented heap. Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support cutting existing delta chains longer than the max depth Some packs built by JGit have incredibly long delta chains due to a long standing bug in PackWriter. Google has packs created by JGit's DfsGarbageCollector with chains of 6000 objects long, or more. Inflating objects at the end of this 6000 long chain is impossible to complete within a reasonable time bound. It could take a beefy system hours to perform even using the heavily optimized native C implementation of Git, let alone with JGit. Enable pack.cutDeltaChains to be set in a configuration file to permit the PackWriter to determine the length of each delta chain and clip the chain at arbitrary points to fit within pack.depth. Delta chain cycles are still possible, but no attempt is made to detect them. A trivial chain of A->B->A will iterate for the full pack.depth configured limit (e.g. 50) and then pick an object to store as non-delta. When cutting chains the object list is walked in reverse to try and take advantage of existing chain computations. The assumption here is most deltas are near the end of the list, and their bases are near the front of the list. Going up from the tail attempts to reuse chainLength computations by relying on the memoized value in the delta base. The chainLength field in ObjectToPack is overloaded into the depth field normally used by DeltaWindow. This is acceptable because the chain cut happens before delta search, and the chainLength is reset to 0 if delta search will follow. Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
11 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Do not write edge objects to the pack stream Consider two objects A->B where A uses B as a delta base, and these are in the same source pack file ordered as "A B". If cached packs is enabled and B is also in the cached pack that will be appended onto the end of the thin pack, and both A, B are supposed to be in the thin pack, PackWriter must consider the fact that A's base B is an edge object that claims to be part of the new pack, but is actually "external" and cannot be written first. If the object reuse system considered B candidates fist this bug does not arise, as B will be marked as edge due to it existing in the cached pack. When the A candidates are later examined, A sees a valid delta base is available as an edge, and will not later try to "write base first" during the writing phase. However, when the reuse system considers A candidates first they see that B will be in the outgoing pack, as it is still part of the thin pack, and arrange for A to be written first. Later when A switches from being in-pack to being an edge object (as it is part of the cached pack) the pointer in B does not get its type changed from ObjectToPack to ObjectId, so B thinks A is non-edge. We work around this case by also checking that the delta base B is non-edge before writing the object to the pack. Later when A writes its object header, delta base B's ObjectToPack will have an offset == 0, which makes isWritten() = false, and the OBJ_REF delta format will be used for A's header. This will be resolved by the client to the copy of B that appears in the later cached pack. Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca
12 years ago
Do not write edge objects to the pack stream Consider two objects A->B where A uses B as a delta base, and these are in the same source pack file ordered as "A B". If cached packs is enabled and B is also in the cached pack that will be appended onto the end of the thin pack, and both A, B are supposed to be in the thin pack, PackWriter must consider the fact that A's base B is an edge object that claims to be part of the new pack, but is actually "external" and cannot be written first. If the object reuse system considered B candidates fist this bug does not arise, as B will be marked as edge due to it existing in the cached pack. When the A candidates are later examined, A sees a valid delta base is available as an edge, and will not later try to "write base first" during the writing phase. However, when the reuse system considers A candidates first they see that B will be in the outgoing pack, as it is still part of the thin pack, and arrange for A to be written first. Later when A switches from being in-pack to being an edge object (as it is part of the cached pack) the pointer in B does not get its type changed from ObjectToPack to ObjectId, so B thinks A is non-edge. We work around this case by also checking that the delta base B is non-edge before writing the object to the pack. Later when A writes its object header, delta base B's ObjectToPack will have an offset == 0, which makes isWritten() = false, and the OBJ_REF delta format will be used for A's header. This will be resolved by the client to the copy of B that appears in the later cached pack. Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca
12 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Implement delta generation during packing PackWriter now produces new deltas if there is not a suitable delta available for reuse from an existing pack file. This permits JGit to send less data on the wire by sending a delta relative to an object the other side already has, instead of sending the whole object. The delta searching algorithm is similar in style to what C Git uses, but apparently has some differences (see below for more on). Briefly, objects that should be considered for delta compression are pushed onto a list. This list is then sorted by a rough similarity score, which is derived from the path name the object was discovered at in the repository during object counting. The list is then walked in order. At each position in the list, up to $WINDOW objects prior to it are attempted as delta bases. Each object in the window is tried, and the shortest delta instruction sequence selects the base object. Some rough rules are used to prevent pathological behavior during this matching phase, like skipping pairings of objects that are not similar enough in size. PackWriter intentionally excludes commits and annotated tags from this new delta search phase. In the JGit repository only 28 out of 2600+ commits can be delta compressed by C Git. As the commit count tends to be a fair percentage of the total number of objects in the repository, and they generally do not delta compress well, skipping over them can improve performance with little increase in the output pack size. Because this implementation was rebuilt from scratch based on my own memory of how the packing algorithm has evolved over the years in C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly the same rules everywhere, and that leads JGit to produce different (but logically equivalent) pack files. Repository | Pack Size (bytes) | Packing Time | JGit - CGit = Difference | JGit / CGit -----------+----------------------------------+----------------- git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s For the above tests pack.threads was set to 1, window size=10, delta depth=50, and delta and object reuse was disabled for both implementations. Both implementations were reading from an already fully packed repository on local disk. The running time reported is after 1 warm-up run of the tested implementation. PackWriter is writing 771 KiB more data on git.git, 3M more on linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an extra 2 minutes to pack. On the running time side, JGit is at a major disadvantage because linux-2.6 doesn't fit into the default WindowCache of 20M, while C Git is able to mmap the entire pack and have it available instantly in physical memory (assuming hot cache). CGit also has a feature where it caches deltas that were created during the compression phase, and uses those cached deltas during the writing phase. PackWriter does not implement this (yet), and therefore must create every delta twice. This could easily account for the increased running time we are seeing. Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Hoist and cluster reference targets Many source browsers and network related tools like UploadPack need to find and parse the target of all branches and annotated tags within the repository during their startup phase. Clustering these together into the same part of the pack file will improve locality, reducing thrashing when an application starts and needs to load all of these into memory at once. To prevent bottlenecking basic log viewing tools that are scannning backwards from the tip of a current branch (and don't need tags) we place this cluster of older targets after 4096 newer commits have already been placed into the pack stream. 4096 was chosen as a rough guess, but was based on a few factors: - log viewers typically show 5-200 commits per page - users only view the first page or two - DHT can cram 2200-4000 commits per 1 MiB chunk thus these will fall into the second commit chunk (roughly) Unfortunately this placement hurts history tools that are scanning backwards through the commit graph and completely ignored tags or branch heads when they started. An ancient tagged commit is no longer positioned behind its first child (its now much earlier), resulting in a page fault for the parser to reload this cluster of objects on demand. This may be an acceptable loss. If a user is walking backwards and has already scanned through more than 4096 commits of history, waiting for the region to reload isn't really that bad compared to the amount of time already spent. If the repository is so small that there are less than 4096 commits, this change has no impact on the placement of objects. Change-Id: If3052e430d305e17878d94145c93754f56b74c61 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
Teach PackWriter how to reuse an existing object list Counting the objects needed for packing is the most expensive part of an UploadPack request that has no uninteresting objects (otherwise known as an initial clone). During this phase the PackWriter is enumerating the entire set of objects in this repository, so they can be sent to the client for their new clone. Allow the ObjectReader (and therefore the underlying storage system) to keep a cached list of all reachable objects from a small number of points in the project's history. If one of those points is reached during enumeration of the commit graph, most objects are obtained from the cached list instead of direct traversal. PackWriter uses the list by discarding the current object lists and restarting a traversal from all refs but marking the object list name as uninteresting. This allows PackWriter to enumerate all objects that are more recent than the list creation, or that were on side branches that the list does not include. However, ObjectWalk tags all of the trees and commits within the list commit as UNINTERESTING, which would normally cause PackWriter to construct a thin pack that excludes these objects. To avoid that, addObject() was refactored to allow this list-based enumeration to always include an object, even if it has been tagged UNINTERESTING by the ObjectWalk. This implies the list-based enumeration may only be used for initial clones, where all objects are being sent. The UNINTERESTING labeling occurs because StartGenerator always enables the BoundaryGenerator if the walker is an ObjectWalk and a commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not enabled. This is the default reasonable behavior for an ObjectWalk, but isn't desired here in PackWriter with the list-based enumeration. Rather than trying to change all of this behavior, PackWriter works around it. Because the list name commit's immediate files and trees were all enumerated before the list enumeration itself starts (and are also within the list itself) PackWriter runs the risk of adding the same objects to its ObjectIdSubclassMap twice. Since this breaks the internal map data structure (and also may cause the object to transmit twice), PackWriter needs to use a new "added" RevFlag to track whether or not an object has been put into the outgoing list yet. Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Teach PackWriter how to reuse an existing object list Counting the objects needed for packing is the most expensive part of an UploadPack request that has no uninteresting objects (otherwise known as an initial clone). During this phase the PackWriter is enumerating the entire set of objects in this repository, so they can be sent to the client for their new clone. Allow the ObjectReader (and therefore the underlying storage system) to keep a cached list of all reachable objects from a small number of points in the project's history. If one of those points is reached during enumeration of the commit graph, most objects are obtained from the cached list instead of direct traversal. PackWriter uses the list by discarding the current object lists and restarting a traversal from all refs but marking the object list name as uninteresting. This allows PackWriter to enumerate all objects that are more recent than the list creation, or that were on side branches that the list does not include. However, ObjectWalk tags all of the trees and commits within the list commit as UNINTERESTING, which would normally cause PackWriter to construct a thin pack that excludes these objects. To avoid that, addObject() was refactored to allow this list-based enumeration to always include an object, even if it has been tagged UNINTERESTING by the ObjectWalk. This implies the list-based enumeration may only be used for initial clones, where all objects are being sent. The UNINTERESTING labeling occurs because StartGenerator always enables the BoundaryGenerator if the walker is an ObjectWalk and a commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not enabled. This is the default reasonable behavior for an ObjectWalk, but isn't desired here in PackWriter with the list-based enumeration. Rather than trying to change all of this behavior, PackWriter works around it. Because the list name commit's immediate files and trees were all enumerated before the list enumeration itself starts (and are also within the list itself) PackWriter runs the risk of adding the same objects to its ObjectIdSubclassMap twice. Since this breaks the internal map data structure (and also may cause the object to transmit twice), PackWriter needs to use a new "added" RevFlag to track whether or not an object has been put into the outgoing list yet. Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Make thin packs more efficient There is no point in pushing all of the files within the edge commits into the delta search when making a thin pack. This floods the delta search window with objects that are unlikely to be useful bases for the objects that will be written out, resulting in lower data compression and higher transfer sizes. Instead observe the path of a tree or blob that is being pushed into the outgoing set, and use that path to locate up to WINDOW ancestor versions from the edge commits. Push only those objects into the edgeObjects set, reducing the number of objects seen by the search window. This allows PackWriter to only look at ancestors for the modified files, rather than all files in the project. Limiting the search to WINDOW size makes sense, because more than WINDOW edge objects will just skip through the window search as none of them need to be delta compressed. To further improve compression, sort edge objects into the front of the window list, rather than randomly throughout. This puts non-edges later in the window and gives them a better chance at finding their base, since they search backwards through the window. These changes make a significant difference in the thin-pack: Before: remote: Counting objects: 144190, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (101405/101405) remote: Compressing objects: 100% (7587/7587) Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done. Resolving deltas: 100% (40339/40339), completed with 2218 local objects. real 0m30.267s After: remote: Counting objects: 61549, done remote: Finding sources: 100% (50275/50275) remote: Getting sizes: 100% (18862/18862) remote: Compressing objects: 100% (7588/7588) Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done. Resolving deltas: 100% (43160/43160), completed with 5014 local objects. real 0m22.170s The resulting pack is 13.63 MiB smaller, even though it contains the same exact objects. 82,543 fewer objects had to have their sizes looked up, which saved about 8s of server CPU time. 2,796 more objects from the client were used as part of the base object set, which contributed to the smaller transfer size. Change-Id: Id01271950432c6960897495b09deab70e33993a9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Fix the way delta chain cycles are prevented Take a very simple approach to avoiding delta chains during object reuse: objects are now always selected from the oldest pack that contains them. This prevents cycles because a pack must not have a cycle in the delta chain. If both objects A and B are chosen out of the same source pack then there cannot be an A->B->A cycle. The oldest pack is also the most likely to have the smallest deltas. Its the biggest pack in the system and probably came from the clone (or last GC) of this repository, where all objects were previously considered and packed tightly together. If an object appears again (for example due to a revert and a push into this repository) the newer copy of won't be nearly as small as the older delta version of it, even if the newer one is also itself a delta. ObjectDirectory already enumerates objects during selection in this newest->oldest order, so it already is supplying these assumptions to PackWriter. Taking advantage of this can speed up selection by a tiny amount by avoiding some tests, but can also help to prevent a cycle needing to be broken on the fly during writing. The previous cycle breaking logic wasn't fully correct either. If a different delta base was chosen, the new delta base might not have been written into the output pack before the current object, forcing the use of REF_DELTA when OFS_DELTA is always smaller. This logic has now been reworked to always re-check the delta base and ensure it gets written before the current object. If a cycle occurs, it gets broken the same way as before, by disabling delta reuse and finding an alternative form of the object, which may require inflating/deflating in whole format. Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Don't reuse commit or tag deltas JGit doesn't generate deltas for commit or tag objects when it packs a repository from scratch. This is an explicit design decision that is (mostly) justified by the fact that these objects do not delta compress well. Annotated tags are made once on stable points of the project history, it is unlikely they will ever appear again with sufficient common text to justify using a delta over just deflating the raw content. JGit never tries to delta compress annotated tags and I take the stance that these are best stored as non-deltas given how frequently they might be accessed by repository viewers. Commits only have sufficient common text when they are cherry-picked to forward-port or back-port a change from one branch to another. Even in these cases the distance between the commits as returned by the log traversal has to be small enough that they would both appear in the delta search window at the same time in order to delta compress one of the messages against the other. JGit never tries to delta compress commits, as it requires a lot of CPU time but typically does not produce a smaller pack file. Avoid reusing deltas for either of these types when constructing a new pack. To avoid killing performance during serving of network clients, UploadPack disables this code change by allowing PackWriter to reuse delta commits. Repositories that were already repacked by C Git will not have their delta commits decompressed and recompressed on the fly during object writing, saving server-side CPU resources. Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
Support creating pack bitmap indexes in PackWriter. Update the PackWriter to support writing out pack bitmap indexes, a parallel ".bitmap" file to the ".pack" file. Bitmaps are selected at commits every 1 to 5,000 commits for each unique path from the start. The most recent 100 commits are all bitmapped. The next 19,000 commits have a bitmaps every 100 commits. The remaining commits have a bitmap every 5,000 commits. Commits with more than 1 parent are prefered over ones with 1 or less. Furthermore, previously computed bitmaps are reused, if the previous entry had the reuse flag set, which is set when the bitmap was placed at the max allowed distance. Bitmaps are used to speed up the counting phase when packing, for requests that are not shallow. The PackWriterBitmapWalker uses a RevFilter to proactively mark commits with RevFlag.SEEN, when they appear in a bitmap. The walker produces the full closure of reachable ObjectIds, given the collection of starting ObjectIds. For fetch request, two ObjectWalks are executed to compute the ObjectIds reachable from the haves and from the wants. The ObjectIds needed to be written are determined by taking all the resulting wants AND NOT the haves. For clone requests, we get cached pack support for "free" since it is possible to determine if all of the ObjectIds in a pack file are included in the resulting list of ObjectIds to write. On my machine, the best times for clones and fetches of the linux kernel repository (with about 2.6M objects and 300K commits) are tabulated below: Operation Index V2 Index VE003 Clone 37530ms (524.06 MiB) 82ms (524.06 MiB) Fetch (1 commit back) 75ms 107ms Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB) Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB) Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB) Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB) Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB) Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 years ago
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727172817291730173117321733173417351736173717381739174017411742174317441745174617471748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778177917801781178217831784178517861787178817891790179117921793179417951796179717981799180018011802180318041805180618071808180918101811181218131814181518161817181818191820182118221823182418251826182718281829183018311832183318341835183618371838183918401841184218431844184518461847184818491850185118521853185418551856185718581859186018611862186318641865186618671868186918701871187218731874187518761877187818791880188118821883188418851886188718881889189018911892189318941895189618971898189919001901190219031904190519061907190819091910191119121913191419151916191719181919192019211922192319241925192619271928192919301931193219331934193519361937193819391940194119421943194419451946194719481949195019511952195319541955195619571958195919601961196219631964196519661967196819691970197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000200120022003200420052006200720082009201020112012201320142015201620172018201920202021202220232024202520262027202820292030203120322033203420352036203720382039204020412042204320442045204620472048204920502051205220532054205520562057205820592060206120622063206420652066206720682069207020712072207320742075207620772078207920802081208220832084208520862087208820892090209120922093209420952096209720982099210021012102210321042105210621072108210921102111211221132114211521162117211821192120212121222123212421252126212721282129213021312132213321342135213621372138213921402141214221432144214521462147214821492150215121522153215421552156215721582159216021612162216321642165216621672168216921702171217221732174217521762177217821792180218121822183218421852186218721882189219021912192219321942195219621972198219922002201220222032204220522062207220822092210221122122213221422152216221722182219222022212222222322242225222622272228222922302231223222332234223522362237223822392240224122422243224422452246224722482249225022512252225322542255225622572258225922602261226222632264226522662267226822692270227122722273227422752276227722782279228022812282228322842285228622872288228922902291229222932294229522962297229822992300230123022303230423052306230723082309231023112312231323142315231623172318231923202321232223232324232523262327232823292330233123322333233423352336233723382339234023412342234323442345234623472348234923502351235223532354235523562357235823592360236123622363236423652366236723682369237023712372237323742375237623772378237923802381238223832384238523862387238823892390239123922393239423952396239723982399240024012402240324042405240624072408240924102411241224132414241524162417241824192420242124222423242424252426242724282429243024312432243324342435243624372438243924402441244224432444244524462447244824492450245124522453245424552456245724582459246024612462246324642465246624672468246924702471247224732474247524762477
  1. /*
  2. * Copyright (C) 2008-2010, Google Inc.
  3. * Copyright (C) 2008, Marek Zawirski <marek.zawirski@gmail.com>
  4. * and other copyright owners as documented in the project's IP log.
  5. *
  6. * This program and the accompanying materials are made available
  7. * under the terms of the Eclipse Distribution License v1.0 which
  8. * accompanies this distribution, is reproduced below, and is
  9. * available at http://www.eclipse.org/org/documents/edl-v10.php
  10. *
  11. * All rights reserved.
  12. *
  13. * Redistribution and use in source and binary forms, with or
  14. * without modification, are permitted provided that the following
  15. * conditions are met:
  16. *
  17. * - Redistributions of source code must retain the above copyright
  18. * notice, this list of conditions and the following disclaimer.
  19. *
  20. * - Redistributions in binary form must reproduce the above
  21. * copyright notice, this list of conditions and the following
  22. * disclaimer in the documentation and/or other materials provided
  23. * with the distribution.
  24. *
  25. * - Neither the name of the Eclipse Foundation, Inc. nor the
  26. * names of its contributors may be used to endorse or promote
  27. * products derived from this software without specific prior
  28. * written permission.
  29. *
  30. * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
  31. * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
  32. * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  33. * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  34. * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
  35. * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  36. * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  37. * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  38. * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  39. * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
  40. * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  41. * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  42. * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  43. */
  44. package org.eclipse.jgit.internal.storage.pack;
  45. import static org.eclipse.jgit.internal.storage.pack.StoredObjectRepresentation.PACK_DELTA;
  46. import static org.eclipse.jgit.internal.storage.pack.StoredObjectRepresentation.PACK_WHOLE;
  47. import static org.eclipse.jgit.lib.Constants.OBJECT_ID_LENGTH;
  48. import static org.eclipse.jgit.lib.Constants.OBJ_BLOB;
  49. import static org.eclipse.jgit.lib.Constants.OBJ_COMMIT;
  50. import static org.eclipse.jgit.lib.Constants.OBJ_TAG;
  51. import static org.eclipse.jgit.lib.Constants.OBJ_TREE;
  52. import java.io.IOException;
  53. import java.io.OutputStream;
  54. import java.lang.ref.WeakReference;
  55. import java.security.MessageDigest;
  56. import java.text.MessageFormat;
  57. import java.util.ArrayList;
  58. import java.util.Arrays;
  59. import java.util.Collection;
  60. import java.util.Collections;
  61. import java.util.Comparator;
  62. import java.util.HashSet;
  63. import java.util.Iterator;
  64. import java.util.List;
  65. import java.util.Map;
  66. import java.util.NoSuchElementException;
  67. import java.util.Set;
  68. import java.util.concurrent.ConcurrentHashMap;
  69. import java.util.concurrent.ExecutionException;
  70. import java.util.concurrent.Executor;
  71. import java.util.concurrent.ExecutorService;
  72. import java.util.concurrent.Executors;
  73. import java.util.concurrent.Future;
  74. import java.util.concurrent.TimeUnit;
  75. import java.util.zip.CRC32;
  76. import java.util.zip.CheckedOutputStream;
  77. import java.util.zip.Deflater;
  78. import java.util.zip.DeflaterOutputStream;
  79. import org.eclipse.jgit.errors.CorruptObjectException;
  80. import org.eclipse.jgit.errors.IncorrectObjectTypeException;
  81. import org.eclipse.jgit.errors.LargeObjectException;
  82. import org.eclipse.jgit.errors.MissingObjectException;
  83. import org.eclipse.jgit.errors.StoredObjectRepresentationNotAvailableException;
  84. import org.eclipse.jgit.internal.JGitText;
  85. import org.eclipse.jgit.internal.storage.file.PackBitmapIndexBuilder;
  86. import org.eclipse.jgit.internal.storage.file.PackBitmapIndexWriterV1;
  87. import org.eclipse.jgit.internal.storage.file.PackIndexWriter;
  88. import org.eclipse.jgit.lib.AnyObjectId;
  89. import org.eclipse.jgit.lib.AsyncObjectSizeQueue;
  90. import org.eclipse.jgit.lib.BatchingProgressMonitor;
  91. import org.eclipse.jgit.lib.BitmapIndex;
  92. import org.eclipse.jgit.lib.BitmapIndex.BitmapBuilder;
  93. import org.eclipse.jgit.lib.BitmapObject;
  94. import org.eclipse.jgit.lib.Constants;
  95. import org.eclipse.jgit.lib.NullProgressMonitor;
  96. import org.eclipse.jgit.lib.ObjectId;
  97. import org.eclipse.jgit.lib.ObjectIdOwnerMap;
  98. import org.eclipse.jgit.lib.ObjectLoader;
  99. import org.eclipse.jgit.lib.ObjectReader;
  100. import org.eclipse.jgit.lib.ProgressMonitor;
  101. import org.eclipse.jgit.lib.Repository;
  102. import org.eclipse.jgit.lib.ThreadSafeProgressMonitor;
  103. import org.eclipse.jgit.revwalk.AsyncRevObjectQueue;
  104. import org.eclipse.jgit.revwalk.DepthWalk;
  105. import org.eclipse.jgit.revwalk.ObjectWalk;
  106. import org.eclipse.jgit.revwalk.RevCommit;
  107. import org.eclipse.jgit.revwalk.RevFlag;
  108. import org.eclipse.jgit.revwalk.RevObject;
  109. import org.eclipse.jgit.revwalk.RevSort;
  110. import org.eclipse.jgit.revwalk.RevTag;
  111. import org.eclipse.jgit.revwalk.RevTree;
  112. import org.eclipse.jgit.storage.pack.PackConfig;
  113. import org.eclipse.jgit.util.BlockList;
  114. import org.eclipse.jgit.util.TemporaryBuffer;
  115. /**
  116. * <p>
  117. * PackWriter class is responsible for generating pack files from specified set
  118. * of objects from repository. This implementation produce pack files in format
  119. * version 2.
  120. * </p>
  121. * <p>
  122. * Source of objects may be specified in two ways:
  123. * <ul>
  124. * <li>(usually) by providing sets of interesting and uninteresting objects in
  125. * repository - all interesting objects and their ancestors except uninteresting
  126. * objects and their ancestors will be included in pack, or</li>
  127. * <li>by providing iterator of {@link RevObject} specifying exact list and
  128. * order of objects in pack</li>
  129. * </ul>
  130. * <p>
  131. * Typical usage consists of creating instance intended for some pack,
  132. * configuring options, preparing the list of objects by calling
  133. * {@link #preparePack(Iterator)} or
  134. * {@link #preparePack(ProgressMonitor, Collection, Collection)}, and finally
  135. * producing the stream with
  136. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}.
  137. * </p>
  138. * <p>
  139. * Class provide set of configurable options and {@link ProgressMonitor}
  140. * support, as operations may take a long time for big repositories. Deltas
  141. * searching algorithm is <b>NOT IMPLEMENTED</b> yet - this implementation
  142. * relies only on deltas and objects reuse.
  143. * </p>
  144. * <p>
  145. * This class is not thread safe, it is intended to be used in one thread, with
  146. * one instance per created pack. Subsequent calls to writePack result in
  147. * undefined behavior.
  148. * </p>
  149. */
  150. public class PackWriter {
  151. private static final int PACK_VERSION_GENERATED = 2;
  152. /** A collection of object ids. */
  153. public interface ObjectIdSet {
  154. /**
  155. * Returns true if the objectId is contained within the collection.
  156. *
  157. * @param objectId
  158. * the objectId to find
  159. * @return whether the collection contains the objectId.
  160. */
  161. boolean contains(AnyObjectId objectId);
  162. }
  163. private static final Map<WeakReference<PackWriter>, Boolean> instances =
  164. new ConcurrentHashMap<WeakReference<PackWriter>, Boolean>();
  165. private static final Iterable<PackWriter> instancesIterable = new Iterable<PackWriter>() {
  166. public Iterator<PackWriter> iterator() {
  167. return new Iterator<PackWriter>() {
  168. private final Iterator<WeakReference<PackWriter>> it =
  169. instances.keySet().iterator();
  170. private PackWriter next;
  171. public boolean hasNext() {
  172. if (next != null)
  173. return true;
  174. while (it.hasNext()) {
  175. WeakReference<PackWriter> ref = it.next();
  176. next = ref.get();
  177. if (next != null)
  178. return true;
  179. it.remove();
  180. }
  181. return false;
  182. }
  183. public PackWriter next() {
  184. if (hasNext()) {
  185. PackWriter result = next;
  186. next = null;
  187. return result;
  188. }
  189. throw new NoSuchElementException();
  190. }
  191. public void remove() {
  192. throw new UnsupportedOperationException();
  193. }
  194. };
  195. }
  196. };
  197. /** @return all allocated, non-released PackWriters instances. */
  198. public static Iterable<PackWriter> getInstances() {
  199. return instancesIterable;
  200. }
  201. @SuppressWarnings("unchecked")
  202. private final BlockList<ObjectToPack> objectsLists[] = new BlockList[OBJ_TAG + 1];
  203. {
  204. objectsLists[OBJ_COMMIT] = new BlockList<ObjectToPack>();
  205. objectsLists[OBJ_TREE] = new BlockList<ObjectToPack>();
  206. objectsLists[OBJ_BLOB] = new BlockList<ObjectToPack>();
  207. objectsLists[OBJ_TAG] = new BlockList<ObjectToPack>();
  208. }
  209. private final ObjectIdOwnerMap<ObjectToPack> objectsMap = new ObjectIdOwnerMap<ObjectToPack>();
  210. // edge objects for thin packs
  211. private List<ObjectToPack> edgeObjects = new BlockList<ObjectToPack>();
  212. // Objects the client is known to have already.
  213. private BitmapBuilder haveObjects;
  214. private List<CachedPack> cachedPacks = new ArrayList<CachedPack>(2);
  215. private Set<ObjectId> tagTargets = Collections.emptySet();
  216. private ObjectIdSet[] excludeInPacks;
  217. private ObjectIdSet excludeInPackLast;
  218. private Deflater myDeflater;
  219. private final ObjectReader reader;
  220. /** {@link #reader} recast to the reuse interface, if it supports it. */
  221. private final ObjectReuseAsIs reuseSupport;
  222. private final PackConfig config;
  223. private final Statistics stats;
  224. private final MutableState state;
  225. private final WeakReference<PackWriter> selfRef;
  226. private Statistics.ObjectType typeStats;
  227. private List<ObjectToPack> sortedByName;
  228. private byte packcsum[];
  229. private boolean deltaBaseAsOffset;
  230. private boolean reuseDeltas;
  231. private boolean reuseDeltaCommits;
  232. private boolean reuseValidate;
  233. private boolean thin;
  234. private boolean useCachedPacks;
  235. private boolean useBitmaps;
  236. private boolean ignoreMissingUninteresting = true;
  237. private boolean pruneCurrentObjectList;
  238. private boolean shallowPack;
  239. private boolean canBuildBitmaps;
  240. private boolean indexDisabled;
  241. private int depth;
  242. private Collection<? extends ObjectId> unshallowObjects;
  243. private PackBitmapIndexBuilder writeBitmaps;
  244. private CRC32 crc32;
  245. /**
  246. * Create writer for specified repository.
  247. * <p>
  248. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  249. * {@link #preparePack(ProgressMonitor, Collection, Collection)}.
  250. *
  251. * @param repo
  252. * repository where objects are stored.
  253. */
  254. public PackWriter(final Repository repo) {
  255. this(repo, repo.newObjectReader());
  256. }
  257. /**
  258. * Create a writer to load objects from the specified reader.
  259. * <p>
  260. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  261. * {@link #preparePack(ProgressMonitor, Collection, Collection)}.
  262. *
  263. * @param reader
  264. * reader to read from the repository with.
  265. */
  266. public PackWriter(final ObjectReader reader) {
  267. this(new PackConfig(), reader);
  268. }
  269. /**
  270. * Create writer for specified repository.
  271. * <p>
  272. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  273. * {@link #preparePack(ProgressMonitor, Collection, Collection)}.
  274. *
  275. * @param repo
  276. * repository where objects are stored.
  277. * @param reader
  278. * reader to read from the repository with.
  279. */
  280. public PackWriter(final Repository repo, final ObjectReader reader) {
  281. this(new PackConfig(repo), reader);
  282. }
  283. /**
  284. * Create writer with a specified configuration.
  285. * <p>
  286. * Objects for packing are specified in {@link #preparePack(Iterator)} or
  287. * {@link #preparePack(ProgressMonitor, Collection, Collection)}.
  288. *
  289. * @param config
  290. * configuration for the pack writer.
  291. * @param reader
  292. * reader to read from the repository with.
  293. */
  294. public PackWriter(final PackConfig config, final ObjectReader reader) {
  295. this.config = config;
  296. this.reader = reader;
  297. if (reader instanceof ObjectReuseAsIs)
  298. reuseSupport = ((ObjectReuseAsIs) reader);
  299. else
  300. reuseSupport = null;
  301. deltaBaseAsOffset = config.isDeltaBaseAsOffset();
  302. reuseDeltas = config.isReuseDeltas();
  303. reuseValidate = true; // be paranoid by default
  304. stats = new Statistics();
  305. state = new MutableState();
  306. selfRef = new WeakReference<PackWriter>(this);
  307. instances.put(selfRef, Boolean.TRUE);
  308. }
  309. /**
  310. * Check whether writer can store delta base as an offset (new style
  311. * reducing pack size) or should store it as an object id (legacy style,
  312. * compatible with old readers).
  313. *
  314. * Default setting: {@value PackConfig#DEFAULT_DELTA_BASE_AS_OFFSET}
  315. *
  316. * @return true if delta base is stored as an offset; false if it is stored
  317. * as an object id.
  318. */
  319. public boolean isDeltaBaseAsOffset() {
  320. return deltaBaseAsOffset;
  321. }
  322. /**
  323. * Set writer delta base format. Delta base can be written as an offset in a
  324. * pack file (new approach reducing file size) or as an object id (legacy
  325. * approach, compatible with old readers).
  326. *
  327. * Default setting: {@value PackConfig#DEFAULT_DELTA_BASE_AS_OFFSET}
  328. *
  329. * @param deltaBaseAsOffset
  330. * boolean indicating whether delta base can be stored as an
  331. * offset.
  332. */
  333. public void setDeltaBaseAsOffset(boolean deltaBaseAsOffset) {
  334. this.deltaBaseAsOffset = deltaBaseAsOffset;
  335. }
  336. /**
  337. * Check if the writer will reuse commits that are already stored as deltas.
  338. *
  339. * @return true if the writer would reuse commits stored as deltas, assuming
  340. * delta reuse is already enabled.
  341. */
  342. public boolean isReuseDeltaCommits() {
  343. return reuseDeltaCommits;
  344. }
  345. /**
  346. * Set the writer to reuse existing delta versions of commits.
  347. *
  348. * @param reuse
  349. * if true, the writer will reuse any commits stored as deltas.
  350. * By default the writer does not reuse delta commits.
  351. */
  352. public void setReuseDeltaCommits(boolean reuse) {
  353. reuseDeltaCommits = reuse;
  354. }
  355. /**
  356. * Check if the writer validates objects before copying them.
  357. *
  358. * @return true if validation is enabled; false if the reader will handle
  359. * object validation as a side-effect of it consuming the output.
  360. */
  361. public boolean isReuseValidatingObjects() {
  362. return reuseValidate;
  363. }
  364. /**
  365. * Enable (or disable) object validation during packing.
  366. *
  367. * @param validate
  368. * if true the pack writer will validate an object before it is
  369. * put into the output. This additional validation work may be
  370. * necessary to avoid propagating corruption from one local pack
  371. * file to another local pack file.
  372. */
  373. public void setReuseValidatingObjects(boolean validate) {
  374. reuseValidate = validate;
  375. }
  376. /** @return true if this writer is producing a thin pack. */
  377. public boolean isThin() {
  378. return thin;
  379. }
  380. /**
  381. * @param packthin
  382. * a boolean indicating whether writer may pack objects with
  383. * delta base object not within set of objects to pack, but
  384. * belonging to party repository (uninteresting/boundary) as
  385. * determined by set; this kind of pack is used only for
  386. * transport; true - to produce thin pack, false - otherwise.
  387. */
  388. public void setThin(final boolean packthin) {
  389. thin = packthin;
  390. }
  391. /** @return true to reuse cached packs. If true index creation isn't available. */
  392. public boolean isUseCachedPacks() {
  393. return useCachedPacks;
  394. }
  395. /**
  396. * @param useCached
  397. * if set to true and a cached pack is present, it will be
  398. * appended onto the end of a thin-pack, reducing the amount of
  399. * working set space and CPU used by PackWriter. Enabling this
  400. * feature prevents PackWriter from creating an index for the
  401. * newly created pack, so its only suitable for writing to a
  402. * network client, where the client will make the index.
  403. */
  404. public void setUseCachedPacks(boolean useCached) {
  405. useCachedPacks = useCached;
  406. }
  407. /** @return true to use bitmaps for ObjectWalks, if available. */
  408. public boolean isUseBitmaps() {
  409. return useBitmaps;
  410. }
  411. /**
  412. * @param useBitmaps
  413. * if set to true, bitmaps will be used when preparing a pack.
  414. */
  415. public void setUseBitmaps(boolean useBitmaps) {
  416. this.useBitmaps = useBitmaps;
  417. }
  418. /** @return true if the index file cannot be created by this PackWriter. */
  419. public boolean isIndexDisabled() {
  420. return indexDisabled || !cachedPacks.isEmpty();
  421. }
  422. /**
  423. * @param noIndex
  424. * true to disable creation of the index file.
  425. */
  426. public void setIndexDisabled(boolean noIndex) {
  427. this.indexDisabled = noIndex;
  428. }
  429. /**
  430. * @return true to ignore objects that are uninteresting and also not found
  431. * on local disk; false to throw a {@link MissingObjectException}
  432. * out of {@link #preparePack(ProgressMonitor, Collection, Collection)} if an
  433. * uninteresting object is not in the source repository. By default,
  434. * true, permitting gracefully ignoring of uninteresting objects.
  435. */
  436. public boolean isIgnoreMissingUninteresting() {
  437. return ignoreMissingUninteresting;
  438. }
  439. /**
  440. * @param ignore
  441. * true if writer should ignore non existing uninteresting
  442. * objects during construction set of objects to pack; false
  443. * otherwise - non existing uninteresting objects may cause
  444. * {@link MissingObjectException}
  445. */
  446. public void setIgnoreMissingUninteresting(final boolean ignore) {
  447. ignoreMissingUninteresting = ignore;
  448. }
  449. /**
  450. * Set the tag targets that should be hoisted earlier during packing.
  451. * <p>
  452. * Callers may put objects into this set before invoking any of the
  453. * preparePack methods to influence where an annotated tag's target is
  454. * stored within the resulting pack. Typically these will be clustered
  455. * together, and hoisted earlier in the file even if they are ancient
  456. * revisions, allowing readers to find tag targets with better locality.
  457. *
  458. * @param objects
  459. * objects that annotated tags point at.
  460. */
  461. public void setTagTargets(Set<ObjectId> objects) {
  462. tagTargets = objects;
  463. }
  464. /**
  465. * Configure this pack for a shallow clone.
  466. *
  467. * @param depth
  468. * maximum depth to traverse the commit graph
  469. * @param unshallow
  470. * objects which used to be shallow on the client, but are being
  471. * extended as part of this fetch
  472. */
  473. public void setShallowPack(int depth,
  474. Collection<? extends ObjectId> unshallow) {
  475. this.shallowPack = true;
  476. this.depth = depth;
  477. this.unshallowObjects = unshallow;
  478. }
  479. /**
  480. * Returns objects number in a pack file that was created by this writer.
  481. *
  482. * @return number of objects in pack.
  483. * @throws IOException
  484. * a cached pack cannot supply its object count.
  485. */
  486. public long getObjectCount() throws IOException {
  487. if (stats.totalObjects == 0) {
  488. long objCnt = 0;
  489. objCnt += objectsLists[OBJ_COMMIT].size();
  490. objCnt += objectsLists[OBJ_TREE].size();
  491. objCnt += objectsLists[OBJ_BLOB].size();
  492. objCnt += objectsLists[OBJ_TAG].size();
  493. for (CachedPack pack : cachedPacks)
  494. objCnt += pack.getObjectCount();
  495. return objCnt;
  496. }
  497. return stats.totalObjects;
  498. }
  499. /**
  500. * Returns the object ids in the pack file that was created by this writer.
  501. *
  502. * This method can only be invoked after
  503. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
  504. * been invoked and completed successfully.
  505. *
  506. * @return number of objects in pack.
  507. * @throws IOException
  508. * a cached pack cannot supply its object ids.
  509. */
  510. public ObjectIdOwnerMap<ObjectIdOwnerMap.Entry> getObjectSet()
  511. throws IOException {
  512. if (!cachedPacks.isEmpty())
  513. throw new IOException(
  514. JGitText.get().cachedPacksPreventsListingObjects);
  515. ObjectIdOwnerMap<ObjectIdOwnerMap.Entry> objs = new ObjectIdOwnerMap<
  516. ObjectIdOwnerMap.Entry>();
  517. for (BlockList<ObjectToPack> objList : objectsLists) {
  518. if (objList != null) {
  519. for (ObjectToPack otp : objList)
  520. objs.add(new ObjectIdOwnerMap.Entry(otp) {
  521. // A new entry that copies the ObjectId
  522. });
  523. }
  524. }
  525. return objs;
  526. }
  527. /**
  528. * Add a pack index whose contents should be excluded from the result.
  529. *
  530. * @param idx
  531. * objects in this index will not be in the output pack.
  532. */
  533. public void excludeObjects(ObjectIdSet idx) {
  534. if (excludeInPacks == null) {
  535. excludeInPacks = new ObjectIdSet[] { idx };
  536. excludeInPackLast = idx;
  537. } else {
  538. int cnt = excludeInPacks.length;
  539. ObjectIdSet[] newList = new ObjectIdSet[cnt + 1];
  540. System.arraycopy(excludeInPacks, 0, newList, 0, cnt);
  541. newList[cnt] = idx;
  542. excludeInPacks = newList;
  543. }
  544. }
  545. /**
  546. * Prepare the list of objects to be written to the pack stream.
  547. * <p>
  548. * Iterator <b>exactly</b> determines which objects are included in a pack
  549. * and order they appear in pack (except that objects order by type is not
  550. * needed at input). This order should conform general rules of ordering
  551. * objects in git - by recency and path (type and delta-base first is
  552. * internally secured) and responsibility for guaranteeing this order is on
  553. * a caller side. Iterator must return each id of object to write exactly
  554. * once.
  555. * </p>
  556. *
  557. * @param objectsSource
  558. * iterator of object to store in a pack; order of objects within
  559. * each type is important, ordering by type is not needed;
  560. * allowed types for objects are {@link Constants#OBJ_COMMIT},
  561. * {@link Constants#OBJ_TREE}, {@link Constants#OBJ_BLOB} and
  562. * {@link Constants#OBJ_TAG}; objects returned by iterator may be
  563. * later reused by caller as object id and type are internally
  564. * copied in each iteration.
  565. * @throws IOException
  566. * when some I/O problem occur during reading objects.
  567. */
  568. public void preparePack(final Iterator<RevObject> objectsSource)
  569. throws IOException {
  570. while (objectsSource.hasNext()) {
  571. addObject(objectsSource.next());
  572. }
  573. }
  574. /**
  575. * Prepare the list of objects to be written to the pack stream.
  576. * <p>
  577. * Basing on these 2 sets, another set of objects to put in a pack file is
  578. * created: this set consists of all objects reachable (ancestors) from
  579. * interesting objects, except uninteresting objects and their ancestors.
  580. * This method uses class {@link ObjectWalk} extensively to find out that
  581. * appropriate set of output objects and their optimal order in output pack.
  582. * Order is consistent with general git in-pack rules: sort by object type,
  583. * recency, path and delta-base first.
  584. * </p>
  585. *
  586. * @param countingMonitor
  587. * progress during object enumeration.
  588. * @param want
  589. * collection of objects to be marked as interesting (start
  590. * points of graph traversal).
  591. * @param have
  592. * collection of objects to be marked as uninteresting (end
  593. * points of graph traversal).
  594. * @throws IOException
  595. * when some I/O problem occur during reading objects.
  596. * @deprecated to be removed in 2.0; use the Set version of this method.
  597. */
  598. @Deprecated
  599. public void preparePack(ProgressMonitor countingMonitor,
  600. final Collection<? extends ObjectId> want,
  601. final Collection<? extends ObjectId> have) throws IOException {
  602. preparePack(countingMonitor, ensureSet(want), ensureSet(have));
  603. }
  604. /**
  605. * Prepare the list of objects to be written to the pack stream.
  606. * <p>
  607. * Basing on these 2 sets, another set of objects to put in a pack file is
  608. * created: this set consists of all objects reachable (ancestors) from
  609. * interesting objects, except uninteresting objects and their ancestors.
  610. * This method uses class {@link ObjectWalk} extensively to find out that
  611. * appropriate set of output objects and their optimal order in output pack.
  612. * Order is consistent with general git in-pack rules: sort by object type,
  613. * recency, path and delta-base first.
  614. * </p>
  615. *
  616. * @param countingMonitor
  617. * progress during object enumeration.
  618. * @param walk
  619. * ObjectWalk to perform enumeration.
  620. * @param interestingObjects
  621. * collection of objects to be marked as interesting (start
  622. * points of graph traversal).
  623. * @param uninterestingObjects
  624. * collection of objects to be marked as uninteresting (end
  625. * points of graph traversal).
  626. * @throws IOException
  627. * when some I/O problem occur during reading objects.
  628. * @deprecated to be removed in 2.0; use the Set version of this method.
  629. */
  630. @Deprecated
  631. public void preparePack(ProgressMonitor countingMonitor,
  632. ObjectWalk walk,
  633. final Collection<? extends ObjectId> interestingObjects,
  634. final Collection<? extends ObjectId> uninterestingObjects)
  635. throws IOException {
  636. preparePack(countingMonitor, walk,
  637. ensureSet(interestingObjects),
  638. ensureSet(uninterestingObjects));
  639. }
  640. @SuppressWarnings("unchecked")
  641. private static Set<ObjectId> ensureSet(Collection<? extends ObjectId> objs) {
  642. Set<ObjectId> set;
  643. if (objs instanceof Set<?>)
  644. set = (Set<ObjectId>) objs;
  645. else if (objs == null)
  646. set = Collections.emptySet();
  647. else
  648. set = new HashSet<ObjectId>(objs);
  649. return set;
  650. }
  651. /**
  652. * Prepare the list of objects to be written to the pack stream.
  653. * <p>
  654. * Basing on these 2 sets, another set of objects to put in a pack file is
  655. * created: this set consists of all objects reachable (ancestors) from
  656. * interesting objects, except uninteresting objects and their ancestors.
  657. * This method uses class {@link ObjectWalk} extensively to find out that
  658. * appropriate set of output objects and their optimal order in output pack.
  659. * Order is consistent with general git in-pack rules: sort by object type,
  660. * recency, path and delta-base first.
  661. * </p>
  662. *
  663. * @param countingMonitor
  664. * progress during object enumeration.
  665. * @param want
  666. * collection of objects to be marked as interesting (start
  667. * points of graph traversal).
  668. * @param have
  669. * collection of objects to be marked as uninteresting (end
  670. * points of graph traversal).
  671. * @throws IOException
  672. * when some I/O problem occur during reading objects.
  673. */
  674. public void preparePack(ProgressMonitor countingMonitor,
  675. Set<? extends ObjectId> want,
  676. Set<? extends ObjectId> have) throws IOException {
  677. ObjectWalk ow;
  678. if (shallowPack)
  679. ow = new DepthWalk.ObjectWalk(reader, depth);
  680. else
  681. ow = new ObjectWalk(reader);
  682. preparePack(countingMonitor, ow, want, have);
  683. }
  684. /**
  685. * Prepare the list of objects to be written to the pack stream.
  686. * <p>
  687. * Basing on these 2 sets, another set of objects to put in a pack file is
  688. * created: this set consists of all objects reachable (ancestors) from
  689. * interesting objects, except uninteresting objects and their ancestors.
  690. * This method uses class {@link ObjectWalk} extensively to find out that
  691. * appropriate set of output objects and their optimal order in output pack.
  692. * Order is consistent with general git in-pack rules: sort by object type,
  693. * recency, path and delta-base first.
  694. * </p>
  695. *
  696. * @param countingMonitor
  697. * progress during object enumeration.
  698. * @param walk
  699. * ObjectWalk to perform enumeration.
  700. * @param interestingObjects
  701. * collection of objects to be marked as interesting (start
  702. * points of graph traversal).
  703. * @param uninterestingObjects
  704. * collection of objects to be marked as uninteresting (end
  705. * points of graph traversal).
  706. * @throws IOException
  707. * when some I/O problem occur during reading objects.
  708. */
  709. public void preparePack(ProgressMonitor countingMonitor,
  710. ObjectWalk walk,
  711. final Set<? extends ObjectId> interestingObjects,
  712. final Set<? extends ObjectId> uninterestingObjects)
  713. throws IOException {
  714. if (countingMonitor == null)
  715. countingMonitor = NullProgressMonitor.INSTANCE;
  716. if (shallowPack && !(walk instanceof DepthWalk.ObjectWalk))
  717. walk = new DepthWalk.ObjectWalk(reader, depth);
  718. findObjectsToPack(countingMonitor, walk, interestingObjects,
  719. uninterestingObjects);
  720. }
  721. /**
  722. * Determine if the pack file will contain the requested object.
  723. *
  724. * @param id
  725. * the object to test the existence of.
  726. * @return true if the object will appear in the output pack file.
  727. * @throws IOException
  728. * a cached pack cannot be examined.
  729. */
  730. public boolean willInclude(final AnyObjectId id) throws IOException {
  731. ObjectToPack obj = objectsMap.get(id);
  732. return obj != null && !obj.isEdge();
  733. }
  734. /**
  735. * Lookup the ObjectToPack object for a given ObjectId.
  736. *
  737. * @param id
  738. * the object to find in the pack.
  739. * @return the object we are packing, or null.
  740. */
  741. public ObjectToPack get(AnyObjectId id) {
  742. ObjectToPack obj = objectsMap.get(id);
  743. return obj != null && !obj.isEdge() ? obj : null;
  744. }
  745. /**
  746. * Computes SHA-1 of lexicographically sorted objects ids written in this
  747. * pack, as used to name a pack file in repository.
  748. *
  749. * @return ObjectId representing SHA-1 name of a pack that was created.
  750. */
  751. public ObjectId computeName() {
  752. final byte[] buf = new byte[OBJECT_ID_LENGTH];
  753. final MessageDigest md = Constants.newMessageDigest();
  754. for (ObjectToPack otp : sortByName()) {
  755. otp.copyRawTo(buf, 0);
  756. md.update(buf, 0, OBJECT_ID_LENGTH);
  757. }
  758. return ObjectId.fromRaw(md.digest());
  759. }
  760. /**
  761. * Returns the index format version that will be written.
  762. * <p>
  763. * This method can only be invoked after
  764. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
  765. * been invoked and completed successfully.
  766. *
  767. * @return the index format version.
  768. */
  769. public int getIndexVersion() {
  770. int indexVersion = config.getIndexVersion();
  771. if (indexVersion <= 0) {
  772. for (BlockList<ObjectToPack> objs : objectsLists)
  773. indexVersion = Math.max(indexVersion,
  774. PackIndexWriter.oldestPossibleFormat(objs));
  775. }
  776. return indexVersion;
  777. }
  778. /**
  779. * Create an index file to match the pack file just written.
  780. * <p>
  781. * This method can only be invoked after
  782. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
  783. * been invoked and completed successfully. Writing a corresponding index is
  784. * an optional feature that not all pack users may require.
  785. *
  786. * @param indexStream
  787. * output for the index data. Caller is responsible for closing
  788. * this stream.
  789. * @throws IOException
  790. * the index data could not be written to the supplied stream.
  791. */
  792. public void writeIndex(final OutputStream indexStream) throws IOException {
  793. if (isIndexDisabled())
  794. throw new IOException(JGitText.get().cachedPacksPreventsIndexCreation);
  795. long writeStart = System.currentTimeMillis();
  796. final PackIndexWriter iw = PackIndexWriter.createVersion(
  797. indexStream, getIndexVersion());
  798. iw.write(sortByName(), packcsum);
  799. stats.timeWriting += System.currentTimeMillis() - writeStart;
  800. }
  801. /**
  802. * Create a bitmap index file to match the pack file just written.
  803. * <p>
  804. * This method can only be invoked after
  805. * {@link #prepareBitmapIndex(ProgressMonitor)} has been invoked and
  806. * completed successfully. Writing a corresponding bitmap index is an
  807. * optional feature that not all pack users may require.
  808. *
  809. * @param bitmapIndexStream
  810. * output for the bitmap index data. Caller is responsible for
  811. * closing this stream.
  812. * @throws IOException
  813. * the index data could not be written to the supplied stream.
  814. */
  815. public void writeBitmapIndex(final OutputStream bitmapIndexStream)
  816. throws IOException {
  817. if (writeBitmaps == null)
  818. throw new IOException(JGitText.get().bitmapsMustBePrepared);
  819. long writeStart = System.currentTimeMillis();
  820. final PackBitmapIndexWriterV1 iw = new PackBitmapIndexWriterV1(bitmapIndexStream);
  821. iw.write(writeBitmaps, packcsum);
  822. stats.timeWriting += System.currentTimeMillis() - writeStart;
  823. }
  824. private List<ObjectToPack> sortByName() {
  825. if (sortedByName == null) {
  826. int cnt = 0;
  827. cnt += objectsLists[OBJ_COMMIT].size();
  828. cnt += objectsLists[OBJ_TREE].size();
  829. cnt += objectsLists[OBJ_BLOB].size();
  830. cnt += objectsLists[OBJ_TAG].size();
  831. sortedByName = new BlockList<ObjectToPack>(cnt);
  832. sortedByName.addAll(objectsLists[OBJ_COMMIT]);
  833. sortedByName.addAll(objectsLists[OBJ_TREE]);
  834. sortedByName.addAll(objectsLists[OBJ_BLOB]);
  835. sortedByName.addAll(objectsLists[OBJ_TAG]);
  836. Collections.sort(sortedByName);
  837. }
  838. return sortedByName;
  839. }
  840. private void beginPhase(PackingPhase phase, ProgressMonitor monitor,
  841. long cnt) {
  842. state.phase = phase;
  843. String task;
  844. switch (phase) {
  845. case COUNTING:
  846. task = JGitText.get().countingObjects;
  847. break;
  848. case GETTING_SIZES:
  849. task = JGitText.get().searchForSizes;
  850. break;
  851. case FINDING_SOURCES:
  852. task = JGitText.get().searchForReuse;
  853. break;
  854. case COMPRESSING:
  855. task = JGitText.get().compressingObjects;
  856. break;
  857. case WRITING:
  858. task = JGitText.get().writingObjects;
  859. break;
  860. case BUILDING_BITMAPS:
  861. task = JGitText.get().buildingBitmaps;
  862. break;
  863. default:
  864. throw new IllegalArgumentException(
  865. MessageFormat.format(JGitText.get().illegalPackingPhase, phase));
  866. }
  867. monitor.beginTask(task, (int) cnt);
  868. }
  869. private void endPhase(ProgressMonitor monitor) {
  870. monitor.endTask();
  871. }
  872. /**
  873. * Write the prepared pack to the supplied stream.
  874. * <p>
  875. * At first, this method collects and sorts objects to pack, then deltas
  876. * search is performed if set up accordingly, finally pack stream is
  877. * written.
  878. * </p>
  879. * <p>
  880. * All reused objects data checksum (Adler32/CRC32) is computed and
  881. * validated against existing checksum.
  882. * </p>
  883. *
  884. * @param compressMonitor
  885. * progress monitor to report object compression work.
  886. * @param writeMonitor
  887. * progress monitor to report the number of objects written.
  888. * @param packStream
  889. * output stream of pack data. The stream should be buffered by
  890. * the caller. The caller is responsible for closing the stream.
  891. * @throws IOException
  892. * an error occurred reading a local object's data to include in
  893. * the pack, or writing compressed object data to the output
  894. * stream.
  895. */
  896. public void writePack(ProgressMonitor compressMonitor,
  897. ProgressMonitor writeMonitor, OutputStream packStream)
  898. throws IOException {
  899. if (compressMonitor == null)
  900. compressMonitor = NullProgressMonitor.INSTANCE;
  901. if (writeMonitor == null)
  902. writeMonitor = NullProgressMonitor.INSTANCE;
  903. excludeInPacks = null;
  904. excludeInPackLast = null;
  905. boolean needSearchForReuse = reuseSupport != null && (
  906. reuseDeltas
  907. || config.isReuseObjects()
  908. || !cachedPacks.isEmpty());
  909. if (compressMonitor instanceof BatchingProgressMonitor) {
  910. long delay = 1000;
  911. if (needSearchForReuse && config.isDeltaCompress())
  912. delay = 500;
  913. ((BatchingProgressMonitor) compressMonitor).setDelayStart(
  914. delay,
  915. TimeUnit.MILLISECONDS);
  916. }
  917. if (needSearchForReuse)
  918. searchForReuse(compressMonitor);
  919. if (config.isDeltaCompress())
  920. searchForDeltas(compressMonitor);
  921. crc32 = new CRC32();
  922. final PackOutputStream out = new PackOutputStream(
  923. writeMonitor,
  924. isIndexDisabled()
  925. ? packStream
  926. : new CheckedOutputStream(packStream, crc32),
  927. this);
  928. long objCnt = getObjectCount();
  929. stats.totalObjects = objCnt;
  930. beginPhase(PackingPhase.WRITING, writeMonitor, objCnt);
  931. long writeStart = System.currentTimeMillis();
  932. out.writeFileHeader(PACK_VERSION_GENERATED, objCnt);
  933. out.flush();
  934. writeObjects(out);
  935. if (!edgeObjects.isEmpty() || !cachedPacks.isEmpty()) {
  936. for (Statistics.ObjectType typeStat : stats.objectTypes) {
  937. if (typeStat == null)
  938. continue;
  939. stats.thinPackBytes += typeStat.bytes;
  940. }
  941. }
  942. for (CachedPack pack : cachedPacks) {
  943. long deltaCnt = pack.getDeltaCount();
  944. stats.reusedObjects += pack.getObjectCount();
  945. stats.reusedDeltas += deltaCnt;
  946. stats.totalDeltas += deltaCnt;
  947. reuseSupport.copyPackAsIs(out, pack, reuseValidate);
  948. }
  949. writeChecksum(out);
  950. out.flush();
  951. stats.timeWriting = System.currentTimeMillis() - writeStart;
  952. stats.totalBytes = out.length();
  953. stats.reusedPacks = Collections.unmodifiableList(cachedPacks);
  954. stats.depth = depth;
  955. for (Statistics.ObjectType typeStat : stats.objectTypes) {
  956. if (typeStat == null)
  957. continue;
  958. typeStat.cntDeltas += typeStat.reusedDeltas;
  959. stats.reusedObjects += typeStat.reusedObjects;
  960. stats.reusedDeltas += typeStat.reusedDeltas;
  961. stats.totalDeltas += typeStat.cntDeltas;
  962. }
  963. reader.release();
  964. endPhase(writeMonitor);
  965. }
  966. /**
  967. * @return description of what this PackWriter did in order to create the
  968. * final pack stream. The object is only available to callers after
  969. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}
  970. */
  971. public Statistics getStatistics() {
  972. return stats;
  973. }
  974. /** @return snapshot of the current state of this PackWriter. */
  975. public State getState() {
  976. return state.snapshot();
  977. }
  978. /** Release all resources used by this writer. */
  979. public void release() {
  980. reader.release();
  981. if (myDeflater != null) {
  982. myDeflater.end();
  983. myDeflater = null;
  984. }
  985. instances.remove(selfRef);
  986. }
  987. private void searchForReuse(ProgressMonitor monitor) throws IOException {
  988. long cnt = 0;
  989. cnt += objectsLists[OBJ_COMMIT].size();
  990. cnt += objectsLists[OBJ_TREE].size();
  991. cnt += objectsLists[OBJ_BLOB].size();
  992. cnt += objectsLists[OBJ_TAG].size();
  993. long start = System.currentTimeMillis();
  994. beginPhase(PackingPhase.FINDING_SOURCES, monitor, cnt);
  995. if (cnt <= 4096) {
  996. // For small object counts, do everything as one list.
  997. BlockList<ObjectToPack> tmp = new BlockList<ObjectToPack>((int) cnt);
  998. tmp.addAll(objectsLists[OBJ_TAG]);
  999. tmp.addAll(objectsLists[OBJ_COMMIT]);
  1000. tmp.addAll(objectsLists[OBJ_TREE]);
  1001. tmp.addAll(objectsLists[OBJ_BLOB]);
  1002. searchForReuse(monitor, tmp);
  1003. if (pruneCurrentObjectList) {
  1004. // If the list was pruned, we need to re-prune the main lists.
  1005. pruneEdgesFromObjectList(objectsLists[OBJ_COMMIT]);
  1006. pruneEdgesFromObjectList(objectsLists[OBJ_TREE]);
  1007. pruneEdgesFromObjectList(objectsLists[OBJ_BLOB]);
  1008. pruneEdgesFromObjectList(objectsLists[OBJ_TAG]);
  1009. }
  1010. } else {
  1011. searchForReuse(monitor, objectsLists[OBJ_TAG]);
  1012. searchForReuse(monitor, objectsLists[OBJ_COMMIT]);
  1013. searchForReuse(monitor, objectsLists[OBJ_TREE]);
  1014. searchForReuse(monitor, objectsLists[OBJ_BLOB]);
  1015. }
  1016. endPhase(monitor);
  1017. stats.timeSearchingForReuse = System.currentTimeMillis() - start;
  1018. if (config.isReuseDeltas() && config.getCutDeltaChains()) {
  1019. cutDeltaChains(objectsLists[OBJ_TREE]);
  1020. cutDeltaChains(objectsLists[OBJ_BLOB]);
  1021. }
  1022. }
  1023. private void searchForReuse(ProgressMonitor monitor, List<ObjectToPack> list)
  1024. throws IOException, MissingObjectException {
  1025. pruneCurrentObjectList = false;
  1026. reuseSupport.selectObjectRepresentation(this, monitor, list);
  1027. if (pruneCurrentObjectList)
  1028. pruneEdgesFromObjectList(list);
  1029. }
  1030. private void cutDeltaChains(BlockList<ObjectToPack> list)
  1031. throws IOException {
  1032. int max = config.getMaxDeltaDepth();
  1033. for (int idx = list.size() - 1; idx >= 0; idx--) {
  1034. int d = 0;
  1035. ObjectToPack b = list.get(idx).getDeltaBase();
  1036. while (b != null) {
  1037. if (d < b.getChainLength())
  1038. break;
  1039. b.setChainLength(++d);
  1040. if (d >= max && b.isDeltaRepresentation()) {
  1041. reselectNonDelta(b);
  1042. break;
  1043. }
  1044. b = b.getDeltaBase();
  1045. }
  1046. }
  1047. if (config.isDeltaCompress()) {
  1048. for (ObjectToPack otp : list)
  1049. otp.clearChainLength();
  1050. }
  1051. }
  1052. private void searchForDeltas(ProgressMonitor monitor)
  1053. throws MissingObjectException, IncorrectObjectTypeException,
  1054. IOException {
  1055. // Commits and annotated tags tend to have too many differences to
  1056. // really benefit from delta compression. Consequently just don't
  1057. // bother examining those types here.
  1058. //
  1059. ObjectToPack[] list = new ObjectToPack[
  1060. objectsLists[OBJ_TREE].size()
  1061. + objectsLists[OBJ_BLOB].size()
  1062. + edgeObjects.size()];
  1063. int cnt = 0;
  1064. cnt = findObjectsNeedingDelta(list, cnt, OBJ_TREE);
  1065. cnt = findObjectsNeedingDelta(list, cnt, OBJ_BLOB);
  1066. if (cnt == 0)
  1067. return;
  1068. int nonEdgeCnt = cnt;
  1069. // Queue up any edge objects that we might delta against. We won't
  1070. // be sending these as we assume the other side has them, but we need
  1071. // them in the search phase below.
  1072. //
  1073. for (ObjectToPack eo : edgeObjects) {
  1074. eo.setWeight(0);
  1075. list[cnt++] = eo;
  1076. }
  1077. // Compute the sizes of the objects so we can do a proper sort.
  1078. // We let the reader skip missing objects if it chooses. For
  1079. // some readers this can be a huge win. We detect missing objects
  1080. // by having set the weights above to 0 and allowing the delta
  1081. // search code to discover the missing object and skip over it, or
  1082. // abort with an exception if we actually had to have it.
  1083. //
  1084. final long sizingStart = System.currentTimeMillis();
  1085. beginPhase(PackingPhase.GETTING_SIZES, monitor, cnt);
  1086. AsyncObjectSizeQueue<ObjectToPack> sizeQueue = reader.getObjectSize(
  1087. Arrays.<ObjectToPack> asList(list).subList(0, cnt), false);
  1088. try {
  1089. final long limit = Math.min(
  1090. config.getBigFileThreshold(),
  1091. Integer.MAX_VALUE);
  1092. for (;;) {
  1093. try {
  1094. if (!sizeQueue.next())
  1095. break;
  1096. } catch (MissingObjectException notFound) {
  1097. monitor.update(1);
  1098. if (ignoreMissingUninteresting) {
  1099. ObjectToPack otp = sizeQueue.getCurrent();
  1100. if (otp != null && otp.isEdge()) {
  1101. otp.setDoNotDelta();
  1102. continue;
  1103. }
  1104. otp = objectsMap.get(notFound.getObjectId());
  1105. if (otp != null && otp.isEdge()) {
  1106. otp.setDoNotDelta();
  1107. continue;
  1108. }
  1109. }
  1110. throw notFound;
  1111. }
  1112. ObjectToPack otp = sizeQueue.getCurrent();
  1113. if (otp == null)
  1114. otp = objectsMap.get(sizeQueue.getObjectId());
  1115. long sz = sizeQueue.getSize();
  1116. if (DeltaIndex.BLKSZ < sz && sz < limit)
  1117. otp.setWeight((int) sz);
  1118. else
  1119. otp.setDoNotDelta(); // too small, or too big
  1120. monitor.update(1);
  1121. }
  1122. } finally {
  1123. sizeQueue.release();
  1124. }
  1125. endPhase(monitor);
  1126. stats.timeSearchingForSizes = System.currentTimeMillis() - sizingStart;
  1127. // Sort the objects by path hash so like files are near each other,
  1128. // and then by size descending so that bigger files are first. This
  1129. // applies "Linus' Law" which states that newer files tend to be the
  1130. // bigger ones, because source files grow and hardly ever shrink.
  1131. //
  1132. Arrays.sort(list, 0, cnt, new Comparator<ObjectToPack>() {
  1133. public int compare(ObjectToPack a, ObjectToPack b) {
  1134. int cmp = (a.isDoNotDelta() ? 1 : 0)
  1135. - (b.isDoNotDelta() ? 1 : 0);
  1136. if (cmp != 0)
  1137. return cmp;
  1138. cmp = a.getType() - b.getType();
  1139. if (cmp != 0)
  1140. return cmp;
  1141. cmp = (a.getPathHash() >>> 1) - (b.getPathHash() >>> 1);
  1142. if (cmp != 0)
  1143. return cmp;
  1144. cmp = (a.getPathHash() & 1) - (b.getPathHash() & 1);
  1145. if (cmp != 0)
  1146. return cmp;
  1147. cmp = (a.isEdge() ? 0 : 1) - (b.isEdge() ? 0 : 1);
  1148. if (cmp != 0)
  1149. return cmp;
  1150. return b.getWeight() - a.getWeight();
  1151. }
  1152. });
  1153. // Above we stored the objects we cannot delta onto the end.
  1154. // Remove them from the list so we don't waste time on them.
  1155. while (0 < cnt && list[cnt - 1].isDoNotDelta()) {
  1156. if (!list[cnt - 1].isEdge())
  1157. nonEdgeCnt--;
  1158. cnt--;
  1159. }
  1160. if (cnt == 0)
  1161. return;
  1162. final long searchStart = System.currentTimeMillis();
  1163. searchForDeltas(monitor, list, cnt);
  1164. stats.deltaSearchNonEdgeObjects = nonEdgeCnt;
  1165. stats.timeCompressing = System.currentTimeMillis() - searchStart;
  1166. for (int i = 0; i < cnt; i++)
  1167. if (!list[i].isEdge() && list[i].isDeltaRepresentation())
  1168. stats.deltasFound++;
  1169. }
  1170. private int findObjectsNeedingDelta(ObjectToPack[] list, int cnt, int type) {
  1171. for (ObjectToPack otp : objectsLists[type]) {
  1172. if (otp.isDoNotDelta()) // delta is disabled for this path
  1173. continue;
  1174. if (otp.isDeltaRepresentation()) // already reusing a delta
  1175. continue;
  1176. otp.setWeight(0);
  1177. list[cnt++] = otp;
  1178. }
  1179. return cnt;
  1180. }
  1181. private void reselectNonDelta(ObjectToPack otp) throws IOException {
  1182. otp.clearDeltaBase();
  1183. otp.clearReuseAsIs();
  1184. boolean old = reuseDeltas;
  1185. reuseDeltas = false;
  1186. reuseSupport.selectObjectRepresentation(this,
  1187. NullProgressMonitor.INSTANCE,
  1188. Collections.singleton(otp));
  1189. reuseDeltas = old;
  1190. }
  1191. private void searchForDeltas(final ProgressMonitor monitor,
  1192. final ObjectToPack[] list, final int cnt)
  1193. throws MissingObjectException, IncorrectObjectTypeException,
  1194. LargeObjectException, IOException {
  1195. int threads = config.getThreads();
  1196. if (threads == 0)
  1197. threads = Runtime.getRuntime().availableProcessors();
  1198. if (threads <= 1 || cnt <= config.getDeltaSearchWindowSize())
  1199. singleThreadDeltaSearch(monitor, list, cnt);
  1200. else
  1201. parallelDeltaSearch(monitor, list, cnt, threads);
  1202. }
  1203. private void singleThreadDeltaSearch(ProgressMonitor monitor,
  1204. ObjectToPack[] list, int cnt) throws IOException {
  1205. long totalWeight = 0;
  1206. for (int i = 0; i < cnt; i++) {
  1207. ObjectToPack o = list[i];
  1208. if (!o.isEdge() && !o.doNotAttemptDelta())
  1209. totalWeight += o.getWeight();
  1210. }
  1211. long bytesPerUnit = 1;
  1212. while (DeltaTask.MAX_METER <= (totalWeight / bytesPerUnit))
  1213. bytesPerUnit <<= 10;
  1214. int cost = (int) (totalWeight / bytesPerUnit);
  1215. if (totalWeight % bytesPerUnit != 0)
  1216. cost++;
  1217. beginPhase(PackingPhase.COMPRESSING, monitor, cost);
  1218. new DeltaWindow(config, new DeltaCache(config), reader,
  1219. monitor, bytesPerUnit,
  1220. list, 0, cnt).search();
  1221. endPhase(monitor);
  1222. }
  1223. private void parallelDeltaSearch(ProgressMonitor monitor,
  1224. ObjectToPack[] list, int cnt, int threads) throws IOException {
  1225. DeltaCache dc = new ThreadSafeDeltaCache(config);
  1226. ThreadSafeProgressMonitor pm = new ThreadSafeProgressMonitor(monitor);
  1227. DeltaTask.Block taskBlock = new DeltaTask.Block(threads, config,
  1228. reader, dc, pm,
  1229. list, 0, cnt);
  1230. taskBlock.partitionTasks();
  1231. beginPhase(PackingPhase.COMPRESSING, monitor, taskBlock.cost());
  1232. pm.startWorkers(taskBlock.tasks.size());
  1233. Executor executor = config.getExecutor();
  1234. final List<Throwable> errors =
  1235. Collections.synchronizedList(new ArrayList<Throwable>(threads));
  1236. if (executor instanceof ExecutorService) {
  1237. // Caller supplied us a service, use it directly.
  1238. runTasks((ExecutorService) executor, pm, taskBlock, errors);
  1239. } else if (executor == null) {
  1240. // Caller didn't give us a way to run the tasks, spawn up a
  1241. // temporary thread pool and make sure it tears down cleanly.
  1242. ExecutorService pool = Executors.newFixedThreadPool(threads);
  1243. try {
  1244. runTasks(pool, pm, taskBlock, errors);
  1245. } finally {
  1246. pool.shutdown();
  1247. for (;;) {
  1248. try {
  1249. if (pool.awaitTermination(60, TimeUnit.SECONDS))
  1250. break;
  1251. } catch (InterruptedException e) {
  1252. throw new IOException(
  1253. JGitText.get().packingCancelledDuringObjectsWriting);
  1254. }
  1255. }
  1256. }
  1257. } else {
  1258. // The caller gave us an executor, but it might not do
  1259. // asynchronous execution. Wrap everything and hope it
  1260. // can schedule these for us.
  1261. for (final DeltaTask task : taskBlock.tasks) {
  1262. executor.execute(new Runnable() {
  1263. public void run() {
  1264. try {
  1265. task.call();
  1266. } catch (Throwable failure) {
  1267. errors.add(failure);
  1268. }
  1269. }
  1270. });
  1271. }
  1272. try {
  1273. pm.waitForCompletion();
  1274. } catch (InterruptedException ie) {
  1275. // We can't abort the other tasks as we have no handle.
  1276. // Cross our fingers and just break out anyway.
  1277. //
  1278. throw new IOException(
  1279. JGitText.get().packingCancelledDuringObjectsWriting);
  1280. }
  1281. }
  1282. // If any task threw an error, try to report it back as
  1283. // though we weren't using a threaded search algorithm.
  1284. //
  1285. if (!errors.isEmpty()) {
  1286. Throwable err = errors.get(0);
  1287. if (err instanceof Error)
  1288. throw (Error) err;
  1289. if (err instanceof RuntimeException)
  1290. throw (RuntimeException) err;
  1291. if (err instanceof IOException)
  1292. throw (IOException) err;
  1293. IOException fail = new IOException(err.getMessage());
  1294. fail.initCause(err);
  1295. throw fail;
  1296. }
  1297. endPhase(monitor);
  1298. }
  1299. private static void runTasks(ExecutorService pool,
  1300. ThreadSafeProgressMonitor pm,
  1301. DeltaTask.Block tb, List<Throwable> errors) throws IOException {
  1302. List<Future<?>> futures = new ArrayList<Future<?>>(tb.tasks.size());
  1303. for (DeltaTask task : tb.tasks)
  1304. futures.add(pool.submit(task));
  1305. try {
  1306. pm.waitForCompletion();
  1307. for (Future<?> f : futures) {
  1308. try {
  1309. f.get();
  1310. } catch (ExecutionException failed) {
  1311. errors.add(failed.getCause());
  1312. }
  1313. }
  1314. } catch (InterruptedException ie) {
  1315. for (Future<?> f : futures)
  1316. f.cancel(true);
  1317. throw new IOException(
  1318. JGitText.get().packingCancelledDuringObjectsWriting);
  1319. }
  1320. }
  1321. private void writeObjects(PackOutputStream out) throws IOException {
  1322. writeObjects(out, objectsLists[OBJ_COMMIT]);
  1323. writeObjects(out, objectsLists[OBJ_TAG]);
  1324. writeObjects(out, objectsLists[OBJ_TREE]);
  1325. writeObjects(out, objectsLists[OBJ_BLOB]);
  1326. }
  1327. private void writeObjects(PackOutputStream out, List<ObjectToPack> list)
  1328. throws IOException {
  1329. if (list.isEmpty())
  1330. return;
  1331. typeStats = stats.objectTypes[list.get(0).getType()];
  1332. long beginOffset = out.length();
  1333. if (reuseSupport != null) {
  1334. reuseSupport.writeObjects(out, list);
  1335. } else {
  1336. for (ObjectToPack otp : list)
  1337. out.writeObject(otp);
  1338. }
  1339. typeStats.bytes += out.length() - beginOffset;
  1340. typeStats.cntObjects = list.size();
  1341. }
  1342. void writeObject(PackOutputStream out, ObjectToPack otp) throws IOException {
  1343. if (!otp.isWritten())
  1344. writeObjectImpl(out, otp);
  1345. }
  1346. private void writeObjectImpl(PackOutputStream out, ObjectToPack otp)
  1347. throws IOException {
  1348. if (otp.wantWrite()) {
  1349. // A cycle exists in this delta chain. This should only occur if a
  1350. // selected object representation disappeared during writing
  1351. // (for example due to a concurrent repack) and a different base
  1352. // was chosen, forcing a cycle. Select something other than a
  1353. // delta, and write this object.
  1354. reselectNonDelta(otp);
  1355. }
  1356. otp.markWantWrite();
  1357. while (otp.isReuseAsIs()) {
  1358. writeBase(out, otp.getDeltaBase());
  1359. if (otp.isWritten())
  1360. return; // Delta chain cycle caused this to write already.
  1361. crc32.reset();
  1362. otp.setOffset(out.length());
  1363. try {
  1364. reuseSupport.copyObjectAsIs(out, otp, reuseValidate);
  1365. out.endObject();
  1366. otp.setCRC((int) crc32.getValue());
  1367. typeStats.reusedObjects++;
  1368. if (otp.isDeltaRepresentation()) {
  1369. typeStats.reusedDeltas++;
  1370. typeStats.deltaBytes += out.length() - otp.getOffset();
  1371. }
  1372. return;
  1373. } catch (StoredObjectRepresentationNotAvailableException gone) {
  1374. if (otp.getOffset() == out.length()) {
  1375. otp.setOffset(0);
  1376. otp.clearDeltaBase();
  1377. otp.clearReuseAsIs();
  1378. reuseSupport.selectObjectRepresentation(this,
  1379. NullProgressMonitor.INSTANCE,
  1380. Collections.singleton(otp));
  1381. continue;
  1382. } else {
  1383. // Object writing already started, we cannot recover.
  1384. //
  1385. CorruptObjectException coe;
  1386. coe = new CorruptObjectException(otp, ""); //$NON-NLS-1$
  1387. coe.initCause(gone);
  1388. throw coe;
  1389. }
  1390. }
  1391. }
  1392. // If we reached here, reuse wasn't possible.
  1393. //
  1394. if (otp.isDeltaRepresentation())
  1395. writeDeltaObjectDeflate(out, otp);
  1396. else
  1397. writeWholeObjectDeflate(out, otp);
  1398. out.endObject();
  1399. otp.setCRC((int) crc32.getValue());
  1400. }
  1401. private void writeBase(PackOutputStream out, ObjectToPack base)
  1402. throws IOException {
  1403. if (base != null && !base.isWritten() && !base.isEdge())
  1404. writeObjectImpl(out, base);
  1405. }
  1406. private void writeWholeObjectDeflate(PackOutputStream out,
  1407. final ObjectToPack otp) throws IOException {
  1408. final Deflater deflater = deflater();
  1409. final ObjectLoader ldr = reader.open(otp, otp.getType());
  1410. crc32.reset();
  1411. otp.setOffset(out.length());
  1412. out.writeHeader(otp, ldr.getSize());
  1413. deflater.reset();
  1414. DeflaterOutputStream dst = new DeflaterOutputStream(out, deflater);
  1415. ldr.copyTo(dst);
  1416. dst.finish();
  1417. }
  1418. private void writeDeltaObjectDeflate(PackOutputStream out,
  1419. final ObjectToPack otp) throws IOException {
  1420. writeBase(out, otp.getDeltaBase());
  1421. crc32.reset();
  1422. otp.setOffset(out.length());
  1423. DeltaCache.Ref ref = otp.popCachedDelta();
  1424. if (ref != null) {
  1425. byte[] zbuf = ref.get();
  1426. if (zbuf != null) {
  1427. out.writeHeader(otp, otp.getCachedSize());
  1428. out.write(zbuf);
  1429. return;
  1430. }
  1431. }
  1432. TemporaryBuffer.Heap delta = delta(otp);
  1433. out.writeHeader(otp, delta.length());
  1434. Deflater deflater = deflater();
  1435. deflater.reset();
  1436. DeflaterOutputStream dst = new DeflaterOutputStream(out, deflater);
  1437. delta.writeTo(dst, null);
  1438. dst.finish();
  1439. typeStats.cntDeltas++;
  1440. typeStats.deltaBytes += out.length() - otp.getOffset();
  1441. }
  1442. private TemporaryBuffer.Heap delta(final ObjectToPack otp)
  1443. throws IOException {
  1444. DeltaIndex index = new DeltaIndex(buffer(otp.getDeltaBaseId()));
  1445. byte[] res = buffer(otp);
  1446. // We never would have proposed this pair if the delta would be
  1447. // larger than the unpacked version of the object. So using it
  1448. // as our buffer limit is valid: we will never reach it.
  1449. //
  1450. TemporaryBuffer.Heap delta = new TemporaryBuffer.Heap(res.length);
  1451. index.encode(delta, res);
  1452. return delta;
  1453. }
  1454. private byte[] buffer(AnyObjectId objId) throws IOException {
  1455. return buffer(config, reader, objId);
  1456. }
  1457. static byte[] buffer(PackConfig config, ObjectReader or, AnyObjectId objId)
  1458. throws IOException {
  1459. // PackWriter should have already pruned objects that
  1460. // are above the big file threshold, so our chances of
  1461. // the object being below it are very good. We really
  1462. // shouldn't be here, unless the implementation is odd.
  1463. return or.open(objId).getCachedBytes(config.getBigFileThreshold());
  1464. }
  1465. private Deflater deflater() {
  1466. if (myDeflater == null)
  1467. myDeflater = new Deflater(config.getCompressionLevel());
  1468. return myDeflater;
  1469. }
  1470. private void writeChecksum(PackOutputStream out) throws IOException {
  1471. packcsum = out.getDigest();
  1472. out.write(packcsum);
  1473. }
  1474. private void findObjectsToPack(final ProgressMonitor countingMonitor,
  1475. final ObjectWalk walker, final Set<? extends ObjectId> want,
  1476. Set<? extends ObjectId> have)
  1477. throws MissingObjectException, IOException,
  1478. IncorrectObjectTypeException {
  1479. final long countingStart = System.currentTimeMillis();
  1480. beginPhase(PackingPhase.COUNTING, countingMonitor, ProgressMonitor.UNKNOWN);
  1481. if (have == null)
  1482. have = Collections.emptySet();
  1483. stats.interestingObjects = Collections.unmodifiableSet(new HashSet<ObjectId>(want));
  1484. stats.uninterestingObjects = Collections.unmodifiableSet(new HashSet<ObjectId>(have));
  1485. walker.setRetainBody(false);
  1486. canBuildBitmaps = config.isBuildBitmaps()
  1487. && !shallowPack
  1488. && have.isEmpty()
  1489. && (excludeInPacks == null || excludeInPacks.length == 0);
  1490. if (!shallowPack && useBitmaps) {
  1491. BitmapIndex bitmapIndex = reader.getBitmapIndex();
  1492. if (bitmapIndex != null) {
  1493. PackWriterBitmapWalker bitmapWalker = new PackWriterBitmapWalker(
  1494. walker, bitmapIndex, countingMonitor);
  1495. findObjectsToPackUsingBitmaps(bitmapWalker, want, have);
  1496. endPhase(countingMonitor);
  1497. stats.timeCounting = System.currentTimeMillis() - countingStart;
  1498. return;
  1499. }
  1500. }
  1501. List<ObjectId> all = new ArrayList<ObjectId>(want.size() + have.size());
  1502. all.addAll(want);
  1503. all.addAll(have);
  1504. final RevFlag include = walker.newFlag("include"); //$NON-NLS-1$
  1505. final RevFlag added = walker.newFlag("added"); //$NON-NLS-1$
  1506. walker.carry(include);
  1507. int haveEst = have.size();
  1508. if (have.isEmpty()) {
  1509. walker.sort(RevSort.COMMIT_TIME_DESC);
  1510. } else {
  1511. walker.sort(RevSort.TOPO);
  1512. if (thin)
  1513. walker.sort(RevSort.BOUNDARY, true);
  1514. }
  1515. List<RevObject> wantObjs = new ArrayList<RevObject>(want.size());
  1516. List<RevObject> haveObjs = new ArrayList<RevObject>(haveEst);
  1517. List<RevTag> wantTags = new ArrayList<RevTag>(want.size());
  1518. AsyncRevObjectQueue q = walker.parseAny(all, true);
  1519. try {
  1520. for (;;) {
  1521. try {
  1522. RevObject o = q.next();
  1523. if (o == null)
  1524. break;
  1525. if (have.contains(o))
  1526. haveObjs.add(o);
  1527. if (want.contains(o)) {
  1528. o.add(include);
  1529. wantObjs.add(o);
  1530. if (o instanceof RevTag)
  1531. wantTags.add((RevTag) o);
  1532. }
  1533. } catch (MissingObjectException e) {
  1534. if (ignoreMissingUninteresting
  1535. && have.contains(e.getObjectId()))
  1536. continue;
  1537. throw e;
  1538. }
  1539. }
  1540. } finally {
  1541. q.release();
  1542. }
  1543. if (!wantTags.isEmpty()) {
  1544. all = new ArrayList<ObjectId>(wantTags.size());
  1545. for (RevTag tag : wantTags)
  1546. all.add(tag.getObject());
  1547. q = walker.parseAny(all, true);
  1548. try {
  1549. while (q.next() != null) {
  1550. // Just need to pop the queue item to parse the object.
  1551. }
  1552. } finally {
  1553. q.release();
  1554. }
  1555. }
  1556. if (walker instanceof DepthWalk.ObjectWalk) {
  1557. DepthWalk.ObjectWalk depthWalk = (DepthWalk.ObjectWalk) walker;
  1558. for (RevObject obj : wantObjs)
  1559. depthWalk.markRoot(obj);
  1560. if (unshallowObjects != null) {
  1561. for (ObjectId id : unshallowObjects)
  1562. depthWalk.markUnshallow(walker.parseAny(id));
  1563. }
  1564. } else {
  1565. for (RevObject obj : wantObjs)
  1566. walker.markStart(obj);
  1567. }
  1568. for (RevObject obj : haveObjs)
  1569. walker.markUninteresting(obj);
  1570. final int maxBases = config.getDeltaSearchWindowSize();
  1571. Set<RevTree> baseTrees = new HashSet<RevTree>();
  1572. BlockList<RevCommit> commits = new BlockList<RevCommit>();
  1573. RevCommit c;
  1574. while ((c = walker.next()) != null) {
  1575. if (exclude(c))
  1576. continue;
  1577. if (c.has(RevFlag.UNINTERESTING)) {
  1578. if (baseTrees.size() <= maxBases)
  1579. baseTrees.add(c.getTree());
  1580. continue;
  1581. }
  1582. commits.add(c);
  1583. countingMonitor.update(1);
  1584. }
  1585. if (shallowPack) {
  1586. for (RevCommit cmit : commits) {
  1587. addObject(cmit, 0);
  1588. }
  1589. } else {
  1590. int commitCnt = 0;
  1591. boolean putTagTargets = false;
  1592. for (RevCommit cmit : commits) {
  1593. if (!cmit.has(added)) {
  1594. cmit.add(added);
  1595. addObject(cmit, 0);
  1596. commitCnt++;
  1597. }
  1598. for (int i = 0; i < cmit.getParentCount(); i++) {
  1599. RevCommit p = cmit.getParent(i);
  1600. if (!p.has(added) && !p.has(RevFlag.UNINTERESTING)
  1601. && !exclude(p)) {
  1602. p.add(added);
  1603. addObject(p, 0);
  1604. commitCnt++;
  1605. }
  1606. }
  1607. if (!putTagTargets && 4096 < commitCnt) {
  1608. for (ObjectId id : tagTargets) {
  1609. RevObject obj = walker.lookupOrNull(id);
  1610. if (obj instanceof RevCommit
  1611. && obj.has(include)
  1612. && !obj.has(RevFlag.UNINTERESTING)
  1613. && !obj.has(added)) {
  1614. obj.add(added);
  1615. addObject(obj, 0);
  1616. }
  1617. }
  1618. putTagTargets = true;
  1619. }
  1620. }
  1621. }
  1622. commits = null;
  1623. if (thin && !baseTrees.isEmpty()) {
  1624. BaseSearch bases = new BaseSearch(countingMonitor, baseTrees, //
  1625. objectsMap, edgeObjects, reader);
  1626. RevObject o;
  1627. while ((o = walker.nextObject()) != null) {
  1628. if (o.has(RevFlag.UNINTERESTING))
  1629. continue;
  1630. if (exclude(o))
  1631. continue;
  1632. int pathHash = walker.getPathHashCode();
  1633. byte[] pathBuf = walker.getPathBuffer();
  1634. int pathLen = walker.getPathLength();
  1635. bases.addBase(o.getType(), pathBuf, pathLen, pathHash);
  1636. addObject(o, pathHash);
  1637. countingMonitor.update(1);
  1638. }
  1639. } else {
  1640. RevObject o;
  1641. while ((o = walker.nextObject()) != null) {
  1642. if (o.has(RevFlag.UNINTERESTING))
  1643. continue;
  1644. if (exclude(o))
  1645. continue;
  1646. addObject(o, walker.getPathHashCode());
  1647. countingMonitor.update(1);
  1648. }
  1649. }
  1650. for (CachedPack pack : cachedPacks)
  1651. countingMonitor.update((int) pack.getObjectCount());
  1652. endPhase(countingMonitor);
  1653. stats.timeCounting = System.currentTimeMillis() - countingStart;
  1654. }
  1655. private void findObjectsToPackUsingBitmaps(
  1656. PackWriterBitmapWalker bitmapWalker, Set<? extends ObjectId> want,
  1657. Set<? extends ObjectId> have)
  1658. throws MissingObjectException, IncorrectObjectTypeException,
  1659. IOException {
  1660. BitmapBuilder haveBitmap = bitmapWalker.findObjects(have, null);
  1661. bitmapWalker.reset();
  1662. BitmapBuilder wantBitmap = bitmapWalker.findObjects(want, haveBitmap);
  1663. BitmapBuilder needBitmap = wantBitmap.andNot(haveBitmap);
  1664. if (useCachedPacks && reuseSupport != null
  1665. && (excludeInPacks == null || excludeInPacks.length == 0))
  1666. cachedPacks.addAll(
  1667. reuseSupport.getCachedPacksAndUpdate(needBitmap));
  1668. for (BitmapObject obj : needBitmap) {
  1669. ObjectId objectId = obj.getObjectId();
  1670. if (exclude(objectId)) {
  1671. needBitmap.remove(objectId);
  1672. continue;
  1673. }
  1674. addObject(objectId, obj.getType(), 0);
  1675. }
  1676. if (thin)
  1677. haveObjects = haveBitmap;
  1678. }
  1679. private static void pruneEdgesFromObjectList(List<ObjectToPack> list) {
  1680. final int size = list.size();
  1681. int src = 0;
  1682. int dst = 0;
  1683. for (; src < size; src++) {
  1684. ObjectToPack obj = list.get(src);
  1685. if (obj.isEdge())
  1686. continue;
  1687. if (dst != src)
  1688. list.set(dst, obj);
  1689. dst++;
  1690. }
  1691. while (dst < list.size())
  1692. list.remove(list.size() - 1);
  1693. }
  1694. /**
  1695. * Include one object to the output file.
  1696. * <p>
  1697. * Objects are written in the order they are added. If the same object is
  1698. * added twice, it may be written twice, creating a larger than necessary
  1699. * file.
  1700. *
  1701. * @param object
  1702. * the object to add.
  1703. * @throws IncorrectObjectTypeException
  1704. * the object is an unsupported type.
  1705. */
  1706. public void addObject(final RevObject object)
  1707. throws IncorrectObjectTypeException {
  1708. if (!exclude(object))
  1709. addObject(object, 0);
  1710. }
  1711. private void addObject(final RevObject object, final int pathHashCode) {
  1712. addObject(object, object.getType(), pathHashCode);
  1713. }
  1714. private void addObject(
  1715. final AnyObjectId src, final int type, final int pathHashCode) {
  1716. final ObjectToPack otp;
  1717. if (reuseSupport != null)
  1718. otp = reuseSupport.newObjectToPack(src, type);
  1719. else
  1720. otp = new ObjectToPack(src, type);
  1721. otp.setPathHash(pathHashCode);
  1722. objectsLists[type].add(otp);
  1723. objectsMap.add(otp);
  1724. }
  1725. private boolean exclude(AnyObjectId objectId) {
  1726. if (excludeInPacks == null)
  1727. return false;
  1728. if (excludeInPackLast.contains(objectId))
  1729. return true;
  1730. for (ObjectIdSet idx : excludeInPacks) {
  1731. if (idx.contains(objectId)) {
  1732. excludeInPackLast = idx;
  1733. return true;
  1734. }
  1735. }
  1736. return false;
  1737. }
  1738. /**
  1739. * Select an object representation for this writer.
  1740. * <p>
  1741. * An {@link ObjectReader} implementation should invoke this method once for
  1742. * each representation available for an object, to allow the writer to find
  1743. * the most suitable one for the output.
  1744. *
  1745. * @param otp
  1746. * the object being packed.
  1747. * @param next
  1748. * the next available representation from the repository.
  1749. */
  1750. public void select(ObjectToPack otp, StoredObjectRepresentation next) {
  1751. int nFmt = next.getFormat();
  1752. if (!cachedPacks.isEmpty()) {
  1753. if (otp.isEdge())
  1754. return;
  1755. if ((nFmt == PACK_WHOLE) | (nFmt == PACK_DELTA)) {
  1756. for (CachedPack pack : cachedPacks) {
  1757. if (pack.hasObject(otp, next)) {
  1758. otp.setEdge();
  1759. otp.clearDeltaBase();
  1760. otp.clearReuseAsIs();
  1761. pruneCurrentObjectList = true;
  1762. return;
  1763. }
  1764. }
  1765. }
  1766. }
  1767. if (nFmt == PACK_DELTA && reuseDeltas && reuseDeltaFor(otp)) {
  1768. ObjectId baseId = next.getDeltaBase();
  1769. ObjectToPack ptr = objectsMap.get(baseId);
  1770. if (ptr != null && !ptr.isEdge()) {
  1771. otp.setDeltaBase(ptr);
  1772. otp.setReuseAsIs();
  1773. } else if (thin && have(ptr, baseId)) {
  1774. otp.setDeltaBase(baseId);
  1775. otp.setReuseAsIs();
  1776. } else {
  1777. otp.clearDeltaBase();
  1778. otp.clearReuseAsIs();
  1779. }
  1780. } else if (nFmt == PACK_WHOLE && config.isReuseObjects()) {
  1781. int nWeight = next.getWeight();
  1782. if (otp.isReuseAsIs() && !otp.isDeltaRepresentation()) {
  1783. // We've chosen another PACK_WHOLE format for this object,
  1784. // choose the one that has the smaller compressed size.
  1785. //
  1786. if (otp.getWeight() <= nWeight)
  1787. return;
  1788. }
  1789. otp.clearDeltaBase();
  1790. otp.setReuseAsIs();
  1791. otp.setWeight(nWeight);
  1792. } else {
  1793. otp.clearDeltaBase();
  1794. otp.clearReuseAsIs();
  1795. }
  1796. otp.setDeltaAttempted(reuseDeltas & next.wasDeltaAttempted());
  1797. otp.select(next);
  1798. }
  1799. private final boolean have(ObjectToPack ptr, AnyObjectId objectId) {
  1800. return (ptr != null && ptr.isEdge())
  1801. || (haveObjects != null && haveObjects.contains(objectId));
  1802. }
  1803. /**
  1804. * Prepares the bitmaps to be written to the pack index. Bitmaps can be used
  1805. * to speed up fetches and clones by storing the entire object graph at
  1806. * selected commits.
  1807. *
  1808. * This method can only be invoked after
  1809. * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
  1810. * been invoked and completed successfully. Writing a corresponding bitmap
  1811. * index is an optional feature that not all pack users may require.
  1812. *
  1813. * @param pm
  1814. * progress monitor to report bitmap building work.
  1815. * @return whether a bitmap index may be written.
  1816. * @throws IOException
  1817. * when some I/O problem occur during reading objects.
  1818. */
  1819. public boolean prepareBitmapIndex(ProgressMonitor pm) throws IOException {
  1820. if (!canBuildBitmaps || getObjectCount() > Integer.MAX_VALUE
  1821. || !cachedPacks.isEmpty())
  1822. return false;
  1823. if (pm == null)
  1824. pm = NullProgressMonitor.INSTANCE;
  1825. writeBitmaps = new PackBitmapIndexBuilder(sortByName());
  1826. PackWriterBitmapPreparer bitmapPreparer = new PackWriterBitmapPreparer(
  1827. reader, writeBitmaps, pm, stats.interestingObjects);
  1828. int numCommits = objectsLists[OBJ_COMMIT].size();
  1829. Collection<PackWriterBitmapPreparer.BitmapCommit> selectedCommits =
  1830. bitmapPreparer.doCommitSelection(numCommits);
  1831. beginPhase(PackingPhase.BUILDING_BITMAPS, pm, selectedCommits.size());
  1832. PackWriterBitmapWalker walker = bitmapPreparer.newBitmapWalker();
  1833. AnyObjectId last = null;
  1834. for (PackWriterBitmapPreparer.BitmapCommit cmit : selectedCommits) {
  1835. if (cmit.isReuseWalker())
  1836. walker.reset();
  1837. else
  1838. walker = bitmapPreparer.newBitmapWalker();
  1839. BitmapBuilder bitmap = walker.findObjects(
  1840. Collections.singleton(cmit), null);
  1841. if (last != null && cmit.isReuseWalker() && !bitmap.contains(last))
  1842. throw new IllegalStateException(MessageFormat.format(
  1843. JGitText.get().bitmapMissingObject, cmit.name(),
  1844. last.name()));
  1845. last = cmit;
  1846. writeBitmaps.addBitmap(cmit, bitmap.build(), cmit.getFlags());
  1847. pm.update(1);
  1848. }
  1849. endPhase(pm);
  1850. return true;
  1851. }
  1852. private boolean reuseDeltaFor(ObjectToPack otp) {
  1853. int type = otp.getType();
  1854. if ((type & 2) != 0) // OBJ_TREE(2) or OBJ_BLOB(3)
  1855. return true;
  1856. if (type == OBJ_COMMIT)
  1857. return reuseDeltaCommits;
  1858. if (type == OBJ_TAG)
  1859. return false;
  1860. return true;
  1861. }
  1862. /** Summary of how PackWriter created the pack. */
  1863. public static class Statistics {
  1864. /** Statistics about a single class of object. */
  1865. public static class ObjectType {
  1866. long cntObjects;
  1867. long cntDeltas;
  1868. long reusedObjects;
  1869. long reusedDeltas;
  1870. long bytes;
  1871. long deltaBytes;
  1872. /**
  1873. * @return total number of objects output. This total includes the
  1874. * value of {@link #getDeltas()}.
  1875. */
  1876. public long getObjects() {
  1877. return cntObjects;
  1878. }
  1879. /**
  1880. * @return total number of deltas output. This may be lower than the
  1881. * actual number of deltas if a cached pack was reused.
  1882. */
  1883. public long getDeltas() {
  1884. return cntDeltas;
  1885. }
  1886. /**
  1887. * @return number of objects whose existing representation was
  1888. * reused in the output. This count includes
  1889. * {@link #getReusedDeltas()}.
  1890. */
  1891. public long getReusedObjects() {
  1892. return reusedObjects;
  1893. }
  1894. /**
  1895. * @return number of deltas whose existing representation was reused
  1896. * in the output, as their base object was also output or
  1897. * was assumed present for a thin pack. This may be lower
  1898. * than the actual number of reused deltas if a cached pack
  1899. * was reused.
  1900. */
  1901. public long getReusedDeltas() {
  1902. return reusedDeltas;
  1903. }
  1904. /**
  1905. * @return total number of bytes written. This size includes the
  1906. * object headers as well as the compressed data. This size
  1907. * also includes all of {@link #getDeltaBytes()}.
  1908. */
  1909. public long getBytes() {
  1910. return bytes;
  1911. }
  1912. /**
  1913. * @return number of delta bytes written. This size includes the
  1914. * object headers for the delta objects.
  1915. */
  1916. public long getDeltaBytes() {
  1917. return deltaBytes;
  1918. }
  1919. }
  1920. Set<ObjectId> interestingObjects;
  1921. Set<ObjectId> uninterestingObjects;
  1922. Collection<CachedPack> reusedPacks;
  1923. int depth;
  1924. int deltaSearchNonEdgeObjects;
  1925. int deltasFound;
  1926. long totalObjects;
  1927. long totalDeltas;
  1928. long reusedObjects;
  1929. long reusedDeltas;
  1930. long totalBytes;
  1931. long thinPackBytes;
  1932. long timeCounting;
  1933. long timeSearchingForReuse;
  1934. long timeSearchingForSizes;
  1935. long timeCompressing;
  1936. long timeWriting;
  1937. ObjectType[] objectTypes;
  1938. {
  1939. objectTypes = new ObjectType[5];
  1940. objectTypes[OBJ_COMMIT] = new ObjectType();
  1941. objectTypes[OBJ_TREE] = new ObjectType();
  1942. objectTypes[OBJ_BLOB] = new ObjectType();
  1943. objectTypes[OBJ_TAG] = new ObjectType();
  1944. }
  1945. /**
  1946. * @return unmodifiable collection of objects to be included in the
  1947. * pack. May be null if the pack was hand-crafted in a unit
  1948. * test.
  1949. */
  1950. public Set<ObjectId> getInterestingObjects() {
  1951. return interestingObjects;
  1952. }
  1953. /**
  1954. * @return unmodifiable collection of objects that should be excluded
  1955. * from the pack, as the peer that will receive the pack already
  1956. * has these objects.
  1957. */
  1958. public Set<ObjectId> getUninterestingObjects() {
  1959. return uninterestingObjects;
  1960. }
  1961. /**
  1962. * @return unmodifiable collection of the cached packs that were reused
  1963. * in the output, if any were selected for reuse.
  1964. */
  1965. public Collection<CachedPack> getReusedPacks() {
  1966. return reusedPacks;
  1967. }
  1968. /**
  1969. * @return number of objects in the output pack that went through the
  1970. * delta search process in order to find a potential delta base.
  1971. */
  1972. public int getDeltaSearchNonEdgeObjects() {
  1973. return deltaSearchNonEdgeObjects;
  1974. }
  1975. /**
  1976. * @return number of objects in the output pack that went through delta
  1977. * base search and found a suitable base. This is a subset of
  1978. * {@link #getDeltaSearchNonEdgeObjects()}.
  1979. */
  1980. public int getDeltasFound() {
  1981. return deltasFound;
  1982. }
  1983. /**
  1984. * @return total number of objects output. This total includes the value
  1985. * of {@link #getTotalDeltas()}.
  1986. */
  1987. public long getTotalObjects() {
  1988. return totalObjects;
  1989. }
  1990. /**
  1991. * @return total number of deltas output. This may be lower than the
  1992. * actual number of deltas if a cached pack was reused.
  1993. */
  1994. public long getTotalDeltas() {
  1995. return totalDeltas;
  1996. }
  1997. /**
  1998. * @return number of objects whose existing representation was reused in
  1999. * the output. This count includes {@link #getReusedDeltas()}.
  2000. */
  2001. public long getReusedObjects() {
  2002. return reusedObjects;
  2003. }
  2004. /**
  2005. * @return number of deltas whose existing representation was reused in
  2006. * the output, as their base object was also output or was
  2007. * assumed present for a thin pack. This may be lower than the
  2008. * actual number of reused deltas if a cached pack was reused.
  2009. */
  2010. public long getReusedDeltas() {
  2011. return reusedDeltas;
  2012. }
  2013. /**
  2014. * @return total number of bytes written. This size includes the pack
  2015. * header, trailer, thin pack, and reused cached pack(s).
  2016. */
  2017. public long getTotalBytes() {
  2018. return totalBytes;
  2019. }
  2020. /**
  2021. * @return size of the thin pack in bytes, if a thin pack was generated.
  2022. * A thin pack is created when the client already has objects
  2023. * and some deltas are created against those objects, or if a
  2024. * cached pack is being used and some deltas will reference
  2025. * objects in the cached pack. This size does not include the
  2026. * pack header or trailer.
  2027. */
  2028. public long getThinPackBytes() {
  2029. return thinPackBytes;
  2030. }
  2031. /**
  2032. * @param typeCode
  2033. * object type code, e.g. OBJ_COMMIT or OBJ_TREE.
  2034. * @return information about this type of object in the pack.
  2035. */
  2036. public ObjectType byObjectType(int typeCode) {
  2037. return objectTypes[typeCode];
  2038. }
  2039. /** @return true if the resulting pack file was a shallow pack. */
  2040. public boolean isShallow() {
  2041. return depth > 0;
  2042. }
  2043. /** @return depth (in commits) the pack includes if shallow. */
  2044. public int getDepth() {
  2045. return depth;
  2046. }
  2047. /**
  2048. * @return time in milliseconds spent enumerating the objects that need
  2049. * to be included in the output. This time includes any restarts
  2050. * that occur when a cached pack is selected for reuse.
  2051. */
  2052. public long getTimeCounting() {
  2053. return timeCounting;
  2054. }
  2055. /**
  2056. * @return time in milliseconds spent matching existing representations
  2057. * against objects that will be transmitted, or that the client
  2058. * can be assumed to already have.
  2059. */
  2060. public long getTimeSearchingForReuse() {
  2061. return timeSearchingForReuse;
  2062. }
  2063. /**
  2064. * @return time in milliseconds spent finding the sizes of all objects
  2065. * that will enter the delta compression search window. The
  2066. * sizes need to be known to better match similar objects
  2067. * together and improve delta compression ratios.
  2068. */
  2069. public long getTimeSearchingForSizes() {
  2070. return timeSearchingForSizes;
  2071. }
  2072. /**
  2073. * @return time in milliseconds spent on delta compression. This is
  2074. * observed wall-clock time and does not accurately track CPU
  2075. * time used when multiple threads were used to perform the
  2076. * delta compression.
  2077. */
  2078. public long getTimeCompressing() {
  2079. return timeCompressing;
  2080. }
  2081. /**
  2082. * @return time in milliseconds spent writing the pack output, from
  2083. * start of header until end of trailer. The transfer speed can
  2084. * be approximated by dividing {@link #getTotalBytes()} by this
  2085. * value.
  2086. */
  2087. public long getTimeWriting() {
  2088. return timeWriting;
  2089. }
  2090. /** @return total time spent processing this pack. */
  2091. public long getTimeTotal() {
  2092. return timeCounting
  2093. + timeSearchingForReuse
  2094. + timeSearchingForSizes
  2095. + timeCompressing
  2096. + timeWriting;
  2097. }
  2098. /**
  2099. * @return get the average output speed in terms of bytes-per-second.
  2100. * {@code getTotalBytes() / (getTimeWriting() / 1000.0)}.
  2101. */
  2102. public double getTransferRate() {
  2103. return getTotalBytes() / (getTimeWriting() / 1000.0);
  2104. }
  2105. /** @return formatted message string for display to clients. */
  2106. public String getMessage() {
  2107. return MessageFormat.format(JGitText.get().packWriterStatistics, //
  2108. Long.valueOf(totalObjects), Long.valueOf(totalDeltas), //
  2109. Long.valueOf(reusedObjects), Long.valueOf(reusedDeltas));
  2110. }
  2111. }
  2112. private class MutableState {
  2113. /** Estimated size of a single ObjectToPack instance. */
  2114. // Assume 64-bit pointers, since this is just an estimate.
  2115. private static final long OBJECT_TO_PACK_SIZE =
  2116. (2 * 8) // Object header
  2117. + (2 * 8) + (2 * 8) // ObjectToPack fields
  2118. + (8 + 8) // PackedObjectInfo fields
  2119. + 8 // ObjectIdOwnerMap fields
  2120. + 40 // AnyObjectId fields
  2121. + 8; // Reference in BlockList
  2122. private final long totalDeltaSearchBytes;
  2123. private volatile PackingPhase phase;
  2124. MutableState() {
  2125. phase = PackingPhase.COUNTING;
  2126. if (config.isDeltaCompress()) {
  2127. int threads = config.getThreads();
  2128. if (threads <= 0)
  2129. threads = Runtime.getRuntime().availableProcessors();
  2130. totalDeltaSearchBytes = (threads * config.getDeltaSearchMemoryLimit())
  2131. + config.getBigFileThreshold();
  2132. } else
  2133. totalDeltaSearchBytes = 0;
  2134. }
  2135. State snapshot() {
  2136. long objCnt = 0;
  2137. objCnt += objectsLists[OBJ_COMMIT].size();
  2138. objCnt += objectsLists[OBJ_TREE].size();
  2139. objCnt += objectsLists[OBJ_BLOB].size();
  2140. objCnt += objectsLists[OBJ_TAG].size();
  2141. // Exclude CachedPacks.
  2142. long bytesUsed = OBJECT_TO_PACK_SIZE * objCnt;
  2143. PackingPhase curr = phase;
  2144. if (curr == PackingPhase.COMPRESSING)
  2145. bytesUsed += totalDeltaSearchBytes;
  2146. return new State(curr, bytesUsed);
  2147. }
  2148. }
  2149. /** Possible states that a PackWriter can be in. */
  2150. public static enum PackingPhase {
  2151. /** Counting objects phase. */
  2152. COUNTING,
  2153. /** Getting sizes phase. */
  2154. GETTING_SIZES,
  2155. /** Finding sources phase. */
  2156. FINDING_SOURCES,
  2157. /** Compressing objects phase. */
  2158. COMPRESSING,
  2159. /** Writing objects phase. */
  2160. WRITING,
  2161. /** Building bitmaps phase. */
  2162. BUILDING_BITMAPS;
  2163. }
  2164. /** Summary of the current state of a PackWriter. */
  2165. public class State {
  2166. private final PackingPhase phase;
  2167. private final long bytesUsed;
  2168. State(PackingPhase phase, long bytesUsed) {
  2169. this.phase = phase;
  2170. this.bytesUsed = bytesUsed;
  2171. }
  2172. /** @return the PackConfig used to build the writer. */
  2173. public PackConfig getConfig() {
  2174. return config;
  2175. }
  2176. /** @return the current phase of the writer. */
  2177. public PackingPhase getPhase() {
  2178. return phase;
  2179. }
  2180. /** @return an estimate of the total memory used by the writer. */
  2181. public long estimateBytesUsed() {
  2182. return bytesUsed;
  2183. }
  2184. @SuppressWarnings("nls")
  2185. @Override
  2186. public String toString() {
  2187. return "PackWriter.State[" + phase + ", memory=" + bytesUsed + "]";
  2188. }
  2189. }
  2190. }