Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому ObjectIdOwnerMap: More lightweight map for ObjectIds
OwnerMap is about 200 ms faster than SubclassMap, more friendly to the
GC, and uses less storage: testing the "Counting objects" part of
PackWriter on 1886362 objects:
ObjectIdSubclassMap:
load factor 50%
table: 4194304 (wasted 2307942)
ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256
ms avg 34800 (last 9 runs)
ObjectIdOwnerMap:
load factor 100%
table: 2097152 (wasted 210790)
directory: 1024
ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140
ms avg 34597 (last 9 runs)
The major difference with OwnerMap is entries must extend from
ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own
private "next" field into each object. This allows the OwnerMap to use
a singly linked list for chaining collisions within a bucket. By
putting collisions in a linked list, we gain the entire table back for
the SHA-1 bits to index their own "private" slot.
Unfortunately this means that each object can appear in at most ONE
OwnerMap, as there is only one "next" field within the object instance
to thread into the map. For types that are very object map heavy like
RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this
is sufficient, these entity types are only put into one map by their
container. By introducing a new map type, we don't break existing
applications that might be trying to use ObjectIdSubclassMap to track
RevCommits they obtained from a RevWalk.
The OwnerMap uses less memory. Each object uses 1 reference more (so
we're up 1,886,362 references), but the table is 1/2 the size (2^20
rather than 2^21). The table itself wastes only 210,790 slots, rather
than 2,307,942. So OwnerMap is wasting 200k fewer references.
OwnerMap is more friendly to the GC, because it hardly ever generates
garbage. As the map reaches its 100% load factor target, it doubles in
size by allocating additional segment arrays of 2048 entries. (So the
first grow allocates 1 segment, second 2 segments, third 4 segments,
etc.) These segments are hooked into the pre-allocated directory of
1024 spaces. This permits the map to grow to 2 million objects before
the directory itself has to grow. By using segments of 2048 entries,
we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is
easier to satisfy then 2,307,942 bytes (for the 512k table that is
just an intermediate step in the SubclassMap). By reusing the
previously allocated segments (they are re-hashed in-place) we don't
release any memory during a table grow.
When the directory grows, it does so by discarding the old one and
using one that is 4x larger (so the directory goes to 4096 entries on
its first grow). A directory of size 4096 can handle up to 8 millon
objects. The second directory grow (16384) goes to 33 million objects.
At that point we're starting to really push the limits of the JVM
heap, but at least its many small arrays. Previously SubclassMap would
need a table of 67108864 entries to handle that object count, which
needs a single contiguous allocation of 256 MiB. That's hard to come
by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB
each. This is much easier to fit into a fragmented heap.
Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Shallow fetch: Respect "shallow" lines
When fetching from a shallow clone, the client sends "have" lines
to tell the server about objects it already has and "shallow" lines
to tell where its local history terminates. In some circumstances,
the server fails to honor the shallow lines and fails to return
objects that the client needs.
UploadPack passes the "have" lines to PackWriter so PackWriter can
omit them from the generated pack. UploadPack processes "shallow"
lines by calling RevWalk.assumeShallow() with the set of shallow
commits. RevWalk creates and caches RevCommits for these shallow
commits, clearing out their parents. That way, walks correctly
terminate at the shallow commits instead of assuming the client has
history going back behind them. UploadPack converts its RevWalk to an
ObjectWalk, maintaining the cached RevCommits, and passes it to
PackWriter.
Unfortunately, to support shallow fetches the PackWriter does the
following:
if (shallowPack && !(walk instanceof DepthWalk.ObjectWalk))
walk = new DepthWalk.ObjectWalk(reader, depth);
That is, when the client sends a "deepen" line (fetch --depth=<n>)
and the caller has not passed in a DepthWalk.ObjectWalk, PackWriter
throws away the RevWalk that was passed in and makes a new one. The
cleared parent lists prepared by RevWalk.assumeShallow() are lost.
Fortunately UploadPack intends to pass in a DepthWalk.ObjectWalk.
It tries to create it by calling toObjectWalkWithSameObjects() on
a DepthWalk.RevWalk. But it doesn't work: because DepthWalk.RevWalk
does not override the standard RevWalk#toObjectWalkWithSameObjects
implementation, the result is a plain ObjectWalk instead of an
instance of DepthWalk.ObjectWalk.
The result is that the "shallow" information is thrown away and
objects reachable from the shallow commits can be omitted from the
pack sent when fetching with --depth from a shallow clone.
Multiple factors collude to limit the circumstances under which this
bug can be observed:
1. Commits with depth != 0 don't enter DepthGenerator's pending queue.
That means a "have" cannot have any effect on DepthGenerator unless
it is also a "want".
2. DepthGenerator#next() doesn't call carryFlagsImpl(), so the
uninteresting flag is not propagated to ancestors there even if a
"have" is also a "want".
3. JGit treats a depth of 1 as "1 past the wants".
Because of (2), the only place the UNINTERESTING flag can leak to a
shallow commit's parents is in the carryFlags() call from
markUninteresting(). carryFlags() only traverses commits that have
already been parsed: commits yet to be parsed are supposed to inherit
correct flags from their parent in PendingGenerator#next (which
doesn't happen here --- that is (2)). So the list of commits that have
already been parsed becomes relevant.
When we hit the markUninteresting() call, all "want"s, "have"s, and
commits to be unshallowed have been parsed. carryFlags() only
affects the parsed commits. If the "want" is a direct parent of a
"have", then it carryFlags() marks it as uninteresting. If the "have"
was also a "shallow", then its parent pointer should have been null
and the "want" shouldn't have been marked, so we see the bug. If the
"want" is a more distant ancestor then (2) keeps the uninteresting
state from propagating to the "want" and we don't see the bug. If the
"shallow" is not also a "have" then the shallow commit isn't parsed
so (2) keeps the uninteresting state from propagating to the "want
so we don't see the bug.
Here is a reproduction case (time flowing left to right, arrows
pointing to parents). "C" must be a commit that the client
reports as a "have" during negotiation. That can only happen if the
server reports it as an existing branch or tag in the first round of
negotiation:
A <-- B <-- C <-- D
First do
git clone --depth 1 <repo>
which yields D as a "have" and C as a "shallow" commit. Then try
git fetch --depth 1 <repo> B:refs/heads/B
Negotiation sets up: have D, shallow C, have C, want B.
But due to this bug B is marked as uninteresting and is not sent.
Change-Id: I6e14b57b2f85e52d28cdcf356df647870f475440
Signed-off-by: Terry Parker <tparker@google.com>
7 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Do not write edge objects to the pack stream
Consider two objects A->B where A uses B as a delta base, and these
are in the same source pack file ordered as "A B".
If cached packs is enabled and B is also in the cached pack that
will be appended onto the end of the thin pack, and both A, B are
supposed to be in the thin pack, PackWriter must consider the fact
that A's base B is an edge object that claims to be part of the
new pack, but is actually "external" and cannot be written first.
If the object reuse system considered B candidates fist this bug
does not arise, as B will be marked as edge due to it existing in
the cached pack. When the A candidates are later examined, A sees a
valid delta base is available as an edge, and will not later try to
"write base first" during the writing phase.
However, when the reuse system considers A candidates first they
see that B will be in the outgoing pack, as it is still part of
the thin pack, and arrange for A to be written first. Later when A
switches from being in-pack to being an edge object (as it is part
of the cached pack) the pointer in B does not get its type changed
from ObjectToPack to ObjectId, so B thinks A is non-edge.
We work around this case by also checking that the delta base B
is non-edge before writing the object to the pack. Later when A
writes its object header, delta base B's ObjectToPack will have
an offset == 0, which makes isWritten() = false, and the OBJ_REF
delta format will be used for A's header. This will be resolved by
the client to the copy of B that appears in the later cached pack.
Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca
12 роки тому Do not write edge objects to the pack stream
Consider two objects A->B where A uses B as a delta base, and these
are in the same source pack file ordered as "A B".
If cached packs is enabled and B is also in the cached pack that
will be appended onto the end of the thin pack, and both A, B are
supposed to be in the thin pack, PackWriter must consider the fact
that A's base B is an edge object that claims to be part of the
new pack, but is actually "external" and cannot be written first.
If the object reuse system considered B candidates fist this bug
does not arise, as B will be marked as edge due to it existing in
the cached pack. When the A candidates are later examined, A sees a
valid delta base is available as an edge, and will not later try to
"write base first" during the writing phase.
However, when the reuse system considers A candidates first they
see that B will be in the outgoing pack, as it is still part of
the thin pack, and arrange for A to be written first. Later when A
switches from being in-pack to being an edge object (as it is part
of the cached pack) the pointer in B does not get its type changed
from ObjectToPack to ObjectId, so B thinks A is non-edge.
We work around this case by also checking that the delta base B
is non-edge before writing the object to the pack. Later when A
writes its object header, delta base B's ObjectToPack will have
an offset == 0, which makes isWritten() = false, and the OBJ_REF
delta format will be used for A's header. This will be resolved by
the client to the copy of B that appears in the later cached pack.
Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca
12 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement delta generation during packing
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Shallow fetch: Respect "shallow" lines
When fetching from a shallow clone, the client sends "have" lines
to tell the server about objects it already has and "shallow" lines
to tell where its local history terminates. In some circumstances,
the server fails to honor the shallow lines and fails to return
objects that the client needs.
UploadPack passes the "have" lines to PackWriter so PackWriter can
omit them from the generated pack. UploadPack processes "shallow"
lines by calling RevWalk.assumeShallow() with the set of shallow
commits. RevWalk creates and caches RevCommits for these shallow
commits, clearing out their parents. That way, walks correctly
terminate at the shallow commits instead of assuming the client has
history going back behind them. UploadPack converts its RevWalk to an
ObjectWalk, maintaining the cached RevCommits, and passes it to
PackWriter.
Unfortunately, to support shallow fetches the PackWriter does the
following:
if (shallowPack && !(walk instanceof DepthWalk.ObjectWalk))
walk = new DepthWalk.ObjectWalk(reader, depth);
That is, when the client sends a "deepen" line (fetch --depth=<n>)
and the caller has not passed in a DepthWalk.ObjectWalk, PackWriter
throws away the RevWalk that was passed in and makes a new one. The
cleared parent lists prepared by RevWalk.assumeShallow() are lost.
Fortunately UploadPack intends to pass in a DepthWalk.ObjectWalk.
It tries to create it by calling toObjectWalkWithSameObjects() on
a DepthWalk.RevWalk. But it doesn't work: because DepthWalk.RevWalk
does not override the standard RevWalk#toObjectWalkWithSameObjects
implementation, the result is a plain ObjectWalk instead of an
instance of DepthWalk.ObjectWalk.
The result is that the "shallow" information is thrown away and
objects reachable from the shallow commits can be omitted from the
pack sent when fetching with --depth from a shallow clone.
Multiple factors collude to limit the circumstances under which this
bug can be observed:
1. Commits with depth != 0 don't enter DepthGenerator's pending queue.
That means a "have" cannot have any effect on DepthGenerator unless
it is also a "want".
2. DepthGenerator#next() doesn't call carryFlagsImpl(), so the
uninteresting flag is not propagated to ancestors there even if a
"have" is also a "want".
3. JGit treats a depth of 1 as "1 past the wants".
Because of (2), the only place the UNINTERESTING flag can leak to a
shallow commit's parents is in the carryFlags() call from
markUninteresting(). carryFlags() only traverses commits that have
already been parsed: commits yet to be parsed are supposed to inherit
correct flags from their parent in PendingGenerator#next (which
doesn't happen here --- that is (2)). So the list of commits that have
already been parsed becomes relevant.
When we hit the markUninteresting() call, all "want"s, "have"s, and
commits to be unshallowed have been parsed. carryFlags() only
affects the parsed commits. If the "want" is a direct parent of a
"have", then it carryFlags() marks it as uninteresting. If the "have"
was also a "shallow", then its parent pointer should have been null
and the "want" shouldn't have been marked, so we see the bug. If the
"want" is a more distant ancestor then (2) keeps the uninteresting
state from propagating to the "want" and we don't see the bug. If the
"shallow" is not also a "have" then the shallow commit isn't parsed
so (2) keeps the uninteresting state from propagating to the "want
so we don't see the bug.
Here is a reproduction case (time flowing left to right, arrows
pointing to parents). "C" must be a commit that the client
reports as a "have" during negotiation. That can only happen if the
server reports it as an existing branch or tag in the first round of
negotiation:
A <-- B <-- C <-- D
First do
git clone --depth 1 <repo>
which yields D as a "have" and C as a "shallow" commit. Then try
git fetch --depth 1 <repo> B:refs/heads/B
Negotiation sets up: have D, shallow C, have C, want B.
But due to this bug B is marked as uninteresting and is not sent.
Change-Id: I6e14b57b2f85e52d28cdcf356df647870f475440
Signed-off-by: Terry Parker <tparker@google.com>
7 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому Implement async/batch lookup of object data
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
14 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому Teach PackWriter how to reuse an existing object list
Counting the objects needed for packing is the most expensive part of
an UploadPack request that has no uninteresting objects (otherwise
known as an initial clone). During this phase the PackWriter is
enumerating the entire set of objects in this repository, so they can
be sent to the client for their new clone.
Allow the ObjectReader (and therefore the underlying storage system)
to keep a cached list of all reachable objects from a small number of
points in the project's history. If one of those points is reached
during enumeration of the commit graph, most objects are obtained from
the cached list instead of direct traversal.
PackWriter uses the list by discarding the current object lists and
restarting a traversal from all refs but marking the object list name
as uninteresting. This allows PackWriter to enumerate all objects
that are more recent than the list creation, or that were on side
branches that the list does not include.
However, ObjectWalk tags all of the trees and commits within the list
commit as UNINTERESTING, which would normally cause PackWriter to
construct a thin pack that excludes these objects. To avoid that,
addObject() was refactored to allow this list-based enumeration to
always include an object, even if it has been tagged UNINTERESTING by
the ObjectWalk. This implies the list-based enumeration may only be
used for initial clones, where all objects are being sent.
The UNINTERESTING labeling occurs because StartGenerator always
enables the BoundaryGenerator if the walker is an ObjectWalk and a
commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not
enabled. This is the default reasonable behavior for an ObjectWalk,
but isn't desired here in PackWriter with the list-based enumeration.
Rather than trying to change all of this behavior, PackWriter works
around it.
Because the list name commit's immediate files and trees were all
enumerated before the list enumeration itself starts (and are also
within the list itself) PackWriter runs the risk of adding the same
objects to its ObjectIdSubclassMap twice. Since this breaks the
internal map data structure (and also may cause the object to transmit
twice), PackWriter needs to use a new "added" RevFlag to track whether
or not an object has been put into the outgoing list yet.
Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Teach PackWriter how to reuse an existing object list
Counting the objects needed for packing is the most expensive part of
an UploadPack request that has no uninteresting objects (otherwise
known as an initial clone). During this phase the PackWriter is
enumerating the entire set of objects in this repository, so they can
be sent to the client for their new clone.
Allow the ObjectReader (and therefore the underlying storage system)
to keep a cached list of all reachable objects from a small number of
points in the project's history. If one of those points is reached
during enumeration of the commit graph, most objects are obtained from
the cached list instead of direct traversal.
PackWriter uses the list by discarding the current object lists and
restarting a traversal from all refs but marking the object list name
as uninteresting. This allows PackWriter to enumerate all objects
that are more recent than the list creation, or that were on side
branches that the list does not include.
However, ObjectWalk tags all of the trees and commits within the list
commit as UNINTERESTING, which would normally cause PackWriter to
construct a thin pack that excludes these objects. To avoid that,
addObject() was refactored to allow this list-based enumeration to
always include an object, even if it has been tagged UNINTERESTING by
the ObjectWalk. This implies the list-based enumeration may only be
used for initial clones, where all objects are being sent.
The UNINTERESTING labeling occurs because StartGenerator always
enables the BoundaryGenerator if the walker is an ObjectWalk and a
commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not
enabled. This is the default reasonable behavior for an ObjectWalk,
but isn't desired here in PackWriter with the list-based enumeration.
Rather than trying to change all of this behavior, PackWriter works
around it.
Because the list name commit's immediate files and trees were all
enumerated before the list enumeration itself starts (and are also
within the list itself) PackWriter runs the risk of adding the same
objects to its ObjectIdSubclassMap twice. Since this breaks the
internal map data structure (and also may cause the object to transmit
twice), PackWriter needs to use a new "added" RevFlag to track whether
or not an object has been put into the outgoing list yet.
Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому PackWriter: Make thin packs more efficient
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
13 роки тому PackWriter: Support reuse of entire packs
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому PackBitmapIndex: Reduce memory usage in GC
Currently, the garbage collection is consistently failing for some large
repositories in the building bitmap phase, e.g.Linux-MSM project:
https://source.codeaurora.org/quic/la/kernel/msm-3.18
Historically, bitmap index creation happened in 3 phases:
1. Select the commits to which bitmaps should be attached.
2. Create all bitmaps for these commits, stored in uncompressed format
in the PackBitmapIndexBuilder.
3. Deltify the bitmaps and write them to disk.
We investigated the process. For phase 2 it's most efficient to create
bitmaps starting with oldest commit and moving to the newest commit,
because the newer commits are able to reuse the work for the old ones.
But for bitmap deltification in phase 3, it's better when a newer
commit's bitmap is the base, and the current disk format writes bitmaps
out for the newest commits first.
This change introduces a new collection to hold the deltified and
compressed representations of the bitmaps, keeping a smaller subset of
commits in the PackBitmapIndexBuilder to help make the bitmap index
creation more memory efficient.
And in this commit, we're setting DISTANCE_THRESHOLD to 0 in the
PackWriterBitmapPreparer, which means the garbage collection will not
have much behavoir change and will still use as much memory as before.
Change-Id: I6ec2c3e8dde11805af47874d67d33cf1ef83660e
Signed-off-by: Yunjie Li <yunjieli@google.com>
4 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
11 роки тому |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089209020912092209320942095209620972098209921002101210221032104210521062107210821092110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137213821392140214121422143214421452146214721482149215021512152215321542155215621572158215921602161216221632164216521662167216821692170217121722173217421752176217721782179218021812182218321842185218621872188218921902191219221932194219521962197219821992200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275227622772278227922802281228222832284228522862287228822892290229122922293229422952296229722982299230023012302230323042305230623072308230923102311231223132314231523162317231823192320232123222323232423252326232723282329233023312332233323342335233623372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419242024212422242324242425242624272428242924302431243224332434243524362437243824392440244124422443244424452446244724482449245024512452245324542455245624572458245924602461246224632464246524662467246824692470247124722473247424752476247724782479248024812482248324842485248624872488248924902491249224932494249524962497249824992500250125022503250425052506250725082509251025112512251325142515251625172518251925202521252225232524252525262527252825292530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560 |
- /*
- * Copyright (C) 2008-2010, Google Inc.
- * Copyright (C) 2008, Marek Zawirski <marek.zawirski@gmail.com> and others
- *
- * This program and the accompanying materials are made available under the
- * terms of the Eclipse Distribution License v. 1.0 which is available at
- * https://www.eclipse.org/org/documents/edl-v10.php.
- *
- * SPDX-License-Identifier: BSD-3-Clause
- */
-
- package org.eclipse.jgit.internal.storage.pack;
-
- import static java.util.Objects.requireNonNull;
- import static org.eclipse.jgit.internal.storage.pack.StoredObjectRepresentation.PACK_DELTA;
- import static org.eclipse.jgit.internal.storage.pack.StoredObjectRepresentation.PACK_WHOLE;
- import static org.eclipse.jgit.lib.Constants.OBJECT_ID_LENGTH;
- import static org.eclipse.jgit.lib.Constants.OBJ_BLOB;
- import static org.eclipse.jgit.lib.Constants.OBJ_COMMIT;
- import static org.eclipse.jgit.lib.Constants.OBJ_TAG;
- import static org.eclipse.jgit.lib.Constants.OBJ_TREE;
-
- import java.io.IOException;
- import java.io.OutputStream;
- import java.lang.ref.WeakReference;
- import java.security.MessageDigest;
- import java.text.MessageFormat;
- import java.time.Duration;
- import java.util.ArrayList;
- import java.util.Arrays;
- import java.util.Collection;
- import java.util.Collections;
- import java.util.HashMap;
- import java.util.HashSet;
- import java.util.Iterator;
- import java.util.List;
- import java.util.Map;
- import java.util.NoSuchElementException;
- import java.util.Set;
- import java.util.concurrent.ConcurrentHashMap;
- import java.util.concurrent.ExecutionException;
- import java.util.concurrent.Executor;
- import java.util.concurrent.ExecutorService;
- import java.util.concurrent.Executors;
- import java.util.concurrent.Future;
- import java.util.concurrent.TimeUnit;
- import java.util.zip.CRC32;
- import java.util.zip.CheckedOutputStream;
- import java.util.zip.Deflater;
- import java.util.zip.DeflaterOutputStream;
-
- import org.eclipse.jgit.annotations.NonNull;
- import org.eclipse.jgit.annotations.Nullable;
- import org.eclipse.jgit.errors.CorruptObjectException;
- import org.eclipse.jgit.errors.IncorrectObjectTypeException;
- import org.eclipse.jgit.errors.LargeObjectException;
- import org.eclipse.jgit.errors.MissingObjectException;
- import org.eclipse.jgit.errors.SearchForReuseTimeout;
- import org.eclipse.jgit.errors.StoredObjectRepresentationNotAvailableException;
- import org.eclipse.jgit.internal.JGitText;
- import org.eclipse.jgit.internal.storage.file.PackBitmapIndexBuilder;
- import org.eclipse.jgit.internal.storage.file.PackBitmapIndexWriterV1;
- import org.eclipse.jgit.internal.storage.file.PackIndexWriter;
- import org.eclipse.jgit.lib.AnyObjectId;
- import org.eclipse.jgit.lib.AsyncObjectSizeQueue;
- import org.eclipse.jgit.lib.BatchingProgressMonitor;
- import org.eclipse.jgit.lib.BitmapIndex;
- import org.eclipse.jgit.lib.BitmapIndex.BitmapBuilder;
- import org.eclipse.jgit.lib.BitmapObject;
- import org.eclipse.jgit.lib.Constants;
- import org.eclipse.jgit.lib.NullProgressMonitor;
- import org.eclipse.jgit.lib.ObjectId;
- import org.eclipse.jgit.lib.ObjectIdOwnerMap;
- import org.eclipse.jgit.lib.ObjectIdSet;
- import org.eclipse.jgit.lib.ObjectLoader;
- import org.eclipse.jgit.lib.ObjectReader;
- import org.eclipse.jgit.lib.ProgressMonitor;
- import org.eclipse.jgit.lib.Repository;
- import org.eclipse.jgit.lib.ThreadSafeProgressMonitor;
- import org.eclipse.jgit.revwalk.AsyncRevObjectQueue;
- import org.eclipse.jgit.revwalk.BitmapWalker;
- import org.eclipse.jgit.revwalk.DepthWalk;
- import org.eclipse.jgit.revwalk.ObjectWalk;
- import org.eclipse.jgit.revwalk.RevCommit;
- import org.eclipse.jgit.revwalk.RevFlag;
- import org.eclipse.jgit.revwalk.RevObject;
- import org.eclipse.jgit.revwalk.RevSort;
- import org.eclipse.jgit.revwalk.RevTag;
- import org.eclipse.jgit.revwalk.RevTree;
- import org.eclipse.jgit.storage.pack.PackConfig;
- import org.eclipse.jgit.storage.pack.PackStatistics;
- import org.eclipse.jgit.transport.FilterSpec;
- import org.eclipse.jgit.transport.ObjectCountCallback;
- import org.eclipse.jgit.transport.PacketLineOut;
- import org.eclipse.jgit.transport.WriteAbortedException;
- import org.eclipse.jgit.util.BlockList;
- import org.eclipse.jgit.util.TemporaryBuffer;
-
- /**
- * <p>
- * PackWriter class is responsible for generating pack files from specified set
- * of objects from repository. This implementation produce pack files in format
- * version 2.
- * </p>
- * <p>
- * Source of objects may be specified in two ways:
- * <ul>
- * <li>(usually) by providing sets of interesting and uninteresting objects in
- * repository - all interesting objects and their ancestors except uninteresting
- * objects and their ancestors will be included in pack, or</li>
- * <li>by providing iterator of {@link org.eclipse.jgit.revwalk.RevObject}
- * specifying exact list and order of objects in pack</li>
- * </ul>
- * <p>
- * Typical usage consists of creating an instance, configuring options,
- * preparing the list of objects by calling {@link #preparePack(Iterator)} or
- * {@link #preparePack(ProgressMonitor, Set, Set)}, and streaming with
- * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}. If the
- * pack is being stored as a file the matching index can be written out after
- * writing the pack by {@link #writeIndex(OutputStream)}. An optional bitmap
- * index can be made by calling {@link #prepareBitmapIndex(ProgressMonitor)}
- * followed by {@link #writeBitmapIndex(OutputStream)}.
- * </p>
- * <p>
- * Class provide set of configurable options and
- * {@link org.eclipse.jgit.lib.ProgressMonitor} support, as operations may take
- * a long time for big repositories. Deltas searching algorithm is <b>NOT
- * IMPLEMENTED</b> yet - this implementation relies only on deltas and objects
- * reuse.
- * </p>
- * <p>
- * This class is not thread safe. It is intended to be used in one thread as a
- * single pass to produce one pack. Invoking methods multiple times or out of
- * order is not supported as internal data structures are destroyed during
- * certain phases to save memory when packing large repositories.
- * </p>
- */
- public class PackWriter implements AutoCloseable {
- private static final int PACK_VERSION_GENERATED = 2;
-
- /** Empty set of objects for {@code preparePack()}. */
- public static final Set<ObjectId> NONE = Collections.emptySet();
-
- private static final Map<WeakReference<PackWriter>, Boolean> instances =
- new ConcurrentHashMap<>();
-
- private static final Iterable<PackWriter> instancesIterable = () -> new Iterator<PackWriter>() {
-
- private final Iterator<WeakReference<PackWriter>> it = instances
- .keySet().iterator();
-
- private PackWriter next;
-
- @Override
- public boolean hasNext() {
- if (next != null) {
- return true;
- }
- while (it.hasNext()) {
- WeakReference<PackWriter> ref = it.next();
- next = ref.get();
- if (next != null) {
- return true;
- }
- it.remove();
- }
- return false;
- }
-
- @Override
- public PackWriter next() {
- if (hasNext()) {
- PackWriter result = next;
- next = null;
- return result;
- }
- throw new NoSuchElementException();
- }
-
- @Override
- public void remove() {
- throw new UnsupportedOperationException();
- }
- };
-
- /**
- * Get all allocated, non-released PackWriters instances.
- *
- * @return all allocated, non-released PackWriters instances.
- */
- public static Iterable<PackWriter> getInstances() {
- return instancesIterable;
- }
-
- @SuppressWarnings("unchecked")
- BlockList<ObjectToPack>[] objectsLists = new BlockList[OBJ_TAG + 1];
- {
- objectsLists[OBJ_COMMIT] = new BlockList<>();
- objectsLists[OBJ_TREE] = new BlockList<>();
- objectsLists[OBJ_BLOB] = new BlockList<>();
- objectsLists[OBJ_TAG] = new BlockList<>();
- }
-
- private ObjectIdOwnerMap<ObjectToPack> objectsMap = new ObjectIdOwnerMap<>();
-
- // edge objects for thin packs
- private List<ObjectToPack> edgeObjects = new BlockList<>();
-
- // Objects the client is known to have already.
- private BitmapBuilder haveObjects;
-
- private List<CachedPack> cachedPacks = new ArrayList<>(2);
-
- private Set<ObjectId> tagTargets = NONE;
-
- private Set<? extends ObjectId> excludeFromBitmapSelection = NONE;
-
- private ObjectIdSet[] excludeInPacks;
-
- private ObjectIdSet excludeInPackLast;
-
- private Deflater myDeflater;
-
- private final ObjectReader reader;
-
- /** {@link #reader} recast to the reuse interface, if it supports it. */
- private final ObjectReuseAsIs reuseSupport;
-
- final PackConfig config;
-
- private final PackStatistics.Accumulator stats;
-
- private final MutableState state;
-
- private final WeakReference<PackWriter> selfRef;
-
- private PackStatistics.ObjectType.Accumulator typeStats;
-
- private List<ObjectToPack> sortedByName;
-
- private byte[] packcsum;
-
- private boolean deltaBaseAsOffset;
-
- private boolean reuseDeltas;
-
- private boolean reuseDeltaCommits;
-
- private boolean reuseValidate;
-
- private boolean thin;
-
- private boolean useCachedPacks;
-
- private boolean useBitmaps;
-
- private boolean ignoreMissingUninteresting = true;
-
- private boolean pruneCurrentObjectList;
-
- private boolean shallowPack;
-
- private boolean canBuildBitmaps;
-
- private boolean indexDisabled;
-
- private boolean checkSearchForReuseTimeout = false;
-
- private final Duration searchForReuseTimeout;
-
- private long searchForReuseStartTimeEpoc;
-
- private int depth;
-
- private Collection<? extends ObjectId> unshallowObjects;
-
- private PackBitmapIndexBuilder writeBitmaps;
-
- private CRC32 crc32;
-
- private ObjectCountCallback callback;
-
- private FilterSpec filterSpec = FilterSpec.NO_FILTER;
-
- private PackfileUriConfig packfileUriConfig;
-
- /**
- * Create writer for specified repository.
- * <p>
- * Objects for packing are specified in {@link #preparePack(Iterator)} or
- * {@link #preparePack(ProgressMonitor, Set, Set)}.
- *
- * @param repo
- * repository where objects are stored.
- */
- public PackWriter(Repository repo) {
- this(repo, repo.newObjectReader());
- }
-
- /**
- * Create a writer to load objects from the specified reader.
- * <p>
- * Objects for packing are specified in {@link #preparePack(Iterator)} or
- * {@link #preparePack(ProgressMonitor, Set, Set)}.
- *
- * @param reader
- * reader to read from the repository with.
- */
- public PackWriter(ObjectReader reader) {
- this(new PackConfig(), reader);
- }
-
- /**
- * Create writer for specified repository.
- * <p>
- * Objects for packing are specified in {@link #preparePack(Iterator)} or
- * {@link #preparePack(ProgressMonitor, Set, Set)}.
- *
- * @param repo
- * repository where objects are stored.
- * @param reader
- * reader to read from the repository with.
- */
- public PackWriter(Repository repo, ObjectReader reader) {
- this(new PackConfig(repo), reader);
- }
-
- /**
- * Create writer with a specified configuration.
- * <p>
- * Objects for packing are specified in {@link #preparePack(Iterator)} or
- * {@link #preparePack(ProgressMonitor, Set, Set)}.
- *
- * @param config
- * configuration for the pack writer.
- * @param reader
- * reader to read from the repository with.
- */
- public PackWriter(PackConfig config, ObjectReader reader) {
- this(config, reader, null);
- }
-
- /**
- * Create writer with a specified configuration.
- * <p>
- * Objects for packing are specified in {@link #preparePack(Iterator)} or
- * {@link #preparePack(ProgressMonitor, Set, Set)}.
- *
- * @param config
- * configuration for the pack writer.
- * @param reader
- * reader to read from the repository with.
- * @param statsAccumulator
- * accumulator for statics
- */
- public PackWriter(PackConfig config, final ObjectReader reader,
- @Nullable PackStatistics.Accumulator statsAccumulator) {
- this.config = config;
- this.reader = reader;
- if (reader instanceof ObjectReuseAsIs)
- reuseSupport = ((ObjectReuseAsIs) reader);
- else
- reuseSupport = null;
-
- deltaBaseAsOffset = config.isDeltaBaseAsOffset();
- reuseDeltas = config.isReuseDeltas();
- searchForReuseTimeout = config.getSearchForReuseTimeout();
- reuseValidate = true; // be paranoid by default
- stats = statsAccumulator != null ? statsAccumulator
- : new PackStatistics.Accumulator();
- state = new MutableState();
- selfRef = new WeakReference<>(this);
- instances.put(selfRef, Boolean.TRUE);
- }
-
- /**
- * Set the {@code ObjectCountCallback}.
- * <p>
- * It should be set before calling
- * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}.
- *
- * @param callback
- * the callback to set
- * @return this object for chaining.
- */
- public PackWriter setObjectCountCallback(ObjectCountCallback callback) {
- this.callback = callback;
- return this;
- }
-
- /**
- * Records the set of shallow commits in the client.
- *
- * @param clientShallowCommits
- * the shallow commits in the client
- */
- public void setClientShallowCommits(Set<ObjectId> clientShallowCommits) {
- stats.clientShallowCommits = Collections
- .unmodifiableSet(new HashSet<>(clientShallowCommits));
- }
-
- /**
- * Check whether writer can store delta base as an offset (new style
- * reducing pack size) or should store it as an object id (legacy style,
- * compatible with old readers).
- *
- * Default setting: {@value PackConfig#DEFAULT_DELTA_BASE_AS_OFFSET}
- *
- * @return true if delta base is stored as an offset; false if it is stored
- * as an object id.
- */
- public boolean isDeltaBaseAsOffset() {
- return deltaBaseAsOffset;
- }
-
- /**
- * Check whether the search for reuse phase is taking too long. This could
- * be the case when the number of objects and pack files is high and the
- * system is under pressure. If that's the case and
- * checkSearchForReuseTimeout is true abort the search.
- *
- * @throws SearchForReuseTimeout
- * if the search for reuse is taking too long.
- */
- public void checkSearchForReuseTimeout() throws SearchForReuseTimeout {
- if (checkSearchForReuseTimeout
- && Duration.ofMillis(System.currentTimeMillis()
- - searchForReuseStartTimeEpoc)
- .compareTo(searchForReuseTimeout) > 0) {
- throw new SearchForReuseTimeout(searchForReuseTimeout);
- }
- }
-
- /**
- * Set writer delta base format. Delta base can be written as an offset in a
- * pack file (new approach reducing file size) or as an object id (legacy
- * approach, compatible with old readers).
- *
- * Default setting: {@value PackConfig#DEFAULT_DELTA_BASE_AS_OFFSET}
- *
- * @param deltaBaseAsOffset
- * boolean indicating whether delta base can be stored as an
- * offset.
- */
- public void setDeltaBaseAsOffset(boolean deltaBaseAsOffset) {
- this.deltaBaseAsOffset = deltaBaseAsOffset;
- }
-
- /**
- * Set the writer to check for long search for reuse, exceeding the timeout.
- * Selecting an object representation can be an expensive operation. It is
- * possible to set a max search for reuse time (see
- * PackConfig#CONFIG_KEY_SEARCH_FOR_REUSE_TIMEOUT for more details).
- *
- * However some operations, i.e.: GC, need to find the best candidate
- * regardless how much time the operation will need to finish.
- *
- * This method enables the search for reuse timeout check, otherwise
- * disabled.
- */
- public void enableSearchForReuseTimeout() {
- this.checkSearchForReuseTimeout = true;
- }
-
- /**
- * Check if the writer will reuse commits that are already stored as deltas.
- *
- * @return true if the writer would reuse commits stored as deltas, assuming
- * delta reuse is already enabled.
- */
- public boolean isReuseDeltaCommits() {
- return reuseDeltaCommits;
- }
-
- /**
- * Set the writer to reuse existing delta versions of commits.
- *
- * @param reuse
- * if true, the writer will reuse any commits stored as deltas.
- * By default the writer does not reuse delta commits.
- */
- public void setReuseDeltaCommits(boolean reuse) {
- reuseDeltaCommits = reuse;
- }
-
- /**
- * Check if the writer validates objects before copying them.
- *
- * @return true if validation is enabled; false if the reader will handle
- * object validation as a side-effect of it consuming the output.
- */
- public boolean isReuseValidatingObjects() {
- return reuseValidate;
- }
-
- /**
- * Enable (or disable) object validation during packing.
- *
- * @param validate
- * if true the pack writer will validate an object before it is
- * put into the output. This additional validation work may be
- * necessary to avoid propagating corruption from one local pack
- * file to another local pack file.
- */
- public void setReuseValidatingObjects(boolean validate) {
- reuseValidate = validate;
- }
-
- /**
- * Whether this writer is producing a thin pack.
- *
- * @return true if this writer is producing a thin pack.
- */
- public boolean isThin() {
- return thin;
- }
-
- /**
- * Whether writer may pack objects with delta base object not within set of
- * objects to pack
- *
- * @param packthin
- * a boolean indicating whether writer may pack objects with
- * delta base object not within set of objects to pack, but
- * belonging to party repository (uninteresting/boundary) as
- * determined by set; this kind of pack is used only for
- * transport; true - to produce thin pack, false - otherwise.
- */
- public void setThin(boolean packthin) {
- thin = packthin;
- }
-
- /**
- * Whether to reuse cached packs.
- *
- * @return {@code true} to reuse cached packs. If true index creation isn't
- * available.
- */
- public boolean isUseCachedPacks() {
- return useCachedPacks;
- }
-
- /**
- * Whether to use cached packs
- *
- * @param useCached
- * if set to {@code true} and a cached pack is present, it will
- * be appended onto the end of a thin-pack, reducing the amount
- * of working set space and CPU used by PackWriter. Enabling this
- * feature prevents PackWriter from creating an index for the
- * newly created pack, so its only suitable for writing to a
- * network client, where the client will make the index.
- */
- public void setUseCachedPacks(boolean useCached) {
- useCachedPacks = useCached;
- }
-
- /**
- * Whether to use bitmaps
- *
- * @return {@code true} to use bitmaps for ObjectWalks, if available.
- */
- public boolean isUseBitmaps() {
- return useBitmaps;
- }
-
- /**
- * Whether to use bitmaps
- *
- * @param useBitmaps
- * if set to true, bitmaps will be used when preparing a pack.
- */
- public void setUseBitmaps(boolean useBitmaps) {
- this.useBitmaps = useBitmaps;
- }
-
- /**
- * Whether the index file cannot be created by this PackWriter.
- *
- * @return {@code true} if the index file cannot be created by this
- * PackWriter.
- */
- public boolean isIndexDisabled() {
- return indexDisabled || !cachedPacks.isEmpty();
- }
-
- /**
- * Whether to disable creation of the index file.
- *
- * @param noIndex
- * {@code true} to disable creation of the index file.
- */
- public void setIndexDisabled(boolean noIndex) {
- this.indexDisabled = noIndex;
- }
-
- /**
- * Whether to ignore missing uninteresting objects
- *
- * @return {@code true} to ignore objects that are uninteresting and also
- * not found on local disk; false to throw a
- * {@link org.eclipse.jgit.errors.MissingObjectException} out of
- * {@link #preparePack(ProgressMonitor, Set, Set)} if an
- * uninteresting object is not in the source repository. By default,
- * true, permitting gracefully ignoring of uninteresting objects.
- */
- public boolean isIgnoreMissingUninteresting() {
- return ignoreMissingUninteresting;
- }
-
- /**
- * Whether writer should ignore non existing uninteresting objects
- *
- * @param ignore
- * {@code true} if writer should ignore non existing
- * uninteresting objects during construction set of objects to
- * pack; false otherwise - non existing uninteresting objects may
- * cause {@link org.eclipse.jgit.errors.MissingObjectException}
- */
- public void setIgnoreMissingUninteresting(boolean ignore) {
- ignoreMissingUninteresting = ignore;
- }
-
- /**
- * Set the tag targets that should be hoisted earlier during packing.
- * <p>
- * Callers may put objects into this set before invoking any of the
- * preparePack methods to influence where an annotated tag's target is
- * stored within the resulting pack. Typically these will be clustered
- * together, and hoisted earlier in the file even if they are ancient
- * revisions, allowing readers to find tag targets with better locality.
- *
- * @param objects
- * objects that annotated tags point at.
- */
- public void setTagTargets(Set<ObjectId> objects) {
- tagTargets = objects;
- }
-
- /**
- * Configure this pack for a shallow clone.
- *
- * @param depth
- * maximum depth of history to return. 1 means return only the
- * "wants".
- * @param unshallow
- * objects which used to be shallow on the client, but are being
- * extended as part of this fetch
- */
- public void setShallowPack(int depth,
- Collection<? extends ObjectId> unshallow) {
- this.shallowPack = true;
- this.depth = depth;
- this.unshallowObjects = unshallow;
- }
-
- /**
- * @param filter the filter which indicates what and what not this writer
- * should include
- */
- public void setFilterSpec(@NonNull FilterSpec filter) {
- filterSpec = requireNonNull(filter);
- }
-
- /**
- * @param config configuration related to packfile URIs
- * @since 5.5
- */
- public void setPackfileUriConfig(PackfileUriConfig config) {
- packfileUriConfig = config;
- }
-
- /**
- * Returns objects number in a pack file that was created by this writer.
- *
- * @return number of objects in pack.
- * @throws java.io.IOException
- * a cached pack cannot supply its object count.
- */
- public long getObjectCount() throws IOException {
- if (stats.totalObjects == 0) {
- long objCnt = 0;
-
- objCnt += objectsLists[OBJ_COMMIT].size();
- objCnt += objectsLists[OBJ_TREE].size();
- objCnt += objectsLists[OBJ_BLOB].size();
- objCnt += objectsLists[OBJ_TAG].size();
-
- for (CachedPack pack : cachedPacks)
- objCnt += pack.getObjectCount();
- return objCnt;
- }
- return stats.totalObjects;
- }
-
- private long getUnoffloadedObjectCount() throws IOException {
- long objCnt = 0;
-
- objCnt += objectsLists[OBJ_COMMIT].size();
- objCnt += objectsLists[OBJ_TREE].size();
- objCnt += objectsLists[OBJ_BLOB].size();
- objCnt += objectsLists[OBJ_TAG].size();
-
- for (CachedPack pack : cachedPacks) {
- CachedPackUriProvider.PackInfo packInfo =
- packfileUriConfig.cachedPackUriProvider.getInfo(
- pack, packfileUriConfig.protocolsSupported);
- if (packInfo == null) {
- objCnt += pack.getObjectCount();
- }
- }
-
- return objCnt;
- }
-
- /**
- * Returns the object ids in the pack file that was created by this writer.
- * <p>
- * This method can only be invoked after
- * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
- * been invoked and completed successfully.
- *
- * @return set of objects in pack.
- * @throws java.io.IOException
- * a cached pack cannot supply its object ids.
- */
- public ObjectIdOwnerMap<ObjectIdOwnerMap.Entry> getObjectSet()
- throws IOException {
- if (!cachedPacks.isEmpty())
- throw new IOException(
- JGitText.get().cachedPacksPreventsListingObjects);
-
- if (writeBitmaps != null) {
- return writeBitmaps.getObjectSet();
- }
-
- ObjectIdOwnerMap<ObjectIdOwnerMap.Entry> r = new ObjectIdOwnerMap<>();
- for (BlockList<ObjectToPack> objList : objectsLists) {
- if (objList != null) {
- for (ObjectToPack otp : objList)
- r.add(new ObjectIdOwnerMap.Entry(otp) {
- // A new entry that copies the ObjectId
- });
- }
- }
- return r;
- }
-
- /**
- * Add a pack index whose contents should be excluded from the result.
- *
- * @param idx
- * objects in this index will not be in the output pack.
- */
- public void excludeObjects(ObjectIdSet idx) {
- if (excludeInPacks == null) {
- excludeInPacks = new ObjectIdSet[] { idx };
- excludeInPackLast = idx;
- } else {
- int cnt = excludeInPacks.length;
- ObjectIdSet[] newList = new ObjectIdSet[cnt + 1];
- System.arraycopy(excludeInPacks, 0, newList, 0, cnt);
- newList[cnt] = idx;
- excludeInPacks = newList;
- }
- }
-
- /**
- * Prepare the list of objects to be written to the pack stream.
- * <p>
- * Iterator <b>exactly</b> determines which objects are included in a pack
- * and order they appear in pack (except that objects order by type is not
- * needed at input). This order should conform general rules of ordering
- * objects in git - by recency and path (type and delta-base first is
- * internally secured) and responsibility for guaranteeing this order is on
- * a caller side. Iterator must return each id of object to write exactly
- * once.
- * </p>
- *
- * @param objectsSource
- * iterator of object to store in a pack; order of objects within
- * each type is important, ordering by type is not needed;
- * allowed types for objects are
- * {@link org.eclipse.jgit.lib.Constants#OBJ_COMMIT},
- * {@link org.eclipse.jgit.lib.Constants#OBJ_TREE},
- * {@link org.eclipse.jgit.lib.Constants#OBJ_BLOB} and
- * {@link org.eclipse.jgit.lib.Constants#OBJ_TAG}; objects
- * returned by iterator may be later reused by caller as object
- * id and type are internally copied in each iteration.
- * @throws java.io.IOException
- * when some I/O problem occur during reading objects.
- */
- public void preparePack(@NonNull Iterator<RevObject> objectsSource)
- throws IOException {
- while (objectsSource.hasNext()) {
- addObject(objectsSource.next());
- }
- }
-
- /**
- * Prepare the list of objects to be written to the pack stream.
- *
- * <p>
- * PackWriter will concat and write out the specified packs as-is.
- *
- * @param c
- * cached packs to be written.
- */
- public void preparePack(Collection<? extends CachedPack> c) {
- cachedPacks.addAll(c);
- }
-
- /**
- * Prepare the list of objects to be written to the pack stream.
- * <p>
- * Basing on these 2 sets, another set of objects to put in a pack file is
- * created: this set consists of all objects reachable (ancestors) from
- * interesting objects, except uninteresting objects and their ancestors.
- * This method uses class {@link org.eclipse.jgit.revwalk.ObjectWalk}
- * extensively to find out that appropriate set of output objects and their
- * optimal order in output pack. Order is consistent with general git
- * in-pack rules: sort by object type, recency, path and delta-base first.
- * </p>
- *
- * @param countingMonitor
- * progress during object enumeration.
- * @param want
- * collection of objects to be marked as interesting (start
- * points of graph traversal). Must not be {@code null}.
- * @param have
- * collection of objects to be marked as uninteresting (end
- * points of graph traversal). Pass {@link #NONE} if all objects
- * reachable from {@code want} are desired, such as when serving
- * a clone.
- * @throws java.io.IOException
- * when some I/O problem occur during reading objects.
- */
- public void preparePack(ProgressMonitor countingMonitor,
- @NonNull Set<? extends ObjectId> want,
- @NonNull Set<? extends ObjectId> have) throws IOException {
- preparePack(countingMonitor, want, have, NONE, NONE);
- }
-
- /**
- * Prepare the list of objects to be written to the pack stream.
- * <p>
- * Like {@link #preparePack(ProgressMonitor, Set, Set)} but also allows
- * specifying commits that should not be walked past ("shallow" commits).
- * The caller is responsible for filtering out commits that should not be
- * shallow any more ("unshallow" commits as in {@link #setShallowPack}) from
- * the shallow set.
- *
- * @param countingMonitor
- * progress during object enumeration.
- * @param want
- * objects of interest, ancestors of which will be included in
- * the pack. Must not be {@code null}.
- * @param have
- * objects whose ancestors (up to and including {@code shallow}
- * commits) do not need to be included in the pack because they
- * are already available from elsewhere. Must not be
- * {@code null}.
- * @param shallow
- * commits indicating the boundary of the history marked with
- * {@code have}. Shallow commits have parents but those parents
- * are considered not to be already available. Parents of
- * {@code shallow} commits and earlier generations will be
- * included in the pack if requested by {@code want}. Must not be
- * {@code null}.
- * @throws java.io.IOException
- * an I/O problem occurred while reading objects.
- */
- public void preparePack(ProgressMonitor countingMonitor,
- @NonNull Set<? extends ObjectId> want,
- @NonNull Set<? extends ObjectId> have,
- @NonNull Set<? extends ObjectId> shallow) throws IOException {
- preparePack(countingMonitor, want, have, shallow, NONE);
- }
-
- /**
- * Prepare the list of objects to be written to the pack stream.
- * <p>
- * Like {@link #preparePack(ProgressMonitor, Set, Set)} but also allows
- * specifying commits that should not be walked past ("shallow" commits).
- * The caller is responsible for filtering out commits that should not be
- * shallow any more ("unshallow" commits as in {@link #setShallowPack}) from
- * the shallow set.
- *
- * @param countingMonitor
- * progress during object enumeration.
- * @param want
- * objects of interest, ancestors of which will be included in
- * the pack. Must not be {@code null}.
- * @param have
- * objects whose ancestors (up to and including {@code shallow}
- * commits) do not need to be included in the pack because they
- * are already available from elsewhere. Must not be
- * {@code null}.
- * @param shallow
- * commits indicating the boundary of the history marked with
- * {@code have}. Shallow commits have parents but those parents
- * are considered not to be already available. Parents of
- * {@code shallow} commits and earlier generations will be
- * included in the pack if requested by {@code want}. Must not be
- * {@code null}.
- * @param noBitmaps
- * collection of objects to be excluded from bitmap commit
- * selection.
- * @throws java.io.IOException
- * an I/O problem occurred while reading objects.
- */
- public void preparePack(ProgressMonitor countingMonitor,
- @NonNull Set<? extends ObjectId> want,
- @NonNull Set<? extends ObjectId> have,
- @NonNull Set<? extends ObjectId> shallow,
- @NonNull Set<? extends ObjectId> noBitmaps) throws IOException {
- try (ObjectWalk ow = getObjectWalk()) {
- ow.assumeShallow(shallow);
- preparePack(countingMonitor, ow, want, have, noBitmaps);
- }
- }
-
- private ObjectWalk getObjectWalk() {
- return shallowPack ? new DepthWalk.ObjectWalk(reader, depth - 1)
- : new ObjectWalk(reader);
- }
-
- /**
- * A visitation policy which uses the depth at which the object is seen to
- * decide if re-traversal is necessary. In particular, if the object has
- * already been visited at this depth or shallower, it is not necessary to
- * re-visit at this depth.
- */
- private static class DepthAwareVisitationPolicy
- implements ObjectWalk.VisitationPolicy {
- private final Map<ObjectId, Integer> lowestDepthVisited = new HashMap<>();
-
- private final ObjectWalk walk;
-
- DepthAwareVisitationPolicy(ObjectWalk walk) {
- this.walk = requireNonNull(walk);
- }
-
- @Override
- public boolean shouldVisit(RevObject o) {
- Integer lastDepth = lowestDepthVisited.get(o);
- if (lastDepth == null) {
- return true;
- }
- return walk.getTreeDepth() < lastDepth.intValue();
- }
-
- @Override
- public void visited(RevObject o) {
- lowestDepthVisited.put(o, Integer.valueOf(walk.getTreeDepth()));
- }
- }
-
- /**
- * Prepare the list of objects to be written to the pack stream.
- * <p>
- * Basing on these 2 sets, another set of objects to put in a pack file is
- * created: this set consists of all objects reachable (ancestors) from
- * interesting objects, except uninteresting objects and their ancestors.
- * This method uses class {@link org.eclipse.jgit.revwalk.ObjectWalk}
- * extensively to find out that appropriate set of output objects and their
- * optimal order in output pack. Order is consistent with general git
- * in-pack rules: sort by object type, recency, path and delta-base first.
- * </p>
- *
- * @param countingMonitor
- * progress during object enumeration.
- * @param walk
- * ObjectWalk to perform enumeration.
- * @param interestingObjects
- * collection of objects to be marked as interesting (start
- * points of graph traversal). Must not be {@code null}.
- * @param uninterestingObjects
- * collection of objects to be marked as uninteresting (end
- * points of graph traversal). Pass {@link #NONE} if all objects
- * reachable from {@code want} are desired, such as when serving
- * a clone.
- * @param noBitmaps
- * collection of objects to be excluded from bitmap commit
- * selection.
- * @throws java.io.IOException
- * when some I/O problem occur during reading objects.
- */
- public void preparePack(ProgressMonitor countingMonitor,
- @NonNull ObjectWalk walk,
- @NonNull Set<? extends ObjectId> interestingObjects,
- @NonNull Set<? extends ObjectId> uninterestingObjects,
- @NonNull Set<? extends ObjectId> noBitmaps)
- throws IOException {
- if (countingMonitor == null)
- countingMonitor = NullProgressMonitor.INSTANCE;
- if (shallowPack && !(walk instanceof DepthWalk.ObjectWalk))
- throw new IllegalArgumentException(
- JGitText.get().shallowPacksRequireDepthWalk);
- if (filterSpec.getTreeDepthLimit() >= 0) {
- walk.setVisitationPolicy(new DepthAwareVisitationPolicy(walk));
- }
- findObjectsToPack(countingMonitor, walk, interestingObjects,
- uninterestingObjects, noBitmaps);
- }
-
- /**
- * Determine if the pack file will contain the requested object.
- *
- * @param id
- * the object to test the existence of.
- * @return true if the object will appear in the output pack file.
- * @throws java.io.IOException
- * a cached pack cannot be examined.
- */
- public boolean willInclude(AnyObjectId id) throws IOException {
- ObjectToPack obj = objectsMap.get(id);
- return obj != null && !obj.isEdge();
- }
-
- /**
- * Lookup the ObjectToPack object for a given ObjectId.
- *
- * @param id
- * the object to find in the pack.
- * @return the object we are packing, or null.
- */
- public ObjectToPack get(AnyObjectId id) {
- ObjectToPack obj = objectsMap.get(id);
- return obj != null && !obj.isEdge() ? obj : null;
- }
-
- /**
- * Computes SHA-1 of lexicographically sorted objects ids written in this
- * pack, as used to name a pack file in repository.
- *
- * @return ObjectId representing SHA-1 name of a pack that was created.
- */
- public ObjectId computeName() {
- final byte[] buf = new byte[OBJECT_ID_LENGTH];
- final MessageDigest md = Constants.newMessageDigest();
- for (ObjectToPack otp : sortByName()) {
- otp.copyRawTo(buf, 0);
- md.update(buf, 0, OBJECT_ID_LENGTH);
- }
- return ObjectId.fromRaw(md.digest());
- }
-
- /**
- * Returns the index format version that will be written.
- * <p>
- * This method can only be invoked after
- * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)} has
- * been invoked and completed successfully.
- *
- * @return the index format version.
- */
- public int getIndexVersion() {
- int indexVersion = config.getIndexVersion();
- if (indexVersion <= 0) {
- for (BlockList<ObjectToPack> objs : objectsLists)
- indexVersion = Math.max(indexVersion,
- PackIndexWriter.oldestPossibleFormat(objs));
- }
- return indexVersion;
- }
-
- /**
- * Create an index file to match the pack file just written.
- * <p>
- * Called after
- * {@link #writePack(ProgressMonitor, ProgressMonitor, OutputStream)}.
- * <p>
- * Writing an index is only required for local pack storage. Packs sent on
- * the network do not need to create an index.
- *
- * @param indexStream
- * output for the index data. Caller is responsible for closing
- * this stream.
- * @throws java.io.IOException
- * the index data could not be written to the supplied stream.
- */
- public void writeIndex(OutputStream indexStream) throws IOException {
- if (isIndexDisabled())
- throw new IOException(JGitText.get().cachedPacksPreventsIndexCreation);
-
- long writeStart = System.currentTimeMillis();
- final PackIndexWriter iw = PackIndexWriter.createVersion(
- indexStream, getIndexVersion());
- iw.write(sortByName(), packcsum);
- stats.timeWriting += System.currentTimeMillis() - writeStart;
- }
-
- /**
- * Create a bitmap index file to match the pack file just written.
- * <p>
- * Called after {@link #prepareBitmapIndex(ProgressMonitor)}.
- *
- * @param bitmapIndexStream
- * output for the bitmap index data. Caller is responsible for
- * closing this stream.
- * @throws java.io.IOException
- * the index data could not be written to the supplied stream.
- */
- public void writeBitmapIndex(OutputStream bitmapIndexStream)
- throws IOException {
- if (writeBitmaps == null)
- throw new IOException(JGitText.get().bitmapsMustBePrepared);
-
- long writeStart = System.currentTimeMillis();
- final PackBitmapIndexWriterV1 iw = new PackBitmapIndexWriterV1(bitmapIndexStream);
- iw.write(writeBitmaps, packcsum);
- stats.timeWriting += System.currentTimeMillis() - writeStart;
- }
-
- private List<ObjectToPack> sortByName() {
- if (sortedByName == null) {
- int cnt = 0;
- cnt += objectsLists[OBJ_COMMIT].size();
- cnt += objectsLists[OBJ_TREE].size();
- cnt += objectsLists[OBJ_BLOB].size();
- cnt += objectsLists[OBJ_TAG].size();
-
- sortedByName = new BlockList<>(cnt);
- sortedByName.addAll(objectsLists[OBJ_COMMIT]);
- sortedByName.addAll(objectsLists[OBJ_TREE]);
- sortedByName.addAll(objectsLists[OBJ_BLOB]);
- sortedByName.addAll(objectsLists[OBJ_TAG]);
- Collections.sort(sortedByName);
- }
- return sortedByName;
- }
-
- private void beginPhase(PackingPhase phase, ProgressMonitor monitor,
- long cnt) {
- state.phase = phase;
- String task;
- switch (phase) {
- case COUNTING:
- task = JGitText.get().countingObjects;
- break;
- case GETTING_SIZES:
- task = JGitText.get().searchForSizes;
- break;
- case FINDING_SOURCES:
- task = JGitText.get().searchForReuse;
- break;
- case COMPRESSING:
- task = JGitText.get().compressingObjects;
- break;
- case WRITING:
- task = JGitText.get().writingObjects;
- break;
- case BUILDING_BITMAPS:
- task = JGitText.get().buildingBitmaps;
- break;
- default:
- throw new IllegalArgumentException(
- MessageFormat.format(JGitText.get().illegalPackingPhase, phase));
- }
- monitor.beginTask(task, (int) cnt);
- }
-
- private void endPhase(ProgressMonitor monitor) {
- monitor.endTask();
- }
-
- /**
- * Write the prepared pack to the supplied stream.
- * <p>
- * Called after
- * {@link #preparePack(ProgressMonitor, ObjectWalk, Set, Set, Set)} or
- * {@link #preparePack(ProgressMonitor, Set, Set)}.
- * <p>
- * Performs delta search if enabled and writes the pack stream.
- * <p>
- * All reused objects data checksum (Adler32/CRC32) is computed and
- * validated against existing checksum.
- *
- * @param compressMonitor
- * progress monitor to report object compression work.
- * @param writeMonitor
- * progress monitor to report the number of objects written.
- * @param packStream
- * output stream of pack data. The stream should be buffered by
- * the caller. The caller is responsible for closing the stream.
- * @throws java.io.IOException
- * an error occurred reading a local object's data to include in
- * the pack, or writing compressed object data to the output
- * stream.
- * @throws WriteAbortedException
- * the write operation is aborted by
- * {@link org.eclipse.jgit.transport.ObjectCountCallback} .
- */
- public void writePack(ProgressMonitor compressMonitor,
- ProgressMonitor writeMonitor, OutputStream packStream)
- throws IOException {
- if (compressMonitor == null)
- compressMonitor = NullProgressMonitor.INSTANCE;
- if (writeMonitor == null)
- writeMonitor = NullProgressMonitor.INSTANCE;
-
- excludeInPacks = null;
- excludeInPackLast = null;
-
- boolean needSearchForReuse = reuseSupport != null && (
- reuseDeltas
- || config.isReuseObjects()
- || !cachedPacks.isEmpty());
-
- if (compressMonitor instanceof BatchingProgressMonitor) {
- long delay = 1000;
- if (needSearchForReuse && config.isDeltaCompress())
- delay = 500;
- ((BatchingProgressMonitor) compressMonitor).setDelayStart(
- delay,
- TimeUnit.MILLISECONDS);
- }
-
- if (needSearchForReuse)
- searchForReuse(compressMonitor);
- if (config.isDeltaCompress())
- searchForDeltas(compressMonitor);
-
- crc32 = new CRC32();
- final PackOutputStream out = new PackOutputStream(
- writeMonitor,
- isIndexDisabled()
- ? packStream
- : new CheckedOutputStream(packStream, crc32),
- this);
-
- long objCnt = packfileUriConfig == null ? getObjectCount() :
- getUnoffloadedObjectCount();
- stats.totalObjects = objCnt;
- if (callback != null)
- callback.setObjectCount(objCnt);
- beginPhase(PackingPhase.WRITING, writeMonitor, objCnt);
- long writeStart = System.currentTimeMillis();
- try {
- List<CachedPack> unwrittenCachedPacks;
-
- if (packfileUriConfig != null) {
- unwrittenCachedPacks = new ArrayList<>();
- CachedPackUriProvider p = packfileUriConfig.cachedPackUriProvider;
- PacketLineOut o = packfileUriConfig.pckOut;
-
- o.writeString("packfile-uris\n"); //$NON-NLS-1$
- for (CachedPack pack : cachedPacks) {
- CachedPackUriProvider.PackInfo packInfo = p.getInfo(
- pack, packfileUriConfig.protocolsSupported);
- if (packInfo != null) {
- o.writeString(packInfo.getHash() + ' ' +
- packInfo.getUri() + '\n');
- stats.offloadedPackfiles += 1;
- stats.offloadedPackfileSize += packInfo.getSize();
- } else {
- unwrittenCachedPacks.add(pack);
- }
- }
- packfileUriConfig.pckOut.writeDelim();
- packfileUriConfig.pckOut.writeString("packfile\n"); //$NON-NLS-1$
- } else {
- unwrittenCachedPacks = cachedPacks;
- }
-
- out.writeFileHeader(PACK_VERSION_GENERATED, objCnt);
- out.flush();
-
- writeObjects(out);
- if (!edgeObjects.isEmpty() || !cachedPacks.isEmpty()) {
- for (PackStatistics.ObjectType.Accumulator typeStat : stats.objectTypes) {
- if (typeStat == null)
- continue;
- stats.thinPackBytes += typeStat.bytes;
- }
- }
-
- stats.reusedPacks = Collections.unmodifiableList(cachedPacks);
- for (CachedPack pack : unwrittenCachedPacks) {
- long deltaCnt = pack.getDeltaCount();
- stats.reusedObjects += pack.getObjectCount();
- stats.reusedDeltas += deltaCnt;
- stats.totalDeltas += deltaCnt;
- reuseSupport.copyPackAsIs(out, pack);
- }
- writeChecksum(out);
- out.flush();
- } finally {
- stats.timeWriting = System.currentTimeMillis() - writeStart;
- stats.depth = depth;
-
- for (PackStatistics.ObjectType.Accumulator typeStat : stats.objectTypes) {
- if (typeStat == null)
- continue;
- typeStat.cntDeltas += typeStat.reusedDeltas;
- stats.reusedObjects += typeStat.reusedObjects;
- stats.reusedDeltas += typeStat.reusedDeltas;
- stats.totalDeltas += typeStat.cntDeltas;
- }
- }
-
- stats.totalBytes = out.length();
- reader.close();
- endPhase(writeMonitor);
- }
-
- /**
- * Get statistics of what this PackWriter did in order to create the final
- * pack stream.
- *
- * @return description of what this PackWriter did in order to create the
- * final pack stream. This should only be invoked after the calls to
- * create the pack/index/bitmap have completed.
- */
- public PackStatistics getStatistics() {
- return new PackStatistics(stats);
- }
-
- /**
- * Get snapshot of the current state of this PackWriter.
- *
- * @return snapshot of the current state of this PackWriter.
- */
- public State getState() {
- return state.snapshot();
- }
-
- /**
- * {@inheritDoc}
- * <p>
- * Release all resources used by this writer.
- */
- @Override
- public void close() {
- reader.close();
- if (myDeflater != null) {
- myDeflater.end();
- myDeflater = null;
- }
- instances.remove(selfRef);
- }
-
- private void searchForReuse(ProgressMonitor monitor) throws IOException {
- long cnt = 0;
- cnt += objectsLists[OBJ_COMMIT].size();
- cnt += objectsLists[OBJ_TREE].size();
- cnt += objectsLists[OBJ_BLOB].size();
- cnt += objectsLists[OBJ_TAG].size();
-
- long start = System.currentTimeMillis();
- searchForReuseStartTimeEpoc = start;
- beginPhase(PackingPhase.FINDING_SOURCES, monitor, cnt);
- if (cnt <= 4096) {
- // For small object counts, do everything as one list.
- BlockList<ObjectToPack> tmp = new BlockList<>((int) cnt);
- tmp.addAll(objectsLists[OBJ_TAG]);
- tmp.addAll(objectsLists[OBJ_COMMIT]);
- tmp.addAll(objectsLists[OBJ_TREE]);
- tmp.addAll(objectsLists[OBJ_BLOB]);
- searchForReuse(monitor, tmp);
- if (pruneCurrentObjectList) {
- // If the list was pruned, we need to re-prune the main lists.
- pruneEdgesFromObjectList(objectsLists[OBJ_COMMIT]);
- pruneEdgesFromObjectList(objectsLists[OBJ_TREE]);
- pruneEdgesFromObjectList(objectsLists[OBJ_BLOB]);
- pruneEdgesFromObjectList(objectsLists[OBJ_TAG]);
- }
- } else {
- searchForReuse(monitor, objectsLists[OBJ_TAG]);
- searchForReuse(monitor, objectsLists[OBJ_COMMIT]);
- searchForReuse(monitor, objectsLists[OBJ_TREE]);
- searchForReuse(monitor, objectsLists[OBJ_BLOB]);
- }
- endPhase(monitor);
- stats.timeSearchingForReuse = System.currentTimeMillis() - start;
-
- if (config.isReuseDeltas() && config.getCutDeltaChains()) {
- cutDeltaChains(objectsLists[OBJ_TREE]);
- cutDeltaChains(objectsLists[OBJ_BLOB]);
- }
- }
-
- private void searchForReuse(ProgressMonitor monitor, List<ObjectToPack> list)
- throws IOException, MissingObjectException {
- pruneCurrentObjectList = false;
- reuseSupport.selectObjectRepresentation(this, monitor, list);
- if (pruneCurrentObjectList)
- pruneEdgesFromObjectList(list);
- }
-
- private void cutDeltaChains(BlockList<ObjectToPack> list)
- throws IOException {
- int max = config.getMaxDeltaDepth();
- for (int idx = list.size() - 1; idx >= 0; idx--) {
- int d = 0;
- ObjectToPack b = list.get(idx).getDeltaBase();
- while (b != null) {
- if (d < b.getChainLength())
- break;
- b.setChainLength(++d);
- if (d >= max && b.isDeltaRepresentation()) {
- reselectNonDelta(b);
- break;
- }
- b = b.getDeltaBase();
- }
- }
- if (config.isDeltaCompress()) {
- for (ObjectToPack otp : list)
- otp.clearChainLength();
- }
- }
-
- private void searchForDeltas(ProgressMonitor monitor)
- throws MissingObjectException, IncorrectObjectTypeException,
- IOException {
- // Commits and annotated tags tend to have too many differences to
- // really benefit from delta compression. Consequently just don't
- // bother examining those types here.
- //
- ObjectToPack[] list = new ObjectToPack[
- objectsLists[OBJ_TREE].size()
- + objectsLists[OBJ_BLOB].size()
- + edgeObjects.size()];
- int cnt = 0;
- cnt = findObjectsNeedingDelta(list, cnt, OBJ_TREE);
- cnt = findObjectsNeedingDelta(list, cnt, OBJ_BLOB);
- if (cnt == 0)
- return;
- int nonEdgeCnt = cnt;
-
- // Queue up any edge objects that we might delta against. We won't
- // be sending these as we assume the other side has them, but we need
- // them in the search phase below.
- //
- for (ObjectToPack eo : edgeObjects) {
- eo.setWeight(0);
- list[cnt++] = eo;
- }
-
- // Compute the sizes of the objects so we can do a proper sort.
- // We let the reader skip missing objects if it chooses. For
- // some readers this can be a huge win. We detect missing objects
- // by having set the weights above to 0 and allowing the delta
- // search code to discover the missing object and skip over it, or
- // abort with an exception if we actually had to have it.
- //
- final long sizingStart = System.currentTimeMillis();
- beginPhase(PackingPhase.GETTING_SIZES, monitor, cnt);
- AsyncObjectSizeQueue<ObjectToPack> sizeQueue = reader.getObjectSize(
- Arrays.<ObjectToPack> asList(list).subList(0, cnt), false);
- try {
- final long limit = Math.min(
- config.getBigFileThreshold(),
- Integer.MAX_VALUE);
- for (;;) {
- try {
- if (!sizeQueue.next())
- break;
- } catch (MissingObjectException notFound) {
- monitor.update(1);
- if (ignoreMissingUninteresting) {
- ObjectToPack otp = sizeQueue.getCurrent();
- if (otp != null && otp.isEdge()) {
- otp.setDoNotDelta();
- continue;
- }
-
- otp = objectsMap.get(notFound.getObjectId());
- if (otp != null && otp.isEdge()) {
- otp.setDoNotDelta();
- continue;
- }
- }
- throw notFound;
- }
-
- ObjectToPack otp = sizeQueue.getCurrent();
- if (otp == null)
- otp = objectsMap.get(sizeQueue.getObjectId());
-
- long sz = sizeQueue.getSize();
- if (DeltaIndex.BLKSZ < sz && sz < limit)
- otp.setWeight((int) sz);
- else
- otp.setDoNotDelta(); // too small, or too big
- monitor.update(1);
- }
- } finally {
- sizeQueue.release();
- }
- endPhase(monitor);
- stats.timeSearchingForSizes = System.currentTimeMillis() - sizingStart;
-
- // Sort the objects by path hash so like files are near each other,
- // and then by size descending so that bigger files are first. This
- // applies "Linus' Law" which states that newer files tend to be the
- // bigger ones, because source files grow and hardly ever shrink.
- //
- Arrays.sort(list, 0, cnt, (ObjectToPack a, ObjectToPack b) -> {
- int cmp = (a.isDoNotDelta() ? 1 : 0) - (b.isDoNotDelta() ? 1 : 0);
- if (cmp != 0) {
- return cmp;
- }
-
- cmp = a.getType() - b.getType();
- if (cmp != 0) {
- return cmp;
- }
-
- cmp = (a.getPathHash() >>> 1) - (b.getPathHash() >>> 1);
- if (cmp != 0) {
- return cmp;
- }
-
- cmp = (a.getPathHash() & 1) - (b.getPathHash() & 1);
- if (cmp != 0) {
- return cmp;
- }
-
- cmp = (a.isEdge() ? 0 : 1) - (b.isEdge() ? 0 : 1);
- if (cmp != 0) {
- return cmp;
- }
-
- return b.getWeight() - a.getWeight();
- });
-
- // Above we stored the objects we cannot delta onto the end.
- // Remove them from the list so we don't waste time on them.
- while (0 < cnt && list[cnt - 1].isDoNotDelta()) {
- if (!list[cnt - 1].isEdge())
- nonEdgeCnt--;
- cnt--;
- }
- if (cnt == 0)
- return;
-
- final long searchStart = System.currentTimeMillis();
- searchForDeltas(monitor, list, cnt);
- stats.deltaSearchNonEdgeObjects = nonEdgeCnt;
- stats.timeCompressing = System.currentTimeMillis() - searchStart;
-
- for (int i = 0; i < cnt; i++)
- if (!list[i].isEdge() && list[i].isDeltaRepresentation())
- stats.deltasFound++;
- }
-
- private int findObjectsNeedingDelta(ObjectToPack[] list, int cnt, int type) {
- for (ObjectToPack otp : objectsLists[type]) {
- if (otp.isDoNotDelta()) // delta is disabled for this path
- continue;
- if (otp.isDeltaRepresentation()) // already reusing a delta
- continue;
- otp.setWeight(0);
- list[cnt++] = otp;
- }
- return cnt;
- }
-
- private void reselectNonDelta(ObjectToPack otp) throws IOException {
- otp.clearDeltaBase();
- otp.clearReuseAsIs();
- boolean old = reuseDeltas;
- reuseDeltas = false;
- reuseSupport.selectObjectRepresentation(this,
- NullProgressMonitor.INSTANCE,
- Collections.singleton(otp));
- reuseDeltas = old;
- }
-
- private void searchForDeltas(final ProgressMonitor monitor,
- final ObjectToPack[] list, final int cnt)
- throws MissingObjectException, IncorrectObjectTypeException,
- LargeObjectException, IOException {
- int threads = config.getThreads();
- if (threads == 0)
- threads = Runtime.getRuntime().availableProcessors();
- if (threads <= 1 || cnt <= config.getDeltaSearchWindowSize())
- singleThreadDeltaSearch(monitor, list, cnt);
- else
- parallelDeltaSearch(monitor, list, cnt, threads);
- }
-
- private void singleThreadDeltaSearch(ProgressMonitor monitor,
- ObjectToPack[] list, int cnt) throws IOException {
- long totalWeight = 0;
- for (int i = 0; i < cnt; i++) {
- ObjectToPack o = list[i];
- totalWeight += DeltaTask.getAdjustedWeight(o);
- }
-
- long bytesPerUnit = 1;
- while (DeltaTask.MAX_METER <= (totalWeight / bytesPerUnit))
- bytesPerUnit <<= 10;
- int cost = (int) (totalWeight / bytesPerUnit);
- if (totalWeight % bytesPerUnit != 0)
- cost++;
-
- beginPhase(PackingPhase.COMPRESSING, monitor, cost);
- new DeltaWindow(config, new DeltaCache(config), reader,
- monitor, bytesPerUnit,
- list, 0, cnt).search();
- endPhase(monitor);
- }
-
- @SuppressWarnings("Finally")
- private void parallelDeltaSearch(ProgressMonitor monitor,
- ObjectToPack[] list, int cnt, int threads) throws IOException {
- DeltaCache dc = new ThreadSafeDeltaCache(config);
- ThreadSafeProgressMonitor pm = new ThreadSafeProgressMonitor(monitor);
- DeltaTask.Block taskBlock = new DeltaTask.Block(threads, config,
- reader, dc, pm,
- list, 0, cnt);
- taskBlock.partitionTasks();
- beginPhase(PackingPhase.COMPRESSING, monitor, taskBlock.cost());
- pm.startWorkers(taskBlock.tasks.size());
-
- Executor executor = config.getExecutor();
- final List<Throwable> errors =
- Collections.synchronizedList(new ArrayList<>(threads));
- if (executor instanceof ExecutorService) {
- // Caller supplied us a service, use it directly.
- runTasks((ExecutorService) executor, pm, taskBlock, errors);
- } else if (executor == null) {
- // Caller didn't give us a way to run the tasks, spawn up a
- // temporary thread pool and make sure it tears down cleanly.
- ExecutorService pool = Executors.newFixedThreadPool(threads);
- Throwable e1 = null;
- try {
- runTasks(pool, pm, taskBlock, errors);
- } catch (Exception e) {
- e1 = e;
- } finally {
- pool.shutdown();
- for (;;) {
- try {
- if (pool.awaitTermination(60, TimeUnit.SECONDS)) {
- break;
- }
- } catch (InterruptedException e) {
- if (e1 != null) {
- e.addSuppressed(e1);
- }
- throw new IOException(JGitText
- .get().packingCancelledDuringObjectsWriting, e);
- }
- }
- }
- } else {
- // The caller gave us an executor, but it might not do
- // asynchronous execution. Wrap everything and hope it
- // can schedule these for us.
- for (DeltaTask task : taskBlock.tasks) {
- executor.execute(() -> {
- try {
- task.call();
- } catch (Throwable failure) {
- errors.add(failure);
- }
- });
- }
- try {
- pm.waitForCompletion();
- } catch (InterruptedException ie) {
- // We can't abort the other tasks as we have no handle.
- // Cross our fingers and just break out anyway.
- //
- throw new IOException(
- JGitText.get().packingCancelledDuringObjectsWriting,
- ie);
- }
- }
-
- // If any task threw an error, try to report it back as
- // though we weren't using a threaded search algorithm.
- //
- if (!errors.isEmpty()) {
- Throwable err = errors.get(0);
- if (err instanceof Error)
- throw (Error) err;
- if (err instanceof RuntimeException)
- throw (RuntimeException) err;
- if (err instanceof IOException)
- throw (IOException) err;
-
- throw new IOException(err.getMessage(), err);
- }
- endPhase(monitor);
- }
-
- private static void runTasks(ExecutorService pool,
- ThreadSafeProgressMonitor pm,
- DeltaTask.Block tb, List<Throwable> errors) throws IOException {
- List<Future<?>> futures = new ArrayList<>(tb.tasks.size());
- for (DeltaTask task : tb.tasks)
- futures.add(pool.submit(task));
-
- try {
- pm.waitForCompletion();
- for (Future<?> f : futures) {
- try {
- f.get();
- } catch (ExecutionException failed) {
- errors.add(failed.getCause());
- }
- }
- } catch (InterruptedException ie) {
- for (Future<?> f : futures)
- f.cancel(true);
- throw new IOException(
- JGitText.get().packingCancelledDuringObjectsWriting, ie);
- }
- }
-
- private void writeObjects(PackOutputStream out) throws IOException {
- writeObjects(out, objectsLists[OBJ_COMMIT]);
- writeObjects(out, objectsLists[OBJ_TAG]);
- writeObjects(out, objectsLists[OBJ_TREE]);
- writeObjects(out, objectsLists[OBJ_BLOB]);
- }
-
- private void writeObjects(PackOutputStream out, List<ObjectToPack> list)
- throws IOException {
- if (list.isEmpty())
- return;
-
- typeStats = stats.objectTypes[list.get(0).getType()];
- long beginOffset = out.length();
-
- if (reuseSupport != null) {
- reuseSupport.writeObjects(out, list);
- } else {
- for (ObjectToPack otp : list)
- out.writeObject(otp);
- }
-
- typeStats.bytes += out.length() - beginOffset;
- typeStats.cntObjects = list.size();
- }
-
- void writeObject(PackOutputStream out, ObjectToPack otp) throws IOException {
- if (!otp.isWritten())
- writeObjectImpl(out, otp);
- }
-
- private void writeObjectImpl(PackOutputStream out, ObjectToPack otp)
- throws IOException {
- if (otp.wantWrite()) {
- // A cycle exists in this delta chain. This should only occur if a
- // selected object representation disappeared during writing
- // (for example due to a concurrent repack) and a different base
- // was chosen, forcing a cycle. Select something other than a
- // delta, and write this object.
- reselectNonDelta(otp);
- }
- otp.markWantWrite();
-
- while (otp.isReuseAsIs()) {
- writeBase(out, otp.getDeltaBase());
- if (otp.isWritten())
- return; // Delta chain cycle caused this to write already.
-
- crc32.reset();
- otp.setOffset(out.length());
- try {
- reuseSupport.copyObjectAsIs(out, otp, reuseValidate);
- out.endObject();
- otp.setCRC((int) crc32.getValue());
- typeStats.reusedObjects++;
- if (otp.isDeltaRepresentation()) {
- typeStats.reusedDeltas++;
- typeStats.deltaBytes += out.length() - otp.getOffset();
- }
- return;
- } catch (StoredObjectRepresentationNotAvailableException gone) {
- if (otp.getOffset() == out.length()) {
- otp.setOffset(0);
- otp.clearDeltaBase();
- otp.clearReuseAsIs();
- reuseSupport.selectObjectRepresentation(this,
- NullProgressMonitor.INSTANCE,
- Collections.singleton(otp));
- continue;
- }
- // Object writing already started, we cannot recover.
- //
- CorruptObjectException coe;
- coe = new CorruptObjectException(otp, ""); //$NON-NLS-1$
- coe.initCause(gone);
- throw coe;
- }
- }
-
- // If we reached here, reuse wasn't possible.
- //
- if (otp.isDeltaRepresentation()) {
- writeDeltaObjectDeflate(out, otp);
- } else {
- writeWholeObjectDeflate(out, otp);
- }
- out.endObject();
- otp.setCRC((int) crc32.getValue());
- }
-
- private void writeBase(PackOutputStream out, ObjectToPack base)
- throws IOException {
- if (base != null && !base.isWritten() && !base.isEdge())
- writeObjectImpl(out, base);
- }
-
- private void writeWholeObjectDeflate(PackOutputStream out,
- final ObjectToPack otp) throws IOException {
- final Deflater deflater = deflater();
- final ObjectLoader ldr = reader.open(otp, otp.getType());
-
- crc32.reset();
- otp.setOffset(out.length());
- out.writeHeader(otp, ldr.getSize());
-
- deflater.reset();
- DeflaterOutputStream dst = new DeflaterOutputStream(out, deflater);
- ldr.copyTo(dst);
- dst.finish();
- }
-
- private void writeDeltaObjectDeflate(PackOutputStream out,
- final ObjectToPack otp) throws IOException {
- writeBase(out, otp.getDeltaBase());
-
- crc32.reset();
- otp.setOffset(out.length());
-
- DeltaCache.Ref ref = otp.popCachedDelta();
- if (ref != null) {
- byte[] zbuf = ref.get();
- if (zbuf != null) {
- out.writeHeader(otp, otp.getCachedSize());
- out.write(zbuf);
- typeStats.cntDeltas++;
- typeStats.deltaBytes += out.length() - otp.getOffset();
- return;
- }
- }
-
- try (TemporaryBuffer.Heap delta = delta(otp)) {
- out.writeHeader(otp, delta.length());
-
- Deflater deflater = deflater();
- deflater.reset();
- DeflaterOutputStream dst = new DeflaterOutputStream(out, deflater);
- delta.writeTo(dst, null);
- dst.finish();
- }
- typeStats.cntDeltas++;
- typeStats.deltaBytes += out.length() - otp.getOffset();
- }
-
- private TemporaryBuffer.Heap delta(ObjectToPack otp)
- throws IOException {
- DeltaIndex index = new DeltaIndex(buffer(otp.getDeltaBaseId()));
- byte[] res = buffer(otp);
-
- // We never would have proposed this pair if the delta would be
- // larger than the unpacked version of the object. So using it
- // as our buffer limit is valid: we will never reach it.
- //
- TemporaryBuffer.Heap delta = new TemporaryBuffer.Heap(res.length);
- index.encode(delta, res);
- return delta;
- }
-
- private byte[] buffer(AnyObjectId objId) throws IOException {
- return buffer(config, reader, objId);
- }
-
- static byte[] buffer(PackConfig config, ObjectReader or, AnyObjectId objId)
- throws IOException {
- // PackWriter should have already pruned objects that
- // are above the big file threshold, so our chances of
- // the object being below it are very good. We really
- // shouldn't be here, unless the implementation is odd.
-
- return or.open(objId).getCachedBytes(config.getBigFileThreshold());
- }
-
- private Deflater deflater() {
- if (myDeflater == null)
- myDeflater = new Deflater(config.getCompressionLevel());
- return myDeflater;
- }
-
- private void writeChecksum(PackOutputStream out) throws IOException {
- packcsum = out.getDigest();
- out.write(packcsum);
- }
-
- private void findObjectsToPack(@NonNull ProgressMonitor countingMonitor,
- @NonNull ObjectWalk walker, @NonNull Set<? extends ObjectId> want,
- @NonNull Set<? extends ObjectId> have,
- @NonNull Set<? extends ObjectId> noBitmaps) throws IOException {
- final long countingStart = System.currentTimeMillis();
- beginPhase(PackingPhase.COUNTING, countingMonitor, ProgressMonitor.UNKNOWN);
-
- stats.interestingObjects = Collections.unmodifiableSet(new HashSet<ObjectId>(want));
- stats.uninterestingObjects = Collections.unmodifiableSet(new HashSet<ObjectId>(have));
- excludeFromBitmapSelection = noBitmaps;
-
- canBuildBitmaps = config.isBuildBitmaps()
- && !shallowPack
- && have.isEmpty()
- && (excludeInPacks == null || excludeInPacks.length == 0);
- if (!shallowPack && useBitmaps) {
- BitmapIndex bitmapIndex = reader.getBitmapIndex();
- if (bitmapIndex != null) {
- BitmapWalker bitmapWalker = new BitmapWalker(
- walker, bitmapIndex, countingMonitor);
- findObjectsToPackUsingBitmaps(bitmapWalker, want, have);
- endPhase(countingMonitor);
- stats.timeCounting = System.currentTimeMillis() - countingStart;
- stats.bitmapIndexMisses = bitmapWalker.getCountOfBitmapIndexMisses();
- return;
- }
- }
-
- List<ObjectId> all = new ArrayList<>(want.size() + have.size());
- all.addAll(want);
- all.addAll(have);
-
- final RevFlag include = walker.newFlag("include"); //$NON-NLS-1$
- final RevFlag added = walker.newFlag("added"); //$NON-NLS-1$
-
- walker.carry(include);
-
- int haveEst = have.size();
- if (have.isEmpty()) {
- walker.sort(RevSort.COMMIT_TIME_DESC);
- } else {
- walker.sort(RevSort.TOPO);
- if (thin)
- walker.sort(RevSort.BOUNDARY, true);
- }
-
- List<RevObject> wantObjs = new ArrayList<>(want.size());
- List<RevObject> haveObjs = new ArrayList<>(haveEst);
- List<RevTag> wantTags = new ArrayList<>(want.size());
-
- // Retrieve the RevWalk's versions of "want" and "have" objects to
- // maintain any state previously set in the RevWalk.
- AsyncRevObjectQueue q = walker.parseAny(all, true);
- try {
- for (;;) {
- try {
- RevObject o = q.next();
- if (o == null)
- break;
- if (have.contains(o))
- haveObjs.add(o);
- if (want.contains(o)) {
- o.add(include);
- wantObjs.add(o);
- if (o instanceof RevTag)
- wantTags.add((RevTag) o);
- }
- } catch (MissingObjectException e) {
- if (ignoreMissingUninteresting
- && have.contains(e.getObjectId()))
- continue;
- throw e;
- }
- }
- } finally {
- q.release();
- }
-
- if (!wantTags.isEmpty()) {
- all = new ArrayList<>(wantTags.size());
- for (RevTag tag : wantTags)
- all.add(tag.getObject());
- q = walker.parseAny(all, true);
- try {
- while (q.next() != null) {
- // Just need to pop the queue item to parse the object.
- }
- } finally {
- q.release();
- }
- }
-
- if (walker instanceof DepthWalk.ObjectWalk) {
- DepthWalk.ObjectWalk depthWalk = (DepthWalk.ObjectWalk) walker;
- for (RevObject obj : wantObjs) {
- depthWalk.markRoot(obj);
- }
- // Mark the tree objects associated with "have" commits as
- // uninteresting to avoid writing redundant blobs. A normal RevWalk
- // lazily propagates the "uninteresting" state from a commit to its
- // tree during the walk, but DepthWalks can terminate early so
- // preemptively propagate that state here.
- for (RevObject obj : haveObjs) {
- if (obj instanceof RevCommit) {
- RevTree t = ((RevCommit) obj).getTree();
- depthWalk.markUninteresting(t);
- }
- }
-
- if (unshallowObjects != null) {
- for (ObjectId id : unshallowObjects) {
- depthWalk.markUnshallow(walker.parseAny(id));
- }
- }
- } else {
- for (RevObject obj : wantObjs)
- walker.markStart(obj);
- }
- for (RevObject obj : haveObjs)
- walker.markUninteresting(obj);
-
- final int maxBases = config.getDeltaSearchWindowSize();
- Set<RevTree> baseTrees = new HashSet<>();
- BlockList<RevCommit> commits = new BlockList<>();
- Set<ObjectId> roots = new HashSet<>();
- RevCommit c;
- while ((c = walker.next()) != null) {
- if (exclude(c))
- continue;
- if (c.has(RevFlag.UNINTERESTING)) {
- if (baseTrees.size() <= maxBases)
- baseTrees.add(c.getTree());
- continue;
- }
-
- commits.add(c);
- if (c.getParentCount() == 0) {
- roots.add(c.copy());
- }
- countingMonitor.update(1);
- }
- stats.rootCommits = Collections.unmodifiableSet(roots);
-
- if (shallowPack) {
- for (RevCommit cmit : commits) {
- addObject(cmit, 0);
- }
- } else {
- int commitCnt = 0;
- boolean putTagTargets = false;
- for (RevCommit cmit : commits) {
- if (!cmit.has(added)) {
- cmit.add(added);
- addObject(cmit, 0);
- commitCnt++;
- }
-
- for (int i = 0; i < cmit.getParentCount(); i++) {
- RevCommit p = cmit.getParent(i);
- if (!p.has(added) && !p.has(RevFlag.UNINTERESTING)
- && !exclude(p)) {
- p.add(added);
- addObject(p, 0);
- commitCnt++;
- }
- }
-
- if (!putTagTargets && 4096 < commitCnt) {
- for (ObjectId id : tagTargets) {
- RevObject obj = walker.lookupOrNull(id);
- if (obj instanceof RevCommit
- && obj.has(include)
- && !obj.has(RevFlag.UNINTERESTING)
- && !obj.has(added)) {
- obj.add(added);
- addObject(obj, 0);
- }
- }
- putTagTargets = true;
- }
- }
- }
- commits = null;
-
- if (thin && !baseTrees.isEmpty()) {
- BaseSearch bases = new BaseSearch(countingMonitor, baseTrees, //
- objectsMap, edgeObjects, reader);
- RevObject o;
- while ((o = walker.nextObject()) != null) {
- if (o.has(RevFlag.UNINTERESTING))
- continue;
- if (exclude(o))
- continue;
-
- int pathHash = walker.getPathHashCode();
- byte[] pathBuf = walker.getPathBuffer();
- int pathLen = walker.getPathLength();
- bases.addBase(o.getType(), pathBuf, pathLen, pathHash);
- if (!depthSkip(o, walker)) {
- filterAndAddObject(o, o.getType(), pathHash, want);
- }
- countingMonitor.update(1);
- }
- } else {
- RevObject o;
- while ((o = walker.nextObject()) != null) {
- if (o.has(RevFlag.UNINTERESTING))
- continue;
- if (exclude(o))
- continue;
- if (!depthSkip(o, walker)) {
- filterAndAddObject(o, o.getType(), walker.getPathHashCode(),
- want);
- }
- countingMonitor.update(1);
- }
- }
-
- for (CachedPack pack : cachedPacks)
- countingMonitor.update((int) pack.getObjectCount());
- endPhase(countingMonitor);
- stats.timeCounting = System.currentTimeMillis() - countingStart;
- stats.bitmapIndexMisses = -1;
- }
-
- private void findObjectsToPackUsingBitmaps(
- BitmapWalker bitmapWalker, Set<? extends ObjectId> want,
- Set<? extends ObjectId> have)
- throws MissingObjectException, IncorrectObjectTypeException,
- IOException {
- BitmapBuilder haveBitmap = bitmapWalker.findObjects(have, null, true);
- BitmapBuilder wantBitmap = bitmapWalker.findObjects(want, haveBitmap,
- false);
- BitmapBuilder needBitmap = wantBitmap.andNot(haveBitmap);
-
- if (useCachedPacks && reuseSupport != null && !reuseValidate
- && (excludeInPacks == null || excludeInPacks.length == 0))
- cachedPacks.addAll(
- reuseSupport.getCachedPacksAndUpdate(needBitmap));
-
- for (BitmapObject obj : needBitmap) {
- ObjectId objectId = obj.getObjectId();
- if (exclude(objectId)) {
- needBitmap.remove(objectId);
- continue;
- }
- filterAndAddObject(objectId, obj.getType(), 0, want);
- }
-
- if (thin)
- haveObjects = haveBitmap;
- }
-
- private static void pruneEdgesFromObjectList(List<ObjectToPack> list) {
- final int size = list.size();
- int src = 0;
- int dst = 0;
-
- for (; src < size; src++) {
- ObjectToPack obj = list.get(src);
- if (obj.isEdge())
- continue;
- if (dst != src)
- list.set(dst, obj);
- dst++;
- }
-
- while (dst < list.size())
- list.remove(list.size() - 1);
- }
-
- /**
- * Include one object to the output file.
- * <p>
- * Objects are written in the order they are added. If the same object is
- * added twice, it may be written twice, creating a larger than necessary
- * file.
- *
- * @param object
- * the object to add.
- * @throws org.eclipse.jgit.errors.IncorrectObjectTypeException
- * the object is an unsupported type.
- */
- public void addObject(RevObject object)
- throws IncorrectObjectTypeException {
- if (!exclude(object))
- addObject(object, 0);
- }
-
- private void addObject(RevObject object, int pathHashCode) {
- addObject(object, object.getType(), pathHashCode);
- }
-
- private void addObject(
- final AnyObjectId src, final int type, final int pathHashCode) {
- final ObjectToPack otp;
- if (reuseSupport != null)
- otp = reuseSupport.newObjectToPack(src, type);
- else
- otp = new ObjectToPack(src, type);
- otp.setPathHash(pathHashCode);
- objectsLists[type].add(otp);
- objectsMap.add(otp);
- }
-
- /**
- * Determines if the object should be omitted from the pack as a result of
- * its depth (probably because of the tree:<depth> filter).
- * <p>
- * Causes {@code walker} to skip traversing the current tree, which ought to
- * have just started traversal, assuming this method is called as soon as a
- * new depth is reached.
- * <p>
- * This method increments the {@code treesTraversed} statistic.
- *
- * @param obj
- * the object to check whether it should be omitted.
- * @param walker
- * the walker being used for traveresal.
- * @return whether the given object should be skipped.
- */
- private boolean depthSkip(@NonNull RevObject obj, ObjectWalk walker) {
- long treeDepth = walker.getTreeDepth();
-
- // Check if this object needs to be rejected because it is a tree or
- // blob that is too deep from the root tree.
-
- // A blob is considered one level deeper than the tree that contains it.
- if (obj.getType() == OBJ_BLOB) {
- treeDepth++;
- } else {
- stats.treesTraversed++;
- }
-
- if (filterSpec.getTreeDepthLimit() < 0 ||
- treeDepth <= filterSpec.getTreeDepthLimit()) {
- return false;
- }
-
- walker.skipTree();
- return true;
- }
-
- // Adds the given object as an object to be packed, first performing
- // filtering on blobs at or exceeding a given size.
- private void filterAndAddObject(@NonNull AnyObjectId src, int type,
- int pathHashCode, @NonNull Set<? extends AnyObjectId> want)
- throws IOException {
-
- // Check if this object needs to be rejected, doing the cheaper
- // checks first.
- boolean reject =
- (!filterSpec.allowsType(type) && !want.contains(src)) ||
- (filterSpec.getBlobLimit() >= 0 &&
- type == OBJ_BLOB &&
- !want.contains(src) &&
- reader.getObjectSize(src, OBJ_BLOB) > filterSpec.getBlobLimit());
- if (!reject) {
- addObject(src, type, pathHashCode);
- }
- }
-
- private boolean exclude(AnyObjectId objectId) {
- if (excludeInPacks == null)
- return false;
- if (excludeInPackLast.contains(objectId))
- return true;
- for (ObjectIdSet idx : excludeInPacks) {
- if (idx.contains(objectId)) {
- excludeInPackLast = idx;
- return true;
- }
- }
- return false;
- }
-
- /**
- * Select an object representation for this writer.
- * <p>
- * An {@link org.eclipse.jgit.lib.ObjectReader} implementation should invoke
- * this method once for each representation available for an object, to
- * allow the writer to find the most suitable one for the output.
- *
- * @param otp
- * the object being packed.
- * @param next
- * the next available representation from the repository.
- */
- public void select(ObjectToPack otp, StoredObjectRepresentation next) {
- int nFmt = next.getFormat();
-
- if (!cachedPacks.isEmpty()) {
- if (otp.isEdge())
- return;
- if (nFmt == PACK_WHOLE || nFmt == PACK_DELTA) {
- for (CachedPack pack : cachedPacks) {
- if (pack.hasObject(otp, next)) {
- otp.setEdge();
- otp.clearDeltaBase();
- otp.clearReuseAsIs();
- pruneCurrentObjectList = true;
- return;
- }
- }
- }
- }
-
- if (nFmt == PACK_DELTA && reuseDeltas && reuseDeltaFor(otp)) {
- ObjectId baseId = next.getDeltaBase();
- ObjectToPack ptr = objectsMap.get(baseId);
- if (ptr != null && !ptr.isEdge()) {
- otp.setDeltaBase(ptr);
- otp.setReuseAsIs();
- } else if (thin && have(ptr, baseId)) {
- otp.setDeltaBase(baseId);
- otp.setReuseAsIs();
- } else {
- otp.clearDeltaBase();
- otp.clearReuseAsIs();
- }
- } else if (nFmt == PACK_WHOLE && config.isReuseObjects()) {
- int nWeight = next.getWeight();
- if (otp.isReuseAsIs() && !otp.isDeltaRepresentation()) {
- // We've chosen another PACK_WHOLE format for this object,
- // choose the one that has the smaller compressed size.
- //
- if (otp.getWeight() <= nWeight)
- return;
- }
- otp.clearDeltaBase();
- otp.setReuseAsIs();
- otp.setWeight(nWeight);
- } else {
- otp.clearDeltaBase();
- otp.clearReuseAsIs();
- }
-
- otp.setDeltaAttempted(reuseDeltas && next.wasDeltaAttempted());
- otp.select(next);
- }
-
- private final boolean have(ObjectToPack ptr, AnyObjectId objectId) {
- return (ptr != null && ptr.isEdge())
- || (haveObjects != null && haveObjects.contains(objectId));
- }
-
- /**
- * Prepares the bitmaps to be written to the bitmap index file.
- * <p>
- * Bitmaps can be used to speed up fetches and clones by storing the entire
- * object graph at selected commits. Writing a bitmap index is an optional
- * feature that not all pack users may require.
- * <p>
- * Called after {@link #writeIndex(OutputStream)}.
- * <p>
- * To reduce memory internal state is cleared during this method, rendering
- * the PackWriter instance useless for anything further than a call to write
- * out the new bitmaps with {@link #writeBitmapIndex(OutputStream)}.
- *
- * @param pm
- * progress monitor to report bitmap building work.
- * @return whether a bitmap index may be written.
- * @throws java.io.IOException
- * when some I/O problem occur during reading objects.
- */
- public boolean prepareBitmapIndex(ProgressMonitor pm) throws IOException {
- if (!canBuildBitmaps || getObjectCount() > Integer.MAX_VALUE
- || !cachedPacks.isEmpty())
- return false;
-
- if (pm == null)
- pm = NullProgressMonitor.INSTANCE;
-
- int numCommits = objectsLists[OBJ_COMMIT].size();
- List<ObjectToPack> byName = sortByName();
- sortedByName = null;
- objectsLists = null;
- objectsMap = null;
- writeBitmaps = new PackBitmapIndexBuilder(byName);
- byName = null;
-
- PackWriterBitmapPreparer bitmapPreparer = new PackWriterBitmapPreparer(
- reader, writeBitmaps, pm, stats.interestingObjects, config);
-
- Collection<BitmapCommit> selectedCommits = bitmapPreparer
- .selectCommits(numCommits, excludeFromBitmapSelection);
-
- beginPhase(PackingPhase.BUILDING_BITMAPS, pm, selectedCommits.size());
-
- BitmapWalker walker = bitmapPreparer.newBitmapWalker();
- AnyObjectId last = null;
- for (BitmapCommit cmit : selectedCommits) {
- if (!cmit.isReuseWalker()) {
- walker = bitmapPreparer.newBitmapWalker();
- }
- BitmapBuilder bitmap = walker.findObjects(
- Collections.singleton(cmit), null, false);
-
- if (last != null && cmit.isReuseWalker() && !bitmap.contains(last))
- throw new IllegalStateException(MessageFormat.format(
- JGitText.get().bitmapMissingObject, cmit.name(),
- last.name()));
- last = BitmapCommit.copyFrom(cmit).build();
- writeBitmaps.processBitmapForWrite(cmit, bitmap.build(),
- cmit.getFlags());
-
- // The bitmap walker should stop when the walk hits the previous
- // commit, which saves time.
- walker.setPrevCommit(last);
- walker.setPrevBitmap(bitmap);
-
- pm.update(1);
- }
-
- endPhase(pm);
- return true;
- }
-
- private boolean reuseDeltaFor(ObjectToPack otp) {
- int type = otp.getType();
- if ((type & 2) != 0) // OBJ_TREE(2) or OBJ_BLOB(3)
- return true;
- if (type == OBJ_COMMIT)
- return reuseDeltaCommits;
- if (type == OBJ_TAG)
- return false;
- return true;
- }
-
- private class MutableState {
- /** Estimated size of a single ObjectToPack instance. */
- // Assume 64-bit pointers, since this is just an estimate.
- private static final long OBJECT_TO_PACK_SIZE =
- (2 * 8) // Object header
- + (2 * 8) + (2 * 8) // ObjectToPack fields
- + (8 + 8) // PackedObjectInfo fields
- + 8 // ObjectIdOwnerMap fields
- + 40 // AnyObjectId fields
- + 8; // Reference in BlockList
-
- private final long totalDeltaSearchBytes;
-
- private volatile PackingPhase phase;
-
- MutableState() {
- phase = PackingPhase.COUNTING;
- if (config.isDeltaCompress()) {
- int threads = config.getThreads();
- if (threads <= 0)
- threads = Runtime.getRuntime().availableProcessors();
- totalDeltaSearchBytes = (threads * config.getDeltaSearchMemoryLimit())
- + config.getBigFileThreshold();
- } else
- totalDeltaSearchBytes = 0;
- }
-
- State snapshot() {
- long objCnt = 0;
- BlockList<ObjectToPack>[] lists = objectsLists;
- if (lists != null) {
- objCnt += lists[OBJ_COMMIT].size();
- objCnt += lists[OBJ_TREE].size();
- objCnt += lists[OBJ_BLOB].size();
- objCnt += lists[OBJ_TAG].size();
- // Exclude CachedPacks.
- }
-
- long bytesUsed = OBJECT_TO_PACK_SIZE * objCnt;
- PackingPhase curr = phase;
- if (curr == PackingPhase.COMPRESSING)
- bytesUsed += totalDeltaSearchBytes;
- return new State(curr, bytesUsed);
- }
- }
-
- /** Possible states that a PackWriter can be in. */
- public enum PackingPhase {
- /** Counting objects phase. */
- COUNTING,
-
- /** Getting sizes phase. */
- GETTING_SIZES,
-
- /** Finding sources phase. */
- FINDING_SOURCES,
-
- /** Compressing objects phase. */
- COMPRESSING,
-
- /** Writing objects phase. */
- WRITING,
-
- /** Building bitmaps phase. */
- BUILDING_BITMAPS;
- }
-
- /** Summary of the current state of a PackWriter. */
- public class State {
- private final PackingPhase phase;
-
- private final long bytesUsed;
-
- State(PackingPhase phase, long bytesUsed) {
- this.phase = phase;
- this.bytesUsed = bytesUsed;
- }
-
- /** @return the PackConfig used to build the writer. */
- public PackConfig getConfig() {
- return config;
- }
-
- /** @return the current phase of the writer. */
- public PackingPhase getPhase() {
- return phase;
- }
-
- /** @return an estimate of the total memory used by the writer. */
- public long estimateBytesUsed() {
- return bytesUsed;
- }
-
- @SuppressWarnings("nls")
- @Override
- public String toString() {
- return "PackWriter.State[" + phase + ", memory=" + bytesUsed + "]";
- }
- }
-
- /**
- * Configuration related to the packfile URI feature.
- *
- * @since 5.5
- */
- public static class PackfileUriConfig {
- @NonNull
- private final PacketLineOut pckOut;
-
- @NonNull
- private final Collection<String> protocolsSupported;
-
- @NonNull
- private final CachedPackUriProvider cachedPackUriProvider;
-
- /**
- * @param pckOut where to write "packfile-uri" lines to (should
- * output to the same stream as the one passed to
- * PackWriter#writePack)
- * @param protocolsSupported list of protocols supported (e.g. "https")
- * @param cachedPackUriProvider provider of URIs corresponding
- * to cached packs
- * @since 5.5
- */
- public PackfileUriConfig(@NonNull PacketLineOut pckOut,
- @NonNull Collection<String> protocolsSupported,
- @NonNull CachedPackUriProvider cachedPackUriProvider) {
- this.pckOut = pckOut;
- this.protocolsSupported = protocolsSupported;
- this.cachedPackUriProvider = cachedPackUriProvider;
- }
- }
- }
|