During parsing these are used with contains(). If they are a List
type, the contains operation is not efficient. Some callers such
as UploadPack often pass a List here, so convert to Set when the
type isn't efficient for contains().
Change-Id: If948ae3bf1f46e756bd2d5db14795e12ba7a6207
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Revert "PackWriter: Do not delta compress already packed objects"
This reverts commit 67b064fc9f.
The "tiny optimization" introduced by 67b0 turns out to have a big
savings on wall-clock time when the object store is very slow (e.g.
the DHT support in JGit), but comes with a much bigger penalty in
space used by the output stream. CGit packed with 67b0 enabled is
7 MiB larger than it should be (36 MiB rather than 28/29 MiB). The
much bigger Linux kernel repository gained over 200 MiB, though some
of this may have been caused by a smaller window setting.
Revert this patch as PackWriter should be optimizing for space used
rather than time spent, since its primary use is network transfer, and
that isn't free.
Change-Id: I7413a9ef89762208159b4a1adc5a22a4c9245611
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Only search for base objects on thin packs
A non-thin pack does not need to worry about preferred bases, the pack
will be self-contained and all required delta base objects will appear
within the pack itself. Obtaining the path buffer and length from the
ObjectWalk to build the preferred base table is "expensive", so avoid
the cost unless a thin pack is being constructed.
Change-Id: I16e30cd864f4189d4304e7957a7cd5bdb9e84528
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Skip progress messages on fast operations
If the "Finding sources" phase will complete in <1 second with no
delta compression enabled, don't bother showing the progress meter for
this phase. Small repositories on the local filesystem tend to rip
through this phase always subsecond and the ProgressMonitor display
can actually slow the operation down.
If delta compression is enabled, there are two phases that may run
very quickly. Set the timer to 500 milliseconds instead, reducing the
risk that the user has to wait longer than 1 second before any sort of
output from the packer occurs.
Change-Id: I58110f17e2a5ffa0134f9768b94804d16bbb8399
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Fix the way delta chain cycles are prevented
Take a very simple approach to avoiding delta chains during object
reuse: objects are now always selected from the oldest pack that
contains them. This prevents cycles because a pack must not have
a cycle in the delta chain. If both objects A and B are chosen
out of the same source pack then there cannot be an A->B->A cycle.
The oldest pack is also the most likely to have the smallest deltas.
Its the biggest pack in the system and probably came from the clone
(or last GC) of this repository, where all objects were previously
considered and packed tightly together. If an object appears again
(for example due to a revert and a push into this repository) the
newer copy of won't be nearly as small as the older delta version
of it, even if the newer one is also itself a delta.
ObjectDirectory already enumerates objects during selection in this
newest->oldest order, so it already is supplying these assumptions
to PackWriter. Taking advantage of this can speed up selection by
a tiny amount by avoiding some tests, but can also help to prevent
a cycle needing to be broken on the fly during writing.
The previous cycle breaking logic wasn't fully correct either.
If a different delta base was chosen, the new delta base might not
have been written into the output pack before the current object,
forcing the use of REF_DELTA when OFS_DELTA is always smaller.
This logic has now been reworked to always re-check the delta base
and ensure it gets written before the current object.
If a cycle occurs, it gets broken the same way as before, by
disabling delta reuse and finding an alternative form of the
object, which may require inflating/deflating in whole format.
Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the total number of objects to look for reuse on is under 4096
this is really close to a reasonable batch size for the DHT storage
system to lookup at once. Combine all of the objects into a single
temporary list, perform reuse, and then prune the main lists if any
duplicate objects were detected from a selected CachedPack.
The intention here is to try and avoid 4 tiny sequential lookups
on the storage system when the time to wait for each of those to
finish is higher than the CPU time required to build (and later
GC) this temporary list.
Change-Id: I528daf9d2f7744dc4a6281750c2d61d8f9da9f3a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of looping over the objectsLists array, always set slot 0 to
null and explicitly work on the 4 indexes that matter. This kills
some loops and increases the length of the code slightly, but I've
always really disliked that dummy 0 slot.
Change-Id: I5ad938501c1c61f637ffdaff0d0d88e3962d8942
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Speed up pruning of objects from cached packs
During object enumeration for the thin pack, very few objects come
out that are duplicated with the cached pack. Typically these are
only cases where a blob or tree was cherry-picked forward, got a
copy or rename, or was reverted... all relatively infrequent events.
Speed up pruning of the thin pack object list by combining the phase
with the object representation selection. Implementers should already
be offering to reuse the object from the cached pack if it is stored
there, at which point the implementation can perform a very fast type
of containment test using the cached pack's identity rather than yet
another index lookup. For the local disk case this is probably not a
big improvement, but it does help on the DHT implementation where the
two passes combined into one reduces latency.
Change-Id: I6a07fc75d9075bf6233e967360b6546f9e9a2b33
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Frequently enough I'm wondering how much of a pack is commits vs.
trees, and the total line doesn't really tell us this because its
a gross total from the pack. Computing the counts per object type
is simple during packing, as PackWriter already has everything in
memory broken up by object type. Its virtually free to get these
values and track them.
Change-Id: Id5e6b1902ea909c72f103a0fbca5d8bc316f9ab3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Rename getObjectsNumber to getObjectCount
This better matches with PackFile and CachedPack's methods
that return the same value.
Change-Id: Idb9b7c71d2048dd2344a62c2cde20b4e34529ab7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter incorrectly returned 0 from getObjectsNumber() when the
pack has not been written yet. This caused dumb transports like
amazon-s3:// and sftp:// to abort early and never write out a pack,
under the assumption that the pack had no objects.
Until the pack header is written to the output stream, compute the
current object count each time it is requested. Once the header is
started, use the object count from the stats object.
Change-Id: I041a2368ae0cfe6f649ec28658d41a6355933900
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
ObjectIdOwnerMap: More lightweight map for ObjectIds
OwnerMap is about 200 ms faster than SubclassMap, more friendly to the
GC, and uses less storage: testing the "Counting objects" part of
PackWriter on 1886362 objects:
ObjectIdSubclassMap:
load factor 50%
table: 4194304 (wasted 2307942)
ms spent 36998 36009 34795 34703 34941 35070 34284 34511 34638 34256
ms avg 34800 (last 9 runs)
ObjectIdOwnerMap:
load factor 100%
table: 2097152 (wasted 210790)
directory: 1024
ms spent 36842 35112 34922 34703 34580 34782 34165 34662 34314 34140
ms avg 34597 (last 9 runs)
The major difference with OwnerMap is entries must extend from
ObjectIdOwnerMap.Entry, where the OwnerMap has injected its own
private "next" field into each object. This allows the OwnerMap to use
a singly linked list for chaining collisions within a bucket. By
putting collisions in a linked list, we gain the entire table back for
the SHA-1 bits to index their own "private" slot.
Unfortunately this means that each object can appear in at most ONE
OwnerMap, as there is only one "next" field within the object instance
to thread into the map. For types that are very object map heavy like
RevWalk (entity RevObject) and PackWriter (entity ObjectToPack) this
is sufficient, these entity types are only put into one map by their
container. By introducing a new map type, we don't break existing
applications that might be trying to use ObjectIdSubclassMap to track
RevCommits they obtained from a RevWalk.
The OwnerMap uses less memory. Each object uses 1 reference more (so
we're up 1,886,362 references), but the table is 1/2 the size (2^20
rather than 2^21). The table itself wastes only 210,790 slots, rather
than 2,307,942. So OwnerMap is wasting 200k fewer references.
OwnerMap is more friendly to the GC, because it hardly ever generates
garbage. As the map reaches its 100% load factor target, it doubles in
size by allocating additional segment arrays of 2048 entries. (So the
first grow allocates 1 segment, second 2 segments, third 4 segments,
etc.) These segments are hooked into the pre-allocated directory of
1024 spaces. This permits the map to grow to 2 million objects before
the directory itself has to grow. By using segments of 2048 entries,
we are asking the GC to acquire 8,204 bytes in a 32 bit JVM. This is
easier to satisfy then 2,307,942 bytes (for the 512k table that is
just an intermediate step in the SubclassMap). By reusing the
previously allocated segments (they are re-hashed in-place) we don't
release any memory during a table grow.
When the directory grows, it does so by discarding the old one and
using one that is 4x larger (so the directory goes to 4096 entries on
its first grow). A directory of size 4096 can handle up to 8 millon
objects. The second directory grow (16384) goes to 33 million objects.
At that point we're starting to really push the limits of the JVM
heap, but at least its many small arrays. Previously SubclassMap would
need a table of 67108864 entries to handle that object count, which
needs a single contiguous allocation of 256 MiB. That's hard to come
by in a 32 bit JVM. Instead OwnerMap uses 8192 arrays of about 8 KiB
each. This is much easier to fit into a fragmented heap.
Change-Id: Ia4acf5cfbf7e9b71bc7faa0db9060f6a969c0c50
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of resizing an ArrayList until all objects have been added,
append objects into a specialized List type that uses small arrays
of 1024 entries for each 1024 objects added.
For a large repository like linux-2.6, PackWriter will now allocate
1,758 smaller arrays to hold the object list, without creating any
garbage from the intermediate states due to list expansion.
1024 was chosen as the block size (and initial directory size) as this
is a reasonable balance for the PackWriter code. Each block uses
approximately 4096 bytes in a 32 bit JVM, as does the default top
level block directory. The top level directory doesn't expand until 1
million items have been added to the list, which for linux-2.6 won't
yet occur as the lists are per-object-type and are thus bounded to
about 1/3 of 1.8 million.
Change-Id: If9e4092eb502394c5d3d044b58cf49952772f6d6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If object reuse validation is enabled, the output pack is going to
probably be stored locally. When reusing an existing cached pack
to save object enumeration costs, ensure the cached pack has not
been corrupted by checking its SHA-1 trailer. If it has, writing
will abort and the output pack won't be complete. This prevents
anyone from trying to use the output pack, and catches corruption
before it can be carried any further.
Change-Id: If89d0d4e429d9f4c86f14de6c0020902705153e6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Avoid CRC-32 validation when feeding IndexPack
There is no need to validate the object contents during
copyObjectAsIs if the result is going to be parsed by unpack-objects
or index-pack. Both programs will compute the SHA-1 of the object,
and also validate most of the pack structure. For git daemon
like servers, this work is already done on the client end of the
connection, so the server doesn't need to repeat that work itself.
Disable object validation for the 3 transport cases where we know
the remote side will handle object validation for us (push, bundle
creation, and upload pack). This improves performance on the server
side by reducing the work that must be done.
Change-Id: Iabb78eec45898e4a17f7aab3fb94c004d8d69af6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Annotated tags need to be parsed by many viewing tools, but putting
them at the end of the pack hurts because kernel prefetching might
not have loaded them, since they are so far from the commits they
reference.
Position tags right behind the commits, but before the trees.
Typically the annotated tag set for a repository is very small,
so the extra prefetch burden it puts on tools that don't need
annotated tags (but do need commits and trees) is fairly low.
Change-Id: Ibbabdd94e7d563901c0309c79a496ee049cdec50
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This simple refactoring makes it easier to pre-process each of the
object lists before its handed into the actual write routine.
Change-Id: Iea95e5ecbc7374f6bcbb43d1c75285f4f564d09d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
JGit doesn't generate deltas for commit or tag objects when it packs
a repository from scratch. This is an explicit design decision that
is (mostly) justified by the fact that these objects do not delta
compress well.
Annotated tags are made once on stable points of the project history,
it is unlikely they will ever appear again with sufficient common
text to justify using a delta over just deflating the raw content.
JGit never tries to delta compress annotated tags and I take the
stance that these are best stored as non-deltas given how frequently
they might be accessed by repository viewers.
Commits only have sufficient common text when they are cherry-picked
to forward-port or back-port a change from one branch to another.
Even in these cases the distance between the commits as returned
by the log traversal has to be small enough that they would both
appear in the delta search window at the same time in order to
delta compress one of the messages against the other. JGit never
tries to delta compress commits, as it requires a lot of CPU time
but typically does not produce a smaller pack file.
Avoid reusing deltas for either of these types when constructing a
new pack. To avoid killing performance during serving of network
clients, UploadPack disables this code change by allowing PackWriter
to reuse delta commits. Repositories that were already repacked by
C Git will not have their delta commits decompressed and recompressed
on the fly during object writing, saving server-side CPU resources.
Change-Id: I749407e7c5c677e05e4d054b40db7656cfa7fca8
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Do not delta compress already packed objects
This is a tiny optimization to how delta search works. Checking for
isReuseAsIs() avoids doing delta compression search on non-delta
objects already stored in packs within the repository. Such objects
are not likely to be delta compressable, as they were already delta
searched when their containing pack was generated and they were
not delta compressed at that time. Doing delta compression now is
unlikely to produce a different result, but would waste a lot of CPU.
The isReuseAsIs() flag is checked before isDoNotDelta() because it
is very common to reuse objects in the output pack. Most objects
get reused, and only a handful have the isDoNotDelta() bit set.
Moving the check earlier allows the loop to more quickly skip
through objects that will never need to be considered.
Change-Id: Ied757363f775058177fc1befb8ace20fe9759bac
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We did not record the time spent on the object reuse search or the
object size lookup, both of which occur between the counting phase and
the compressing phase. If there are enough objects involved, these
times can be significant so its worth timing them and recording it.
Change-Id: I89084acfc598bb6533d75d90cb8de459f0ed93be
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The total delta count is supposed to include reused deltas, not
just newly created deltas.
Change-Id: I98cbdcef80d59714a4f62ff322e7b709b08b6d26
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Many source browsers and network related tools like UploadPack need
to find and parse the target of all branches and annotated tags
within the repository during their startup phase. Clustering these
together into the same part of the pack file will improve locality,
reducing thrashing when an application starts and needs to load
all of these into memory at once.
To prevent bottlenecking basic log viewing tools that are scannning
backwards from the tip of a current branch (and don't need tags)
we place this cluster of older targets after 4096 newer commits
have already been placed into the pack stream. 4096 was chosen as
a rough guess, but was based on a few factors:
- log viewers typically show 5-200 commits per page
- users only view the first page or two
- DHT can cram 2200-4000 commits per 1 MiB chunk
thus these will fall into the second commit chunk (roughly)
Unfortunately this placement hurts history tools that are scanning
backwards through the commit graph and completely ignored tags or
branch heads when they started.
An ancient tagged commit is no longer positioned behind its first
child (its now much earlier), resulting in a page fault for the
parser to reload this cluster of objects on demand. This may be
an acceptable loss. If a user is walking backwards and has already
scanned through more than 4096 commits of history, waiting for the
region to reload isn't really that bad compared to the amount of
time already spent.
If the repository is so small that there are less than 4096 commits,
this change has no impact on the placement of objects.
Change-Id: If3052e430d305e17878d94145c93754f56b74c61
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the underlying storage has a high latency per SHA-1 lookup
(e.g. the DHT support we are working on), parsing each wanted
annotated tag object back to its underlying commit is too slow,
its a sequential lookup for each tag. With hundreds of tags in
a repository this takes far too long.
Instead queue up a list of the tags whose objects need to be found,
and then locate all of those in one parseAny batch. This works
for the common case of annotated tag to single tree or commit.
For the less often used tag->tag->commit, it at least gets us
one level parsed in the larger batch before we have to go back to
sequential lookups.
Change-Id: I94beef3f14281406f15c8cf9fa02d83faf102a19
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Correct total delta count when reusing pack
If the CachedPack knows its delta count, we need to increment both
the totalDeltas and reusedDeltas fields of the stats object.
Change-Id: I70113609c22476ce7f1e4d9a92f486e9b0f59e44
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Short-circuit counting on full cached pack reuse
If one or more cached packs fully covers the request, don't bother
with looking up the objects and trying to walk the graph. Just use
the cached packs and return immediately.
This helps clones of quiet repositories that have not been modified
since their last repack, its likely the cached packs are accurate
and no graph walking is required.
Change-Id: I9062a5ac2f71b525322590209664a84051fd5f8a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Sort commits by parse order to improve locality
RevWalk in JGit and the revision code in C Git both parse commits out
of the pack file in an order that differs from strict timestamp and
topological sorting. Both implementations pop a commit from the head
of a date queue, and then immediately parse all of its parents in
order to insert those into the date queue at the proper positions as
determined by their committer timestamp field. This implies that the
parents are parsed when their most recent child is popped from the
queue, and not where they are popped during traversal.
Hoisting a parent commit to be immediately behind its child improves
locality by making sure all parents of a merge are clustered together,
and thus can be paged into the parser by the pack file buffering
system (aka WindowCache in JGit) together.
Change-Id: I80f9e64cafa2e8f082776b43845edf23065386a2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Try for accurate delta reuse on cached pack
If a cached pack is used, it might know how many deltas are contained
within it. Record that count as part of our reusedDeltas field
for the stats line we show clients.
Change-Id: I1c61fb817305a95eeac654cccf132cba20b2339c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
UploadPack: Expose PackWriter activity to a logger
The UploadPackLogger interface allows applications that embed
GitServlet or otherwise use UploadPack to service clients to
track and log how PackWriter was used, and what it sent. This
provides more granularity into the request activity than might
be available from the HTTP server logs, helping administrators
to better understand utilization and Git server performance.
Change-Id: I1d36b060eb3385339d5f986e68192789ef70fc4e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When UploadPack has computed the merge base between the client's have
set and the want set, its already loaded and parsed all of the
interesting commits that PackWriter needs to transmit to the client.
Switching the RevWalk and its object pool over to be an ObjectWalk
saves PackWriter from needing to re-parse these same commits from the
ObjectDatabase, reducing the startup latency for the enumeration
phase of packing.
UploadPack doesn't want to use an ObjectWalk for the okToGiveUp()
tests because its slower, during each commit popped it needs to cache
the tree into the pendingObjects list, and during each reset() it
discards a bunch of ObjectWalk specific state and reallocates some
internal collections. ObjectWalk was never meant to be rapidly
reset() like UploadPack does, so its perhaps somewhat cleaner to allow
"upgrading" a RevWalk to an ObjectWalk.
Bug: 301639
Change-Id: I97ef52a0b79d78229c272880aedb7f74d0f7532f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
CGit pack-objects displays a totals line after the pack data
was fully written. This can be useful to understand some of
the decisions made by the packer, and has been a great tool
for helping to debug some of that code.
Track some of the basic values, and send it to the client when
packing is done:
remote: Counting objects: 1826776, done
remote: Finding sources: 100% (55121/55121)
remote: Getting sizes: 100% (25654/25654)
remote: Compressing objects: 100% (11434/11434)
remote: Total 1861830 (delta 3926), reused 1854705 (delta 38306)
Receiving objects: 100% (1861830/1861830), 386.03 MiB | 30.32 MiB/s, done.
Change-Id: If3b039017a984ed5d5ae80940ce32bda93652df5
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
There is no point in pushing all of the files within the edge
commits into the delta search when making a thin pack. This floods
the delta search window with objects that are unlikely to be useful
bases for the objects that will be written out, resulting in lower
data compression and higher transfer sizes.
Instead observe the path of a tree or blob that is being pushed
into the outgoing set, and use that path to locate up to WINDOW
ancestor versions from the edge commits. Push only those objects
into the edgeObjects set, reducing the number of objects seen by the
search window. This allows PackWriter to only look at ancestors
for the modified files, rather than all files in the project.
Limiting the search to WINDOW size makes sense, because more than
WINDOW edge objects will just skip through the window search as
none of them need to be delta compressed.
To further improve compression, sort edge objects into the front
of the window list, rather than randomly throughout. This puts
non-edges later in the window and gives them a better chance at
finding their base, since they search backwards through the window.
These changes make a significant difference in the thin-pack:
Before:
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (101405/101405)
remote: Compressing objects: 100% (7587/7587)
Receiving objects: 100% (50275/50275), 24.67 MiB | 9.90 MiB/s, done.
Resolving deltas: 100% (40339/40339), completed with 2218 local objects.
real 0m30.267s
After:
remote: Counting objects: 61549, done
remote: Finding sources: 100% (50275/50275)
remote: Getting sizes: 100% (18862/18862)
remote: Compressing objects: 100% (7588/7588)
Receiving objects: 100% (50275/50275), 11.04 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (43160/43160), completed with 5014 local objects.
real 0m22.170s
The resulting pack is 13.63 MiB smaller, even though it contains the
same exact objects. 82,543 fewer objects had to have their sizes
looked up, which saved about 8s of server CPU time. 2,796 more
objects from the client were used as part of the base object set,
which contributed to the smaller transfer size.
Change-Id: Id01271950432c6960897495b09deab70e33993a9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sigend-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Some of this code predates making ObjectId.equals() final
and fixing RevObject.equals() to match ObjectId.equals().
It was therefore more complex than it needs to be, because
it tried to work around RevObject's broken equals() rules
by converting to ObjectId in a different collection.
Also combine setUpWalker() and findObjectsToPack() methods,
these can be one method and the code is actually cleaner.
Change-Id: I0f4cf9997cd66d8b6e7f80873979ef1439e507fe
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
The first 'Compressing objects' progress message is wrong, its
actually PackWriter looking up the sizes of each object in the
ObjectDatabase, so objects can be sorted correctly in the later
type-size sort that tries to take advantage of "Linus' Law" to
improve delta compression.
Rename the progress to say 'Getting sizes', which is an accurate
description of what it is doing.
Change-Id: Ida0a052ad2f6e994996189ca12959caab9e556a3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
When compressing objects, don't include the edges in the progress
meter. These cost almost no CPU time as they are simply pushed into
and popped out of the delta search window.
Change-Id: I7ea19f0263e463c65da34a7e92718c6db1d4a131
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Revert "Teach PackWriter how to reuse an existing object list"
This reverts commit f5fe2dca3c.
I regret adding this feature to the public API. Caches aren't always
the best idea, as they require work to maintain. Here the cache is
redundant information that must be computed, and when it grows stale
must be removed. The redundant information takes up more disk space,
about the same size as the pack-*.idx files are. For the linux-2.6
repository, that's more than 40 MB for a 400 MB repository. So the
cache is a 10% increase in disk usage.
The entire point of this cache is to improve PackWriter performance,
and only PackWriter performance, and only when sending an initial
clone to a new client. There may be better ways to optimize this, and
until we have a solid solution, we shouldn't be using a separate cache
in JGit.
Teach PackWriter how to reuse an existing object list
Counting the objects needed for packing is the most expensive part of
an UploadPack request that has no uninteresting objects (otherwise
known as an initial clone). During this phase the PackWriter is
enumerating the entire set of objects in this repository, so they can
be sent to the client for their new clone.
Allow the ObjectReader (and therefore the underlying storage system)
to keep a cached list of all reachable objects from a small number of
points in the project's history. If one of those points is reached
during enumeration of the commit graph, most objects are obtained from
the cached list instead of direct traversal.
PackWriter uses the list by discarding the current object lists and
restarting a traversal from all refs but marking the object list name
as uninteresting. This allows PackWriter to enumerate all objects
that are more recent than the list creation, or that were on side
branches that the list does not include.
However, ObjectWalk tags all of the trees and commits within the list
commit as UNINTERESTING, which would normally cause PackWriter to
construct a thin pack that excludes these objects. To avoid that,
addObject() was refactored to allow this list-based enumeration to
always include an object, even if it has been tagged UNINTERESTING by
the ObjectWalk. This implies the list-based enumeration may only be
used for initial clones, where all objects are being sent.
The UNINTERESTING labeling occurs because StartGenerator always
enables the BoundaryGenerator if the walker is an ObjectWalk and a
commit was marked UNINTERESTING, even if RevSort.BOUNDARY was not
enabled. This is the default reasonable behavior for an ObjectWalk,
but isn't desired here in PackWriter with the list-based enumeration.
Rather than trying to change all of this behavior, PackWriter works
around it.
Because the list name commit's immediate files and trees were all
enumerated before the list enumeration itself starts (and are also
within the list itself) PackWriter runs the risk of adding the same
objects to its ObjectIdSubclassMap twice. Since this breaks the
internal map data structure (and also may cause the object to transmit
twice), PackWriter needs to use a new "added" RevFlag to track whether
or not an object has been put into the outgoing list yet.
Change-Id: Ie99ed4d969a6bb20cc2528ac6b8fb91043cee071
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Use TOPO order only for incremental packs
When performing an initial clone of a repository there are no
uninteresting commits, and the resulting pack will be completely
self-contained. Therefore PackWriter does not need to honor C
Git standard TOPO ordering as described in JGit commit ba984ba2e0
("Fix checkReferencedIsReachable to use correct base list").
Switching to COMMIT_TIME_DESC when there are no uninteresting commits
allows the "Counting objects" phase to emit progress earlier, as the
RevWalk will not buffer the commit list. When TOPO is set the RevWalk
enumerates all commits first, before outputing any for PackWriter to
mark progress updates from.
Change-Id: If2b6a9903b536c7fb3c45f85d0a67ff6c6e66f22
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Don't permit transient worker threads to access the underlying output
stream of a ProgressMonitor, as they might get marked as the stream's
writer thread. Instead proxy update events from the workers back onto
the application's real work thread. This ensures that the stream only
sees a single thread, and its the thread that will remain alive for
the entire life cycle of the operation.
This fixes IOException("Write end dead") during local repository fetch
when threaded delta search is enabled. One of the transient delta
search threads became the designated writer for the pipe, and when it
terminated the reader end thought the writer was dead, even though the
main writer thread was still executing in PackWriter.
Bug: 326557
Change-Id: I01d1b20a3d7be1c0b480c7fb5c9773c161fe5c15
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Fix checkReferencedIsReachable to use correct base list
When checkReferencedIsReachable is set in ReceivePack we are trying
to prove that the push client is permitted to access an object that
it did not send to us, but that the received objects link to either
via a link inside of an object (e.g. commit parent pointer or tree
member) or by a delta base reference.
To do this check we are making a list of every potential delta base,
and then ensuring that every delta base used appears on this list.
If a delta base does not appear on this list, we abort with an error,
letting the client know we are missing a particular object.
Preventing spurious errors about missing delta base objects requires
us to use the exact same list of potential delta bases as the remote
push client used. This means we must use TOPO ordering, and we
need to enable BOUNDARY sorting so that ObjectWalk will correctly
include any trees found during the enumeration back to the common
merge base between the interesting and uninteresting heads.
To ensure JGit's own push client matches this same potential delta
base list, we need to undo 60aae90d4d ("Disable topological
sorting in PackWriter") and switch back to using the conventional
TOPO ordering for commits in a pack file. This ensures that our
own push client will use the same potential base object list as
checkReferencedIsReachable uses on the receiving side.
Change-Id: I14d0a326deb62a43f987b375cfe519711031e172
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Use limited getCachedBytes code to reduce duplication
Rather than duplicating this block everywhere, reuse the limited size
form of getCachedBytes to acquire the content of an object.
Change-Id: I2e26a823e6fd0964d8f8dbfaa0fc2e8834c179c1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Allow ObjectReuseAsIs to have more control over write ordering
The reuse system used by an object database may be able to benefit
from knowing what objects are coming next, and even improve data
throughput by delaying (or moving up) objects that are stored near
each other in the source database.
Pushing the iteration down into the reuse code makes it possible
for a smarter implementation to aggregate reuse. But for the
standard pack file format on disk we don't bother, its quite
efficient already.
Change-Id: I64f0048ca7071a8b44950d6c2a5dfbca3be6bba6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
ObjectReader implementations may wish to use multiple threads in
order to evaluate object reuse faster. Let the reader make that
decision by passing the iteration down into the reader.
Because the work is pushed into the reader, it may need to locate a
given ObjectToPack given its ObjectId. This can easily occur if the
reader has sent a list of ObjectIds to the object database and gets
back information keyed only by ObjectId, without the ObjectToPack
handle. Expose lookup using the PackWriter's own internal map,
so the reader doesn't need to build a redundant copy to track the
assocation of ObjectId back to ObjectToPack.
Change-Id: I0c536405a55034881fb5db92a2d2a99534faed34
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When the output stream is deeply buffered (e.g. 1 MiB or more in
an HTTP servlet on some containers) trying to kick out the header
earlier will prevent the client from stalling hard while the first
1 MiB is received and it can process the pack header. Forcing a
flush here lets the client see the header and start its progress
monitor for "Receiving objects: (1/N)" so the user knows there
is still activity occurring, even though the buffering may cause
there to be some lag as the buffer fills up on the sending side.
Change-Id: I3edf39e8f703fe87a738dc236d426b194db85e3a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This refactoring permits applications to configure global per-process
settings for all packing and easily pass it through to per-request
PackWriters, ensuring that the process configuration overrides the
repository specific settings.
For example this might help in a daemon environment where the server
wants to cap the resources used to serve a dynamic upload pack
request, even though the repository's own pack.* settings might be
configured to be more aggressive. This allows fast but less bandwidth
efficient serving of clients, while still retaining good compression
through a cron managed `git gc`.
Change-Id: I58cc5e01b48924b1a99f79aa96c8150cdfc50846
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
These need to be dynamic based on the current thread's environment
at time of execution in order to be properly localized for the end
user that will be seeing these messages.
Change-Id: I4976f462cfe606edd2761c0e36b2f6b20f63d53c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Allow PackWriter callers to manage the thread pool
By permitting the caller of PackWriter to select the Executor it
uses for task execution, we give the caller the ability to manage
the lifecycle of the thread pool, including reusing it across
concurrent pack generators.
This is the first step to supporting application thread pools
within Daemon or another managed service like Gerrit Code Review.
Change-Id: I96bee7b9c30ff9885f2bd261d0b6daaac713b5a4
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
There were some broken links, incorrect uses of @value, an invalid
tag and an outdated comment.
Change-Id: I22886bcc869a4b62bd606ebed40669f7b4723664
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>