Delta search was discarding discovered deltas if an object appeared
near a type boundary in the delta search window. This has caused JGit
to produce larger pack files than other implementations of the packing
algorithm.
Delta search works by pushing prior objects into a search window, an
ordered list of objects to attempt to delta compress the next object
against. (The window size is bounded, avoiding O(N^2) behavior.)
For implementation reasons multiple object types can appear in the
input list, and the window. PackWriter commonly passes both trees and
blobs in the input list handed to the DeltaWindow algorithm. The pack
file format requires an object to only delta compress against the same
type, so the DeltaWindow algorithm must stop doing comparisions if a
blob would be compared to a tree.
Because the input list is sorted by object type and the window is
recently considered prior objects, once a wrong type is discovered in
the window the search algorithm stops and uses the current result.
Unfortunately the termination condition was discarding any found
delta by setting deltaBase and deltaBuf to null when it was trying
to break the window search.
When this bug occurs, the state of the DeltaWindow looks like this:
current
|
\ /
input list: tree0 tree1 blob1 blob2
window: blob1 tree1 tree0
/ \
|
res.prev
As the loop iterates to the right across the window, it first finds
that blob1 is a suitable delta base for blob2, and temporarily holds
this in the bestDelta/deltaBuf fields. It then considers tree1, but
tree1 has the wrong type (blob != tree), so the window loop must give
up and fall through the remaining code.
Moving the condition up and discarding the window contents allows
the bestDelta/deltaBuf to be kept, letting the final file delta
compress blob1 against blob0.
The impact of this bug (and its fix) on real world repositories is
likely minimal. The boundary from blob to tree happens approximately
once in the search, as the input list is sorted by type. Only the
first window size worth of blobs (e.g. 10 or 250) were failing to
produce a delta in the final file.
This bug fix does produce significantly different results for small
test repositories created in the unit test suite, such as when a pack
may contains 6 objects (2 commits, 2 trees, 2 blobs). Packing test
cases can now better sample different output pack file sizes depending
on delta compression and object reuse flags in PackConfig.
Change-Id: Ibec09398d0305d4dbc0c66fce1daaf38eb71148f
Rescale "Compressing objects" progress meter by size
Instead of counting objects processed, count number of bytes added
into the window. This should rescale the progress meter so that 30%
complete means 30% of the total uncompressed content size has been
inflated and fed into the window.
In theory the progress meter should be more accurate about its
percentage complete/remaining fraction than with objects. When
counting objects small objects move the progress meter more rapidly
than large objects, but demand a smaller amount of work than large
objects being compressed.
Change-Id: Id2848c16a2148b5ca51e0ca1e29c5be97eefeb48
Instead of assuming all objects cost the same amount of time to
delta compress, aggregate the byte size of objects in the list
and partition threads with roughly equal total bytes.
Before splitting the list select the N largest paths and assign
each one to its own thread. This allows threads to get through the
worst cases in parallel before attempting smaller paths that are
more likely to be splittable.
By running the largest path buckets first on each thread the likely
slowest part of compression is done early, while progress is still
reporting a low percentage. This gives users a better impression of
how fast the phase will run. On very complex inputs the slow part
is more likely to happen first, making a user realize its time to
go grab lunch, or even run it overnight.
If the worst sections are earlier, memory overruns may show up
earlier, giving the user a chance to correct the configuration and
try again before wasting large amounts of time. It also makes it
less likely the delta compression phase reaches 92% in 30 minutes
and then crawls for 10 hours through the remaining 8%.
Change-Id: I7621c4349b99e40098825c4966b8411079992e5f
TemporaryBuffer is great when the output size is not known, but must
be bound by a relatively large upper limit that fits in memory, e.g.
64 KiB or 20 MiB. The buffer gracefully supports growing storage by
allocating 8 KiB blocks and storing them in an ArrayList.
In a Git repository many deltas are less than 8 KiB. Typical tree
objects are well below this threshold, and their deltas must be
encoded even smaller.
For these much smaller cases avoid the 8 KiB minimum allocation used
by TemporaryBuffer. Instead allocate a very small OutputStream
writing to an array that is sized at the limit.
Change-Id: Ie25c6d3a8cf4604e0f8cd9a3b5b701a592d6ffca
Correct distribution of allowed delta size along chain length
Nicolas Pitre discovered a very simple rule for selecting between two
different delta base candidates:
- if based whole object, must be <= 50% of target
- if at end of a chain, must be <= 1/depth * 50% of target
The rule penalizes deltas near the end of the chain, requiring them to
be very small in order to be kept by the packer. This favors deltas
that are based on a shorter chain, where the read-time unpack cost is
much lower. Fewer bytes need to be consulted from the source pack
file, and less copying is required in memory to rebuild the object.
Junio Hamano explained Nico's rule to me today, and this commit fixes
DeltaWindow to implement it as described.
When no base has been chosen the computation is simply the statements
denoted above. However once a base with depth of 9 has been chosen
(e.g. when pack.depth is limited to 10), a non-delta source may
create a new delta that is up to 10x larger than the already selected
base. This reflects the intent of Nico's size distribution rule no
matter what order objects are visited in the DeltaWindow.
With this patch and my other patches applied, repacking JGit with:
[pack]
reuseObjects = false
reuseDeltas = false
depth = 50
window = 250
threads = 4
compression = 9
CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1]
JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2]
rest 9,809 bytes
The improved JGit result for the head pack is only 6.4 KiB larger than
CGit's resulting pack. This patch allowed JGit to find an additional
39.7 KiB worth of space savings. JGit now also often runs 2s faster
than CGit, despite also creating bitmaps and pruning objects after the
head pack creation.
[1] time git repack -a -d -F --window=250 --depth=50
[2] time java -Xmx128m -jar jgit debug-gc
Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
When an idle thread tries to steal work from a sibling's remaining
toSearch queue, always try to split along a path boundary. This
avoids missing delta opportunities in the current window of the
thread whose work is being taken.
The search order is reversed to walk further down the chain from
current position, avoiding the risk of splitting the list within
the path the thread is currently processing.
When selecting which thread to split from use an accurate estimate
of the size to be taken. This avoids selecting a thread that has
only one path remaining but may contain more pending entries than
another thread with several paths remaining.
As there is now a race condition where the straggling thread can
start the next path before the split can finish, the stealWork()
loop spins until it is able to acquire a split or there is only
one path remaining in the siblings.
Change-Id: Ib11ff99f90a4d9efab24bf4a85342cc63203dba5
Replace DeltaWindow array with circularly linked list
Typical window sizes are 10 and 250 (although others are accepted).
In either case the pointer overhead of 1 pointer in an array or
2 pointers for a double linked list is trivial. A doubly linked
list as used here for window=250 is only another 1024 bytes on a
32 bit machine, or 2048 bytes on a 64 bit machine.
The critical search loops scan through the array in either the
previous direction or the next direction until the cycle is finished,
or some other scan abort condition is reached. Loading the next
object's pointer from a field in the current object avoids the
branch required to test for wrapping around the edge of the array.
It also saves the array bounds check on each access.
When a delta is chosen the window is shuffled to hoist the currently
selected base as an earlier candidate for the next object. Moving
the window entry is easier in a double-linked list than sliding a
group of array entries.
Change-Id: I9ccf20c3362a78678aede0f0f2cda165e509adff
javac and the JIT are more likely to understand a boolean being
used as a branch conditional than comparing int against 0 and 1.
Rewrite NEXT_RES and NEXT_SRC constants to be booleans so the
code is clarified for the JIT.
Change-Id: I1bdd8b587a69572975a84609c779b9ebf877b85d
Micro-optimize DeltaWindow maxMemory test to be != 0
Instead of using a compare-with-0 use a does not equal 0.
javac bytecode has a special instruction for this, as it
is very common in software. We can assume the JIT knows
how to efficiently translate the opcode to machine code,
and processors can do != 0 very quickly.
Change-Id: Idb84c1d744d2874517fd4bfa1db390e2dbf64eac
Steal work from delta threads to rebalance CPU load
If the configuration wants to run 4 threads the delta search work
is initially split somewhat evenly across the 4 threads. During
execution some threads will finish early due to the work not being
split fairly, as the initial partitions were based on object count
and not cost to inflate or size of DeltaIndex.
When a thread finishes early it now tries to take 50% of the work
remaining on a sibling thread, and executes that before exiting.
This repeats as each thread completes until a thread has only 1
object remaining.
Repacking Blink, Chromium's new fork of WebKit (2.2M objects 3.9G):
[pack]
reuseDeltas = false
reuseObjects = false
depth = 50
threads = 8
window = 250
windowMemory = 800m
before: ~105% CPU after 80%
after: >780% CPU to 100%
Change-Id: I65e45422edd96778aba4b6e5a0fd489ea48e8ca3
JGit 3.0: move internal classes into an internal subpackage
This breaks all existing callers once. Applications are not supposed
to build against the internal storage API unless they can accept API
churn and make necessary updates as versions change.
Change-Id: I2ab1327c202ef2003565e1b0770a583970e432e9
The maxMemory for a DeltaWindow can be optionally disabled when it is
less than or equal to zero. Respect this configuration when enforcing
the limits on object load.
Change-Id: Ic0f4ffcabf82105f8e690bd0eb5e6be485a313b3
Previously, memory limits were enforced at the start of each iteration
of the delta search, based on objects that were currently loaded in
memory. However, new objects added to the window may be expanded in a
future iteration of the search and thus were not accounted for correctly
at the start of the search. To fix this, memory limits are now enforced
before each object is loaded.
Change-Id: I898ab43e7bf5ee7189831f3a68bb9385ae694b8f
Fix DeltaWindow.clear() to release loaded buffer bytes.
It is possible for the buffer to be set but not the index. It
ocurrs when an exception occurs during creating an index, but
after the buffer is loaded. Furthermore, the cleared DeltaWindowEntry
should have been ent and not res.
Change-Id: I2e0d79540316635bf7aa43efd225e4eb38230844
Do not delta compress objects that have already tried to compress.
If an object is in a pack file already, delta compression will not
attempt to re-compress it. This assumes that the previous
packing already performed the optimal compression attempt, however,
the subclasses of StoredObjectRepresentation may use other heuristics
to determine if the stored format is optimal.
Change-Id: I403de522f4b0dd2667d54f6faed621f392c07786
When compressing objects, don't include the edges in the progress
meter. These cost almost no CPU time as they are simply pushed into
and popped out of the delta search window.
Change-Id: I7ea19f0263e463c65da34a7e92718c6db1d4a131
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Use 3 different types of LargeObjectException for the 3 major ways
that we can fail to load an object. For each of these use a unique
string translation which describes the root cause better than just
the ObjectId.name() does.
Change-Id: I810c98d5691b74af9fc6cbd46fc9879e35a7bdca
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This refactoring permits applications to configure global per-process
settings for all packing and easily pass it through to per-request
PackWriters, ensuring that the process configuration overrides the
repository specific settings.
For example this might help in a daemon environment where the server
wants to cap the resources used to serve a dynamic upload pack
request, even though the repository's own pack.* settings might be
configured to be more aggressive. This allows fast but less bandwidth
efficient serving of clients, while still retaining good compression
through a cron managed `git gc`.
Change-Id: I58cc5e01b48924b1a99f79aa96c8150cdfc50846
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Discard the uncompressed delta as soon as its compressed
The DeltaCache will most likely need to copy the compressed delta
into a new buffer in order to compact away the wasted space at the
end caused by over allocation. Since we don't need the uncompressed
format anymore, null out our only reference to it so the GC can
reclaim this memory if it needs to perform a collection in order
to satisfy the cache's allocation attempt.
Change-Id: I50403cfd2e3001b093f93a503cccf7adab43cc9d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Honor pack.windowlimit to cap memory usage during packing
The pack.windowlimit configuration parameter places an upper bound
on the number of bytes used by the DeltaWindow class as it scans
through the object list. If memory usage would exceed the limit
the window is temporarily decreased in size to keep memory used
within that bound.
Change-Id: I09521b8f335475d8aee6125826da8ba2e545060d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter now caches small deltas, or deltas that are very tiny
compared to their source inputs, so that the writing phase goes
faster by reusing those cached deltas.
The cached data is stored compressed, which usually translates to
a bigger footprint due to deltas being very hard to compress, but
saves time during writing by avoiding the deflate step. They are
held under SoftReferences so that the JVM GC can clear out deltas
if memory gets very tight. We would rather continue working and
spend a bit more CPU time during writing than crash due to OOME.
To avoid OutOfMemoryErrors during the caching phase we also trap
OOME and just abort out of the caching.
Because deflateBound() always produces something larger than what
we need to actually store the deflated data, we copy it over into
a new buffer if the actual length doesn't match the buffer length.
When packing jgit.git this saves over 111 KiB in the cache, and is
thus a worthwhile hit on CPU time.
To further save memory we store the inflated size of the delta
(which we need for the object header) in the same field as the
pathHash, as the pathHash is no longer necessary by this phase
of the packing algorithm.
Change-Id: I0da0c600d845e8ec962289751f24e65b5afa56d7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>