Do not delta compress objects that have already tried to compress.
If an object is in a pack file already, delta compression will not
attempt to re-compress it. This assumes that the previous
packing already performed the optimal compression attempt, however,
the subclasses of StoredObjectRepresentation may use other heuristics
to determine if the stored format is optimal.
Change-Id: I403de522f4b0dd2667d54f6faed621f392c07786
When compressing objects, don't include the edges in the progress
meter. These cost almost no CPU time as they are simply pushed into
and popped out of the delta search window.
Change-Id: I7ea19f0263e463c65da34a7e92718c6db1d4a131
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Use 3 different types of LargeObjectException for the 3 major ways
that we can fail to load an object. For each of these use a unique
string translation which describes the root cause better than just
the ObjectId.name() does.
Change-Id: I810c98d5691b74af9fc6cbd46fc9879e35a7bdca
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This refactoring permits applications to configure global per-process
settings for all packing and easily pass it through to per-request
PackWriters, ensuring that the process configuration overrides the
repository specific settings.
For example this might help in a daemon environment where the server
wants to cap the resources used to serve a dynamic upload pack
request, even though the repository's own pack.* settings might be
configured to be more aggressive. This allows fast but less bandwidth
efficient serving of clients, while still retaining good compression
through a cron managed `git gc`.
Change-Id: I58cc5e01b48924b1a99f79aa96c8150cdfc50846
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Discard the uncompressed delta as soon as its compressed
The DeltaCache will most likely need to copy the compressed delta
into a new buffer in order to compact away the wasted space at the
end caused by over allocation. Since we don't need the uncompressed
format anymore, null out our only reference to it so the GC can
reclaim this memory if it needs to perform a collection in order
to satisfy the cache's allocation attempt.
Change-Id: I50403cfd2e3001b093f93a503cccf7adab43cc9d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Honor pack.windowlimit to cap memory usage during packing
The pack.windowlimit configuration parameter places an upper bound
on the number of bytes used by the DeltaWindow class as it scans
through the object list. If memory usage would exceed the limit
the window is temporarily decreased in size to keep memory used
within that bound.
Change-Id: I09521b8f335475d8aee6125826da8ba2e545060d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter now caches small deltas, or deltas that are very tiny
compared to their source inputs, so that the writing phase goes
faster by reusing those cached deltas.
The cached data is stored compressed, which usually translates to
a bigger footprint due to deltas being very hard to compress, but
saves time during writing by avoiding the deflate step. They are
held under SoftReferences so that the JVM GC can clear out deltas
if memory gets very tight. We would rather continue working and
spend a bit more CPU time during writing than crash due to OOME.
To avoid OutOfMemoryErrors during the caching phase we also trap
OOME and just abort out of the caching.
Because deflateBound() always produces something larger than what
we need to actually store the deflated data, we copy it over into
a new buffer if the actual length doesn't match the buffer length.
When packing jgit.git this saves over 111 KiB in the cache, and is
thus a worthwhile hit on CPU time.
To further save memory we store the inflated size of the delta
(which we need for the object header) in the same field as the
pathHash, as the pathHash is no longer necessary by this phase
of the packing algorithm.
Change-Id: I0da0c600d845e8ec962289751f24e65b5afa56d7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter now produces new deltas if there is not a suitable delta
available for reuse from an existing pack file. This permits JGit to
send less data on the wire by sending a delta relative to an object
the other side already has, instead of sending the whole object.
The delta searching algorithm is similar in style to what C Git
uses, but apparently has some differences (see below for more on).
Briefly, objects that should be considered for delta compression are
pushed onto a list. This list is then sorted by a rough similarity
score, which is derived from the path name the object was discovered
at in the repository during object counting. The list is then
walked in order.
At each position in the list, up to $WINDOW objects prior to it
are attempted as delta bases. Each object in the window is tried,
and the shortest delta instruction sequence selects the base object.
Some rough rules are used to prevent pathological behavior during
this matching phase, like skipping pairings of objects that are
not similar enough in size.
PackWriter intentionally excludes commits and annotated tags from
this new delta search phase. In the JGit repository only 28 out
of 2600+ commits can be delta compressed by C Git. As the commit
count tends to be a fair percentage of the total number of objects
in the repository, and they generally do not delta compress well,
skipping over them can improve performance with little increase in
the output pack size.
Because this implementation was rebuilt from scratch based on my own
memory of how the packing algorithm has evolved over the years in
C Git, PackWriter, DeltaWindow, and DeltaEncoder don't use exactly
the same rules everywhere, and that leads JGit to produce different
(but logically equivalent) pack files.
Repository | Pack Size (bytes) | Packing Time
| JGit - CGit = Difference | JGit / CGit
-----------+----------------------------------+-----------------
git | 25094348 - 24322890 = +771458 | 59.434s / 59.133s
jgit | 5669515 - 5709046 = - 39531 | 6.654s / 6.806s
linux-2.6 | 389M - 386M = +3M | 20m02s / 18m01s
For the above tests pack.threads was set to 1, window size=10,
delta depth=50, and delta and object reuse was disabled for both
implementations. Both implementations were reading from an already
fully packed repository on local disk. The running time reported
is after 1 warm-up run of the tested implementation.
PackWriter is writing 771 KiB more data on git.git, 3M more on
linux-2.6, but is actually 39.5 KiB smaller on jgit.git. Being
larger by less than 0.7% on linux-2.6 isn't bad, nor is taking an
extra 2 minutes to pack. On the running time side, JGit is at a
major disadvantage because linux-2.6 doesn't fit into the default
WindowCache of 20M, while C Git is able to mmap the entire pack and
have it available instantly in physical memory (assuming hot cache).
CGit also has a feature where it caches deltas that were created
during the compression phase, and uses those cached deltas during
the writing phase. PackWriter does not implement this (yet),
and therefore must create every delta twice. This could easily
account for the increased running time we are seeing.
Change-Id: I6292edc66c2e95fbe45b519b65fdb3918068889c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>