Do not delta compress objects that have already tried to compress.
If an object is in a pack file already, delta compression will not
attempt to re-compress it. This assumes that the previous
packing already performed the optimal compression attempt, however,
the subclasses of StoredObjectRepresentation may use other heuristics
to determine if the stored format is optimal.
Change-Id: I403de522f4b0dd2667d54f6faed621f392c07786
Allow ObjectReuseAsIs to have more control over write ordering
The reuse system used by an object database may be able to benefit
from knowing what objects are coming next, and even improve data
throughput by delaying (or moving up) objects that are stored near
each other in the source database.
Pushing the iteration down into the reuse code makes it possible
for a smarter implementation to aggregate reuse. But for the
standard pack file format on disk we don't bother, its quite
efficient already.
Change-Id: I64f0048ca7071a8b44950d6c2a5dfbca3be6bba6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Allow ObjectToPack subclasses to use up to 4 bits of flags
Some instances may benefit from having access to memory efficient
storage for some small values, like single flag bits. Give up a
portion of our delta depth field to make 4 bits available to any
subclass that wants it.
This still gives us room for delta chains of 1,048,576 objects,
and that is just insane. Unpacking 1 million objects to get to
something is longer than most users are willing to wait for data
from Git.
Change-Id: If17ea598dc0ddbde63d69a6fcec0668106569125
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
An ObjectReader implementation may be very slow for a single object,
but yet support bulk queries efficiently by batching multiple small
requests into a single larger request. This easily happens when the
reader is built on top of a database that is stored on another host,
as the network round-trip time starts to dominate the operation cost.
RevWalk, ObjectWalk, UploadPack and PackWriter are the first major
users of this new bulk interface, with the goal being to support an
efficient way to pack a repository for a fetch/clone client when the
source repository is stored in a high-latency storage system.
Processing the want/have lists is now done in bulk, to remove
the high costs associated with common ancestor negotiation.
PackWriter already performs object reuse selection in bulk, but it
now can also do the object size lookup and object counting phases
with higher efficiency. Actual object reuse, deltification, and
final output are still doing sequential lookups, making them a bit
more expensive to perform.
Change-Id: I4c966f84917482598012074c370b9831451404ee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Storage implementations may find this useful when implementing the
ObjectReuseAsIs interface on their ObjectReader. Expose it so we
don't force them to create a redundant copy of the information.
Change-Id: I802ec8113c00884fccde5d0e92b9849716316f62
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter now caches small deltas, or deltas that are very tiny
compared to their source inputs, so that the writing phase goes
faster by reusing those cached deltas.
The cached data is stored compressed, which usually translates to
a bigger footprint due to deltas being very hard to compress, but
saves time during writing by avoiding the deflate step. They are
held under SoftReferences so that the JVM GC can clear out deltas
if memory gets very tight. We would rather continue working and
spend a bit more CPU time during writing than crash due to OOME.
To avoid OutOfMemoryErrors during the caching phase we also trap
OOME and just abort out of the caching.
Because deflateBound() always produces something larger than what
we need to actually store the deflated data, we copy it over into
a new buffer if the actual length doesn't match the buffer length.
When packing jgit.git this saves over 111 KiB in the cache, and is
thus a worthwhile hit on CPU time.
To further save memory we store the inflated size of the delta
(which we need for the object header) in the same field as the
pathHash, as the pathHash is no longer necessary by this phase
of the packing algorithm.
Change-Id: I0da0c600d845e8ec962289751f24e65b5afa56d7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Its useful to know what the flags are or what the base that was
selected is. Dump these out as part of the object's toString.
Change-Id: I8810067fb8337b08b4fcafd5f9ea3e1e31ca6726
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Make ObjectToPack clearReuseAsIs signal available to subclasses
A subclass may want to use this method to release handles that are
caching reuse information. Make it protected so they can override
it and update themselves.
Change-Id: I2277a56ad28560d2d2d97961cbc74bc7405a70d4
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Long ago when PackWriter is first written we thought that the delta
depth could be updated automatically. But its never used. Instead
make this a simple standard setter so the caller can more directly
set the delta depth of this object. This permits us to configure a
depth that takes into account more than just the depth of another
object in this same pack.
Change-Id: I1d71b74f2edd7029b8743a2c13b591098ce8cc8f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This flag will later control whether or not PackWriter search for a
delta base for this object. Edge objects will never get searched,
as the writer won't be outputting them, so they should always have
this flag set on. Sometime in the future this flag should also be
set for file blobs on file paths that have the "-delta" gitattribute
set in the repository's attributes file.
Change-Id: I6e518e1a6996c8ce00b523727f1b605e400e82c6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We need to remember these so we can later cluster objects that
have similar file paths near each other as we search for deltas
between them.
Change-Id: I52cb1e4ca15c9c267a2dbf51dd0d795f885f4cf8
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with the file code, move the pack writer
into its own package so the related classes and their package
private methods are hidden from the rest of the library.
Change-Id: Ic1b5c7c8c8d266e90c910d8d68dfc8e93586854f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The new selection implementation uses a public API on the
ObjectReader, allowing the storage library to enumerate its
candidates and select the best one for this packer without
needing to build a temporary list of the candidates first.
Change-Id: Ie01496434f7d3581d6d3bbb9e33c8f9fa649b6cd
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Make the lower bits available for flags that PackWriter can use to
keep track of facts about the object. We shouldn't need more than
2^24 delta depths, unpacking that chain is unfathomable anyway.
This change gets us 4 bits that are unused in the lower end of the
word, which are typically easier to load from Java and most machine
instruction sets. We can use these in later changes.
Change-Id: Ib9e11221b5bca17c8a531e4ed130ba14c0e3744f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Extract PackFile specific code to ObjectToPack subclass
The ObjectReader class is dual-purposed into being a factory for the
ObjectToPack, permitting specific ObjectDatabase implementations
to override the method and offer their own custom subclass of the
generic ObjectToPack class. By allowing them to directly extend the
type, each implementation can add custom fields to support tracking
where an object is stored, without incurring any additional penalties
like a parallel Map<ObjectId,Object> would cost.
The reader was chosen to act as a factory rather than the database,
as the reader will eventually be tied more tightly with the
ObjectWalk and TreeWalk. During object enumeration the reader
would have had to load the object for the RevWalk, and may chose
to cache object position data internally so it can later be reused
and fed into the ObjectToPack instance supplied to the PackWriter.
Since a reader is not thread-safe, and is scoped to this PackWriter
and its internal ObjectWalk, its a great place for the database to
perform caching, if any.
Right now this change goes a bit backwards by changing what should
be generic ObjectToPack references inside of PackWriter to the very
PackFile specific LocalObjectToPack subclass. We will correct these
in a later commit as we start to refine what the ObjectToPack API
will eventually look like in order to better support the PackWriter.
Change-Id: I9f047d26b97e46dee3bc0ccb4060bbebedbe8ea9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This shortens the implementation within PackWriter, and starts
to open the door for some other refactorings based on changing
the ObjectToPack to be a public part of the API.
Change-Id: Id849cbffc4de20b903e844a2de7737eeb8b7a3ff
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>