Move ObjectDirectory streaming limit to WindowCacheConfig
IDEs like Eclipse offer up the settings in WindowCacheConfig to the
user as a global set of options that are configured for the entire
JVM process, not per-repository, as the cache is shared across the
entire JVM. The limit on how much we are willing to allocate for
an object buffer is similar to the limit on how much we can use for
data caches, allocating that much space impacts the entire JVM and
not just a single repository, so it should be a global limit.
Change-Id: I22eafb3e223bf8dea57ece82cd5df8bfe5badebc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
ObjectReader implementations are now responsible for creating the
unique abbreviation of an ObjectId, or for resolving an abbreviation
back to its full form. In this latter case the reader can offer up
multiple candidates to the caller, who may be able to disambiguate
them based on context.
Repository.resolve() doesn't take multiple candidates into account
right now, but it could in the future by looking for a remaining
^0 or ^{commit} suffix and take an expansion if there is only one
commit that matches the input abbreviation. It could also use
the distance from an annotated tag to resolve "tag-NNN-gcommit"
style strings that are often output by `git describe`.
Change-Id: Icd3250adc8177ae05278b858933afdca0cbbdb56
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Allow ObjectReuseAsIs to have more control over write ordering
The reuse system used by an object database may be able to benefit
from knowing what objects are coming next, and even improve data
throughput by delaying (or moving up) objects that are stored near
each other in the source database.
Pushing the iteration down into the reuse code makes it possible
for a smarter implementation to aggregate reuse. But for the
standard pack file format on disk we don't bother, its quite
efficient already.
Change-Id: I64f0048ca7071a8b44950d6c2a5dfbca3be6bba6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
ObjectReader implementations may wish to use multiple threads in
order to evaluate object reuse faster. Let the reader make that
decision by passing the iteration down into the reader.
Because the work is pushed into the reader, it may need to locate a
given ObjectToPack given its ObjectId. This can easily occur if the
reader has sent a list of ObjectIds to the object database and gets
back information keyed only by ObjectId, without the ObjectToPack
handle. Expose lookup using the PackWriter's own internal map,
so the reader doesn't need to build a redundant copy to track the
assocation of ObjectId back to ObjectToPack.
Change-Id: I0c536405a55034881fb5db92a2d2a99534faed34
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Honor pack.threads and perform delta search in parallel
If we have multiple CPUs available, packing usually goes faster
when each CPU is assigned a slice of the available search space.
The number of threads to use is guessed from the runtime if it
wasn't set by the caller, or wasn't set in the configuration.
Change-Id: If554fd8973db77632a52a0f45377dd6ec13fc220
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This is an informational function used by PackWriter to help it
better organize objects for delta compression. Storage systems
can implement it to provide up more detailed size information,
or they can simply rely on the default behavior that uses the
ObjectLoader obtained from open.
For local file storage, we can obtain this information faster
through specialized routines that parse a pack object header.
Change-Id: I13a09b4effb71ea5151b51547f7d091564531e58
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
amend commit: Support large delta packed objects as streams
Rename the ByteWindow's inflate() method to setInput. We have
completely refactored the purpose of this method to be feeding part
(or all) of the window as input to the Inflater, and the actual
inflate activity happens in the caller.
Change-Id: Ie93a5bae0e9e637b5e822d56993ce6b562c6ad15
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Use core.streamFileThreshold to set our streaming limit
We default this to 1 MiB for now, but we allow users to modify
it through the Repository's configuration file to be a different
value. A new repository listener is used to identify when the
setting has been updated and trigger a reconfiguration of any
active ObjectReaders.
To prevent a horrible explosion we cap core.streamFileThreshold
at no more than 1/4 of the maximum JVM heap size. We do this
because we need at least 2 byte arrays equal in size to the
stream threshold for the worst case delta inflation scenario,
and our host application probably also needs some amount of the
heap for their working set size.
Change-Id: I103b3a541dc970bbf1a6d92917a12c5a1ee34d6c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Very large delta instruction streams, or deltas which use very large
base objects, are now streamed through as large objects rather than
being inflated into a byte array.
This isn't the most efficient way to access delta encoded content, as
we may need to rewind and reprocess the base object when there was a
block moved within the file, but it will at least prevent the JVM from
having its heap explode.
When streaming a delta we have an inflater open for each level in the
delta chain, to inflate the instruction set of the delta, as well as
an inflater for the base level object. The base object is buffered,
as is the top level delta requested by the application, but we do not
buffer the intermediate delta streams. This keeps memory usage lower,
so its closer to 1024 bytes per level in the chain, without having an
adverse impact on raw throughput as the top-level buffer gets pushed
down to the lowest stream that has the next region.
Delta instructions transparently collapse here, if the top level does
not copy a region from its base, the base won't materialize that part
from its own base, etc. This allows us to avoid copying around a lot
of segments which have been deleted from the final version.
Change-Id: I724d45245cebb4bad2deeae7b896fc55b2dd49b3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to the loose object support, whole packed objects can
now be streamed back to the caller. The streaming is less
efficient as we copy the data from the cached window array
into the InflaterInputStream's internal buffer, then inflate
it there before returning to the application.
Like with unpacked objects, there is plenty of room for some
optimization, especially for the copyTo method, where we don't
necessarily need so much buffering to exist.
Change-Id: Ie23be81289e37e24b91d17b0891e47b9da988008
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Big loose objects can now be streamed if they are over the large
object size threshold. This prevents the JVM heap from exploding
with a very large byte array to hold the slurped file, and then
again with its uncompressed copy.
We may have slightly slowed down the simple case for small
loose objects, as the loader no longer slurps the entire thing
and decompresses in memory. To try and keep good performance
for the very common small objects that are below 8 KiB in size,
buffers are set to 8 KiB, causing the reader to slurp most of the
file anyway. However the data has to be copied at least once,
from the BufferedInputStream into the InflaterInputStream.
New unit tests are supplied to get nearly 100% code coverage on the
unpacked code paths, for both standard and pack style loose objects.
We tested a fair chunk of the code elsewhere, but these new tests
are better isolated to the specific branches in the code path.
Change-Id: I87b764ab1b84225e9b5619a2a55fd8eaa640e1fe
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did on Repository, the openObject method
already implied we wanted to open an object, given its main
argument was of type AnyObjectId. Simplify the method name
to just the action, has or open.
Change-Id: If055e5e0d8de0e2424c18a773f6d2bc2f66054f4
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Throw IncorrectObjectTypeException on bad type hints
If the type hint isn't OBJ_ANY and it doesn't match the actual type
observed from the object store, define the reader to throw back an
IncorrectObjectTypeException. This way the caller doesn't have to
perform this check itself before it evaluates the object data, and
we can simplify quite a few call sites.
Change-Id: I9f0dfa033857f439c94245361fcae515bc0a6533
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with the file code, move the pack writer
into its own package so the related classes and their package
private methods are hidden from the rest of the library.
Change-Id: Ic1b5c7c8c8d266e90c910d8d68dfc8e93586854f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Tighten up local packed object representation during packing
Rather than making a loader, and then using that to fill the object
representation, parse the header and set up our data directly.
This saves some time, as we don't waste cycles on information we
won't use right now.
The weight computed for a representation is now its actual stored
size in the pack file, rather than its inflated size. This accounts
for changes made when the compression level is modified on the
repository. It is however more costly to determine the weight of
the object, since we have to find its length in the pack. To try and
recover that cost we now cache the length as part of our ObjectToPack
record, so it doesn't have to be found during the output phase.
A LocalObjectToPack now costs us (assuming 32 bit pointers):
(32 bit) (64 bit)
vm header: 8 bytes 8 bytes
ObjectId: 20 bytes 20 bytes
PackedObjectInfo: 12 bytes 12 bytes
ObjectToPack: 8 bytes 12 bytes
LocalOTP: 20 bytes 24 bytes
----------- ---------
68 bytes 74 bytes
Change-Id: I923d2736186eb2ac8ab498d3eb137e17930fcb50
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Move FileRepository to storage.file.FileRepository
This move isolates all of the local file specific implementation code
into a single package, where their package-private methods and support
classes are properly hidden away from the rest of the core library.
Because of the sheer number of files impacted, I have limited this
change to only the renames and the updated imports.
Change-Id: Icca4884e1a418f83f8b617d0c4c78b73d8a4bd17
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Objects that fall completely within a single window can be worked
with in a zero-copy fashion, provided that the window is backed by
a normal byte[] and not by a ByteBuffer.
This works for a surprising number of objects. The default window
size is 8 KiB, but most deltas are quite a bit smaller than that.
Objects smaller than 1/2 of the window size have a very good chance
of falling completely within a window's array, which means we can
work with them without copying their data around.
Larger objects, or objects which are unlucky enough to span over a
window boundary, get copied through the temporary buffer. We pay
a tiny penalty to realize we can't use the zero-copy code path,
but its easier than trying to keep track of two adjacent windows.
With this change (as well as everything preceeding it), packing
is actually a bit faster. Some crude benchmarks based on cloning
linux-2.6.git (~324 MiB, 1,624,785 objects) over localhost using
C git client and JGit daemon shows we get better throughput, and
slightly better times:
Total Time | Throughput
(old) (now) | (old) (now)
--------------+---------------------------
2m45s 2m37s | 12.49 MiB/s 21.17 MiB/s
2m42s 2m36s | 16.29 MiB/s 22.63 MiB/s
2m37s 2m31s | 16.07 MiB/s 21.92 MiB/s
Change-Id: I48b2c8d37f08d7bf5e76c5a8020cde4a16ae3396
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Output of selected reuses is refactored to use a new ObjectReuseAsIs
interface that extends the ObjectReader. This interface allows the
reader to control how it performs the reuse into the output stream,
but also allows it to throw an exception to request the writer to
find a different candidate representation.
The PackFile reuse code was overhauled, cleaning up the APIs so they
aren't exposed in the object loader, but instead are now a single
method on the PackFile itself. The reuse algorithm was changed to do
a data verification pass, followed by the copy pass to the output.
This permits us to work around a corrupt object in a pack file by
seeking another copy of that object when this one is bad.
The reuse code was also optimized for the common case, where the
in-pack representation is under 16 KiB. In these smaller cases
data is sent to the pack writer more directly, avoiding some copying.
Change-Id: I6350c2b444118305e8446ce1dfd049259832bcca
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The new selection implementation uses a public API on the
ObjectReader, allowing the storage library to enumerate its
candidates and select the best one for this packer without
needing to build a temporary list of the candidates first.
Change-Id: Ie01496434f7d3581d6d3bbb9e33c8f9fa649b6cd
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Extract PackFile specific code to ObjectToPack subclass
The ObjectReader class is dual-purposed into being a factory for the
ObjectToPack, permitting specific ObjectDatabase implementations
to override the method and offer their own custom subclass of the
generic ObjectToPack class. By allowing them to directly extend the
type, each implementation can add custom fields to support tracking
where an object is stored, without incurring any additional penalties
like a parallel Map<ObjectId,Object> would cost.
The reader was chosen to act as a factory rather than the database,
as the reader will eventually be tied more tightly with the
ObjectWalk and TreeWalk. During object enumeration the reader
would have had to load the object for the RevWalk, and may chose
to cache object position data internally so it can later be reused
and fed into the ObjectToPack instance supplied to the PackWriter.
Since a reader is not thread-safe, and is scoped to this PackWriter
and its internal ObjectWalk, its a great place for the database to
perform caching, if any.
Right now this change goes a bit backwards by changing what should
be generic ObjectToPack references inside of PackWriter to the very
PackFile specific LocalObjectToPack subclass. We will correct these
in a later commit as we start to refine what the ObjectToPack API
will eventually look like in order to better support the PackWriter.
Change-Id: I9f047d26b97e46dee3bc0ccb4060bbebedbe8ea9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The WindowCache is an implementation detail of PackFile and how its
used by ObjectDirectory. Lets start to hide it and replace the public
API with a more generic concept, ObjectReader.
Because PackedObjectLoader is also considered a private detail of
PackFile, we have to make PackWriter temporarily dependent upon the
WindowCursor and thus FileRepository and ObjectDirectory in order to
just start the refactoring. In later changes we will clean up the
APIs more, exposing sufficient support to PackWriter without needing
the file specific implementation details.
Change-Id: I676be12b57f3534f1285854ee5de1aa483895398
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Factor out duplicate Inflater setup in WindowCursor
Since we use this code twice, pull it into a private method. Let
the compiler/JIT worry about whether or not this logic should be
inlined into the call sites.
Change-Id: Ia44fb01e0328485bcdfd7af96835d62b227a0fb1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
As discussed on the egit-dev mailing list, we prefer not to have
trailing whitespace in our source code. Correct all currently
offending lines by trimming them.
Change-Id: I002b1d1980071084c0bc53242c8f5900970e6845
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Per CQ 3448 this is the initial contribution of the JGit project
to eclipse.org. It is derived from the historical JGit repository
at commit 3a2dd9921c.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>