Support cutting existing delta chains longer than the max depth
Some packs built by JGit have incredibly long delta chains due to a
long standing bug in PackWriter. Google has packs created by JGit's
DfsGarbageCollector with chains of 6000 objects long, or more.
Inflating objects at the end of this 6000 long chain is impossible
to complete within a reasonable time bound. It could take a beefy
system hours to perform even using the heavily optimized native C
implementation of Git, let alone with JGit.
Enable pack.cutDeltaChains to be set in a configuration file to
permit the PackWriter to determine the length of each delta chain
and clip the chain at arbitrary points to fit within pack.depth.
Delta chain cycles are still possible, but no attempt is made to
detect them. A trivial chain of A->B->A will iterate for the full
pack.depth configured limit (e.g. 50) and then pick an object to
store as non-delta.
When cutting chains the object list is walked in reverse to try
and take advantage of existing chain computations. The assumption
here is most deltas are near the end of the list, and their bases
are near the front of the list. Going up from the tail attempts to
reuse chainLength computations by relying on the memoized value in
the delta base.
The chainLength field in ObjectToPack is overloaded into the depth
field normally used by DeltaWindow. This is acceptable because the
chain cut happens before delta search, and the chainLength is reset
to 0 if delta search will follow.
Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
JGit 3.0: move internal classes into an internal subpackage
This breaks all existing callers once. Applications are not supposed
to build against the internal storage API unless they can accept API
churn and make necessary updates as versions change.
Change-Id: I2ab1327c202ef2003565e1b0770a583970e432e9
This is helpful for writing the pack configuration into a log file.
Change-Id: I5e7f5ff7e01c9538ca12a1860844ba9b467bdf05
Signed-off-by: Edwin Kempin <edwin.kempin@sap.com>
The bitmap code in PackWriter knows exactly when to use a pack as
a "cached pack". It enables cached pack usage only when the pack
has a bitmap and its entire closure of objects needs to be sent.
This is a much simpler code path to maintain, and JGit actually
has a way to write the necessary index.
Change-Id: I2645d482f8733fdf0c4120cc59ba9aa4d4ba6881
Bitmaps provide a huge performance boost for counting objects and they
play nice with the cgit implementation.
Change-Id: I33b05a6c8f1ee2df7770f0b9fdc50d0b4bbf1029
Enable writing pack indexes with bitmaps in the GC.
Update the dfs and file GC implementations to prepare and write
bitmaps on the packs that contain the full closure of the object
graph. Update the DfsPackDescription to include the index version.
Change-Id: I3f1421e9cd90fe93e7e2ef2b8179ae2f1ba819ed
Support creating pack bitmap indexes in PackWriter.
Update the PackWriter to support writing out pack bitmap indexes,
a parallel ".bitmap" file to the ".pack" file.
Bitmaps are selected at commits every 1 to 5,000 commits for
each unique path from the start. The most recent 100 commits are
all bitmapped. The next 19,000 commits have a bitmaps every 100
commits. The remaining commits have a bitmap every 5,000 commits.
Commits with more than 1 parent are prefered over ones
with 1 or less. Furthermore, previously computed bitmaps are reused,
if the previous entry had the reuse flag set, which is set when the
bitmap was placed at the max allowed distance.
Bitmaps are used to speed up the counting phase when packing, for
requests that are not shallow. The PackWriterBitmapWalker uses
a RevFilter to proactively mark commits with RevFlag.SEEN, when
they appear in a bitmap. The walker produces the full closure
of reachable ObjectIds, given the collection of starting ObjectIds.
For fetch request, two ObjectWalks are executed to compute the
ObjectIds reachable from the haves and from the wants. The
ObjectIds needed to be written are determined by taking all the
resulting wants AND NOT the haves.
For clone requests, we get cached pack support for "free" since
it is possible to determine if all of the ObjectIds in a pack file
are included in the resulting list of ObjectIds to write.
On my machine, the best times for clones and fetches of the linux
kernel repository (with about 2.6M objects and 300K commits) are
tabulated below:
Operation Index V2 Index VE003
Clone 37530ms (524.06 MiB) 82ms (524.06 MiB)
Fetch (1 commit back) 75ms 107ms
Fetch (10 commits back) 456ms (269.51 KiB) 341ms (265.19 KiB)
Fetch (100 commits back) 449ms (269.91 KiB) 337ms (267.28 KiB)
Fetch (1000 commits back) 2229ms ( 14.75 MiB) 189ms ( 14.42 MiB)
Fetch (10000 commits back) 2177ms ( 16.30 MiB) 254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB) 1655ms (189.39 MiB)
Change-Id: Icdb0cdd66ff168917fb9ef17b96093990cc6a98d
A pack bitmap index is an additional index of compressed
bitmaps of the object graph. Furthermore, a logical API of the index
functionality is included, as it is expected to be used by the
PackWriter.
Compressed bitmaps are created using the javaewah library, which is a
word-aligned compressed variant of the Java bitset class based on
run-length encoding. The library only works with positive integer
values. Thus, the maximum number of ObjectIds in a pack file that
this index can currently support is limited to Integer.MAX_VALUE.
Every ObjectId is given an integer mapping. The integer is the
position of the ObjectId in the complete ObjectId list, sorted
by offset, for the pack file. That integer is what the bitmaps
use to reference the ObjectId. Currently, the new index format can
only be used with pack files that contain a complete closure of the
object graph e.g. the result of a garbage collection.
The index file includes four bitmaps for the Git object types i.e.
commits, trees, blobs, and tags. In addition, a collection of
bitmaps keyed by an ObjectId is also included. The bitmap for each entry
in the collection represents the full closure of ObjectIds reachable
from the keyed ObjectId (including the keyed ObjectId itself). The
bitmaps are further compressed by XORing the current bitmaps against
prior bitmaps in the index, and selecting the smallest representation.
The XOR'd bitmap and offset from the current entry to the position
of the bitmap to XOR against is the actual representation of the entry
in the index file. Each entry contains one byte, which is currently
used to note whether the bitmap should be blindly reused.
Change-Id: Id328724bf6b4c8366a088233098c18643edcf40f
Break the dependency on RevObject when creating a newObjectToPack().
Update the ObjectReuseAsIs API to support creating new
ObjectToPack with only the AnyObjectId and Git object type. This is
needed to support the future pack index bitmaps, which only contain
this information and do not want the overhead of creating a temporary
object for every ObjectId.
Change-Id: I906360b471412688bf429ecef74fd988f47875dc
Include supported extensions in PackFile constructor.
Previously a PackFile class was assumed to only support a .pack and .idx
file. Update the constructor to enumerate the supported extensions for
the pack file. This will allow the bitmap code to only be executed if
the bitmap extension file is known to exist.
Change-Id: Ie59041dffec5f60d7ea2771026ffd945106bd4bf
Reduce memory held and speed up DfsGarbageCollector.
getObjectList() returns a list of ObjectToPack. These can hold on to a
lot of memory. Furthermore, binary searching for objects in a sorted
array can be slow. Improve the speed and reduce the memory by creating a
copy of the ObjectId and inserting it into an ObjectIdOwnerMap.
Change-Id: Ib5aa5b7447e05938b47fa55812a87b9872c20ea7
Update DfsGarbageCollector to not read back a pack index.
Previously, the Dfs GC excluded objects from packs by passing a
previously written index to the PackWriter. Reading back a file on
Dfs is slow. Instead, allow the PackWriter to expose the objects
included in a pack and forward that to invocations of excludeObjects() .
Change-Id: I377cb4ab07f62cf790505e1eeb0b2efe81897c79
Rename PackConstants to PackExt, a typed pack file extension.
PackConstants previously contained string values for the pack and pack
index extension. Change PackConstant to be PackExt, a typed wrapper
around the string pack file extension.
Change-Id: I86ac4db6da8f33aa42d6f37cfcc119e819444318
Update DfsObjDatabase API to open/write by pack extension.
Previously, the DfsObjDatabase had a hardcoded getPackFile() and
getPackIndex() methods which opens a .pack and .idx file, respectively.
A future change to add a bitmap index will need to be stored in a
parallel .bitmap file. Update the DfsObjDatabase to support opening and
writing of files for any pack extension.
Change-Id: I7c403b501e242096a2d435f6865d6025a9f86108
The maxMemory for a DeltaWindow can be optionally disabled when it is
less than or equal to zero. Respect this configuration when enforcing
the limits on object load.
Change-Id: Ic0f4ffcabf82105f8e690bd0eb5e6be485a313b3
Previously, memory limits were enforced at the start of each iteration
of the delta search, based on objects that were currently loaded in
memory. However, new objects added to the window may be expanded in a
future iteration of the search and thus were not accounted for correctly
at the start of the search. To fix this, memory limits are now enforced
before each object is loaded.
Change-Id: I898ab43e7bf5ee7189831f3a68bb9385ae694b8f
A few classes such as Constanrs are marked with @SuppressWarnings, as are
toString() methods with many liternal, but otherwise $NLS-n$ is used for
string containing text that should not be translated. A few literals may
fall into the gray zone, but mostly I've tried to only tag the obvious
ones.
Change-Id: I22e50a77e2bf9e0b842a66bdf674e8fa1692f590
Fix DeltaWindow.clear() to release loaded buffer bytes.
It is possible for the buffer to be set but not the index. It
ocurrs when an exception occurs during creating an index, but
after the buffer is loaded. Furthermore, the cleared DeltaWindowEntry
should have been ent and not res.
Change-Id: I2e0d79540316635bf7aa43efd225e4eb38230844
These appear as descriptions in the index, see here (currently empty):
http://download.eclipse.org/jgit/docs/latest/apidocs/
Change-Id: If7996deef30ae688bade8b3ad6b19547ca3d8b50
Signed-off-by: Chris Aniszczyk <zx@twitter.com>
Do not delta compress objects that have already tried to compress.
If an object is in a pack file already, delta compression will not
attempt to re-compress it. This assumes that the previous
packing already performed the optimal compression attempt, however,
the subclasses of StoredObjectRepresentation may use other heuristics
to determine if the stored format is optimal.
Change-Id: I403de522f4b0dd2667d54f6faed621f392c07786
PackWriter supports excluding objects from being written to the pack.
You may specify a PackIndex which lists all those objects which should
not go into the new pack. This feature was broken because not all
commits have been checked whether they should be excluded or not. For
other object types the exclude algorithm worked. This commit adds the
missing check.
Change-Id: Id0047098393641ccba784c58b8325175c22fcece
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Use Integer, Character, and Long valueOf methods when
passing parameters to MessageFormat and other places
that expect objects instead of primitives
Change-Id: I5942fbdbca6a378136c00d951ce61167f2366ca4
Parsing the size from a packed object header was incorrectly computing
the total inflated length when the length exceeded the range of a Java
int. The next 7 bits of size information was shifted left as an int
using a shift of 25 bits, placing the higher bits of the size into the
sign position. When this size was extended to a long to be added to
the current size accumulator the size went negative, resulting in
NegativeArraySizeException being thrown.
Fix all places where this particular pattern of code is used to read a
pack size field, or a binary delta header, as they both use the same
variable length encoding scheme.
Change-Id: I04008728ed828f18202652c3d5401cf95a441d0a
Consider two objects A->B where A uses B as a delta base, and these
are in the same source pack file ordered as "A B".
If cached packs is enabled and B is also in the cached pack that
will be appended onto the end of the thin pack, and both A, B are
supposed to be in the thin pack, PackWriter must consider the fact
that A's base B is an edge object that claims to be part of the
new pack, but is actually "external" and cannot be written first.
If the object reuse system considered B candidates fist this bug
does not arise, as B will be marked as edge due to it existing in
the cached pack. When the A candidates are later examined, A sees a
valid delta base is available as an edge, and will not later try to
"write base first" during the writing phase.
However, when the reuse system considers A candidates first they
see that B will be in the outgoing pack, as it is still part of
the thin pack, and arrange for A to be written first. Later when A
switches from being in-pack to being an edge object (as it is part
of the cached pack) the pointer in B does not get its type changed
from ObjectToPack to ObjectId, so B thinks A is non-edge.
We work around this case by also checking that the delta base B
is non-edge before writing the object to the pack. Later when A
writes its object header, delta base B's ObjectToPack will have
an offset == 0, which makes isWritten() = false, and the OBJ_REF
delta format will be used for A's header. This will be resolved by
the client to the copy of B that appears in the later cached pack.
Change-Id: Ifab6bfdf3c0aa93649468f49bcf91d67f90362ca
Packs can contain up to 2^32-1 objects, which exceeds the range of a
Java int. Try harder to accept higher object counts in some cases by
using long more often when we are working with the object count value.
This is a trivial refactoring, we may have to make even more changes
to the object handling code to support more than 2^31-1 objects.
Change-Id: I8cd8146e97cd1c738ad5b48fa9e33804982167e7
Annotated tags are relatively rare and currently are scheduled in a
pack file near the commits, decreasing the time it takes to resolve
client requests reading tags as part of a history traversal.
Putting them first before the commits allows the storage system to
page in the tag area, and have it relatively hot in the LRU when
the nearby commit area gets examined too. Later looking at the
tree and blob data will pollute the cache, making it more likely
the tags are not loaded and would require file IO.
Change-Id: I425f1f63ef937b8447c396939222ea20fdda290f
Correct progress monitor on "Getting sizes:" phase
This counter always was running 1 higher, because it incremented
after the queue was exhausted (and every object was processed). Move
increments to be after the queue has provided a result, to ensure
we do not show a higher in-progress count than total count.
Change-Id: I97f815a0492c0957300475af409b6c6260008463
Keep track of a static collection of all PackWriter instances
Stored in a weak concurrent hash map, which we clean up while iterating.
Usually the weak reference behavior should not be necessary because
PackWriters should be released with release(), but we still want to
avoid leaks when dealing with broken client code.
Change-Id: I337abb952ac6524f7f920fedf04065edf84d01d2
Estimate the amount of memory used by a PackWriter
Memory usage is dominated by three terms:
- The maximum memory allocated to each delta window.
- The maximum size of a single file held in memory during delta search.
- ObjectToPack instances owned by the writer.
For the first two terms, rather than doing complex instrumentation of
the DeltaWindows, we just overestimate based on the config parameters
(though we may underestimate if the maximum size is not set).
For the ObjectToPack instances, we do some rough byte accounting of the
underlying Java object representation.
Change-Id: I23fe3cf9d260a91f1aeb6ea22d75af8ddb9b1939
Add an object encapsulating the state of a PackWriter
Exposes essentially the same state machine to the programmer as is
exposed to the client via a ProgressMonitor, using a wrapper around
beginTask()/endTask().
Change-Id: Ic3622b4acea65d2b9b3551c668806981fa7293e3
Export the shallow pack information, and also a handy function to
sum up the total times. Include the time writing out the index file,
if it was created.
Change-Id: I7f60ae6848455a357b25feedb23743bbf6c153cf
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This implements the server side of shallow clones only (i.e.
git-upload-pack), not the client side.
CQ: 5517
Bug: 301627
Change-Id: Ied5f501f9c8d1fe90ab2ba44fac5fa67ed0035a4
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
PackWriter: support excluding objects already in other packs
This can be useful when implementing garbage collection and there
are packs that should not be copied, such as huge packs that have
a sibling ".keep" file alongside of them.
Callers driving PackWriter need to initialize the list of packs not
to include objects from by passing each index to excludeObjects().
Change-Id: Id7f34df69df97be406bcae184308e92b0e8690fd
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
During parsing these are used with contains(). If they are a List
type, the contains operation is not efficient. Some callers such
as UploadPack often pass a List here, so convert to Set when the
type isn't efficient for contains().
Change-Id: If948ae3bf1f46e756bd2d5db14795e12ba7a6207
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Revert "PackWriter: Do not delta compress already packed objects"
This reverts commit 67b064fc9f.
The "tiny optimization" introduced by 67b0 turns out to have a big
savings on wall-clock time when the object store is very slow (e.g.
the DHT support in JGit), but comes with a much bigger penalty in
space used by the output stream. CGit packed with 67b0 enabled is
7 MiB larger than it should be (36 MiB rather than 28/29 MiB). The
much bigger Linux kernel repository gained over 200 MiB, though some
of this may have been caused by a smaller window setting.
Revert this patch as PackWriter should be optimizing for space used
rather than time spent, since its primary use is network transfer, and
that isn't free.
Change-Id: I7413a9ef89762208159b4a1adc5a22a4c9245611
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Only search for base objects on thin packs
A non-thin pack does not need to worry about preferred bases, the pack
will be self-contained and all required delta base objects will appear
within the pack itself. Obtaining the path buffer and length from the
ObjectWalk to build the preferred base table is "expensive", so avoid
the cost unless a thin pack is being constructed.
Change-Id: I16e30cd864f4189d4304e7957a7cd5bdb9e84528
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Skip progress messages on fast operations
If the "Finding sources" phase will complete in <1 second with no
delta compression enabled, don't bother showing the progress meter for
this phase. Small repositories on the local filesystem tend to rip
through this phase always subsecond and the ProgressMonitor display
can actually slow the operation down.
If delta compression is enabled, there are two phases that may run
very quickly. Set the timer to 500 milliseconds instead, reducing the
risk that the user has to wait longer than 1 second before any sort of
output from the packer occurs.
Change-Id: I58110f17e2a5ffa0134f9768b94804d16bbb8399
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some embeddings of UploadPack (e.g. Gerrit Code Review) set their own
PackConfig from a server-wide configuration, overriding any JGit
defaults or settings that may exist at the local repository level.
Make a copy constructor form of PackConfig so this server-wide
configuration object can be copied and then merged with repository
specific configuration data.
Change-Id: I4463c95aeaf7d6536c3ab132dec9c50ee528d9e0
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
jgit.storage.dht is a storage provider implementation for JGit that
permits storing the Git repository in a distributed hashtable, NoSQL
system, or other database. The actual underlying storage system is
undefined, and can be plugged in by implementing 7 small interfaces:
* Database
* RepositoryIndexTable
* RepositoryTable
* RefTable
* ChunkTable
* ObjectIndexTable
* WriteBuffer
The storage provider interface tries to assume very little about the
underlying storage system, and requires only three key features:
* key -> value lookup (a hashtable is suitable)
* atomic updates on single rows
* asynchronous operations (Java's ExecutorService is easy to use)
Most NoSQL database products offer all 3 of these features in their
clients, and so does any decent network based cache system like the
open source memcache product. Relying only on key equality for data
retrevial makes it simple for the storage engine to distribute across
multiple machines. Traditional SQL systems could also be used with a
JDBC based spi implementation.
Before submitting this change I have implemented six storage systems
for the spi layer:
* Apache HBase[1]
* Apache Cassandra[2]
* Google Bigtable[3]
* an in-memory implementation for unit testing
* a JDBC implementation for SQL
* a generic cache provider that can ride on top of memcache
All six systems came in with an spi layer around 1000 lines of code to
implement the above 7 interfaces. This is a huge reduction in size
compared to prior attempts to implement a new JGit storage layer. As
this package shows, a complete JGit storage implementation is more
than 17,000 lines of fairly complex code.
A simple cache is provided in storage.dht.spi.cache. Implementers can
use CacheDatabase to wrap any other type of Database and perform fast
reads against a network based cache service, such as the open source
memcached[4]. An implementation of CacheService must be provided to
glue this spi onto the network cache.
[1] https://github.com/spearce/jgit_hbase
[2] https://github.com/spearce/jgit_cassandra
[3] http://labs.google.com/papers/bigtable.html
[4] http://memcached.org/
Change-Id: I0aa4072781f5ccc019ca421c036adff2c40c4295
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Fix the way delta chain cycles are prevented
Take a very simple approach to avoiding delta chains during object
reuse: objects are now always selected from the oldest pack that
contains them. This prevents cycles because a pack must not have
a cycle in the delta chain. If both objects A and B are chosen
out of the same source pack then there cannot be an A->B->A cycle.
The oldest pack is also the most likely to have the smallest deltas.
Its the biggest pack in the system and probably came from the clone
(or last GC) of this repository, where all objects were previously
considered and packed tightly together. If an object appears again
(for example due to a revert and a push into this repository) the
newer copy of won't be nearly as small as the older delta version
of it, even if the newer one is also itself a delta.
ObjectDirectory already enumerates objects during selection in this
newest->oldest order, so it already is supplying these assumptions
to PackWriter. Taking advantage of this can speed up selection by
a tiny amount by avoiding some tests, but can also help to prevent
a cycle needing to be broken on the fly during writing.
The previous cycle breaking logic wasn't fully correct either.
If a different delta base was chosen, the new delta base might not
have been written into the output pack before the current object,
forcing the use of REF_DELTA when OFS_DELTA is always smaller.
This logic has now been reworked to always re-check the delta base
and ensure it gets written before the current object.
If a cycle occurs, it gets broken the same way as before, by
disabling delta reuse and finding an alternative form of the
object, which may require inflating/deflating in whole format.
Change-Id: I9953ab8be54ceb8b588e1280d6f7edd688887747
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the total number of objects to look for reuse on is under 4096
this is really close to a reasonable batch size for the DHT storage
system to lookup at once. Combine all of the objects into a single
temporary list, perform reuse, and then prune the main lists if any
duplicate objects were detected from a selected CachedPack.
The intention here is to try and avoid 4 tiny sequential lookups
on the storage system when the time to wait for each of those to
finish is higher than the CPU time required to build (and later
GC) this temporary list.
Change-Id: I528daf9d2f7744dc4a6281750c2d61d8f9da9f3a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of looping over the objectsLists array, always set slot 0 to
null and explicitly work on the 4 indexes that matter. This kills
some loops and increases the length of the code slightly, but I've
always really disliked that dummy 0 slot.
Change-Id: I5ad938501c1c61f637ffdaff0d0d88e3962d8942
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Speed up pruning of objects from cached packs
During object enumeration for the thin pack, very few objects come
out that are duplicated with the cached pack. Typically these are
only cases where a blob or tree was cherry-picked forward, got a
copy or rename, or was reverted... all relatively infrequent events.
Speed up pruning of the thin pack object list by combining the phase
with the object representation selection. Implementers should already
be offering to reuse the object from the cached pack if it is stored
there, at which point the implementation can perform a very fast type
of containment test using the cached pack's identity rather than yet
another index lookup. For the local disk case this is probably not a
big improvement, but it does help on the DHT implementation where the
two passes combined into one reduces latency.
Change-Id: I6a07fc75d9075bf6233e967360b6546f9e9a2b33
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Frequently enough I'm wondering how much of a pack is commits vs.
trees, and the total line doesn't really tell us this because its
a gross total from the pack. Computing the counts per object type
is simple during packing, as PackWriter already has everything in
memory broken up by object type. Its virtually free to get these
values and track them.
Change-Id: Id5e6b1902ea909c72f103a0fbca5d8bc316f9ab3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Rename getObjectsNumber to getObjectCount
This better matches with PackFile and CachedPack's methods
that return the same value.
Change-Id: Idb9b7c71d2048dd2344a62c2cde20b4e34529ab7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>