Delete expired garbage even when there is no GC pack present.
Delete the condition to check whether the garbage pack creation time
is older than the last GC operation, because it's not possible to
find the last GC operation time when there is no GC pack.
Add additional tests to make sure the contents of the expired garbage
packs are considered during the GC operation and any actively
referenced objects from the garbage packs are copied successfully
into the GC pack before deleting the garbage pack.
Change-Id: I09e8b2656de8ba7f9b996724ad1961d908e937b6
Signed-off-by: Thirumala Reddy Mutchukota <thirumala@google.com>
Enable and fix warnings about redundant specification of type arguments
Since the introduction of generic type parameter inference in Java 7,
it's not necessary to explicitly specify the type of generic parameters.
Enable the warning in Eclipse, and fix all occurrences.
Change-Id: I9158caf1beca5e4980b6240ac401f3868520aad0
Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>
Prefer smaller GC files during DFS garbage collection
In 8ac65d33ed PackWriter changed its
behavior to always prefer the last object representation presented
to it by the ObjectReuseAsIs implementation. This was a fix to avoid
delta chain cycles.
Unfortunately it can lead to suboptimal compression when concurrent
GCs are run on the same repository. One case is automatic GC running
(with default settings) in parallel to a manual GC that has disabled
delta reuse in order to generate new smaller deltas for the entire
history of the repository.
Running GC with no-reuse generally requires more CPU time, which
also translates to a longer running time. This can lead to a race
where the automatic GC completes before the no-reuse GC, leaving
the repository in a state such as:
no-reuse GC: size 1 GiB, mtime = 18:45
auto GC: size 8 GiB, mtime = 17:30
With the default sort ordering, the smaller no-reuse GC pack is
sorted earlier in the pack list, due to its more recent mtime.
During object reuse in a future GC, these smaller representations
are considered first by PackWriter, but are all discarded when the
auto GC file from 17:30 is examined second (due to its older mtime).
Work around this in two ways.
Well formed DFS repositories should have at most 1 GC pack. If
2 or more GC packs exist, break the sorting tie by selecting the
smaller file earlier in the pack list. This allows all normal read
code paths to favor the smaller file, which places less pressure
on the DfsBlockCache. If any GC race happens, readers serving clone
requests will prefer the file that is smaller.
During object reuse, flip this ordering so that the smaller file is
last. This allows PackWriter to see smaller deltas last, replacing
larger representations that were previously considered from other
pack files.
Change-Id: I0b7dc8bb9711c82abd6bd16643f518cfccc6d31a
Disabling the garbage pack coalescing when garbageTtl > 0 can result in
lot of garbage packs if they are created within the garbageTtl time.
To avoid a large number of garbage packs, re-introducing garbage pack
coalescing for the packs that are created within a single calendar day
when the garbageTtl is more than one day or one third of the garbageTtl.
Change-Id: If969716aeb55fb4fd0ff71d75f41a07638cd5a69
Signed-off-by: Thirumala Reddy Mutchukota <thirumala@google.com>
The Compacter and Garbage Collector will record the estimated size of
the newly going to be created compact, gc or garbage packs. This
information can be used by the clients to better make a call on how to
actually store the pack based on the approximated expected size.
Added a new protected method DfsObjDatabase.newPack(PackSource
packSource, long estimatedPackSize), so that the clients can override
this method to make use of the estimatedPackSize while creating a new
PackDescription object. The default implementation of this method is
equivalent to
newPack(packSource).setEstimatedPackSize(estimatedPackSize). I didn't
make it abstract because that would force all the existing sub classes
of DfsObjDatabase to implement this method. Due to this default
implementation, the estimatedPackSize is added to DfsPackDescription
using a setter instead of a constructor parameter (even though
constructor parameter would be a better choice as this value is set only
during the object creation).
Change-Id: Iade1122633ea774c2e842178a6a6cbb4a57b598b
Signed-off-by: Thirumala Reddy Mutchukota <thirumala@google.com>
Check that DfsBlockCache#blockSize is a power of 2
In case a value is used which isn’t a power of 2 there will be a high
chance of java.lang.ArrayIndexOutBoundsException and
org.eclipse.jgit.errors.CorruptObjectException due to a mismatching
assumption for the DfsBlockCache#blockSizeShift parameter.
Change-Id: Ib348b3704edf10b5f93a3ffab4fa6f09cbbae231
Signed-off-by: Philipp Marx <smigfu@googlemail.com>
DfsGarbageCollector will now enforce a maximum time to live (TTL) for
UNREACHABLE_GARBAGE packs. The default TTL is 1 day, which should be
enough time to avoid races with other processes that are inserting
data into the repository.
Change-Id: Id719e6e2a03cfc9a0c0aef8ed71d261dda14bd0c
Signed-off-by: Mike Williams <miwilliams@google.com>
When using a DfsInserter for high-throughput insertion of many
objects (analogous to git-fast-import), we don't necessarily want to
do a random object lookup for each. It'll be faster from the
inserter's perspective to insert the duplicate objects and let a later
GC handle the deduplication.
Change-Id: Ic97f5f01657b4525f157e6df66023f1f07fc1851
Expose the ObjectInserter that created an ObjectReader
We've found in Gerrit Code Review that it is common to pass around
both an ObjectReader (or more commonly a RevWalk wrapping one) and an
ObjectInserter. These code paths often assume that the ObjectReader
can read back any objects created by the ObjectInserter without
flushing. However, we previously had no way to enforce that constraint
programmatically, leading to hard-to-spot problems.
Provide a solution by exposing the ObjectInserter that created an
ObjectReader, when known. Callers can either continue passing both
objects and check:
reader.getCreatedFromInserter() == inserter
or they can just pass around ObjectReader and extract the inserter
when it's needed (checking that it's not null at usage time).
Change-Id: Ibbf5d1968b506f6b47030ab1b046ffccb47352ea
Insert duplicate objects to prevent race during garbage collection.
Prior to this change, DfsInserter would not insert an object into a pack
if it already existed in another pack in the repository, even if that
pack was unreachable. Consider this sequence of events:
- Object FOO is pushed to a repository.
- Subsequent ref changes make FOO UNREACHABLE_GARBAGE.
- FOO is subsequently re-inserted using a DfsInserter, but skipped
due to existing in UNREACHABLE_GARBAGE.
- The repository is repacked; FOO will not be written into a new pack
because it is not yet reachable from a reference. If the
UNREACHABLE_GARBAGE packs are deleted, FOO disappears.
- A reference is updated to reference FOO. This reference is now broken
as FOO was removed when the repacking process deleted the
UNREACHABLE_GARBAGE pack that stored the only copy of FOO.
The garbage collector can't safely delete the UNREACHABLE_GARBAGE
pack because FOO might be in the middle of being re-inserted/re-packed.
This change writes a duplicate copy of an object if it only exists in
UNREACHABLE_GARBAGE. This "freshens" the object to give it a chance to
survive long enough to be made reachable through a reference.
Change-Id: I20f2062230f3af3bccd6f21d3b7342f1152a5532
Signed-off-by: Mike Williams <miwilliams@google.com>
The LRU chain management code was broken leading to situations where
the chain was incomplete. This prevented the cache from removing
items when it exceeded its memory target, causing a leak.
One case was repeated hit on the head of the chain. moveToHead(e)
was invoked linking the head back to itself in a cycle orphaning
the rest of the table.
Add some unit tests to cover this and a few other paths.
Change-Id: Ib27486eaa1b1d2bf1c745a56d0a5832bfb029322
Revert "Add a method to DfsOutputStream to read as an InputStream"
This reverts commit b646578d89.
openInputStream() is never used in JGit, nor is it used by any
known working DFS implementation. The method was added as a
utility for reading back from a DfsInserter, but the final
implementation of that feature does not requrire this method.
Change-Id: I075ad95e40af49c92b554480f8993ef5658f7684
Add a method to ObjectInserter to read back inserted objects
In the DFS implementation, flushing an inserter writes a new pack to
the storage system and is potentially very slow, but was the only way
to ensure previously-inserted objects were available. For some tasks,
like performing a series of three-way merges, the total size of all
inserted objects may be small enough to avoid flushing the in-memory
buffered data.
DfsOutputStream already provides a read method to read back from the
not-yet-flushed data, so use this to provide an ObjectReader in the
DFS case.
In the file-backed case, objects are written out loosely on the fly,
so the implementation can just return the existing WindowCursor.
Change-Id: I454fdfb88f4d215e31b7da2b2a069853b197b3dd