Reorder modifiers to follow Java Language Specification
The Java Language Specification recommends listing modifiers in
the following order:
1. Annotations
2. public
3. protected
4. private
5. abstract
6. static
7. final
8. transient
9. volatile
10. synchronized
11. native
12. strictfp
Not following this convention has no technical impact, but will reduce
the code's readability because most developers are used to the standard
order.
This was detected using SonarLint.
Change-Id: I9cddecb4f4234dae1021b677e915be23d349a380
Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>
Allow overriding DfsPackDescription comparator for scanning packs
Provide a factory for comparators that use the default heuristics except
with a different ordering of PackSources.
Change-Id: I0809b64deb3d0486040076946fdbdad650d69240
Move DfsPackDescription comparators to common location
There are several ways of comparing DfsPackDescriptions for different
purposes, such as object lookup search order and reftable ordering. Some
of these are later compounded into comparators on other objects, so they
appear in the code as Comparator<DfsReftable>, for example.
Put all the DfsPackDescription comparators in static methods on
DfsPackDescription itself. Stop implementing Comparable, to avoid giving
the impression that there is always one true and correct way of sorting
packs.
Change-Id: Ia5ca65249c13373f7ef5b8a5d1ad50a26577706c
Rather than requiring callers to do their own computations based on the
package-private "category" number, provide an actual
Comparator<PackSource> instance, and explicitly discourage usage of
default Enum comparison.
Construct the default comparator using a builder pattern based on
defining equivalence classes. This gives us the same behavior as the old
category field in PackSource, with an abstraction that does not leak the
implementation detail of comparing rank numbers.
Change-Id: I6757211397ab1bc181d61298e073f88b69dbefc3
In normal operation, the source of a pack should never be null; the DFS
implementation should always know where a pack came from. Existing
implementations in InMemoryRepository and at Google always have the
source available at construction time.
The problem with null PackSources in the previous implementation was it
made the DfsPackDescription#compareTo method intransitive. Specifically,
it skips comparing the sources at all if *either* operand is null.
Suppose we have three descriptions A, B, and C, where all fields are
equal except the PackSource, and:
* A's source is INSERT
* B's source is null
* C's source is RECEIVE
In this case, A.compareTo(B) == 0, and B.compareTo(C) == 0, since all
fields are equal except the source, which is skipped. But
A.compareTo(C) != 0, since A and B have different sources.
Avoid this problem in compareTo by enforcing that the source is never
null. We could of course assign an arbitrary category number to a null
source in order to make comparison transitive[1], but it's simpler to
implement and reason about if the field is non-nullable, and there is no
real-world use case to make it null.
Although a non-null source is required at construction time, the field
is currently still mutable: DfsPackDecscription#setPackSource is used by
DfsInserterTest to mark packs as garbage. This could probably be
avoided as well, allowing us to convert packSource to a final field, but
doing so is beyond the scope of this change.
[1] The astute reader will notice this is already done by
DfsObjDatabase#reftableComparator(). In fact, the reason that
different comparator implementations non-obviously have different
semantics for this nullable field is another reason why it's clearer
to avoid null entirely.
Change-Id: I85a2aaf3fd6d4868f241f7972a0349f087830ffa
Remove it from
* package private functions.
* try blocks
* for loops
this was done with the following python script:
$ cat f.py
import sys
import re
import os
def replaceFinal(m):
return m.group(1) + "(" + m.group(2).replace('final ', '') + ")"
methodDecl = re.compile(r"^([\t ]*[a-zA-Z_ ]+)\(([^)]*)\)")
def subst(fn):
input = open(fn)
os.rename(fn, fn + "~")
dest = open(fn, 'w')
for l in input:
l = methodDecl.sub(replaceFinal, l)
dest.write(l)
dest.close()
for root, dirs, files in os.walk(".", topdown=False):
for f in files:
if not f.endswith('.java'):
continue
full = os.path.join(root, f)
print full
subst(full)
Change-Id: If533a75a417594fc893e7c669d2c1f0f6caeb7ca
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
An empty repository may have a dangling symref HEAD pointing to
refs/heads/master. In this case, there will be a reftable even though
there are no packs yet.
Change-Id: Ib759ffbbfc490953481853e74263dd46d2592888
Signed-off-by: Minh Thai <mthai@google.com>
dfs: Switch InMemoryRepository to DfsReftableDatabase
This ensure DfsReftableDatabase is tested by the same test suites that
use/test InMemoryRepository. It also simplifies the logic of
InMemoryRepository and brings its compatibility story closer to any
other DFS repository that uses reftables for its reference storage.
Change-Id: I881469fd77ed11a9239b477633510b8c482a19ca
Signed-off-by: Minh Thai <mthai@google.com>
Signed-off-by: Terry Parker <tparker@google.com>
DfsReftableDatabase is a new alternative for DfsRefDatabase that
handles more operations for the implementor by delegating through
reftables. All reftable files are stored in sibling DfsObjDatabase
using PackExt.REFTABLE and PackSource.INSERT.
Its assumed the DfsObjDatabase periodically runs compactions and GCs
using DfsPackCompactor and DfsGarbageCollector. Those passes are
essential to collapsing the stack of reftables.
Change-Id: Ia03196ff6fd9ae2d0623c3747cfa84357c6d0c79
Signed-off-by: Minh Thai <mthai@google.com>
Signed-off-by: Terry Parker <tparker@google.com>
Reftable storage in DFS is related to pack storage. Reftables are
stored in the same namespace, but with PackExt.REFTABLE. Include
the set of DfsReftable instances in the PackList and export some
helpers to access the tables.
Change-Id: I6a4f5f953ed6b0ff80a7780f4c6cbcc5eda0da3e
dfs: only create DfsPackFile if description has PACK
In the future with reftable a DFS implementation may choose to create
a PackDescription that contains only a REFTABLE extension. Filter
these out by only creating a DfsPackFile if the PackDescription as the
expected PackExt.PACK.
Change-Id: I4c831622378156ae6b68f82c1ee1db5e150893be
By making this a deterministic function, DfsBlockCache can stop
retaining a map of every DfsPackDescription it has ever seen. This
fixes a long standing memory leak in DfsBlockCache.
This refactoring also simplifies the idea of setting up more
lightweight objects around streams.
Change-Id: I051e7b96f5454c6b0a0e652d8f4a69c0bed7f6f4
Enable and fix warnings about redundant specification of type arguments
Since the introduction of generic type parameter inference in Java 7,
it's not necessary to explicitly specify the type of generic parameters.
Enable the warning in Eclipse, and fix all occurrences.
Change-Id: I9158caf1beca5e4980b6240ac401f3868520aad0
Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>
The Compacter and Garbage Collector will record the estimated size of
the newly going to be created compact, gc or garbage packs. This
information can be used by the clients to better make a call on how to
actually store the pack based on the approximated expected size.
Added a new protected method DfsObjDatabase.newPack(PackSource
packSource, long estimatedPackSize), so that the clients can override
this method to make use of the estimatedPackSize while creating a new
PackDescription object. The default implementation of this method is
equivalent to
newPack(packSource).setEstimatedPackSize(estimatedPackSize). I didn't
make it abstract because that would force all the existing sub classes
of DfsObjDatabase to implement this method. Due to this default
implementation, the estimatedPackSize is added to DfsPackDescription
using a setter instead of a constructor parameter (even though
constructor parameter would be a better choice as this value is set only
during the object creation).
Change-Id: Iade1122633ea774c2e842178a6a6cbb4a57b598b
Signed-off-by: Thirumala Reddy Mutchukota <thirumala@google.com>
DfsObjDatabase: clear PackList dirty bit if no new packs
If a reference was updated more recently than a pack was written
(typical) the PackList was perpetually dirty until the next GC
was completed for the repository.
Detect this condition by observing no changes to the PackList
membership and resetting the dirty bit.
Change-Id: Ie2133aca1f8083307c73b6a26358175864f100ef
DfsObjectDatabase: Expose PackList and move markDirty there
What's invalidated when an object database is "dirty" is not the whole
database, but rather a specific list of packs. If there is a race
between getting the pack list and setting the volatile dirty flag
where the packs are rescanned, we don't need to mark the new pack list
as dirty.
This is a fine point that only really applies if the decision of
whether or not to mark dirty actually requires introspecting the pack
list (say, its timestamps). The general operation of "take whatever
is the current pack list and mark it dirty" may still be inherently
racy, but the cost is not so high.
Change-Id: I159e9154bd8b2d348b4e383627a503e85462dcc6
Invalidate DfsObjDatabase pack list when refs are updated
Currently, there is a race where a user of a DfsRepository in a single
thread may get unexpected MissingObjectExceptions trying to look up an
object that appears as the current value of a ref:
1. Thread A scans packs before scanning refs, for example by reading
an object by SHA-1.
2. Thread B flushes an object and updates a ref to point to that
object.
3. Thread A looks up the ref updated in (2). Since it is scanning refs
for the first time, it sees the new object SHA-1.
4. Thread A tries to read the object it found in (3), using the cached
pack list it got from (1). The object appears missing.
Allow implementations to work around this by marking the object
database's current pack list as "dirty." A dirty pack list means that
DfsReader will rescan packs and try again if a requested object is
missing. Implementations should mark objects as dirty any time the ref
database reads or scans refs that might be newer than a previously
cached pack list.
Change-Id: I06c722b20c859ed1475628ec6a2f6d3d6d580700
Force reads to use a search ordering of:
INSERT / RECEIVE
COMPACT
GC (heads)
GC_REST (non-heads)
GC_TXN (refs/txn)
UNREACHABLE_GARBAGE
This has provided decent performance for object lookups. Starting
from an arbitrary reference may find the content in a newer pack
created by DfsObjectInserter or a ReceivePack server. Compaction of
recent packs also contains newer content, and then most interesting
data is in the "main" GC pack. As the GC pack is self-contained (has
no edges leading outside) readers typically do not need to go further.
Adding a new GC_REST PackSource allows the DfsGarbageCollector to
identify to the pack ordering code which pack is which, so the
non-heads are scanned second during reads. This removes a hack that
was unique to Google's implementation that enforced this behavior by
fixing up the lastModified timestamp.
Renumber the PackSource's categories to reflect this search ordering.
Change-Id: I19fdaab8a8d6687cbe8c88488e6daa0630bf189a
The RefTree graph needs to be quickly accessed to read references.
It is also distinct graph disconnected from the rest of the
repository. Store the commit and tree objects in their own pack.
Change-Id: Icbb735be8fa91ccbf0708ca3a219b364e11a6b83
Insert duplicate objects to prevent race during garbage collection.
Prior to this change, DfsInserter would not insert an object into a pack
if it already existed in another pack in the repository, even if that
pack was unreachable. Consider this sequence of events:
- Object FOO is pushed to a repository.
- Subsequent ref changes make FOO UNREACHABLE_GARBAGE.
- FOO is subsequently re-inserted using a DfsInserter, but skipped
due to existing in UNREACHABLE_GARBAGE.
- The repository is repacked; FOO will not be written into a new pack
because it is not yet reachable from a reference. If the
UNREACHABLE_GARBAGE packs are deleted, FOO disappears.
- A reference is updated to reference FOO. This reference is now broken
as FOO was removed when the repacking process deleted the
UNREACHABLE_GARBAGE pack that stored the only copy of FOO.
The garbage collector can't safely delete the UNREACHABLE_GARBAGE
pack because FOO might be in the middle of being re-inserted/re-packed.
This change writes a duplicate copy of an object if it only exists in
UNREACHABLE_GARBAGE. This "freshens" the object to give it a chance to
survive long enough to be made reachable through a reference.
Change-Id: I20f2062230f3af3bccd6f21d3b7342f1152a5532
Signed-off-by: Mike Williams <miwilliams@google.com>
JGit 3.0: move internal classes into an internal subpackage
This breaks all existing callers once. Applications are not supposed
to build against the internal storage API unless they can accept API
churn and make necessary updates as versions change.
Change-Id: I2ab1327c202ef2003565e1b0770a583970e432e9
Cluster UNREACHABLE_GARBAGE packs at the end of the search list
Garbage is unlikely to be used by a reader. Ensure they always
cluster at the end of the search list, no matter what timestamp
was used on the pack files.
Change-Id: I3bed89e9569ee3363c36bb3f73fcd34057a3883f
Rename PackConstants to PackExt, a typed pack file extension.
PackConstants previously contained string values for the pack and pack
index extension. Change PackConstant to be PackExt, a typed wrapper
around the string pack file extension.
Change-Id: I86ac4db6da8f33aa42d6f37cfcc119e819444318
Update DfsObjDatabase API to open/write by pack extension.
Previously, the DfsObjDatabase had a hardcoded getPackFile() and
getPackIndex() methods which opens a .pack and .idx file, respectively.
A future change to add a bitmap index will need to be stored in a
parallel .bitmap file. Update the DfsObjDatabase to support opening and
writing of files for any pack extension.
Change-Id: I7c403b501e242096a2d435f6865d6025a9f86108
Expose class DfsReader and method DfsPackFile.hasObject() as public.
Applications may want to be able to inquire about some details of
the storage of a repository. Make this possible by exposing some
simple accessor methods.
Expose method DfsObjDatabase.clearCache() as protected, allowing
implementing subclasses to dump the cache if necessary, and force
it to reload on a future request.
Change-Id: Ic592c82d45ace9f2fa5f8d7e4bacfdce96dea969
Once a pack has been committed with commitPack(), we know that the pack
list has changed but we don't re-scan the underlying storage.
Change-Id: Ia7b35df4442a5f5dfe7e817edcc77b44b5410d08
Add a listener for changes to a DfsObjDatabase's pack files
Intended for cross-request use, so only refers to
DfsRepositoryDescriptions rather than DfsRepositorys.
Change-Id: I2633e472c9264d91d632069f608d53d4bdd0fc09
Add a DFS repository description and reference it in each pack
Just as DfsPackDescription describes a pack but does not imply it is
open in memory, a DfsRepositoryDescription describes a repository at a
basic level without it necessarily being open.
Change-Id: I890b5fccdda12c1090cfabf4083b5c0e98d717f6
In practice the DHT storage layer has not been performing as well as
large scale server environments want to see from a Git server.
The performance of the DHT schema degrades rapidly as small changes
are pushed into the repository due to the chunk size being less than
1/3 of the pushed pack size. Small chunks cause poor prefetch
performance during reading, and require significantly longer prefetch
lists inside of the chunk meta field to work around the small size.
The DHT code is very complex (>17,000 lines of code) and is very
sensitive to the underlying database round-trip time, as well as the
way objects were written into the pack stream that was chunked and
stored on the database. A poor pack layout (from any version of C Git
prior to Junio reworking it) can cause the DHT code to be unable to
enumerate the objects of the linux-2.6 repository in a completable
time scale.
Performing a clone from a DHT stored repository of 2 million objects
takes 2 million row lookups in the DHT to locate the OBJECT_INDEX row
for each object being cloned. This is very difficult for some DHTs to
scale, even at 5000 rows/second the lookup stage alone takes 6 minutes
(on local filesystem, this is almost too fast to bother measuring).
Some servers like Apache Cassandra just fall over and cannot complete
the 2 million lookups in rapid fire.
On a ~400 MiB repository, the DHT schema has an extra 25 MiB of
redundant data that gets downloaded to the JGit process, and that is
before you consider the cost of the OBJECT_INDEX table also being
fully loaded, which is at least 223 MiB of data for the linux kernel
repository. In the DHT schema answering a `git clone` of the ~400 MiB
linux kernel needs to load 248 MiB of "index" data from the DHT, in
addition to the ~400 MiB of pack data that gets sent to the client.
This is 193 MiB more data to be accessed than the native filesystem
format, but it needs to come over a much smaller pipe (local Ethernet
typically) than the local SATA disk drive.
I also never got around to writing the "repack" support for the DHT
schema, as it turns out to be fairly complex to safely repack data in
the repository while also trying to minimize the amount of changes
made to the database, due to very common limitations on database
mutation rates..
This new DFS storage layer fixes a lot of those issues by taking the
simple approach for storing relatively standard Git pack and index
files on an abstract filesystem. Packs are accessed by an in-process
buffer cache, similar to the WindowCache used by the local filesystem
storage layer. Unlike the local file IO, there are some assumptions
that the storage system has relatively high latency and no concept of
"file handles". Instead it looks at the file more like HTTP byte range
requests, where a read channel is a simply a thunk to trigger a read
request over the network.
The DFS code in this change is still abstract, it does not store on
any particular filesystem, but is fairly well suited to the Amazon S3
or Apache Hadoop HDFS. Storing packs directly on HDFS rather than
HBase removes a layer of abstraction, as most HBase row reads turn
into an HDFS read.
Most of the DFS code in this change was blatently copied from the
local filesystem code. Most parts should be refactored to be shared
between the two storage systems, but right now I am hesistent to do
this due to how well tuned the local filesystem code currently is.
Change-Id: Iec524abdf172e9ec5485d6c88ca6512cd8a6eafb