Matthias Sohn [Mon, 10 May 2021 22:56:57 +0000 (00:56 +0200)]
Merge branch 'stable-5.9' into stable-5.10
* stable-5.9:
LockFile: create OutputStream only when needed
Remove ReftableNumbersNotIncreasingException
Fix stamping to produce stable file timestamps
Thomas Wolf [Tue, 4 May 2021 21:48:56 +0000 (23:48 +0200)]
LockFile: create OutputStream only when needed
Don't create the stream eagerly in lock(); that may cause JGit to
exceed OS or JVM limits on open file descriptors if many locks need
to be created, for instance when creating many refs. Instead create
the output stream only when one really needs to write something.
Bug: 573328
Change-Id: If9441ed40494d46f594a896d34a5c4f56f91ebf4 Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Thomas Wolf [Fri, 19 Mar 2021 08:35:34 +0000 (09:35 +0100)]
sshd: try all configured signature algorithms for a key
For RSA keys, there may be several configured signature algorithms:
rsa-sha2-512, rsa-sha2-256, and ssh-rsa. Upstream sshd has bug
SSHD-1105 [1] and always and unconditionally uses only the first
configured algorithm. With the default order, this means that it cannot
connect to a server that knows only ssh-rsa, like for instance Apache
MINA sshd servers older than 2.6.0.
This affects for instance bitbucket.org or also AWS Code Commit.
Re-introduce our own pubkey authenticator that fixes this.
Note that a server may impose a penalty (back-off delay) for subsequent
authentication attempts with signature algorithms unknown to the server.
In such cases, users can re-order the signature algorithm list via the
PubkeyAcceptedAlgorithms (formerly PubkeyAcceptedKeyTypes) ssh config.
Apache MINA sshd 2.6.0 appears to use only the first appropriate
public key signature algorithm for a particular key. See [1]. For
RSA keys, that is rsa-sha2-512. This breaks authentication at servers
that only know the older (and deprecated) ssh-rsa algorithm.
With PubkeyAcceptedAlgorithms, users can re-order algorithms in
the ssh config file per host, if needed. Setting
PubkeyAcceptedAlgorithms ^ssh-rsa
will put "ssh-rsa" at the front of the list of algorithms, and then
authentication at such servers with RSA keys works again.
Matthias Sohn [Tue, 9 Mar 2021 17:00:55 +0000 (18:00 +0100)]
Merge branch 'master' into stable-5.11
* master:
Manually set status of jmh dependencies
Update DEPENDENCIES report for 5.11.0
Add dependency to dash-licenses
PackFile: Add id + ext based constructors
GC: deleteOrphans: Use PackFile
PackExt: Convert to Enum
Restore preserved packs during missing object seeks
Pack: Replace extensions bitset with bitmapIdx PackFile
PackDirectory: Use PackFile to ensure we find preserved packs
GC: Use PackFile to de-dup logic
Create a PackFile class for Pack filenames
Matthias Sohn [Sun, 7 Mar 2021 17:41:05 +0000 (18:41 +0100)]
Manually set status of jmh dependencies
The following jmh dependencies were approved as works-with:
- jmh-core/1.21 has GPL-2.0 license and was approved in CQ20517
- jmh-generator-annprocess/1.21 has GPL-2.0 license and was approved in
CQ20518
Nasser Grainawi [Thu, 4 Mar 2021 21:14:43 +0000 (14:14 -0700)]
PackFile: Add id + ext based constructors
Add new constructors to PackFile to improve a common use case where
callers know the directory, id, and extension, but previously needed to
construct a valid file name (with prefix, '.', etc) to create a
PackFile. Most callers can use the variant that has id as an ObjectId,
but provide an id as String variant too.
Martin Fick [Tue, 15 Dec 2020 21:20:44 +0000 (14:20 -0700)]
Restore preserved packs during missing object seeks
Provide a recovery path for objects being referenced during the pack
pruning race. Due to the pack pruning race, it is possible for objects
to become referenced after a pack has been deemed safe to prune, but
before it actually gets pruned. If this happened previously, the newly
referenced objects would be missing and potentially result in a
corrupted ref.
Add the ability to recover from this situation when an object is missing
but happens to still be available in a pack in the "preserved"
directory. This is likely only useful when used in conjunction with the
--preserve-old-packs GC option, which prunes packs by hard-linking to
the preserved directory. If an object is missing and found in a pack in
the preserved directory, immediately recover that pack and its
associated files (idx, bitmaps...) by moving them back to the original
pack directory, and then retry the operation that would have failed due
to the missing object. This retry can now succeed and the repository
may avoid corruption. This approach should drastically reduce the
chance of a corrupt repository during pack pruning at very little extra
cost. This extra cost should only be incurred when objects are missing
and a failure would normally occur.
Change-Id: I2a704e3276b88cc892159d9bfe2455c6eec64252 Signed-off-by: Martin Fick <quic_mfick@quicinc.com> Signed-off-by: Nasser Grainawi <quic_nasserg@quicinc.com>
Nasser Grainawi [Thu, 11 Feb 2021 06:26:17 +0000 (23:26 -0700)]
Pack: Replace extensions bitset with bitmapIdx PackFile
The only extension that was ever consulted from the bitmap was the
bitmap index. We can simplify the Pack code as well as the code of
all the callers if we focus on just that usage.
Nasser Grainawi [Thu, 11 Feb 2021 06:33:43 +0000 (23:33 -0700)]
PackDirectory: Use PackFile to ensure we find preserved packs
Update scanPacksImpl and listPackDirectory (renamed to
getPackFilesByExtById) to use the new PackFile functionality to
validate file names and complete pack file sets (.pack, .idx, etc).
Most importantly, this allows a later change to rely on scanPacks() to
complete a packList that contains packs with the 'old-' prefix in their
extension.
This also eliminates duplication of logic for how to identify and
construct pack files.
Nasser Grainawi [Thu, 11 Feb 2021 05:51:05 +0000 (22:51 -0700)]
Create a PackFile class for Pack filenames
The PackFile class is intended to be a central place to do all
common pack filename manipulation and parsing to help reduce repeated
code and bugs. Use the PackFile class in the Pack class and in many
tests to ensure it works well in a variety of situations. Later changes
will expand use of PackFiles to even more areas.
Change-Id: I921b30f865759162bae46ddd2c6d669de06add4a Signed-off-by: Nasser Grainawi <quic_nasserg@quicinc.com> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Thomas Wolf [Mon, 1 Mar 2021 07:30:09 +0000 (08:30 +0100)]
HTTP: cookie file stores expiration in seconds
A cookie file stores the expiration in seconds since the Linux Epoch,
not in milliseconds. Correct reading and writing cookie files; with
a backwards-compatibility hack to read files that contain a millisecond
timestamp.
Add a test, and fix tests not to rely on the actual current time so
that they will also run successfully after 2030-01-01 noon.
Bug: 571574
Change-Id: If3ba68391e574520701cdee119544eedc42a1ff2 Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Thomas Wolf [Fri, 29 Jan 2021 22:03:44 +0000 (23:03 +0100)]
LFS: handle invalid pointers better
Make sure that SmudgeFilter calls LfsPointer.parseLfsPointer() with
a stream that supports mark/reset, and make sure that parseLfsPointer()
resets the stream properly if it decides that the stream content is not
a LFS pointer.
Add a test.
Bug: 570758
Change-Id: I2593d67cff31b2dfdfaaa48e437331f0ed877915 Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
In a distributed setting, one can have multiple datacenters use
reftables for serving, while the ground truth for the Ref database is
administered centrally. In this setting, replication delays combined
with compaction can cause update-index ranges to overlap.
Such a setting is used at Google, and the JGit code already handles
this correctly (modulo a bugfix that applied in change I8f8215b99a).
Remove the restriction that was applied at FileReftableDatabase.
Matthias Sohn [Sun, 27 Dec 2020 01:11:47 +0000 (02:11 +0100)]
Fix errorprone configuration for maven-compiler-plugin with javac
See https://errorprone.info/docs/installation.
Add new profile jdk8 to enable running errorprone with javac on java 8
and java 11. Remove errorprone configuration from benchmark module,
didn't find a way to make it work and this module does not contain any
productive code.
Change-Id: I6a84195af05e6cea9e7c04ad5cd4c79742e80cb3 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Matthias Sohn [Wed, 24 Feb 2021 13:54:52 +0000 (14:54 +0100)]
Merge branch 'master' into stable-5.11
* master: (35 commits)
[releng] japicmp: update last release version
IgnoreNode: include path to file for invalid .gitignore patterns
FastIgnoreRule: include bad pattern in log message
init: add config option to set default for the initial branch name
init: allow specifying the initial branch name for the new repository
Fail clone if initial branch doesn't exist in remote repository
GPG: fix reading unprotected old-format secret keys
Update Orbit to S20210216215844
Add missing bazel dependency for o.e.j.gpg.bc.test
GPG: handle extended private key format
dfs: handle short copies
[GPG] Provide a factory for the BouncyCastleGpgSigner
Fix boxing warnings
GPG: compute the keygrip to find a secret key
GPG signature verification via BouncyCastle
Post commit hook failure should not cause commit failure
Allow to define additional Hook classes outside JGit
GitHook: use default charset for output and error streams
GitHook: use generic OutputStream instead of PrintStream
Update jetty to 9.4.36.v20210114
...
Thomas Wolf [Tue, 23 Feb 2021 17:10:08 +0000 (18:10 +0100)]
IgnoreNode: include path to file for invalid .gitignore patterns
Include the full file path of the .gitignore file and the line number
of the invalid pattern. Also include the pattern itself.
.gitignore files inside the repository are reported with their
repository-relative path; files outside (from git config
core.excludesFile or .git/info/exclude) are reported with their
full absolute path.
Bug: 571143
Change-Id: Ibe5969679bc22cff923c62e3ab9801d90d6d06d1 Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Thomas Wolf [Tue, 23 Feb 2021 12:11:56 +0000 (13:11 +0100)]
FastIgnoreRule: include bad pattern in log message
When a .gitignore pattern cannot be parsed include the pattern in the
log message. Just reporting "not closed bracket" isn't helpful if the
user doesn't know in which pattern the problem occurred.
Even better would be to include the full path of the .gitignore file
that contained the offending pattern. This is not implemented in this
change; it may need new API and needs more thought.
Bug: 571143
Change-Id: Id5b16d9cf550544ba3ad409a02041946fa8516ab Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Matthias Sohn [Mon, 25 Jan 2021 01:43:18 +0000 (02:43 +0100)]
init: add config option to set default for the initial branch name
We introduced the option --initial-branch=<branch-name> to allow
initializing a new repository with a different initial branch.
To allow users to override the initial branch name more permanently
(i.e. without having to specify the name manually for each 'git init'),
introduce the 'init.defaultBranch' option.
This option was added to git in 2.28.0.
See https://git-scm.com/docs/git-config#Documentation/git-config.txt-initdefaultBranch
Bug: 564794
Change-Id: I679b14057a54cd3d19e44460c4a5bd3a368ec848 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Matthias Sohn [Mon, 25 Jan 2021 00:54:03 +0000 (01:54 +0100)]
init: allow specifying the initial branch name for the new repository
Add option --initial-branch/-b to InitCommand and the CLI init command.
This is the first step to implement support for the new option
init.defaultBranch. Both were added to git in release 2.28.
See https://git-scm.com/docs/git-init#Documentation/git-init.txt--bltbranch-namegt
Bug: 564794
Change-Id: Ia383b3f90b5549db80f99b2310450a7faf6bce4c Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Thomas Wolf [Sun, 24 Jan 2021 01:13:43 +0000 (02:13 +0100)]
GPG: handle extended private key format
Add detection for the key-value pair format that was available in
gpg-agent for some time already and that has become the default since
gpg-agent 2.2.20. If a secret key in the .gnupg/private-keys-v1.d
directory is found to have this format, extract the human-readable key
from it, convert it to the binary serialized form and hand that to
BouncyCastle.
Encrypted keys in the new format may use AES/OCB. OCB is a patent-
encumbered algorithm; although there is a license for open-source
software, that may not be good enough and OCB may not be available in
Java. It is not available in the default security provider in Java,
and it is also not available in the BouncyCastle version included in
Eclipse.
Implement AES/OCB decryption, throwing a PGPException with a nice
message if the algorithm is not available. Include a copy of the normal
s-expression parser of BouncyCastle and fix it to properly handle data
from such keys: such keys do not contain an internal hash since the
AES/OCB cipher includes and checks a MAC already.
Bug: 570501
Change-Id: Ifa6391a809a84cfc6ae7c6610af6a79204b4143b Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
wh [Thu, 17 Dec 2020 18:14:32 +0000 (18:14 +0000)]
dfs: handle short copies
`copy` is documented as possibly returning a smaller number of bytes
than requested. In practice, this can occur if a block is cached and the
reader never pulls in the file to check its size.
Thomas Wolf [Sun, 17 Jan 2021 15:21:28 +0000 (16:21 +0100)]
GPG: compute the keygrip to find a secret key
The gpg-agent stores secret keys in individual files in the secret
key directory private-keys-v1.d. The files have the key's keygrip
(in upper case) as name and extension ".key".
A keygrip is a SHA1 hash over the parameters of the public key. By
computing this keygrip, we can pre-compute the expected file name and
then check only that one file instead of having to iterate over all
keys stored in that directory.
This file naming scheme is actually an implementation detail of
gpg-agent. It is unlikely to change, though. The keygrip itself is
computed via libgcrypt and will remain stable according to the GPG
main author.[1]
Add an implementation for calculating the keygrip and include tests.
Do not iterate over files in BouncyCastleGpgKeyLocator but only check
the single file identified by the keygrip.
Ideally upstream BouncyCastle would provide such a getKeyGrip() method.
But as it re-builds GPG and libgcrypt internals, it's doubtful it would
be included there, and since BouncyCastle even lacks a number of curve
OIDs for ed25519/curve25519 and uses the short-Weierstrass parameters
instead of the more common Montgomery parameters, including it there
might be quite a bit of work.
Thomas Wolf [Thu, 7 Jan 2021 16:11:57 +0000 (17:11 +0100)]
GPG signature verification via BouncyCastle
Add a GpgSignatureVerifier interface, plus a factory to create
instances thereof that is provided via the ServiceLoader mechanism.
Implement the new interface for BouncyCastle. A verifier maintains
an internal LRU cache of previously found public keys to speed up
verifying multiple objects (tag or commits). Mergetags are not handled.
Provide a new VerifySignatureCommand in org.eclipse.jgit.api together
with a factory method Git.verifySignature(). The command can verify
signatures on tags or commits, and can be limited to accept only tags
or commits. Provide a new public WrongObjectTypeException thrown when
the command is limited to either tags or commits and a name resolves
to some other object kind.
In jgit.pgm, implement "git tag -v", "git log --show-signature", and
"git show --show-signature". The output is similar to command-line
gpg invoked via git, but not identical. In particular, lines are not
prefixed by "gpg:" but by "bc:".
Trust levels for public keys are read from the keys' trust packets,
not from GPG's internal trust database. A trust packet may or may
not be set. Command-line GPG produces more warning lines depending
on the trust level, warning about keys with a trust level below
"full".
There are no unit tests because JGit still doesn't have any setup to
do signing unit tests; this would require at least a faked .gpg
directory with pre-created key rings and keys, and a way to make the
BouncyCastle classes use that directory instead of the default. See
bug 547538 and also bug 544847.
Tested manually with a small test repository containing signed and
unsigned commits and tags, with signatures made with different keys
and made by command-line git using GPG 2.2.25 and by JGit using
BouncyCastle 1.65.
Bug: 547751
Change-Id: If7e34aeed6ca6636a92bf774d893d98f6d459181 Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Matthias Sohn [Sun, 7 Feb 2021 22:48:57 +0000 (23:48 +0100)]
Allow to define additional Hook classes outside JGit
EGit wants to add gitflow specific hooks in org.eclipse.egit.gitflow.
Make GitHook public to allow sub-classing outside of the
org.eclipse.jgit.hooks package.
Change-Id: I439575ec901e3610b5cf9d66f7641c8324faa865 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Marija Savtchouk [Thu, 21 Jan 2021 15:08:57 +0000 (15:08 +0000)]
Allow dir/file conflicts in virtual base commit on recursive merge.
If RecursiveMerger finds multiple base commits, it tries to compute
the virtual ancestor to use as a base for the three way merge.
Currently, the content conflicts between ancestors are ignored (file
staged with the conflict markers). If the path is a file in one ancestor
and a dir in the other, it results in NoMergeBaseException
(CONFLICTS_DURING_MERGE_BASE_CALCULATION).
Allow these conflicts by ignoring this unmerged path in the virtual
base. The merger will compute diff in the children instead and it
can be further fixed manually if needed.
Change-Id: Id59648ae1d6bdf300b26fff513c3204317b755ab Signed-off-by: Marija Savtchouk <mariasavtchouk@google.com>
Thomas Wolf [Sun, 24 Jan 2021 00:57:09 +0000 (01:57 +0100)]
GPG: support git config gpg.program
Add it to the GpgConfig. Change GpgConfig to load the values once only.
Add a parameter to the GpgObjectSigner interface's operations to pass
in a GpgConfig. Update CommitCommand and TagCommand to pass the value
to the signer. Let the signer decide whether it can actually produce
the wanted signature type (openpgp or x509).
No behavior change. But this makes it possible to implement different
signers that might support x509 signatures, or use gpg.program and
shell out to an external GPG executable for signing.
Change-Id: I427f83eb1ece81c310e1cddd85315f6f88cc99ea Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Adithya Chakilam [Sat, 30 Jan 2021 01:35:04 +0000 (19:35 -0600)]
Fix DateRevQueue tie breaks with more than 2 elements
DateRevQueue is expected to give out the commits that have higher
commit time. But in case of tie(same commit time), it should give
the commit that is inserted first. This is inferred from the
testInsertTie test case written for DateRevQueue. Also that test
case, right now uses just two commits which caused it not to fail
with the current implementation, so added another commit to make
the test more robust.
By fixing the DateRevQueue, we would also match the behaviour of
LogCommand.addRange(c1,c2) with git log c1..c2. A test case for
the same is added to show that current behaviour is not the
expected one.
By fixing addRange(), the order in which commits are applied during
a rebase is altered. Rebase logic should have never depended upon
LogCommand.addRange() since the intended order of addRange() is not
the order a rebase should use. So, modify the RebaseCommand to use
RevWalk directly with TopoNonIntermixSortGenerator.
Add a new LogCommandTest.addRangeWithMerge() test case which creates
commits in the following order:
A - B - C - M
\ /
-D-
Using git 2.30.0, git log B..M outputs: M C D
LogCommand.addRange(B, M) without this fix outputs: M D C
LogCommand.addRange(B, M) with this fix outputs: M C D
Matthias Sohn [Sun, 3 Jan 2021 01:10:58 +0000 (02:10 +0100)]
Fix FileRepository#convertToReftable which failed if no reflog existed
Deleting non-existing files when converting to reftable without backup
caused convertToReftable to fail. Observed this on a mirrored repository
which had no reflogs. Fix this by skipping missing files during
deletion.
Change-Id: I3bb913d5bfddccc6813677b873006efb849a6ebc Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
David Ostrovsky [Sat, 25 Jul 2020 08:00:11 +0000 (10:00 +0200)]
Migrate to Apache MINA sshd 2.6.0 and Orbit I20210203173513
Re-enable DSA, DSA_CERT, and RSA_CERT public key authentication.
DSA is discouraged for a long time already, but it might still be
way too disruptive to completely drop it. RSA is discouraged for
far less long, and dropping that would be really disruptive.
Adapt to the changed property handling. Remove work-arounds for
shortcomings of earlier sshd versions.
Use Orbit I20210203173513, which includes sshd 2.6.0. This also bumps
apache.httpclient to 4.5.13 and apache.httpcore to 4.4.14.
Change-Id: I2d24a1ce4cc9f616a94bb5c4bdaedbf20dc6638e Signed-off-by: David Ostrovsky <david@ostrovsky.org> Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Thomas Wolf [Fri, 29 Jan 2021 11:12:28 +0000 (12:12 +0100)]
LFS: make pointer parsing more robust
Parsing an LFS pointer must check the input more to not run into
exceptions. LfsPoint.parseLfsPointer() is used in various places to
determine whether a blob is a LFS pointer; it is not only called with
valid LFS pointers. Tighten the validations and return null if they
fail. All callers already do check for a null return value.
Also, LfsPointer implemented Comparable but did not override equals().
This is rather unusual and actually warned against in the javadoc of
Comparable. Implement equals() and hashCode().
Add more tests.
Bug: 570744
Change-Id: I90ca264d0a250275cf1907e9dcfcee5eab80df0f Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Terry Parker [Mon, 25 Jan 2021 06:18:21 +0000 (22:18 -0800)]
Move reachability checker generation into the ObjectReader object
Reachability checkers are retrieved from RevWalk and ObjectWalk objects:
* RevWalk.createReachabilityChecker()
* ObjectWalk.createObjectReachabilityChecker()
Since RevWalks and ObjectWalks are themselves directly instantiated
in hundreds of places (e.g. UploadPack...) overriding them in a
consistent way requires overloading 100s of methods, which isn't
feasible. Moving reachability checker generation to a more central
place solves that problem.
The ObjectReader object seems a good place from which to get
reachability checkers, because reachability checkers return
information about relationships between objects. ObjectDatabases
delegate many operations to ObjectReaders, and reachability bitmaps
are attached to ObjectReaders.
The Bitmapped and Pedestrian reachability checker objects were
package private in the org.eclipse.jgit.revwalk package. This change
makes them public and moves them to the
org.eclipse.jgit.internal.revwalk package. Corresponding tests are
also moved.
Motivation:
1) Reachability checking algorithms need to scale. One of the
internal Android repositories has ~2.4 million refs/changes/*
references, causing bad long tail performance in reachability
checks.
2) Reachability check performance is impacted by repository
topography: number of refs, number of objects, amounts of
related vs. unrelated history.
3) Reachability check performance is also affected by per-branch
access (Gerrit branch permissions) since different users can
see different branches.
4) Reachability check performance isn't affected by any state in a
RevWalk or ObjectWalk.
I don't yet know if a single algorithm will work for all cases in #2
and #3. We may need to evolve the ReachabilityChecker interfaces
over time to solve the Gerrit branch permissions case, or use
Gerrit-specific identity information to solve that in an efficient
way.
This change takes the existing public API and moves it to the
ObjectReader/whole repository level, which is where we can do
consistent customizations for #2 and #3. We intend to upstream the
best of whatever works, but anticipate the need for multiple rounds
of experimentation.
Change-Id: I9185feff43551fb387957c436112d5250486833d Signed-off-by: Terry Parker <tparker@google.com>
Add the "compression-level" option to all ArchiveCommand formats
Different archive formats support a compression level in the range
[0-9]. The value 0 is for lowest compressions and 9 for highest. Highest
levels produce output files of smaller sizes but require more memory to
do the compression.
This change allows passing a "compression-level" option to the git
archive command and implements using it for different file formats.
Change-Id: I5758f691c37ba630dbac24db67bb7da827bbc8e1 Signed-off-by: Youssef Elghareeb <ghareeb@google.com> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Jonathan Tan [Wed, 27 Jan 2021 18:36:43 +0000 (13:36 -0500)]
Merge changes I36d9b63e,I8c5db581,I2c02e89c
* changes:
Compare getting all refs except specific refs with seek and with filter
Add getsRefsByPrefixWithSkips (excluding prefixes) to ReftableDatabase
Add seekPastPrefix method to RefCursor
Gal Paikin [Mon, 7 Dec 2020 14:18:34 +0000 (15:18 +0100)]
Compare getting all refs except specific refs with seek and with filter
There are currently two ways to get all refs except a specific ref, we
add two methods that perform both and compare the two different approaches.
This change adds two methods that compares the two different approaches
of such query:
1. Get all the refs, and then filter by refs that don't start with the
prefix (current approach).
2. Get all refs until encountering a ref that is part of the prefix we
should exclude, skip using seekPastPrefix, and continue (new approach).
This works since the refs are sorted.
Specifically in Gerrit, we often have thousands of refs that are not
refs/changes, and millions of refs/changes, hence the second approach
should be much faster. In Jgit in general it's still expected to provide
a better result even if we're skipping a smaller chunk of the refs
since the complexity here is O(logn) with a binary search, rather than
O(number of skipped refs).
We ran this benchmark on a big chunk of chromium/src's reftable. To run
it, we first create the reftable:
Result:
total time the action took using seek: 36925 usec
total time the action took using filter: 874382 usec
number of refs that start with prefix: 4266.
number of refs that don't start with prefix: 1962695.
Similarly for Android's biggest repository, platform/frameworks/base
(still only partial result):
total time the action took using seek: 9020 usec
total time the action took using filter: 143166 usec
number of refs that start with prefix: 296.
number of refs that don't start with prefix: 60400.
In conclusion, it's easy to see an improvement of a factor of 15-20x for
large Gerrit repositories!
Signed-off-by: Gal Paikin <paiking@google.com>
Change-Id: I36d9b63eb259804c774864429cf2c761cd099cc3
Gal Paikin [Mon, 30 Nov 2020 14:57:06 +0000 (15:57 +0100)]
Add getsRefsByPrefixWithSkips (excluding prefixes) to ReftableDatabase
We sometimes want to get all the refs except specific prefixes,
similarly to getRefsByPrefix that gets all the refs of a specific
prefix.
We now create a new method that gets all refs matching a prefix except a
set of specific prefixes.
One use-case is for Gerrit to be able to get all the refs except
refs/changes; in Gerrit we often have lots of refs/changes, but very
little other refs. Currently, to get all the refs except refs/changes we
need to get all the refs and then filter the refs/changes, which is very
inefficient. With this method, we can simply skip the unneeded prefix so
that we don't have to go over all the elements.
RefDirectory still uses the inefficient implementation, since there
isn't a simple way to use Refcursor to achieve the efficient
implementation (as done in ReftableDatabase).
Signed-off-by: Gal Paikin <paiking@google.com>
Change-Id: I8c5db581acdeb6698e3d3a2abde8da32f70c854c