Luca Milanesio [Wed, 18 May 2022 12:31:30 +0000 (13:31 +0100)]
Shortcut during git fetch for avoiding looping through all local refs
The FetchProcess needs to verify that all the refs received point
to objects that are reachable from the local refs, which could be
very expensive but is needed to avoid missing objects exceptions
because of broken chains.
When the local repository has a lot of refs (e.g. millions) and the
client is fetching a non-commit object (e.g. refs/sequences/changes in
Gerrit) the reachability check on all local refs can be very expensive
compared to the time to fetch the remote ref.
Example for a 2M refs repository:
- fetching a single non-commit object: 50ms
- checking the reachability of local refs: 30s
A ref pointing to a non-commit object doesn't have any parent or
successor objects, hence would never need to have a reachability check
done. Skipping the askForIsComplete() altogether would save the 30s
time spent in an unnecessary phase.
Matthias Sohn [Tue, 4 Oct 2022 13:42:25 +0000 (15:42 +0200)]
FetchCommand: fix fetchSubmodules to work on a Ref to a blob
FetchCommand#fetchSubmodules assumed that FETCH_HEAD can always be
parsed as a tree. This isn't true if it refers to a Ref referring to a
BLOB. This is e.g. used in Gerrit for Refs like refs/sequences/changes
which are used to implement sequences stored in git.
Luca Milanesio [Tue, 20 Dec 2022 21:50:19 +0000 (21:50 +0000)]
Allow the exclusions of refs prefixes from bitmap
When running a GC.repack() against a repository with over one
thousands of refs/heads and tens of millions of ObjectIds,
the calculation of all bitmaps associated with all the refs
would result in an unreasonable big file that would take up to
several hours to compute.
Test scenario: repo with 2500 heads / 10M obj Intel Xeon E5-2680 2.5GHz
Before this change: 20 mins
After this change and 2300 heads excluded: 10 mins (90s for bitmap)
Having such a large bitmap file is also slow in the runtime
processing and have negligible or even negative benefits, because
the time lost in reading and decompressing the bitmap in memory
would not be compensated by the time saved by using it.
It is key to preserve the bitmaps for those refs that are mostly
used in clone/fetch and give the ability to exlude some refs
prefixes that are known to be less frequently accessed, even
though they may actually be actively written.
Example: Gerrit sandbox branches may even be actively
used and selected automatically because its commits are very
recent, however, they may bloat the bitmap, making it ineffective.
A mono-repo with tens of thousands of developers may have
a relatively small number of active branches where the
CI/CD jobs are continuously fetching/cloning the code. However,
because Gerrit allows the use of sandbox branches, the
total number of refs/heads may be even tens to hundred
thousands.
Luca Milanesio [Wed, 28 Dec 2022 01:09:52 +0000 (01:09 +0000)]
PackWriterBitmapPreparer: do not include annotated tags in bitmap
The annotated tags should be excluded from the bitmap associated
with the heads-only packfile. However, this was not happening
because of the check of exclusion of the peeled object instead
of the objectId to be excluded from the bitmap.
When creating a bitmap for the above commit graph, before this
change all the commits are included (3 bitmaps), which is
incorrect, because all commits reachable from annotated tags
should not be included.
The heads-only bitmap should include only commit0 and commit1
but because PackWriterBitPreparer was checking for the peeled
pointer of tag1 to be excluded (commit2) which was not found in
the list of tags to exclude (annotated-tag1), the commit2 was
included, even if it wasn't reachable only from the head.
Add an additional check for exclusion of the original objectId
for allowing the exclusion of annotated tags and their pointed
commits. Add one specific test associated with an annotated tag
for making sure that this use-case is covered also.
Example repository benchmark for measuring the improvement:
# refs: 400k (2k heads, 88k tags, 310k changes)
# objects: 11M (88k of them are annotate tags)
# packfiles: 2.7G
Before this change:
GC time: 5h
clone --bare time: 7 mins
After this change:
GC time: 20 mins
clone --bare time: 3 mins
Matthias Sohn [Wed, 18 Jan 2023 16:39:19 +0000 (17:39 +0100)]
Speedup GC listing objects referenced from reflogs
GC needs to get a ReflogReader for all existing refs to list all objects
referenced from reflogs. The existing Repository#getReflogReader method
accepts the ref name and then resolves the Ref to create a ReflogReader.
GC calling that for a huge number of Refs one by one is very slow. GC
first gets all Refs in bulk and then calls getReflogReader for each of
them.
Fix this by adding another getReflogReader method to Repository which
accepts a Ref directly.
This speeds up running JGit gc on a mirror clone of the Gerrit
repository from 15:36 min to 1:08 min. The repository used in this test
had 45k refs, 275k commits and 1.2m git objects.
Matthias Sohn [Tue, 4 Oct 2022 13:45:01 +0000 (15:45 +0200)]
Fix running JMH benchmarks
Without this I sometimes hit the error:
$ java -jar target/benchmarks.jar
Exception in thread "main" java.lang.RuntimeException: ERROR: Unable to
find the resource: /META-INF/BenchmarkList
at org.openjdk.jmh.runner.AbstractResourceReader.getReaders(AbstractResourceReader.java:98)
at org.openjdk.jmh.runner.BenchmarkList.find(BenchmarkList.java:124)
at org.openjdk.jmh.runner.Runner.internalRun(Runner.java:253)
at org.openjdk.jmh.runner.Runner.run(Runner.java:209)
at org.openjdk.jmh.Main.main(Main.java:71)
Matthias Sohn [Fri, 11 Nov 2022 16:54:06 +0000 (17:54 +0100)]
Add option to allow using JDK's SHA1 implementation
The change If6da9833 moved the computation of SHA1 from the JVM's
JCE to a pure Java implementation with collision detection.
The extra security for public sites comes with a cost of slower
SHA1 processing compared to the native implementation in the JDK.
When JGit is used internally and not exposed to any traffic from
external or untrusted users, the extra cost of the pure Java SHA1
implementation can be avoided, falling back to the previous
native MessageDigest implementation.
Matthias Sohn [Thu, 27 Oct 2022 18:31:31 +0000 (20:31 +0200)]
Ignore IllegalStateException if JVM is already shutting down
Trying to register/unregister a shutdown hook when the JVM is already in
shutdown throws an IllegalStateException. Ignore this exception since we
can't do anything about it.
Matthias Sohn [Wed, 29 Jun 2022 12:58:17 +0000 (14:58 +0200)]
UploadPack: don't prematurely terminate timer in case of error
In uploadWithExceptionPropagation don't prematurely terminate timer in
case of error to enable reporting it to the client. Expose a close
method so that callers can terminate it at the appropriate time.
If the timer is already terminated when trying to report it to the
client this failed with the error java.lang.IllegalStateException:
"Timer already terminated".
Do not create reflog for remote tracking branches during clone
When using JGit on a non-bare repository, the CloneCommand
it previously created local reflogs for all branches including remote
tracking ones, causing the generation of a potentially large
number of files on the local filesystem.
The creation of the remote-tracking branches (refs/remotes/*) during
clone is not an issue for the local filesystem because all of them are
stored in a single packed-refs file. However, the creation of a large
number of ref logs on a local filesystem IS an issue because it
may not be tuned or initialised in term of inodes to contain a very
large number of files.
When a user (or a CI system) performs the CloneCommand against
a potentially large repository (e.g., millions of branches), it is
interested in working or validating a single branch or tag and is
unlikely to work with all the remote-tracking branches.
The eager creation of a reflogs for all the remote-tracking branches is
not just a performance issue but may also compromise the ability to
use JGit for cloning a large repository.
The behaviour implemented in this change is also consistent with the
optimisation done in the C code-base [1].
We differentiate between clone and fetch commands using --branch
<initialBranch> option, that is only available in clone command,
and is set as HEAD per default.
Luca Milanesio [Thu, 19 May 2022 10:12:28 +0000 (11:12 +0100)]
UploadPack: do not check reachability of visible SHA1s
When JGit needs to serve a Git client requesting SHA1s
during the want phase, it needs to make a full reachability
check from the advertised refs to the ones requested to
keep all objects in the correct scope of confidentiality
allowed by the avertised refs.
The check is also performed when the SHA1 corresponds to
one of the tips of the advertised refs which is a waste of
resources.
eric.steele [Wed, 1 Jun 2022 01:03:17 +0000 (18:03 -0700)]
AmazonS3: Add support for AWS API signature version 4
Updating the AmazonS3 class to support AWS Signature version 4 because
version 2 is no longer supported in all AWS regions. The version can be
selected with the new 'aws.api.signature.version' property (defaults to
2 for backwards compatibility). When set to '4', the user must also
specify the AWS region via the 'region' property. The 'region' property
must match the region that the 'domain' property resolves to.
Saša Živkov [Fri, 3 Jun 2022 14:36:43 +0000 (16:36 +0200)]
Fix connection leak for smart http connections
SmartHttpPushConnection: close InputStream and OutputStream after
processing. Wrap IOExceptions which aren't TransportExceptions already
as a TransportException.
Also-By: Matthias Sohn <matthias.sohn@sap.com>
Change-Id: I8e11d899672fc470c390a455dc86367e92ef9076
Remove stray files (probes or lock files) created by background threads
NOTE: port back from master branch.
On process exit, it was possible that the filesystem timestamp
resolution measurement left behind .probe files or even a lock file
for the jgit.config.
Ensure the SAVE_RUNNER is shut down when the process exits (via
System.exit() or otherwise). Move lf.lock() into the try-finally
block when saving the config file.
Delete .probe files on JVM shutdown -- they are created in daemon
threads that may terminate abruptly, not executing the "finally"
clause that normally removes these files.
BasePackConnection::readAdvertisedRefsImpl was creating an exception by
calling `noRepository`, and then blindly calling `initCause` on it. As
`noRepository` can be overridden, it's not guaranteed to be missing a
cause.
BasePackPushConnection overrides `noRepository` and initiates a fetch,
which may throw a `NoRemoteRepositoryException` with a cause.
In this case calling `initCause` threw an `IllegalStateException`.
In order to throw the correct exception, we now return the
BasePackPushConnection exception and suppress the one thrown by
BasePackConnection
Marcin Czech [Wed, 22 Dec 2021 16:42:36 +0000 (17:42 +0100)]
UploadPack v2 protocol: Stop negotiation for orphan refs
The fetch of a single orphan ref (for example Gerrit meta ref:
refs/changes/21/21/meta) did not stop the negotiation so client
had to advertise all refs. This impacts the fetch performance
on repositories with a large number of refs (for example on
Gerrit repository it takes 20 seconds to fetch meta ref
comparing to 1.2 second to fetch ref with parent).
To avoid this issue UploadPack, used on the server side,
now checks if all `want` refs have parents, if not this
means that client doesn't need any extra objects, hence
the server responds with `ready` and finishes the
negotiation phase.
Matthias Sohn [Thu, 16 Dec 2021 17:42:45 +0000 (18:42 +0100)]
Use slf4j-simple instead of log4j for logging
JGit uses slf4j-api as logging API.
The libraries
- org.eclipse.jgit.http.test
- org.eclipse.jgit.pgm
- org.eclipse.jgit.ssh.apache.test
- org.eclipse.jgit.test
used the outdated log4j 1.2.15 which is EOL since years.
Since both jgit command line and also the tests don't need sophisticated
logging features replace log4j with the much simpler slf4j-simple log
implementation. The org.slf4j.binding.simple 1.7.30 archive has only
25kB instead of 429kB for log4j 1.2.15
Applications using jgit are free to choose any other log implementation
supporting slf4j API.
Matthias Sohn [Mon, 13 Dec 2021 23:07:10 +0000 (00:07 +0100)]
Update orbit to R20211213173813
and update
- com.google.gson to 2.8.8.v20211029-0838
- javaewah to 1.1.13.v20211029-0839
- net.i2p.crypto.eddsa to 0.3.0.v20210923-1401
- org.apache.ant to 1.10.12.v20211102-1452
- org.apache.commons.compress to 1.21.0.v20211103-2100
- org.bouncycastle.bcprov to 1.69.0.v20210923-1401
- org.junit to 4.13.2.v20211018-1956
Matthias Sohn [Fri, 3 Dec 2021 21:45:05 +0000 (22:45 +0100)]
Update maven plugins
- build-helper-maven-plugin to 3.2.0
- jacoco-maven-plugin to 0.8.7
- maven-antrun-plugin to 3.0.0
- maven-dependency-plugin to 3.2.0
- maven-enforcer-plugin to 3.0.0
- maven-jar-plugin to 3.2.0
- maven-javadoc-plugin to 3.3.1
- maven-jxr-plugin to 3.1.1
- maven-pmd-plugin to 3.15.0
- maven-project-info-reports-plugin to 3.1.2
- maven-resources-plugin to 3.2.0
- maven-shade-plugin to 3.2.4
- maven-site-plugin to 3.9.1
- maven-source-plugin to 3.2.1
- maven-surefire-plugin to 3.0.0-M5
- spotbugs-maven-plugin to 4.3.0
- tycho and tycho-extras to 1.7.0
RefDirectory.scanRef: Re-use file existence check done in snapshot creation
Return immediately in scanRef if the loose ref was identified as
missing when a snapshot was attempted for the ref. This will help
performance of scanRef when the ref is packed but has a corresponding
empty dir in 'refs/'.
For example, consider the case where we create 50k sharded refs in
a new namespace called 'new-refs' using an atomic 'BatchRefUpdate'.
The refs are named like 'refs/new-refs/01/1/1', 'refs/new-refs/01/1/2',
'refs/new-refs/01/1/3' and so on. After the refs are created, the
'new-refs' namespace looks like below:
$ find refs/new-refs -type f | wc -l
0
$ find refs/new-refs -type d | wc -l
5101
At this point, an 'exactRef' call on each of the 50k refs without
this change takes ~2.5s, where as with this change it takes ~1.5s.
FileSnapshot: Lazy load file store attributes cache
Doing a getFileStoreAttributes call even when the file doesn't
exist is unnecessary. This call is particularly slow on some
filesystems. Instead, do it only when the file exists and load
the appropriate cache.
This update can help speed up RefDirectory.exactRef when the ref
is packed, but has a corresponding empty dir for it under 'refs/'.
This scenario can happen when an atomic 'BatchRefUpdate' creates
new sharded refs.
For example, consider the case where we create 50k sharded refs in
a new namespace called 'new-refs' using an atomic 'BatchRefUpdate'.
The refs are named like 'refs/new-refs/01/1/1', 'refs/new-refs/01/1/2',
'refs/new-refs/01/1/3' and so on. After the refs are created, the
'new-refs' namespace looks like below:
$ find refs/new-refs -type f | wc -l
0
$ find refs/new-refs -type d | wc -l
5101
At this point, an 'exactRef' call on each of the 50k refs without
this change takes ~30s, where as with this change it takes ~2.5s.
Thomas Wolf [Sun, 28 Nov 2021 10:42:20 +0000 (11:42 +0100)]
FS: debug logging only if system config file cannot be found
The command 'git config --system --show-origin --list -z' fails if
the system config doesn't exist. Use debug logging instead of a
warning for failures of that command. Typically the user cannot do
anything about it anyway, and JGit will just work without system
config.
Bug: 577492
Change-Id: If628ab376182183aea57a385c169e144d371bbb2 Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Thomas Wolf [Tue, 19 Oct 2021 07:07:14 +0000 (09:07 +0200)]
Set JSch global config values only if not set already
Bug: 576604
Change-Id: I04415f07bf2023bc8435c097d1d3c65221d563f1 Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
(cherry picked from commit f8b0c00e6a8f5628babff6dd37254a21589b6e44)
Thomas Wolf [Fri, 19 Nov 2021 18:15:46 +0000 (19:15 +0100)]
Better git system config finding
We've used "GIT_EDITOR=edit git config --system --edit" to determine
the location of the git system config for a long time. But git 2.34.0
always expects this command to have a tty, but there isn't one when
called from Java. If there isn't one, the Java process may get a
SIGTTOU from the child process and hangs.
Arguably it's a bug in C git 2.34.0 to unconditionally assume there
was a tty. But JGit needs a fix *now*, otherwise any application using
JGit will lock up if git 2.34.0 is installed on the machine.
Therefore, use a different approach if the C git found is 2.8.0 or
newer: parse the output of
git config --system --show-origin --list -z
"--show-origin" exists since git 2.8.0; it prefixes the values with
the file name of the config file they come from, which is the system
config file for this command. (This works even if the first item in
the system config is an include.)
Bug: 577358
Change-Id: I3ef170ed3f488f63c3501468303119319b03575d Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
(cherry picked from commit 9446e62733da5005be1d5182f0dce759a3052d4a)