searchForReuse might impact performance in large repositories
The search for reuse phase for *all* the objects scans *all*
the packfiles, looking for the best candidate to serve back to the
client.
This can lead to an expensive operation when the number of
packfiles and objects is high.
Add parameter "pack.searchForReuseTimeout" to limit the time spent
on this search.
Change-Id: I54f5cddb6796fdc93ad9585c2ab4b44854fa6c48
Retry loose object read upon "Stale file handle" exception
When reading loose objects over NFS it is possible that the OS syscall
would fail with ESTALE errors: This happens when the open file
descriptor no longer refers to a valid file.
Notoriously it is possible to hit this scenario when git data is shared
among multiple clients, for example by multiple gerrit instances in HA.
If one of the two clients performs a GC operation that would cause the
packing and then the pruning of loose objects, the other client might
still hold a reference to those objects, which would cause an exception
to bubble up the stack.
The Linux NFS FAQ[1] (at point A.10), suggests that the proper way to
handle such ESTALE scenarios is to:
"[...] close the file or directory where the error occurred, and reopen
it so the NFS client can resolve the pathname again and retrieve the new
file handle."
In case of a stale file handle exception, we now attempt to read the
loose object again (up to 5 times), until we either succeed or encounter
a FileNotFoundException, in which case the search can continue to
Packfiles and alternates.
The limit of 5 provides an arbitrary upper bounds that is consistent to
the one chosen when handling stale file handles for packed-refs
files (see [2] for context).
[1] http://nfs.sourceforge.net/
[2] https://git.eclipse.org/r/c/jgit/jgit/+/54350
Bug: 573791
Change-Id: I9950002f772bbd8afeb9c6108391923be9d0ef51
Fix PathSuffixFilter: can decide only on full paths
On a subtree, a PathSuffixFilter must return -1 ("indeterminate"),
not 0 ("include"), otherwise negation goes wrong: an indeterminate
result (-1) is passed on, but a decision (0/1) is inverted.
As a result a negated PathSuffixFilter would skip all folders.
Bug: 574253
Change-Id: I27fe785c0d772392a5b5efe0a7b1c9cafcb6e566
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Teach independent negotiation (no pack file) using an option "wait-for-done"
From Git commit 9c1e657a8f:
Currently, the packfile negotiation step within a Git fetch cannot be
done independent of sending the packfile, even though there is at
least one application wherein this is useful - push negotiation.
Therefore, make it possible for this negotiation step to be done
independently.
This feature is for protocol v2 only.
In the protocol, the main hindrance towards independent negotiation is
that the server can unilaterally decide to send the packfile. This is
solved by a "wait-for-done" argument: the server will then wait for
the client to say "done". In practice, the client will never say it;
instead it will cease requests once it is satisfied.
Advertising the server capability option "wait-for-done" is behind the
transport config: uploadpack.advertisewaitfordone, which by default is
false.
Change-Id: I5ebd3e99ad76b8943597216e23ced2ed38eb5224
This is similar to change Idbc2c29bd that skipped detecting content
renames for large files. With this change, we added a new option in
RenameDetector called "skipContentRenamesForBinaryFiles", that when set,
causes binary files with any slight modification to be identified as
added/deleted. The default for this boolean is false, so preserving
current behaviour.
Change-Id: I4770b1f69c60b1037025ddd0940ba86df6047299
RepoCommand: Do not set 'branch' if the revision is a tag
The "branch" field in the .gitmodules is the signal for gerrit to keep
the superproject autoupdated. Tags are immutable and there is no need to
track them, plus the cgit client requires the field to be a "remote
branch name" but not a tag.
Do not set the "branch" field if the revision is a tag. Keep those tags
in another field ("ref") as they help other tools to find the commit in
the destination repository.
We can still have false negatives when a refname is not fully qualified,
but this check covers e.g. the most common case in android.
Note that the javadoc of #setRecordRemoteBranch already mentions that
"submodules that request a tag will not have branch name recorded".
Change-Id: Ib1c321a4d3b7f8d51ca2ea204f72dc0cfed50c37
Signed-off-by: Ivan Frade <ifrade@google.com>
Check the last line of the last hunk of a file, not the last line of
the whole patch.
Note that C git only checks that this line starts with "\ " and is at
least 12 characters long because of possible different texts when non-
English messages are used.
Change-Id: I0db81699eb3e99ed7b536a3e2b8dc97df1f58a89
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
ApplyCommand: handle completely empty context lines in text patches
C git treats completely empty lines as empty context lines (which
traditionally have a single blank). Apparently newer GNU diff may
produce such lines; see [1]. ("Newer" meaning "since 2006"...)
[1] https://github.com/git/git/commit/b507b465f7831
Change-Id: I80c1f030edb17a46289b1dabf11a2648d2660d38
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
ApplyCommand: use byte arrays for text patches, not strings
Instead of converting the patch bytes to strings apply the patch on
byte level, like C git does. Converting the input lines and the hunk
lines from bytes to strings and then applying the patch based on
strings may give surprising results if a patch converts a text file
from one encoding to another. Moreover, in the end we don't know which
encoding to use to write the result.
Previous code just wrote the result as UTF-8, which forcibly changed
the encoding if the original input had some other encoding (even if the
patch had the same non-UTF-8 encoding). It was also wrong if the input
was UTF-8, and the patch should have changed the encoding to something
else.
So use ByteBuffers instead of Strings. This has the additional advantage
that all these ByteBuffers can share the underlying byte arrays of the
input and of the patch, so it also reduces memory consumption.
Change-Id: I450975f2ba0e7d0bec8973e3113cc2e7aea187ee
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Implement applying binary patches. Handles both literal and delta
patches. Note that C git also runs binary files through the clean
and smudge filters. Implement the same safeguards against corrupted
patches as in C git: require the full OIDs to be present in the patch
file, and apply a binary patch only if both pre- and post-image hashes
match.
Add tests for applying literal and delta patches.
Bug: 371725
Change-Id: I71dc214fe4145d7cc8e4769384fb78c7d0d6c220
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Add a new BinaryDeltaInputStream that applies a delta provided by
another InputStream to a given base. Because delta application needs
random access to the base, the base itself cannot be yet another
InputStream. But at least this enables streaming of the result.
Add a simple test using delta hunks generated by C git.
Bug: 371725
Change-Id: Ibd26fa2f49860737ad5c5387f7f4870d3e85e628
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
ApplyCommand: add streams to read/write binary patch hunks
Add streams that can encode or decode git binary patch data on the fly.
Git writes binary patches base-85 encoded, at most 52 un-encoded bytes,
with the unencoded data length prefixed in a one-character encoding, and
suffixed with a newline character.
Add a test for both the new input and the output stream. The test
roundtrips binary data of different lengths in different ways.
Bug: 371725
Change-Id: Ic3faebaa4637520f5448b3d1acd78d5aaab3907a
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Add an implementation for base-85 encoding and decoding [1]. Git binary
patches use this format.
Base-85 encoding assembles bytes as 32-bit MSB values, then converts
these values to base-85 numbers (always 5 bytes) encoded as printable
ASCII characters. Decoding base-85 is the reverse operation. Note
that decoding may overflow on invalid input as 85^5 > 2^32. Encodings
always have a length that is a multiple of 5. If input length is not
divisible by 4, padding bytes are (logically) added, which are ignored
when decoding. The encoding for n bytes has thus always exactly length
(n + 3) / 4 * 5 in integer arithmetic (truncating division).
Includes tests.
[1] https://datatracker.ietf.org/doc/html/rfc1924
Bug: 371725
Change-Id: Ib5b9a503cd62cf70e080a4fb38c8cd1eeeaebcfe
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Update tests to record the number of events fired post-setup and only
assert for events fired during BatchRefUpdate.execute. For tests which
use writeLooseRef to setup refs, create new tests which assert the
number of RefsChangedEvent(s) rather than updating the existing ones
to call RefDirectory.exactRef as it changes the code path.
Change-Id: I0187811628d179d9c7e874c9bb8a7ddb44dd9df4
Signed-off-by: Kaushik Lingarkar <quic_kaushikl@quicinc.com>
ApplyCommand: convert to git internal format before applying patch
Applying a patch on Windows failed if the patch had the (normal)
single-LF line endings, but the file on disk had the usual Windows
CR-LF line endings.
Git (and JGit) compute diffs on the git-internal blob, i.e., after
CR-LF transformation and clean filtering. Applying patches to files
directly is thus incorrect and may fail if CR-LF settings don't
match, or if clean/smudge filtering is involved.
Change ApplyCommand to run the file content through the check-in
filters before applying the patch, and run the result through the
check-out filters. This makes patch application succeed even if the
patch has single-LFs, but the file has CR-LF and core.autocrlf is
true.
Add tests for various combinations of line endings in the file and in
the patch, and a test to verify the clean/smudge handling.
See also [1].
Running the file though clean/smudge may give strange results with
LFS-managed files. JGit's DiffFormatter has some extra code and
applies the smudge filter again after having run the file through
the check-in filters (CR-LF and clean). So JGit can actually produce
a diff on LFS-managed files using the normal diff machinery. (If it
doesn't run out of memory, that is. After all, LFS is intended for
_large_ files.) How such a diff would be applied with either C git
or JGit is entirely unclear; neither has any code for this special
case. Compare also [2].
Note that C git just doesn't know about LFS and always diffs after
the check-in filter chain, so for LFS files, it'll produce a diff
of the LFS pointers.
[1] https://github.com/git/git/commit/c24f3abac
[2] https://github.com/git-lfs/git-lfs/issues/440
Bug: 571585
Change-Id: I8f71ff26313b5773ff1da612b0938ad2f18751f5
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Don't create the stream eagerly in lock(); that may cause JGit to
exceed OS or JVM limits on open file descriptors if many locks need
to be created, for instance when creating many refs. Instead create
the output stream only when one really needs to write something.
Bug: 573328
Change-Id: If9441ed40494d46f594a896d34a5c4f56f91ebf4
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Don't create the stream eagerly in lock(); that may cause JGit to
exceed OS or JVM limits on open file descriptors if many locks need
to be created, for instance when creating many refs. Instead create
the output stream only when one really needs to write something.
Bug: 573328
Change-Id: If9441ed40494d46f594a896d34a5c4f56f91ebf4
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Git has different conflict resolution strategies:
* There is a tree merge strategy "ours" which just ignores any changes
from theirs ("-s ours"). JGit also has the mirror strategy "theirs"
ignoring any changes from "ours". (This doesn't exist in C git.)
Adapt StashApplyCommand and CherrypickCommand to be able to use those
tree merge strategies.
* For the resolve/recursive tree merge strategies, there are content
conflict resolution strategies "ours" and "theirs", which resolve
any conflict hunks by taking the "ours" or "theirs" hunk. In C git
those correspond to "-Xours" or -Xtheirs". Implement that in
MergeAlgorithm, and add API to set and pass through such a strategy
for resolving content conflicts.
* The "ours/theirs" content conflict resolution strategies also apply
for binary files. Handle these cases in ResolveMerger.
Note that the content conflict resolution strategies ("-X ours/theirs")
do _not_ apply to modify/delete or delete/modify conflicts. Such
conflicts are always reported as conflicts by C git. They do apply,
however, if one side completely clears a file's content.
Bug: 501111
Change-Id: I2c9c170c61c440a2ab9c387991e7a0c3ab960e07
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Allow file mode conflicts in virtual base commit on recursive merge.
Similar to https://git.eclipse.org/r/c/jgit/jgit/+/175166, ignore
path that have conflicts on attributes, so that the virtual base could
be used by RecursiveMerger.
Change-Id: I99c95445a305558d55bbb9c9e97446caaf61c154
Signed-off-by: Marija Savtchouk <mariasavtchouk@google.com>
In cases where we need to determine if a given commit is merged
into many refs, using isMergedInto(base, tip) for each ref would
cause multiple unwanted walks.
getMergedInto() marks the unreachable commits as uninteresting
which would then avoid walking that same path again.
Using the same api, also introduce isMergedIntoAny() and
isMergedIntoAll()
Change-Id: I65de9873dce67af9c415d1d236bf52d31b67e8fe
Signed-off-by: Adithya Chakilam <quic_achakila@quicinc.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
There are two code paths for detecting renames: one on tree diffs
(using DiffFormatter#scan) and the other on single file diffs (using
DiffFormatter#format). The latter skips binary and large files
for rename detection - check [1], but the former doesn't.
This change skips content rename detection for the tree diffs case for
large files. This is essential to avoid expensive computations while
reading the file, especially for callers who don't want to pay that
cost. Content renames are those which involve files with slightly
modified content. Exact renames will still be identified.
The default threshold for file sizes is reused from
PackConfig.DEFAULT_BIG_FILE_THRESHOLD: 50 MB.
[1] 232876421d/org.eclipse.jgit/src/org/eclipse/jgit/diff/RawText.java (386)
Change-Id: Idbc2c29bd381c6e387185204638f76fda47df41e
Signed-off-by: Youssef Elghareeb <ghareeb@google.com>
Add new constructors to PackFile to improve a common use case where
callers know the directory, id, and extension, but previously needed to
construct a valid file name (with prefix, '.', etc) to create a
PackFile. Most callers can use the variant that has id as an ObjectId,
but provide an id as String variant too.
Change-Id: I39e4466abe8c9509f5916d5bfe675066570b8585
Signed-off-by: Nasser Grainawi <quic_nasserg@quicinc.com>
Restore preserved packs during missing object seeks
Provide a recovery path for objects being referenced during the pack
pruning race. Due to the pack pruning race, it is possible for objects
to become referenced after a pack has been deemed safe to prune, but
before it actually gets pruned. If this happened previously, the newly
referenced objects would be missing and potentially result in a
corrupted ref.
Add the ability to recover from this situation when an object is missing
but happens to still be available in a pack in the "preserved"
directory. This is likely only useful when used in conjunction with the
--preserve-old-packs GC option, which prunes packs by hard-linking to
the preserved directory. If an object is missing and found in a pack in
the preserved directory, immediately recover that pack and its
associated files (idx, bitmaps...) by moving them back to the original
pack directory, and then retry the operation that would have failed due
to the missing object. This retry can now succeed and the repository
may avoid corruption. This approach should drastically reduce the
chance of a corrupt repository during pack pruning at very little extra
cost. This extra cost should only be incurred when objects are missing
and a failure would normally occur.
Change-Id: I2a704e3276b88cc892159d9bfe2455c6eec64252
Signed-off-by: Martin Fick <quic_mfick@quicinc.com>
Signed-off-by: Nasser Grainawi <quic_nasserg@quicinc.com>
Pack: Replace extensions bitset with bitmapIdx PackFile
The only extension that was ever consulted from the bitmap was the
bitmap index. We can simplify the Pack code as well as the code of
all the callers if we focus on just that usage.
Change-Id: I799ddfdee93142af67ce5081d14a430d36aa4c15
Signed-off-by: Nasser Grainawi <quic_nasserg@quicinc.com>
The PackFile class is intended to be a central place to do all
common pack filename manipulation and parsing to help reduce repeated
code and bugs. Use the PackFile class in the Pack class and in many
tests to ensure it works well in a variety of situations. Later changes
will expand use of PackFiles to even more areas.
Change-Id: I921b30f865759162bae46ddd2c6d669de06add4a
Signed-off-by: Nasser Grainawi <quic_nasserg@quicinc.com>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
A cookie file stores the expiration in seconds since the Linux Epoch,
not in milliseconds. Correct reading and writing cookie files; with
a backwards-compatibility hack to read files that contain a millisecond
timestamp.
Add a test, and fix tests not to rely on the actual current time so
that they will also run successfully after 2030-01-01 noon.
Bug: 571574
Change-Id: If3ba68391e574520701cdee119544eedc42a1ff2
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>