This is similar to change Idbc2c29bd that skipped detecting content
renames for large files. With this change, we added a new option in
RenameDetector called "skipContentRenamesForBinaryFiles", that when set,
causes binary files with any slight modification to be identified as
added/deleted. The default for this boolean is false, so preserving
current behaviour.
Change-Id: I4770b1f69c60b1037025ddd0940ba86df6047299
There are two code paths for detecting renames: one on tree diffs
(using DiffFormatter#scan) and the other on single file diffs (using
DiffFormatter#format). The latter skips binary and large files
for rename detection - check [1], but the former doesn't.
This change skips content rename detection for the tree diffs case for
large files. This is essential to avoid expensive computations while
reading the file, especially for callers who don't want to pay that
cost. Content renames are those which involve files with slightly
modified content. Exact renames will still be identified.
The default threshold for file sizes is reused from
PackConfig.DEFAULT_BIG_FILE_THRESHOLD: 50 MB.
[1] 232876421d/org.eclipse.jgit/src/org/eclipse/jgit/diff/RawText.java (386)
Change-Id: Idbc2c29bd381c6e387185204638f76fda47df41e
Signed-off-by: Youssef Elghareeb <ghareeb@google.com>
Use org.eclipse.jgit.errors.CancelledException which is a subclass of
IOException instead of org.eclipse.jgit.api.errors.CanceledException in
order to avoid breaking API. We can reconsider this with the next major
version 6.0.
Bug: 536324
Change-Id: Ia6f84f59aa6b7d78b8fccaba24ade320a54f7458
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Abort rename detection in a timely manner if cancelled
If progress monitor is cancelled break loops in rename detection by
throwing a CanceledException.
Bug: 536324
Change-Id: Ia3511fb749d2a5d45005e72c156b874ab7a0da26
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Enable and fix warnings about redundant specification of type arguments
Since the introduction of generic type parameter inference in Java 7,
it's not necessary to explicitly specify the type of generic parameters.
Enable the warning in Eclipse, and fix all occurrences.
Change-Id: I9158caf1beca5e4980b6240ac401f3868520aad0
Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>
Enable and fix 'Should be tagged with @Override' warning
Set missingOverrideAnnotation=warning in Eclipse compiler preferences
which enables the warning:
The method <method> of type <type> should be tagged with @Override
since it actually overrides a superclass method
Justification for this warning is described in:
http://stackoverflow.com/a/94411/381622
Enabling this causes in excess of 1000 warnings across the entire
code-base. They are very easy to fix automatically with Eclipse's
"Quick Fix" tool.
Fix all of them except 2 which cause compilation failure when the
project is built with mvn; add TODO comments on those for further
investigation.
Change-Id: I5772061041fd361fe93137fd8b0ad356e748a29c
Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>
Use AutoClosable to close resources in bundle org.eclipse.jgit
- use try-with-resource where possible
- replace use of deprecated release() by close()
Change-Id: I0f139c3535679087b7fa09649166bca514750b81
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
The various rename detection options are an inherent part of the
filter, similar to the path being followed.
This fixes a potential NPE when a RevWalk with a FollowFilter is
created without a Repository, since the old code path tried to get
the DiffConfig from the RevWalk's possibly-missing repository.
Change-Id: Idb273d5a92849b42935ac14eed73b796b80aad50
Allow detecting which files were renamed during a revwalk
The egit history view shows the files associated with a commit by using
a PathFilter. When following renames with a FollowFilter, the PathFilter
cannot be configured anymore because the affected files are simply not
known.
Thus, it should be possible to get to know which files are renamed.
Bug: 302549
Change-Id: I4761e9f5cfb4f0ef0b0e1e38991401a1d5003bea
If there are only deletes, don't need perform rename or copy
detection. There are no adds (aka destinations) for the deletes
to match against.
Change-Id: I00fb90c509fa26a053de561dd8506cc1e0f5799a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Files bigger than 8 MB (2^23 bytes) tended to overflow the internal
hashtable, as the table was capped in size to 2^17 records. If a
file contained 2^17 unique data blocks/lines, the table insertion
got stuck in an infinite loop as the able couldn't grow, and there
was no open slot for the new item.
Remove the artifical 2^17 table limit and instead allow the table
to grow to be as big as 2^30. With a 64 byte block size, this
permits hashing inputs as large as 64 GB.
If the table reaches 2^30 (or cannot be allocated) hashing is
aborted. RenameDetector no longer tries to break a modify file pair,
and it does not try to match the file for rename or copy detection.
Change-Id: Ibb4d756844f4667e181e24a34a468dc3655863ac
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the iterators passed into a diff formatter are working tree
iterators, we should enable ignoring files that are ignored, as
well as actually pull up the current content from the working tree
rather than getting it from the repository.
Because we abstract away the working directory access logic,
we can now actually support rename detection between the working
directory and the local repository when using a DiffFormatter.
This means its possible for an application to show an unstaged
delete-add pair as a rename if the add path is not ignored.
(Because the ignored file wouldn't show up in our difference output.)
Even more interesting is we can now do rename detection between any
two working trees, if both input iterators are WorkingTreeIterators.
Unfortunately we don't (yet) optimize for comparing the working
tree with the index involved so we can take advantage of cached
stat data to rule out non-dirty paths.
Change-Id: I4c0598afe48d8f99257266bf447a0ecd23ca7f5e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Move rename detection, path following into DiffFormatter
Applications just want a quick way to configure our diff
implementation, and then just want to use it without a lot of fuss.
Move all of the rename detection logic and path following logic
out of our pgm package and into DiffFormatter itself, making it
much easier for a GUI to take advantage of the features without
duplicating a lot of code.
Change-Id: I4b54e987bb6dc804fb270cbc495fe4cae26c7b0e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Fix ArrayIndexOutOfBounds on non-square exact rename matrix
If the exact rename matrix for a particular ObjectId isn't square we
crashed with an ArrayIndexOutOfBoundsException because the matrix
entries were encoded backwards. The encode function accepts the
source (aka deleted) index first, not second. Add a unit test to
cover this non-square case to ensure we don't have this regression
in the future.
Change-Id: I5b005e5093e1f00de2e3ec104e27ab6820203566
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Rename getOldName,getNewName to getOldPath,getNewPath
TreeWalk calls this value "path", while "name" is the stuff after the
last slash. FileHeader should do the same thing to be consistent.
Rename getOldName to getOldPath and getNewName to getNewPath.
Bug: 318526
Change-Id: Ib2e372ad4426402d37939b48d8f233154cc637da
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
File pairs that are very dissimilar during a diff were not being
broken apart into their constituent ADD/DELETE pairs. The leads to
sub-optimal rename detection. Take, for example, this situation:
A file exists at src/a.txt containing "foo". A user renames src/a.txt
to src/b.txt, then adds a new src/a.txt containing "bar".
Even though the old a.txt and the new b.txt are identical, the
rename detection algorithm would not detect it as a rename since
it was already paired in a MODIFY. I added code to split all
MODIFYs below a certain score into their constituent ADD/DELETE
pairs. This allows situations like the one I described above to be
more correctly handled.
Change-Id: I22c04b70581f206bbc68c4cd1ee87a1f663b418e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If we have two adds of the same object but no deletes the detector
threw an NPE because the entry that came back from the deleted map
was null (no matching objects). In this case we need to put the
adds all back onto the list of left over additions since they did
not match a delete.
Change-Id: Ie68fbe7426b4dc0cb571a08911c7adbffff755d5
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
CC: Jeffrey Schumacher" <jeffschu@google.com>
Implemented file path based tie breaking to exact rename detection
During the exact rename detection phase in RenameDetector, ties were
resolved on a first-found basis. I added support for file path based
tie breaking during that phase. Basically, there are four situations
that have to be handled:
One add matching one delete:
In this simple case, we pair them as a rename.
One add matching many deletes:
Find the delete whos path matches the add the closest, and
pair them as a rename.
Many adds matching one delete:
Similar to the above case, we find the add that matches the
delete the closest, and pair them as a rename. The other adds
are marked as copies of the delete.
Many adds matching many deletes:
Build a scoring matrix similar to the one used for content-
based matching, scoring instead by file path. Some of the
utility functions in SimilarityRenameDetector are used in
this case, as we use the same encoding scheme. Once the
matrix is built, scan it for the best matches, marking them
as renames. The rest are marked as copies.
I don't particularly like the idea of using utility functions right
out of SimilarityRenameDetector, but it works for the moment. A later
commit will likely refactor this into a common utility class, as well
as bringing exact rename detection out of RenameDetector and into a
separate class, much like SimilarityRenameDetector.
Change-Id: I1fb08390aebdcbf20d049aecf402a36506e55611
Added very small optimization to exact rename detection
Optimized a small loop in findExactRenames. The loop would go through
all the items in a list of DiffEntries even after it already found
what it was looking for. I made it break out of the loop as soon as
a good match was found.
Change-Id: I28741e0c49ce52d8008930a87cd1db7037700a61
The javadoc for the setRenameLimit method in RenameDetector said
that you could only have limits in the range (0,100), implying
that 0 and 100 were illegal inputs. The code, however, allowed 0 and
100. I changed the javadoc to say that the range [0,100] was legal.
I also documented the IllegalArgumentException that is thrown if the
limit is outside that range.
Change-Id: I916838f254859f6f0e1516bb55b8e7dc87e57dc2
This way we don't have to reparse for the rename limit every time
we create a new rename detector for a repository.
Change-Id: I669d031690b85ef4da5e39189be7173fb773fc56
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Content similarity based rename detection is performed only after
a linear time detection is performed using exact content match on
the ObjectIds. Any names which were paired up during that exact
match phase are excluded from the inexact similarity based rename,
which reduces the space that must be considered.
During rename detection two entries cannot be marked as a rename
if they are different types of files. This prevents a symlink from
being renamed to a regular file, even if their blob content appears
to be similar, or is identical.
Efficiently comparing two files is performed by building up two
hash indexes and hashing lines or short blocks from each file,
counting the number of bytes that each line or block represents.
Instead of using a standard java.util.HashMap, we use a custom
open hashing scheme similiar to what we use in ObjecIdSubclassMap.
This permits us to have a very light-weight hash, with very little
memory overhead per cell stored.
As we only need two ints per record in the map (line/block key and
number of bytes), we collapse them into a single long inside of
a long array, making very efficient use of available memory when
we create the index table. We only need object headers for the
index structure itself, and the index table, but not per-cell.
This offers a massive space savings over using java.util.HashMap.
The score calculation is done by approximating how many bytes are
the same between the two inputs (which for a delta would be how much
is copied from the base into the result). The score is derived by
dividing the approximate number of bytes in common into the length
of the larger of the two input files.
Right now the SimilarityIndex table should average about 1/2 full,
which means we waste about 50% of our memory on empty entries
after we are done indexing a file and sort the table's contents.
If memory becomes an issue we could discard the table and copy all
records over to a new array that is properly sized.
Building the index requires O(M + N log N) time, where M is the
size of the input file in bytes, and N is the number of unique
lines/blocks in the file. The N log N time constraint comes
from the sort of the index table that is necessary to perform
linear time matching against another SimilarityIndex created for
a different file.
To actually perform the rename detection, a SxD matrix is created,
placing the sources (aka deletions) along one dimension and the
destinations (aka additions) along the other. A simple O(S x D)
loop examines every cell in this matrix.
A SimilarityIndex is built along the row and reused for each
column compare along that row, avoiding the costly index rebuild
at the row level. A future improvement would be to load a smaller
square matrix into SimilarityIndexes and process everything in that
sub-matrix before discarding the column dimension and moving down
to the next sub-matrix block along that same grid of rows.
An optional ProgressMonitor is permitted to be passed in, allowing
applications to see the progress of the detector as it works through
the matrix cells. This provides some indication of current status
for very long running renames.
The default line/block hash function used by the SimilarityIndex
may not be optimal, and may produce too many collisions. It is
borrowed from RawText's hash, which is used to quickly skip out of
a longer equality test if two lines have different hash functions.
We may need to refine this hash in the future, in order to minimize
the number of collisions we get on common source files.
Based on a handful of test commits in JGit (especially my own
recent rename repository refactoring series), this rename detector
produces output that is very close to C Git. The content similarity
scores are sometimes off by 1%, which is most probably caused by
our SimilarityIndex type using a different hash function than C
Git uses when it computes the delta size between any two objects
in the rename matrix.
Bug: 318504
Change-Id: I11dff969e8a2e4cf252636d857d2113053bdd9dc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
JGit does not currently do rename detection during diffs. I added
a class that, given a TreeWalk to iterate over, can output a list
of DiffEntry's for that TreeWalk, taking into account renames. This
class only detects renames by SHA1's. More complex rename detection,
along the lines of what C Git does will be added later.
Change-Id: I93606ce15da70df6660651ec322ea50718dd7c04