Content length is computed and cached (short term) in the working
tree iterator when core.autocrlf is set.
Hopefully this is a cleaner fix than my previous attempt to make
autocrlf work.
Change-Id: I1b6bbb643101a00db94e5514b5e2b069f338907a
RawText#getEOL() does the same thing as RawText#getLineDelimiter()
The duplication has been introduced when merging
I08e1369e142bb19f42a8d7bbb5a7d062cc8533fc and
I18adc63596f4657516ccc6d704a561924c79d445. The former should have been
manually rebased. It also missed a copyright update in ApplyCommandTest.
Change-Id: I18fe6108220f964524fb16b719604222aa7abee6
Report diff entries for files that only change mode
This also updates DiffFormatter to not write path lines
for entries that have the same object id
Bug: 361570
Change-Id: I830a78e2babf472503630a7aa020ebfd5c7e69c6
Allow detecting which files were renamed during a revwalk
The egit history view shows the files associated with a commit by using
a PathFilter. When following renames with a FollowFilter, the PathFilter
cannot be configured anymore because the affected files are simply not
known.
Thus, it should be possible to get to know which files are renamed.
Bug: 302549
Change-Id: I4761e9f5cfb4f0ef0b0e1e38991401a1d5003bea
Adds method into DiffEntry class that allows to specify whether changed
trees are included in scanning result list. By default changed trees
aren't added, but in some cases having changed tree would be useful.
Also adds check for tree count in TreeWalk and when it is different from
two it will thrown an IllegalArgumentException.
This change is required by egit
I7ddb21e7ff54333dd6d7ace3209bbcf83da2b219
Change-Id: I5a680a73e1cffa18ade3402cc86008f46c1da1f1
Signed-off-by: Dariusz Luksza <dariusz@luksza.org>
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
There is no reason for this type to contain an ArrayList and try to
hide the implementation. It only slows down execution by adding an
extra layer of method dispatch to each invocation.
Instead subclass from ArrayList.
Change-Id: Ifbb9c7060c2fe3d5a7397c1aa85fbade14088637
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Adds a class which can be used to calculates a SHA1 of the diff
associated with a patch, similar to git patch-id.
In this version whitespace is not ignored.
Change-Id: I421d15ea905e23df543082786786841cbe3ef10d
Signed-off-by: Stefan Lay <stefan.lay@sap.com>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
For the following patch on the linux 2.6.32 tag:
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -685,6 +685,7 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sc
static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+#if 0
#ifdef CONFIG_SCHED_DEBUG
s64 d = se->vruntime - cfs_rq->min_vruntime;
@@ -694,6 +695,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct
sched
if (d > 3*sysctl_sched_latency)
schedstat_inc(cfs_rq, nr_spread_over);
#endif
+#endif
}
static void
JGit produced an incorrect diff, attempting to add a new "}" instead
of the new "#endif" at the end of the hunk. This was caused by a prior
fix for bug 328895 where we wanted to "slide" a diff down in the file
when adding a new method/function and want to show the closing curly
brace as being added after the new method, rather than added onto the
end of the prior function or method just before the insertion point.
Bug: 345956
Change-Id: I32b9e24f1e2980258b1b39dd1807919ab1c5f9b2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Fix diff when first text is the start of the other
The problem occurred when the first text ends in the middle
of the last line of the other text and the first text has no
end of line.
Bug: 344975
Change-Id: I1f0dd9f8062f2148a7c1341c9122202e082ad19d
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
DiffFormatter: Use IndexDiffFilter to speed up working tree
If DiffFormatter is asked to compare the index to the working tree,
it can go faster by using the cached stat information to compare
the two entries rather than relying on SHA-1 computation alone.
Change-Id: Icb21c15b8279ee8cee382e5e179e0cf8903aee4d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Its confusing that a new TreeWalk() needs to have reset() invoked
on it before addTree(). This is a historical accident caused by
how TreeWalk was abused within ObjectWalk.
Drop the initial empty tree from the TreeWalk and thus remove a
number of pointless reset() operations from unit tests and some of
the internal JGit code.
Existing application code which is still calling reset() will simply
be incurring a few unnecessary field assignments, but they should
consider cleaning up their code in the future.
Change-Id: I434e94ffa43491019e7dff52ca420a4d2245f48b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When in OURS and THEIRS a new file is created we want a conflict
when the two contents differ. If on two branches the same file
with the same content is created this should not be a conflict.
But: the current merge algorithm is throwing NPEs in this case.
Fix this by choosing an empty RawText as common base if the
base is empty.
Change-Id: I21cb23f852965b82fb82ccd66ec961c7edb3ac3d
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Fix DiffConfig to understand "copy" resp. "copies" for diff.renames property.
Rename detection should be considered enabled if
diff.renames config property is set to "copy" or "copies", instead of
throwing IllegalArgumentException.
Change-Id: If55d955e37235d4d00f5b0febd6aa10c0e27814e
The diff algorithm which is used by Merge, Cherry-Pick, Rebase
should be configurable. A new configuration parameter "diff.algorithm"
is introduced which currently accepts the values "myers" or
"histogram". Based on this parameter for example the ResolveMerger
will choose a diff algorithm. The reason for this is bug 331078.
This bug shows that JGit is more compatible with C Git when
histogram diff is in place. But since histogram diff is quite new we
need an easy way to fall back to Myers diff.
Bug: 331078
Change-Id: I2549c992e478d991c61c9508ad826d1a9e539ae3
Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Signed-off-by: Philipp Thun <philipp.thun@sap.com>
If there are only deletes, don't need perform rename or copy
detection. There are no adds (aka destinations) for the deletes
to match against.
Change-Id: I00fb90c509fa26a053de561dd8506cc1e0f5799a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Setting the array elements to -1 is more expensive than relying on
the allocator to zero the array for us first. Shifting the code to
always add 1 to the size (so an empty file is actually 1 byte long)
allows us to detect an unloaded size by comparing to 0, thus saving
the array fill calls.
Change-Id: Iad859e910655675b53ba70de8e6fceaef7cfcdd1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
SimilarityRenameDetector: Avoid allocating source index
If the only file added is really small, and all of the deleted
files are really big, none of the permutations will match up due
to the sizes being too far apart to fit the current rename score.
Avoid allocating the really big deleted SimilarityIndex by deferring
its construction until at least one add along that row has a
reasonable chance of matching it.
This avoids expending a lot of CPU time looking at big deleted
binary files when a small modified text file was broken due to a
high percentage of changed lines.
Change-Id: I11ae37edb80a7be1eef8cc01d79412017c2fc075
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
SimilarityRenameDetector: Only attempt to index large files once
If a file fails to index the first time the loop encounters it, the
file is likely to fail to index again on the next row. Rather than
wasting a huge amount of CPU to index it again and fail, remember
which destination files failed to index and skip over them on each
subsequent row.
Because this condition is very unlikely, avoid allocating the BitSet
until its actually needed. This keeps the memory usage unaffected
for the common case.
Change-Id: I93509b28b61a9bba8f681a7b4df4c6127bca2a09
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The counter portion of each pair is only 32 bits wide, but is part
of a larger 64 bit integer. If the file size was larger than 4 GB
the counter could overflow and impact the key, changing the hash,
and later resulting in an incorrect similarity score.
Guard against this overflow condition by capping the count for each
record at 2^32-1. If any record contains more than that many bytes
the table aborts hashing and throws TableFullException.
This permits the index to scan and work on files that exceed 4 GB
in size, but only if the file contains more than one unique block.
The index throws TableFullException on a 4 GB file containing all
zeros, but should succeed on a 6 GB file containing unique lines.
The index now uses a 64 bit accumulator during the common scoring
algorithm, possibly resulting in slower summations. However this
index is already heavily dependent upon 64 bit integer operations
being efficient, so increasing from 32 bits to 64 bits allows us
to correctly handle 6 GB files.
Change-Id: I14e6dbc88d54ead19336a4c0c25eae18e73e6ec2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Files bigger than 8 MB (2^23 bytes) tended to overflow the internal
hashtable, as the table was capped in size to 2^17 records. If a
file contained 2^17 unique data blocks/lines, the table insertion
got stuck in an infinite loop as the able couldn't grow, and there
was no open slot for the new item.
Remove the artifical 2^17 table limit and instead allow the table
to grow to be as big as 2^30. With a 64 byte block size, this
permits hashing inputs as large as 64 GB.
If the table reaches 2^30 (or cannot be allocated) hashing is
aborted. RenameDetector no longer tries to break a modify file pair,
and it does not try to match the file for rename or copy detection.
Change-Id: Ibb4d756844f4667e181e24a34a468dc3655863ac
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
SimilarityIndex: Correct comment explaining the logic
This comment was wrong, due to a copy-and-paste error. Here the
code is looking at records of dst that do not exist in src, and
are skipping past them to find another match.
Change-Id: I07c1fba7dee093a1eeffcf7e0c7ec85446777ffb
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When adding a new method near the end of the sequence we want to
show the full method inserted, and not tear the prior method due
to the common trailing curly brace being consumed as part of the
common end region of the sequences.
Bug: 328895
Change-Id: I233bc40445fb5452863f5fb082bc3097433a8da6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
HistogramDiff failed on cases where the initial element for the LCS
was actually very common (e.g. has 20 occurrences), and the first
element of the inserted region after the LCS was also common but
had fewer occurrences (e.g. 10), while the LCS also contained a
unique element (1 occurrence).
This happens often in Java source code. The initial element for
the LCS might be the empty line ("\n"), and the inserted but common
element might be "\t/**\n", with the LCS being a large span of
lines that contains unique method declarations. Even though "/**"
occurs less often than the empty line its not a better LCS if the
LCS we already have contains a unique element.
The logic in HistogramDiff would normally have worked fine, except I
tried to optimize scanning of B by making tryLongestCommonSequence
return the end of the region when there are matching elements
found in A. This allows us to skip over the current LCS region,
as it has already been examined, but caused us to fail to identify
an element that had a lower occurrence count within the region.
The solution used here is to trade space-for-time by keeping a
table of A positions to their occurrence counts. This allows the
matching logic to always use the smallest count for this region,
even if the smallest count doesn't appear on the initial element.
The new unit test testEdit_LcsContainsUnique() verifies this new
behavior works as expected.
Bug: 328895
Change-Id: Id170783b891f645b6a8cf6f133c6682b8de40aaf
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Fix oddness check in MyersDiff for negative numbers
It's probably not possible that these numbers are negative in the
algorithm, but it's cleaner this way and gets rid of three more
FindBugs warnings.
Change-Id: Ifbce4e2c787fb9a7cd309c605e8d86211ef8a352
These routines can be useful when debugging, because we can add an
expression to the Eclipse "Expressions" panel to show the text that
appears on a line. Gerrit Code Review also uses these in its own
subclass of RawText in order to format patch files, so pulling it up
to be part of core JGit may help other applications too.
Change-Id: I20a6b112e3403ecfc1c2715ae75dcecc1a85b167
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Remove dead RawText(RawTextComparator) constructor
Since the introduction of HashedSequence we no longer need to supply
the RawTextComparator at the time of constructing a RawText. Drop the
definition from the constructor, because it doesn't make sense as part
of our public API.
Change-Id: Iaab34611d60eee4a2036830142b089b2dae81842
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Fix RawTextComparator reduceCommonStartEnd at empty lines
When an empty line was inserted at the beginning of the common end
part of a RawText the comparator incorrectly considered it to be
common, which meant the DiffAlgorithm would later not even have it be
part of the region it examines. This would cause JGit to skip a line
of insertion, which later confused Gerrit Code Review when it tried to
match up the pre and post RawText files for a difference that had this
type of insertion.
Define two new unit tests to check for this insertion of a blank line
condition and correct for it by removing the LF from the common region
when the condition is detected.
Change-Id: I2108570eb2929803b9a56f9fb9c400c758e7156b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
HistogramDiff outperforms it for any case where PatienceDiff needs to
fallback to another algorithm. Consequently it's not worth keeping
around, because we would always want a fallback enabled.
Change-Id: I39b99cb1db4b3be74a764dd3d68cd4c9ecd91481
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Its behavior is similar to PatienceDiff, and runs nearly as fast,
often beating the performance of MyersDiff.
Change-Id: I43c3faefa8109f1a68ef57522bec9cf27b5df252
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When passing to a fallback algorithm, we can avoid creating a new copy
of the hash codes for each sequence by passing in the hashed sequences
directly. This makes it cheaper to switch from HistogramDiff down to
MyersDiff in a single pass.
Change-Id: Ibf2e81be57c083862eeb134279aed676653bf9b5
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
There is a corner case where we get an EMPTY region during recursion,
but we didn't expect to receive that. Its harmless to ignore the
region since the region is empty and has no content, so do so rather
than throwing an exception
Change-Id: I50dcec81ecba763072bb739adfab5879fb48b23a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Certain inputs caused an infinite loop because the prior match data
couldn't be used as expected. Rather than incrementing the match
pointer before looking at an element, do it after, so the loop breaks
when we wrap around to the starting point.
Change-Id: Ieab28bb3485a914eeddc68aa38c256f255dd778c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
HistogramDiff is an alternative implementation of patience diff,
performing a search over all matching locations and picking the
longest common subsequence that has the lowest occurrence count.
If there are unique common elements, its behavior is identical to
that of patience diff.
Actual performance on real-world source files usually beats
MyersDiff, sometimes by a factor of 3, especially for complex
comparators that ignore whitespace.
Change-Id: I1806cd708087e36d144fb824a0e5ab7cdd579d73
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Pass through the addAll request to our underlying ArrayList.
This way the underlying ArrayList grows no more than once during the
call, which may be important if the list was originally allocated
at the default size of 16, but 64 Edits are being added.
Change-Id: I31c3261e895766f82c3c832b251a09f6e37e8860
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The hash code returned by RawTextComparator (or that is used
by the SimilarityIndex) play an important role in the speed of
any algorithm that is based upon them. The lower the number of
collisions produced by the hash function, the shorter the hash
chains within hash tables will be, and the less likely we are to
fall into O(N^2) runtime behaviors for algorithms like PatienceDiff.
Our prior hash function was absolutely horrid, so replace it with
the proper definition of the DJB hash that was originally published
by Professor Daniel J. Bernstein.
To support this assertion, below is a table listing the maximum
number of collisions that result when hashing the unique lines in
each source code file of 3 randomly chosen projects:
test_jgit: 931 files; 122 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 418
djb 5
sha1 6
string_hash31 11
test_linux26: 30198 files; 258 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 8675
djb 32
sha1 8
string_hash31 32
test_frameworks_base: 8381 files; 184 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 4615
djb 10
sha1 6
string_hash31 13
We can clearly see that prior_hash performed very poorly, resulting
in 8,675 collisions (elements in the same hash bucket) for at least
one file in the Linux kernel repository. This leads to some very
bad O(N) style insertion and lookup performance, even though the
hash table was sized to be the next power-of-2 larger than the
total number of unique lines in the file.
The djb hash we are replacing prior_hash with performs closer to
SHA-1 in terms of having very few collisions. This indicates it
provides a reasonably distributed output for this type of input,
despite being a much simpler algorithm (and therefore will be much
faster to execute).
The string_hash31 function is provided just to compare results with,
it is the algorithm commonly used by java.lang.String hashCode().
However, life isn't quite this simple.
djb produces a 32 bit hash code, but our hash tables are always
smaller than 2^32 buckets. Mashing the 32 bit code into an array
index used to be done by simply taking the lower bits of the hash
code by a bitwise and operator. This unfortuntely still produces
many collisions, e.g. 32 on the linux-2.6 repository files.
From [1] we can apply a final "cleanup" step to the hash code to
mix the bits together a little better, and give priority to the
higher order bits as they include data from more bytes of input:
test_jgit: 931 files; 122 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 418
djb 5
djb + cleanup 6
test_linux26: 30198 files; 258 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 8675
djb 32
djb + cleanup 7
test_frameworks_base: 8381 files; 184 avg. unique lines/file
Algorithm | Collisions
-------------+-----------
prior_hash 4615
djb 10
djb + cleanup 7
This is a massive improvement, as the number of collisions for
common inputs drops to acceptable levels, and we haven't really
made the hash functions any more complex than they were before.
[1] http://lkml.org/lkml/2009/10/27/404
Change-Id: Ia753b695de9526a157ddba265824240bd05dead1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Perform common start/end elimination by default for DiffAlgorithm
As it turns out, every single diff algorithm we might try to
implement can benfit from using the SequenceComparator's native
concept of the simple reduceCommonStartEnd() step. For most inputs,
there can be a significant number of elements that can be removed
from the space the DiffAlgorithm needs to consider, which will
reduce the overall running time for the final solution.
Pool this logic inside of DiffAlgorithm itself as a default, but
permit a specific algorithm to override it when necessary.
Convert MyersDiff to use this reduction to reduce the space it
needs to search, making it perform slightly better on common inputs.
Change-Id: I14004d771117e4a4ab2a02cace8deaeda9814bc1
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Remove unnecessary hash cache from PatienceDiffIndex
PatienceDiff always uses a HashedSequence, which promises to provide
constant time access for hash codes during the equals method and
aborts fast if the hash codes don't match. Therefore we don't need
to cache the hash codes inside of the index, saving us memory.
Change-Id: I80bf1e95094b7670e6c0acc26546364a1012d60e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Most diff implementations really want to use cached hash codes for
elements, rather than element equality, as they need to perform many
compares and unique hash codes for elements can really speed that
process up.
To make it easier to define element hash functions, move the caching
of hash codes into a wrapper sequence type, so that individual
sequence types like RawText don't need to do this themselves. This
has a nice property of also allowing the sequence to no longer care
about the specific SequenceComparator that is going to be used, and
permits the caching to only examine the middle region that isn't
common to the two inputs.
Change-Id: If8623556da9419117b07c5073e8bce39de02570e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This is a faster exact match based form that tries to improve
performance for the common case of the header and trailer of
a text file not changing at all. After this fast path we use
the slower path based on the super class' using equals() to
allow for whitespace ignore modes to still work.
Some simple performance testing showed a major improvement over the
older implementation for a common edit we see in JGit. The test
compared blob 29a89bc and 372a978, which is the ObjectDirectory.java
file difference in commit 41dd9ed1c0.
The two text files are approximately 22 KiB in size.
DEFAULT old 203900 ns
DEFAULT new 100400 ns
This new version is 2x faster for the DEFAULT comparator, which does
not treat space specially. This is because we can now examine a
larger swath of text with fewer instructions per byte compared. The
older algorithm had to stop at each line break and recompute how to
examine the next line, while the new algorithm only stops when the
first difference is found.
WS_IGNORE_ALL old 298500 ns
WS_IGNORE_ALL new 63300 ns
Its 4.7x faster for the whitespace ignore comparator, as the common
header and footer do not have a whitespace difference. Avoiding the
special case handling for whitespace on each byte considered saves a
lot of time.
Since most edits to source code (and other text like files) appears in
the interior of the file, fast elimination of common header/footer
means faster diff throughput. In the less common case of an actual
header or footer edit, the common header/footer elimination is stopped
rather quickly either way, so there is very little downside to the
optimiation applied here.
Change-Id: I1d501b4c3ff80ed086b20bf12faf51ae62167db7
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
DiffAlgorithm implementations may find it useful to construct an Edit
and use that to later subsequence the two base sequences, so define
two new utility methods a() and b() to construct the A and B ranges.
Once a subsequence has had Edits created for it the indexes are
within the space of the subsequence. These must be shifted back to
the original base sequence's indexes. Define toBase() as a utility
method to perform that shifting work in-place, so DiffAlgorithm
implementations have an efficient way to convert back to the caller's
original space.
Change-Id: I8d788e4d158b9f466fa9cb4a40865fb806376aee
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
A diff algorithm may find this type useful if it wants to delegate a
particular range of elements to another algorithm, without changing
the underlying sequence types.
Change-Id: I4544467781233e21ac8b35081304b2bad7db00f6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This makes it easier to parametrize DiffFormatter with a different
implementation, as we later plan to add PatienceDiff to JGit.
Change-Id: Id35ef478d5fa20fe10a1ba297f9436fd7adde9ce
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>