mirrors/jgit - jgit - source @ dussan.org

Commit Graph

Autor	SHA1	Mensaje	Fecha
Youssef Elghareeb	4a78d911c5	Skip detecting content renames for large files There are two code paths for detecting renames: one on tree diffs (using DiffFormatter#scan) and the other on single file diffs (using DiffFormatter#format). The latter skips binary and large files for rename detection - check [1], but the former doesn't. This change skips content rename detection for the tree diffs case for large files. This is essential to avoid expensive computations while reading the file, especially for callers who don't want to pay that cost. Content renames are those which involve files with slightly modified content. Exact renames will still be identified. The default threshold for file sizes is reused from PackConfig.DEFAULT_BIG_FILE_THRESHOLD: 50 MB. [1] `232876421d/org.eclipse.jgit/src/org/eclipse/jgit/diff/RawText.java (386)` Change-Id: Idbc2c29bd381c6e387185204638f76fda47df41e Signed-off-by: Youssef Elghareeb <ghareeb@google.com>	hace 3 años
David Pursehouse	fdabbe67e2	SimilarityRenameDetector: Fix inconsistent indentation Replace space indentation with tab indentation Change-Id: Ic130d3bde5d3a73d8f5c6225974153573722d05b Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>	hace 4 años
David Pursehouse	6a72f2943d	Use indexOf(char) and lastIndexOf(char) rather than String versions An indexOf or lastIndexOf call with a single letter String can be made more performant by switching to a call with a char argument. Found with SonarLint. As a side-effect of this change, we no longer need to suppress the NON-NLS warnings. Change-Id: Id44cb996bb74ed30edd560aa91fd8525aafdc8dd Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>	hace 4 años
Matthias Sohn	5c5f7c6b14	Update EDL 1.0 license headers to new short SPDX compliant format This is the format given by the Eclipse legal doc generator [1]. [1] https://www.eclipse.org/projects/tools/documentation.php?id=technology.jgit Bug: 548298 Change-Id: I8d8cabc998ba1b083e3f0906a8d558d391ffb6c4 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>	hace 4 años
Matthias Sohn	cf6463bddc	Fix API breakage introduced by da254106 Use org.eclipse.jgit.errors.CancelledException which is a subclass of IOException instead of org.eclipse.jgit.api.errors.CanceledException in order to avoid breaking API. We can reconsider this with the next major version 6.0. Bug: 536324 Change-Id: Ia6f84f59aa6b7d78b8fccaba24ade320a54f7458 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com> Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>	hace 5 años
Matthias Sohn	da254106a7	Abort rename detection in a timely manner if cancelled If progress monitor is cancelled break loops in rename detection by throwing a CanceledException. Bug: 536324 Change-Id: Ia3511fb749d2a5d45005e72c156b874ab7a0da26 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>	hace 5 años
David Pursehouse	3b4448637f	Enable and fix warnings about redundant specification of type arguments Since the introduction of generic type parameter inference in Java 7, it's not necessary to explicitly specify the type of generic parameters. Enable the warning in Eclipse, and fix all occurrences. Change-Id: I9158caf1beca5e4980b6240ac401f3868520aad0 Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>	hace 7 años
Robin Rosenberg	c310fa0c80	Mark non-externalizable strings as such A few classes such as Constanrs are marked with @SuppressWarnings, as are toString() methods with many liternal, but otherwise $NLS-n$ is used for string containing text that should not be translated. A few literals may fall into the gray zone, but mostly I've tried to only tag the obvious ones. Change-Id: I22e50a77e2bf9e0b842a66bdf674e8fa1692f590	hace 11 años
Robin Rosenberg	95d311f888	Move JGitText to an internal package Change-Id: I763590a45d75f00a09097ab6f89581a3bbd3c797	hace 12 años
Shawn O. Pearce	05653bda04	SimilarityRenameDetector: Initialize sizes to 0 Setting the array elements to -1 is more expensive than relying on the allocator to zero the array for us first. Shifting the code to always add 1 to the size (so an empty file is actually 1 byte long) allows us to detect an unloaded size by comparing to 0, thus saving the array fill calls. Change-Id: Iad859e910655675b53ba70de8e6fceaef7cfcdd1 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	hace 13 años
Shawn O. Pearce	68baa3097e	SimilarityRenameDetector: Avoid allocating source index If the only file added is really small, and all of the deleted files are really big, none of the permutations will match up due to the sizes being too far apart to fit the current rename score. Avoid allocating the really big deleted SimilarityIndex by deferring its construction until at least one add along that row has a reasonable chance of matching it. This avoids expending a lot of CPU time looking at big deleted binary files when a small modified text file was broken due to a high percentage of changed lines. Change-Id: I11ae37edb80a7be1eef8cc01d79412017c2fc075 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	hace 13 años
Shawn O. Pearce	918e6e20f0	SimilarityRenameDetector: Only attempt to index large files once If a file fails to index the first time the loop encounters it, the file is likely to fail to index again on the next row. Rather than wasting a huge amount of CPU to index it again and fail, remember which destination files failed to index and skip over them on each subsequent row. Because this condition is very unlikely, avoid allocating the BitSet until its actually needed. This keeps the memory usage unaffected for the common case. Change-Id: I93509b28b61a9bba8f681a7b4df4c6127bca2a09 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	hace 13 años
Shawn O. Pearce	d63887127e	SimilarityIndex: Accept files larger than 8 MB Files bigger than 8 MB (2^23 bytes) tended to overflow the internal hashtable, as the table was capped in size to 2^17 records. If a file contained 2^17 unique data blocks/lines, the table insertion got stuck in an infinite loop as the able couldn't grow, and there was no open slot for the new item. Remove the artifical 2^17 table limit and instead allow the table to grow to be as big as 2^30. With a 64 byte block size, this permits hashing inputs as large as 64 GB. If the table reaches 2^30 (or cannot be allocated) hashing is aborted. RenameDetector no longer tries to break a modify file pair, and it does not try to match the file for rename or copy detection. Change-Id: Ibb4d756844f4667e181e24a34a468dc3655863ac Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	hace 13 años
Shawn O. Pearce	59a262d5d2	Support creating the working directory difference If the iterators passed into a diff formatter are working tree iterators, we should enable ignoring files that are ignored, as well as actually pull up the current content from the working tree rather than getting it from the repository. Because we abstract away the working directory access logic, we can now actually support rename detection between the working directory and the local repository when using a DiffFormatter. This means its possible for an application to show an unstaged delete-add pair as a rename if the add path is not ignored. (Because the ignored file wouldn't show up in our difference output.) Even more interesting is we can now do rename detection between any two working trees, if both input iterators are WorkingTreeIterators. Unfortunately we don't (yet) optimize for comparing the working tree with the index involved so we can take advantage of cached stat data to rule out non-dirty paths. Change-Id: I4c0598afe48d8f99257266bf447a0ecd23ca7f5e Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	hace 13 años
Shawn O. Pearce	60c5939b23	Rename getOldName,getNewName to getOldPath,getNewPath TreeWalk calls this value "path", while "name" is the stuff after the last slash. FileHeader should do the same thing to be consistent. Rename getOldName to getOldPath and getNewName to getNewPath. Bug: 318526 Change-Id: Ib2e372ad4426402d37939b48d8f233154cc637da Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	hace 14 años
Jeff Schumacher	31311cacfd	Implemented file path based tie breaking to exact rename detection During the exact rename detection phase in RenameDetector, ties were resolved on a first-found basis. I added support for file path based tie breaking during that phase. Basically, there are four situations that have to be handled: One add matching one delete: In this simple case, we pair them as a rename. One add matching many deletes: Find the delete whos path matches the add the closest, and pair them as a rename. Many adds matching one delete: Similar to the above case, we find the add that matches the delete the closest, and pair them as a rename. The other adds are marked as copies of the delete. Many adds matching many deletes: Build a scoring matrix similar to the one used for content- based matching, scoring instead by file path. Some of the utility functions in SimilarityRenameDetector are used in this case, as we use the same encoding scheme. Once the matrix is built, scan it for the best matches, marking them as renames. The rest are marked as copies. I don't particularly like the idea of using utility functions right out of SimilarityRenameDetector, but it works for the moment. A later commit will likely refactor this into a common utility class, as well as bringing exact rename detection out of RenameDetector and into a separate class, much like SimilarityRenameDetector. Change-Id: I1fb08390aebdcbf20d049aecf402a36506e55611	hace 14 años
Jeff Schumacher	9a48de86d8	Added file path similarity to scoring metric in rename detection The scoring method was not taking into account the similarity of the file paths and file names. I changed the metric so that it is 99% based on content (which used to be 100% of the old metric), and 1% based on path similarity. Of that 1%, half (.5% of the total final score) is based on the actual file names (e.g. "foo.java"), and half on the directory (e.g. "src/com/foo/bar/"). Change-Id: I94f0c23bf6413c491b10d5625f6ad7d2ecfb4def	hace 14 años
Jeff Schumacher	64b9458640	Added file size based rename detection optimization Prior to this change, files that were very different in size (enough so that they could not have enough in common to be detected as renames) were still having their scores calculated. I added an optimization to skip such files. For example, if the rename detection threshold is 60%, the larger file is 200kb, and the smaller file is 50kb, the pair cannot be counted as a rename since they cannot possibly share 60% of their content in common. (200*.6=120, 120>50) Change-Id: Icd8315412d5de6292839778e7cea7fe6f061b0fc	hace 14 años
Shawn O. Pearce	978535b090	Implement similarity based rename detection Content similarity based rename detection is performed only after a linear time detection is performed using exact content match on the ObjectIds. Any names which were paired up during that exact match phase are excluded from the inexact similarity based rename, which reduces the space that must be considered. During rename detection two entries cannot be marked as a rename if they are different types of files. This prevents a symlink from being renamed to a regular file, even if their blob content appears to be similar, or is identical. Efficiently comparing two files is performed by building up two hash indexes and hashing lines or short blocks from each file, counting the number of bytes that each line or block represents. Instead of using a standard java.util.HashMap, we use a custom open hashing scheme similiar to what we use in ObjecIdSubclassMap. This permits us to have a very light-weight hash, with very little memory overhead per cell stored. As we only need two ints per record in the map (line/block key and number of bytes), we collapse them into a single long inside of a long array, making very efficient use of available memory when we create the index table. We only need object headers for the index structure itself, and the index table, but not per-cell. This offers a massive space savings over using java.util.HashMap. The score calculation is done by approximating how many bytes are the same between the two inputs (which for a delta would be how much is copied from the base into the result). The score is derived by dividing the approximate number of bytes in common into the length of the larger of the two input files. Right now the SimilarityIndex table should average about 1/2 full, which means we waste about 50% of our memory on empty entries after we are done indexing a file and sort the table's contents. If memory becomes an issue we could discard the table and copy all records over to a new array that is properly sized. Building the index requires O(M + N log N) time, where M is the size of the input file in bytes, and N is the number of unique lines/blocks in the file. The N log N time constraint comes from the sort of the index table that is necessary to perform linear time matching against another SimilarityIndex created for a different file. To actually perform the rename detection, a SxD matrix is created, placing the sources (aka deletions) along one dimension and the destinations (aka additions) along the other. A simple O(S x D) loop examines every cell in this matrix. A SimilarityIndex is built along the row and reused for each column compare along that row, avoiding the costly index rebuild at the row level. A future improvement would be to load a smaller square matrix into SimilarityIndexes and process everything in that sub-matrix before discarding the column dimension and moving down to the next sub-matrix block along that same grid of rows. An optional ProgressMonitor is permitted to be passed in, allowing applications to see the progress of the detector as it works through the matrix cells. This provides some indication of current status for very long running renames. The default line/block hash function used by the SimilarityIndex may not be optimal, and may produce too many collisions. It is borrowed from RawText's hash, which is used to quickly skip out of a longer equality test if two lines have different hash functions. We may need to refine this hash in the future, in order to minimize the number of collisions we get on common source files. Based on a handful of test commits in JGit (especially my own recent rename repository refactoring series), this rename detector produces output that is very close to C Git. The content similarity scores are sometimes off by 1%, which is most probably caused by our SimilarityIndex type using a different hash function than C Git uses when it computes the delta size between any two objects in the rename matrix. Bug: 318504 Change-Id: I11dff969e8a2e4cf252636d857d2113053bdd9dc Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	hace 14 años

21 Commits (4a78d911c578a6f9028d6e74b5668dfc384ef80f)