mirrors/jgit - jgit - source @ dussan.org

Commit Graph

Author	SHA1	Message	Date
Han-Wen Nienhuys	f3ec7cf3f0	Remove further unnecessary 'final' keywords Remove it from * package private functions. * try blocks * for loops this was done with the following python script: $ cat f.py import sys import re import os def replaceFinal(m): return m.group(1) + "(" + m.group(2).replace('final ', '') + ")" methodDecl = re.compile(r"^([\t ][a-zA-Z_ ]+)$([^)])$") def subst(fn): input = open(fn) os.rename(fn, fn + "~") dest = open(fn, 'w') for l in input: l = methodDecl.sub(replaceFinal, l) dest.write(l) dest.close() for root, dirs, files in os.walk(".", topdown=False): for f in files: if not f.endswith('.java'): continue full = os.path.join(root, f) print full subst(full) Change-Id: If533a75a417594fc893e7c669d2c1f0f6caeb7ca Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>	6 years ago
David Pursehouse	5c70be0085	Open auto-closeable resources in try-with-resource When an auto-closeable resources is not opened in try-with-resource, the warning "should be managed by try-with-resource" is emitted by Eclipse. Fix the ones that can be silenced simply by moving the declaration of the variable into a try-with-resource. In cases where we explicitly call the close() method, for example in tests where we are testing specific behavior caused by the close(), suppress the warning. Leave the ones that will require more significant refcactoring to fix. They can be done in separate commits that can be reviewed and tested in isolation. Change-Id: I9682cd20fb15167d3c7f9027cecdc82bc50b83c4 Signed-off-by: David Pursehouse <david.pursehouse@gmail.com>	6 years ago
Matthias Sohn	4d8233f237	Fix javadoc org.eclipse.jgit diff package Change-Id: I7162d72916abc8533ad37e8b17335ff4a70d6519 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>	6 years ago
David Pletcher	5e57cc9585	Enable public access to SimilarityIndex scoring function The SimilarityIndex class implements the useful capability of scoring the similarity between two files. That capability is required for a feature that's being developed in another package, to detect files derived from a set of potential sources. This CL adds a public factory method to create a SimilarityIndex from an ObjectLoader. It grants public access to the SimilarityIndex class, the score method, an inner exception class and a special marker instance of that exception class. Change-Id: I3f72670da643be3bb8e261c5af5e9664bcd0401b Signed-off-by: David Pletcher <dpletcher@google.com>	9 years ago
Marc Strapetz	1cb5668441	Rename detection should canonicalize line endings Native Git canonicalizes line endings when detecting renames, more specifically it replaces CRLF by LF. See: hash_chars in diffcore-delta.c Bug: 449545 Change-Id: Iec2aab12ae9e67074cccb7fbd4d9defe176a0130 Signed-off-by: Marc Strapetz <marc.strapetz@syntevo.com> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>	9 years ago
Shawn O. Pearce	0e307a6afd	SimilarityIndex: Don't overflow internal counter fields The counter portion of each pair is only 32 bits wide, but is part of a larger 64 bit integer. If the file size was larger than 4 GB the counter could overflow and impact the key, changing the hash, and later resulting in an incorrect similarity score. Guard against this overflow condition by capping the count for each record at 2^32-1. If any record contains more than that many bytes the table aborts hashing and throws TableFullException. This permits the index to scan and work on files that exceed 4 GB in size, but only if the file contains more than one unique block. The index throws TableFullException on a 4 GB file containing all zeros, but should succeed on a 6 GB file containing unique lines. The index now uses a 64 bit accumulator during the common scoring algorithm, possibly resulting in slower summations. However this index is already heavily dependent upon 64 bit integer operations being efficient, so increasing from 32 bits to 64 bits allows us to correctly handle 6 GB files. Change-Id: I14e6dbc88d54ead19336a4c0c25eae18e73e6ec2 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	d63887127e	SimilarityIndex: Accept files larger than 8 MB Files bigger than 8 MB (2^23 bytes) tended to overflow the internal hashtable, as the table was capped in size to 2^17 records. If a file contained 2^17 unique data blocks/lines, the table insertion got stuck in an infinite loop as the able couldn't grow, and there was no open slot for the new item. Remove the artifical 2^17 table limit and instead allow the table to grow to be as big as 2^30. With a 64 byte block size, this permits hashing inputs as large as 64 GB. If the table reaches 2^30 (or cannot be allocated) hashing is aborted. RenameDetector no longer tries to break a modify file pair, and it does not try to match the file for rename or copy detection. Change-Id: Ibb4d756844f4667e181e24a34a468dc3655863ac Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	f3b511568b	SimilarityIndex: Correct comment explaining the logic This comment was wrong, due to a copy-and-paste error. Here the code is looking at records of dst that do not exist in src, and are skipping past them to find another match. Change-Id: I07c1fba7dee093a1eeffcf7e0c7ec85446777ffb Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Shawn O. Pearce	11f99fecfd	Reduce content hash function collisions The hash code returned by RawTextComparator (or that is used by the SimilarityIndex) play an important role in the speed of any algorithm that is based upon them. The lower the number of collisions produced by the hash function, the shorter the hash chains within hash tables will be, and the less likely we are to fall into O(N^2) runtime behaviors for algorithms like PatienceDiff. Our prior hash function was absolutely horrid, so replace it with the proper definition of the DJB hash that was originally published by Professor Daniel J. Bernstein. To support this assertion, below is a table listing the maximum number of collisions that result when hashing the unique lines in each source code file of 3 randomly chosen projects: test_jgit: 931 files; 122 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 418 djb 5 sha1 6 string_hash31 11 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 8675 djb 32 sha1 8 string_hash31 32 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 4615 djb 10 sha1 6 string_hash31 13 We can clearly see that prior_hash performed very poorly, resulting in 8,675 collisions (elements in the same hash bucket) for at least one file in the Linux kernel repository. This leads to some very bad O(N) style insertion and lookup performance, even though the hash table was sized to be the next power-of-2 larger than the total number of unique lines in the file. The djb hash we are replacing prior_hash with performs closer to SHA-1 in terms of having very few collisions. This indicates it provides a reasonably distributed output for this type of input, despite being a much simpler algorithm (and therefore will be much faster to execute). The string_hash31 function is provided just to compare results with, it is the algorithm commonly used by java.lang.String hashCode(). However, life isn't quite this simple. djb produces a 32 bit hash code, but our hash tables are always smaller than 2^32 buckets. Mashing the 32 bit code into an array index used to be done by simply taking the lower bits of the hash code by a bitwise and operator. This unfortuntely still produces many collisions, e.g. 32 on the linux-2.6 repository files. From [1] we can apply a final "cleanup" step to the hash code to mix the bits together a little better, and give priority to the higher order bits as they include data from more bytes of input: test_jgit: 931 files; 122 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 418 djb 5 djb + cleanup 6 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 8675 djb 32 djb + cleanup 7 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 4615 djb 10 djb + cleanup 7 This is a massive improvement, as the number of collisions for common inputs drops to acceptable levels, and we haven't really made the hash functions any more complex than they were before. [1] http://lkml.org/lkml/2009/10/27/404 Change-Id: Ia753b695de9526a157ddba265824240bd05dead1 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	13 years ago
Jeff Schumacher	e64cb03065	Fixed bug in scoring mechanism for rename detection A bug in rename detection would cause file scores to be wrong. The bug was due to the way rename detection would judge the similarity between files. If file A has three lines containing 'foo', and file B has 5 lines containing 'foo', the rename detection phase should record that A and B have three lines in common (the minimum of the number of times that line appears in both files). Instead, it would choose the the number of times the line appeared in the destination file, in this case file B. I fixed the bug by having the SimilarityIndex instead choose the minimum number, as it should. I also added a test case to verify that the bug had been fixed. Change-Id: Ic75272a2d6e512a361f88eec91e1b8a7c2298d6b	14 years ago
Jeff Schumacher	9a48de86d8	Added file path similarity to scoring metric in rename detection The scoring method was not taking into account the similarity of the file paths and file names. I changed the metric so that it is 99% based on content (which used to be 100% of the old metric), and 1% based on path similarity. Of that 1%, half (.5% of the total final score) is based on the actual file names (e.g. "foo.java"), and half on the directory (e.g. "src/com/foo/bar/"). Change-Id: I94f0c23bf6413c491b10d5625f6ad7d2ecfb4def	14 years ago
Jeff Schumacher	4c14b7869d	Fixed potential div by zero bug The scoring logic in SimilarityIndex was dividing by the max file size. If both files are empty, this would cause a div by zero error. This case cannot currently happen, since two empty files would have the same SHA1, and would therefore be caught in the earlier SHA1 based detection pass. Still, if this logic eventually gets separated from that pass, a div by zero error would occur. I changed the logic to instead consider two empty files to have a similarity score of 100. Change-Id: Ic08e18a066b8fef25bb5e7c62418106a8cee762a	14 years ago
Shawn O. Pearce	978535b090	Implement similarity based rename detection Content similarity based rename detection is performed only after a linear time detection is performed using exact content match on the ObjectIds. Any names which were paired up during that exact match phase are excluded from the inexact similarity based rename, which reduces the space that must be considered. During rename detection two entries cannot be marked as a rename if they are different types of files. This prevents a symlink from being renamed to a regular file, even if their blob content appears to be similar, or is identical. Efficiently comparing two files is performed by building up two hash indexes and hashing lines or short blocks from each file, counting the number of bytes that each line or block represents. Instead of using a standard java.util.HashMap, we use a custom open hashing scheme similiar to what we use in ObjecIdSubclassMap. This permits us to have a very light-weight hash, with very little memory overhead per cell stored. As we only need two ints per record in the map (line/block key and number of bytes), we collapse them into a single long inside of a long array, making very efficient use of available memory when we create the index table. We only need object headers for the index structure itself, and the index table, but not per-cell. This offers a massive space savings over using java.util.HashMap. The score calculation is done by approximating how many bytes are the same between the two inputs (which for a delta would be how much is copied from the base into the result). The score is derived by dividing the approximate number of bytes in common into the length of the larger of the two input files. Right now the SimilarityIndex table should average about 1/2 full, which means we waste about 50% of our memory on empty entries after we are done indexing a file and sort the table's contents. If memory becomes an issue we could discard the table and copy all records over to a new array that is properly sized. Building the index requires O(M + N log N) time, where M is the size of the input file in bytes, and N is the number of unique lines/blocks in the file. The N log N time constraint comes from the sort of the index table that is necessary to perform linear time matching against another SimilarityIndex created for a different file. To actually perform the rename detection, a SxD matrix is created, placing the sources (aka deletions) along one dimension and the destinations (aka additions) along the other. A simple O(S x D) loop examines every cell in this matrix. A SimilarityIndex is built along the row and reused for each column compare along that row, avoiding the costly index rebuild at the row level. A future improvement would be to load a smaller square matrix into SimilarityIndexes and process everything in that sub-matrix before discarding the column dimension and moving down to the next sub-matrix block along that same grid of rows. An optional ProgressMonitor is permitted to be passed in, allowing applications to see the progress of the detector as it works through the matrix cells. This provides some indication of current status for very long running renames. The default line/block hash function used by the SimilarityIndex may not be optimal, and may produce too many collisions. It is borrowed from RawText's hash, which is used to quickly skip out of a longer equality test if two lines have different hash functions. We may need to refine this hash in the future, in order to minimize the number of collisions we get on common source files. Based on a handful of test commits in JGit (especially my own recent rename repository refactoring series), this rename detector produces output that is very close to C Git. The content similarity scores are sometimes off by 1%, which is most probably caused by our SimilarityIndex type using a different hash function than C Git uses when it computes the delta size between any two objects in the rename matrix. Bug: 318504 Change-Id: I11dff969e8a2e4cf252636d857d2113053bdd9dc Signed-off-by: Shawn O. Pearce <spearce@spearce.org>	14 years ago

15 Commits (f3ec7cf3f0436a79e252251a31dbc62694555897)