source.dussan.org Git - jgit.git/commit

author	Shawn O. Pearce <spearce@spearce.org>
	Fri, 12 Nov 2010 19:56:40 +0000 (11:56 -0800)
committer	Shawn O. Pearce <spearce@spearce.org>
	Fri, 12 Nov 2010 19:57:02 +0000 (11:57 -0800)
commit	0e307a6afddbb564ea6c34b3766d749f80e4442a
tree	48470013d89b0eabc42360e192ea6ad06ba34a70	tree \| snapshot
parent	d63887127e20c0a70c53c48a9aa5ffbdb1cf8873	commit \| diff

SimilarityIndex: Don't overflow internal counter fields

The counter portion of each pair is only 32 bits wide, but is part
of a larger 64 bit integer.  If the file size was larger than 4 GB
the counter could overflow and impact the key, changing the hash,
and later resulting in an incorrect similarity score.

Guard against this overflow condition by capping the count for each
record at 2^32-1.  If any record contains more than that many bytes
the table aborts hashing and throws TableFullException.

This permits the index to scan and work on files that exceed 4 GB
in size, but only if the file contains more than one unique block.
The index throws TableFullException on a 4 GB file containing all
zeros, but should succeed on a 6 GB file containing unique lines.

The index now uses a 64 bit accumulator during the common scoring
algorithm, possibly resulting in slower summations.  However this
index is already heavily dependent upon 64 bit integer operations
being efficient, so increasing from 32 bits to 64 bits allows us
to correctly handle 6 GB files.

Change-Id: I14e6dbc88d54ead19336a4c0c25eae18e73e6ec2
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>