Jeff Schumacher [Fri, 9 Jul 2010 22:11:54 +0000 (15:11 -0700)]
Added file path similarity to scoring metric in rename detection
The scoring method was not taking into account the similarity of
the file paths and file names. I changed the metric so that it is 99%
based on content (which used to be 100% of the old metric), and 1%
based on path similarity. Of that 1%, half (.5% of the total final
score) is based on the actual file names (e.g. "foo.java"), and half
on the directory (e.g. "src/com/foo/bar/").
Jeff Schumacher [Fri, 9 Jul 2010 19:53:57 +0000 (12:53 -0700)]
Fixed potential div by zero bug
The scoring logic in SimilarityIndex was dividing by the max file
size. If both files are empty, this would cause a div by zero
error. This case cannot currently happen, since two empty files
would have the same SHA1, and would therefore be caught in the
earlier SHA1 based detection pass. Still, if this logic eventually
gets separated from that pass, a div by zero error would occur.
I changed the logic to instead consider two empty files to have a
similarity score of 100.
Jeff Schumacher [Fri, 9 Jul 2010 18:18:50 +0000 (11:18 -0700)]
Added file size based rename detection optimization
Prior to this change, files that were very different in size (enough
so that they could not have enough in common to be detected as
renames) were still having their scores calculated. I added an
optimization to skip such files. For example, if the rename detection
threshold is 60%, the larger file is 200kb, and the smaller file is
50kb, the pair cannot be counted as a rename since they cannot
possibly share 60% of their content in common. (200*.6=120, 120>50)
Jeff Schumacher [Thu, 8 Jul 2010 16:59:36 +0000 (09:59 -0700)]
Create FileHeader from DiffEntry
Added support for converting DiffEntrys to FileHeaders. FileHeaders
are DiffEntrys with a buffer containing the diff output as well as
a list of HunkHeaders. The HunkHeaders contain EditLists. The
createFileHeader(DiffEntry) method in DiffFormatter performs a Myers
Diff on the files refered to by the DiffEntry, then puts the returned
EditList into a single HunkHeader, which is then put into the
FileHeader to be returned. It also generates the appropriate diff
header an puts it into the FileHeader's buffer. The rest of the diff
output, which would normally be parsed to generate the HunkHeaders,
is not generated. In fact, the purpose of this method is to avoid
the costly diff output generation and parsing normally required to
create a FileHeader.
The FollowFilter can be installed on a RevWalk to cause the path
to be updated through rename detection when the affected file is
found to be added to the project.
The filter works reasonably well, for example we can follow the
history of the fsck command in git-core:
Similar to what we did with diff, implement whitespace ignore options
for log too. This requires us to define some means of creating any
RawText object type at will inside of DiffFormatter, so we define a
new factory interface to construct RawText instances on demand.
Unfortunately we have to copy the entire block of common options.
args4j only processes the options/arguments on the one command class
and Java doesn't support multiple inheritance.
Change-Id: Ia16cd3a11b850fffae9fbe7b721d7e43f1d0e8a5 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Passing around the OutputStream and the Repository is crazy. Instead
put the stream in the constructor, since this formatter exists only to
output to the stream, and put the repository as a member variable that
can be optionally set.
Change-Id: I2bad012fee7f40dc1346700ebd19f1e048982878 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Implement rename detection in the command line diff and log commands.
Also support --name-status, -p and -U flags, as these can be quite
useful to view more detail.
All of the Git patch file formatting code is now moved over to the
DiffFormatter class. This permits us to reuse it in any context,
including inside of IDEs.
Change-Id: I687ccba34e18105a07e0a439d2181c323209d96c Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Content similarity based rename detection is performed only after
a linear time detection is performed using exact content match on
the ObjectIds. Any names which were paired up during that exact
match phase are excluded from the inexact similarity based rename,
which reduces the space that must be considered.
During rename detection two entries cannot be marked as a rename
if they are different types of files. This prevents a symlink from
being renamed to a regular file, even if their blob content appears
to be similar, or is identical.
Efficiently comparing two files is performed by building up two
hash indexes and hashing lines or short blocks from each file,
counting the number of bytes that each line or block represents.
Instead of using a standard java.util.HashMap, we use a custom
open hashing scheme similiar to what we use in ObjecIdSubclassMap.
This permits us to have a very light-weight hash, with very little
memory overhead per cell stored.
As we only need two ints per record in the map (line/block key and
number of bytes), we collapse them into a single long inside of
a long array, making very efficient use of available memory when
we create the index table. We only need object headers for the
index structure itself, and the index table, but not per-cell.
This offers a massive space savings over using java.util.HashMap.
The score calculation is done by approximating how many bytes are
the same between the two inputs (which for a delta would be how much
is copied from the base into the result). The score is derived by
dividing the approximate number of bytes in common into the length
of the larger of the two input files.
Right now the SimilarityIndex table should average about 1/2 full,
which means we waste about 50% of our memory on empty entries
after we are done indexing a file and sort the table's contents.
If memory becomes an issue we could discard the table and copy all
records over to a new array that is properly sized.
Building the index requires O(M + N log N) time, where M is the
size of the input file in bytes, and N is the number of unique
lines/blocks in the file. The N log N time constraint comes
from the sort of the index table that is necessary to perform
linear time matching against another SimilarityIndex created for
a different file.
To actually perform the rename detection, a SxD matrix is created,
placing the sources (aka deletions) along one dimension and the
destinations (aka additions) along the other. A simple O(S x D)
loop examines every cell in this matrix.
A SimilarityIndex is built along the row and reused for each
column compare along that row, avoiding the costly index rebuild
at the row level. A future improvement would be to load a smaller
square matrix into SimilarityIndexes and process everything in that
sub-matrix before discarding the column dimension and moving down
to the next sub-matrix block along that same grid of rows.
An optional ProgressMonitor is permitted to be passed in, allowing
applications to see the progress of the detector as it works through
the matrix cells. This provides some indication of current status
for very long running renames.
The default line/block hash function used by the SimilarityIndex
may not be optimal, and may produce too many collisions. It is
borrowed from RawText's hash, which is used to quickly skip out of
a longer equality test if two lines have different hash functions.
We may need to refine this hash in the future, in order to minimize
the number of collisions we get on common source files.
Based on a handful of test commits in JGit (especially my own
recent rename repository refactoring series), this rename detector
produces output that is very close to C Git. The content similarity
scores are sometimes off by 1%, which is most probably caused by
our SimilarityIndex type using a different hash function than C
Git uses when it computes the delta size between any two objects
in the rename matrix.
Bug: 318504
Change-Id: I11dff969e8a2e4cf252636d857d2113053bdd9dc Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Jeff Schumacher [Thu, 1 Jul 2010 22:30:46 +0000 (15:30 -0700)]
Added a preliminary version of rename detection
JGit does not currently do rename detection during diffs. I added
a class that, given a TreeWalk to iterate over, can output a list
of DiffEntry's for that TreeWalk, taking into account renames. This
class only detects renames by SHA1's. More complex rename detection,
along the lines of what C Git does will be added later.
Jeff Schumacher [Wed, 30 Jun 2010 23:21:49 +0000 (16:21 -0700)]
Refactored code out of FileHeader to facilitate rename detection
Refactored a superclass out of FileHeader called DiffEntry that holds
the more general data from FileHeader that is useful in rename
detection (old/new Ids, modes, names, as well as changeType and
score). FileHeader is now a DiffEntry that adds Hunks, parsing
abilities, etc.
Dmitry Neverov [Wed, 30 Jun 2010 17:46:53 +0000 (10:46 -0700)]
Fix missing flush in StreamCopyThread
It is possible that StreamCopyThread will not flush everything
from it's src to it's dst. In most cases StreamCopyThread works
like this:
in loop:
n = src.read(buf);
dst.write(buf, 0, n);
and when we want to flush, we interrupt() StreamCopyThread and it
flushes everything it wrote to dst.
The problem is that our interrupt() could interrupt reading. In this
case we will flush everything we wrote to dst, but not everything
we wrote to src.
Change-Id: Ifaf4d8be87535c7364dd59b217dfc631460018ff Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Jeff Schumacher [Tue, 29 Jun 2010 23:04:08 +0000 (16:04 -0700)]
Added check for binary files while diffing
Added a check in Diff to ensure that files that are most likely
not text are not line-by-line diffed. Files are determined to be
binary by checking the first 8000 bytes for a null character. This
is a similar heuristic to what C Git uses.
Jeff Schumacher [Mon, 28 Jun 2010 20:03:44 +0000 (13:03 -0700)]
Added further support for whitespace ignoring during diff
Added code to support ignoring leading, trailing, and changed
whitespace when performing a diff operation. I also added command
line options to Diff to enable the various whitespace ignoring
methods. These match the flags for git diff.
Jeff Schumacher [Thu, 24 Jun 2010 20:39:50 +0000 (13:39 -0700)]
Added support for whitespace ignoring
JGit did not have support for skipping whitespace when comparing
lines in RawText objects. I added a subclass of RawText that skips
whitespace in its equals and hashCode methods. I used a subclass
rather than adding functionality into RawText so that performance
would not be impacted by extra logic.
This class only supports ignoring all whitespace. Others will follow
that allow other forms of whitespace ignoring.
Shawn O. Pearce [Mon, 14 Jun 2010 19:37:17 +0000 (12:37 -0700)]
UploadPack: Avoid unnecessary flush in smart HTTP
Under smart HTTP the biDirectionalPipe flag is false, and we return
back immediately at this point in the negotiation process. There is
no need to flush the stream to the client, the request is over and
it will be automatically flushed out by the higher level servlet
that invoked us. Avoiding flush here allows us to only use flush
after a progress message is sent during pack generation.
Change-Id: Id0c8b7e95e3be6ca4c1b479e096bed6b0283b828 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Tue, 22 Jun 2010 23:42:29 +0000 (16:42 -0700)]
Add MutableObjectId.copyFrom(AnyObjectId)
This simplifies the PackIndex code, which is trying to quickly copy
an existing ObjectId into a MutableObjectId. Rather than having
the PackIndex violate the ObjectId's internals, expose a copy from
function similar to the other ones for copying from raw byte arrays
or hex formatted strings.
Change-Id: I142635cbece54af2ab83c58477961ce925dc8255 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Sat, 19 Jun 2010 00:38:22 +0000 (17:38 -0700)]
Expose AnyObjectId compareTo(byte[]) and compareTo(int[])
Storage systems can use these implementations to compare a passed
AnyObjectId with a stored representation of an ObjectId in the
canonical network byte order format. This can be useful to do a
binary search, or just linear scan, over an encoded storage file.
Change-Id: I8c72993c4f4c6e98d599ac2c9867453752f25fd2 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Wed, 23 Jun 2010 00:07:09 +0000 (17:07 -0700)]
Expose RefWriter constructor taking RefList
An implementation might prefer to use the RefList type here, and
RefList is part of our public API. Expose the constructor so callers
who have a RefList can take advantage of the existing sorting.
Change-Id: I545867f85aa2c479d2d610024ebbe318144709c8 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Tue, 22 Jun 2010 23:48:37 +0000 (16:48 -0700)]
Expose RefUpdate constructor to any subclass
When we finally move RefDirectory to the new storage.file package,
its associated RefDirectoryUpdate will need visiblity to this
constructor in order to initialize itself. This is true of any
other repository implementation, so make it protected rather than
package level visible.
Change-Id: If838aec9baeb80ee2f12dcbca717657c725a9242 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Tue, 22 Jun 2010 23:37:38 +0000 (16:37 -0700)]
isValidRefName: Inline the forbidden ref suffix of ".lock"
A Git reference name must never end with ".lock", as it would
confuse any existing C client that tries to obtain a clone of the
repository over the network. Even if the repository isn't on a
local filesystem, it still should ban that suffix.
Because I plan to move LockFile to storage.file and make it a private
implementation detail of the local file system storage model,
we can't rely on its package level SUFFIX field here. Making it
public probably won't work long-term either, as I also plan to
pull storage.file into its own separate project that depends on
the core library.
So, just inline the constant here. Its as foribidden as ":" is.
Change-Id: If85076861baeacc183b82696375a13e935ba8836 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Tue, 22 Jun 2010 23:54:23 +0000 (16:54 -0700)]
Remove pack stream from PackWriterTest
This stream was used only to determine how many bytes had been
written thus far. Except we're always dumping it into a simple
ByteArrayOutputStream, which also knows that. Drop the dependency
on the pack stream and use ByteArrayOutputStream directly.
This lets us later move this test into the new storage.file
package without dragging along the pack stream that is an internal
implementation detail of PackWriter, which is more general than
just the file storage layer.
Change-Id: I291689c0b1ed799270c213ee73b710b2637fb238 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Tue, 22 Jun 2010 23:18:22 +0000 (16:18 -0700)]
Remove pointless setOldObjectId in test
Setting this value is pointless, because its automatically set
by the refs.newUpdate call that created the update operation.
The API is protected by default, because application level code,
including this test, should not be calling it.
Change-Id: I8867a4e8007892e2bd44a05d7dec619081081943 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Sat, 19 Jun 2010 01:10:32 +0000 (18:10 -0700)]
Remove speed tests based on mapCommit
The mapCommit API is being deprecated because it doesn't run very
fast. Leaving tests around to test how fast it is relative to C Git
isn't instructive. Remove them, which should help aid the transition
away from the mapCommit API.
Change-Id: I27e1c844610d7da5b2c44b33a00602706973c9cc Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Mathias Kinzler [Mon, 14 Jun 2010 16:03:30 +0000 (18:03 +0200)]
Allow to read configured keys
Currently, there is no way to read the content
of the Git Configuration in a way that would
allow to list all configured values generically.
This change extends the Config class in such a
way as to being able to get a list of sections and
to get a list of names for any given section or
subsection.
This is required in able to implement proper
configuration handling in EGit (show all the
content of a given configuration similar to
"git config -l").
* changes:
git-servlet: Fix comparing uploadFactory with the wrong DISABLED instance
Prefer static inner classes
Override equals for SwingLane since super class PlotLane defines it
Make sure a Stream is closed upon errors in IpLogGenerator
Make constant static in RebuildCommitGraph
Make inner classes static in http code
Cache filemode in GitIndex
Remove unused parent field in PlotLane
Removed unused repo field in WorkDirCheckout
Extend DiffFormatter API to simplify styling
Shawn O. Pearce [Mon, 14 Jun 2010 15:18:47 +0000 (08:18 -0700)]
tools/version.sh: Use backup files on Win32
Windows doesn't permit us to edit a file in-place with Perl.
So create backup files when we perform the edit, and remove them
when we are done. This is a tad slower on POSIX systems, but is
much more portable.
Change-Id: I429c7d698924cb32e709363f5da82f7232bbdab2 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Mon, 14 Jun 2010 15:12:48 +0000 (08:12 -0700)]
Merge branch 'stable-0.8'
* stable-0.8:
Qualify post-0.8.4 builds
JGit 0.8.4
JGit 0.8.3
Include about.html in org.eclipse.jgit artifact
Fix build.properties of the JGit feature
Added the standard SULA for JGit
Add "resources/" as a source folder
Chris Aniszczyk [Mon, 7 Jun 2010 21:26:57 +0000 (16:26 -0500)]
Added the standard SULA for JGit
The Eclipse Foundation requires the standard SULA be present
in every feature. We had the license present via edl-v10.html
but we were missing the SULA via the license.html file. The
fix is to simply add the SULA.
Change-Id: I75b43ce098f544b95181755b5cc81a9b1dee6391 Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Marc Strapetz [Tue, 20 Apr 2010 19:01:19 +0000 (21:01 +0200)]
Repository can be configured with FS
On Windows, FS_Win32_Cygwin has been used if a Cygwin Git installation
is present in the PATH. Assuming that the user works with the Cygwin
Git installation may result in unnecessary overhead if he actually
does not.
Applications built on top of jgit may have more knowledge on the
actually used Git client (Cygwin or not) and hence should be able to
configure which FS to use accordingly.
Robin Rosenberg [Mon, 24 May 2010 19:19:59 +0000 (21:19 +0200)]
Add support for computing a Change-Id à la Gerrit
A Change-Id helps tools like Gerrit Code Review to keeps different
versions of a patch together. The Change-Id is computed as a SHA-1
hash of some of the same basic information as a commit id on the first
commit intended to solve a particular problem and then reused for
updated solutions.
Change-Id: I04334f84e76e83a4185283cb72ea0308b1cb4182 Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Refactor ReadTreeTest to allow testing other checkout classes
ReadTreeTest contains a lot of useful tests for "checkout"
implementations. But ReadTreeTest was hardcoded to test only
WorkDirCheckout. This change doesn't add/modify any tests semantically
but refactors ReadTreeTest so that a different implementations of
checkout can be tested. This was done to allow DirCacheCheckout to be
tested without rewriting all these tests.
Change-Id: I36e34264482b855ed22c9dde98824f573cf8ae22 Signed-off-by: Christian Halstrick <christian.halstrick@sap.com> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Shawn O. Pearce [Fri, 28 May 2010 22:06:29 +0000 (15:06 -0700)]
eclipse-iplog: Use contribution rather than bug element
Wayne changed the schema to no longer be dependent upon the Bugzilla
notion of a contribution, but instead be more generic and better
support systems like Gerrit Code Review. Update our output to
use the <contribution> element and include a link to the change
in Gerrit.
Change-Id: Ibc8a436918bd8e7597dc17743824201a74bce09b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Fri, 28 May 2010 21:30:27 +0000 (14:30 -0700)]
eclipse-ipzilla: Correctly parse result with empty last field
If the last field of our IPzilla query comes back empty, we were
skipping over and not including it in the result List, causing an
IndexOutOfBoundsException when it was read into our data model.
If the last field is empty, actually add the empty string.
Change-Id: Ib18b335990c73e036b185199d0004f4ffc395867 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Thu, 27 May 2010 02:00:06 +0000 (19:00 -0700)]
Don't use interruptable pread() to access pack files
The J2SE NIO APIs require that FileChannel close the underlying file
descriptor if a thread is interrupted while it is inside of a read or
write operation on that channel. This is insane, because it means we
cannot share the file descriptor between threads. If a thread is in
the middle of the FileChannel variant of IO.readFully() and it
receives an interrupt, the pack will be automatically closed on us.
This causes the other threads trying to use that same FileChannel to
receive IOExceptions, which leads to the pack getting marked as
invalid. Once the pack is marked invalid, JGit loses access to its
entire contents and starts to report MissingObjectExceptions.
Because PackWriter must ensure that the chosen pack file stays
available until the current object's data is fully copied to the
output, JGit cannot simply reopen the pack when its automatically
closed due to an interrupt being sent at the wrong time. The pack may
have been deleted by a concurrent `git gc` process, and that open file
descriptor might be the last reference to the inode on disk. Once its
closed, the PackWriter loses access to that object representation, and
it cannot complete sending the object the client.
Fortunately, RandomAccessFile's readFully method does not have this
problem. Interrupts during readFully() are ignored. However, it
requires us to first seek to the offset we need to read, then issue
the read call. This requires locking around the file descriptor to
prevent concurrent threads from moving the pointer before the read.
This reduces the concurrency level, as now only one window can be
paged in at a time from each pack. However, the WindowCache should
already be holding most of the pages required to handle the working
set for a process, and its own internal locking was already limiting
us on the number of concurrent loads possible. Provided that most
concurrent accesses are getting hits in the WindowCache, or are for
different repositories on the same server, we shouldn't see a major
performance hit due to the more serialized loading.
I would have preferred to use a pool of RandomAccessFiles for each
pack, with threads borrowing an instance dedicated to that thread
whenever they needed to page in a window. This would permit much
higher levels of concurrency by using multiple file descriptors (and
file pointers) for each pack. However the code became too complex to
develop in any reasonable period of time, so I've chosen to retrofit
the existing code with more serialization instead.
Bug: 308945
Change-Id: I2e6e11c6e5a105e5aef68871b66200fd725134c9 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Stefan Lay [Thu, 20 May 2010 13:09:39 +0000 (15:09 +0200)]
Add a merge command to the jgit API
Merges the current head with one other commit.
In this first iteration the merge command supports
only fast forward and already up-to-date.
Change-Id: I0db480f061e01b343570cf7da02cac13a0cbdf8f Signed-off-by: Stefan Lay <stefan.lay@sap.com> Signed-off-by: Christian Halstrick <christian.halstrick@sap.com> Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
The CommitCommand should take care to create a merge commit if the file
$GIT_DIR/MERGE_HEAD exists. It should then read the parents for the merge
commit out of this file. It should also take care that when commiting
a merge and no commit message was specified to read the message from
$GIT_DIR/MERGE_MSG.
Finally the CommitCommand should remove these files if the commit
succeeded.
Change-Id: I4e292115085099d5b86546d2021680cb1454266c Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Sasa Zivkov [Wed, 19 May 2010 14:59:28 +0000 (16:59 +0200)]
Externalize strings from JGit
The strings are externalized into the root resource bundles.
The resource bundles are stored under the new "resources" source
folder to get proper maven build.
Strings from tests are, in general, not externalized. Only in
cases where it was necessary to make the test pass the strings
were externalized. This was typically necessary in cases where
e.getMessage() was used in assert and the exception message was
slightly changed due to reuse of the externalized strings.
Change-Id: Ic0f29c80b9a54fcec8320d8539a3e112852a1f7b Signed-off-by: Sasa Zivkov <sasa.zivkov@sap.com>
Shawn O. Pearce [Sun, 16 May 2010 00:01:49 +0000 (17:01 -0700)]
Fix SSH deadlock during OutOfMemoryError
In close() method of SshFetchConnection and SshPushConnection
errorThread.join() can wait forever if JSch will not close the
channel's error stream. Join with a timeout, and interrupt the
copy thread if its blocked on data that will never arrive.
Dmitry Neverov [Wed, 19 May 2010 18:39:17 +0000 (11:39 -0700)]
Fix race condition in StreamCopyThread
If we get an interrupt during an IO operation (src.read or dst.write)
caused by the flush() method incrementing the flush counter, ensure
we restart the proper section of code. Just ignore the interrupt
and continue running.
Bug: 313082
Change-Id: Ib2b37901af8141289bbac9807cacf42b4e2461bd Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Sun, 16 May 2010 02:10:47 +0000 (19:10 -0700)]
Remove unnecessary truncation of in-pack size during copy
The number of bytes to copy was truncated to an int, but the
pack's copyToStream() method expected to be passed a long here.
Pass through the long so we don't truncate a giant object.
Change-Id: I0786ad60a3a33f84d8746efe51f68d64e127c332 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Sun, 16 May 2010 00:51:03 +0000 (17:51 -0700)]
Reduce the size of PackWriter's ObjectToPack instances
Rather than holding onto the PackedObjectLoader, only hold the
PackFile and the object offset. During a reuse copy that is all
we should need to complete a reuse, and the other parts of the
PackedObjectLoader just waste memory.
This change reduces the per-object memory usage of a PackWriter by
32 bytes on a 32 bit JVM using only OFS_DELTA formatted objects.
The savings is even larger (by another 20 bytes) for REF_DELTAs.
This is close to a 50% reduction in the size of ObjectToPack,
making it rather worthwhile to do.
Beyond the memory reduction, this change will help to make future
refactoring work easier. We need to redo the API used to support
copying data, and disconnecting it from the PackedObjectLoader is
a good first step.
Change-Id: I24ba4e621e101f14e79a16463aec5379f447aa9b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Sun, 16 May 2010 00:37:14 +0000 (17:37 -0700)]
Reduce size of PackedObjectLoader by dropping long to int
Rather than keep track of both the position of the object, and the
position of its data, just keep track of the number of bytes used
by the object's header in the pack. This shaves 4 bytes out of the
size of the PackedObjectLoader instances.
We also can defer the addition instruction to the materialize()
operation, avoiding it entirely if the caller never actually uses
the loader. This may be relevant for PackWriter invocations,
where only 1 loader gets chosen for a given object, even though
the object may appear on disk in more than one pack file.
Error reporting is now simplified, as we can rely on the object
offset rather than its data offset. This is the value displayed
by pack debugging tools like `git verify-pack -v`, so its better
to use that in our own errors.
Because nobody needs getDataOffset() now, we can drop that from
the public API.
Change-Id: Ic639c0d5a722315f4f5c8ffda6e26643d90e5f42 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Sat, 15 May 2010 23:18:44 +0000 (16:18 -0700)]
Factor out duplicate Inflater setup in WindowCursor
Since we use this code twice, pull it into a private method. Let
the compiler/JIT worry about whether or not this logic should be
inlined into the call sites.
Change-Id: Ia44fb01e0328485bcdfd7af96835d62b227a0fb1 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Fri, 14 May 2010 22:02:31 +0000 (15:02 -0700)]
Squash OffsetCache into WindowCache
Originally when I wrote this code I had hoped to use OffsetCache
to also implement the UnpackedObjectCache. But it turns out they
need rather different code, and it just wasn't worth trying to
reuse the OffsetCache base class.
Before doing any major refactoring or code cleanups here, squash the
two classes together and delete OffsetCache. As WindowCache is our
only subclass, this is pretty simple to do. We also get a minor
code reduction due to less duplication between the two classes,
and the JIT should be able to do a better job of optimization here
as we can define types up front rather than relying on generics
that erase back to java.lang.Object.
Change-Id: Icac8bda01260e405899efabfdd274928e98f3521 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Fri, 14 May 2010 21:03:32 +0000 (14:03 -0700)]
Avoid unnecessary second read on OBJ_OFS_DELTA headers
When we read the object header we copy 20 bytes from the pack data,
then start parsing out the type and the inflated size. For most
objects, this is only going to require 3 bytes, which is sufficient
to represent objects with inflated sizes of up to 2^16. The local
buffer however still has 17 bytes remaining in it, and that can be
used to satisfy the OBJ_OFS_DELTA header.
We shouldn't need to worry about walking off the end of the buffer
here, because delta offsets cannot be larger than 64 bits, and that
requires only 9 bytes in the OFS_DELTA encoding.
Assuming worst-case scenarios of 9 bytes for the OFS_DELTA encoding,
the pack file itself must be approaching 2^64 bytes, an infeasible
size to store on any current technology. However, even if this
were the case we still have 11 bytes for the type/size header.
In that encoding we can represent an object as large as 2^74 bytes,
which is also an infeasible size to process in JGit.
So drop the second read here.
The data offsets we pass into the ObjectLoaders being constructed
need to be computed individually now. This saves a local variable,
but pushes the addition operation into each branch of the switch.
Change-Id: I6cf64697a9878db87bbf31c7636c03392b47a062 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Shawn O. Pearce [Thu, 13 May 2010 17:23:33 +0000 (10:23 -0700)]
Fix hang when fetching over SSH
JSch may hang or abort with the timeout if JGit connects before
its obtained the streams. Instead defer the connect() call until
after the streams have been configured.
Shawn O. Pearce [Thu, 13 May 2010 16:56:15 +0000 (09:56 -0700)]
Fix interrupted write in StreamCopyThread
If a flush() gets delivered at the same time that we are blocking
while writing to an interruptable stream, the copy thread will
abort assuming its a stream error. Instead ignore the interrupt,
and retry the write.
Change-Id: Icbf62d1b8abe0fabbb532dbee088020eecf4c6c2 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Dmitry Neverov [Thu, 13 May 2010 16:50:44 +0000 (09:50 -0700)]
Fix missing flush in StreamCopyThread
It is possible to miss flush() invocation in StreamCopyThread.
In this case some data will not be sent to remote host and we will
wait forever (or until timeout) in src.read().
Use a counter to keep track of the flush requests.
Change-Id: Ia818be9b109a1674d9e2a9c78e125ab248cfb75b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Jonathan Gossage [Tue, 11 May 2010 23:06:28 +0000 (18:06 -0500)]
Fix Maven Javadoc generation problem
There is a serious problem with the Maven Javadoc plugin. Please see
http://jira.codehaus.org/browse/MJAVADOC-275
for details. This problem is fixed by using maven-javadoc-plugin V2.7
instead of maven-javadoc-plugin v2.6.1.
Matthias Sohn [Tue, 11 May 2010 12:35:26 +0000 (14:35 +0200)]
Expose org.eclipse.jgit.junit via jgit p2 repository
EGit Tycho builds on build.eclipse.org frequently hit corrupted artifacts
which leads to broken builds. Cleaning up these corrupted files is tedious
since it requires file system access on the build server. Hence we want to
switch to use job-local m2 repositories. This requires that build artifacts
are shared between the jgit and egit build jobs via p2. Therefore the
bundle org.eclipse.jgit.junit needs to be exposed via p2 repository.
Change-Id: I0ccd7763eede117cb68240fdd25f13d6e6f6a2c1 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Add builder-style API to jgit and Commit & Log cmd
Added a new package org.eclipse.jgit.api and a builder-style API for
jgit. Added also the first implementation for two git commands: Commit
and Log.
This API is intended to be used by external components when
functionalities of the standard git commands are required. It will also
help to ease writing JGit tests.
For internal usages this API may often not be optimal because the git
commands are doing much more than required or they expect parameters of
an unappropriate type.
Change-Id: I71ac4839ab9d2f848307eba9252090c586b4146b Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>