Parsing the size from a packed object header was incorrectly computing
the total inflated length when the length exceeded the range of a Java
int. The next 7 bits of size information was shifted left as an int
using a shift of 25 bits, placing the higher bits of the size into the
sign position. When this size was extended to a long to be added to
the current size accumulator the size went negative, resulting in
NegativeArraySizeException being thrown.
Fix all places where this particular pattern of code is used to read a
pack size field, or a binary delta header, as they both use the same
variable length encoding scheme.
Change-Id: I04008728ed828f18202652c3d5401cf95a441d0a
In practice the DHT storage layer has not been performing as well as
large scale server environments want to see from a Git server.
The performance of the DHT schema degrades rapidly as small changes
are pushed into the repository due to the chunk size being less than
1/3 of the pushed pack size. Small chunks cause poor prefetch
performance during reading, and require significantly longer prefetch
lists inside of the chunk meta field to work around the small size.
The DHT code is very complex (>17,000 lines of code) and is very
sensitive to the underlying database round-trip time, as well as the
way objects were written into the pack stream that was chunked and
stored on the database. A poor pack layout (from any version of C Git
prior to Junio reworking it) can cause the DHT code to be unable to
enumerate the objects of the linux-2.6 repository in a completable
time scale.
Performing a clone from a DHT stored repository of 2 million objects
takes 2 million row lookups in the DHT to locate the OBJECT_INDEX row
for each object being cloned. This is very difficult for some DHTs to
scale, even at 5000 rows/second the lookup stage alone takes 6 minutes
(on local filesystem, this is almost too fast to bother measuring).
Some servers like Apache Cassandra just fall over and cannot complete
the 2 million lookups in rapid fire.
On a ~400 MiB repository, the DHT schema has an extra 25 MiB of
redundant data that gets downloaded to the JGit process, and that is
before you consider the cost of the OBJECT_INDEX table also being
fully loaded, which is at least 223 MiB of data for the linux kernel
repository. In the DHT schema answering a `git clone` of the ~400 MiB
linux kernel needs to load 248 MiB of "index" data from the DHT, in
addition to the ~400 MiB of pack data that gets sent to the client.
This is 193 MiB more data to be accessed than the native filesystem
format, but it needs to come over a much smaller pipe (local Ethernet
typically) than the local SATA disk drive.
I also never got around to writing the "repack" support for the DHT
schema, as it turns out to be fairly complex to safely repack data in
the repository while also trying to minimize the amount of changes
made to the database, due to very common limitations on database
mutation rates..
This new DFS storage layer fixes a lot of those issues by taking the
simple approach for storing relatively standard Git pack and index
files on an abstract filesystem. Packs are accessed by an in-process
buffer cache, similar to the WindowCache used by the local filesystem
storage layer. Unlike the local file IO, there are some assumptions
that the storage system has relatively high latency and no concept of
"file handles". Instead it looks at the file more like HTTP byte range
requests, where a read channel is a simply a thunk to trigger a read
request over the network.
The DFS code in this change is still abstract, it does not store on
any particular filesystem, but is fairly well suited to the Amazon S3
or Apache Hadoop HDFS. Storing packs directly on HDFS rather than
HBase removes a layer of abstraction, as most HBase row reads turn
into an HDFS read.
Most of the DFS code in this change was blatently copied from the
local filesystem code. Most parts should be refactored to be shared
between the two storage systems, but right now I am hesistent to do
this due to how well tuned the local filesystem code currently is.
Change-Id: Iec524abdf172e9ec5485d6c88ca6512cd8a6eafb
The 'Counting objects' phase of PackWriter requires good hit rates
from the DeltaBaseCache while walking trees, the deltas need to find
their bases in the cache in order to inflate in a reasonable time.
If JGit is running in a multi-threaded server, such as Gerrit Code
Review, each thread needs its own DeltaBaseCache to prevent one thread
from evicting the other thread's relevant bases. Move the cache to be
per-ObjectReader, lazily allocated when required by a PackFile.
Change-Id: If9d5ed06728e813632ae96dcfb811f4860b276e8
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of computing this on every request, compute it once and
hold onto the result. This improves performance for LocalCachedPack
which does a lot of tests against the pack name string.
Change-Id: I3803745e3a5dda7b5f0faf39aae9423e2c777e7f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When I disabled validation I broke the code that handled copying small
objects whose contents were below 8192 bytes in size but spanned over
the end of one window and into the next window. These objects did not
ever populate the temporary write buffer, resulting in garbage writing
into the output stream instead of valid object contents.
Change-Id: Ie26a2aaa885d0eee4888a9b12c222040ee4a8562
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If object reuse validation is enabled, the output pack is going to
probably be stored locally. When reusing an existing cached pack
to save object enumeration costs, ensure the cached pack has not
been corrupted by checking its SHA-1 trailer. If it has, writing
will abort and the output pack won't be complete. This prevents
anyone from trying to use the output pack, and catches corruption
before it can be carried any further.
Change-Id: If89d0d4e429d9f4c86f14de6c0020902705153e6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
PackWriter: Avoid CRC-32 validation when feeding IndexPack
There is no need to validate the object contents during
copyObjectAsIs if the result is going to be parsed by unpack-objects
or index-pack. Both programs will compute the SHA-1 of the object,
and also validate most of the pack structure. For git daemon
like servers, this work is already done on the client end of the
connection, so the server doesn't need to repeat that work itself.
Disable object validation for the 3 transport cases where we know
the remote side will handle object validation for us (push, bundle
creation, and upload pack). This improves performance on the server
side by reducing the work that must be done.
Change-Id: Iabb78eec45898e4a17f7aab3fb94c004d8d69af6
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of using the current thread's stack to recurse through the
delta chain, use a linked list that is stored in the heap. This
permits the any thread to load a deep delta chain without running out
of thread stack space.
Despite needing to allocate a stack entry object for each delta
visited along the chain being loaded, the object allocation count is
kept the same as in the prior version by removing the transient
ObjectLoaders from the intermediate objects accessed in the chain.
Instead the byte[] for the raw data is passed, and null is used as a
magic value to signal isLarge() and enter the large object code path.
Like the old version, this implementation minimizes the amount of
memory that must be live at once. The current delta instruction
sequence, the base it applies onto, and the result are the only live
data arrays. As each level is processed, the prior base is discarded
and replaced with the new result.
Each Delta frame on the stack is slightly larger than the standard
ObjectLoader.SmallObject type that was used before, however the Delta
instances should be smaller than the old method stack frames, so total
memory usage should actually be lower with this new implementation.
Change-Id: I6faca2a440020309658ca23fbec4c95aa637051c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the object type is a whole object and all we want is the type,
there is no need to skip the length header. The type is already known
and can be returned as-is. Instead skip the length header only for
the two delta formats, where the delta base must itself be scanned.
Change-Id: I87029258e88924b3e5850bdd6c9006a366191d10
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This variable was not used for anything, but Eclipse's JDT failed to
notice because of the "shift += " operation within the body of the
while loop. Here we don't need the shift because we do not decode the
length, but we do have to skip over the bytes that store the length to
locate the delta base.
Bug: 331319
Change-Id: I200a874fd7e39e3adf2640b8cd0f53dcf91ef4c9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
CC: Remy Suen <remysuen@ca.ibm.com>
Fixes the "Method ignores results of InputStream.read()" warning.
This is the only place where read() was used instead of readFully()
and the return value was not checked. So it was either an oversight
or should be documented. This change assumes it was an oversight.
Change-Id: I859404a7d80449c538a552427787f3e57d7c92b4
Increase core.streamFileThreshold default to 50 MiB
Projects like org.eclipse.mdt contain large XML files about 6 MiB
in size. So does the Android project platform/frameworks/base.
Doing a clone of either project with JGit takes forever to checkout
the files into the working directory, because delta decompression
tends to be very expensive as we need to constantly reposition the
base stream for each copy instruction. This can be made worse by
a very bad ordering of offsets, possibly due to an XML editor that
doesn't preserve the order of elements in the file very well.
Increasing the threshold to the same limit PackWriter uses when
doing delta compression (50 MiB) permits a default configured
JGit to decompress these XML file objects using the faster
random-access arrays, rather than re-seeking through an inflate
stream, significantly reducing checkout time after a clone.
Since this new limit may be dangerously close to the JVM maximum
heap size, every allocation attempt is now wrapped in a try/catch
so that JGit can degrade by switching to the large object stream
mode when the allocation is refused. It will run slower, but the
operation will still complete.
The large stream mode will run very well for big objects that aren't
delta compressed, and is acceptable for delta compressed objects that
are using only forward referencing copy instructions. Copies using
prior offsets are still going to be horrible, and there is nothing
we can do about it except increase core.streamFileThreshold.
We might in the future want to consider changing the way the delta
generators work in JGit and native C Git to avoid prior offsets once
an object reaches a certain size, even if that causes the delta
instruction stream to be slightly larger. Unfortunately native
C Git won't want to do that until its also able to stream objects
rather than malloc them as contiguous blocks.
Change-Id: Ief7a3896afce15073e80d3691bed90c6a3897307
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
This class is used only to cache the unpacked form of an object that
was used as a base for another object. The theory goes that if an
object is used as a delta base for A, it will probably also be a
delta base for B, C, D, E, etc. and therefore having an unpacked copy
of it on hand will make delta resolution for the others very fast.
However since objects are usually only accessed once, we don't want
to cache everything we unpack, just things that we are likely to
need again. The only things we need again are the delta bases.
Hence, its a delta base cache.
This gets us the class name UnpackedObjectCache back, so we can
use it to actually create a cache of unpacked object information.
Change-Id: I121f356cf4eca7b80126497264eac22bd5825a1d
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We miscomputed the CRC32 checksum for a REF_DELTA type of object, by
not including the full 20 byte ObjectId of the delta base in the CRC
code we use when the delta is too large to go through our two faster
small reuse code paths. This resulted in a corruption error during
packing, where the PackFile erroneously suspected the data was wrong
on the local filesystem and aborted writing, because the CRC didn't
match what we had read from the index.
Change-Id: I7d12cdaeaf2c83ddc11223ce0108d9bd6886e025
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
ObjectReader implementations are now responsible for creating the
unique abbreviation of an ObjectId, or for resolving an abbreviation
back to its full form. In this latter case the reader can offer up
multiple candidates to the caller, who may be able to disambiguate
them based on context.
Repository.resolve() doesn't take multiple candidates into account
right now, but it could in the future by looking for a remaining
^0 or ^{commit} suffix and take an expansion if there is only one
commit that matches the input abbreviation. It could also use
the distance from an annotated tag to resolve "tag-NNN-gcommit"
style strings that are often output by `git describe`.
Change-Id: Icd3250adc8177ae05278b858933afdca0cbbdb56
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This is an informational function used by PackWriter to help it
better organize objects for delta compression. Storage systems
can implement it to provide up more detailed size information,
or they can simply rely on the default behavior that uses the
ObjectLoader obtained from open.
For local file storage, we can obtain this information faster
through specialized routines that parse a pack object header.
Change-Id: I13a09b4effb71ea5151b51547f7d091564531e58
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Now that any large objects are forced through a streaming loader
when its bigger than getStreamFileThreshold(), and that threshold
is pegged at Integer.MAX_VALUE as its largest size, we will never
be able to reach this code path where we threw OutOfMemoryError.
Robin pointed out that we probably should include a message here,
but the code is effectively unreachable, so there isn't any value
in adding a message at this point.
So remove it.
Change-Id: Ie611d005622e38a75537f1350246df0ab89dd500
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Callers don't necessarily need the getSize() result from a large
delta. They instead should be always using openStream() or copyTo()
for blobs going to local files, or they should be checking the
result of the constant-time isLarge() method to determine the type
of access they can use on the ObjectLoader. Avoid inflating the
delta instruction stream twice by delaying the decoding of the size
until after we have created the DeltaStream and decoded the header.
Likewise with the type, callers don't necessarily always need it
to be present in an ObjectLoader. Delay looking at it as late as
we can, thereby avoiding an ugly O(N^2) loop looking up the type
for every single object in the entire delta chain.
Change-Id: I6487b75b52a5d201d811a8baed2fb4fcd6431320
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Use core.streamFileThreshold to set our streaming limit
We default this to 1 MiB for now, but we allow users to modify
it through the Repository's configuration file to be a different
value. A new repository listener is used to identify when the
setting has been updated and trigger a reconfiguration of any
active ObjectReaders.
To prevent a horrible explosion we cap core.streamFileThreshold
at no more than 1/4 of the maximum JVM heap size. We do this
because we need at least 2 byte arrays equal in size to the
stream threshold for the worst case delta inflation scenario,
and our host application probably also needs some amount of the
heap for their working set size.
Change-Id: I103b3a541dc970bbf1a6d92917a12c5a1ee34d6c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Very large delta instruction streams, or deltas which use very large
base objects, are now streamed through as large objects rather than
being inflated into a byte array.
This isn't the most efficient way to access delta encoded content, as
we may need to rewind and reprocess the base object when there was a
block moved within the file, but it will at least prevent the JVM from
having its heap explode.
When streaming a delta we have an inflater open for each level in the
delta chain, to inflate the instruction set of the delta, as well as
an inflater for the base level object. The base object is buffered,
as is the top level delta requested by the application, but we do not
buffer the intermediate delta streams. This keeps memory usage lower,
so its closer to 1024 bytes per level in the chain, without having an
adverse impact on raw throughput as the top-level buffer gets pushed
down to the lowest stream that has the next region.
Delta instructions transparently collapse here, if the top level does
not copy a region from its base, the base won't materialize that part
from its own base, etc. This allows us to avoid copying around a lot
of segments which have been deleted from the final version.
Change-Id: I724d45245cebb4bad2deeae7b896fc55b2dd49b3
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to the loose object support, whole packed objects can
now be streamed back to the caller. The streaming is less
efficient as we copy the data from the cached window array
into the InflaterInputStream's internal buffer, then inflate
it there before returning to the application.
Like with unpacked objects, there is plenty of room for some
optimization, especially for the copyTo method, where we don't
necessarily need so much buffering to exist.
Change-Id: Ie23be81289e37e24b91d17b0891e47b9da988008
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Replace PackedObjectLoader with ObjectLoader.SmallObject
The class is identical, but ObjectLoader.SmallObject is part of our
public API for storage implementations to build on top of.
Change-Id: I381a3953b14870b6d3d74a9c295769ace78869dc
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with the file code, move the pack writer
into its own package so the related classes and their package
private methods are hidden from the rest of the library.
Change-Id: Ic1b5c7c8c8d266e90c910d8d68dfc8e93586854f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We no longer need an ObjectLoader to be lazy and try to delay
the materialization of the object content. That was done only
to support PackWriter searching for a good reuse candidate.
Instead, simplify the code base by doing the materialization
immediately when the loader asks for it, because any caller
asking for the loader is going to need the content.
Change-Id: Id867b1004529744f234ab8f9cfab3d2c52ca3bd0
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
These were only used by PackWriter to help it filter object
representations. Their only user disappeared when we rewrote the
object selection code path to use the new representation type.
Change-Id: I9ed676bfe4f87fcf94aa21e53bda43115912e145
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Tighten up local packed object representation during packing
Rather than making a loader, and then using that to fill the object
representation, parse the header and set up our data directly.
This saves some time, as we don't waste cycles on information we
won't use right now.
The weight computed for a representation is now its actual stored
size in the pack file, rather than its inflated size. This accounts
for changes made when the compression level is modified on the
repository. It is however more costly to determine the weight of
the object, since we have to find its length in the pack. To try and
recover that cost we now cache the length as part of our ObjectToPack
record, so it doesn't have to be found during the output phase.
A LocalObjectToPack now costs us (assuming 32 bit pointers):
(32 bit) (64 bit)
vm header: 8 bytes 8 bytes
ObjectId: 20 bytes 20 bytes
PackedObjectInfo: 12 bytes 12 bytes
ObjectToPack: 8 bytes 12 bytes
LocalOTP: 20 bytes 24 bytes
----------- ---------
68 bytes 74 bytes
Change-Id: I923d2736186eb2ac8ab498d3eb137e17930fcb50
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Move FileRepository to storage.file.FileRepository
This move isolates all of the local file specific implementation code
into a single package, where their package-private methods and support
classes are properly hidden away from the rest of the core library.
Because of the sheer number of files impacted, I have limited this
change to only the renames and the updated imports.
Change-Id: Icca4884e1a418f83f8b617d0c4c78b73d8a4bd17
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Objects that fall completely within a single window can be worked
with in a zero-copy fashion, provided that the window is backed by
a normal byte[] and not by a ByteBuffer.
This works for a surprising number of objects. The default window
size is 8 KiB, but most deltas are quite a bit smaller than that.
Objects smaller than 1/2 of the window size have a very good chance
of falling completely within a window's array, which means we can
work with them without copying their data around.
Larger objects, or objects which are unlucky enough to span over a
window boundary, get copied through the temporary buffer. We pay
a tiny penalty to realize we can't use the zero-copy code path,
but its easier than trying to keep track of two adjacent windows.
With this change (as well as everything preceeding it), packing
is actually a bit faster. Some crude benchmarks based on cloning
linux-2.6.git (~324 MiB, 1,624,785 objects) over localhost using
C git client and JGit daemon shows we get better throughput, and
slightly better times:
Total Time | Throughput
(old) (now) | (old) (now)
--------------+---------------------------
2m45s 2m37s | 12.49 MiB/s 21.17 MiB/s
2m42s 2m36s | 16.29 MiB/s 22.63 MiB/s
2m37s 2m31s | 16.07 MiB/s 21.92 MiB/s
Change-Id: I48b2c8d37f08d7bf5e76c5a8020cde4a16ae3396
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Output of selected reuses is refactored to use a new ObjectReuseAsIs
interface that extends the ObjectReader. This interface allows the
reader to control how it performs the reuse into the output stream,
but also allows it to throw an exception to request the writer to
find a different candidate representation.
The PackFile reuse code was overhauled, cleaning up the APIs so they
aren't exposed in the object loader, but instead are now a single
method on the PackFile itself. The reuse algorithm was changed to do
a data verification pass, followed by the copy pass to the output.
This permits us to work around a corrupt object in a pack file by
seeking another copy of that object when this one is bad.
The reuse code was also optimized for the common case, where the
in-pack representation is under 16 KiB. In these smaller cases
data is sent to the pack writer more directly, avoiding some copying.
Change-Id: I6350c2b444118305e8446ce1dfd049259832bcca
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Don't use interruptable pread() to access pack files
The J2SE NIO APIs require that FileChannel close the underlying file
descriptor if a thread is interrupted while it is inside of a read or
write operation on that channel. This is insane, because it means we
cannot share the file descriptor between threads. If a thread is in
the middle of the FileChannel variant of IO.readFully() and it
receives an interrupt, the pack will be automatically closed on us.
This causes the other threads trying to use that same FileChannel to
receive IOExceptions, which leads to the pack getting marked as
invalid. Once the pack is marked invalid, JGit loses access to its
entire contents and starts to report MissingObjectExceptions.
Because PackWriter must ensure that the chosen pack file stays
available until the current object's data is fully copied to the
output, JGit cannot simply reopen the pack when its automatically
closed due to an interrupt being sent at the wrong time. The pack may
have been deleted by a concurrent `git gc` process, and that open file
descriptor might be the last reference to the inode on disk. Once its
closed, the PackWriter loses access to that object representation, and
it cannot complete sending the object the client.
Fortunately, RandomAccessFile's readFully method does not have this
problem. Interrupts during readFully() are ignored. However, it
requires us to first seek to the offset we need to read, then issue
the read call. This requires locking around the file descriptor to
prevent concurrent threads from moving the pointer before the read.
This reduces the concurrency level, as now only one window can be
paged in at a time from each pack. However, the WindowCache should
already be holding most of the pages required to handle the working
set for a process, and its own internal locking was already limiting
us on the number of concurrent loads possible. Provided that most
concurrent accesses are getting hits in the WindowCache, or are for
different repositories on the same server, we shouldn't see a major
performance hit due to the more serialized loading.
I would have preferred to use a pool of RandomAccessFiles for each
pack, with threads borrowing an instance dedicated to that thread
whenever they needed to page in a window. This would permit much
higher levels of concurrency by using multiple file descriptors (and
file pointers) for each pack. However the code became too complex to
develop in any reasonable period of time, so I've chosen to retrofit
the existing code with more serialization instead.
Bug: 308945
Change-Id: I2e6e11c6e5a105e5aef68871b66200fd725134c9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The strings are externalized into the root resource bundles.
The resource bundles are stored under the new "resources" source
folder to get proper maven build.
Strings from tests are, in general, not externalized. Only in
cases where it was necessary to make the test pass the strings
were externalized. This was typically necessary in cases where
e.getMessage() was used in assert and the exception message was
slightly changed due to reuse of the externalized strings.
Change-Id: Ic0f29c80b9a54fcec8320d8539a3e112852a1f7b
Signed-off-by: Sasa Zivkov <sasa.zivkov@sap.com>
Remove unnecessary truncation of in-pack size during copy
The number of bytes to copy was truncated to an int, but the
pack's copyToStream() method expected to be passed a long here.
Pass through the long so we don't truncate a giant object.
Change-Id: I0786ad60a3a33f84d8746efe51f68d64e127c332
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Reduce size of PackedObjectLoader by dropping long to int
Rather than keep track of both the position of the object, and the
position of its data, just keep track of the number of bytes used
by the object's header in the pack. This shaves 4 bytes out of the
size of the PackedObjectLoader instances.
We also can defer the addition instruction to the materialize()
operation, avoiding it entirely if the caller never actually uses
the loader. This may be relevant for PackWriter invocations,
where only 1 loader gets chosen for a given object, even though
the object may appear on disk in more than one pack file.
Error reporting is now simplified, as we can rely on the object
offset rather than its data offset. This is the value displayed
by pack debugging tools like `git verify-pack -v`, so its better
to use that in our own errors.
Because nobody needs getDataOffset() now, we can drop that from
the public API.
Change-Id: Ic639c0d5a722315f4f5c8ffda6e26643d90e5f42
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Avoid unnecessary second read on OBJ_OFS_DELTA headers
When we read the object header we copy 20 bytes from the pack data,
then start parsing out the type and the inflated size. For most
objects, this is only going to require 3 bytes, which is sufficient
to represent objects with inflated sizes of up to 2^16. The local
buffer however still has 17 bytes remaining in it, and that can be
used to satisfy the OBJ_OFS_DELTA header.
We shouldn't need to worry about walking off the end of the buffer
here, because delta offsets cannot be larger than 64 bits, and that
requires only 9 bytes in the OFS_DELTA encoding.
Assuming worst-case scenarios of 9 bytes for the OFS_DELTA encoding,
the pack file itself must be approaching 2^64 bytes, an infeasible
size to store on any current technology. However, even if this
were the case we still have 11 bytes for the type/size header.
In that encoding we can represent an object as large as 2^74 bytes,
which is also an infeasible size to process in JGit.
So drop the second read here.
The data offsets we pass into the ObjectLoaders being constructed
need to be computed individually now. This saves a local variable,
but pushes the addition operation into each branch of the switch.
Change-Id: I6cf64697a9878db87bbf31c7636c03392b47a062
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Move pure IO utility functions to a utility class of its own.
According the javadoc, and implied by the name of the class, NB
is about network byte order. The purpose of moving the IO only,
and non-byte order related functions to another class is to
make it easier for new contributors to understand that they
can use these functions in general and it's also makes it easier
to understand where to put new IO related utility functions
Change-Id: I4a9f6b39d5564bc8a694b366e7ff3cc758c5181b
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
As discussed on the egit-dev mailing list, we prefer not to have
trailing whitespace in our source code. Correct all currently
offending lines by trimming them.
Change-Id: I002b1d1980071084c0bc53242c8f5900970e6845
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Per CQ 3448 this is the initial contribution of the JGit project
to eclipse.org. It is derived from the historical JGit repository
at commit 3a2dd9921c.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>