Don't use java.nio channel for file size determination
Java NIO has some problems (like files closing unexpectedly because the
thread was interrupted). To avoid those problems, don't use a NIO
channel to determine the size of a file, but rather ask the File itself.
We have to be prepared to handle wrong/outdated information in this case
too, as the inode of the File may change between opening and determining
file size.
Change-Id: Ic7aa6c3337480879efcce4a3058b548cd0e2cef0
This reverts commit bf845c126d since this
change needs to go through a formal IP review and Chris missed to file a
CQ for that.
Change-Id: I303515d78116f0591a2911dbfb9f857738f086a9
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
This patch introduces CRLF handling to the DirCacheCheckout and
WorkingTreeIterator supporting the AutoCRLF for add, checkout
reset and status and hopefully some other places that depende
on the underlying logic of the affected API's.
The patch includes test cases for the Status command provided by
Tomasz Zarna for bug 353867.
The core.eol and core.safecrlf options are not yet supported.
Bug: 301775
Bug: 353867
Change-Id: I2280a2dc0698829475de6a662a6c6e80b1df7663
In practice the DHT storage layer has not been performing as well as
large scale server environments want to see from a Git server.
The performance of the DHT schema degrades rapidly as small changes
are pushed into the repository due to the chunk size being less than
1/3 of the pushed pack size. Small chunks cause poor prefetch
performance during reading, and require significantly longer prefetch
lists inside of the chunk meta field to work around the small size.
The DHT code is very complex (>17,000 lines of code) and is very
sensitive to the underlying database round-trip time, as well as the
way objects were written into the pack stream that was chunked and
stored on the database. A poor pack layout (from any version of C Git
prior to Junio reworking it) can cause the DHT code to be unable to
enumerate the objects of the linux-2.6 repository in a completable
time scale.
Performing a clone from a DHT stored repository of 2 million objects
takes 2 million row lookups in the DHT to locate the OBJECT_INDEX row
for each object being cloned. This is very difficult for some DHTs to
scale, even at 5000 rows/second the lookup stage alone takes 6 minutes
(on local filesystem, this is almost too fast to bother measuring).
Some servers like Apache Cassandra just fall over and cannot complete
the 2 million lookups in rapid fire.
On a ~400 MiB repository, the DHT schema has an extra 25 MiB of
redundant data that gets downloaded to the JGit process, and that is
before you consider the cost of the OBJECT_INDEX table also being
fully loaded, which is at least 223 MiB of data for the linux kernel
repository. In the DHT schema answering a `git clone` of the ~400 MiB
linux kernel needs to load 248 MiB of "index" data from the DHT, in
addition to the ~400 MiB of pack data that gets sent to the client.
This is 193 MiB more data to be accessed than the native filesystem
format, but it needs to come over a much smaller pipe (local Ethernet
typically) than the local SATA disk drive.
I also never got around to writing the "repack" support for the DHT
schema, as it turns out to be fairly complex to safely repack data in
the repository while also trying to minimize the amount of changes
made to the database, due to very common limitations on database
mutation rates..
This new DFS storage layer fixes a lot of those issues by taking the
simple approach for storing relatively standard Git pack and index
files on an abstract filesystem. Packs are accessed by an in-process
buffer cache, similar to the WindowCache used by the local filesystem
storage layer. Unlike the local file IO, there are some assumptions
that the storage system has relatively high latency and no concept of
"file handles". Instead it looks at the file more like HTTP byte range
requests, where a read channel is a simply a thunk to trigger a read
request over the network.
The DFS code in this change is still abstract, it does not store on
any particular filesystem, but is fairly well suited to the Amazon S3
or Apache Hadoop HDFS. Storing packs directly on HDFS rather than
HBase removes a layer of abstraction, as most HBase row reads turn
into an HDFS read.
Most of the DFS code in this change was blatently copied from the
local filesystem code. Most parts should be refactored to be shared
between the two storage systems, but right now I am hesistent to do
this due to how well tuned the local filesystem code currently is.
Change-Id: Iec524abdf172e9ec5485d6c88ca6512cd8a6eafb
Support reading first SHA-1 from large FETCH_HEAD files
When reading refs, avoid reading huge files that were put there
accidentally, but still read the top of e.g. FETCH_HEAD, which
may be longer than our limit. We're only interested in the first line
anyway.
Bug: 340880
Change-Id: I11029b9b443f22019bf80bd3dd942b48b531bc11
Signed-off-by: Carsten Pfieffer <carsten.pfeiffer@gebit.de>
Signed-off-by: Chris Aniszczyk <caniszczyk@gmail.com>
Perform automatic CRLF to LF conversion during WorkingTreeIterator
WorkingTreeIterator now optionally performs CRLF to LF conversion for
text files. A basic framework is left in place to support enabling
(or disabling) this feature based on gitattributes, and also to
support the more generic smudge/clean filter system. As there is
no gitattribute support yet in JGit this is left unimplemented,
but the mightNeedCleaning(), isBinary() and filterClean() methods
will provide reasonable places to plug that into in the future.
[sp: All bugs inside of WorkingTreeIterator are my fault, I wrote
most of it while cherry-picking this patch and building it on
top of Marc's original work.]
CQ: 4419
Bug: 301775
Change-Id: I0ca35cfbfe3f503729cbfc1d5034ad4abcd1097e
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Don't use interruptable pread() to access pack files
The J2SE NIO APIs require that FileChannel close the underlying file
descriptor if a thread is interrupted while it is inside of a read or
write operation on that channel. This is insane, because it means we
cannot share the file descriptor between threads. If a thread is in
the middle of the FileChannel variant of IO.readFully() and it
receives an interrupt, the pack will be automatically closed on us.
This causes the other threads trying to use that same FileChannel to
receive IOExceptions, which leads to the pack getting marked as
invalid. Once the pack is marked invalid, JGit loses access to its
entire contents and starts to report MissingObjectExceptions.
Because PackWriter must ensure that the chosen pack file stays
available until the current object's data is fully copied to the
output, JGit cannot simply reopen the pack when its automatically
closed due to an interrupt being sent at the wrong time. The pack may
have been deleted by a concurrent `git gc` process, and that open file
descriptor might be the last reference to the inode on disk. Once its
closed, the PackWriter loses access to that object representation, and
it cannot complete sending the object the client.
Fortunately, RandomAccessFile's readFully method does not have this
problem. Interrupts during readFully() are ignored. However, it
requires us to first seek to the offset we need to read, then issue
the read call. This requires locking around the file descriptor to
prevent concurrent threads from moving the pointer before the read.
This reduces the concurrency level, as now only one window can be
paged in at a time from each pack. However, the WindowCache should
already be holding most of the pages required to handle the working
set for a process, and its own internal locking was already limiting
us on the number of concurrent loads possible. Provided that most
concurrent accesses are getting hits in the WindowCache, or are for
different repositories on the same server, we shouldn't see a major
performance hit due to the more serialized loading.
I would have preferred to use a pool of RandomAccessFiles for each
pack, with threads borrowing an instance dedicated to that thread
whenever they needed to page in a window. This would permit much
higher levels of concurrency by using multiple file descriptors (and
file pointers) for each pack. However the code became too complex to
develop in any reasonable period of time, so I've chosen to retrofit
the existing code with more serialization instead.
Bug: 308945
Change-Id: I2e6e11c6e5a105e5aef68871b66200fd725134c9
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The strings are externalized into the root resource bundles.
The resource bundles are stored under the new "resources" source
folder to get proper maven build.
Strings from tests are, in general, not externalized. Only in
cases where it was necessary to make the test pass the strings
were externalized. This was typically necessary in cases where
e.getMessage() was used in assert and the exception message was
slightly changed due to reuse of the externalized strings.
Change-Id: Ic0f29c80b9a54fcec8320d8539a3e112852a1f7b
Signed-off-by: Sasa Zivkov <sasa.zivkov@sap.com>
Move pure IO utility functions to a utility class of its own.
According the javadoc, and implied by the name of the class, NB
is about network byte order. The purpose of moving the IO only,
and non-byte order related functions to another class is to
make it easier for new contributors to understand that they
can use these functions in general and it's also makes it easier
to understand where to put new IO related utility functions
Change-Id: I4a9f6b39d5564bc8a694b366e7ff3cc758c5181b
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
As discussed on the egit-dev mailing list, we prefer not to have
trailing whitespace in our source code. Correct all currently
offending lines by trimming them.
Change-Id: I002b1d1980071084c0bc53242c8f5900970e6845
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Per CQ 3448 this is the initial contribution of the JGit project
to eclipse.org. It is derived from the historical JGit repository
at commit 3a2dd9921c.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>