Robin Rosenberg [Tue, 29 Nov 2011 23:16:40 +0000 (00:16 +0100)]
Add percent-encoding of reserved characters in URIish
We do this for the the names that have an explicit scheme and
do it both ways. The URIish is parsed before decoding. Only
a few special characters are encoded for the path part of the
URI, i.e. space, non-ASCII and control characters. The percent
encoding is assumed to be a stream encoding so we interpret it
as UTF-8.
There was the possibility that during history rendering we draw a lane
"trough" a passed commit. Vadim Dmitriev found that out in bug 335818.
I added the needed check to that block of code where it was missing.
Bug: 335818
Change-Id: Ic944193b2aca55ff3eb0235d46afa60b7896aa0f Signed-off-by: Christian Halstrick <christian.halstrick@sap.com>
Shawn O. Pearce [Thu, 1 Dec 2011 01:36:32 +0000 (17:36 -0800)]
Discard request HTTP bodies for status code <400
The HTTP RFCs require a server to fully consume the request body before
it can return a non-error status code, which is any code below 400.
JGit returns most Git level errors inside of an HTTP 200 OK response,
and sometimes this happens before the entire request was consumed from
the servlet container. In such cases the body must be skipped or read
until EOF is reached, ensuring the HTTP keep-alive semantics will work
for the next request on the same TCP connection.
HTTP status codes >= 400 may be returned without consuming the body,
and a servlet container must set "Connection: close" in the response
headers when this happens, since the state of the request body is not
well defined with an early abort.
With the introduction of sendError() in GitSmartHttpTools there are
only a handful of locations that need to worry about the request body
being consumed, so sprinkle the call in as necessary.
Shawn O. Pearce [Thu, 1 Dec 2011 01:28:54 +0000 (17:28 -0800)]
Ensure all smart HTTP errors are sent to clients
Error messages are typically short, below the 32 KiB in-memory buffer
size of the SmartOutputStream. When an error is queued up for sending
to a client and an exception is thrown up into the servlet handler we
discarded the message and sent nothing to the client, as the messages
were stuck inside of the SmartOutputStream buffer.
Hoist the creation of the output stream above the invocation of try
block of the service, and use close() in the few catch blocks that
assume there are buffered messages ready for transmission. This will
ensure errors from unpacking a stream in ReceivePack are sent off to
a client correctly, as previously these were causing no status report
to arrive at the client side as the data was stuck in the buffer.
Shawn O. Pearce [Thu, 1 Dec 2011 01:17:59 +0000 (17:17 -0800)]
Always close the GZIPOutputStream to release Deflater
The stream must be closed to ensure the native resources associated
with its internal Deflater instance are cleaned up early, instead of
waiting for GC to identify the dead object and finialize it.
Shawn O. Pearce [Thu, 1 Dec 2011 01:40:10 +0000 (17:40 -0800)]
Fix HTTP unit tests
I modified the way errors are returned, and this particular test is
now getting a different access denied response. The new text happens
to be what I intended to have here, so update the test.
Matthias Sohn [Sun, 27 Nov 2011 00:16:49 +0000 (01:16 +0100)]
Reset SSH connection and credentials on "Auth fail"
When SSH user/password authentication failed this may have been caused
by changed credentials on the server side. When the SSH credentials of a
user change the SSH connection needs to be re-established and
credentials which may have been stored by the credentials provider
need to be reset in order to enable prompting for the new credentials.
Bug: 356233
Change-Id: I7d64c5f39b68a9687c858bb68a961616eabbc751 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Shawn O. Pearce [Tue, 22 Nov 2011 23:51:11 +0000 (15:51 -0800)]
Add utilities for smart HTTP error handling
The GitSmartHttpTools class started as utility functions to help report
useful error messages to users of the android.googlesource.com service.
Now that the GitServlet and GitFilter classes support filters before a
git-upload-pack or git-receive-pack request, server implementors may
these routines helpful to report custom messages to clients. Using the
sendError() method to return an HTTP 200 OK with error text embedded in
the payload prevents native Git clients from retrying the action with a
dumb Git or WebDAV HTTP request.
Refactor some of the existing code to use these new error functions and
protocol constants. The new sendError() function is very close to being
identical to the old error handling code in RepositoryFilter, however we
now use the POST Content-Type rather than the Accept HTTP header to check
if the client will accept the error data in the response body rather than
using the HTTP status code. This is a more reliable way of checking for
native Git clients, as the Accept header was not always populated with the
correct string in older versions of Git smart HTTP.
Shawn O. Pearce [Tue, 22 Nov 2011 23:18:32 +0000 (15:18 -0800)]
Strip leading slashes in RepositoryFilter
If removing the leading slash results in an empty string, return
with an HTTP 404 error before trying to use the RepositoryResolver.
Moving this into a loop ahead of the length check ensures there is
no empty string passed into the resolver.
Sasa Zivkov [Tue, 22 Nov 2011 16:43:00 +0000 (17:43 +0100)]
maxObjectSizeLimit for receive-pack.
ReceivePack (and PackParser) can be configured with the
maxObjectSizeLimit in order to prevent users from pushing too large
objects to Git. The limit check is applied to all object types
although it is most likely that a BLOB will exceed the limit. In all
cases the size of the object header is excluded from the object size
which is checked against the limit as this is the size of which a BLOB
object would take in the working tree when checked out as a file.
When an object exceeds the maxObjectSizeLimit the receive-pack will
abort immediately.
Delta objects (both offset and ref delta) are also checked against the
limit. However, for delta objects we will first check the size of the
inflated delta block against the maxObjectSizeLimit and abort
immediately if it exceeds the limit. In this case we even do not know
the exact size of the resolved delta object but we assume it will be
larger than the given maxObjectSizeLimit as delta is generally only
chosen if the delta can copy more data from the base object than the
delta needs to insert or needs to represent the copy ranges. Aborting
early, in this case, avoids unnecessary inflating of the (huge) delta
block.
Unfortunately, it is too expensive (especially for a large delta) to
compute SHA-1 of an object that causes the receive-pack to abort.
This would decrease the value of this feature whose main purpose is to
protect server resources from users pushing huge objects. Therefore
we don't report the SHA-1 in the error message.
Change-Id: I177ef24553faacda444ed5895e40ac8925ca0d1e Signed-off-by: Sasa Zivkov <sasa.zivkov@sap.com> Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Shawn O. Pearce [Wed, 16 Nov 2011 23:04:44 +0000 (15:04 -0800)]
Fix duplicate objects in "thin+cached" packs from DFS
The DfsReader must offer every representation of an object that
exists on the local repository when PackWriter asks for them. This
is necessary to identify objects in the thin pack part that are also
in the cached pack that will be appended onto the end of the stream.
Without looking at all alternatives, PackWriter may pack the same
object twice (once in the thin section, again in the cached base
pack). This may cause the command line C version to go into an
infinite loop when repacking the resulting repository, as it may see
a delta chain cycle with one of those duplicate copies of the object.
Previously the DfsReader tried to avoid looking at packs that it
might not care about, but this is insufficient, as all versions
must be considered during pack generation.
Shawn O. Pearce [Thu, 17 Nov 2011 19:23:18 +0000 (11:23 -0800)]
Do not write edge objects to the pack stream
Consider two objects A->B where A uses B as a delta base, and these
are in the same source pack file ordered as "A B".
If cached packs is enabled and B is also in the cached pack that
will be appended onto the end of the thin pack, and both A, B are
supposed to be in the thin pack, PackWriter must consider the fact
that A's base B is an edge object that claims to be part of the
new pack, but is actually "external" and cannot be written first.
If the object reuse system considered B candidates fist this bug
does not arise, as B will be marked as edge due to it existing in
the cached pack. When the A candidates are later examined, A sees a
valid delta base is available as an edge, and will not later try to
"write base first" during the writing phase.
However, when the reuse system considers A candidates first they
see that B will be in the outgoing pack, as it is still part of
the thin pack, and arrange for A to be written first. Later when A
switches from being in-pack to being an edge object (as it is part
of the cached pack) the pointer in B does not get its type changed
from ObjectToPack to ObjectId, so B thinks A is non-edge.
We work around this case by also checking that the delta base B
is non-edge before writing the object to the pack. Later when A
writes its object header, delta base B's ObjectToPack will have
an offset == 0, which makes isWritten() = false, and the OBJ_REF
delta format will be used for A's header. This will be resolved by
the client to the copy of B that appears in the later cached pack.
Shawn O. Pearce [Thu, 17 Nov 2011 19:39:53 +0000 (11:39 -0800)]
Use long for more object counts in PackWriter
Packs can contain up to 2^32-1 objects, which exceeds the range of a
Java int. Try harder to accept higher object counts in some cases by
using long more often when we are working with the object count value.
This is a trivial refactoring, we may have to make even more changes
to the object handling code to support more than 2^31-1 objects.
Shawn O. Pearce [Thu, 17 Nov 2011 19:44:43 +0000 (11:44 -0800)]
Search for annotated tag reuse first
Annotated tags are relatively rare and currently are scheduled in a
pack file near the commits, decreasing the time it takes to resolve
client requests reading tags as part of a history traversal.
Putting them first before the commits allows the storage system to
page in the tag area, and have it relatively hot in the LRU when
the nearby commit area gets examined too. Later looking at the
tree and blob data will pollute the cache, making it more likely
the tags are not loaded and would require file IO.
Shawn O. Pearce [Thu, 17 Nov 2011 15:21:22 +0000 (07:21 -0800)]
Correct progress monitor on "Getting sizes:" phase
This counter always was running 1 higher, because it incremented
after the queue was exhausted (and every object was processed). Move
increments to be after the queue has provided a result, to ensure
we do not show a higher in-progress count than total count.
Shawn O. Pearce [Wed, 16 Nov 2011 22:52:31 +0000 (14:52 -0800)]
Refactor DfsReader selection of cached packs
Make the code more clear with a simple refactoring of the boolean
logic into a method that describes the condition we are looking
for on each pack file. A cached pack is possible if there exists
a tips collection, and the collection is non-empty.
Dave Borowitz [Thu, 10 Nov 2011 20:58:13 +0000 (12:58 -0800)]
Keep track of a static collection of all PackWriter instances
Stored in a weak concurrent hash map, which we clean up while iterating.
Usually the weak reference behavior should not be necessary because
PackWriters should be released with release(), but we still want to
avoid leaks when dealing with broken client code.
Dave Borowitz [Thu, 10 Nov 2011 20:53:50 +0000 (12:53 -0800)]
Estimate the amount of memory used by a PackWriter
Memory usage is dominated by three terms:
- The maximum memory allocated to each delta window.
- The maximum size of a single file held in memory during delta search.
- ObjectToPack instances owned by the writer.
For the first two terms, rather than doing complex instrumentation of
the DeltaWindows, we just overestimate based on the config parameters
(though we may underestimate if the maximum size is not set).
For the ObjectToPack instances, we do some rough byte accounting of the
underlying Java object representation.
Dave Borowitz [Thu, 10 Nov 2011 20:49:15 +0000 (12:49 -0800)]
Add an object encapsulating the state of a PackWriter
Exposes essentially the same state machine to the programmer as is
exposed to the client via a ProgressMonitor, using a wrapper around
beginTask()/endTask().
Jens Baumgart [Wed, 26 Oct 2011 12:00:48 +0000 (14:00 +0200)]
Add detection of untracked folders to IndexDiffFilter
Decorators need to know whether folders in the working tree contain only
untracked files. This change enhances IndexDiffFilter to report such
folders. This works only together with treewalks which operate in
default traversal mode. For treewalks which process entries in
postorder mode (files are walked before their parent folder is walked)
this detection doesn't work.
Bug: 359264
Change-Id: I9298d1e3ccac0aec8bbd4e8ac867bc06a5c89c9f Signed-off-by: Christian Halstrick <christian.halstrick@sap.com> Signed-off-by: Jens Baumgart <jens.baumgart@sap.com> Signed-off-by: Chris Aniszczyk <zx@twitter.com>
Kevin Sawicki [Tue, 8 Nov 2011 20:49:16 +0000 (12:49 -0800)]
Support a configured credentials provider in LsRemoteCommand
Refactored the three common transport configuration options:
credentials provider, timeout, and transport config callback
into a new TransportCommand base class which is now extended
by all commands that use a Transport object during execution.
Bug: 349188
Change-Id: I90c2c14fb4e3cc4712905158f9047153a0c235c2 Signed-off-by: Kevin Sawicki <kevin@github.com> Signed-off-by: Chris Aniszczyk <zx@twitter.com>
Robin Rosenberg [Sun, 16 Oct 2011 05:01:21 +0000 (07:01 +0200)]
Kill GitIndex
A few places were still using GitIndex. Replacing it was fairly
simple, but there is a difference in test outcome in
ReadTreeTest.testUntrackedConflicts. I believe the new behavior
is good, since we do not update neither the index, not the worktree.
Change-Id: I4be5357b7b3139dded17f77e07a140addb213ea7 Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Robin Rosenberg [Wed, 19 Oct 2011 21:14:41 +0000 (23:14 +0200)]
Deprecate GitIndex more by using only DirCache internally.
This includes merging ReadTreeTest into DirCacheCheckoutTest and
converting IndexDiffTest to use DirCache only. The GitIndex specific
T0007GitIndex test remains.
GitIndex is deprecated. Let us speed up its demise by focusing the
DirCacheCheckout tests to using DirCache instead.
This also add explicit deprecation comments to methods that depend
on GitIndex in Repository and TreeEntry. The latter is deprecated in
itself.
Change-Id: Id89262f7fbfee07871f444378f196ded444f2783 Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Shawn O. Pearce [Mon, 7 Nov 2011 20:06:56 +0000 (12:06 -0800)]
Always use try/finally around DfsBlockCache.clockLock
Any RuntimeException or Error in this block will leave the lock
held by the caller thread, which can later result in deadlock or
just cache requests hanging forever because they cannot get to
the lock object.
Wrap everything in try/finally to prevent the lock from hanging,
even though a RuntimeException or Error should never happen in
any of these code paths.
Shawn O. Pearce [Tue, 8 Nov 2011 20:22:37 +0000 (12:22 -0800)]
DfsBlockCache: Fix NPE when evicting empty cell
The cache starts with a single empty Ref that has no data, as the
clock list does not support being empty. When this Ref is removed,
the size has to be decremented from the associated DfsPackKey,
which was previously null. Make it always be non-null.
Robin Rosenberg [Sat, 5 Nov 2011 15:52:24 +0000 (16:52 +0100)]
Don't throw away the stack trace when tests fail
Most unexpected exceptions are completely useless yielding message
like "null" or "3" or in the best cases something reasonable, but
still out of context.
Just declare the test as throwing an exception. That will retain
the full stack trace leading to the point of failure without using
a debugger or changing the code.
* changes:
DfsBlockCache: Update hits to not include contains()
Add a listener for changes to a DfsObjDatabase's pack files
Expose the reverse index size in the DfsPackDescription
Add a DfsPackFile method to get the number of cached bytes
Expose the list of pack files in the DfsBlockCache
Add a DFS repository description and reference it in each pack
Clarify the docstring of DfsBlockCache.reconfigure()
DFS: A storage layer for JGit
Dave Borowitz [Fri, 7 Oct 2011 22:31:19 +0000 (15:31 -0700)]
Add a DFS repository description and reference it in each pack
Just as DfsPackDescription describes a pack but does not imply it is
open in memory, a DfsRepositoryDescription describes a repository at a
basic level without it necessarily being open.
Dave Borowitz [Thu, 3 Nov 2011 19:43:03 +0000 (12:43 -0700)]
Clarify the docstring of DfsBlockCache.reconfigure()
The docstring was copied from the local filesystem cache code, which
actually attempted to reconfigure the cache on the fly. The DFS cache is
designed to be "reconfigured" exactly once.
In practice the DHT storage layer has not been performing as well as
large scale server environments want to see from a Git server.
The performance of the DHT schema degrades rapidly as small changes
are pushed into the repository due to the chunk size being less than
1/3 of the pushed pack size. Small chunks cause poor prefetch
performance during reading, and require significantly longer prefetch
lists inside of the chunk meta field to work around the small size.
The DHT code is very complex (>17,000 lines of code) and is very
sensitive to the underlying database round-trip time, as well as the
way objects were written into the pack stream that was chunked and
stored on the database. A poor pack layout (from any version of C Git
prior to Junio reworking it) can cause the DHT code to be unable to
enumerate the objects of the linux-2.6 repository in a completable
time scale.
Performing a clone from a DHT stored repository of 2 million objects
takes 2 million row lookups in the DHT to locate the OBJECT_INDEX row
for each object being cloned. This is very difficult for some DHTs to
scale, even at 5000 rows/second the lookup stage alone takes 6 minutes
(on local filesystem, this is almost too fast to bother measuring).
Some servers like Apache Cassandra just fall over and cannot complete
the 2 million lookups in rapid fire.
On a ~400 MiB repository, the DHT schema has an extra 25 MiB of
redundant data that gets downloaded to the JGit process, and that is
before you consider the cost of the OBJECT_INDEX table also being
fully loaded, which is at least 223 MiB of data for the linux kernel
repository. In the DHT schema answering a `git clone` of the ~400 MiB
linux kernel needs to load 248 MiB of "index" data from the DHT, in
addition to the ~400 MiB of pack data that gets sent to the client.
This is 193 MiB more data to be accessed than the native filesystem
format, but it needs to come over a much smaller pipe (local Ethernet
typically) than the local SATA disk drive.
I also never got around to writing the "repack" support for the DHT
schema, as it turns out to be fairly complex to safely repack data in
the repository while also trying to minimize the amount of changes
made to the database, due to very common limitations on database
mutation rates..
This new DFS storage layer fixes a lot of those issues by taking the
simple approach for storing relatively standard Git pack and index
files on an abstract filesystem. Packs are accessed by an in-process
buffer cache, similar to the WindowCache used by the local filesystem
storage layer. Unlike the local file IO, there are some assumptions
that the storage system has relatively high latency and no concept of
"file handles". Instead it looks at the file more like HTTP byte range
requests, where a read channel is a simply a thunk to trigger a read
request over the network.
The DFS code in this change is still abstract, it does not store on
any particular filesystem, but is fairly well suited to the Amazon S3
or Apache Hadoop HDFS. Storing packs directly on HDFS rather than
HBase removes a layer of abstraction, as most HBase row reads turn
into an HDFS read.
Most of the DFS code in this change was blatently copied from the
local filesystem code. Most parts should be refactored to be shared
between the two storage systems, but right now I am hesistent to do
this due to how well tuned the local filesystem code currently is.
Robin Rosenberg [Fri, 4 Nov 2011 16:53:44 +0000 (17:53 +0100)]
Allow '\' in user names in URI-ish
Actually this is not ok according to the RFC, but this implementation is
ment to be Git compatible. A '\' is needed when the authentication
requires or allows authentication to a Windows domain where the
user name can be specified as DOMAIN\user.
Robin Rosenberg [Fri, 28 Oct 2011 12:58:32 +0000 (14:58 +0200)]
Do not resolve path using cygwin unless told to
The system property jgit.cygpath must be set to true in order
for cygwin's cygpath to be used to translate path from cygwin
namespace to Windows namespace.
The cygwin path translation should be considered deprecated.
Bug: 353389
Change-Id: I2b5234c0ab936dac67d1e232f4cd28331bf3226d Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Matthias Sohn [Wed, 26 Oct 2011 21:29:23 +0000 (17:29 -0400)]
Merge changes I488e9c97,I30f1049f,I1c088dce
* changes:
Cosmetic adjustment of relative date format, do not display "0 months"
Make use of the many date formatting options in the log command
Define a utility class for handling Git date formats