The most expensive part of packing a repository for transport to
another system is enumerating all of the objects in the repository.
Once this gets to the size of the linux-2.6 repository (1.8 million
objects), enumeration can take several CPU minutes and costs a lot
of temporary working set memory.
Teach PackWriter to efficiently reuse an existing "cached pack"
by answering a clone request with a thin pack followed by a larger
cached pack appended to the end. This requires the repository
owner to first construct the cached pack by hand, and record the
tip commits inside of $GIT_DIR/objects/info/cached-packs:
cd $GIT_DIR
root=$(git rev-parse master)
tmp=objects/.tmp-$$
names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp)
for n in $names; do
chmod a-w $tmp-$n.pack $tmp-$n.idx
touch objects/pack/pack-$n.keep
mv $tmp-$n.pack objects/pack/pack-$n.pack
mv $tmp-$n.idx objects/pack/pack-$n.idx
done
(echo "+ $root";
for n in $names; do echo "P $n"; done;
echo) >>objects/info/cached-packs
git repack -a -d
When a clone request needs to include $root, the corresponding
cached pack will be copied as-is, rather than enumerating all of
the objects that are reachable from $root.
For a linux-2.6 kernel repository that should be about 376 MiB,
the above process creates two packs of 368 MiB and 38 MiB[1].
This is a local disk usage increase of ~26 MiB, due to reduced
delta compression between the large cached pack and the smaller
recent activity pack. The overhead is similar to 1 full copy of
the compressed project sources.
With this cached pack in hand, JGit daemon completes a clone request
in 1m17s less time, but a slightly larger data transfer (+2.39 MiB):
Before:
remote: Counting objects: 1861830, done
remote: Finding sources: 100% (1861830/1861830)
remote: Getting sizes: 100% (88243/88243)
remote: Compressing objects: 100% (88184/88184)
Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
Resolving deltas: 100% (1564621/1564621), done.
real 3m19.005s
After:
remote: Counting objects: 1601, done
remote: Counting objects: 1828460, done
remote: Finding sources: 100% (50475/50475)
remote: Getting sizes: 100% (18843/18843)
remote: Compressing objects: 100% (7585/7585)
remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
Resolving deltas: 100% (1559477/1559477), done.
real 2m2.938s
Repository owners can periodically refresh their cached packs by
repacking their repository, folding all newer objects into a larger
cached pack. Since repacking is already considered to be a normal
Git maintenance activity, this isn't a very big burden.
[1] In this test $root was set back about two weeks.
Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Expose pack fetch/push connections for subclassing
These classes need to be visible if an application wants to define
its own native pack based protocol embedded within another layer,
much like we already support for smart HTTP.
Change-Id: I7e2ac3ad01d15b94d340128a395fe0b2f560ff35
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When we are creating a pack the higher level application should be able
to override the PackConfig used, allowing it to control the number of
threads used or how much memory is allocated per writer.
Change-Id: I47795987bb0d161d3642082acc2f617d7cb28d8c
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Move PackWriter progress monitors onto the operations
Rather than taking the ProgressMonitor objects in our constructor and
carrying them around as instance fields, take them as arguments to the
actual time consuming operations we need to run.
Change-Id: I2b230d07e277de029b1061c807e67de5428cc1c4
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Ensure ObjectReader used by PackWriter is released
The ObjectReader API demands that we release the reader when we are
done with it. PackWriter contains a reader, which it uses for the
entire packing session. Expose the release of the reader through
a release method on the writer.
This still doesn't address the RevWalk and TreeWalk users, who
don't correctly release their reader. But its a small step in the
right direction.
Change-Id: I5cb0b5c1b432434a799fceb21b86479e09b84a0a
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Similar to what we did with the file code, move the pack writer
into its own package so the related classes and their package
private methods are hidden from the rest of the library.
Change-Id: Ic1b5c7c8c8d266e90c910d8d68dfc8e93586854f
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The strings are externalized into the root resource bundles.
The resource bundles are stored under the new "resources" source
folder to get proper maven build.
Strings from tests are, in general, not externalized. Only in
cases where it was necessary to make the test pass the strings
were externalized. This was typically necessary in cases where
e.getMessage() was used in assert and the exception message was
slightly changed due to reuse of the externalized strings.
Change-Id: Ic0f29c80b9a54fcec8320d8539a3e112852a1f7b
Signed-off-by: Sasa Zivkov <sasa.zivkov@sap.com>
Reduce multi-level buffered streams in transport code
Some transports actually provide stream buffering on their own,
without needing to be wrapped up inside of a BufferedInputStream in
order to smooth out system calls to read or write. A great example
of this is the JSch SSH client, or the Apache MINA SSHD server.
Both use custom buffering to packetize the streams into the encrypted
SSH channel, and wrapping them up inside of a BufferedInputStream
or BufferedOutputStream is relatively pointless.
Our SideBandOutputStream implementation also provides some fairly
large buffering, equal to one complete side-band packet on the main
data channel. Wrapping that inside of a BufferedOutputStream just to
smooth out small writes from PackWriter causes extra data copies, and
provides no advantage. We can save some memory and some CPU cycles
by letting PackWriter dump directly into the SideBandOutputStream's
internal buffer array.
Instead we push the buffering streams down to be as close to the
network socket (or operating system pipe) as possible. This allows
us to smooth out the smaller reads/writes from pkt-line messages
during advertisement and negotation, but avoid copying altogether
when the stream switches to larger writes over a side band channel.
Change-Id: I2f6f16caee64783c77d3dd1b2a41b3cc0c64c159
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Capture non-progress side band #2 messages and put in result
Any messages received on side band #2 that aren't scraped as a
progress message into our ProgressMonitor are now forwarded to a
buffer which is later included into the OperationResult object.
Application callers can use this buffer to present the additional
messages from the remote peer after the push or fetch operation
has concluded.
The smart push connections using the native send-pack/receive-pack
protocol now request side-band-64k capability if it is available
and forward any messages received through that channel onto this
message buffer. This makes hook messages available over smart HTTP,
or even over SSH.
The SSH transport was modified to redirect the remote command's
stderr stream into the message buffer, interleaved with any data
received over side band #2. Due to buffering between these two
different channels in the SSH channel mux itself the order of any
writes between the two cannot be ensured, but it tries to stay close.
The local fork transport was also modified to redirect the local
receive-pack's stderr into the message buffer, rather than going to
the invoking JVM's System.err. This gives applications a chance
to log the local error messages, rather than needing to redirect
their JVM's stderr before startup.
To keep things simple, the application has to wait for the entire
operation to complete before it can see the messages. This may
be a downside if the user is trying to debug a remote hook that is
blocking indefinitely, the user would need to abort the connection
before they can inspect the message buffer in any sort of UI built
on top of JGit.
Change-Id: Ibc215f4569e63071da5b7e5c6674ce924ae39e11
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
ReceivePack: Enable side-band-64k capability for status reports
We now advertise the side-band-64k capability inside of ReceivePack,
allowing hooks to echo status messages down the side band channel
instead of over the optional stderr stream.
This change permits hooks running inside of an http:// based push
invocation to still message the end-user with more detailed errors
than the small per-command string in the status report.
Change-Id: I64f251ef2d13ab3fd0e1a319a4683725455e5244
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The boolean field sentCommand is always true at this point, as it
was assigned just 5 lines above. So we always set the status of
the update command object to AWAITING_REPORT.
Simplify the logic by dropping the ?: operator. I assume this is
older code from an attempt to manage dry-run push support within
the native connection, but in fact dry-run support is done higher
up inside of PushProcess.
Change-Id: I450d491bbbb5afecdbf5444ab7169222e856a3bb
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Per CQ 3448 this is the initial contribution of the JGit project
to eclipse.org. It is derived from the historical JGit repository
at commit 3a2dd9921c.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>