You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

PackOutputStream.java 7.9KB

PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
PackWriter: Support reuse of entire packs The most expensive part of packing a repository for transport to another system is enumerating all of the objects in the repository. Once this gets to the size of the linux-2.6 repository (1.8 million objects), enumeration can take several CPU minutes and costs a lot of temporary working set memory. Teach PackWriter to efficiently reuse an existing "cached pack" by answering a clone request with a thin pack followed by a larger cached pack appended to the end. This requires the repository owner to first construct the cached pack by hand, and record the tip commits inside of $GIT_DIR/objects/info/cached-packs: cd $GIT_DIR root=$(git rev-parse master) tmp=objects/.tmp-$$ names=$(echo $root | git pack-objects --keep-true-parents --revs $tmp) for n in $names; do chmod a-w $tmp-$n.pack $tmp-$n.idx touch objects/pack/pack-$n.keep mv $tmp-$n.pack objects/pack/pack-$n.pack mv $tmp-$n.idx objects/pack/pack-$n.idx done (echo "+ $root"; for n in $names; do echo "P $n"; done; echo) >>objects/info/cached-packs git repack -a -d When a clone request needs to include $root, the corresponding cached pack will be copied as-is, rather than enumerating all of the objects that are reachable from $root. For a linux-2.6 kernel repository that should be about 376 MiB, the above process creates two packs of 368 MiB and 38 MiB[1]. This is a local disk usage increase of ~26 MiB, due to reduced delta compression between the large cached pack and the smaller recent activity pack. The overhead is similar to 1 full copy of the compressed project sources. With this cached pack in hand, JGit daemon completes a clone request in 1m17s less time, but a slightly larger data transfer (+2.39 MiB): Before: remote: Counting objects: 1861830, done remote: Finding sources: 100% (1861830/1861830) remote: Getting sizes: 100% (88243/88243) remote: Compressing objects: 100% (88184/88184) Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done. remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844) Resolving deltas: 100% (1564621/1564621), done. real 3m19.005s After: remote: Counting objects: 1601, done remote: Counting objects: 1828460, done remote: Finding sources: 100% (50475/50475) remote: Getting sizes: 100% (18843/18843) remote: Compressing objects: 100% (7585/7585) remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510) Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done. Resolving deltas: 100% (1559477/1559477), done. real 2m2.938s Repository owners can periodically refresh their cached packs by repacking their repository, folding all newer objects into a larger cached pack. Since repacking is already considered to be a normal Git maintenance activity, this isn't a very big burden. [1] In this test $root was set back about two weeks. Change-Id: Ib87131d5c4b5e8c5cacb0f4fe16ff4ece554734b Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
13 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252
  1. /*
  2. * Copyright (C) 2008-2009, Google Inc.
  3. * Copyright (C) 2008, Marek Zawirski <marek.zawirski@gmail.com>
  4. * and other copyright owners as documented in the project's IP log.
  5. *
  6. * This program and the accompanying materials are made available
  7. * under the terms of the Eclipse Distribution License v1.0 which
  8. * accompanies this distribution, is reproduced below, and is
  9. * available at http://www.eclipse.org/org/documents/edl-v10.php
  10. *
  11. * All rights reserved.
  12. *
  13. * Redistribution and use in source and binary forms, with or
  14. * without modification, are permitted provided that the following
  15. * conditions are met:
  16. *
  17. * - Redistributions of source code must retain the above copyright
  18. * notice, this list of conditions and the following disclaimer.
  19. *
  20. * - Redistributions in binary form must reproduce the above
  21. * copyright notice, this list of conditions and the following
  22. * disclaimer in the documentation and/or other materials provided
  23. * with the distribution.
  24. *
  25. * - Neither the name of the Eclipse Foundation, Inc. nor the
  26. * names of its contributors may be used to endorse or promote
  27. * products derived from this software without specific prior
  28. * written permission.
  29. *
  30. * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
  31. * CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
  32. * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
  33. * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  34. * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
  35. * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  36. * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
  37. * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  38. * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  39. * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
  40. * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  41. * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  42. * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  43. */
  44. package org.eclipse.jgit.storage.pack;
  45. import java.io.IOException;
  46. import java.io.OutputStream;
  47. import java.security.MessageDigest;
  48. import java.util.zip.CRC32;
  49. import org.eclipse.jgit.JGitText;
  50. import org.eclipse.jgit.lib.Constants;
  51. import org.eclipse.jgit.lib.ProgressMonitor;
  52. import org.eclipse.jgit.util.NB;
  53. /** Custom output stream to support {@link PackWriter}. */
  54. public final class PackOutputStream extends OutputStream {
  55. private static final int BYTES_TO_WRITE_BEFORE_CANCEL_CHECK = 128 * 1024;
  56. private final ProgressMonitor writeMonitor;
  57. private final OutputStream out;
  58. private final PackWriter packWriter;
  59. private final CRC32 crc = new CRC32();
  60. private final MessageDigest md = Constants.newMessageDigest();
  61. private long count;
  62. private byte[] headerBuffer = new byte[32];
  63. private byte[] copyBuffer;
  64. private long checkCancelAt;
  65. /**
  66. * Initialize a pack output stream.
  67. * <p>
  68. * This constructor is exposed to support debugging the JGit library only.
  69. * Application or storage level code should not create a PackOutputStream,
  70. * instead use {@link PackWriter}, and let the writer create the stream.
  71. *
  72. * @param writeMonitor
  73. * monitor to update on object output progress.
  74. * @param out
  75. * target stream to receive all object contents.
  76. * @param pw
  77. * packer that is going to perform the output.
  78. */
  79. public PackOutputStream(final ProgressMonitor writeMonitor,
  80. final OutputStream out, final PackWriter pw) {
  81. this.writeMonitor = writeMonitor;
  82. this.out = out;
  83. this.packWriter = pw;
  84. this.checkCancelAt = BYTES_TO_WRITE_BEFORE_CANCEL_CHECK;
  85. }
  86. @Override
  87. public void write(final int b) throws IOException {
  88. count++;
  89. out.write(b);
  90. crc.update(b);
  91. md.update((byte) b);
  92. }
  93. @Override
  94. public void write(final byte[] b, int off, int len)
  95. throws IOException {
  96. while (0 < len) {
  97. final int n = Math.min(len, BYTES_TO_WRITE_BEFORE_CANCEL_CHECK);
  98. count += n;
  99. if (checkCancelAt <= count) {
  100. if (writeMonitor.isCancelled()) {
  101. throw new IOException(
  102. JGitText.get().packingCancelledDuringObjectsWriting);
  103. }
  104. checkCancelAt = count + BYTES_TO_WRITE_BEFORE_CANCEL_CHECK;
  105. }
  106. out.write(b, off, n);
  107. crc.update(b, off, n);
  108. md.update(b, off, n);
  109. off += n;
  110. len -= n;
  111. }
  112. }
  113. @Override
  114. public void flush() throws IOException {
  115. out.flush();
  116. }
  117. void writeFileHeader(int version, long objectCount) throws IOException {
  118. System.arraycopy(Constants.PACK_SIGNATURE, 0, headerBuffer, 0, 4);
  119. NB.encodeInt32(headerBuffer, 4, version);
  120. NB.encodeInt32(headerBuffer, 8, (int) objectCount);
  121. write(headerBuffer, 0, 12);
  122. }
  123. /**
  124. * Write one object.
  125. *
  126. * If the object was already written, this method does nothing and returns
  127. * quickly. This case occurs whenever an object was written out of order in
  128. * order to ensure the delta base occurred before the object that needs it.
  129. *
  130. * @param otp
  131. * the object to write.
  132. * @throws IOException
  133. * the object cannot be read from the object reader, or the
  134. * output stream is no longer accepting output. Caller must
  135. * examine the type of exception and possibly its message to
  136. * distinguish between these cases.
  137. */
  138. public void writeObject(ObjectToPack otp) throws IOException {
  139. packWriter.writeObject(this, otp);
  140. }
  141. /**
  142. * Commits the object header onto the stream.
  143. * <p>
  144. * Once the header has been written, the object representation must be fully
  145. * output, or packing must abort abnormally.
  146. *
  147. * @param otp
  148. * the object to pack. Header information is obtained.
  149. * @param rawLength
  150. * number of bytes of the inflated content. For an object that is
  151. * in whole object format, this is the same as the object size.
  152. * For an object that is in a delta format, this is the size of
  153. * the inflated delta instruction stream.
  154. * @throws IOException
  155. * the underlying stream refused to accept the header.
  156. */
  157. public void writeHeader(ObjectToPack otp, long rawLength)
  158. throws IOException {
  159. if (otp.isDeltaRepresentation()) {
  160. if (packWriter.isDeltaBaseAsOffset()) {
  161. ObjectToPack baseInPack = otp.getDeltaBase();
  162. if (baseInPack != null && baseInPack.isWritten()) {
  163. final long start = count;
  164. int n = encodeTypeSize(Constants.OBJ_OFS_DELTA, rawLength);
  165. write(headerBuffer, 0, n);
  166. long offsetDiff = start - baseInPack.getOffset();
  167. n = headerBuffer.length - 1;
  168. headerBuffer[n] = (byte) (offsetDiff & 0x7F);
  169. while ((offsetDiff >>= 7) > 0)
  170. headerBuffer[--n] = (byte) (0x80 | (--offsetDiff & 0x7F));
  171. write(headerBuffer, n, headerBuffer.length - n);
  172. return;
  173. }
  174. }
  175. int n = encodeTypeSize(Constants.OBJ_REF_DELTA, rawLength);
  176. otp.getDeltaBaseId().copyRawTo(headerBuffer, n);
  177. write(headerBuffer, 0, n + Constants.OBJECT_ID_LENGTH);
  178. } else {
  179. int n = encodeTypeSize(otp.getType(), rawLength);
  180. write(headerBuffer, 0, n);
  181. }
  182. }
  183. private int encodeTypeSize(int type, long rawLength) {
  184. long nextLength = rawLength >>> 4;
  185. headerBuffer[0] = (byte) ((nextLength > 0 ? 0x80 : 0x00) | (type << 4) | (rawLength & 0x0F));
  186. rawLength = nextLength;
  187. int n = 1;
  188. while (rawLength > 0) {
  189. nextLength >>>= 7;
  190. headerBuffer[n++] = (byte) ((nextLength > 0 ? 0x80 : 0x00) | (rawLength & 0x7F));
  191. rawLength = nextLength;
  192. }
  193. return n;
  194. }
  195. /** @return a temporary buffer writers can use to copy data with. */
  196. public byte[] getCopyBuffer() {
  197. if (copyBuffer == null)
  198. copyBuffer = new byte[16 * 1024];
  199. return copyBuffer;
  200. }
  201. void endObject() {
  202. writeMonitor.update(1);
  203. }
  204. /** @return total number of bytes written since stream start. */
  205. long length() {
  206. return count;
  207. }
  208. /** @return obtain the current CRC32 register. */
  209. int getCRC32() {
  210. return (int) crc.getValue();
  211. }
  212. /** Reinitialize the CRC32 register for a new region. */
  213. void resetCRC32() {
  214. crc.reset();
  215. }
  216. /** @return obtain the current SHA-1 digest. */
  217. byte[] getDigest() {
  218. return md.digest();
  219. }
  220. }