選択できるのは25トピックまでです。 トピックは、先頭が英数字で、英数字とダッシュ('-')を使用した35文字以内のものにしてください。

dht-schema.html 40KB


  1. <html>
  2. <head>
  3. <title>Git on DHT Schema</title>
  4. <style type='text/css'>
  5. body { font-size: 10pt; }
  6. h1 { font-size: 16pt; }
  7. h2 { font-size: 12pt; }
  8. h3 { font-size: 10pt; }
  9. body {
  10. margin-left: 8em;
  11. margin-right: 8em;
  12. }
  13. h1 { margin-left: -3em; }
  14. h2 { margin-left: -2em; }
  15. h3 { margin-left: -1em; }
  16. hr { margin-left: -4em; margin-right: -4em; }
  17. .coltoc {
  18. font-size: 8pt;
  19. font-family: monospace;
  20. }
  21. .rowkey {
  22. margin-left: 1em;
  23. padding-top: 0.2em;
  24. padding-left: 1em;
  25. padding-right: 1em;
  26. width: 54em;
  27. border: 1px dotted red;
  28. background-color: #efefef;
  29. white-space: nowrap;
  30. }
  31. .rowkey .header {
  32. font-weight: bold;
  33. padding-right: 1em;
  34. }
  35. .rowkey .var {
  36. font-style: italic;
  37. font-family: monospace;
  38. }
  39. .rowkey .lit {
  40. font-weight: bold;
  41. font-family: monospace;
  42. }
  43. .rowkey .example {
  44. font-family: monospace;
  45. }
  46. .rowkey p {
  47. white-space: normal;
  48. }
  49. .colfamily {
  50. margin-top: 0.5em;
  51. padding-top: 0.2em;
  52. padding-left: 1em;
  53. padding-right: 1em;
  54. width: 55em;
  55. border: 1px dotted blue;
  56. background-color: #efefef;
  57. white-space: nowrap;
  58. }
  59. .colfamily .header {
  60. font-weight: bold;
  61. padding-right: 1em;
  62. }
  63. .colfamily .var {
  64. font-style: italic;
  65. font-family: monospace;
  66. }
  67. .colfamily .lit {
  68. font-family: monospace;
  69. }
  70. .colfamily .example {
  71. font-family: monospace;
  72. }
  73. .colfamily p {
  74. white-space: normal;
  75. }
  76. .summary_table {
  77. border-collapse: collapse;
  78. border-spacing: 0;
  79. }
  80. .summary_table .desc {
  81. font-size: 8pt;
  82. white-space: nowrap;
  83. text-align: right;
  84. width: 20em;
  85. }
  86. .summary_table td {
  87. border: 1px dotted lightgray;
  88. padding-top: 2px;
  89. padding-bottom: 2px;
  90. padding-left: 5px;
  91. padding-right: 5px;
  92. vertical-align: top;
  93. }
  94. .summary_table tr.no_border td {
  95. border: none;
  96. }
  97. </style>
  98. </head>
  99. <body>
  100. <h1>Git on DHT Schema</h1>
  101. <p>Storing Git repositories on a Distributed Hash Table (DHT) may
  102. improve scaling for large traffic, but also simplifies management when
  103. there are many small repositories.</p>
  104. <h2>Table of Contents</h2>
  105. <ul>
  106. <li><a href="#concepts">Concepts</a></li>
  107. <li><a href="#summary">Summary</a></li>
  108. <li><a href="#security">Data Security</a></li>
  109. <li>Tables:
  110. <ul>
  111. <li><a href="#REPOSITORY_INDEX">Table REPOSITORY_INDEX</a>
  112. (
  113. <a href="#REPOSITORY_INDEX.id" class="toccol">id</a>
  114. )</li>
  115. <li><a href="#REPOSITORY">Table REPOSITORY</a>
  116. (
  117. <a href="#REPOSITORY.chunk-info" class="toccol">chunk-info</a>,
  118. <a href="#REPOSITORY.cached-pack" class="toccol">cached-pack</a>
  119. )</li>
  120. <li><a href="#REF">Table REF</a>
  121. (
  122. <a href="#REF.target" class="toccol">target</a>
  123. )</li>
  124. <li><a href="#OBJECT_INDEX">Table OBJECT_INDEX</a>
  125. (
  126. <a href="#OBJECT_INDEX.info" class="toccol">info</a>
  127. )</li>
  128. <li><a href="#CHUNK">Table CHUNK</a>
  129. (
  130. <a href="#CHUNK.chunk" class="toccol">chunk</a>,
  131. <a href="#CHUNK.index" class="toccol">index</a>,
  132. <a href="#CHUNK.meta" class="toccol">meta</a>
  133. )</li>
  134. </ul>
  135. </li>
  136. <li>Protocol Messages:
  137. <ul>
  138. <li><a href="#message_RefData">RefData</a></li>
  139. <li><a href="#message_ObjectInfo">ObjectInfo</a></li>
  140. <li><a href="#message_ChunkInfo">ChunkInfo</a></li>
  141. <li><a href="#message_ChunkMeta">ChunkMeta</a></li>
  142. <li><a href="#message_CachedPackInfo">CachedPackInfo</a></li>
  143. </ul>
  144. </li>
  145. </ul>
  146. <a name="concepts"><h2>Concepts</h2></a>
  147. <p><i>Git Repository</i>: Stores the version control history for a
  148. single project. Each repository is a directed acyclic graph (DAG)
  149. composed of objects. Revision history for a project is described by a
  150. commit object pointing to the complete set of files that make up that
  151. version of the project, and a pointer to the commit that came
  152. immediately before it. Repositories also contain references,
  153. associating a human readable branch or tag name to a specific commit
  154. object. Tommi Virtanen has a
  155. <a href="http://eagain.net/articles/git-for-computer-scientists/">more
  156. detailed description of the Git DAG</a>.</p>
  157. <p><i>Object</i>: Git objects are named by applying the SHA-1 hash
  158. algorithm to their contents. There are 4 object types: commit, tree,
  159. blob, tag. Objects are typically stored deflated using libz deflate,
  160. but may also be delta compressed against another similar object,
  161. further reducing the storage required. The big factor for Git
  162. repository size is usually object count, e.g. the linux-2.6 repository
  163. contains 1.8 million objects.</p>
  164. <p><i>Reference</i>: Associates a human readable symbolic name, such
  165. as <code>refs/heads/master</code> to a specific Git object, usually a
  166. commit or tag. References are updated to point to the most recent
  167. object whenever changes are committed to the repository.</p>
  168. <p><i>Git Pack File</i>: A container stream holding many objects in a
  169. highly compressed format. On the local filesystem, Git uses pack files
  170. to reduce both inode and space usage by combining millions of objects
  171. into a single data stream. On the network, Git uses pack files as the
  172. basic network protocol to transport objects from one system's
  173. repository to another.</p>
  174. <p><i>Garbage Collection</i>: Scanning the Git object graph to locate
  175. objects that are reachable, and others that are unreachable. Git also
  176. generally performs data recompression during this task to produce more
  177. optimal deltas between objects, reducing overall disk usage and data
  178. transfer sizes. This is independent of any GC that may be performed by
  179. the DHT to clean up old cells.</p>
  180. <p>The basic storage strategy employed by this schema is to break a
  181. standard Git pack file into chunks, approximately 1 MiB in size. Each
  182. chunk is stored as one row in the <a href="#CHUNK">CHUNK</a> table.
  183. During reading, chunks are paged into the application on demand, but
  184. may also be prefetched using prefetch hints. Rules are used to break
  185. the standard pack into chunks, these rules help to improve reference
  186. locality and reduce the number of chunk loads required to service
  187. common operations. In a nutshell, the DHT is used as a virtual memory
  188. system for pages about 1 MiB in size.</p>
  189. <a name="summary"><h2>Summary</h2></a>
  190. <p>The schema consists of a handful of tables. Size estimates are
  191. given for one copy of the linux-2.6 Git repository, a relative tortue
  192. test case that contains 1.8 million objects and is 425 MiB when stored
  193. on the local filesystem. All sizes are before any replication made by
  194. the DHT, or its underlying storage system.</p>
  195. <table style='margin-left: 2em' class='summary_table'>
  196. <tr>
  197. <th>Table</th>
  198. <th>Rows</th>
  199. <th>Cells/Row</th>
  200. <th>Bytes</th>
  201. <th>Bytes/Row</th>
  202. </tr>
  203. <tr>
  204. <td><a href="#REPOSITORY_INDEX">REPOSITORY_INDEX</a>
  205. <div class='desc'>Map host+path to surrogate key.</div></td>
  206. <td align='right'>1</td>
  207. <td align='right'>1</td>
  208. <td align='right'>&lt; 100 bytes</td>
  209. <td align='right'>&lt; 100 bytes</td>
  210. </tr>
  211. <tr>
  212. <td><a href="#REPOSITORY">REPOSITORY</a>
  213. <div class='desc'>Accounting and replica management.</div></td>
  214. <td align='right'>1</td>
  215. <td align='right'>403</td>
  216. <td align='right'>65 KiB</td>
  217. <td align='right'>65 KiB</td>
  218. </tr>
  219. <tr>
  220. <td><a href="#REF">REF</a>
  221. <div class='desc'>Bind branch/tag name to Git object.</div></td>
  222. <td align='right'>211</td>
  223. <td align='right'>1</td>
  224. <td align='right'>14 KiB</td>
  225. <td align='right'>67 bytes</td>
  226. </tr>
  227. <tr>
  228. <td><a href="#OBJECT_INDEX">OBJECT_INDEX</a>
  229. <div class='desc'>Locate Git object by SHA-1 name.</div></td>
  230. <td align='right'>1,861,833</td>
  231. <td align='right'>1</td>
  232. <td align='right'>154 MiB</td>
  233. <td align='right'>87 bytes</td>
  234. </tr>
  235. <tr>
  236. <td><a href="#CHUNK">CHUNK</a>
  237. <div class='desc'>Complete Git object storage.</div></td>
  238. <td align='right'>402</td>
  239. <td align='right'>3</td>
  240. <td align='right'>417 MiB</td>
  241. <td align='right'>~ 1 MiB</td>
  242. </tr>
  243. <tr class='no_border'>
  244. <td align='right'><i>Total</i></td>
  245. <td align='right'>1,862,448</td>
  246. <td align='right'></td>
  247. <td align='right'>571 MiB</td>
  248. <td align='right'></td>
  249. </tr>
  250. </table>
  251. <a name="security"><h2>Data Security</h2></a>
  252. <p>If data encryption is necessary to protect file contents, the <a
  253. href="#CHUNK.chunk">CHUNK.chunk</a> column can be encrypted with a
  254. block cipher such as AES. This column contains the revision commit
  255. messages, file paths, and file contents. By encrypting one column, the
  256. majority of the repository data is secured. As each cell value is
  257. about 1 MiB and contains a trailing 4 bytes of random data, an ECB
  258. mode of operation may be sufficient. Because the cells are already
  259. very highly compressed using the Git data compression algorithms,
  260. there is no increase in disk usage due to encryption.</p>
  261. <p>Branch and tag names (<a href="#REF">REF</a> row keys) are not
  262. encrypted. If these need to be secured the portion after the ':' would
  263. need to be encrypted with a block cipher. However these strings are
  264. very short and very common (HEAD, refs/heads/master, refs/tags/v1.0),
  265. making encryption difficult. A variation on the schema might move all
  266. rows for a repository into a single protocol messsage, then encrypt
  267. the protobuf into a single cell. Unfortunately this strategy has a
  268. high update cost, and references change frequently.</p>
  269. <p>Object SHA-1 names (<a href="#OBJECT_INDEX">OBJECT_INDEX</a> row
  270. keys and <a href="#CHUNK.index">CHUNK.index</a> values) are not
  271. encrypted. This allows a reader to determine if a repository contains
  272. a specific revision, but does not allow them to inspect the contents
  273. of the revision. The CHUNK.index column could also be encrypted with a
  274. block cipher when CHUNK.chunk is encrypted (see above), however the
  275. OBJECT_INDEX table row keys cannot be encrypted if abbrevation
  276. expansion is to be supported for end-users of the repository. The row
  277. keys must be unencrypted as abbreviation resolution is performed by a
  278. prefix range scan over the keys.</p>
  279. <p>The remaining tables and columns contain only statistics (e.g.
  280. object counts or cell sizes), or internal surrogate keys
  281. (repository_id, chunk_key) and do not require encryption.</p>
  282. <hr />
  283. <a name="REPOSITORY_INDEX"><h2>Table REPOSITORY_INDEX</h2></a>
  284. <p>Maps a repository name, as presented in the URL by an end-user or
  285. client application, into its internal repository_id value. This
  286. mapping allows the repository name to be quickly modified (e.g.
  287. renamed) without needing to update the larger data rows of the
  288. repository.</p>
  289. <p>The exact semantics of the repository_name format is left as a
  290. deployment decision, but DNS hostname, '/', repository name would be
  291. one common usage.</p>
  292. <h3>Row Key</h3>
  293. <div class='rowkey'>
  294. <div>
  295. <span class='header'>Row Key:</span>
  296. <span class='var'>repository_name</span>
  297. </div>
  298. <p>Human readable name of the repository, typically derived from the
  299. HTTP <code>Host</code> header and path in the URL.</p>
  300. <p>Examples:</p>
  301. <ul>
  302. <li><span class='example'>com.example.git/pub/git/foo.git</span></li>
  303. <li><span class='example'>com.example.git/~user/mystuff.git</span></li>
  304. </ul>
  305. </div>
  306. <h3>Columns</h3>
  307. <div class='colfamily'>
  308. <div>
  309. <span class='header'>Column:</span>
  310. <a name="REPOSITORY_INDEX.id"><span class='lit'>id:</span></a>
  311. </div>
  312. <p>The repository_id, as an 8-digit hex ASCII string.</p>
  313. </div>
  314. <h3>Size Estimate</h3>
  315. <p>Less than 100,000 rows. More likely estimate is 1,000 rows.
  316. Total disk usage under 512 KiB, assuming 1,000 names and 256
  317. characters per name.</p>
  318. <h3>Updates</h3>
  319. <p>Only on repository creation or rename, which is infrequent (&lt;10
  320. rows/month). Updates are performed in a row-level transaction, to
  321. ensure a name is either assigned uniquely, or fails.</p>
  322. <h3>Reads</h3>
  323. <p>Reads are tried first against memcache, then against the DHT if the
  324. entry did not exist in memcache. Successful reads against the DHT are
  325. put back into memcache in the background.</p>
  326. <a name="REPOSITORY"><h2>Table REPOSITORY</h2></a>
  327. <p>Tracks top-level information about each repository.</p>
  328. <h3>Row Key</h3>
  329. <div class='rowkey'>
  330. <div>
  331. <span class='header'>Row Key:</span>
  332. <span class='var'>repository_id</span>
  333. </div>
  334. <p>The repository_id, as an 8-digit hex ASCII string.</p>
  335. </div>
  336. <p>Typically this is assigned sequentially, then has the bits reversed
  337. to evenly spread repositories throughout the DHT. For example the
  338. first repository is <code>80000000</code>, and the second is
  339. <code>40000000</code>.</p>
  340. <h3>Columns</h3>
  341. <div class='colfamily'>
  342. <div>
  343. <span class='header'>Column:</span>
  344. <a name="REPOSITORY.chunk-info"><span class='lit'>chunk-info:</span></a>
  345. <span class='var'>chunk_key[short]</span>
  346. </div>
  347. <p>Cell value is the protocol message <a
  348. href="#message_ChunkInfo">ChunkInfo</a> describing the chunk's
  349. contents. Most of the message's fields are only useful for quota
  350. accounting and reporting.</p>
  351. </div>
  352. <p>This column exists to track all of the chunks that make up a
  353. repository's object set. Garbage collection and quota accounting tasks
  354. can primarily drive off this column, rather than scanning the much
  355. larger <a href="#CHUNK">CHUNK</a> table with a regular expression on
  356. the chunk row key.</p>
  357. <p>As each chunk averages 1 MiB in size, the linux-2.6 repository
  358. (at 373 MiB) has about 400 chunks and thus about 400 chunk-info
  359. cells. The chromium repository (at 1 GiB) has about 1000 chunk-info
  360. cells. It would not be uncommon to have 2000 chunk-info cells.</p>
  361. <div class='colfamily'>
  362. <div>
  363. <span class='header'>Column:</span>
  364. <a name="REPOSITORY.cached-pack"><span class='lit'>cached-pack:</span></a>
  365. <span class='var'>NNNNx38</span>
  366. <span class='lit'>.</span>
  367. <span class='var'>VVVVx38</span>
  368. </div>
  369. <p>Variables:</p>
  370. <ul>
  371. <li><span class='var'>NNNNx38</span> = 40 hex digit name of the cached pack</li>
  372. <li><span class='var'>VVVVx38</span> = 40 hex digit version of the cached pack</li>
  373. </ul>
  374. <p>Examples:</p>
  375. <ul>
  376. <li><span class='example'>4e32fb97103981e7dd53dcc786640fa4fdb444b8.8975104a03d22e54f7060502e687599d1a2c2516</span></li>
  377. </ul>
  378. <p>Cell value is the protocol message <a
  379. href="#message_CachedPackInfo">CachedPackInfo</a> describing the
  380. chunks that make up a cached pack.</p>
  381. </div>
  382. <p>The <code>cached-pack</code> column family describes large lists of
  383. chunks that when combined together in a specific order create a valid
  384. Git pack file directly streamable to a client. This avoids needing to
  385. enumerate and pack the entire repository on each request.</p>
  386. <p>The cached-pack name (NNNNx38 above) is the SHA-1 of the objects
  387. contained within the pack, in binary, sorted. This is the standard
  388. naming convention for pack files on the local filesystem. The version
  389. (VVVVx38 above) is the SHA-1 of the chunk keys, sorted. The version
  390. makes the cached-pack cell unique, if any single bit in the compressed
  391. data is modified a different version will be generated, and a
  392. different cell will be used to describe the alternate version of the
  393. same data. The version is necessary to prevent repacks of the same
  394. object set (but with different compression settings or results) from
  395. stepping on active readers.</p>
  396. <h2>Size Estimate</h2>
  397. <p>1 row per repository (~1,000 repositories), however the majority of
  398. the storage cost is in the <code>chunk-info</code> column family,
  399. which can have more than 2000 cells per repository.</p>
  400. <p>Each <code>chunk-info</code> cell is on average 147 bytes. For a
  401. large repository like chromium.git (over 1000 chunks) this is only 147
  402. KiB for the entire row.</p>
  403. <p>Each <code>cached-pack</code> cell is on average 5350 bytes. Most
  404. repositories have 1 of these cells, 2 while the repository is being
  405. repacked on the server side to update the cached-pack data.</p>
  406. <h2>Updates</h2>
  407. <p>Information about each ~1 MiB chunk of pack data received over the
  408. network is stored as a unique column in the <code>chunk-info</code>
  409. column family.</p>
  410. <p>Most pushes are at least 2 chunks (commit, tree), with 50 pushes
  411. per repository per day being possible (50,000 new cells/day).</p>
  412. <p><b>TODO:</b> Average push rates?</p>
  413. <h2>Reads</h2>
  414. <p><i>Serving clients:</i> Read all cells in the
  415. <code>cached-pack</code> column family, typically only 1-5 cells. The
  416. cells are cached in memcache and read from there first.</p>
  417. <p><i>Garbage collection:</i> Read all cells in the
  418. <code>chunk-info</code> column family to determine which chunks are
  419. owned by this repository, without scanning the <a href="#CHUNK">CHUNK</a> table.
  420. Delete <code>chunk-info</code> after the corresponding <a href="#CHUNK">CHUNK</a>
  421. row has been deleted. Unchanged chunks have their info left alone.</p>
  422. <a name="REF"><h2>Table REF</h2></a>
  423. <p>Associates a human readable branch (e.g.
  424. <code>refs/heads/master</code>) or tag (e.g.
  425. <code>refs/tags/v1.0</code>) name to the Git
  426. object that represents that current state of
  427. the repository.</p>
  428. <h3>Row Key</h3>
  429. <div class='rowkey'>
  430. <div>
  431. <span class='header'>Row Key:</span>
  432. <span class='var'>repository_id</span>
  433. <span class='lit'>:</span>
  434. <span class='var'>ref_name</span>
  435. </div>
  436. <p>Variables:</p>
  437. <ul>
  438. <li><span class='var'>repository_id</span> = Repository owning the reference (see above)</li>
  439. <li><span class='var'>ref_name</span> = Name of the reference, UTF-8 string</li>
  440. </ul>
  441. <p>Examples:</p>
  442. <ul>
  443. <li><span class='example'>80000000:HEAD</span></li>
  444. <li><span class='example'>80000000:refs/heads/master</span></li>
  445. <br />
  446. <li><span class='example'>40000000:HEAD</span></li>
  447. <li><span class='example'>40000000:refs/heads/master</span></li>
  448. </ul>
  449. </div>
  450. <p>The separator <code>:</code> used in the row key was chosen because
  451. this character is not permitted in a Git reference name.</p>
  452. <h3>Columns</h3>
  453. <div class='colfamily'>
  454. <div>
  455. <span class='header'>Column:</span>
  456. <a name="REF.target"><span class='lit'>target:</span></a>
  457. </div>
  458. <p>Cell value is the protocol message
  459. <a href="#message_RefData">RefData</a> describing the
  460. current SHA-1 the reference points to, and the chunk
  461. it was last observed in. The chunk hint allows skipping
  462. a read of <a href="#OBJECT_INDEX">OBJECT_INDEX</a>.</p>
  463. <p>Several versions (5) are stored for emergency rollbacks.
  464. Additional versions beyond 5 are cleaned up during table garbage
  465. collection as managed by the DHT's cell GC.</p>
  466. </div>
  467. <h3>Size Estimate</h3>
  468. <p><i>Normal Git usage:</i> ~10 branches per repository, ~200 tags.
  469. For 1,000 repositories, about 200,000 rows total. Average row size is
  470. about 240 bytes/row before compression (67 after), or 48M total.</p>
  471. <p><i>Gerrit Code Review usage:</i> More than 300 new rows per day.
  472. Each snapshot of each change under review is one reference.</p>
  473. <h3>Updates</h3>
  474. <p>Writes are performed by doing an atomic compare-and-swap (through a
  475. transaction), changing the RefData protocol buffer.</p>
  476. <h3>Reads</h3>
  477. <p>Reads perform prefix scan for all rows starting with
  478. <code>repository_id:</code>. Plans exist to cache these reads within a
  479. custom service, avoiding most DHT queries.</p>
  480. <a name="OBJECT_INDEX"><h2>Table OBJECT_INDEX</h2></a>
  481. <p>The Git network protocol has clients sending object SHA-1s to the
  482. server, with no additional context or information. End-users may also
  483. type a SHA-1 into a web search box. This table provides a mapping of
  484. the object SHA-1 to which chunk(s) store the object's data. The table
  485. is sometimes also called the 'global index', since it names where
  486. every single object is stored.</p>
  487. <h3>Row Key</h3>
  488. <div class='rowkey'>
  489. <div>
  490. <span class='header'>Row Key:</span>
  491. <span class='var'>NN</span>
  492. <span class='lit'>.</span>
  493. <span class='var'>repository_id</span>
  494. <span class='lit'>.</span>
  495. <span class='var'>NNx40</span>
  496. </div>
  497. <p>Variables:</p>
  498. <ul>
  499. <li><span class='var'>NN</span> = First 2 hex digits of object SHA-1</li>
  500. <li><span class='var'>repository_id</span> = Repository owning the object (see above)</li>
  501. <li><span class='var'>NNx40</span> = Complete object SHA-1 name, in hex</li>
  502. </ul>
  503. <p>Examples:</p>
  504. <ul>
  505. <li><span class='example'>2b.80000000.2b5c9037c81c38b3b9abc29a3a87a4abcd665ed4</span></li>
  506. <li><span class='example'>8f.40000000.8f270a441569b127cc4af8a6ef601d94d9490efb</span></li>
  507. </ul>
  508. </div>
  509. <p>The first 2 hex digits (<code>NN</code>) distribute object keys
  510. within the same repository around the DHT keyspace, preventing a busy
  511. repository from creating too much of a hot-spot within the DHT. To
  512. simplify key generation, these 2 digits are repeated after the
  513. repository_id, as part of the 40 hex digit object name.</p>
  514. <p>Keys must be clustered by repository_id to support extending
  515. abbreviations. End-users may supply an abbreviated SHA-1 of 4 or more
  516. digits (up to 39) and ask the server to complete them to a full 40
  517. digit SHA-1 if the server has the relevant object within the
  518. repository's object set.</p>
  519. <p>A schema variant that did not include the repository_id as part of
  520. the row key was considered, but discarded because completing a short
  521. 4-6 digit abbreviated SHA-1 would be impractical once there were
  522. billions of objects stored in the DHT. Git end-users expect to be able
  523. to use 4 or 6 digit abbreviations on very small repositories, as the
  524. number of objects is low and thus the number of bits required to
  525. uniquely name an object within that object set is small.</p>
  526. <h3>Columns</h3>
  527. <div class='colfamily'>
  528. <div>
  529. <span class='header'>Column:</span>
  530. <a name="OBJECT_INDEX.info"><span class='lit'>info:</span></a>
  531. <span class='var'>chunk_key[short]</span>
  532. </div>
  533. <p>Cell value is the protocol message
  534. <a href="#message_ObjectInfo">ObjectInfo</a> describing how the object
  535. named by the row key is stored in the chunk named by the column name.</p>
  536. <p>Cell timestamp matters. The <b>oldest cell</b> within the
  537. entire column family is favored during reads. As chunk_key is
  538. unique, versions within a single column aren't relevant.</p>
  539. </div>
  540. <h3>Size Estimate</h3>
  541. <p>Average row size per object/chunk pair is 144 bytes uncompressed
  542. (87 compressed), based on the linux-2.6 repository. The linux-2.6
  543. repository has 1.8 million objects, and is growing at a rate of about
  544. 300,000 objects/year. Total usage for linux-2.6 is above 154M.</p>
  545. <p>Most rows contain only 1 cell, as the object appears in only 1
  546. chunk within that repository.</p>
  547. <p><i>Worst case:</i> 1.8 million rows/repository * 1,000 repositories
  548. is around 1.8 billion rows and 182G.</p>
  549. <h3>Updates</h3>
  550. <p>One write per object received over the network; typically performed
  551. as part of an asynchronous batch. Each batch is sized around 512 KiB
  552. (about 3000 rows). Because of SHA-1's uniform distribution, row keys
  553. are first sorted and then batched into buckets of about 3000 rows. To
  554. prevent too much activity from going to one table segment at a time
  555. the complete object list is segmented into up to 32 groups which are
  556. written in round-robin order.</p>
  557. <p>A full push of the linux-2.6 repository writes 1.8 million
  558. rows as there are 1.8 million objects in the pack stream.</p>
  559. <p>During normal insert or receive operations, each received object is
  560. a blind write to add one new <code>info:chunk_key[short]</code> cell
  561. to the row. During repack, all cells in the <code>info</code> column
  562. family are replaced with a single cell.</p>
  563. <h3>Reads</h3>
  564. <p>During common ancestor negotiation reads occur in batches of 64-128
  565. full row keys, uniformly distributed throughout the key space. Most of
  566. these reads are misses, the OBJECT_INDEX table does not contain the
  567. key offered by the client. A successful negotation for most developers
  568. requires at least two rounds of 64 objects back-to-back over HTTP. Due
  569. to the high miss rate on this table, an in-memory bloom filter may be
  570. important for performance.</p>
  571. <p>To support the high read-rate (and high miss-rate) during common
  572. ancestor negotation, an alternative to an in-memory bloom filter
  573. within the DHT is to downoad the entire set of keys into an alternate
  574. service job for recently accessed repositories. This service can only
  575. be used if <i>all</i> of the keys for the same repository_id are
  576. hosted within the service. Given this is under 36 GiB for the worst
  577. case 1.8 billion rows mentioned above, this may be feasible. Loading
  578. the table can be performed by fetching <a
  579. href="#REPOSITORY.chunk-info">REPOSITORY.chunk-info</a> and then
  580. performing parallel gets for the <a
  581. href="#CHUNK.index">CHUNK.index</a> column, and scanning the local
  582. indexes to construct the list of known objects.</p>
  583. <p>During repacking with no delta reuse, worst case scenario requires
  584. reading all records with the same repository_id (for linux-2.6 this
  585. is 1.8 million rows). Reads are made in a configurable batch size,
  586. right now this is set at 2048 keys/batch, with 4 concurrent batches in
  587. flight at a time.</p>
  588. <p>Reads are tried first against memcache, then against the DHT if the
  589. entry did not exist in memcache. Successful reads against the DHT are
  590. put back into memcache in the background.</p>
  591. <a name="CHUNK"><h2>Table CHUNK</h2></a>
  592. <p>Stores the object data for a repository, containing commit history,
  593. directory structure, and file revisions. Each chunk is typically 1 MiB
  594. in size, excluding the index and meta columns.</p>
  595. <h3>Row Key</h3>
  596. <div class='rowkey'>
  597. <div>
  598. <span class='header'>Row Key:</span>
  599. <span class='var'>HH</span>
  600. <span class='lit'>.</span>
  601. <span class='var'>repository_id</span>
  602. <span class='lit'>.</span>
  603. <span class='var'>HHx40</span>
  604. </div>
  605. <p>Variables:</p>
  606. <ul>
  607. <li><span class='var'>HH</span> = First 2 hex digits of chunk SHA-1</li>
  608. <li><span class='var'>repository_id</span> = Repository owning the chunk (see above)</li>
  609. <li><span class='var'>HHx40</span> = Complete chunk SHA-1, in hex</li>
  610. </ul>
  611. <p>Examples:</p>
  612. <ul>
  613. <li><span class='example'>09.80000000.09e0eb57543be633b004b672cbebdf335aa4d53f</span> <i>(full key)</i></li>
  614. </ul>
  615. </div>
  616. <p>Chunk keys are computed by first computing the SHA-1 of the
  617. <code>chunk:</code> column, which is the compressed object contents
  618. stored within the chunk. As the chunk data includes a 32 bit salt in
  619. the trailing 4 bytes, this value is random even for the exact same
  620. object input.</p>
  621. <p>The leading 2 hex digit <code>HH</code> component distributes
  622. chunks for the same repository (and over the same time period) evenly
  623. around the DHT keyspace, preventing any portion from becoming too
  624. hot.</p>
  625. <h3>Columns</h3>
  626. <div class='colfamily'>
  627. <div>
  628. <span class='header'>Column:</span>
  629. <a name="CHUNK.chunk"><span class='lit'>chunk:</span></a>
  630. </div>
  631. <p>Multiple objects in Git pack-file format, about 1 MiB in size.
  632. The data is already very highly compressed by Git and is not further
  633. compressable by the DHT.</p>
  634. </div>
  635. <p>This column is essentially the standard Git pack-file format,
  636. without the standard header or trailer. Objects can be stored in
  637. either whole format (object content is simply deflated inline)
  638. or in delta format (reference to a delta base is followed by
  639. deflated sequence of copy and/or insert instructions to recreate
  640. the object content). The OBJ_OFS_DELTA format is preferred
  641. for deltas, since it tends to use a shorter encoding than the
  642. OBJ_REF_DELTA format. Offsets beyond the start of the chunk are
  643. actually offsets to other chunks, and must be resolved using the
  644. <code>meta.base_chunk.relative_start</code> field.</p>
  645. <p>Because the row key is derived from the SHA-1 of this column, the
  646. trailing 4 bytes is randomly generated at insertion time, to make it
  647. impractical for remote clients to predict the name of the chunk row.
  648. This allows the stream parser to bindly insert rows without first
  649. checking for row existance, or worry about replacing an existing
  650. row and causing data corruption.</p>
  651. <p>This column value is essentially opaque to the DHT.</p>
  652. <div class='colfamily'>
  653. <div>
  654. <span class='header'>Column:</span>
  655. <a name="CHUNK.index"><span class='lit'>index:</span></a>
  656. </div>
  657. <p>Binary searchable table listing object SHA-1 and starting offset
  658. of that object within the <code>chunk:</code> data stream. The data
  659. in this index is essentially random (due to the SHA-1s stored in
  660. binary) and thus is not compressable.</p>
  661. </div>
  662. <p>Sorted list of SHA-1s of each object that is stored in this chunk,
  663. along with the offset. This column allows efficient random access to
  664. any object within the chunk, without needing to perform a remote read
  665. against <a href="#OBJECT_INDEX">OBJECT_INDEX</a> table. The column is
  666. very useful at read time, where pointers within Git objects will
  667. frequently reference other objects stored in the same chunk.</p>
  668. <p>This column is sometimes called the local index, since it is local
  669. only to the chunk and thus differs from the global index stored in the
  670. <a href="#OBJECT_INDEX">OBJECT_INDEX</a> table.</p>
  671. <p>The column size is 24 bytes per object stored in the chunk. Commit
  672. chunks store on average 2200 commits/chunk, so a commit chunk index is
  673. about 52,800 bytes.</p>
  674. <p>This column value is essentially opaque to the DHT.</p>
  675. <div class='colfamily'>
  676. <div>
  677. <span class='header'>Column:</span>
  678. <a name="CHUNK.meta"><span class='lit'>meta:</span></a>
  679. </div>
  680. <p>Cell value is the protocol message
  681. <a href="#message_ChunkMeta">ChunkMeta</a> describing prefetch
  682. hints, object fragmentation, and delta base locations. Unlike
  683. <code>chunk:</code> and <code>index:</code>, this column is
  684. somewhat compressable.</p>
  685. </div>
  686. <p>The meta column provides information critical for reading the
  687. chunk's data. (Unlike <a href="#message_ChunkInfo">ChunkInfo</a> in
  688. the <a href="#REPOSITORY">REPOSITORY</a> table, which is used only for
  689. accounting.)</p>
  690. <p>The most important element is the BaseChunk nested message,
  691. describing a chunk that contains a base object required to inflate
  692. an object that is stored in this chunk as a delta.</p>
  693. <h3>Chunk Contents</h3>
  694. <p>Chunks try to store only a single object type, however mixed object
  695. type chunks are supported. The rule to store only one object type per
  696. chunk improves data locality, reducing the number of chunks that need
  697. to be accessed from the DHT in order to perform a particular Git
  698. operation. Clustering commits together into a 'commit chunk' improves
  699. data locality during log/history walking operations, while clustering
  700. trees together into a 'tree chunk' improves data locality during the
  701. early stages of packing or difference generation.</p>
  702. <p>Chunks reuse the standard Git pack data format to support direct
  703. streaming of a chunk's <code>chunk:</code> column to clients, without
  704. needing to perform any data manipulation on the server. This enables
  705. high speed data transfer from the DHT to the client.</p>
  706. <h3>Large Object Fragmentation</h3>
  707. <p>If a chunk contains more than one object, all objects within the
  708. chunk must store their complete compressed form within the chunk. This
  709. limits an object to less than 1 MiB of compressed data.</p>
  710. <p>Larger objects whose compressed size is bigger than 1 MiB are
  711. fragmented into multiple chunks. The first chunk contains the object's
  712. pack header, and the first 1 MiB of compressed data. Subsequent data
  713. is stored in additional chunks. The additional chunk keys are stored
  714. in the <code>meta.fragment</code> field. Each chunk that is part of
  715. the same large object redundantly stores the exact same meta
  716. value.</p>
  717. <h3>Size Estimate</h3>
  718. <p>Approximately the same size if the repository was stored on the
  719. local filesystem. For the linux-2.6 repository (373M / 1.8 million
  720. objects), about 417M (373.75M in <code>chunk:</code>, 42.64M in
  721. <code>index:</code>, 656K in <code>meta:</code>).</p>
  722. <p>Row count is close to size / 1M (373M / 1M = 373 rows), but may be
  723. slightly higher (e.g. 402) due to fractional chunks on the end of
  724. large fragmented objects, or where the single object type rule caused a
  725. chunk to close before it was full.</p>
  726. <p>For the complete Android repository set, disk usage is ~13G.</p>
  727. <h3>Updates</h3>
  728. <p>This table is (mostly) append-only. Write operations blast in ~1
  729. MiB chunks, as the key format assures writers the new row does not
  730. already exist. Chunks are randomly scattered by the hashing function,
  731. and are not buffered very deep by writers.</p>
  732. <p><i>Interactive writes:</i> Small operations impacting only 1-5
  733. chunks will write all columns in a single operation. Most chunks of
  734. this varity will be very small, 1-10 objects per chunk and about 1-10
  735. KiB worth of compressed data inside of the <code>chunk:</code> column.
  736. This class of write represents a single change made by one developer
  737. that must be shared back out immediately.</p>
  738. <p><i>Large pushes:</i> Large operations impacting tens to hundreds of
  739. chunks will first write the <code>chunk:</code> column, then come back
  740. later and populate the <code>index:</code> and <code>meta:</code>
  741. columns once all chunks have been written. The delayed writing of
  742. index and meta during large operations is required because the
  743. information for these columns is not available until the entire data
  744. stream from the Git client has been received and scanned. As the Git
  745. server may not have sufficient memory to store all chunk data (373M or
  746. 1G!), its written out first to free up memory.</p>
  747. <p><i>Garbage collection:</i> Chunks that are not optimally sized
  748. (less than the target ~1 MiB), optimally localized (too many graph
  749. pointers outside of the chunk), or compressed (Git found a smaller way
  750. to store the same content) will be replaced by first writing new
  751. chunks, and then deleting the old chunks.</p>
  752. <p>Worst case, this could churn as many as 402 rows and 373M worth of
  753. data for the linux-2.6 repository. Special consideration will be made
  754. to try and avoid replacing chunks whose <code>WWWW</code> key
  755. component is 'sufficiently old' and whose content is already
  756. sufficiently sized and compressed. This will help to limit churn to
  757. only more recently dated chunks, which are smaller in size.</p>
  758. <h3>Reads</h3>
  759. <p>All columns are read together as a unit. Memcache is checked first,
  760. with reads falling back to the DHT if the cache does not have the
  761. chunk.</p>
  762. <p>Reasonably accurate prefetching is supported through background
  763. threads and prefetching metadata embedded in the <a
  764. href="#message_CachedPackInfo">CachedPackInfo</a> and <a
  765. href="#message_ChunkMeta">ChunkMeta</a> protocol messages used by
  766. readers.</p>
  767. <hr />
  768. <h2>Protocol Messages</h2>
  769. <pre>
  770. package git_store;
  771. option java_package = "org.eclipse.jgit.storage.dht.proto";
  772. // Entry in RefTable describing the target of the reference.
  773. // Either symref *OR* target must be populated, but never both.
  774. //
  775. <a name="message_RefData">message RefData</a> {
  776. // An ObjectId with an optional hint about where it can be found.
  777. //
  778. message Id {
  779. required string object_name = 1;
  780. optional string chunk_key = 2;
  781. }
  782. // Name of another reference this reference inherits its target
  783. // from. The target is inherited on-the-fly at runtime by reading
  784. // the other reference. Typically only "HEAD" uses symref.
  785. //
  786. optional string symref = 1;
  787. // ObjectId this reference currently points at.
  788. //
  789. optional Id target = 2;
  790. // True if the correct value for peeled is stored.
  791. //
  792. optional bool is_peeled = 3;
  793. // If is_peeled is true, this field is accurate. This field
  794. // exists only if target points to annotated tag object, then
  795. // this field stores the "object" field for that tag.
  796. //
  797. optional Id peeled = 4;
  798. }
  799. // Entry in ObjectIndexTable, describes how an object appears in a chunk.
  800. //
  801. <a name="message_ObjectInfo">message ObjectInfo</a> {
  802. // Type of Git object.
  803. //
  804. enum ObjectType {
  805. COMMIT = 1;
  806. TREE = 2;
  807. BLOB = 3;
  808. TAG = 4;
  809. }
  810. optional ObjectType object_type = 1;
  811. // Position of the object's header within its chunk.
  812. //
  813. required int32 offset = 2;
  814. // Total number of compressed data bytes, not including the pack
  815. // header. For fragmented objects this is the sum of all chunks.
  816. //
  817. required int64 packed_size = 3;
  818. // Total number of bytes of the uncompressed object. For a
  819. // delta this is the size after applying the delta onto its base.
  820. //
  821. required int64 inflated_size = 4;
  822. // ObjectId of the delta base, if this object is stored as a delta.
  823. // The base is stored in raw binary.
  824. //
  825. optional bytes delta_base = 5;
  826. }
  827. // Describes at a high-level the information about a chunk.
  828. // A repository can use this summary to determine how much
  829. // data is stored, or when garbage collection should occur.
  830. //
  831. <a name="message_ChunkInfo">message ChunkInfo</a> {
  832. // Source of the chunk (what code path created it).
  833. //
  834. enum Source {
  835. RECEIVE = 1; // Came in over the network from external source.
  836. INSERT = 2; // Created in this repository (e.g. a merge).
  837. REPACK = 3; // Generated during a repack of this repository.
  838. }
  839. optional Source source = 1;
  840. // Type of Git object stored in this chunk.
  841. //
  842. enum ObjectType {
  843. MIXED = 0;
  844. COMMIT = 1;
  845. TREE = 2;
  846. BLOB = 3;
  847. TAG = 4;
  848. }
  849. optional ObjectType object_type = 2;
  850. // True if this chunk is a member of a fragmented object.
  851. //
  852. optional bool is_fragment = 3;
  853. // If present, key of the CachedPackInfo object
  854. // that this chunk is a member of.
  855. //
  856. optional string cached_pack_key = 4;
  857. // Summary description of the objects stored here.
  858. //
  859. message ObjectCounts {
  860. // Number of objects stored in this chunk.
  861. //
  862. optional int32 total = 1;
  863. // Number of objects stored in whole (non-delta) form.
  864. //
  865. optional int32 whole = 2;
  866. // Number of objects stored in OFS_DELTA format.
  867. // The delta base appears in the same chunk, or
  868. // may appear in an earlier chunk through the
  869. // ChunkMeta.base_chunk link.
  870. //
  871. optional int32 ofs_delta = 3;
  872. // Number of objects stored in REF_DELTA format.
  873. // The delta base is at an unknown location.
  874. //
  875. optional int32 ref_delta = 4;
  876. }
  877. optional ObjectCounts object_counts = 5;
  878. // Size in bytes of the chunk's compressed data column.
  879. //
  880. optional int32 chunk_size = 6;
  881. // Size in bytes of the chunk's index.
  882. //
  883. optional int32 index_size = 7;
  884. // Size in bytes of the meta information.
  885. //
  886. optional int32 meta_size = 8;
  887. }
  888. // Describes meta information about a chunk, stored inline with it.
  889. //
  890. <a name="message_ChunkMeta">message ChunkMeta</a> {
  891. // Enumerates the other chunks this chunk depends upon by OFS_DELTA.
  892. // Entries are sorted by relative_start ascending, enabling search. Thus
  893. // the earliest chunk is at the end of the list.
  894. //
  895. message BaseChunk {
  896. // Bytes between start of the base chunk and start of this chunk.
  897. // Although the value is positive, its a negative offset.
  898. //
  899. required int64 relative_start = 1;
  900. required string chunk_key = 2;
  901. }
  902. repeated BaseChunk base_chunk = 1;
  903. // If this chunk is part of a fragment, key of every chunk that
  904. // makes up the fragment, including this chunk.
  905. //
  906. repeated string fragment = 2;
  907. // Chunks that should be prefetched if reading the current chunk.
  908. //
  909. message PrefetchHint {
  910. repeated string edge = 1;
  911. repeated string sequential = 2;
  912. }
  913. optional PrefetchHint commit_prefetch = 51;
  914. optional PrefetchHint tree_prefetch = 52;
  915. }
  916. // Describes a CachedPack, for efficient bulk clones.
  917. //
  918. <a name="message_CachedPackInfo">message CachedPackInfo</a> {
  919. // Unique name of the cached pack. This is the SHA-1 hash of
  920. // all of the objects that make up the cached pack, sorted and
  921. // in binary form. (Same rules as Git on the filesystem.)
  922. //
  923. required string name = 1;
  924. // SHA-1 of all chunk keys, which are themselves SHA-1s of the
  925. // raw chunk data. If any bit differs in compression (due to
  926. // repacking) the version will differ.
  927. //
  928. required string version = 2;
  929. // Total number of objects in the cached pack. This must be known
  930. // in order to set the final resulting pack header correctly before it
  931. // is sent to clients.
  932. //
  933. required int64 objects_total = 3;
  934. // Number of objects stored as deltas, rather than deflated whole.
  935. //
  936. optional int64 objects_delta = 4;
  937. // Total size of the chunks, in bytes, not including the chunk footer.
  938. //
  939. optional int64 bytes_total = 5;
  940. // Objects this pack starts from.
  941. //
  942. message TipObjectList {
  943. repeated string object_name = 1;
  944. }
  945. required TipObjectList tip_list = 6;
  946. // Chunks, in order of occurrence in the stream.
  947. //
  948. message ChunkList {
  949. repeated string chunk_key = 1;
  950. }
  951. required ChunkList chunk_list = 7;
  952. }
  953. </pre>
  954. </body>
  955. </html>