Git on DHT Schema

Storing Git repositories on a Distributed Hash Table (DHT) may improve scaling for large traffic, but also simplifies management when there are many small repositories.

Table of Contents

Concepts

Git Repository: Stores the version control history for a single project. Each repository is a directed acyclic graph (DAG) composed of objects. Revision history for a project is described by a commit object pointing to the complete set of files that make up that version of the project, and a pointer to the commit that came immediately before it. Repositories also contain references, associating a human readable branch or tag name to a specific commit object. Tommi Virtanen has a more detailed description of the Git DAG.

Object: Git objects are named by applying the SHA-1 hash algorithm to their contents. There are 4 object types: commit, tree, blob, tag. Objects are typically stored deflated using libz deflate, but may also be delta compressed against another similar object, further reducing the storage required. The big factor for Git repository size is usually object count, e.g. the linux-2.6 repository contains 1.8 million objects.

Reference: Associates a human readable symbolic name, such as refs/heads/master to a specific Git object, usually a commit or tag. References are updated to point to the most recent object whenever changes are committed to the repository.

Git Pack File: A container stream holding many objects in a highly compressed format. On the local filesystem, Git uses pack files to reduce both inode and space usage by combining millions of objects into a single data stream. On the network, Git uses pack files as the basic network protocol to transport objects from one system's repository to another.

Garbage Collection: Scanning the Git object graph to locate objects that are reachable, and others that are unreachable. Git also generally performs data recompression during this task to produce more optimal deltas between objects, reducing overall disk usage and data transfer sizes. This is independent of any GC that may be performed by the DHT to clean up old cells.

The basic storage strategy employed by this schema is to break a standard Git pack file into chunks, approximately 1 MiB in size. Each chunk is stored as one row in the CHUNK table. During reading, chunks are paged into the application on demand, but may also be prefetched using prefetch hints. Rules are used to break the standard pack into chunks, these rules help to improve reference locality and reduce the number of chunk loads required to service common operations. In a nutshell, the DHT is used as a virtual memory system for pages about 1 MiB in size.

Summary

The schema consists of a handful of tables. Size estimates are given for one copy of the linux-2.6 Git repository, a relative tortue test case that contains 1.8 million objects and is 425 MiB when stored on the local filesystem. All sizes are before any replication made by the DHT, or its underlying storage system.

Table Rows Cells/Row Bytes Bytes/Row
REPOSITORY_INDEX
Map host+path to surrogate key.
1 1 < 100 bytes < 100 bytes
REPOSITORY
Accounting and replica management.
1 403 65 KiB 65 KiB
REF
Bind branch/tag name to Git object.
211 1 14 KiB 67 bytes
OBJECT_INDEX
Locate Git object by SHA-1 name.
1,861,833 1 154 MiB 87 bytes
CHUNK
Complete Git object storage.
402 3 417 MiB ~ 1 MiB
Total 1,862,448 571 MiB

Data Security

If data encryption is necessary to protect file contents, the CHUNK.chunk column can be encrypted with a block cipher such as AES. This column contains the revision commit messages, file paths, and file contents. By encrypting one column, the majority of the repository data is secured. As each cell value is about 1 MiB and contains a trailing 4 bytes of random data, an ECB mode of operation may be sufficient. Because the cells are already very highly compressed using the Git data compression algorithms, there is no increase in disk usage due to encryption.

Branch and tag names (REF row keys) are not encrypted. If these need to be secured the portion after the ':' would need to be encrypted with a block cipher. However these strings are very short and very common (HEAD, refs/heads/master, refs/tags/v1.0), making encryption difficult. A variation on the schema might move all rows for a repository into a single protocol messsage, then encrypt the protobuf into a single cell. Unfortunately this strategy has a high update cost, and references change frequently.

Object SHA-1 names (OBJECT_INDEX row keys and CHUNK.index values) are not encrypted. This allows a reader to determine if a repository contains a specific revision, but does not allow them to inspect the contents of the revision. The CHUNK.index column could also be encrypted with a block cipher when CHUNK.chunk is encrypted (see above), however the OBJECT_INDEX table row keys cannot be encrypted if abbrevation expansion is to be supported for end-users of the repository. The row keys must be unencrypted as abbreviation resolution is performed by a prefix range scan over the keys.

The remaining tables and columns contain only statistics (e.g. object counts or cell sizes), or internal surrogate keys (repository_id, chunk_key) and do not require encryption.


Table REPOSITORY_INDEX

Maps a repository name, as presented in the URL by an end-user or client application, into its internal repository_id value. This mapping allows the repository name to be quickly modified (e.g. renamed) without needing to update the larger data rows of the repository.

The exact semantics of the repository_name format is left as a deployment decision, but DNS hostname, '/', repository name would be one common usage.

Row Key

Row Key: repository_name

Human readable name of the repository, typically derived from the HTTP Host header and path in the URL.

Examples:

Columns

Column: id:

The repository_id, as an 8-digit hex ASCII string.

Size Estimate

Less than 100,000 rows. More likely estimate is 1,000 rows. Total disk usage under 512 KiB, assuming 1,000 names and 256 characters per name.

Updates

Only on repository creation or rename, which is infrequent (<10 rows/month). Updates are performed in a row-level transaction, to ensure a name is either assigned uniquely, or fails.

Reads

Reads are tried first against memcache, then against the DHT if the entry did not exist in memcache. Successful reads against the DHT are put back into memcache in the background.

Table REPOSITORY

Tracks top-level information about each repository.

Row Key

Row Key: repository_id

The repository_id, as an 8-digit hex ASCII string.

Typically this is assigned sequentially, then has the bits reversed to evenly spread repositories throughout the DHT. For example the first repository is 80000000, and the second is 40000000.

Columns

Column: chunk-info: chunk_key[short]

Cell value is the protocol message ChunkInfo describing the chunk's contents. Most of the message's fields are only useful for quota accounting and reporting.

This column exists to track all of the chunks that make up a repository's object set. Garbage collection and quota accounting tasks can primarily drive off this column, rather than scanning the much larger CHUNK table with a regular expression on the chunk row key.

As each chunk averages 1 MiB in size, the linux-2.6 repository (at 373 MiB) has about 400 chunks and thus about 400 chunk-info cells. The chromium repository (at 1 GiB) has about 1000 chunk-info cells. It would not be uncommon to have 2000 chunk-info cells.

Column: cached-pack: NNNNx38 . VVVVx38

Variables:

Examples:

Cell value is the protocol message CachedPackInfo describing the chunks that make up a cached pack.

The cached-pack column family describes large lists of chunks that when combined together in a specific order create a valid Git pack file directly streamable to a client. This avoids needing to enumerate and pack the entire repository on each request.

The cached-pack name (NNNNx38 above) is the SHA-1 of the objects contained within the pack, in binary, sorted. This is the standard naming convention for pack files on the local filesystem. The version (VVVVx38 above) is the SHA-1 of the chunk keys, sorted. The version makes the cached-pack cell unique, if any single bit in the compressed data is modified a different version will be generated, and a different cell will be used to describe the alternate version of the same data. The version is necessary to prevent repacks of the same object set (but with different compression settings or results) from stepping on active readers.

Size Estimate

1 row per repository (~1,000 repositories), however the majority of the storage cost is in the chunk-info column family, which can have more than 2000 cells per repository.

Each chunk-info cell is on average 147 bytes. For a large repository like chromium.git (over 1000 chunks) this is only 147 KiB for the entire row.

Each cached-pack cell is on average 5350 bytes. Most repositories have 1 of these cells, 2 while the repository is being repacked on the server side to update the cached-pack data.

Updates

Information about each ~1 MiB chunk of pack data received over the network is stored as a unique column in the chunk-info column family.

Most pushes are at least 2 chunks (commit, tree), with 50 pushes per repository per day being possible (50,000 new cells/day).

TODO: Average push rates?

Reads

Serving clients: Read all cells in the cached-pack column family, typically only 1-5 cells. The cells are cached in memcache and read from there first.

Garbage collection: Read all cells in the chunk-info column family to determine which chunks are owned by this repository, without scanning the CHUNK table. Delete chunk-info after the corresponding CHUNK row has been deleted. Unchanged chunks have their info left alone.

Table REF

Associates a human readable branch (e.g. refs/heads/master) or tag (e.g. refs/tags/v1.0) name to the Git object that represents that current state of the repository.

Row Key

Row Key: repository_id : ref_name

Variables:

Examples:

The separator : used in the row key was chosen because this character is not permitted in a Git reference name.

Columns

Column: target:

Cell value is the protocol message RefData describing the current SHA-1 the reference points to, and the chunk it was last observed in. The chunk hint allows skipping a read of OBJECT_INDEX.

Several versions (5) are stored for emergency rollbacks. Additional versions beyond 5 are cleaned up during table garbage collection as managed by the DHT's cell GC.

Size Estimate

Normal Git usage: ~10 branches per repository, ~200 tags. For 1,000 repositories, about 200,000 rows total. Average row size is about 240 bytes/row before compression (67 after), or 48M total.

Gerrit Code Review usage: More than 300 new rows per day. Each snapshot of each change under review is one reference.

Updates

Writes are performed by doing an atomic compare-and-swap (through a transaction), changing the RefData protocol buffer.

Reads

Reads perform prefix scan for all rows starting with repository_id:. Plans exist to cache these reads within a custom service, avoiding most DHT queries.

Table OBJECT_INDEX

The Git network protocol has clients sending object SHA-1s to the server, with no additional context or information. End-users may also type a SHA-1 into a web search box. This table provides a mapping of the object SHA-1 to which chunk(s) store the object's data. The table is sometimes also called the 'global index', since it names where every single object is stored.

Row Key

Row Key: NN . repository_id . NNx40

Variables:

Examples:

The first 2 hex digits (NN) distribute object keys within the same repository around the DHT keyspace, preventing a busy repository from creating too much of a hot-spot within the DHT. To simplify key generation, these 2 digits are repeated after the repository_id, as part of the 40 hex digit object name.

Keys must be clustered by repository_id to support extending abbreviations. End-users may supply an abbreviated SHA-1 of 4 or more digits (up to 39) and ask the server to complete them to a full 40 digit SHA-1 if the server has the relevant object within the repository's object set.

A schema variant that did not include the repository_id as part of the row key was considered, but discarded because completing a short 4-6 digit abbreviated SHA-1 would be impractical once there were billions of objects stored in the DHT. Git end-users expect to be able to use 4 or 6 digit abbreviations on very small repositories, as the number of objects is low and thus the number of bits required to uniquely name an object within that object set is small.

Columns

Column: info: chunk_key[short]

Cell value is the protocol message ObjectInfo describing how the object named by the row key is stored in the chunk named by the column name.

Cell timestamp matters. The oldest cell within the entire column family is favored during reads. As chunk_key is unique, versions within a single column aren't relevant.

Size Estimate

Average row size per object/chunk pair is 144 bytes uncompressed (87 compressed), based on the linux-2.6 repository. The linux-2.6 repository has 1.8 million objects, and is growing at a rate of about 300,000 objects/year. Total usage for linux-2.6 is above 154M.

Most rows contain only 1 cell, as the object appears in only 1 chunk within that repository.

Worst case: 1.8 million rows/repository * 1,000 repositories is around 1.8 billion rows and 182G.

Updates

One write per object received over the network; typically performed as part of an asynchronous batch. Each batch is sized around 512 KiB (about 3000 rows). Because of SHA-1's uniform distribution, row keys are first sorted and then batched into buckets of about 3000 rows. To prevent too much activity from going to one table segment at a time the complete object list is segmented into up to 32 groups which are written in round-robin order.

A full push of the linux-2.6 repository writes 1.8 million rows as there are 1.8 million objects in the pack stream.

During normal insert or receive operations, each received object is a blind write to add one new info:chunk_key[short] cell to the row. During repack, all cells in the info column family are replaced with a single cell.

Reads

During common ancestor negotiation reads occur in batches of 64-128 full row keys, uniformly distributed throughout the key space. Most of these reads are misses, the OBJECT_INDEX table does not contain the key offered by the client. A successful negotation for most developers requires at least two rounds of 64 objects back-to-back over HTTP. Due to the high miss rate on this table, an in-memory bloom filter may be important for performance.

To support the high read-rate (and high miss-rate) during common ancestor negotation, an alternative to an in-memory bloom filter within the DHT is to downoad the entire set of keys into an alternate service job for recently accessed repositories. This service can only be used if all of the keys for the same repository_id are hosted within the service. Given this is under 36 GiB for the worst case 1.8 billion rows mentioned above, this may be feasible. Loading the table can be performed by fetching REPOSITORY.chunk-info and then performing parallel gets for the CHUNK.index column, and scanning the local indexes to construct the list of known objects.

During repacking with no delta reuse, worst case scenario requires reading all records with the same repository_id (for linux-2.6 this is 1.8 million rows). Reads are made in a configurable batch size, right now this is set at 2048 keys/batch, with 4 concurrent batches in flight at a time.

Reads are tried first against memcache, then against the DHT if the entry did not exist in memcache. Successful reads against the DHT are put back into memcache in the background.

Table CHUNK

Stores the object data for a repository, containing commit history, directory structure, and file revisions. Each chunk is typically 1 MiB in size, excluding the index and meta columns.

Row Key

Row Key: HH . repository_id . HHx40

Variables:

Examples:

Chunk keys are computed by first computing the SHA-1 of the chunk: column, which is the compressed object contents stored within the chunk. As the chunk data includes a 32 bit salt in the trailing 4 bytes, this value is random even for the exact same object input.

The leading 2 hex digit HH component distributes chunks for the same repository (and over the same time period) evenly around the DHT keyspace, preventing any portion from becoming too hot.

Columns

Column: chunk:

Multiple objects in Git pack-file format, about 1 MiB in size. The data is already very highly compressed by Git and is not further compressable by the DHT.

This column is essentially the standard Git pack-file format, without the standard header or trailer. Objects can be stored in either whole format (object content is simply deflated inline) or in delta format (reference to a delta base is followed by deflated sequence of copy and/or insert instructions to recreate the object content). The OBJ_OFS_DELTA format is preferred for deltas, since it tends to use a shorter encoding than the OBJ_REF_DELTA format. Offsets beyond the start of the chunk are actually offsets to other chunks, and must be resolved using the meta.base_chunk.relative_start field.

Because the row key is derived from the SHA-1 of this column, the trailing 4 bytes is randomly generated at insertion time, to make it impractical for remote clients to predict the name of the chunk row. This allows the stream parser to bindly insert rows without first checking for row existance, or worry about replacing an existing row and causing data corruption.

This column value is essentially opaque to the DHT.

Column: index:

Binary searchable table listing object SHA-1 and starting offset of that object within the chunk: data stream. The data in this index is essentially random (due to the SHA-1s stored in binary) and thus is not compressable.

Sorted list of SHA-1s of each object that is stored in this chunk, along with the offset. This column allows efficient random access to any object within the chunk, without needing to perform a remote read against OBJECT_INDEX table. The column is very useful at read time, where pointers within Git objects will frequently reference other objects stored in the same chunk.

This column is sometimes called the local index, since it is local only to the chunk and thus differs from the global index stored in the OBJECT_INDEX table.

The column size is 24 bytes per object stored in the chunk. Commit chunks store on average 2200 commits/chunk, so a commit chunk index is about 52,800 bytes.

This column value is essentially opaque to the DHT.

Column: meta:

Cell value is the protocol message ChunkMeta describing prefetch hints, object fragmentation, and delta base locations. Unlike chunk: and index:, this column is somewhat compressable.

The meta column provides information critical for reading the chunk's data. (Unlike ChunkInfo in the REPOSITORY table, which is used only for accounting.)

The most important element is the BaseChunk nested message, describing a chunk that contains a base object required to inflate an object that is stored in this chunk as a delta.

Chunk Contents

Chunks try to store only a single object type, however mixed object type chunks are supported. The rule to store only one object type per chunk improves data locality, reducing the number of chunks that need to be accessed from the DHT in order to perform a particular Git operation. Clustering commits together into a 'commit chunk' improves data locality during log/history walking operations, while clustering trees together into a 'tree chunk' improves data locality during the early stages of packing or difference generation.

Chunks reuse the standard Git pack data format to support direct streaming of a chunk's chunk: column to clients, without needing to perform any data manipulation on the server. This enables high speed data transfer from the DHT to the client.

Large Object Fragmentation

If a chunk contains more than one object, all objects within the chunk must store their complete compressed form within the chunk. This limits an object to less than 1 MiB of compressed data.

Larger objects whose compressed size is bigger than 1 MiB are fragmented into multiple chunks. The first chunk contains the object's pack header, and the first 1 MiB of compressed data. Subsequent data is stored in additional chunks. The additional chunk keys are stored in the meta.fragment field. Each chunk that is part of the same large object redundantly stores the exact same meta value.

Size Estimate

Approximately the same size if the repository was stored on the local filesystem. For the linux-2.6 repository (373M / 1.8 million objects), about 417M (373.75M in chunk:, 42.64M in index:, 656K in meta:).

Row count is close to size / 1M (373M / 1M = 373 rows), but may be slightly higher (e.g. 402) due to fractional chunks on the end of large fragmented objects, or where the single object type rule caused a chunk to close before it was full.

For the complete Android repository set, disk usage is ~13G.

Updates

This table is (mostly) append-only. Write operations blast in ~1 MiB chunks, as the key format assures writers the new row does not already exist. Chunks are randomly scattered by the hashing function, and are not buffered very deep by writers.

Interactive writes: Small operations impacting only 1-5 chunks will write all columns in a single operation. Most chunks of this varity will be very small, 1-10 objects per chunk and about 1-10 KiB worth of compressed data inside of the chunk: column. This class of write represents a single change made by one developer that must be shared back out immediately.

Large pushes: Large operations impacting tens to hundreds of chunks will first write the chunk: column, then come back later and populate the index: and meta: columns once all chunks have been written. The delayed writing of index and meta during large operations is required because the information for these columns is not available until the entire data stream from the Git client has been received and scanned. As the Git server may not have sufficient memory to store all chunk data (373M or 1G!), its written out first to free up memory.

Garbage collection: Chunks that are not optimally sized (less than the target ~1 MiB), optimally localized (too many graph pointers outside of the chunk), or compressed (Git found a smaller way to store the same content) will be replaced by first writing new chunks, and then deleting the old chunks.

Worst case, this could churn as many as 402 rows and 373M worth of data for the linux-2.6 repository. Special consideration will be made to try and avoid replacing chunks whose WWWW key component is 'sufficiently old' and whose content is already sufficiently sized and compressed. This will help to limit churn to only more recently dated chunks, which are smaller in size.

Reads

All columns are read together as a unit. Memcache is checked first, with reads falling back to the DHT if the cache does not have the chunk.

Reasonably accurate prefetching is supported through background threads and prefetching metadata embedded in the CachedPackInfo and ChunkMeta protocol messages used by readers.


Protocol Messages

package git_store;
option java_package = "org.eclipse.jgit.storage.dht.proto";


    // Entry in RefTable describing the target of the reference.
    // Either symref *OR* target must be populated, but never both.
    //
message RefData {
    // An ObjectId with an optional hint about where it can be found.
    //
  message Id {
    required string object_name = 1;
    optional string chunk_key = 2;
  }

    // Name of another reference this reference inherits its target
    // from.  The target is inherited on-the-fly at runtime by reading
    // the other reference.  Typically only "HEAD" uses symref.
    //
  optional string symref = 1;

    // ObjectId this reference currently points at.
    //
  optional Id target = 2;

    // True if the correct value for peeled is stored.
    //
  optional bool is_peeled = 3;

    // If is_peeled is true, this field is accurate.  This field
    // exists only if target points to annotated tag object, then
    // this field stores the "object" field for that tag.
    //
  optional Id peeled = 4;
}


    // Entry in ObjectIndexTable, describes how an object appears in a chunk.
    //
message ObjectInfo {
    // Type of Git object.
    //
  enum ObjectType {
    COMMIT = 1;
    TREE = 2;
    BLOB = 3;
    TAG = 4;
  }
  optional ObjectType object_type = 1;

    // Position of the object's header within its chunk.
    //
  required int32 offset = 2;

    // Total number of compressed data bytes, not including the pack
    // header. For fragmented objects this is the sum of all chunks.
    //
  required int64 packed_size = 3;

    // Total number of bytes of the uncompressed object. For a
    // delta this is the size after applying the delta onto its base.
    //
  required int64 inflated_size = 4;

    // ObjectId of the delta base, if this object is stored as a delta.
    // The base is stored in raw binary.
    //
  optional bytes delta_base = 5;
}


    // Describes at a high-level the information about a chunk.
    // A repository can use this summary to determine how much
    // data is stored, or when garbage collection should occur.
    //
message ChunkInfo {
    // Source of the chunk (what code path created it).
    //
  enum Source {
    RECEIVE = 1;    // Came in over the network from external source.
    INSERT = 2;     // Created in this repository (e.g. a merge).
    REPACK = 3;     // Generated during a repack of this repository.
  }
  optional Source source = 1;

    // Type of Git object stored in this chunk.
    //
  enum ObjectType {
    MIXED = 0;
    COMMIT = 1;
    TREE = 2;
    BLOB = 3;
    TAG = 4;
  }
  optional ObjectType object_type = 2;

    // True if this chunk is a member of a fragmented object.
    //
  optional bool is_fragment = 3;

    // If present, key of the CachedPackInfo object
    // that this chunk is a member of.
    //
  optional string cached_pack_key = 4;

    // Summary description of the objects stored here.
    //
  message ObjectCounts {
      // Number of objects stored in this chunk.
      //
    optional int32 total = 1;

      // Number of objects stored in whole (non-delta) form.
      //
    optional int32 whole = 2;

      // Number of objects stored in OFS_DELTA format.
      // The delta base appears in the same chunk, or
      // may appear in an earlier chunk through the
      // ChunkMeta.base_chunk link.
      //
    optional int32 ofs_delta = 3;

      // Number of objects stored in REF_DELTA format.
      // The delta base is at an unknown location.
      //
    optional int32 ref_delta = 4;
  }
  optional ObjectCounts object_counts = 5;

    // Size in bytes of the chunk's compressed data column.
    //
  optional int32 chunk_size = 6;

    // Size in bytes of the chunk's index.
    //
  optional int32 index_size = 7;

    // Size in bytes of the meta information.
    //
  optional int32 meta_size  = 8;
}


    // Describes meta information about a chunk, stored inline with it.
    //
message ChunkMeta {
    // Enumerates the other chunks this chunk depends upon by OFS_DELTA.
    // Entries are sorted by relative_start ascending, enabling search.  Thus
    // the earliest chunk is at the end of the list.
    //
  message BaseChunk {
      // Bytes between start of the base chunk and start of this chunk.
      // Although the value is positive, its a negative offset.
      //
    required int64 relative_start = 1;
    required string chunk_key = 2;
  }
  repeated BaseChunk base_chunk = 1;

    // If this chunk is part of a fragment, key of every chunk that
    // makes up the fragment, including this chunk.
    //
  repeated string fragment = 2;

    // Chunks that should be prefetched if reading the current chunk.
    //
  message PrefetchHint {
    repeated string edge = 1;
    repeated string sequential = 2;
  }
  optional PrefetchHint commit_prefetch = 51;
  optional PrefetchHint tree_prefetch = 52;
}


   // Describes a CachedPack, for efficient bulk clones.
    //
message CachedPackInfo {
    // Unique name of the cached pack.  This is the SHA-1 hash of
    // all of the objects that make up the cached pack, sorted and
    // in binary form.  (Same rules as Git on the filesystem.)
    //
  required string name = 1;

    // SHA-1 of all chunk keys, which are themselves SHA-1s of the
    // raw chunk data. If any bit differs in compression (due to
    // repacking) the version will differ.
    //
  required string version = 2;

    // Total number of objects in the cached pack. This must be known
    // in order to set the final resulting pack header correctly before it
    // is sent to clients.
    //
  required int64 objects_total = 3;

    // Number of objects stored as deltas, rather than deflated whole.
    //
  optional int64 objects_delta = 4;

    // Total size of the chunks, in bytes, not including the chunk footer.
    //
  optional int64 bytes_total = 5;

    // Objects this pack starts from.
    //
  message TipObjectList {
    repeated string object_name = 1;
  }
  required TipObjectList tip_list = 6;

    // Chunks, in order of occurrence in the stream.
    //
  message ChunkList {
    repeated string chunk_key = 1;
  }
  required ChunkList chunk_list = 7;
}