diff options
author | Shawn O. Pearce <spearce@spearce.org> | 2012-08-18 15:27:12 -0700 |
---|---|---|
committer | Matthias Sohn <matthias.sohn@sap.com> | 2012-09-05 17:19:51 +0200 |
commit | 130ad4ea4407316b1fd115db456c4aa950907196 (patch) | |
tree | 4943c050912ff259ea4f13c5fcf521ad8ff7c5b0 /org.eclipse.jgit.storage.dht/README | |
parent | e44c3e713902faaf3c831827915312666cd6ecd6 (diff) | |
download | jgit-130ad4ea4407316b1fd115db456c4aa950907196.tar.gz jgit-130ad4ea4407316b1fd115db456c4aa950907196.zip |
Delete storage.dht package
This experiment proved to be not very useful. I had originally
planned to use this on top of Google Bigtable, Apache HBase or
Apache Cassandra. Unfortunately the schema is very complex and
does not perform well. The storage.dfs package has much better
performance and has been in production at Google for many months
now, proving it is a viable storage backend for Git.
As there are no users of the storage.dht schema, either at Google or
any other company, nor any valid open source implementations of the
storage system, drop the entire package and API from the JGit project.
There is no point in trying to maintain code that is simply not used.
Change-Id: Ia8d32f27426d2bcc12e7dc9cc4524c59f4fe4df9
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
Diffstat (limited to 'org.eclipse.jgit.storage.dht/README')
-rw-r--r-- | org.eclipse.jgit.storage.dht/README | 89 |
1 files changed, 0 insertions, 89 deletions
diff --git a/org.eclipse.jgit.storage.dht/README b/org.eclipse.jgit.storage.dht/README deleted file mode 100644 index 1e07d377e7..0000000000 --- a/org.eclipse.jgit.storage.dht/README +++ /dev/null @@ -1,89 +0,0 @@ -JGit Storage on DHT -------------------- - -This implementation still has some pending issues: - -* DhtInserter must skip existing objects - - DirCache writes all trees to the ObjectInserter, letting the - inserter figure out which trees we already have, and which are new. - DhtInserter should buffer trees into a chunk, then before writing - the chunk to the DHT do a batch lookup to find the existing - ObjectInfo (if any). If any exist, the chunk should be compacted to - eliminate these objects, and if there is room in the chunk for more - objects, it should go back to the DhtInserter to be filled further - before flushing. - - This implies the DhtInserter needs to work on multiple chunks at - once, and may need to combine chunks together when there is more - than one partial chunk. - -* DhtPackParser must check for collisions - - Because ChunkCache blindly assumes any copy of an object is an OK - copy of an object, DhtPackParser needs to validate all new objects - at the end of its importing phase, before it links the objects into - the ObjectIndexTable. Most objects won't already exist, but some - may, and those that do must either be removed from their chunk, or - have their content byte-for-byte validated. - - Removal from a chunk just means deleting it from the chunk's local - index, and not writing it to the global ObjectIndexTable. This - creates a hole in the chunk which is wasted space, and that isn't - very useful. Fortunately objects that fit fully within one chunk - may be easy to inflate and double check, as they are small. Objects - that are big span multiple chunks, and the new chunks can simply be - deleted from the ChunkTable, leaving the original chunks. - - Deltas can be checked quickly by inflating the delta and checking - only the insertion point text, comparing that to the existing data - in the repository. Unfortunately the repository is likely to use a - different delta representation, which means at least one of them - will need to be fully inflated to check the delta against. - -* DhtPackParser should handle small-huge-small-huge - - Multiple chunks need to be open at once, in case we get a bad - pattern of small-object, huge-object, small-object, huge-object. In - this case the small-objects should be put together into the same - chunk, to prevent having too many tiny chunks. This is tricky to do - with OFS_DELTA. A long OFS_DELTA requires all prior chunks to be - closed out so we know their lengths. - -* RepresentationSelector performance bad on Cassandra - - The 1.8 million batch lookups done for linux-2.6 kills Cassandra, it - cannot handle this read load. - -* READ_REPAIR isn't fully accurate - - There are a lot of places where the generic DHT code should be - helping to validate the local replica is consistent, and where it is - not, help the underlying storage system to heal the local replica by - reading from a remote replica and putting it back to the local one. - Most of this should be handled in the DHT SPI layer, but the generic - DHT code should be giving better hints during get() method calls. - -* LOCAL / WORLD writes - - Many writes should be done locally first, before they replicate to - the other replicas, as they might be backed out on an abort. - - Likewise some writes must take place across sufficient replicas to - ensure the write is not lost... and this may include ensuring that - earlier local-only writes have actually been committed to all - replicas. This committing to replicas might be happening in the - background automatically after the local write (e.g. Cassandra will - start to send writes made by one node to other nodes, but doesn't - promise they finish). But parts of the code may need to force this - replication to complete before the higher level git operation ends. - -* Forks/alternates - - Forking is common, but we should avoid duplicating content into the - fork if the base repository has it. This requires some sort of - change to the key structure so that chunks are owned by an object - pool, and the object pool owns the repositories that use it. GC - proceeds at the object pool level, rather than the repository level, - but might want to take some of the reference namespace into account - to avoid placing forked less-common content near primary content. |