summaryrefslogtreecommitdiffstats
path: root/org.eclipse.jgit.storage.dht.test/tst/org
Commit message (Collapse)AuthorAgeFilesLines
* Delete storage.dht packageShawn O. Pearce2012-09-058-1172/+0
| | | | | | | | | | | | | | | | | This experiment proved to be not very useful. I had originally planned to use this on top of Google Bigtable, Apache HBase or Apache Cassandra. Unfortunately the schema is very complex and does not perform well. The storage.dfs package has much better performance and has been in production at Google for many months now, proving it is a viable storage backend for Git. As there are no users of the storage.dht schema, either at Google or any other company, nor any valid open source implementations of the storage system, drop the entire package and API from the JGit project. There is no point in trying to maintain code that is simply not used. Change-Id: Ia8d32f27426d2bcc12e7dc9cc4524c59f4fe4df9 Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
* DHT: Drop leading hash digits from row keysShawn O. Pearce2011-06-092-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally I put the first two digits of the object SHA-1 into the start of a row key to try and spread the load of objects around a DHT service. Unfortunately this tends to not work as well as I had hoped. Servers reading a repository need to contact every node in a DHT cluster if the cluster tries to evenly distribute the object rows. This is a lot of connections, especially if the cluster has many backend storage servers. If the library has an open connection limit (possibly due to JVM file descriptor limitations) it may need to open and close a lot of connections to access a repository, rather than being able to reuse the same connection to a handful of backend servers. This results in a lot of connection thrashing for some DHT type databases, and is inefficient. Some DHTs are able to operate even if part of the database space is currently unavailable. For example, a DHT service might assign some section of the key space to a node, and then fail that section over to another node when the primary is noticed as being offline. During that failover period that section of the key space is not available, but other sections hosted by other backends are still ready for service. Spreading keys all over the cluster makes it likely that any single backend being temporarily down means the entire cluster is down, rather than only some. This is a massive schema change, but it should improve relability and performance for any DHT system. Change-Id: I6b65bfb4c14b6f7bd323c2bd0638b49d429245be Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Store Git on any DHTShawn O. Pearce2011-05-058-0/+1171
jgit.storage.dht is a storage provider implementation for JGit that permits storing the Git repository in a distributed hashtable, NoSQL system, or other database. The actual underlying storage system is undefined, and can be plugged in by implementing 7 small interfaces: * Database * RepositoryIndexTable * RepositoryTable * RefTable * ChunkTable * ObjectIndexTable * WriteBuffer The storage provider interface tries to assume very little about the underlying storage system, and requires only three key features: * key -> value lookup (a hashtable is suitable) * atomic updates on single rows * asynchronous operations (Java's ExecutorService is easy to use) Most NoSQL database products offer all 3 of these features in their clients, and so does any decent network based cache system like the open source memcache product. Relying only on key equality for data retrevial makes it simple for the storage engine to distribute across multiple machines. Traditional SQL systems could also be used with a JDBC based spi implementation. Before submitting this change I have implemented six storage systems for the spi layer: * Apache HBase[1] * Apache Cassandra[2] * Google Bigtable[3] * an in-memory implementation for unit testing * a JDBC implementation for SQL * a generic cache provider that can ride on top of memcache All six systems came in with an spi layer around 1000 lines of code to implement the above 7 interfaces. This is a huge reduction in size compared to prior attempts to implement a new JGit storage layer. As this package shows, a complete JGit storage implementation is more than 17,000 lines of fairly complex code. A simple cache is provided in storage.dht.spi.cache. Implementers can use CacheDatabase to wrap any other type of Database and perform fast reads against a network based cache service, such as the open source memcached[4]. An implementation of CacheService must be provided to glue this spi onto the network cache. [1] https://github.com/spearce/jgit_hbase [2] https://github.com/spearce/jgit_cassandra [3] http://labs.google.com/papers/bigtable.html [4] http://memcached.org/ Change-Id: I0aa4072781f5ccc019ca421c036adff2c40c4295 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>