1 files changed, 291 insertions, 0 deletions
diff --git a/contrib/lua-torch/decisiontree/doc/benchmark.md b/contrib/lua-torch/decisiontree/doc/benchmark.md
new file mode 100644
index 000000000..cb8f905d6
--- /dev/null
+++ b/contrib/lua-torch/decisiontree/doc/benchmark.md
@@ -0,0 +1,291 @@
+# Benchmarks
+
+This file outlines the roadmap (and commensurate benchmarks) of optimizations and refactorings over time.
+
+## Baseline
+
+The baseline implementation is very slow.
+We converted the Twitter decision tree library (used internally) from Java to Lua.
+The objective was to replicate the GBDT and Random Forest implementations as is (more or less).
+The Java library is very good and reasonably fast. The same code in Lua is slow.
+The point of this Lua baseline was not to obtain the same computational performance as the Java library.
+Instead, we wanted the training and inferences algorithms of the Lua lib to match thoses of the Java lib.
+As such, the training/validation error of the baseline Lua lib should match that of the Java lib.
+The unit tests seem to validate this claim as both training/validation set performance is unit tested.
+We also used the conversion exercise as a way to learn about decision tree implementation (our background is deep learning).
+That being said, the baseline performance is terrible:
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark()"
+CartTrainer: sparse dataset create: 2963.192386 samples/sec; 0.337479 sec
+CartTrainer: train single-thread : 14.165438 samples/sec; 70.594361 sec
+CartTrainer: setup feature-parallel : 5.129034 samples/sec; 194.968478 sec
+CartTrainer: train feature-parallel : 9.736592 samples/sec; 102.705344 sec
+```
+
+The original Java lib had approximately 43 classes.
+The baseline has about 24.
+This reduction is due to obvious merging of classes. But also to conversions of classes to functions.
+The next patches continue this process of reducing the number of classes.
+
+## Patch 1 (complete):
+
+This patch further reduces the number of classes, but adds the DataSet class.
+The code is much simple to read. Examples are batched.
+
+ * [x] examples are batched in dt.DataSet: {input, target, score}
+ * [x] deprecate dt.LabeledExample
+ * [x] list of examples are replaced with torch.LongTensors of exampleIds
+ * [x] merge TreeBrancher into TreeState
+ * [x] merge BestSplitFinder and SplitStateUpdater into TreeState
+ * [x] TreeState subclasses: GradientBoostState and GiniState
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark()"
+CartTrainer: sparse dataset create: 3597.392294 samples/sec; 0.277984 sec
+CartTrainer: train single-thread : 35.763255 samples/sec; 27.961663 sec
+CartTrainer: setup feature-parallel : 36759.250495 samples/sec; 0.027220 sec
+CartTrainer: train feature-parallel : 72.523658 samples/sec; 13.788606 sec
+```
+
+ The setup time for feature-parallelization is most improved.
+ The run-time for feature-parallel also about half that of single-thread.
+ Since its using 2 threads, that means the parallelization is working quite well.
+
+ We also added benchmarks for the `RandomForestTrainer` and `GradientBoostTrainer`:
+
+```
+GradientBoostTrainer: train single-thread : 599.895105 samples/sec; 0.083348 sec/tree, 1.666958 sec
+GradientBoostTrainer: train feature-parallel : 974.235273 samples/sec; 0.051322 sec/tree, 1.026446 sec
+RandomForestTrainer: train single-thread : 134.781044 samples/sec; 0.370972 sec/tree, 7.419441 sec
+RandomForestTrainer: setup tree-parallel : 73341.097064 samples/sec; 0.013649 sec
+RandomForestTrainer: train tree-parallel : 262.975891 samples/sec; 0.190131 sec/tree, 3.802630 sec
+```
+
+Looks good.
+
+## Patch 2 (complete):
+
+ * [x] dt.LossFunction -> nn.Criterion (LogitBoost is done, missing MSE)
+ * [x] use SparseTensor:buildIndex() to accelerate TreeState:findBestSplit()
+ * [x] benchmarks use 10000 instead of 1000 examples
+
+The benchmarks indicate good improvements. Most improvements were made possible by the use of `buildIndex`:
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark()"
+GradientBoostState: findBestSplit (first) : 11.415645 sec
+GradientBoostState: findBestSplit (second) : 11.246336 sec
+CartTrainer: sparse dataset create: 3284.803629 samples/sec; 3.044327 sec
+CartTrainer: train single-thread : 239.544758 samples/sec; 41.745858 sec
+CartTrainer: setup feature-parallel : 10996.443063 samples/sec; 0.909390 sec
+CartTrainer: train feature-parallel : 473.888592 samples/sec; 21.102011 sec
+RandomForestTrainer: train single-thread : 892.985186 samples/sec; 0.559920 sec/tree, 11.198394 sec
+RandomForestTrainer: setup tree-parallel : 176806.252266 samples/sec; 0.056569 sec
+RandomForestTrainer: train tree-parallel : 1377.849291 samples/sec; 0.362884 sec/tree, 7.257688 sec
+GradientBoostTrainer: train single-thread : 2685.485128 samples/sec; 0.186186 sec/tree, 3.723722 sec
+GradientBoostTrainer: train feature-parallel : 3712.313215 samples/sec; 0.134687 sec/tree, 2.693738 sec
+```
+
+The main bottleneck now is in serializing the SparseTensor hash maps. We temporarly overcame this bottleneck by
+deleting indexes when calling `CartTrainer:featureParallel()` and `RandomForestTrainer:treeParallel()`.
+In this way, the indexes are recreated for each thread. Ideally, we would use a C hash map such that a pointer
+could be serialized instead. But `tds.Hash` does not serialize well. For now instead, we use lua tables.
+
+This is the benchmark for `GradientBoostTrainer` on a large dataset of dense inputs:
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark({'GradientBoostTrainer'}, {nExample=100000, sparse=false, nFeature=836, nTree=5, downsampleRatio=1, minLeafSize=1000, maxLeafNodes=8})"
+GradientBoostTrainer: train single-thread : 152.463989 samples/sec; 131.178517 sec/tree, 655.892584 sec
+GradientBoostTrainer: train feature-parallel : 224.288488 samples/sec; 89.170872 sec/tree, 445.854358 sec
+[tw-mbp-nleonard decisiontree]$ th -e "dt = require 'decisiontree'; dt.benchmark({'GradientBoostTrainer'}, {nExample=100000, sparse=false, nFeature=836, nTree=5, downsampleRatio=1, minLeafSize=1000, maxLeafNodes=8,nThread=4})"
+GradientBoostTrainer: train single-thread : 163.836896 samples/sec; 122.072625 sec/tree, 610.363126 sec
+GradientBoostTrainer: train feature-parallel : 407.981442 samples/sec; 49.021838 sec/tree, 245.109188 sec
+```
+
+## Patch 3 :
+
+Optimize GBDT for large datasets consisting of dense inputs. The benchmarks:
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark({'GradientBoostTrainer'}, {nExample=100000, sparse=false, nFeature=836, nTree=5, downsampleRatio=1, minLeafSize=1000, maxLeafNodes=8})"
+GradientBoostTrainer: train single-thread : 547.553407 samples/sec; 36.526117 sec/tree, 182.630587 sec
+GradientBoostTrainer: train feature-parallel : 792.964678 samples/sec; 25.221804 sec/tree, 126.109022 sec
+[tw-mbp-nleonard decisiontree]$ th -e "dt = require 'decisiontree'; dt.benchmark({'GradientBoostTrainer'}, {nExample=100000, sparse=false, nFeature=836, nTree=5, downsampleRatio=1, minLeafSize=1000, maxLeafNodes=8,nThread=4})"
+GradientBoostTrainer: train single-thread : 555.793759 samples/sec; 35.984571 sec/tree, 179.922855 sec
+GradientBoostTrainer: train feature-parallel : 1289.977846 samples/sec; 15.504142 sec/tree, 77.520711 sec
+```
+
+For 1, 2 and 4 threads, the speedups of patch 3 over patch 2 are respectively: 3.39, 3.53, and 3.18.
+For this patch, the multi-threading speedup of 2 and 4 threads over a single thread are respectively: 1.42 and 2.33.
+Improvements over the previous patch were obtained by optimizing two aspects:
+
+  1. Optimizing `TreeState.findBestFeatureSplit` for dense datasets (for example: `if dense, then ...`);
+  2. Removing `assert` clauses in `GradientBoostState.update`. The `update` method is called for every (example, feature), making it a major bottleneck.
+
+Converting the `update` to C could lead to further optimizations.
+
+This patch also improves the benchmark on sparse datasets:
+```
+$ th -e "dt = require 'decisiontree'; dt.benchmark()"
+RandomForestTrainer: train single-thread : 1121.311196 samples/sec; 0.445907 sec/tree, 8.918131 sec
+RandomForestTrainer: setup tree-parallel : 168773.323354 samples/sec; 0.059256 sec
+RandomForestTrainer: train tree-parallel : 1701.280938 samples/sec; 0.293896 sec/tree, 5.877924 sec
+GradientBoostState: findBestSplit (first) : 8.250646 sec
+GradientBoostState: findBestSplit (second) : 7.952077 sec
+GradientBoostTrainer: train single-thread : 3355.248596 samples/sec; 0.149020 sec/tree, 2.980405 sec
+GradientBoostTrainer: train feature-parallel : 4399.133369 samples/sec; 0.113659 sec/tree, 2.273175 sec
+CartTrainer: sparse dataset create: 3428.105601 samples/sec; 2.917069 sec
+CartTrainer: train single-thread : 282.172416 samples/sec; 35.439331 sec
+CartTrainer: setup feature-parallel : 9455.440801 samples/sec; 1.057598 sec
+CartTrainer: train feature-parallel : 594.054049 samples/sec; 16.833491 sec
+DFD: train random forest in parallel : 346.831378 samples/sec; 0.288325 sec/tree, 5.766491 sec
+DFD: updateOutput : 831.105546 samples/sec; 0.038509 sec
+```
+
+## Patch 4 :
+
+This patch improves `nn.DFD` from
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark({'DFD'}, {nTree=500,maxLeafNodes=8,minLeafSize=1})"
+DFD: train random forest in parallel : 10.527251 samples/sec; 0.037997 sec/tree, 18.998313 sec
+DFD: updateOutput : 32.442950 samples/sec; 9.863472 sec
+```
+
+to
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark({'DFD'}, {nTree=500,maxLeafNodes=8,minLeafSize=1})"
+DFD: train random forest in parallel : 10.839547 samples/sec; 0.036902 sec/tree, 18.450956 sec
+DFD: updateOutput : 359.158353 samples/sec; 0.890975 sec
+Sparse2Dense: updateOutput : 15395.648952 samples/sec; 0.020791 sec
+```
+
+That is a 10x speedup for `nn.DFD`.
+
+The patch also adds a benchmark for `nn.Sparse2Dense`:
+
+```
+th -e "dt = require 'decisiontree'; dt.benchmark({'Sparse2Dense'}, {nTree=500,maxLeafNodes=8,minLeafSize=1})"
+Sparse2Dense: updateOutput : 17158.126406 samples/sec; 0.018653 sec
+```
+
+Indeed, `nn.Sparse2Dense` is not the bottleneck; `nn.DFD` is.
+
+## Patch 5 :
+
+This patch improves `nn.DFD` inference from
+
+```
+for i in `seq 3`; do th -e "dt = require 'decisiontree'; dt.benchmark({'DFD'}, {nTree=500,maxLeafNodes=8,minLeafSize=1,batchsize=16,nActive=1200,nFeature=1300,nloop=100})"; done
+DFD: train random forest in parallel : 8.452295 samples/sec; 0.047324 sec/tree, 23.662212 sec
+DFD: updateOutput : 176.617872 samples/sec; 9.059109 sec
+DFD: train random forest in parallel : 8.350019 samples/sec; 0.047904 sec/tree, 23.952042 sec
+DFD: updateOutput : 183.508204 samples/sec; 8.718962 sec
+DFD: train random forest in parallel : 8.525779 samples/sec; 0.046917 sec/tree, 23.458266 sec
+DFD: updateOutput : 178.877077 samples/sec; 8.944692 sec
+```
+
+to
+
+```
+for i in `seq 3`; do th -e "dt = require 'decisiontree'; dt.benchmark({'DFD'}, {nTree=500,maxLeafNodes=8,minLeafSize=1,batchsize=16,nActive=1200,nFeature=1300,nloop=100})"; done
+DFD: train random forest in parallel : 8.434502 samples/sec; 0.047424 sec/tree, 23.712129 sec
+DFD: updateOutput : 6479.597179 samples/sec; 0.246933 sec
+DFD: train random forest in parallel : 8.334543 samples/sec; 0.047993 sec/tree, 23.996518 sec
+DFD: updateOutput : 6663.641184 samples/sec; 0.240114 sec
+DFD: train random forest in parallel : 8.353265 samples/sec; 0.047885 sec/tree, 23.942735 sec
+DFD: updateOutput : 6882.607456 samples/sec; 0.232475 sec
+```
+
+That is a 37x speedup for `nn.DFD`.
+
+## Patch 6:
+
+This patch improves `nn.DFD` from the previous result to
+
+```
+for i in `seq 5`; do th -e "dt = require 'decisiontree'; dt.benchmark({'DFD'}, {nTree=500,maxLeafNodes=8,minLeafSize=1,batchsize=16,nActive=1200,nFeature=1300,nloop=10000})"; done
+DFD: train random forest in parallel : 8.353504 samples/sec; 0.047884 sec/tree, 23.942050 sec
+DFD: updateOutput : 91967.342339 samples/sec; 1.739753 sec
+DFD: train random forest in parallel : 8.528141 samples/sec; 0.046904 sec/tree, 23.451770 sec
+DFD: updateOutput : 91405.321702 samples/sec; 1.750451 sec
+DFD: train random forest in parallel : 8.184562 samples/sec; 0.048872 sec/tree, 24.436250 sec
+DFD: updateOutput : 91623.388867 samples/sec; 1.746284 sec
+DFD: train random forest in parallel : 8.779561 samples/sec; 0.045560 sec/tree, 22.780182 sec
+DFD: updateOutput : 93914.242852 samples/sec; 1.703686 sec
+DFD: train random forest in parallel : 8.636201 samples/sec; 0.046317 sec/tree, 23.158330 sec
+DFD: updateOutput : 94092.241963 samples/sec; 1.700465 sec
+```
+
+That is another 13.8x speedup.
+
+## Patch 7:
+
+This patch improves `nn.Sparse2Dense` computation from
+
+```
+for i in `seq 3`; do th -e "dt = require 'decisiontree'; torch.setdefaulttensortype('torch.FloatTensor'); dt.benchmark({'Sparse2Dense'}, {nTree=500,maxLeafNodes=8,minLeafSize=1,nFeature=1500,nActive=1300,nloop=1000})"; done
+Sparse2Dense: updateOutput : 1103.570777 samples/sec; 28.996786 sec
+Sparse2Dense: updateOutput : 1092.064331 samples/sec; 29.302309 sec
+Sparse2Dense: updateOutput : 1036.963572 samples/sec; 30.859334 sec
+```
+
+to
+
+```
+for i in `seq 3`; do th -e "dt = require 'decisiontree'; torch.setdefaulttensortype('torch.FloatTensor'); dt.benchmark({'Sparse2Dense'}, {nTree=500,maxLeafNodes=8,minLeafSize=1,nFeature=1500,nActive=1300,nloop=1000})"; done
+Sparse2Dense: updateOutput : 62995.834470 samples/sec; 0.507978 sec
+Sparse2Dense: updateOutput : 62471.568253 samples/sec; 0.512242 sec
+Sparse2Dense: updateOutput : 62965.099331 samples/sec; 0.508226 sec
+```
+
+This represents a speedup of about 57x.
+
+## Patch 8:
+
+This patch improves `nn.Sparse2Dense` from the previous result to
+
+```for i in `seq 3`; do th -e "dt = require 'decisiontree'; torch.setdefaulttensortype('torch.FloatTensor'); dt.benchmark({'Sparse2Dense'}, {nTree=500,maxLeafNodes=8,minLeafSize=1,nFeature=1500,nActive=1300,nloop=1000})"; done
+Sparse2Dense: updateOutput : 124268.079914 samples/sec; 0.257515 sec
+Sparse2Dense: updateOutput : 114750.039542 samples/sec; 0.278873 sec
+Sparse2Dense: updateOutput : 122863.314766 samples/sec; 0.260458 sec
+```
+
+which corresponds to another 1.95x speedup.
+
+## Patch 9:
+
+This patches moves the core of training GBDTs, which used to be a big bottleneck, to C. It also
+performs small optimizations across the board (faster scoring, faster branching, ...) that provide a
+little more performance.
+
+The original commit had this performance:
+
+```
+th -e "dt = require 'decisiontree'; torch.setdefaulttensortype('torch.FloatTensor'); dt.benchmark({'GradientBoostTrainer'}, {nExample=100000, sparse=false, nFeature=836, nTree=5, downsampleRatio=1, minLeafSize=1000, maxLeafNodes=8})"
+GradientBoostTrainer: train single-thread : 500.414666 samples/sec; 39.966854 sec/tree, 199.834271 sec
+GradientBoostTrainer: train feature-parallel : 1227.228044 samples/sec; 16.296890 sec/tree, 81.484448 sec (4 threads)
+GradientBoostTrainer: train feature-parallel : 1385.926280 samples/sec; 14.430782 sec/tree, 72.153910 sec (8 threads)
+```
+
+and the new version has
+
+```
+GradientBoostTrainer: train single-thread : 15285.644631 samples/sec; 1.308417 sec/tree, 6.542086 sec
+GradientBoostTrainer: train feature-parallel : 43170.435932 samples/sec; 0.463280 sec/tree, 2.316400 sec (4 threads)
+GradientBoostTrainer: train feature-parallel : 50062.681239 samples/sec; 0.399499 sec/tree, 1.997496 sec (8 threads)
+```
+
+That represents a speedup of about 30.5x over the baseline for 1 thread and 36.1x for 8 threads.
+Note that the performance doesn't increase much as we increase the number of threads since we use
+feature parallelism and the number of features evaluated is small (29 in this case) due to bagging.
+If we disable bagging, then we have the following result with 8 threads and the new code:
+
+```
+GradientBoostTrainer: train single-thread : 590.823965 samples/sec; 33.851030 sec/tree, 169.255152 sec
+GradientBoostTrainer: train feature-parallel : 3232.188576 samples/sec; 6.187758 sec/tree, 30.938789 sec
+```
+
+So processing 836 features now is much faster than processing 29 before.