You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 13KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386
  1. # Torch decision tree library
  2. ```lua
  3. local dt = require 'decisiontree'
  4. ```
  5. This project implements random forests and gradient boosted decision trees (GBDT).
  6. The latter uses gradient tree boosting.
  7. Both use ensemble learning to produce ensembles of decision trees (that is, forests).
  8. ## `nn.DFD`
  9. One practical application for decision forests is to *discretize* an input feature space into a richer output feature space.
  10. The `nn.DFD` Module can be used as a decision forest discretizer (DFD):
  11. ```lua
  12. local dfd = nn.DFD(df, onlyLastNode)
  13. ```
  14. where `df` is a `dt.DecisionForest` instance or the table returned by the method `getReconstructionInfo()` on another `nn.DFD` module, and `onlyLastNode` is a boolean that indicates that module should return only the id of the last node visited on each tree (by default it outputs all traversed nodes except for the roots).
  15. The `nn.DFD` module requires dense `input` tensors.
  16. Sparse `input` tensors (tables of tensors) are not supported.
  17. The `output` returned by a call to `updateOutput` is a batch of sparse tensors.
  18. This `output` where `output[1]` and `output[2]` are a respectively a list of key and value tensors:
  19. ```lua
  20. {
  21. { [torch.LongTensor], ... , [torch.LongTensor] },
  22. { [torch.Tensor], ... , [torch.Tensor] }
  23. }
  24. ```
  25. This module doesn't support CUDA.
  26. ### Example
  27. As a concrete example, let us first train a Random Forest on a dummy dense dataset:
  28. ```lua
  29. local nExample = 100
  30. local batchsize = 2
  31. local inputsize = 10
  32. -- train Random Forest
  33. local trainSet = dt.getDenseDummyData(nExample, nil, inputsize)
  34. local opt = {
  35. activeRatio=0.5,
  36. featureBaggingSize=5,
  37. nTree=4,
  38. maxLeafNodes=nExample/2,
  39. minLeafSize=nExample/10,
  40. }
  41. local trainer = dt.RandomForestTrainer(opt)
  42. local df = trainer:train(trainSet, trainSet.featureIds)
  43. mytester:assert(#df.trees == opt.nTree)
  44. ```
  45. Now that we have `df`, a `dt.DecisionForest` instance, we can use it to initialize `nn.DFD`:
  46. ```lua
  47. local dfd = nn.DFD(df)
  48. ```
  49. The `dfd` instance holds no reference to `df`, instead it extracts the relevant attributes from `df`.
  50. These attributes are stored in tensors for batching and efficiency.
  51. We can discretize a hypothetical `input` by calling `forward`:
  52. ```lua
  53. local input = trainSet.input:sub(1,batchsize)
  54. local output = dfd:forward(input)
  55. ```
  56. The resulting output is a table consisting of two tables: keys and values.
  57. The keys and values tables each contains `batchsize` tensors:
  58. ```lua
  59. print(output)
  60. {
  61. 1 :
  62. {
  63. 1 : LongTensor - size: 14
  64. 2 : LongTensor - size: 16
  65. 3 : LongTensor - size: 15
  66. 4 : LongTensor - size: 13
  67. }
  68. 2 :
  69. {
  70. 1 : DoubleTensor - size: 14
  71. 2 : DoubleTensor - size: 16
  72. 3 : DoubleTensor - size: 15
  73. 4 : DoubleTensor - size: 13
  74. }
  75. }
  76. ```
  77. An example's feature keys (`LongTensor`) and commensurate values (`DoubleTensor`) have the same number of elements.
  78. The examples have variable number of key-value pairs representing the nodes traversed in the tree.
  79. The output feature space has as many dimensions (that is, possible feature keys) for each node in the forest.
  80. ## `torch.SparseTensor`
  81. Suppose you have a set of `keys` mapped to `values`:
  82. ```lua
  83. local keys = torch.LongTensor{1,3,4,7,2}
  84. local values = torch.Tensor{0.1,0.3,0.4,0.7,0.2}
  85. ```
  86. You can use a `SparseTensor` to encapsulate these into a read-only tensor:
  87. ```lua
  88. local st = torch.SparseTensor(input, target)
  89. ```
  90. The _decisiontree_ library uses `SparseTensors` to simulate the `__index` method of the `torch.Tensor`.
  91. For example, one can obtain the value associated to key 3 of the above `st` instance:
  92. ```lua
  93. local value = st[3]
  94. assert(value == 0.3)
  95. ```
  96. When the key,value pair are missing, `nil` is returned instead:
  97. ```lua
  98. local value = st[2]
  99. assert(value == nil)
  100. ```
  101. The best implementation for this kind of indexing is slow (it uses a sequential scan of the `keys).
  102. To speedup indexing, one can call the `buildIndex()` method before hand:
  103. ```lua
  104. st:buildIndex()
  105. ```
  106. The `buildIndex()` creates a hash map (a Lua table) of keys to their commensurate indices in the `values` table.
  107. ## `dt.DataSet`
  108. The `CartTrainer`, `RandomForestTrainer` and `GradientBoostTrainer` require that data sets be encapsulated into a `DataSet`.
  109. Suppose you have a dataset of dense inputs and targets:
  110. ```lua
  111. local nExample = 10
  112. local nFeature = 5
  113. local input = torch.randn(nExample, nFeature)
  114. local target = torch.Tensor(nExample):random(0,1)
  115. ```
  116. these can be encapsulated into a `DataSet` object:
  117. ```lua
  118. local dataset = dt.DataSet(input, target)
  119. ```
  120. Now suppose you have a dataset where the `input` is a table of `SparseTensor` instances:
  121. ```lua
  122. local input = {}
  123. for i=1,nExample do
  124. local nKeyVal = math.random(2,nFeature)
  125. local keys = torch.LongTensor(nKeyVal):random(1,nFeature)
  126. local values = torch.randn(nKeyVal)
  127. input[i] = torch.SparseTensor(keys, values)
  128. end
  129. ```
  130. You can still use a `DataSet` to encapsulate the sparse dataset:
  131. ```lua
  132. local dataset = dt.DataSet(input, target)
  133. ```
  134. The main purpose of the `DataSet` class is to sort each feature by value.
  135. This is captured by the `sortFeatureValues(input)` method, which is called in the constructor:
  136. ```lua
  137. local sortedFeatureValues, featureIds = self:sortFeatureValues(input)
  138. ```
  139. The `featureIds` is a `torch.LongTensor` of all available feature IDs.
  140. For a dense `input` tensor, this is just `torch.LongTensor():range(1,input:size(2))`.
  141. But for a sparse `input` tensor, the `featureIds` tensor only contains the feature IDs present in the dataset.
  142. The resulting `sortedFeatureValues` is a table mapping `featureIds` to `exampleIds` sorted by `featureValues`.
  143. For each `featureId`, examples are sorted by `featureValue` in ascending order.
  144. For example, the table might look like: `{featureId=exampleIds}` where `examplesIds={1,3,2}`.
  145. The `CartTrainer` accesses the `sortedFeatureValues` tensor by calling `getSortedFeature(featureId)`:
  146. ```lua
  147. local exampleIdsWithFeature = dataset:getSortedFeature(featureId)
  148. ```
  149. The ability to access examples IDs sorted by feature value, given a feature ID, is the main purpose of the `DataSet`.
  150. The `CartTrainer` relies on these sorted lists to find the best way to split a set of examples between two tree nodes.
  151. ## `dt.CartTrainer`
  152. ```lua
  153. local trainer = dt.CartTrainer(dataset, minLeafSize, maxLeafNodes)
  154. ```
  155. The `CartTrainer` is used by the `RandomForestTrainer` and `GradientBoostTrainer` to train individual trees.
  156. CART stands for classification and regression trees.
  157. However, only binary classifiers are unit tested.
  158. The constructor takes the following arguments:
  159. * `dataset` is a `dt.DataSet` instance representing the training set.
  160. * `minLeafSize` is the minimum examples per leaf node in a tree. The larger the value, the more regularization.
  161. * `maxLeafNodes` is the maximum nodes in the tree. The lower the value, the more regularization.
  162. Training is initiated by calling the `train()` method:
  163. ```lua
  164. local trainSet = dt.DataSet(input, target)
  165. local rootTreeState = dt.GiniState(trainSet:getExampleIds())
  166. local activeFeatures = trainSet.featureIds
  167. local tree = trainer:train(rootTreeState, activeFeatures)
  168. ```
  169. The resulting `tree` is a `CartTree` instance.
  170. The `rootTreeState` is a `TreeState` instance like `GiniState` (used by `RandomForestTrainer`) or `GradientBoostState` (used by `GradientBoostTrainer`).
  171. The `activeFeatures` is a `LongTensor` of feature IDs that used to build the tree.
  172. Every other feature ID is ignored during training. This is useful for feature bagging.
  173. By default the `CartTrainer` runs in a single-thread.
  174. The `featureParallel(nThread)` method can be called before calling `train()` to parallelize training using `nThread` workers:
  175. ```lua
  176. local nThread = 3
  177. trainer:featureParallel(nThread)
  178. trainer:train(rootTreeState, activeFeatures)
  179. ```
  180. Feature parallelization assigns a set of features IDs to each thread.
  181. The `CartTrainer` can be used as a stand-alone tree trainer.
  182. But it is recommended to use it within the context of a `RandomForestTrainer` or `GradientBoostTrainer` instead.
  183. The latter typically generalize better.
  184. ## RandomForestTrainer
  185. The `RandomForestTrainer` is used to train a random forest:
  186. ```lua
  187. local nExample = trainSet:size()
  188. local opt = {
  189. activeRatio=0.5,
  190. featureBaggingSize=5,
  191. nTree=14,
  192. maxLeafNodes=nExample/2,
  193. minLeafSize=nExample/10,
  194. }
  195. local trainer = dt.RandomForestTrainer(opt)
  196. local forest = trainer:train(trainSet, trainSet.featureIds)
  197. ```
  198. The returned `forest` is a `DecisionForest` instance.
  199. A `DecisionForest` has a similar interface to the `CartTree`.
  200. Indeed, they both sub-class the `DecisionTree` abstract class.
  201. The constructor takes a single `opt` table argument, which contains the actual arguments:
  202. * `activeRatio` is the ratio of active examples per tree. This is used for boostrap sampling.
  203. * `featureBaggingSize` is the number of features per tree. This is also used fpr feature bagging.
  204. * `nTree` is the number of trees to be trained.
  205. * `maxLeafNodes` and `minLeafSize` are passed to the underlying `CartTrainer` constructor (controls regularization).
  206. Internally, the `RandomForestTrainer` passes a `GiniBoostState` to the `CartTrainer:train()` method.
  207. Training can be parallelized by calling `treeParallel(nThread)`:
  208. ```lua
  209. local nThread = 3
  210. trainer:treeParallel(nThread)
  211. local forest = trainer:train(trainSet, trainSet.featureIds)
  212. ```
  213. Training then parallelizes by training each tree in its own thread worker.
  214. ## GradientBoostTrainer
  215. References:
  216. * A. [Boosted Tree presentation](https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)
  217. Graient boosted decision trees (GBDT) can be trained as follows:
  218. ```lua
  219. local nExample = trainSet:size()
  220. local maxLeafNode, minLeafSize = nExample/2, nExample/10
  221. local cartTrainer = dt.CartTrainer(trainSet, minLeafSize, maxLeafNode)
  222. local opt = {
  223. lossFunction=nn.LogitBoostCriterion(false),
  224. treeTrainer=cartTrainer,
  225. shrinkage=0.1,
  226. downsampleRatio=0.8,
  227. featureBaggingSize=-1,
  228. nTree=14,
  229. evalFreq=8,
  230. earlyStop=0
  231. }
  232. local trainer = dt.GradientBoostTrainer(opt)
  233. local forest = trainer:train(trainSet, trainSet.featureIds, validSet)
  234. ```
  235. The above code snippet uses the `LogitBoostCriterion` outlined in reference A.
  236. It is used for training binary classification trees.
  237. The returned `forest` is a `DecisionForest` instance.
  238. A `DecisionForest` has a similar interface to the `CartTree`.
  239. Indeed, they both sub-class the `DecisionTree` abstract class.
  240. The constructor takes a single `opt` table argument, which contains the actual arguments:
  241. * `lossFunction` is a `nn.Criterion` instance extended to include the `updateHessInput(input, target)` and `backward2(input, target)`. These return the hessian of the `input`.
  242. * `treeTrainer` is a `CartTrainer` instance. Its `featureParallel()` method can be called to implement feature parallelization.
  243. * `shrinkage` is the weight of each additional tree.
  244. * `downsampleRatio` is the ratio of examples to be sampled for each tree. Used for bootstrap sampling.
  245. * `featureBaggingSize` is the number of features to sample per tree. Used for feature bagging. `-1` defaults to `torch.round(math.sqrt(featureIds:size(1)))`
  246. * `nTree` is the maximum number of trees.
  247. * `evalFreq` is the number of epochs between calls to `validate()` for cross-validation and early-stopping.
  248. * `earlyStop` is the maximum number of epochs to wait for early-stopping.
  249. Internally, the `GradientBoostTrainer` passes a `GradientBoostState` to the `CartTrainer:train()` method.
  250. ## TreeState
  251. An abstract class that holds the state of a subtree during decision tree training.
  252. It also manages the state of candidate splits.
  253. ```lua
  254. local treeState = dt.TreeState(exampleIds)
  255. ```
  256. The `exampleIds` argument is a `LongTensor` containing the example IDs that make up the sub-tree.
  257. ## GiniState
  258. A `TreeState` subclass used internally by the `RandomForestTrainer`.
  259. Uses Gini impurity to determine how to split trees.
  260. ```lua
  261. local treeState = dt.GiniState(exampleIds)
  262. ```
  263. The `exampleIds` argument is a `LongTensor` containing the example IDs that make up the sub-tree.
  264. ## GradientBoostState
  265. A `TreeState` subclass used internally by the `GradientBoostTrainer`.
  266. It implements the GBDT spliting algorithm, which uses a loss function.
  267. ```lua
  268. local treeState = dt.GradientBoostState(exampleIds, lossFunction)
  269. ```
  270. The `exampleIds` argument is a `LongTensor` containing the example IDs that make up the sub-tree.
  271. The `lossFunction` is an `nn.Criterion` instance (see `GradientBoostTrainer`).
  272. ## WorkPool
  273. Utility class that simplifies construction of a pool of daemon threads with which to execute tasks in parallel.
  274. ```lua
  275. local workpool = dt.WorkPool(nThread)
  276. ```
  277. ## CartTree
  278. Implements a trained CART decision tree:
  279. ```lua
  280. local tree = nn.CartTree(rootNode)
  281. ```
  282. The `rootNode` is a `CartNode` instance.
  283. Each `CartNode` contains pointers to left and right branches, which are themselves `CartNode` instances.
  284. For inference, use the `score(input)` method:
  285. ```lua
  286. local score = tree:score(input)
  287. ```