diff options
Diffstat (limited to 'vendor/github.com/blevesearch/zap/v12/zap.md')
-rw-r--r-- | vendor/github.com/blevesearch/zap/v12/zap.md | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/vendor/github.com/blevesearch/zap/v12/zap.md b/vendor/github.com/blevesearch/zap/v12/zap.md new file mode 100644 index 0000000000..d74dc548b8 --- /dev/null +++ b/vendor/github.com/blevesearch/zap/v12/zap.md @@ -0,0 +1,177 @@ +# ZAP File Format + +## Legend + +### Sections + + |========| + | | section + |========| + +### Fixed-size fields + + |--------| |----| |--| |-| + | | uint64 | | uint32 | | uint16 | | uint8 + |--------| |----| |--| |-| + +### Varints + + |~~~~~~~~| + | | varint(up to uint64) + |~~~~~~~~| + +### Arbitrary-length fields + + |--------...---| + | | arbitrary-length field (string, vellum, roaring bitmap) + |--------...---| + +### Chunked data + + [--------] + [ ] + [--------] + +## Overview + +Footer section describes the configuration of particular ZAP file. The format of footer is version-dependent, so it is necessary to check `V` field before the parsing. + + |==================================================| + | Stored Fields | + |==================================================| + |-----> | Stored Fields Index | + | |==================================================| + | | Dictionaries + Postings + DocValues | + | |==================================================| + | |---> | DocValues Index | + | | |==================================================| + | | | Fields | + | | |==================================================| + | | |-> | Fields Index | + | | | |========|========|========|========|====|====|====| + | | | | D# | SF | F | FDV | CF | V | CC | (Footer) + | | | |========|====|===|====|===|====|===|====|====|====| + | | | | | | + |-+-+-----------------| | | + | |--------------------------| | + |-------------------------------------| + + D#. Number of Docs. + SF. Stored Fields Index Offset. + F. Field Index Offset. + FDV. Field DocValue Offset. + CF. Chunk Factor. + V. Version. + CC. CRC32. + +## Stored Fields + +Stored Fields Index is `D#` consecutive 64-bit unsigned integers - offsets, where relevant Stored Fields Data records are located. + + 0 [SF] [SF + D# * 8] + | Stored Fields | Stored Fields Index | + |================================|==================================| + | | | + | |--------------------| ||--------|--------|. . .|--------|| + | |-> | Stored Fields Data | || 0 | 1 | | D# - 1 || + | | |--------------------| ||--------|----|---|. . .|--------|| + | | | | | + |===|============================|==============|===================| + | | + |-------------------------------------------| + +Stored Fields Data is an arbitrary size record, which consists of metadata and [Snappy](https://github.com/golang/snappy)-compressed data. + + Stored Fields Data + |~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~| + | MDS | CDS | MD | CD | + |~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~| + + MDS. Metadata size. + CDS. Compressed data size. + MD. Metadata. + CD. Snappy-compressed data. + +## Fields + +Fields Index section located between addresses `F` and `len(file) - len(footer)` and consist of `uint64` values (`F1`, `F2`, ...) which are offsets to records in Fields section. We have `F# = (len(file) - len(footer) - F) / sizeof(uint64)` fields. + + + (...) [F] [F + F#] + | Fields | Fields Index. | + |================================|================================| + | | | + | |~~~~~~~~|~~~~~~~~|---...---|||--------|--------|...|--------|| + ||->| Dict | Length | Name ||| 0 | 1 | | F# - 1 || + || |~~~~~~~~|~~~~~~~~|---...---|||--------|----|---|...|--------|| + || | | | + ||===============================|==============|=================| + | | + |----------------------------------------------| + + +## Dictionaries + Postings + +Each of fields has its own dictionary, encoded in [Vellum](https://github.com/couchbase/vellum) format. Dictionary consists of pairs `(term, offset)`, where `offset` indicates the position of postings (list of documents) for this particular term. + + |================================================================|- Dictionaries + + | | Postings + + | | DocValues + | Freq/Norm (chunked) | + | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] | + | |->[ Freq | Norm (float32 under varint) ] | + | | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] | + | | | + | |------------------------------------------------------------| | + | Location Details (chunked) | | + | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | | + | |->[ Size | Pos | Start | End | Arr# | ArrPos | ... ] | | + | | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | | + | | | | + | |----------------------| | | + | Postings List | | | + | |~~~~~~~~|~~~~~|~~|~~~~~~~~|-----------...--| | | + | |->| F/N | LD | Length | ROARING BITMAP | | | + | | |~~~~~|~~|~~~~~~~~|~~~~~~~~|-----------...--| | | + | | |----------------------------------------------| | + | |--------------------------------------| | + | Dictionary | | + | |~~~~~~~~|--------------------------|-...-| | + | |->| Length | VELLUM DATA : (TERM -> OFFSET) | | + | | |~~~~~~~~|----------------------------...-| | + | | | + |======|=========================================================|- DocValues Index + | | | + |======|=========================================================|- Fields + | | | + | |~~~~|~~~|~~~~~~~~|---...---| | + | | Dict | Length | Name | | + | |~~~~~~~~|~~~~~~~~|---...---| | + | | + |================================================================| + +## DocValues + +DocValues Index is `F#` pairs of varints, one pair per field. Each pair of varints indicates start and end point of DocValues slice. + + |================================================================| + | |------...--| | + | |->| DocValues |<-| | + | | |------...--| | | + |==|=================|===========================================|- DocValues Index + ||~|~~~~~~~~~|~~~~~~~|~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~|| + || DV1 START | DV1 STOP | . . . . . | DV(F#) START | DV(F#) END || + ||~~~~~~~~~~~|~~~~~~~~~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~|| + |================================================================| + +DocValues is chunked Snappy-compressed values for each document and field. + + [~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-] + [ Doc# in Chunk | Doc1 | Offset1 | ... | DocN | OffsetN | SNAPPY COMPRESSED DATA ] + [~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-] + +Last 16 bytes are description of chunks. + + |~~~~~~~~~~~~...~|----------------|----------------| + | Chunk Sizes | Chunk Size Arr | Chunk# | + |~~~~~~~~~~~~...~|----------------|----------------| |