**Modern LZ Compression Part 2: FSE and Arithmetic Coding** Gary Linscott - glinscott@gmail.com - https://glinscott.github.io This article is Part 2 of a series on Modern LZ Compression. In [Part 1](https://glinscott.github.io/lz/index.html), we built a fully functional LZ + Huffman compressor. In this part, we will start from that baseline and move the design closer to state of the art. You can find the source for the article at https://github.com/glinscott/linzip2. Expanding the Benchmark Set =========================== In Part 1, we primarily used the [enwik8](http://mattmahoney.net/dc/textdata.html) dataset for benchmarking. While enwik8 is great, it's important to evaluate our compressor on a wider range of data types. For this article, we'll add the [Silesia compression corpus](http://mattmahoney.net/dc/silesia.html) to our benchmark set. !!! Note Silesia Compression Corpus The Silesia compression corpus is a collection of text files, executable files, source code, images, and more. This variety makes it a good test for evaluating the general-purpose performance of compression algorithms. Overview ======== We will make four changes to the Part 1 compressor: | Stage | enwik8 | Silesia | What changed | | ----- | ------ | ------- | ------------ | | Huffman baseline | 34,566,563 (34.57%) | 65,599,514 (30.95%) | Huffman-coded literals and sequences | | + FSE sequence streams | 34,349,694 (34.35%) | 65,226,148 (30.77%) | Better entropy coding for sequence streams | | + repeat offsets | 34,316,448 (34.32%) | 64,388,765 (30.38%) | Cheaper symbols for repeated match distances | | + stateful parser | 34,314,243 (34.31%) | 64,003,456 (30.19%) | Parser understands repeat offsets and literal runs | | + literal table cleanup | 34,263,694 (34.26%) | 63,911,530 (30.15%) | Delta-coded symbols and RLE for literal Huffman tables | Arithmetic Coding ================= Before diving into FSE, it helps to look at the line of ideas that leads to it. Arithmetic coding, ANS, and FSE are all trying to do the same job, representing symbols in proportion to their probabilities without the rigid code-length constraints of Huffman. A quick example illustrates this well. If a symbol shows up 90% of the time, the ideal cost is much less than one bit, but a Huffman code still has to spend at least one bit to represent it. Arithmetic coding can use ~0.15 bits on average. Arithmetic coding does this by maintaining an interval. We start with `[0, 1)`, split it according to the symbol probabilities, and keep the subrange for the symbol we are encoding. We then split the subrange by the symbol probabilities again, and so on until the whole message is consumed. In the ideal mathematical picture, that interval can shrink indefinitely. The simplest possible encoder does not emit bits after every symbol - it just needs to write out any number in that final interval. Note that this requires infinite precision, we'll discuss practical implementations next. !!! Sidenote History of Arithmetic Compression Arithmetic coding has its roots in the work of Claude Shannon, the father of information theory. The first practical implementations were developed in the 1970s and 1980s by researchers such as Jorma Rissanen, Glen Langdon, and Ian Witten. However, arithmetic coding was initially hampered by patents, which limited its widespread adoption. It wasn't until the patents expired in the late 1990s and early 2000s that arithmetic coding became more widely used. As a toy example, suppose the symbol probabilities are `A = 0.45`, `B = 0.35`, and `C = 0.20`, and we want to encode `BAC`. ~~~diagram .-------------------------------. | Arithmetic Coding: encode BAC | '-------------------------------' Initial split on [0, 1): 0.00 0.45 0.80 1.00 |-----------------------|--------------|-------------| A B C | | keep B v Split [0.45, 0.80): 0.45 0.6075 0.7300 0.80 |-----------------------|--------------|-------------| A B C | | keep A v Split [0.45, 0.6075): 0.45 0.520875 0.576 0.6075 |-----------------------|--------------|-------------| A B C | | keep C v write 0.59 (any value in the range works) ~~~ The decoder gets the same probability table, the final number, and the number of symbols to decode. It starts again from `[0, 1)`, asks which symbol subrange contains `0.59`, emits that symbol, restricts the interval to that subrange, and repeats. First, `0.59` lies in `B`'s subrange of `[0, 1)`, so we decode `B`. Inside `[0.45, 0.80)`, it lies in `A`'s subrange, so we decode `A`. Finally, inside `[0.45, 0.6075)`, it lies in `C`'s subrange, so we decode `C`. Note that the final number carries the message itself. A real stream still needs a way to know when to stop, either from a symbol count or an end-of-stream symbol. !!! SideBar: Range Coding Range coding is a variant of arithmetic coding that uses integer ranges instead of floating-point ranges. This simplifies the implementation and improves performance. In practice, most "arithmetic coders" in real compressors are range coders of one form or another. !!! SideBar: Reciprocal Arithmetic Coding [Charles Bloom](http://cbloomrants.blogspot.com/) developed a division-free arithmetic coder called `recip_arith`. The key idea is to replace expensive divisions with multiplications and bit shifts. That keeps the coder in the arithmetic-coding family, but makes it much more practical on modern hardware. The source is [here](https://github.com/thecbloom/recip_arith/blob/master/recip_arith.h), with accompanying [test code](https://github.com/thecbloom/recip_arith/blob/master/test_recip_arith.cpp). Asymmetric Numeral Systems (ANS) =============================== In the ideal arithmetic-coding picture, the interval can become arbitrarily small, so a real implementation has to keep renormalizing and carefully managing precision. This adds a lot of complexity and bookkeeping. ANS, developed by [Jarek Duda](https://en.wikipedia.org/wiki/Jarek_Duda) in 2009, reframes the same bookkeeping in a much more machine-friendly way. Instead of tracking an interval, ANS tracks a single integer state. The magic is that it still provides near-optimal compression. First we quantize the symbol probabilities into integer counts. ~~~ none 0 1 2 3 | 4 5 | 6 7 A | B | C ~~~ To encode one symbol, we split the current state into two pieces. `low_part` chooses which slot inside the symbol's range we land in, while `high_part` carries the rest of the old state forward. Here `sym_freq` is the number of states owned by the symbol, and `sym_base` is where that symbol's range starts. For `A`, `sym_freq = 4` and `sym_base = 0`; for `B`, `2` and `4`; for `C`, `2` and `6`. ~~~ none low_part = state % sym_freq high_part = state / sym_freq new_state = high_part * 8 + sym_base + low_part ~~~ To avoid requiring infinite precision, we keep the state inside a fixed working range. Before encoding a symbol, we spill low bits until the state is small enough for that symbol's transition. Because common symbols own larger ranges, they can accept larger states, spill less often, and therefore cost fewer bits on average. For this toy example, we'll use a simple renormalization rule: before encoding a symbol with frequency `sym_freq`, require `state < 8 * sym_freq`. If the next symbol is `B`, then `sym_freq = 2`, so we need `state < 16`. Suppose `state = 29`. That is too large, so we spill one low bit: ~~~ none write 29 & 1 = 1 state = 29 >> 1 = 14 ~~~ Now the input state fits. We can encode `B` using `sym_base = 4`: ~~~ none low_part = 14 % 2 = 0 high_part = 14 / 2 = 7 new_state = 7 * 8 + 4 + 0 = 60 ~~~ The encoded state is larger again, which is fine: the next symbol will renormalize before it runs its own transition. So `B` forced us to spill first because it only owns 2 states. If the symbol were `A`, we would allow states up to `8 * 4 = 32` before renormalizing. Finite State Entropy (FSE) ========================= FSE, developed by [Yann Collet](https://fastcompression.blogspot.com/), is a high-performance implementation of ANS. It's the entropy coder used in [zstd](https://github.com/facebook/zstd). FSE takes the ANS state transition and bakes most of the work into tables. That is the trade we want for LZ sequence streams. The alphabets are small, the probabilities are fixed for a block, and the decoder will run the same compact loop over and over. In FSE, decode becomes a fast table lookup. The table tells us which symbol this state emits, how many bits we need to read, and how to reconstruct the next state. !!! Note FSE and Instruction-Level Parallelism A single FSE state machine has a tight dependency chain: each decoded symbol produces the next state. Real implementations often keep two or four independent FSE states and interleave their bitstreams in the same loop. This gives a superscalar CPU independent work to issue each cycle. In practice, this is a 2-4x speedup. ### FSE Tables The implementation uses two compact tables: ~~~ c++ struct EncodedSymbol { int delta_bits; int delta_state; }; struct DecodeEntry { int state; int num_bits; int symbol; }; EncodedSymbol encoded_[256]; DecodeEntry decode_[4096]; ~~~ `encoded_[symbol]` tells the encoder how many low bits to spill from the current state, and how to map the remaining subrange back to a new state. `decode_[state]` does the inverse: emit a symbol, read a few bits, then recover the next state. !!! SideBar: Why `decode_[4096]`? In the general implementation we often use `state_bits_ = 12`, so the decoder table has `1 << 12 = 4096` entries. That is large enough to model probabilities reasonably well, but still small enough to sit comfortably in cache. In the LZ sequence coder below we actually use 10 state bits, since the alphabet is much smaller. Suppose the current state is `x = 23`, and the decoder table says: ~~~ none decode_[23] = { state = 40, num_bits = 3, symbol = B } ~~~ Then decoding one symbol means outputting `B`, reading 3 bits from the bitstream, say `101` = 5, and computing the next state as `40 + 5 = 45`. ### Step 1: Normalize frequencies The first job is to convert raw counts into a table that sums to exactly `state_len_ = 1 << state_bits_`. That gives us a fixed number of states to assign to symbols. This should look familiar from Part 1. There we normalized frequencies into code lengths; here we normalize them into state counts. ~~~ c++ int total = 0; for (int i = 0; i < max_symbols_; ++i) total += freq_[i]; int raw_freq[256]; memcpy(raw_freq, freq_, sizeof(raw_freq)); int new_total = 0; for (int symbol = 0; symbol < max_symbols_; ++symbol) { uint64_t c = raw_freq[symbol]; if (c == 0) continue; freq_[symbol] = std::max(1, int((c * state_len_ + (total / 2)) / total)); new_total += freq_[symbol]; } ~~~ There is one annoyance: rounding usually leaves us a little high or low. To rebalance, we repeatedly adjust the symbol whose coding cost changes least. ~~~ c++ int correction = state_len_ - new_total; while (correction != 0) { int best_symbol = ...; freq_[best_symbol] += correction > 0 ? 1 : -1; correction += correction > 0 ? -1 : 1; } ~~~ This is still a heuristic, but it keeps the normalized table closer to the real distribution without making the code much more complicated. At the end of normalization, every symbol that appeared at least once still has at least one state, the total number of states is exactly `state_len_`, and more likely symbols own more states. ~~~ none Normalization example (force counts to sum to 16 states) Symbol: A B C D Raw counts: 10 5 3 2 total = 20 Norm counts: 8 4 2 2 total = 16 Raw: A ########## B ##### C ### D ## Normalized: A ######## B #### C ## D ## ~~~ ### Step 2: Spread symbols across the state space After normalization, each symbol owns some number of states. Now we need to decide where those states live. Spreading the states across the table is key. Imagine if all of a symbol's states point back into one small region. Then repeated symbols get stuck in that region, and the state machine spends bits fighting the table layout instead of representing the input distribution. !!! Note FSE Table Spreading This is one of the trickier parts of FSE. Yann Collet's [FSE: distributing symbol values](http://fastcompression.blogspot.com/2014/02/fse-distributing-symbol-values.html) and [FSE: a closer look](http://fastcompression.blogspot.com/2014/02/fse-closer-look.html) go deeper into why the table spread matters. The short version here is: normalized counts decide how many states each symbol owns; the spread decides where those states live. The implementation uses the standard FSE-style spread. Walk through the state table with a fixed odd step, and drop each symbol into the next slot it owns. ~~~ c++ int mask = state_len - 1; int step = (state_len >> 1) + (state_len >> 3) + 3; int pos = 0; for (int symbol = 0; symbol < num_symbols; ++symbol) { for (int i = 0; i < freq[symbol]; ++i) { sorted_symbols[pos] = symbol; pos = (pos + step) & mask; } } ~~~ After this step, `sorted_symbols_[state]` is the decoder view. For each state, it tells us which symbol that state emits. The encoder wants the inverse view. It starts with a symbol and a rank inside that symbol's state range, and needs to find the actual FSE state to jump to. That is what `packed_table_` stores, a mapping from `(symbol, rank-within-symbol) -> state`. With the spread order in hand, we can fill that packed transition table: ~~~ c++ for (int i = 0; i < state_len_; ++i) { int symbol = sorted_symbols_[i]; int from_state = next_state[symbol]; ++next_state[symbol]; int to_state = state_len_ + i; packed_table_[cum_prob[symbol] + from_state - freq_[symbol]] = to_state; } ~~~ Each entry in `packed_table_` points back to one of the spread decoder states. So the decoder gets a well-spread table, while the encoder gets compact per-symbol ranges. ~~~ none Tiny state-spreading example (16 states) Counts: A -> 8 states B -> 4 states C -> 4 states With the 16-state step, `step = 13`, the decoder table is: index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 symbol: A A B C A B C A B C A A C A A B enc st: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 The packed table groups those spread states by symbol: A entries: [16, 17, 20, 23, 26, 27, 29, 30] B entries: [18, 21, 24, 31] C entries: [19, 22, 25, 28] The encoder can index into `A`'s compact range, but the value it gets sends the state machine back to one of `A`'s spread decoder states. ~~~ !!! Note Encoder States Why not store encoder states as `0..15` directly? In ANS, the state has to stay in a normalized range with enough high bits to carry information forward. If the state gets too small, there is nowhere for the next symbol to store its high bits, and the encode/decode mapping stops being reversible. ### Step 3: Build the per-symbol encoder parameters The encoder needs a fast way to decide how many bits to flush for a symbol and which section of `packed_table_` to use. That is what `encoded_[symbol]` contains. ~~~ c++ int total = 0; for (int symbol = 0; symbol < max_symbols_; ++symbol) { if (freq_[symbol] == 0) continue; uint32_t max_bits_out = state_bits_ - log2(freq_[symbol] - 1); uint32_t min_state_plus = freq_[symbol] << max_bits_out; encoded_[symbol].delta_bits = (max_bits_out << 16) - min_state_plus; encoded_[symbol].delta_state = total - freq_[symbol]; total += freq_[symbol]; } ~~~ This code decides how many low bits must be written to keep the state in range, and computes the base offset into that symbol's subrange of `packed_table_`. The `delta_bits` formula is a branchless way of computing whether this particular state spills `k` or `k - 1` bits. !!! Note FSE Subrange Coding The `delta_bits` and `delta_state` formulas are the compact version of FSE's subrange coding step. Yann Collet's [FSE Tricks: memory efficient subrange coding](http://fastcompression.blogspot.com/2014/02/fse-tricks-memory-efficient-subrange.html) walks through the derivation in more detail. ### Step 4: Build the decoder table The decoder wants the inverse mapping laid out directly by state: ~~~ c++ for (int i = 0; i < state_len_; ++i) { int symbol = sorted_symbols[i]; decode_[i].symbol = symbol; int from_state = next_state[symbol]; ++next_state[symbol]; decode_[i].num_bits = state_bits_ - log2(from_state); decode_[i].state = (from_state << decode_[i].num_bits) - state_len_; } ~~~ For each state we precompute which symbol to output, how many bits to pull from the bitstream, and the base used to reconstruct the next state. This makes FSE decoding incredibly fast. ### Encoding one symbol FSE encodes in reverse order. This unusual requirement comes from the way the state machine works, as we need the final state first. ~~~ c++ int num_bits = (state_ + encoded_[symbol].delta_bits) >> 16; writer_.writeBits(state_, num_bits); int subrange_id = state_ >> num_bits; state_ = packed_table_[subrange_id + encoded_[symbol].delta_state]; ~~~ Each symbol spills a few low bits from `state_`, uses the remaining high bits to identify a subrange, then jumps to the next state for that symbol. After all symbols are done, we write the final state itself. ~~~ none Example encode step current state = 93 state bits 1011101 ^^^ spill these low bits to the bitstream remaining high bits -> subrange id subrange id + delta_state -> index into packed_table_ packed_table_[...] -> next state ~~~ ### Decoding one symbol Decoding runs forward, and inverts the encoder: ~~~ c++ int symbol = decode_[state_].symbol; int len = decode_[state_].num_bits; int n = br_.readBits(len); state_ = decode_[state_].state + n; return symbol; ~~~ Given the current state, the table tells us the symbol and how many bits are needed to recover the next state. ~~~ none Encode / decode symmetry Encode (backward) Decode (forward) ----------------- ---------------- state x state x | | | spill low bits | table lookup v v write bits emit symbol | | | use high bits as subrange id | read num_bits v v jump to next state reconstruct next state ~~~ Integrating FSE into our LZ Compressor ===================================== Zstd uses Huffman for literals, and FSE for the sequence streams of literal lengths, match lengths, and offset codes. Literal bytes have a 256-symbol alphabet, and Huffman is simple, fast, and good enough for that stream. Sequence streams (literal lengths, match lengths, and offset codes) have small alphabets with very skewed distributions. Exactly the case where fractional-bit coding helps, and the FSE tables stay small. So that is the direction we will take here too: Huffman for the literal byte stream, FSE for the sequence symbol streams. ### Encoding sequence values The main loop scans the symbol classes, then encodes them in reverse order, writing any extra bits beside the FSE code: ~~~ c++ FSEEncoder encoder(end, 32, 10); for (int i = 0; i < num_seq_; ++i) encoder.scan(is_offset ? offsetCode(values[i]) : lengthCode(values[i])); encoder.buildTable(); for (int64_t i = num_seq_ - 1; i >= 0; --i) { int v = values[i]; if (v >= (is_offset ? 2 : 16)) encoder.writer().writeBits(is_offset ? offsetExtra(v) : lengthExtra(v), log2(v)); encoder.encode(is_offset ? offsetCode(v) : lengthCode(v)); } ~~~ The only subtle point is the order. The extra bits must be written next to the symbol they belong to, so they also go out in reverse. ### FSE-only results Compared to the clean Part 1 Huffman baseline, this change gets enwik8 from 34,566,563 (34.57%) bytes down to 34,349,694 (34.35%), and Silesia from 65,599,514 (30.95%) down to 65,226,148 (30.77%). It is a solid win on both corpora, but there is still more work to do. Repeat Offsets ============== Many matches reuse one of the last few offsets. Instead of sending the full distance again, we can send a tiny symbol saying "use recent offset 1", "use recent offset 2", or "use recent offset 3". We keep a small move-to-front list of recent offsets: ~~~ c++ struct RepOffsets { int rep[3] = {1, 4, 8}; }; ~~~ Then, when encoding offsets, we first try to map them to one of these recent values. When an older repeat is used, it moves to the front. When a new offset is used, it becomes the new most recent offset. ~~~ c++ if (offset == 0) return 0; if (offset == reps.rep[0]) return 1; if (offset == reps.rep[1]) { std::swap(reps.rep[0], reps.rep[1]); return 2; } if (offset == reps.rep[2]) { int dist = reps.rep[2]; reps.rep[2] = reps.rep[1]; reps.rep[1] = reps.rep[0]; reps.rep[0] = dist; return 3; } reps.rep[2] = reps.rep[1]; reps.rep[1] = reps.rep[0]; reps.rep[0] = offset; return 3 + offsetCode(offset); ~~~ Symbol 0 is the final literal-only sequence. The next three values are repeat-offset symbols. Anything larger is a normal "absolute" offset code plus extra bits. The offset stream is extremely skewed in real data. Repeated nearby offsets show up constantly, especially in structured data. Once those common cases collapse down to tiny symbols, FSE has a much easier job. !!! Note Why Repeat Offsets Work LZ matches are not distributed uniformly. Once a compressor finds a useful offset, the next few matches often reuse that same distance. Zstd exploits this heavily. ### Decoding repeat offsets Decoding mirrors the encoder: first decode the offset symbol, then either look it up in the recent-offset list or reconstruct the full distance from its extra bits. ~~~ c++ int symbol = decoder.decodeOne(); int value = decodeOffsetSymbol(symbol, reps); if (symbol >= 4) { int extra_bits = symbol - 4; if (extra_bits > 0) value |= decoder.br().readBits(extra_bits); reps.rep[0] = value; } ~~~ By the time we finish decoding sequences, we are back to a plain match offset and match length: ~~~ c++ uint8_t* match_base = out - match_offsets[i]; for (int j = 0; j < match_lengths[i]; ++j) { *out++ = match_base[j]; } ~~~ ### Repeat-offset results Enwik8 goes from 34,349,694 (34.35%) bytes to 34,316,448 (34.32%). Silesia goes from 65,226,148 (30.77%) to 64,388,765 (30.38%). The gain is modest on enwik8, but large on Silesia. ### Improving the Parser to Understand Repeats Once repeat offsets are in place, the parser needs to understand them too. Otherwise it will still choose matches as if all offsets cost about the same, and it will miss many of the cheap "reuse the last offset" opportunities. The parser now has to solve a different problem. At each byte position we can emit either another literal or a match, but the cost of a match now depends on more than just its length and distance. In particular, it depends on the current repeat-offset history and on how many literals are still pending before the next match. The parser has to reason about **state**, not just local match quality. The brute-force version would keep multiple parser states per position. The simpler version used here keeps only **one winning state per position**, including the current repeat-offset history and the number of pending literals since the last match. This leads naturally to dynamic programming. For each position we store the best known way to arrive there, together with the parser state that future costs depend on: ~~~ c++ std::vector cost(length + 1, 999999999); std::vector pending_literals(length + 1, 0); std::vector reps(length + 1); std::vector steps(length + 1); cost[0] = 0; ~~~ At each position we consider two kinds of edges, either emitting one more literal byte or closing the current literal run and emitting a match. The literal edge is straightforward: ~~~ c++ int lit_cost = cost[i] + literalBytePrice(buffer[p + i]); ~~~ The match edge requires us to estimate the literal, offset and match length. Once we've chosen a match cost, we update the parser state that future prices depend on. ~~~ c++ int match_cost = cost[i] + literalLengthPrice(pending_literals[i]) + offsetPrice(match_dist[j], next_reps) + matchLengthPrice(match_len[j]); ~~~ We keep only the single winning state per position, but that state has enough information to price repeat offsets and literal-run boundaries. Even with dynamic programming this is a heuristic parser. Two different histories can arrive at the same byte position with slightly different future prospects, and this parser keeps only the current winner. We give up some optimality to keep the parse fast and simple. ### Final repeat-offset results Using that parser on top of the repeat-offset format gets enwik8 from 34,316,448 (34.32%) bytes down to 34,314,243 (34.31%), and Silesia from 64,388,765 (30.38%) down to 64,003,456 (30.19%). !!! Note Tuning hyperparameters A compressor has many different hyperparameters to tune. For example, `max_matches` - which limits how many hash-chain candidates the parser examines at each byte - has a huge impact. The final numbers in this article use `max_matches = 16`. * `enwik8`: `34,263,694` (34.26%) with `16`, versus `35,481,328` (35.48%) with `8` * `Silesia`: `63,911,530` (30.15%) with `16`, versus `65,649,784` (30.97%) with `8` Literal Table Cleanup ===================== One remaining source of overhead is the literal Huffman table. We are still writing one `(symbol, code length)` pair per used symbol in every block. Two small fixes help. First, symbols are sorted, so symbol ids can be delta-coded. We can use a simple trick of storing small values in fewer bits, escaping out to bigger values. ~~~ c++ void writeSmallValue(BitWriter& writer, int v) { if (v < 16) { writer.writeBit(0); writer.writeBits(v, 4); } else if (v < 48) { writer.writeBits(0b10, 2); writer.writeBits(v - 16, 5); } else { writer.writeBits(0b11, 2); writer.writeBits(v, 8); } } ~~~ Many adjacent symbols share the same code length, so code lengths can also be run-length encoded. Again, a minimal implementation. ~~~ c++ writer_.writeBits(codelen - 1, 4); writer_.writeBits(run - 1, 4); for (int j = 0; j < run; ++j) { int delta = nodes_[i + j].symbol - prev_symbol - 1; writeSmallValue(writer_, delta); prev_symbol = nodes_[i + j].symbol; } ~~~ ### Literal-table results This trims enwik8 from 34,314,243 (34.31%) bytes to 34,263,694 (34.26%), and Silesia from 64,003,456 (30.19%) to 63,911,530 (30.15%). Benchmark Notes =============== Here is what zstd looks like on the same inputs. | Stage | enwik8 | Silesia | | ----- | ------ | ------- | | Ours | 34,263,694 (34.26%) | 63,911,530 (30.15%) | | zstd 1.5.7 `-3` reference | 35,445,335 (35.45%) | 66,217,545 (31.24%) | | zstd 1.5.7 `-9` reference | 31,129,515 (31.13%) | 59,190,849 (27.92%) | However, encoding is much slower. Nearly all the time is in our "optimal" parse. Production parsers use many more tricks to keep this search fast. | Stage | enwik8 | Silesia | | ----- | ------ | ------- | | Ours | 8.1 MB/s | 4.5 MB/s | | zstd 1.5.7 `-3` | 62.7 MB/s | 62.6 MB/s | Conclusion ========== In this article, we explored arithmetic coding, ANS, and FSE. We then pushed the compressor a little closer to modern practice: Huffman literals, FSE-coded sequence streams, repeat offsets, optimized table storage, and finally a parser that is aware of that sequence structure. A key takeaway is that many compression gains come from searching for a better symbol stream structure. Once the entropy coder is good enough, the next job is finding the right way to break up the data so that stream compresses well. The [code](https://github.com/glinscott/linzip2/blob/master/main.cc) implements the ideas covered in this article. Huge thanks to folks like Jarek Duda, Charles Bloom, Yann Collet and Fabien Giesen for posting their work online. Without their contributions, the field of compression would not be nearly as rich as it is. Special thanks to Mark Nelson for his article on [arithmetic coding](https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html). Thanks also to Morgan McGuire for the wonderful [Markdeep](https://casual-effects.com/markdeep/features.md.html#basicformatting/lists) library. References ========== 1. Jarek Duda, [Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding](https://arxiv.org/pdf/1311.2540.pdf) 2. Yann Collet, [Finite State Entropy: a new breed of entropy coder](http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html) 3. Yann Collet, [FSE: a closer look](http://fastcompression.blogspot.com/2014/02/fse-closer-look.html) 4. Yann Collet, [FSE: distributing symbol values](http://fastcompression.blogspot.com/2014/02/fse-distributing-symbol-values.html) 5. Charles Bloom, [Understanding ANS](http://cbloomrants.blogspot.com/2014/02/02-06-14-understanding-ans-8.html) 6. Charles Bloom, [Understanding ANS (12)](http://cbloomrants.blogspot.com/2014/02/02-18-14-understanding-ans-12.html) 7. Yann Collet, [FSE Tricks: memory efficient subrange coding](http://fastcompression.blogspot.com/2014/02/fse-tricks-memory-efficient-subrange.html) 8. Charles Bloom, [Understanding ANS (10)](http://cbloomrants.blogspot.com/2014/02/02-11-14-understanding-ans-10.html) 9. Yann Collet, [Realtime Data Compression](http://fastcompression.blogspot.com) 10. Fabien Giesen, [Interleaved Entropy Coders](https://fgiesen.wordpress.com/2014/02/02/rans-notes/) 11. Yann Collet, [Smaller and faster data compression with Zstandard](https://code.fb.com/core-data/smaller-and-faster-data-compression-with-zstandard/) 12. Charles Bloom, [About Arithmetic Coders and recip\_arith](http://cbloomrants.blogspot.com/2018/10/about-arithmetic-coders-and-reciparith.html) 13. Charles Bloom, [Multi-symbol division-free arithmetic coding using reciprocals](http://cbloomrants.blogspot.com/2018/10/10-16-18-multi-symbol-division-free.html) 14. "Range ANS (rANS) faster direct replacement for range coding" [https://encode.ru/threads/1870-Range-ANS-(rANS)-faster-direct-replacement-for-range-coding](https://encode.ru/threads/1870-Range-ANS-(rANS)-faster-direct-replacement-for-range-coding) 15. Mark Nelson, [Data Compression with Arithmetic Coding](https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html) 16. Yann Collet, [Better normalization for ANS-based coders](http://fastcompression.blogspot.com/2014/03/better-normalization-for-better.html) 17. Charles Bloom, [Understanding ANS - normalization and probability](http://cbloomrants.blogspot.com/2014/01/1-30-14-understanding-ans-1.html)