Iterated Log Coding (ILC) offers a novel approach to data compression by representing integers as a series of logarithmic operations. Instead of traditional methods like Huffman coding or arithmetic coding, ILC leverages the repeated application of the logarithm to achieve potentially superior compression for certain data distributions. It encodes an integer by counting how many times the logarithm base b needs to be applied before the result falls below a threshold. This "iteration count" becomes the core of the compressed representation, supplemented by a fractional value representing the remainder after the final logarithm application. Decoding reverses this process, effectively "exponentiating" the iteration count and incorporating the fractional remainder. While the blog post acknowledges that ILC's practical usefulness requires further investigation, it highlights the theoretical potential and presents a basic implementation in Python.
This blog post introduces a novel compression technique called "Iterated Log Coding," or "iterlog," which cleverly exploits the predictable nature of sorted integer sequences. The core idea revolves around representing each integer in a sorted sequence not by its absolute value, but by the difference between it and the preceding integer, a concept similar to delta encoding. However, iterlog takes this further by recursively applying this differencing process. Instead of just storing these differences directly, it stores the base-2 logarithm of each difference, rounded up to the nearest integer. This process of taking the log of the differences is iterated until all remaining values become zero.
The author argues that this approach is particularly well-suited for compressing sorted sequences containing clusters of similar values, a common characteristic of many real-world datasets. When integers are close together, their differences, and consequently the logarithms of these differences, will be small. Representing these small values requires fewer bits, leading to significant compression. The iterated nature of the algorithm further amplifies this effect. As the differences shrink through successive iterations, the logarithms also decrease, leading to progressively smaller representations.
The blog post details the encoding and decoding processes with illustrative examples. Encoding involves repeatedly calculating differences between consecutive numbers and then their logarithms, storing the rounded-up logarithm values at each level. A key point is that the first number in the sequence and the number of elements in the sequence need to be stored separately as they are not part of the differencing process. The number of iterations required is also implicitly stored by the presence of all zeros in the final iteration. Decoding reverses this process by repeatedly exponentiating and summing the logged differences, effectively reconstructing the original sequence.
The author acknowledges that iterlog coding is not a general-purpose compression algorithm and might not be suitable for all types of data. It's particularly effective for sorted sequences with clustered values, where the differences between successive elements are small. In situations where the differences are large and vary significantly, iterlog may not offer significant compression advantages or might even increase the data size compared to uncompressed representation. The post concludes with a Python implementation of the iterlog encoding and decoding algorithms, allowing readers to experiment with the technique and evaluate its performance on different datasets. The author invites further exploration and optimization of the idea, suggesting that varying the logarithm base or utilizing alternative rounding strategies could potentially improve compression ratios in specific scenarios.
Summary of Comments ( 23 )
https://news.ycombinator.com/item?id=43181610
Hacker News users generally praised the clarity and novelty of the Iterated Log Coding approach. Several commenters appreciated the author's clear explanation of a complex topic and the potential benefits of the technique for compression, especially in specialized domains like bioinformatics. Some discussed its similarities to Huffman coding and Elias gamma coding, suggesting it falls within a family of variable-length codes optimized for certain data distributions. A few pointed out limitations or offered alternative implementations, including using a lookup table for smaller values of 'n' for performance improvements. The practicality of the method for general-purpose compression was questioned, with some suggesting it might be too niche, while others found it theoretically interesting and a valuable addition to existing compression methods.
The Hacker News post "Iterated Log Coding" discussing the blog post about a new compression algorithm has generated a moderate amount of discussion, with several commenters engaging with the core ideas presented.
One of the most compelling threads revolves around the practicality and novelty of the "iterated log" approach. A user points out that the method is reminiscent of Elias gamma coding, a well-established variable-length coding scheme. This sparked further discussion comparing the two methods, with some suggesting that iterated log coding might offer advantages in certain scenarios, particularly when dealing with very large numbers, while others remain skeptical, highlighting the efficiency and existing implementations of Elias gamma coding.
Another commenter questions the choice of using base-10 logarithms in the examples, suggesting that base-2 logarithms would be more natural in a computational context. This comment prompts a brief discussion about the rationale behind the base choice, with the possibility that base-10 was selected for easier human readability in the illustrative examples.
Several commenters express interest in seeing benchmarks and comparisons against existing compression algorithms. They emphasize the importance of real-world performance data to evaluate the effectiveness of the proposed method. One user specifically asks about the performance on typical integer sequences found in practice.
There's also a short exchange regarding the potential applications of this compression method. One commenter suggests it could be useful for compressing indexes in databases.
Finally, a few commenters delve into the theoretical underpinnings of the algorithm, discussing its relationship to other coding schemes and exploring its potential limitations. One user mentions the connection to prefix codes and how the unique decodability property is ensured.
Overall, the comments section reveals a mixture of curiosity, skepticism, and cautious optimism towards the iterated log coding approach. While some see it as a potentially interesting idea with specific niche applications, others remain unconvinced of its practical value compared to established techniques. The prevailing sentiment appears to be a desire for more empirical evidence and comparisons to solidify the claims made in the original blog post.