The goal of this notebook is to introduce the reader to some probabilistic algorithms that are commonly employed in relational databases and used to analyse streams of data. The algorithms are: Approximate Counting, used to estimate the size of a multiset; Flajolet-Martin, LogLog Counting and HyperLogLog Counting, which are used to estimate the cardinality of a multiset (amount of unique elements); and Bloom Filters, which can be used for membership queries (some variants of it are also introduced). Besides its applications, these algorithms are fun to learn and, to some degree, intuitive.
In the 70s Morris had to count the occurences of some events, but he only had 8-bits to count each of them (255 being the maximum value). Due to this limitation many common events shared the same value of 255, even though some of them were much more frequent. He needed to discern them and a precise count was not necessary, it was enough to know only the order of the magnitude of the events occurence. This resulted in the development of what is called nowadays the Approximate Counting algorithm (Morris 1978).
Lets assume that our base is @@0@@, and thus we can count @@1@@, otherwise leave as it is.
To better understand lets illustrate it with an example. Let @@2@@ and @@3@@. 16 events should happen before we increment @@4@@ to 5, as @@5@@.
Lets run it 10 times on a scenario of 10,000 events (@@0@@) to see its accuracy.
This algorithm has one parameter, the base @@0@@. Flajolet analysis shows that for @@1@@ its peformance remains remarkably stable over many increments. However, as he notes, one has to keep in mind the application's domain in order to assess if errors at this magnitude are acceptible (Flajolet 1985).
One possible application for Approximate Counting is to keep track of the amount of times a block of flash memory has been written, as these blocks can only endure a limited amount of write operations. Approximate Counting not only is good for this task because it only takes @@2@@ bits, but its counter itself won't be rewritten every time. Thus, when the amount of space required to store a counter is worrisome, or when the action of writing the counter is expensive, this algorithm is a good candidate (Lumbroso).
The next algorithms/data structures make use of hashing functions, so before diving into them lets define the hashing functions that will be used. I note that some algorithms/data structures make assumptions about these hash functions, but none of them were tested. Futhermore, I constrain every hashing to have 24 bits.
Besides the hashing functions, lets also define the dataset. For the experiments I will use the following multiset of @@0@@ words:
Cardinality of a multiset is its number of unique elements. The naive and deterministic approach is to either sort it (@@0@@) and to then in one pass count the unique elements, or to trade-off space for speed by using a hashtable. Either way it is an expensive process, and probabilistic algorithms trade-off on accuracy to gain on speed and space.
One common use of algorithms to estimate cardinality are on relational databases, as there is a concern about query optimization (Lumbroso). Take for example the task of selecting the intersection of two multisets: 1. One approach would be to sort the multiset A (or B) and then for each member of the multiset B perform a search on A; 2. Another approach is to sort both multisets and perform a process similar to merge-sort; 3. Or maybe it would be better to remove the duplicates before executing 1 or 2.
One of the statistics used by relational databases to decide which path to take when doing query optimization is the cardinality (Flajolet and Martin 1985)
This section starts by explaining the Flajolet-Marting algorithm (Flajolet and Martin 1985), followed by its refinement called LogLog Counting (Durand and Flajolet 2003)
For this algorithm one needs to have a hash function that generates integers that are distributed in an sufficiently uniformly manner on the interval @@0@@ is a parameter set by the user that defines the amount of bits that will be used. We also need to define one auxiliary function that returns the first 1-bit of the binary representation of the hash:
Thus, if the hashes are uniformly distributed, we expected half of them to have the first bit set to 1, a quarter of them to have the first bit set to 0 and the second to 1, and @@0@@. The algorithm mainly consists of keeping track of these first-one bits patterns in a bitmap:
From the bitmap one can extract the cardinality by this observation: if @@0@@ is the amount of unique elements in @@1@@, then we expect @@2@@ to be 1 if @@3@@.
Besides the correction factor, the authors also demonstrated that one should expect the value to be off by one order of magnitude. In order to remedy this situation one could run multiple hashes/bitmaps and average the result.
From the previous algorithm it was noted that one possible opitmization would be to use multiple hash functions and keep multiple bitmaps. The LogLog Counting algorithm goes on a similar path as it keeps multiple buckets (they act similarly to bitmaps) but uses only one hash function, thus avoiding the additional cost of executing mutliple hash functions.
Lets define @@0@@ as the amount of buckets, with @@1@@. We can combine all of these estimations with the following equation:
Where @@3@@ is a correction factor that can be set as @@4@@ for large values of @@5@@.
As one can see there is a tremendous improvement on the accuracy. In fact, one can expect the accuracy to be in the order of @@0@@.
HyperLogLog improves the LogLog algorithm by using harmonic means to average the buckets. Thus the cardinality is estimated by the following equation:
Where @@1@@ for large values of @@2@@.
Interestingly enough LogLog is more accurate than HyperLogLog. It may be an implementation mistake or for this dataset/hash function LogLog outperforms HyperLogLog. In any case, note how none of the algorithms makes any assumption about the data, their insertion order or how frequent the elements are repeated.
Bloom Filters is a data structure that allows one to know if an element is not in the set or if it is probably in the set. Thus its use is appropriate for enviroments where the majority of queries should return negative.
Lets assume that we have a bitmap of @@0@@ addresses (@@1@@, all initially set to zero) and @@2@@ hash functions that have as output one of these @@3@@ addresses. In order to insert an element one has to compute the @@4@@ hash functions and mark @@5@@ at every @@6@@ address. Then to check if an element is not in the set, one computes the @@7@@ addresses and checks if at least one of them has the value of zero; if all of them have the value of 1, then probably the element is in the set.
From the way this data structure works it is important to note that using many hash functions is not necessearily good, as @@8@@ grows bigger so does the ratio of the elements set to 1 on the bitmap and thus leading to more false positives (stating that an element is in the set when in fact it is not).
On the experiment below I measure the amount of false positives holding fixed the amount of elements on the set (~400) and the number of hash functions (@@9@@), with the independent variable being the size of the bitmap, which directly influences the ratio of elements set to 1.
There are many variants and applications for Bloom Filters, for this work I will only introduce two of them. For a length overview of its many uses I recommend the article titled "Theory and Practice of Bloom Filters for Distributed Systems" (Tarkoma, Rothenberg, and Lagerspetz 2012).
By default Bloom Filters do not support removal of elements, as many elements may have at least one position in common, and thus marking 0 where one element is hashed to may invalidate many others. One can get around this problem by going from bits to counters, although overflow might be a problem. This variant is called Counting Bloom Filters (Fan et al. 2000).
Besides the ability of removing elements, one can use its counters to report the frequency of the elements. As a single position may be incremented by many elements, the sensible thing to do is to compute the minimum of the @@0@@ addresses (note that this implementation will never underestimate).
Being able to selectively remove one element is a good addition, but sometimes one needs to continuosly remove old/stale data. For example, imagine a case where there is a stream of data and one needs to check if the newly arrived element is unique or not. If there are infinitely many unique elements, then having false negatives is unavoidable, as it is impossible to keep all of them. Moreover, we can't keep too many of them as the bitmap would be filled with 1s and thus increase the false positive ratio. One possible solution to acheive a balance between false positivies and false negatives is to continuosly remove the old/stale data. This variant is called Stable Bloom Filters (SBF) (Deng and Rafiei 2006)and uses the counter to also store time related information. At SBF, whenever doing an insertion, firstly one selects @@0@@ random positions of the bitmap and decrements them by 1, then at each of the @@1@@ addresses you set the counter to a maximum value instead of 1. This way data that is not accessed often is going to be forgotten, and data that is often accessed retains a large counter value.
Bloom, Burton H. 1970. “Space/Time Trade-Offs in Hash Coding with Allowable Errors.” Commun. ACM 13 (7): 422–26. doi:10.1145/362686.362692.
Deng, Fan, and Davood Rafiei. 2006. “Approximately Detecting Duplicates for Streaming Data Using Stable Bloom Filters.” In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, 25–36. SIGMOD ’06. ACM. http://doi.acm.org/10.1145/1142473.1142477.
Durand, Marianne, and Philippe Flajolet. 2003. “Loglog Counting of Large Cardinalities.” In Algorithms - ESA 2003: 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003. Proceedings, 605–17. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-540-39658-1_55.
Fan, Li, Pei Cao, Jussara Almeida, and Andrei Z. Broder. 2000. “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol.” IEEE/ACM Trans. Netw. 8 (3): 281–93. doi:10.1109/90.851975.
Flajolet, Philippe. 1985. “Approximate Counting: A Detailed Analysis.” BIT Computer Science and Numerical Mathematics 25 (1): 113–34. doi:10.1007/BF01934993.
Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. “HyperLogLog: The Analysis of a near-Optimal Cardinality Estimation Algorithm.” In AofA: Analysis of Algorithms, AH:137–56. DMTCS Proceedings. Discrete Mathematics and Theoretical Computer Science. https://hal.inria.fr/hal-00406166.
Flajolet, Philippe, and G. Nigel Martin. 1985. “Probabilistic Counting Algorithms for Data Base Applications.” Journal of Computer and System Sciences 31: 182–209. doi:http://dx.doi.org/10.1016/0022-0000(85)90041-8.
Lumbroso, Jérémie. “How Flajolet Processed Streams with Coin Flips.” Probabilistic Streaming Algorithms. http://www.stat.purdue.edu/~mdw/ChapterIntroductions/ApproxCountingLumbroso.pdf.
Morris, Robert. 1978. “Counting Large Numbers of Events in Small Registers.” Commun. ACM 21 (10): 840–42. doi:10.1145/359619.359627.
Tarkoma, Sasu, Christian Esteve Rothenberg, and Eemil Lagerspetz. 2012. “Theory and Practice of Bloom Filters for Distributed Systems.” IEEE Communications Surveys and Tutorials 14 (1): 131–55.