2934: Bloom Filter

Explain xkcd: It's 'cause you're dumb.
(Redirected from 2934)
Jump to: navigation, search
Bloom Filter
Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they're the right one you can never be sure.
Title text: Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they're the right one you can never be sure.

Explanation[edit]

Ambox notice.png This explanation may be incomplete or incorrect: PROBABLY CREATED - Please change this comment when editing this page. Do NOT delete this tag too soon.
If you can address this issue, please edit the page! Thanks.

The comic is referring to a Bloom Filter, a data structure that is used for approximate membership queries and cardinality estimation using a bounded amount of memory. That is, after a series of objects are added to the bloom filter, given another object, the bloom filter can be queried to see if that object has already been added to it, with a chance of a false positive answer that depends on the size of the bloom filter. Or, the bloom filter can be queried for an approximate count of the objects that have been added to the bloom filter already.

A bloom filter uses a large bit array, and a number of hashing functions that produce indexes into this array. When a value is added to the set, it's hashed with each function, and the corresponding bits in the array are set to 1. To test if a value is in the set you hash it with all the functions, and check if all the bits are 1. If they are, the value may be in the set, but there can also be false positives because each hash collides with some other value in the set (assuming reasonable hash functions, a different element for each hash). But if any of the bits is 0, you know for sure the value is not in the set. The higher the ratio between the size of the bit array and the number of elements in the set, the smaller the false positive rate is (10 bits/element has about 1% false positives.

The joke in the comic is that Cueball has a 1-bit Bloom filter. When the set is empty, it accurately reports that any value is not in the set. But as soon as anything is added to the set, it has a very large false positive rate, since that single bit will be set and everything will hash to that index. Similarly the cardinality estimation is (correctly) 0 initially, but after the first addition the estimate will be "somewhere between 1 and infinity" which is not a terribly useful estimate.

There's also no point in having multiple hash functions for a 1-bit filter, since there's only one possible hash value.

The title text references how bloom filters are always accurate in saying that an element is not in the list (bloom filters are not correct), but you can never be sure if an element is actually in the list (when a bloom filter actually is correct), because of false positives.

Transcript[edit]

Ambox notice.png This transcript is incomplete. Please help editing it! Thanks.
[Ponytail holds out her hand to Cueball, who is holding a paper with a 1 on it.]
Ponytail: Does your set contai-
Cueball: Yeah, probably.
[Caption below the panel:]
One-Bit Bloom Filter


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

It certaintly does contain a thing. 172.68.23.74 00:10, 18 May 2024 (UTC)

The title text deals with inaccuracies in determining whether you have chosen the right programming tool for your membership query (or some different task), not just inaccuracies in the Bloom filter as one of these tools. This analogy remains unexplained. Transgalactic (talk) 11:24, 18 May 2024 (UTC)