Programming

19774 readers

54 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]

founded 2 years ago

MODERATORS

snowe

Ategon

[email protected]

Bloom filters: real-world applications (llimllib.github.io)

submitted 2 years ago by Reader9 to c/programming

10 comments fedilink hide all child comments

What are your real-world applications of this versatile data structure?

They are useful for optimization in databases like sqlite and query engines like apache spark. Application developers can use them as concise representations of user data for filtering previously seen items.

The linked site gives a short introduction to bloom filters along with some links to further reading:

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

you are viewing a single comment's thread
view the rest of the comments

[–] noli 5 points 2 years ago* (last edited 2 years ago) (1 children)

I know they are used in google's BigTable. All data there is stored in seperate SSTables and you can specify that a locality group should have bloom filters generated for its SSTables. Apparently cassandra has them too.

Both are the same general application though and you already mentioned databases.

I did think about using them at some point for authentication purposes in a webservice. The idea being to check for double uses of a refresh token. This way the user database would need to store only a small amount of extra storage to check for the reuse of a refresh token and if you set the parameters accordingly, the false positives are kind of a benefit in that users cannot infinitely refresh and they actually have to reauthenticate sometimes.

Edit to add: I also read a paper recently that uses a datastructure called a collage that is closely related to bloom filters to perform in-network calculations in a sensor network. If I understand correctly, the basic idea there is that every node in the network adds a bit to the datastructure while it is in transit, so data from the entire network is aggregated. The result can then be fed to a classifier ML model. (Source: Oostvogels, J., Michiels, S., & Hughes, D. (2022). One-Take: Gathering Distributed Sensor Data Through Dominant Symbols for Fast Classification. )

[–] Reader9 2 points 2 years ago (1 children)

Collage sounds really interesting , will check it out. Another variation on bloom filter I recently learned about is count-min-sketch. It allows for storing/incrementing a count along with each key, and can answer “probably in set with count greater than _”, “definitely not in set”.

Thanks for adding more detail on the DB use-cases!

[–] noli 2 points 2 years ago (1 children)

Interesting. Do I understand it correctly if I say it's a bloom filter where instead of setting a bit to 1 for each of the hashes, you increment a counter for that hash?

How do you infer the count then, take the minimum of all matching hashes? Because intuitively it seems to me like you would need a lot more space to avoid counts being too high

[–] Reader9 2 points 2 years ago* (last edited 2 years ago)

This data structure uses a 2-dimensional array to store data, documented in this scala implementation: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala. I’m still trying to understand it as well.

Similar to your idea, I had thought that by using k bloom filters, each with their own hash function and bit array, one could store an approximate count up to k for each key, which also might be wasteful or a naïve solution.

PDF link: http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf