Data Engineering

420 readers

1 users here now

A community for discussion about data engineering

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

founded 2 years ago

MODERATORS

ericjmorey

Bloom filters: real-world applications (llimllib.github.io)

submitted 2 years ago by Reader9 to c/data_engineering

4 comments fedilink hide all child comments

cross-posted from: https://programming.dev/post/2656516

What are your real-world applications of this versatile data structure?

They are useful for optimization in databases like sqlite and query engines like apache spark. Application developers can use them as concise representations of user data for filtering previously seen items.

The linked site gives a short introduction to bloom filters along with some links to further reading:

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 2 points 2 years ago

All records created by customers, yes. The alternative used was our default structure, but that structure is also in ram. The problem faced is that a resolver can be targeted in several ways, one of which is querying for zone data that doesn't exist. When latency is the primary performance concern, and you need a low performance cost method to look up whether a request can be (maybe) served or (absolutely) not, bloom filters look like the right filter to use.

Performance wise (latency), there was no improvement in lookup performance, nor a reduction on CPU consumed. So for us in that application, it didn't make sense.

Big table was the example I was thinking of. A massive k:v store that can start at 4, 8, 32TB you can have in RAM. The problem is, you can only answer negative membership in the key space, not positive. So if you're looking up a key you either get a absolute no or an opportunity to scan the key space anyway.

What I've tended to find is that indexes make more sense in most scenarios. Not all, because there's always exceptions. But indexes tend to be faster, still consume vastly less space that the data being indexed, and they tend to offer more powerful features rather than just membership in a set.