this post was submitted on 07 Sep 2023
4 points (100.0% liked)
Data Engineering
393 readers
1 users here now
A community for discussion about data engineering
Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
A bloom filter tells you that an element is not in a set, or maybe is in a set. The only guaranteed accurate result is not in set. You also have to reproduce the filter whenever you remove items from the set being tested, which isn't a free operation.
I've seen bloom filters abused in a lot of ways that make no sense, but very few ways that do. The classical example of scanning a k:v space in a memory efficient manner is the best application I've seen, but it still doesn't really work if you need to know a key is definitely in that space. You still have to perform the lookup.
An application for bloom filter I was working on was mitigating the effects of abusive lookups for non-existent names on a DNS platform, and it turned out that even with absurdly high lookup rates, adding an initial check for negative presence in a record set, we didn't benefit at all.
In the google BigTable paper they use it like this. The reason it makes sense in that context is that there is that bloom filters are tiny and can be easily held in memory for a large dataset, whereas performing the actual lookup requires a slow disk access/network request. This way you can reduce the amount of expensive lookups massively at the cost of a small amount of memory.
Also since the underlying data format is immutable the cost of recalculating upon removal doesn't matter since it doesn't need to happen
Compared to which alternative? I also don't understand exactly what the filter's purpose was in this case, did you have all DNS records in a bloom filter to quickly check for misses? Or was it some kind of allowlist/blocklist of clients?
Finally, what metrics were you using to decide it did not benefit at all?
All records created by customers, yes. The alternative used was our default structure, but that structure is also in ram. The problem faced is that a resolver can be targeted in several ways, one of which is querying for zone data that doesn't exist. When latency is the primary performance concern, and you need a low performance cost method to look up whether a request can be (maybe) served or (absolutely) not, bloom filters look like the right filter to use.
Performance wise (latency), there was no improvement in lookup performance, nor a reduction on CPU consumed. So for us in that application, it didn't make sense.
Big table was the example I was thinking of. A massive k:v store that can start at 4, 8, 32TB you can have in RAM. The problem is, you can only answer negative membership in the key space, not positive. So if you're looking up a key you either get a absolute no or an opportunity to scan the key space anyway.
What I've tended to find is that indexes make more sense in most scenarios. Not all, because there's always exceptions. But indexes tend to be faster, still consume vastly less space that the data being indexed, and they tend to offer more powerful features rather than just membership in a set.
I can see why this data structure might be abused and/or chosen for an inappropriate use-case since it seems to offer a lot of value for the tiny amount of space required.
This is a good description. I think the name “filter” is appropriate for their best use cases, when you want to remove members of some other set if they are probably members of the bloom filter set, and can accept that you might remove some extras due to false positives.
Problems like that come up from time-to-time.