Programming

19848 readers

568 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]

founded 2 years ago

MODERATORS

snowe

Ategon

[email protected]

The power of interning: making a time series database 2000x smaller in Rust (gendignoux.com)

submitted 1 month ago by [email protected] to c/programming

10 comments fedilink hide all child comments

top 10 comments

sorted by: hot top controversial new old

[–] [email protected] 24 points 1 month ago (1 children)

If you're storing your data as json on disk, you probably never cared about efficient storage anyway.

[–] [email protected] 9 points 1 month ago

They thought "data lake" was a physical description.

[–] [email protected] 8 points 1 month ago (1 children)

Rust was the important factor in this result. That's why it's in the headline. It wasn't the hugely inefficient way of storing the data initially. Not at all.

FFS, you could just have run gzip on it probably.

[–] NostraDavid 3 points 1 month ago* (last edited 1 month ago) (1 children)

FFS, you could just have run gzip on it probably.

Gzip doesn't reduce your data's size by 2000x. Of course this could be done by other languages as well, but running gzip on your data doesn't keep it accessible.

Even turning the data into a Parquet file would've been a massive improvement, while keeping it accessible, but it likely would not have been 2000x smaller. 10x, maybe.

edit: zip: about 10x; 7zip about 166x (from ~10GB to 60MB) - still not 2000x

[–] [email protected] 1 points 1 month ago (1 children)

It all depends on the data entropy. Formats like JSON compress very well anyway. If the data is also very repetitive too then 2000x is very possible.

[–] FizzyOrange 2 points 1 month ago (1 children)

In my experience taking an inefficient format and copping out by saying "we can just compress it" is always rubbish. Compression tends to be slow, rules out sparse reads, is awkward to deal with remotely, and you generally always end up with the inefficient decompressed data in the end anyway, whether in temporarily decompressed files or in memory.

I worked in a company where they went against my recommendation not to use JSON for a memory profiler output. We ended up with 10 GB JSON files, even compressed they were super annoying.

We switched to SQLite in the end which was far superior.

[–] [email protected] 1 points 1 month ago (1 children)

Of course compressing isn't a good solution for this stuff. The point of the comment was to say how unremarkable the original claim was.

[–] FizzyOrange 1 points 1 month ago

Yeah I agree.

[–] FizzyOrange 4 points 1 month ago (1 children)

Hmm I think just using SQLite or DuckDB with normalised data would probably get you 99% of the way there...

[–] [email protected] 2 points 1 month ago* (last edited 1 month ago)

It all depends on how it's represented on disk though and how the query is executed. Sqlite only supports numbers and strings, and if you keep using a VARCHAR, a read of those rows are going to have materialize a string into memory inside the sqlite library. DuckDB has more types, but if you're using varchars everywhere, something has to read that string into memory unless you can push down logic into a query that doesn't actually have to read the actual value, such as one that can use indices.

The best way is to change the representation on disk, such as converting low-cardinality columns like the station into a numeric id. A standard int being four bytes is a lot more efficient than an n-byte string + a header and it can be compared by value.

This is where file formats, like Parquet, shine. They're oriented more towards parsing by systems. JSON is geared towards human parsing.