this post was submitted on 08 Dec 2024
9 points (100.0% liked)

Data Engineering

387 readers
1 users here now

A community for discussion about data engineering

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

founded 2 years ago
MODERATORS
 

I am creating a couple of bigger database tables with at least hundreds of millions of observations, but growing. Some tables are by minute, some by milliseconds. timestamps are not necessarily unique.

Should I create separate year, month, or date and time columns? Is one unique datetime column enough? At what size would you partition the tables?

Raw data is in csv.

Currently I aim for postgres and duckdb. Does timescaledb make a significant difference?

top 3 comments
sorted by: hot top controversial new old
[–] [email protected] 10 points 1 week ago (1 children)

How are you planning on querying the data? I would plan your columns and indexes based on that answer. I'm going to assume arbitrary time ranges.

Just indexing a standard datetime column should be sufficient but that's going to depend on a few things. Like you're not accessing date ranges 10,000 times per second (website data source). Or you're not inserting 100,000 records per second. Even then you might be able to get away with it if you want to throw more hardware at it.

The answer to most of database design questions at larger scales depends on the scale. If you need to query this date once for a report you do it in a spreadsheet. If you are building the next Facebook api then you shard the crap out of it into a time series database and add layers of caching.

My suggestion is to build it simple first and test. Don't make assumptions about scale before you have data to back it up..

[–] jupyter 4 points 1 week ago

Thanks! That helps a lot

[–] [email protected] 5 points 1 week ago

I thought InfluxDB is the choice for such use cases. But I’m not an expert…