this post was submitted on 15 Oct 2023

2 points (100.0% liked)

Programming

20393 readers

86 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]

founded 2 years ago

MODERATORS

Tracking Sqlite binary file with git? (lemmy.world)

submitted 2 years ago by [email protected] to c/programming

11 comments fedilink hide all child comments

I have a project in development that I'm working on and I frequently switch between two computers. I am including my sqlite file in git and so far it's been fine but I've heard in the past that git doesn't do well with binary? Has anyone actually had issues doing this?

I decided to perform a dump just in case so i dont have to start from scratch if something does go wrong.

top 11 comments

sorted by: hot top controversial new old

[–] [email protected] 6 points 2 years ago (1 children)

Do not worry, you will not have any issues such as data loss or corruption regardless of what data you keep in git. Keeping binary files may be ineffective, as many binary formats do not play well with external compression, as a result your repository will grow by the whole size of updated file on each commit, even if the logical change is small. This should not be the case for sqlite files though, which should be well compressible both as individual blobs and between versions of the same file, so with each update your repository would grow roughly by the size of changed data. You should be careful though to close all database connections before committing so the file is in consistent state and contains all the recently written data.

[–] [email protected] 1 points 2 years ago

Thank you!

[–] [email protected] 4 points 2 years ago* (last edited 2 years ago) (1 children)

While it will "work" I honestly wouldn't recommend it.

Your .git directory is potentially going to get stupidly large (possibly large enough that some git service providers will turn you away - individual file size limits as low as 100MB are common), and one day you're likely to face a merge conflict that's really difficult to fix.

Use something else to sync your database or even better just don't sync them at all, and use migrations to keep two databases up to date. The latter is what most people do.

If "something does go wrong" though, you should be able to just restore the sqlite database from a backup... you do have backups right? RIGHT? Git is not a backup.

[–] [email protected] 3 points 2 years ago* (last edited 2 years ago)

The sqlite file in question is just for initial development testing, it's loss would be but a minor annoyance. Since i first posted this question, I've removed the binary file from git tracking anyways and just keep a plain text dump file. This is for convenience while working between two computers, not actual data backup.

[–] atheken 3 points 2 years ago* (last edited 2 years ago) (1 children)

It sounds like you might be developing an app with an evolving schema.

You should consider adopting a db migrations framework and having a task that can apply them to a dev database to bootstrap/upgrade the DB. If you take this route, you won’t need to even commit the db file, and you will be able to easily seed/replicate the DB schema when you deploy it.

Additionally, SQLite is awesome, and if you are actually storing some data, you can do some stuff where you can make tables that are backed by structured text files (like CSV), so there are ways to store data in text while still getting the benefits of having a SQL interface

Large binary files will start to expand the git repo, but if they’re relatively small, and the update frequency is somewhat limited, it won’t really be an issue. If you are concerned about it, you can look into git-lfs, but it might not matter much.

EDIT: Also, since the "git is bad with binary files" is such a pervasive myth, I decided to check into it a little bit. A couple things:

Git uses "delta compression" when packing/storing/transmitting files. This allows common chunks to be stored once and then reassembled when you check out a file. It does this for "normal" files reguardless of whether they are text or binary until they are considered "big", at which point, they are stored as a single unit in the pack file. What's "big"? By default, 512MB.

You can go pretty deep on the internals of the way that packfiles are constructed in git, but more than likely, a file that's a few MB is still going to work fine, and you will get some storage reduction when you commit it.

You should configure automatic gc to periodically repack stuff so that the actual .git repo doesn't balloon, but again, even if you're talking about a few GB, it's still not much on modern systems.

[–] o11c 1 points 2 years ago

git is, however, bad with files that don't have meaningful small binary diffs. And the page size for SQL binary files is small enough that that is in fact a problem (though this is not nearly as bad as already-compressed files).

If you disable VACUUM that can give a rough idea of what git actually has to deal with. But you really shouldn't.

[–] [email protected] 3 points 2 years ago* (last edited 2 years ago)

Export the data and structure in SQL. SQL is plain text and suitable for git.

If data can be seeded easily only export the structure and git control it.

In Rails framework, schema file and seed file are used for structure and data.

[–] lysdexic 1 points 2 years ago

I am including my sqlite file in git and so far it’s been fine but I’ve heard in the past that git doesn’t do well with binary? Has anyone actually had issues doing this?

Git by default interprets all files as source code, and supports some convenient features such as converting newline characters. It also enables text file diffs for all files by default, which might not be what you want to do with stuff like SQLite databases.

Nevertheless, Git also supports non-source files. You only need to tell them which ones are they, and what it should do with them.

For that, Git supports Git attributes. For your particular case, you will need to create a .gitattributes file in your repository, and add a regex that matches your SQLite files.

Git also allows you to diff your SQLite databases with your own custom diff tool. For that you need to create your own diff script, configure your Git install to invoke your Git script for specific types of diff, and then get back to your Git repository where you added git attributes for the SQLite files, and update the git attributes to set the diff type as your custom SQLite diff script.

[–] [email protected] 1 points 2 years ago* (last edited 2 years ago)

The problem is mainly that the size of your git repo blows up really quick with regularly changing binary files. Also, merge conflicts exist but I think git would just make you choose which binary to keep.

[–] coltorl 1 points 2 years ago

This is a common enough practice. I work for a large engineering org and we put a sqlite db in git for ease of spinning up an environment for troubleshooting.

[–] [email protected] 1 points 2 years ago

Git is mainly tracking and saving changes, which works great for text, but not that well for data (especially binary). You won't lose your data, but the Git repo will keep growing too fast.

The big question here is: How often does the data change? If you just use it as a convenient format and rarely change things, it should be fine. Though as mentioned: It might make sense to export to SQL before putting it in Git then. As long as the size is reasonable too (Not storing gigabytes of data).

Alternatives can be other sync services (Dropbox, Seafile, ..) to keep your Git repo lean or even better: Set up a SQL server so the data is always in the same spot. Of course that depends on if you have internet everywhere you work (but you probably do).