datahoarder

6716 readers

22 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 4 years ago

MODERATORS

[email protected]

What's an elegant way of automatically backing up the contents of a large drive to multiple smaller drives that add up to the capacity of the large drive? (on Linux) (lemmy.ml)

submitted 10 months ago* (last edited 10 months ago) by [email protected] to c/[email protected]

8 comments fedilink hide all child comments

So I have a nearly full 4 TB hard drive in my server that I want to make an offline backup of. However, the only spare hard drives I have are a few 500 GB and 1 TB ones, so the entire contents will not fit all at once, but I do have enough total space for it. I also only have one USB hard drive dock so I can only plug in one hard drive at a time, and in any case I don't want to do any sort of RAID 0 or striping because the hard drives are old and I don't want a single one of them failing to make the entire backup unrecoverable.

I could just play digital Tetris and just manually copy over individual directories to each smaller drive until they fill up while mentally keeping track of which directories still need to be copied when I change drives, but I'm hoping for a more automatic and less error prone way. Ideally, I'd want something that can automatically begin copying the entire contents of a given drive or directory to a drive that isn't big enough to fit everything, automatically round down to the last file that will fit in its entirety (I don't want to split files between drives), and then wait for me to unplug the first drive and plug in another drive and specify a new mount point before continuing to copy the remaining files, using as many drives as necessary to copy everything.

Does anyone know of something that can accomplish all of this on a Linux system?

top 8 comments

sorted by: hot top controversial new old

[–] [email protected] 6 points 10 months ago

You could create a Python script to do this. There is a library called psutil that would help. Basically,

iterate over mounted drives and see how much each has available
based on these values, iterate over your backup files and separate them into chunks that will fit on each drive
copy chunks to respective drives

Would be a fun little project even for a beginner I think.

[–] [email protected] 6 points 10 months ago

Mergerfs to combine the smaller ones.

[–] [email protected] 3 points 10 months ago

I don’t want to do any sort of RAID 0 or striping because the hard drives are old and I don’t want a single one of them failing to make the entire backup unrecoverable.

This will happen in any case unless you had enough capacity for redundancy.

What is in this 4TB drive? A Linux installation? A bunch of user data? Both? What kind of data?

The first step to this is to separate your concerns. If you had e.g. a 20GiB Linux install, 10GiB of loose home files, 1TiB of Movies, 500GiB of photos, 1TiB of games and 500GiB of Music for example, you could back each of those up separately onto separate drives.

Now, it's likely that you'd still have more data of one category than what fits on your largest external drive (movies are a likely candidate).

For this purpose, I use https://git-annex.branchable.com/. It's a beast to get into and set up properly with plenty of footguns attached but it was designed to solve issues like this elegantly.
One of the most important things it does is separate file content from file metadata; making metadata available in all locations ("repos") while data can be present in only a subset, thereby achieving distributed storage. I.e. you could have 4TiB of file contents distributed over a bunch of 500GiB drives but in each one of those repos you'd have the full file tree available (metadata of all files + content of present files) allowing you to manage your files in any place without having all the contents present (or even any). It's quite magical.

Once configured properly, you can simply attach a drive, clone the git repo onto it and then run a git annex sync --content and it'll fill that drive up with as much content as it can or until each "file"'s numcopies or other configured constraints are reached.

[–] [email protected] 2 points 10 months ago (1 children)

sounds like your main limitation is attaching the drives – if you can attach them all to a single system (ex. a separate computer or a NAS case) then at least it becomes somewhat easier to access them all at once

I was thinking JBOD but Wikipedia points out the same issue you mention with RAID 0, failure of one drive can mess up the logical volume which leads to a whole host of new issues to deal with during recovery

[–] [email protected] 2 points 10 months ago* (last edited 10 months ago)

Not that big of a deal when it is a backup. Raid is not a backup solution, it is a 24/7 uptime solution. If the main drive dies with JBOD, then you have the backup. If a backup drive fails, then you still have the main. Trick is to ensure any drive issues are dealt with immediately and no backup runs if there is a smart error or similar on any drive.

So having software that monitors drive health and email/notifies you is necessary.

Secondary benefit of JBOD is all drives in a pool are still readable separately.

[–] [email protected] 2 points 10 months ago

What I do is on the originator drive, I create new subdirectories and start categorizing items by content; like I'll put all the ebooks into one directory, and all the television into another. It just makes it easier for me to find things later if I can just head to the drive with all the television on it.

If there's a particular directory with a lot of content, I might create further divisions - maybe shows that are finished vs those who are still getting new episodes, or sitcoms vs drama, that kind of thing.

Then I make a list of how big each master directory is, and I start copying them over to the most appropriate-sized drive. I usually find that I can fit in one large directory, and a couple of smaller ones, and then the last drive gets all the leftovers. I also tape a post-it note to each drive saying something like "2022-23 television" or "science fiction audiobooks" or whatever.

I also create a new directory on the originating drive called something like ++COPIED and, once I've copied content to a new drive, I move the original directory to ++COPIED: I'll still have access if I need it, but I don't have to keep track of it any longer. Once everything is successfully copied over, I can just delete that one directory.

It's a manual process, yes, but it does make it easier for me to find stuff when I want to look at it again later.

[–] [email protected] 1 points 9 months ago

You could check out git-annex.

[–] [email protected] 0 points 10 months ago

This seems like a terrible idea.