this post was submitted on 16 Nov 2023
2 points (100.0% liked)

Data Hoarder

0 readers
3 users here now

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time (tm) ). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

founded 1 year ago
MODERATORS
 

A few months ago I made a post detailing the cost comparison between AWS S3, and Backblaze B2. The conclusion I came to was that for disaster recovery cloud storage, Glacier Deep Archive is the most cost effective method that exists.

There are obviously many caveats, with many commenters rightfully arguing that shipping local storage devices elsewhere, or building a new NAS at an offsite location, would be significantly cheaper over a 10 year horizon. I am not here to dispute that. I am here to say that if someone is willing to stomach the cost of cloud storage, for the convenience it provides, glacier deep archive can be the cheapest possible option. The other major caveat is that its only cheaper as a last resort that ideally should never actually be used. I look at it like an insurance policy with a deductible.

One problem, however, is that AWS is not a consumer tool, it's an enterprise tool. I work with AWS in my day to day, so I had no problem quickly hacking together some scripts to upload my data, even if it wasn't the most elegant setup. But some of the comments on my original post were saying that they would be interested in S3, but didn't know where to start. Uploading in bulk to glacier requires at a bare minimum, familiarity with the CLI, but more likely it requires coding knowledge and familiarity with the AWS SDK.

I already had the running idea of a personal web tool to help manage my glacier archive, but with these comments in mind I shifted towards trying to make a more user friendly open source app, that could be used by others.

And so was born, the S3 Media Archival Application. It currently supports what I consider to be the bare minimum for someone to bulk upload media archives to S3, and recover them if need be. I plan on sporadically continuing to improve it, and actively trying to address / manage bugs and feature requests if / when they pop up. As such for now, it is sitting at version 0.1.1. I am a back-end developer by trade, and I know right now the UX is a little... rough. But I am confident it's functional.

At the end of the day, I did mostly build this for my own use, but I would like to share it and hope others may find it useful. If there are people who want to use S3 but don't have the pre-requisite coding skills, I hope they could use this application. Or if there are people who have the pre-requisite coding skills, this application could still make their life easier.

Using S3 for this kind of media archive is not a back up. Crucially, there is no incremental kind of restore, and full test restores are cost prohibitive. Hence why I keep calling it an archive. It is made up entirely of data that I never expect to change, like a finished season of a TV show, or an artist's album. I believe it's well suited to this kind of task, and I can still run small scale test restores on smaller media.

The other thing I like about archiving my media in this way, is that a common sentiment I see

I don't keep an offsite copy of my media data because it's too expensive and I can rebuild it again if I need to

My feelings about this is that rebuilding would be a daunting task. I have 1400 individually downloaded albums, all painfully organized one by one. But I also have copies of TV Shows, movies, comics that might not be possible to find again. In this case, I could do a partial restore of my archive, say just download the music and the stuff I can't find elsewhere. This would avoid the bulk of the egress fees, but still give me at least the option of saving certain things from the original archive.

I don't think having an S3 glacier archive is right for everybody and every use case, but I like it and I think it makes sense for my priorities. If you think it makes sense for you as well, hopefully this application might help :)

top 4 comments
sorted by: hot top controversial new old
[–] [email protected] 1 points 1 year ago (1 children)

Have you thought about using dynamodb to store a catalog of the data?

[–] [email protected] 1 points 1 year ago

Currently the catalog is a simple H2 database, but using dynamo could be interesting.

I would personally be more inclined to update the application configuration to allow hosting of your own relation database, so you could use the default H2, run a DB in the cloud, or self host something.

Supporting dynamoDB would be difficult because the data layer currently is designed to be relational, although you could definitely argue that a relational DB is overkill for the simplicity of my schemas. Either way tho, switching to noSQL would be a significant refactor

[–] [email protected] 1 points 1 year ago

AWS is not a consumer tool, it's an enterprise tool. I work with AWS in my day to day, so I had no problem quickly hacking together some scripts to upload my data, even if it wasn't the most elegant setup. But some of the comments on my original post were saying that they would be interested in S3, but didn't know where to start. Uploading in bulk to glacier requires at a bare minimum, familiarity with the CLI, but more likely it requires coding knowledge and familiarity with the AWS SDK.

Someone chime in if I'm mistaken, but I know platforms like TrueNAS and QNAP QTS (and presumably the OSes from Synology, Asustor, and TerraMaster) have built-in functionality for automatically uploading folders or datasets to B2 / S3 / etc. At least on TrueNAS and QNAP, you just give it credentials, point to a bucket, and go.

It's great that you're doing this but are you focusing more on people trying to DIY a solution? Or are you saying what you've put together is better than what TrueNAS et al have built in?

[–] [email protected] 1 points 1 year ago

/u/Madman200 have you tried using the 1TB monthly free egress that cloudfront offers to handle downloading the exports?

I've been experimenting with using Deep Archive myself, and I suspect that if I:

  1. Restore a Deep Archive object
  2. Copy that object to a new S3 bucket
  3. Set up a cloudfront distribution with an S3 origin
  4. Download the object through the Cloudfront distribution

the download would consume the "Always Free" 1TB bandwidth instead of being considered normal data egress.

I'm pretty sure 1TB out to a single IP address is an unintentional use, but Cloudfront + S3 looks like a normal CDN-type use to me.

I'm just waiting for a restore to complete before trying this and seeing what shows up on my AWS bill.