Data Hoarder

184 readers

1 users here now

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time (tm) ). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

founded 1 year ago

MODERATORS

[email protected]

I built an open source self-hosted web application designed to make archiving to S3 Deep Archive simpler and more accessible. (alien.top)

submitted 1 year ago by [email protected] to c/[email protected]

4 comments fedilink hide all child comments

A few months ago I made a post detailing the cost comparison between AWS S3, and Backblaze B2. The conclusion I came to was that for disaster recovery cloud storage, Glacier Deep Archive is the most cost effective method that exists.

There are obviously many caveats, with many commenters rightfully arguing that shipping local storage devices elsewhere, or building a new NAS at an offsite location, would be significantly cheaper over a 10 year horizon. I am not here to dispute that. I am here to say that if someone is willing to stomach the cost of cloud storage, for the convenience it provides, glacier deep archive can be the cheapest possible option. The other major caveat is that its only cheaper as a last resort that ideally should never actually be used. I look at it like an insurance policy with a deductible.

One problem, however, is that AWS is not a consumer tool, it's an enterprise tool. I work with AWS in my day to day, so I had no problem quickly hacking together some scripts to upload my data, even if it wasn't the most elegant setup. But some of the comments on my original post were saying that they would be interested in S3, but didn't know where to start. Uploading in bulk to glacier requires at a bare minimum, familiarity with the CLI, but more likely it requires coding knowledge and familiarity with the AWS SDK.

I already had the running idea of a personal web tool to help manage my glacier archive, but with these comments in mind I shifted towards trying to make a more user friendly open source app, that could be used by others.

And so was born, the S3 Media Archival Application. It currently supports what I consider to be the bare minimum for someone to bulk upload media archives to S3, and recover them if need be. I plan on sporadically continuing to improve it, and actively trying to address / manage bugs and feature requests if / when they pop up. As such for now, it is sitting at version 0.1.1. I am a back-end developer by trade, and I know right now the UX is a little... rough. But I am confident it's functional.

At the end of the day, I did mostly build this for my own use, but I would like to share it and hope others may find it useful. If there are people who want to use S3 but don't have the pre-requisite coding skills, I hope they could use this application. Or if there are people who have the pre-requisite coding skills, this application could still make their life easier.

Using S3 for this kind of media archive is not a back up. Crucially, there is no incremental kind of restore, and full test restores are cost prohibitive. Hence why I keep calling it an archive. It is made up entirely of data that I never expect to change, like a finished season of a TV show, or an artist's album. I believe it's well suited to this kind of task, and I can still run small scale test restores on smaller media.

The other thing I like about archiving my media in this way, is that a common sentiment I see

I don't keep an offsite copy of my media data because it's too expensive and I can rebuild it again if I need to

My feelings about this is that rebuilding would be a daunting task. I have 1400 individually downloaded albums, all painfully organized one by one. But I also have copies of TV Shows, movies, comics that might not be possible to find again. In this case, I could do a partial restore of my archive, say just download the music and the stuff I can't find elsewhere. This would avoid the bulk of the egress fees, but still give me at least the option of saving certain things from the original archive.

I don't think having an S3 glacier archive is right for everybody and every use case, but I like it and I think it makes sense for my priorities. If you think it makes sense for you as well, hopefully this application might help :)

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 year ago (1 children)

Have you thought about using dynamodb to store a catalog of the data?

[–] [email protected] 1 points 1 year ago

Currently the catalog is a simple H2 database, but using dynamo could be interesting.

I would personally be more inclined to update the application configuration to allow hosting of your own relation database, so you could use the default H2, run a DB in the cloud, or self host something.

Supporting dynamoDB would be difficult because the data layer currently is designed to be relational, although you could definitely argue that a relational DB is overkill for the simplicity of my schemas. Either way tho, switching to noSQL would be a significant refactor