this post was submitted on 29 May 2024
1674 points (99.8% liked)

Technology

58303 readers
16 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 97 points 6 months ago (3 children)

A quick search indicates that they’ve archived ~100PB of data.

Now I’m trying to come up with a way to archive the internet archive in a peer-to-peer/federated fashion while maintaining fidelity as much as possible…

[–] [email protected] 32 points 5 months ago

That’s what IPFS is for. It’s ideal for that kind of stuff

[–] [email protected] 16 points 5 months ago (4 children)

Can DDOS attacks actually erase/corrupt stored data though? There’s no way they’re running all of this on a single server, with hundreds of PB’s worth of storage, right?

[–] [email protected] 42 points 5 months ago

No. It affects availability. Not integrity or confidentiality.

[–] [email protected] 37 points 5 months ago

DDOS attacks block connection to the servers, they don't actually harm the data itself. You could probably overload a server to the point of it shutting down, which might affect data in transit, but data at rest usually wouldn't be harmed in any way; unless through some freak accident a server crash would render a drive unusable. But even then, servers are usually fully redundant, and have RAID systems in place that mirror the data, so kind of a dual redundancy. Plus actual backups on top of that; though with that amount of data they might have a priority system in place and not everything is fully backed up.

[–] pythonoob 5 points 5 months ago

Not technically by itself as far as I know

[–] [email protected] 3 points 5 months ago

From what I've learned, it is possible to create a vulnerability within the system of a ddos attack would overload and cause a reset or fault. At that point, it's possible to inject code and initiate a breach or takeover.

I can't find the documentation on it so... Take it with a grain of salt. I thought I learned about it in college. Unsure.

[–] [email protected] 7 points 5 months ago (3 children)
[–] [email protected] 26 points 5 months ago* (last edited 5 months ago) (2 children)

That wouldn't distribute the load of storing it though. Anyone on the torrent would need to set aside 100PBs of storage for it, which is clearly never going to happen.

You'd want a federated (or otherwise distributed) storage scheme where thousands of people could each contribute a smaller portion of storage, while also being accessible to any federated client. 100,000 clients each contributing 1TB of storage would be enough to get you one copy of the full data set with no redundancy. Ideally you'd have more than that so that a single node going down doesn't mean permanent data loss.

[–] [email protected] 7 points 5 months ago (1 children)

Not sure you'd be able to find 100k people to host a 1TB server though. Plus, redundancy would be better anyway since it would provide more download avenues in case some node is slow or has gone down.

[–] [email protected] 7 points 5 months ago

Yes, it's a big ask, because it's a lot of data. Any distributed solution will require either a large number of people or a huge commitment of storage capacity. Both 100,000 people and 1TB per node is a lot to ask for, but that's basically the minimum viable level for that much data. Ten million people each committing 50GB would be great, and offer sufficient redundancy that you could lose 80% of the nodes before losing data, but that's not a realistic number to expect to participate.

[–] [email protected] 6 points 5 months ago* (last edited 5 months ago) (1 children)

That wouldn't distribute the load of storing it though. Anyone on the torrent would need to set aside 100PBs of storage for it, which is clearly never going to happen.

Torrents are designed for incomplete storage of data. You can store and verify few chunks without any problem.

You'd want a federated (or otherwise distributed) storage scheme where thousands of people could each contribute a smaller portion of storage, while also being accessible to any federated client.

Torrents. You may not have entirety of data, but you can request what you need from swarm. The only limitation is you need to know in which chunk data you need.

Ideally you'd have more than that so that a single node going down doesn't mean permanent data loss.

True.

[–] [email protected] 4 points 5 months ago (1 children)

True. Until you responded I actually completely forgot that you can selectively download torrents. Would be nice to not have to manually manage that at the user level though.

Some kind of bespoke torrent client that managed it under the hood could probably work without having to invent your own peer-to-peer protocol for it. I wonder how long it would take to compute the torrent hash values for 100PB of data? :D

[–] [email protected] 1 points 5 months ago* (last edited 5 months ago)

~300MB/s on one core of 13-years old i5 SHA-256(used in BitTorrent v2). Newer cores can about half a gig per one. Less than 3 days on one core then. Less than day on 3 cores.*

* assuming no additional performance penalty for increased power consumption and memory bandwith usage

My guess storage bandwidth would be biggest bottleneck.

Found relatively old article(in Russian, just search for openssl and look at graph that mentions SHA-512 which is SHA-2 too) that says i7-2500 all-cores throughput is slightly over 1GB/s.

[–] [email protected] 6 points 5 months ago (2 children)

It’d be a lot more complicated than that, I think, if one wanted to effectively be able to address it like a file system, as well as holistically verify the integrity of the data and preventing unintentional and unwanted tampering

[–] [email protected] 16 points 5 months ago (1 children)

as well as holistically verify the integrity of the data and preventing unintentional and unwanted tampering

Torrents. Their hashes are derived from hashes of chunks. Just verify chunks.

if one wanted to effectively be able to address it like a file system

https://github.com/johang/btfs

[–] [email protected] 2 points 5 months ago
[–] [email protected] 3 points 5 months ago (1 children)

ia already serves all their uploads as torrents

[–] [email protected] 1 points 5 months ago

There is this, yes.