this post was submitted on 28 May 2024
75 points (83.8% liked)
Technology
60113 readers
2560 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It's important to distinguish between lossy and lossless algorithms. What was specifically requested in this case is a lossless algorithm which means that you must be able to perfectly reassemble the original input given only the compressed output. It must be an exact match, not a close match, but absolutely identical.
Lossless algorithms rely generally on two tricks. The first is removing common data. If for instance some format always includes some set of bytes in the same location you can remove them from the compressed data and rely on the decompression algorithm to know it needs to reinsert them. From a signal theory perspective those bytes represent noise as they don't convey meaningful data (they're not signal in other words).
The second trick is substituting shorter sequences for common longer ones. For instance if you can identify many long sequences of data that occur in multiple places you can create a lookup index and replace each of those long sequences with the shorter index key. The catch is that you obviously can't do this with every possible sequence of bytes unless the data is highly regular and you can use a standardized index that doesn't need to be included in the compressed data. Depending on how poorly you do in selecting the sequences to add to your index, or how unpredictable the data to be compressed is you can even end up taking up more space than the original once you account for the extra storage of the index.
From a theory perspective everything is classified as either signal or noise. Signal has meaning and is highly resistant to compression. Noise does not convey meaning and is typically easy to compress (because you can often just throw it away, either because you can recreate it from nothing as in the case of boilerplate byte sequences, or because it's redundant data that can be reconstructed from compressed signal).
Take for instance a worst case scenario for compression, a long sequence of random uniformly distributed bytes (perhaps as a one time pad). There's no boilerplate to remove, and no redundant data to remove, there is in effect no noise in the data only signal. Your only options for compression would be to construct a lookup index, but if the data is highly uniform it's likely there are no long sequences of repeated bytes. It's highly likely that you can create no index that would save any significant amount of space. This is in effect nearly impossible to compress.
Modern compression relies on the fact that most data formats are in fact highly predictable with lots of trimmable noise by way of redundant boilerplate, and common often repeated sequences, or in the case of lossy encodings even signal that can be discarded in favor of approximations that are largely indistinguishable from the original.