this post was submitted on 20 Sep 2024

12 points (100.0% liked)

Ask Lemmy

31097 readers

1887 users here now

A Fediverse community for open-ended, thought provoking questions

Rules: (interactive)

1) Be nice and; have fun

Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them

2) All posts must end with a '?'

This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?

3) No spam

Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.

4) NSFW is okay, within reason

Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either [email protected] or [email protected]. NSFW comments should be restricted to posts tagged [NSFW].

5) This is not a support community.

It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email [email protected]. For other questions check our partnered communities list, or use the search function.

6) No US Politics.

Please don't post about current US Politics. If you need to do this, try [email protected] or [email protected]

Reminder: The terms of service apply here too.

Partnered Communities:

Logo design credit goes to: tubbadu

founded 2 years ago

MODERATORS

[email protected]

Can any file and its data be converted to a String or plaintext? (lemmy.world)

submitted 7 months ago* (last edited 7 months ago) by [email protected] to c/[email protected]

29 comments fedilink hide all child comments

I've noticed some files I opened in a text editor have all kinds of crazy unrenderable chars

top 29 comments

sorted by: hot top controversial new old

[–] djehuti 20 points 7 months ago (1 children)

You're looking for https://en.m.wikipedia.org/wiki/Character_encoding Which explains the funny characters.

[–] [email protected] 7 points 7 months ago* (last edited 7 months ago) (2 children)

Spank you, much :D

[–] AsudoxDev 3 points 7 months ago (1 children)

Spank?

[–] [email protected] 2 points 7 months ago

Spank

[–] [email protected] 2 points 7 months ago

Ace Ventura on lemmy trying to understand file encodings

[–] [email protected] 12 points 7 months ago

can it? Sure, most any arrangement of bits can be converted into some kind of Unicode text. Can it be converted to something meaningful or readable? No, some formats are plain text (.txt, .ini, .json, .html for some random examples) that are meant to be read by humans, and others are binary formats that are only meaningful when decoded by a computer into specific data structures inside a piece of software.

[–] [email protected] 6 points 7 months ago

At the end of the day data is just binary, i.e. it's composed of 0 and 1. What those 0 and 1 represent is mostly irrelevant to this discussion. The short version is that 01000001 can mean A or it can mean that a given pixel is 65/256 red, or that the speaker should vibrate in a specific frequency, etc, etc.

So what happens when you open a file that's not text in a text editor? Well, some of the 0 and 1 make up gibberish, or characters that are not meant to be printed. Fun fact, you should be able do this the other way around too, i.e. open a text as an image, but again it will be gibberish, and most likely would not load since images have lots of information that relate to size, compression, etc, that if incorrect the program won't know what to do, but because text can always be valid it will always work, although sometimes your editor might show weird thing in the places where there's a non-printable character.

[–] [email protected] 6 points 7 months ago* (last edited 7 months ago) (1 children)

Yes, see Binary-to-text encoding (e.g., Base64).

[–] [email protected] 1 points 7 months ago* (last edited 7 months ago) (3 children)

Can you comment on the specific makeup of a "rendered" audio file in plaintext, how is the computer representing every little noise bit of sound at any given point, the polyphony etc?

What are the conventions of such representation? How can a spectrogram tell pitches are where they are, how is the computer representing that?

Is it the same to view plaintext as analysing it with a hex-viewer?

[–] [email protected] 8 points 7 months ago

There's two things at play here.

MP3 (or WAV, OGG, FLAC etc.) provide a way to encode polyphony and stereo and such into a sequence of bytes.

And then separately, there's Unicode (or ASCII) for encoding letters into bytes. These are just big tables which say e.g.:

01000001 = uppercase 'A'
01000010 = uppercase 'B'
01100001 = lowercase 'A'

So, what your text editor does, is that it looks at the sequence of bytes that MP3 encoded and then it just looks into its table and somewhat erronously interprets it as individual letters.

[–] [email protected] 6 points 7 months ago

I think you are conflating a few different concepts here.

Can you comment on the specific makeup of a “rendered” audio file in plaintext, how is the computer representing every little noise bit of sound at any given point, the polyphony etc?
What are the conventions of such representation? How can a spectrogram tell pitches are where they are, how is the computer representing that?

This is a completely separate concern from how data can be represented as text, and will vary by audio format. The "simplest", PCM encoded audio like in a .wav file, doesn't really concern itself at all with polyphony and is just a quantised representation of the audio wave amplitude at any given instant in time. It samples that tens of thousands of times per second. Whether it's a single pure tone or a full symphony the density of what's stored is the same. Just an air-pressure-over-time graph, essentially.

Is it the same to view plaintext as analysing it with a hex-viewer?

"Plaintext" doesn't really have a fixed definition in this context. It can be the same as looking at it in a hex viewer, if your "plaintext" representation is hexadecimal encoding. Binary data, like in audio files, isn't plaintext, and opening it directly in a text editor is not expected to give you a useful result, or even a consistent result. Different editors might show you different "text" depending on what encoding they fall back on, or how they represent unprintable characters.

There are several methods of representing binary data as text, such as hexadecimal, base64, or uuencode, but none of these representations if saved as-is are the original file, strictly speaking.

[–] [email protected] 3 points 7 months ago (1 children)

Most binary-to-text encodings don’t attempt to make the text human-readable—they’re just intended to transmit the data over a text-only medium to a recipient who will decode it back to the original binary format.

[–] [email protected] 1 points 7 months ago* (last edited 7 months ago) (2 children)

I do understand I'm not able to read it myself, I'm more curious about the architecture of how that data is represented and stored and conceptually how such representation is practically organized/reified...

[–] [email protected] 3 points 7 months ago* (last edited 7 months ago)

The original binary format is split into six-bit chunks (e.g., 100101), which in decimal format correspond to the integers from 0 to 63. These are just mapped to letters in order:

000000 = A,
000001 = B,
000010 = C,
000011 = D,

etc.—it goes through the capital letters first, then lower-case letters, then digits, then “+” and “/”. It’s so simple you could do it by hand from the above description, if you were looking at the data in binary format.

[–] [email protected] 2 points 7 months ago (1 children)

One representation of a sound wave is a sequence of amplitudes, expressed as binary values. Each sequential chunk of N bits is a number, and that number represents the amplitude of the sound signal at a moment in time. Those moments in time are spaced at equal intervals. One common sampling rate is 44.1 kHz.

That number is chosen because of the nyquist-shannon sampling rate theorem, in combination with the fact that humans tend to be able to hear sounds up to 20 kHz.

The sampling rate theorem says that if you want to reproduce a signal containing information at up to X frequency, you need to sample it at 2X frequency.

To learn more about this topic, look for texts, classes, or videos on “signal processing”. It’s often taught in classes that also cover electronic circuits.

Here is an example of such a text

That’s pretty dense reading, but if you’re willing to stop and learn any math you encounter while reading it, it will probably blow your mind into a whole new level of understanding the world.

[–] [email protected] 2 points 7 months ago (1 children)

I honestly wish I had gotten into all the science and physics of signal processing, taken calculus etc, I feel like I'll pick up a lot of the more qualitative stuff over time particularly if I'm able to apply it in building certain apps that do some novel manipulations and obviously some of that will require me to get an operational understanding of how to put all these blocks together.

[–] [email protected] 2 points 7 months ago (1 children)

You still can. Worst case, you spend $80 now and then on a textbook. There’s no reason you can’t buy a physics or calculus textbook and just start reading it. Costs about the same as an expensive dinner for two.

Best case, you just learn it for free or for the cost of a Khan Academy membership.

You’re not limited to surface level understanding. You can develop as deep an understanding of any topic as anyone else. In fact, I would wager an adult who knows how to work can probably learn math and physics at a much deeper level than a college engineering student, if only because they can take their time and absorb everything fully.

Sounds like you might be a coder. Consider the level of code quality people achieve in hobby projects: often much better than in a professional setting because in the pro setting there’s always a time and budget constraint. In a hobby project, one can polish and polish and take all the time they want.

It’s never too late to give yourself a solid science education.

[–] [email protected] 1 points 7 months ago

It’s been 20 years since I bought textbooks and they’ve doubled in price (now about $150 for a physics textbook).

But this one’s one sale: https://a.co/d/8zxWC8B

[–] [email protected] 4 points 7 months ago (1 children)

technically, yes. all unprintable binary can be resolved to 64 printable characters. but that resulting string may not be english or any human language.

[–] [email protected] 1 points 7 months ago* (last edited 7 months ago) (2 children)

But its still contains the actual data in a faithfully reproducible/useable way?

[–] [email protected] 6 points 7 months ago

Yes. Decoding a base64 encoded string will give you back the exact original data.

Importantly though, this isn't what you're seeing when you open files in a text editor as you describe in your original post, and if you copied the text of those files and saved a new copy it's very likely that it would not reproduce correctly.

[–] [email protected] 4 points 7 months ago

yes, this method doesn't lose any bits. one of its primary use before was email which was strictly text only.

[–] [email protected] 3 points 7 months ago (1 children)

Are those binary files by any chance?

[–] [email protected] 2 points 7 months ago* (last edited 7 months ago) (2 children)

I just mean like any file (pdf, jpeg, mp4, mp3, exe—

mp4/mp3 most famously for me

I find it so damn cool and incredible I can record something/anything right now and open the audio in a text file and its all right there—albeit in an incomprehensible format but there altogether.

Its like a thinking rock etching sound into stone

[–] [email protected] 8 points 7 months ago* (last edited 7 months ago) (1 children)

If you're on Linux, you can convert that to something more human readable by piping it to base64. It works with any file, but I'll use an image here:

cat image.webp | base64

Which yields:

UklGRroEAABXRUJQVlA4WAoAAAAgAAAAYwAAQgAASUNDUKACAAAAAAKgbGNtcwRAAABtbnRyUkdC
IFhZWiAH6AAIABoADgAJACBhY3NwQVBQTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9tYAAQAA
AADTLWxjbXMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA1k
ZXNjAAABIAAAAEBjcHJ0AAABYAAAADZ3dHB0AAABmAAAABRjaGFkAAABrAAAACxyWFlaAAAB2AAA
ABRiWFlaAAAB7AAAABRnWFlaAAACAAAAABRyVFJDAAACFAAAACBnVFJDAAACFAAAACBiVFJDAAAC
FAAAACBjaHJtAAACNAAAACRkbW5kAAACWAAAACRkbWRkAAACfAAAACRtbHVjAAAAAAAAAAEAAAAM
ZW5VUwAAACQAAAAcAEcASQBNAFAAIABiAHUAaQBsAHQALQBpAG4AIABzAFIARwBCbWx1YwAAAAAA
AAABAAAADGVuVVMAAAAaAAAAHABQAHUAYgBsAGkAYwAgAEQAbwBtAGEAaQBuAABYWVogAAAAAAAA
9tYAAQAAAADTLXNmMzIAAAAAAAEMQgAABd7///MlAAAHkwAA/ZD///uh///9ogAAA9wAAMBuWFla
IAAAAAAAAG+gAAA49QAAA5BYWVogAAAAAAAAJJ8AAA+EAAC2xFhZWiAAAAAAAABilwAAt4cAABjZ
cGFyYQAAAAAAAwAAAAJmZgAA8qcAAA1ZAAAT0AAACltjaHJtAAAAAAADAAAAAKPXAABUfAAATM0A
AJmaAAAmZwAAD1xtbHVjAAAAAAAAAAEAAAAMZW5VUwAAAAgAAAAcAEcASQBNAFBtbHVjAAAAAAAA
AAEAAAAMZW5VUwAAAAgAAAAcAHMAUgBHAEJWUDgg9AEAALAQAJ0BKmQAQwA+8WSmTqmlKCYvmWqp
MB4JZQDLnNaF2NMD2L3xQGb5nmLiGhGWxQuD8kwUSXF0u2UTgX0YrR3MY2SsRCNEQ8hZ6WkCUTih
LdmsElHZVzoMwO/fj4X/ZSNT2R9qgxwqgEed891j4KCNRLK/tUbG3hZ3Mw2kixguSFIEcAgBtv8w
eAu0PwAA/upMzBqq+dcN8viO7FpqpV6GvPcRILm+HsOQblnpHx03lASjGlSyGbkKUD3xA5KOqgq/
VEUJ4qF9VoAYFbFhQRAgkvmREk5umMj8sr9Np95+n/oP2Aq2VW5xU4F1xpD8Vd4Dp7Phwm9w/Dnf
94djRROFRYPZeg/1Q/qiROFRVRu2nBcgndbhc0x0h+kgvT/naeJOEqwNjYPlIiw/DGuxav7+x09R
mf2mJto3ineDqfyMWUN83PmKqzGHkYGhZrTU478qjlQucDzWkwobnUmzhE6I+mDYkfiUVPcHyXbf
xXRStyPiPZAkJZrE9OrjFNUeljRQdVTQqeBsy+O9VwDLU5GcKhBQHa4cj+/DGqUhi74WH0EuHsb3
EgZVNc1FbRm5QFOpjDSprGIRYxe6sFFDrDOg4DhWZRnOa7s68pGaDDpbqrORxzPHXPbs55/1HTas
DDGzKFmTG4hJ2GUZKqjPcQ+MAAAA

Copy that into a text file and pass it to base64 with the decode flag, and you'll get the original binary:

cat data.txt | base64 -d > data.bin

Inspect it to see what kind of file it is:

file data.bin -> data.bin: RIFF (little-endian) data, Web/P image

Rename it so you can just double-click it to open it:

mv data.bin data.webp

Enjoy the surprise.

You can also print files like that, scan them using OCR, and then restore them. A very inefficient way to do backups, but it works.

[–] [email protected] 3 points 7 months ago (2 children)

How is it representing it tho? Like does it have woven in there an array of hexcode colors for every microscopic pixel that makea up the picture.

Are images and audio files just arrays of frames which are arrays of pixels and sound units?

[–] [email protected] 4 points 7 months ago* (last edited 7 months ago)

It just converts the raw binary data into character encoding, so it doesn't matter what the source is (image, video, database file, etc). The source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

The decoding process is just the reverse of that: mapping the data back to binary form.

https://en.wikipedia.org/wiki/Base64

[–] [email protected] 2 points 7 months ago

the answer to your how question is as needed.

some image and audio formats (especially older ones) are like that, yes. others use compression or other techniques to suit their need. like a sound can be a raw recording sample. or a sound can be described with Attack/Decay/Sustain/Release, along with octave and note etc. so a MIDI file is an audio file format without samples.

i once created an image format to be used for spiraling out images, instead of pixel arrays they were concentric circles of pixels that i could easily offset.

[–] [email protected] 1 points 7 months ago

You can use a hex editor to view those files and even change them in some cases.

Something like this https://github.com/WerWolv/ImHex