this post was submitted on 15 Nov 2023
1 points (100.0% liked)

Data Hoarder

0 readers
3 users here now

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time (tm) ). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

founded 1 year ago
MODERATORS
 

I've got ZorinOS/ubuntu. I've tried httrack, but it gets slimjet launch terminal errors. I've tried getting chatgpt to write python scripts for me. I've tried WFDownloaderApp, but it's GUI glitches horribly.
I've tried "DownloadThemAll!" but its just a browser extension, and it will only download a single webpage & i see no way to enable crawling or filters.

Please help, thanks.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 1 year ago (1 children)
[–] [email protected] 1 points 1 year ago (1 children)

Thank you very much, i havent tried it until you suggested it, its literally the only program that appears to work so far. I started it & its running/webcrawling right now as we speak.
My only concerns are the instructions dont mention webp image files, and i wasnt sure if Wget is built for that image type, so instead i just did the instructions "jpeg,jpg,bmp,gif,png".

But i definitely want to do webp and actually ALL image file formats. But I'm not sure if wget is built to recognize all image file formats.

2.) wget's "Recursive retrieval" follows links by a default maximum depth of five layers.
But is that enough? and how do i set it to deeper? and how much is too deep for a webpage? can it be too deep? logically speaking, once the domain or subdomain name starts to change completely, that appears to me is the best indication to stop.

3.) If there are any errors or time outs that the websites server causes,etc, at the end when wget is done, will it tell me how many URL & images it was blocked from downloading?

[–] [email protected] 1 points 1 year ago

WGET is awesome, I have scraped tons with it. So many options, you can even spoof all the request header info to get around sites that try to limit auto downloaders. Here is the manual: https://www.gnu.org/software/wget/manual/wget.html

  1. webp or any file extension will work. (note on webp, most sites actually have jpgs still, but convert and serve webp to save bandwidth if the browser says it accepts them. There is a header you can disable in firefox to not accept webp unless it is the only option:

https://addons.mozilla.org/en-CA/firefox/addon/dont-accept-webp/

Wget is not behaving identically to a browser so im unsure what this part of the request looks like or if it needs modification. If it isnt working let me know.

  1. 5 might be enough, but maybe not. Scroll down in my first link comments, they show how to set to infinite: "-l inf".

For future scraping, look at the mirror command. It sets recursion to infinite and will make a full copy of the site. You can also use the --convert-links option, which changes all the links to point to the locally downloaded files. It then behaves the same as the real website.

You cant go too deep unless you use --span-hosts, it can grab external files from different domains to make the mirrored site a true copy, but yea, you often don't need that. You also want to be more careful with recursive depth here - it can go too deep and you end up with too much data.

  1. I'm not sure about this. I think you can turn on logging, but I'm not sure what that gets you. I've used the no-clobber command to run wget again, without re-downloading existing files. This is handy for resuming or filling in gaps that were missed due to timeout, etc.

Some sites also need to use the wait or random-wait command to avoid detection.