this post was submitted on 21 Jul 2023

25 points (90.3% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

54565 readers

477 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others

Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

💰 Please help cover server costs.


Ko-fi	Liberapay

founded 1 year ago

MODERATORS

[email protected]

Website ripper? (lemmy.ml)

submitted 1 year ago by [email protected] to c/[email protected]

24 comments fedilink hide all child comments

I want to rip the contents of a pay website, but I have to log in to their web site on a web page to get access

Does anyone have any good tools for Windows for that?

I'm guessing that any such tools must have a built in browser, or be a browser plugin for it to work.

top 19 comments

sorted by: hot top controversial new old

[–] [email protected] 12 points 1 year ago (3 children)

Unless you have an account there's no easy way to get access to the content on the page. Once you have an account there's technically nothing stopping you from just saving the HTML file to your computer.

Something else you can try though, assuming you don't have an account, is to just turn off JavaScript. If the site lets you partially load the content and then asks you to create an account to read more, they usually just block the content by having JavaScript add an opaque overlay. With JavaScript disabled, obviously it's not there to add the overlay and you're able to keep reading.

[–] [email protected] 4 points 1 year ago (1 children)

I have an account, so that's not a problem. The problem is how to automate going into every little content page and downloading the content, including the hi-res files.

[–] [email protected] 3 points 1 year ago (1 children)

I'm on a Mac and use SiteSucker so I know that's not super helpful but for windows you could try wGet or WebCopy? https://www.cyotek.com/cyotek-webcopy / https://gnuwin32.sourceforge.net/packages/wget.htm

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

Webcopy looks promising if I can get the crawler part of it to work with this site's authentication...

edit: I couldn't get Webcopy's spider to authenticate correctly.

Webcopy uses the deprecated version of Internet Explorer in Windows 10 as a module, and I can log into the website using the Capture Forms browser dialog, but the cookies or whatever else don't translate over to the spider.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

Depending on the website, there might be some tools specifically tailored for that website you could use that will extract the content you're looking for, but they're likely going to be command-line based, and you'll likely have to extract your cookies so that the tools can work as if you were logged in your account from outside your browser.

Is it too much to ask which website?

[–] [email protected] 1 points 1 year ago

It also might block the loading of the page content...

I would assume its being fetched by a javascript script, through an api.

That is fairly common

[–] [email protected] 5 points 1 year ago (2 children)

Httrack might do what you need

[–] [email protected] 3 points 1 year ago (1 children)

Httrack doesn't allow me to log into the website. The only security feature it has is http authorization, and this particular website has a plain web login.

[–] [email protected] 2 points 1 year ago (1 children)

Depending on how they auth, this might give you a way to look like httrack is your existing logged in session: https://superuser.com/questions/157331/mirroring-a-web-site-behind-a-login-form

[–] [email protected] 1 points 1 year ago

Interesting idea. Unfortunately the cookies weren't in cleartext in the page headers. I found the cookies values in the networking values, pasted them into htttrack, but that didn't work.

My html cookie-fu is weak.

[–] [email protected] 1 points 1 year ago

Came here to say this. Idk how it does with a password protected site.

[–] [email protected] 5 points 1 year ago (1 children)

Speaking of likely-to-work browser plugins; https://addons.mozilla.org/en-US/firefox/addon/downthemall/

[–] [email protected] 1 points 1 year ago

I've used Down The Mall heaps in the past, it works well

[–] [email protected] 3 points 1 year ago

What's the site, because often you can find specially designed tools on github for this purpose that handle all the logins etc.

[–] [email protected] 3 points 1 year ago (1 children)

If you're open to docker options, I've used and recommend ArchiveBox. It supports using a login to rip sites, and you can set it to rip once or on a schedule, etc.

I think they have a desktop app version in the works if you were looking for a more of a one-time approach.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

I installed and played around with ArchiveBox after your suggestion.

The login/cookie copying function seems to be oriented to how Chromium is installed on Linux, which I don't have up and running in any meaningful way, and there doesn't seem to be any support place where I can ask questions.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

Okay, I found SurfOffline that does the trick without too much hassle, but....

It's verrrrrrrry slooooooooow.

It uses Internet Explorer as a module, and calls each individual resource separately, instead of file copying from IE's cache, which is weird and slow, especially when hundreds of images are involved.

And SurfOffline doesn't appear to be supported anymore, i.e. the support email's inbox is full.

edit: Aaaaand SurfOffline doesn't save to .html files with a directory structure!!! It stores everything in some kind of sql database, and it only saves to .mht and .chm files, which are deprecated Microsoft help file formats!!!

What it does have is a built in web server that only works while the program is running.

So what I plan to do is have the program up but doing nothing, while I sick Httrack on the 127.0.0.1 web address for my ripped website.

Httrrack will hopefully "extract" the website to .html format.

Whew, what a hassle!

[–] [email protected] 1 points 1 year ago

To continue my travails:

Httrack didn't do a great job: It was slow, even copying from the same machine, and it flattened the directory structure of the website it was writing, making it almost un-navigable.

Here's where Cyotek WebCopy shines: It's copying the website from SurfOffline's database webserver quickly, so I should have the entire website re-extracted very soon!

load more comments