this post was submitted on 17 Jan 2024
1 points (100.0% liked)

Web scraper discussion

49 readers
1 users here now

Discuss all kinds of web scraping.

founded 2 years ago
MODERATORS
 

I've been looking to create a local database of cooking recipes for personal use, but doing it manually is quite tedious, to say the least. It takes maybe 5-ish minutes per recipe to navigate the various websites copy the text, create file, re-format the inevitably flawed text into readable ASCII only, and look over the result for spelling, grammar, and readability errors (one guy who made the recipes was seemingly barely literate, could hardly pass 3rd grade English class).

Are there any utilities you are aware of that would make this easier? Obviously the more automated the better, but automating text-pulling from a website, line sizing, indents, list formatting, and easy headers would be the minimum to be "worth it". Useful features would be automated file creation and naming, savable config presets, unified functionality (I.E. one utility that does everything, rather than a web API, a reformatter, and a file-writer, for example). The recipes tend to be in a certain format (Rich text? not sure) that prevents much of the readability from being retained when copied and pasted manually.

I'm looking for one single utility if at all possible, due to not wanting excessive headaches on my end. I'm running Linux Mint. Thanks for the heads-up.

top 1 comments
sorted by: hot top controversial new old
[–] sudo 1 points 2 weeks ago

No existing utilities will do all that for you but there are libraries that you can plug into a python or nodejs script that will help. The most blunt solution would be something that converts HTML to markdown. Even then you'll have to massage some of the data based on the formatting of some sites.