Feel like this belongs in [email protected]
Think I should cross-post?
This is a most excellent place for technology news and articles.
Doesn't make any sense. Why would you crawl wikipedia when you can just download a dump as a torrent ?
AI bros aren't that smart.
Apparently the dump doesn't include media, though there's ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don't care about externalizing costs onto others if it might mean a competitive advantage (e.g. most recent data, not having to spend time and resources developing dedicated ingestion systems for specific sites).
I want to stress this: it's not that "tech bros" are just stupid—even though a lot of them are revoltingly unappreciative of the giants whose sholders they stand on—it's that they don't care.
There's a chance this isn't being done by someone who only wants Wikipedia's data. As the amount of websites you scrape increases, your desire to use the easy tools loses out to creating the most general tool that can look at most webpages.
To have the most recent data?
To just have the most recent data within reasonable time frame is one thing. AI companies are like "I must have every single article within 5 minutes they get updated, or I'll throw my pacifier out of the pram". No regard for the considerations of the source sites.
These fucking companies.. downing a torrent of annas archive but crawling wikipedia scourge of mankind
When I imagine a future with AI ruining the world, I always thought it was going to be some Skynet/CABAL/HAL9000 type of thing
Not this sad, boring, depressing type shit
wikipedia should install ai mazes on their servers
Not in this case, to be fair. The only concern is cost - since Wiki wouldn't be opposed to them getting their actual data - and AI mazes are designed to safeguard more sensitive data, not reducing cost
Nice analysis. Need more smart people like you in the world
I agree with that assessment!
Are there alternatives besides Cloudflare's solution?
We should stop using ai
I don't know about stopping entirely. I built a pretty cool RAG system for internal use in my company, it very much facilitates navigating very large amounts of text data.
I still struggle with a use case for artificial intelligence in my own life. I play around with it all and I'm just like, it doesn't do a good job. Also, I think humanity is missing the plot, you know? Like, we don't need government. If government isn't going to do government. Government serves the people, not corporations. Or at least it should. I don't know, I think we're entering in times. At some point, I think people will pray for nuclear war, because life will be so miserable. That it would be better than just to end it all.
AI has niches but they're exactly that: Niches. Small duct tape tasks for fudging over "hard problems" where manual code would result in a worse outcome and take far more time. Little esoteric problem spaces, which notably don't actually require you to use several states worth of electrical power training on a 50PB dataset of anime titties.
An example: I have a name generator in my game that strings together several consonant+vowel phoneme pairs into a name. This means that the names are always pronounceable, but often the spelling looks really unintuitive. Eg Joosiffe, which the player would likely pronounce as Joseph. However, the leap we do in our head between those two spellings is a process of declassifying phonemes and then re-classifying phonemes, and is actually a "hard problem" from a coding perspective due to the unintituive, multifarious complexities of written, spoken, and conceptualized human language. Adding this step to my name generator in code would be a project of it's own, larger than the game itself, and wouldn't ever work nearly as well as it needed to. But relatively small (30MB) AI models that do this with something like 99.8% satisfaction already exist. They didn't require a data center's worth of resources to train, and since they're academic projects they have licenses that allow them to be used for free in a game.
I’m dyslexic and basically a terrible writer. It has helped my professional communication develop. It really helps me speed up my issues with my disability and feel confident in my communications.
This is a cool use case. Just make sure you retain your own voice! If you read an AI-generated sentence out loud and think "I'd have said it this way instead", IMO you should absolutely then change it to be that way.
Understood and I do. I try to tweak it a little to my own style. But it helps write the hundreds of cover letters I’m submitting a day. Looking for work. This usually took me hours for just one submission. Now I can fly through.
Like, we don't need government.
Welcome to the anarchist. Now you have to pick your flavor! Social Anarcho-syndicalism, Anarcho comunist, anarcho-capitalism, anarcho christianis, and the list goes on!
I found LLMs helpfuls to develop some scripts and answer some simple trivial questions (like how does house property work in China). I could have looked for that in a regular search engine though. But that's it, I am still happy looking for things myself and investigating since you can't really trust their answers.
At some point, I think people will pray for nuclear war, because life will be so miserable.
Reminds me of Roll out the Fallout by The Chalkeaters
what assholes .. just fucking download the full package and quit hitting the URL
Right‽ This is ridiculously stupid when you can download the entirety of Wikipedia in a single package and parse it to your hearts desire
Not only that, but we make it goddamn trivial for not just Wikipedia but for other Wikimedia projects. Doing this is just stealing without attribution and share-alike like the CC BY-SA 4.0 license demands and then on top of that kicking down the ladder for people who actually want to use Wikimedia and not the hallucinatory slop they're trying to supplant it with. LLM companies have caused incalculable damage to critical thinking, the open web, the copyleft movement, and the climate.
Yay interrobang :D
The amount of stupid AI scraping behavior I see even on my small websites is ridiculous, they'll endlessly pound identical pages as fast as possible over an entire week, apparently not even checking if the contents changed. Probably some vibe coded shit that barely functions.
If I was running infra for them, I’d just start blacklisting abusive IPs without warning
Laws should be passed in all countries that AI crawlers should request permission before crawling whatever target site. I haver no pity to AI "thiefs" that get their models poisoned. F...ing plague, wasn't enough the adware and spyware...
i doubt the recent uptick in traffic is from “stealing data” for training but rather from agents scraping them for context, eg Edge Copilot, Google’s AI search, SearchGPT, etc.
poisoning the data will likely not help in this situation since there’s a human on the other side that will just do the same search again given unsatisfactory results. like how retries and timeouts can cause huge outages for web scale companies, poisoning search results will likely cause this type of traffic to increase and further increase the chances of DoS and higher bandwidth usage.
And the quality of the AI output sucks. I was recently looking for information about positive convention for yaw, pitch, and roll in aircraft. I was looking at az and yaw and got reasonable results from the AI, but when I looked at pitch and el all of the results were about elevator pitches. Even when I spelled out elevation it insisted on elevator pitches. I scroll past the AI results as a matter of principle, but I usually look at them so I have something specific to complain about when people ask why I am so virulently anti-AI.
The other day I tried to have it help me with a programming task on a personal project. I am an experienced programmer, but I only "get by" in Python (typically just by looking up the documentation for the standard library). I thought, "OK. This is it. I will ask Llama 3.3 and GPT4 for help."
That shit literally set me back a weekend. It gave me such bad approaches and answers, that I could tell were bad (aforementioned experience in programming, degree in comp sci, etc) that I got confused about writing Python. Had I just done what I usually do, which is to look up the documentation and use my brain, I would have gotten my weekend task done a whole weekend sooner.
It scares me to think what people are doing to themselves by relying on this, especially if they're novices.
It scares me to think what people are doing to themselves by relying on this, especially if they're novices.
Same here. There's a lot of denial going on but, LLMs are not good for anything that requires factual information. They likely will never be on account of just being statistical models for language. Summarizing long text where correctness isn't an issue is really one of the only places where I still think that they are good.
Search? Not if you want anything factual with citations.
Code? Fuck no. They constantly produce code of poor quality that may depend on non-existent libraries or functionality. More time it's spent debugging than writing code and it leaves the dev with a poor understanding of what the code actually does and ways to optimize/extend/etc.
Generating literary smut? Well, it's not going to do as good of a job as a person who can create something completely novel but can be passable without likely harm to authors (I'd classify it as a tier below erotic fan fiction).
AI is useful for basic, mundane tasks and that's about it. Trying to force it to be some sort of Uber search engine is such a bad idea.
This is an example of corporate terrorism sponsored by our own government. Elon Musk loves to see himself as the villain in Ready Player One. And this is not a joke you can look it up. Big tech is waging war against American citizens, and no longer do we have any control of our government, and the Democrats will not save us. The electoral processes will not save us. This is just hard for some people to accept, that's why things have to fall apart before they get a clue. Unfortunately, those that are wiser are going to feel the flames first.
Support Wikipedia! They're awesome and are our backbone.
AI: The "pen that can write in zero gravity" when pencils exist.
Well I get the analogy, but also I think they didn't use pencils because of the graphite and complications with filtering air or something.