I've noticed that Google is getting more and more aggressive with VPNs. It won't let me load anything on VPN without logging in. This applies to third-party tools like yt-dlp too.
This probably depends on your VPN provider. Perhaps I can make a throwaway google account just to get it to stfu? I don't know how hard it is to make a semi-anonymous Google account nowadays.
This is not strictly true in general. Generative AI is able to produce output that is not in the training data, by learning a broad range of concepts and applying them in novel ways. I can generate an image of a rollerskating astronaut even if there are no rollerskating astronauts in the training data.
It is true that some training sets include CSAM, at least in the past. Back in 2023, researches found a few thousand such images in the LAION-5B dataset (roughly one per million images). 404 Media has an excellent article with details: https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/
On learning of this, LAION took down their database until it could properly cleaned. Source: https://laion.ai/notes/laion-maintenance/
Those images were collected from the public web. LAION took steps to avoid linking to illicit content (details in the link above), but clearly it's an imperfect system. God only knows what closed companies (OpenAI, Google, etc.) are doing. With open data sets, at least any interested parties can review, verify, and report this stuff. With closed data sets, who knows?