this post was submitted on 22 Dec 2024
1590 points (97.6% liked)

Technology

60074 readers
3248 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

It's all made from our data, anyway, so it should be ours to use as we want

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 2 days ago (1 children)

Oh, that copyright is bollocks. If you follow its intent, you should be including academics, and that state of affairs would be abhorrent (we'd stagnate).

[–] [email protected] 3 points 2 days ago (1 children)

I see the issue as more like thought policing is the inevitable outcome of calling training copyright infringement because there is no difference between a person that recalls information and talks about it with others and the intended use of published information for training. If training an AI with all the knowledge a person learns in a similar manner is somehow wrong, then the inevitable long term way this plays out is a Minority Report like dystopia. It sets the precedent for prosecution of people for their thoughts or intentions and not their actions. This kind of thought policing existed in the darkest depths of the medieval era, or even into more recent eras of witch hunts or McCarthyism. Perhaps we are on the brink of another such dark era.

As far as I am aware:

  • Copyright is intended to protect someone from another person copying their work for for financial gain, or to be much more specific–copying work for direct gain using any form of complex social hierarchy such as awards, reputation, or monetary gain.

  • What copyright does not protect is the dissemination of knowledge as it relates to publicly published works.

  • One has the choice to remain the sole proprietor of one's knowledge, but to publish publicly is to relinquish ownership of the information contained within.

  • Principally, copyright protects that you were the first to write it, and the way in which you wrote it, but it does nothing to protect the knowledge contained within. If a person recalls that knowledge, they are not required to state a citation when speaking aloud, or in some way making use of that knowledge.

  • Copyright also has a scope of intent, and that primarily involves competitive works from ones peers and excludes the scope of general knowledge and usefulness to society at large.

I'm not trying to mock you, or say you are right or wrong. Quite frankly, I don't think in these terms, or care about the kinds of people who do. I'm heavily abstracted and intuitively driven to understand. I believe everything that is not intuitive is simple not fully understood yet. However naïve that may be is irrelevant here. I'm of the bias that those with something to gain often lack objective thinking and show a measure of envy when unexpected changes occur in society. I'm not accusing you, but only sharing the most minor of biases I am aware of while trying to say I want to understand. I would like to know if there is anything in the framework I just laid out that is overlooked. I would like to better understand why you find this issue upsetting. I'm one of the most flawed and openly human people on Lemmy. Look at my history if in doubt. I have no skin in this game, just curiosity.

[–] [email protected] 3 points 2 days ago (1 children)

My view is that of a scholar - one who does devote a large part of their life to freely creating and disseminating knowledge. I do indeed hold a strong bias here, one I'm happy to admit.

Much of the time, when I've run across copyright, it is rarely (if ever come to think of it) in the name of the author (a common requirement of journals being the giving up of ownership of one's work). It normally falls to a company; one usually driven by shareholder value with little (if no) concern for the author's rights. This tends to be the rule rather than the exception, and I'd argue that copyright in it's current incarnation merely provides a legal avenue to steal the work of another, or hold to ransom their works from future generations. This contradicts the first point, and also the second (paywalled papers); indeed the lack of availability of academic works (created for free, or with public funding) is, I believe, a key driver of inequality in this world.

One can withold or even selectively share knowledge, and history will never know what that has cost us.

In terms of AI training, I wouldn't say it is copyright infringement even in spirit, and I say this as one whose works are vomited out verbatim by LLMs when questioned about the field. The comparison with speaking is an interesting one, for we generally do try and attribute ideas if we hold the speaker in esteem, or feel their name will enhance our point. An AI, however, is not speaking of their own volition, but is instead acting in the interest of the company hosting them (and so would fall under the professional label rather than the personal). This might contradict your final point, if one assumes AI progresses as a subscription product (which looks likely).

I think your framework has merit, mostly because it is built on ideals (and we need more such thinking in the world); however, it does not quite match the observed data. Though, it does suggest the rules a better incarnation of copyright could adhere to.

More so, I think no-one has an issue with training publicly available models - it's the ones under copyright themselves people are leery of.

[–] [email protected] 2 points 2 days ago

I wholeheartedly agree about proprietary models. My perspective is as someone who saw the initial momentum of AI and only run models on my hardware. What you are seeing with your work is not possible from a base model in practice. There are too many holes that need to align in the swiss cheese to make that possible, especially with softmax settings for general use. Even with deterministic softmax settings this doesn't happen. I've even tried overtraining with a fine tune, and it won't reproduce verbatim. What you are seeing is only possible with an agenetic RAG architecture. RAG is augmented retrieval with a database. The common open source libraries are LangChain and ChromaDB for the agent and database. The agent is just a group of models running at the same time with a central model capable of functions calling in the model loader code.

I can coax stuff out of a base model that is not supposed to be there, but it is so extreme and unreliable that it is not at all something useful. If I give a model something like 10k tokens (words/fragments) of lead-in then I can start a sentence of the reply and the model might get a sentence or two correct before it goes off on some tangent. Those kinds of paths through the tensor layers are like walking on a knife edge. There is absolutely no way to get that kind of result at random or without extreme efforts. The first few words of a model's reply are very important too, and with open source models I can control every aspect. Indeed, I run models from a text editor interface where I see and control every aspect of generation.

I tried to create a RAG for learning Operating Systems Principles and Practice, Computer Systems A Programmer's Perspective, and Linux Kernel Development as the next step in learning CS on my own. I learned a lot of the limits of present AI systems. They have a lot of promise, but progress mostly involves peripheral model loader code more than it does with the base model IMO.

I don't know the answer to the stagnation and corruption of academia in so many areas. I figure there must be a group somewhere that has figured out civilization is going to collapse soon so why bother.