Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If using copyrighted material to train an LLM is theft, so is reading a book.


So if I get access to the Perplexity AI source code (I borrow it from a friend), read all of it, and reproduce it at some level, then Perplexity will be:" sure, that's fine no harm, no IP theft, no copyright violation, because you read it so we're good"?

No, they would sue me for everything I got, and then some. That's the weird thing about these companies, they are never afraid to use IP law to go after others, but those same laws don't apply to them... because?

Just pay the stupid license and if that makes your business unsustainable then it's not much a business is it?


Funny enough, their prompts leaked: https://www.reddit.com/r/perplexity_ai/s/kn6i20kMLH

And I’ve built a perplexity clone in about a day - it’s not that hard: search -> scrape results -> parse results —> summarize results -> summarize aggregate results into single summary.

I’m really not sure I even see their moat.


What have you used if i may ask? It seems very simple indeed. What search API is best?

Also there is a program called html2text to throw out the html formatting so as to use less tokens. Have you used this or something similar?


Brave API (Bing is good as well). Here's a little gist (Elixir). It's pretty rudimentary so far and needs refining, but works alright enough (result at bottom): https://gist.github.com/cpursley/b4af2ff3b56c912f659bd5300e4...

The most useful part is probably the prompt and usage of Phi 3 Mini 128K Instruct for web page summarization and Llama 3 for the final summary (of the summaries). I'm parsing out all but minimal content html but might even remove that to keep context length down.


Very nice, thank you!


If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes. Now, if you “get it from a friend”, illegally, _or_ you just redeploy the code, without creating a transformative work, then there’s a problem.

> Just pay the stupid license and if that makes your business unsustainable then it's not much a business is it?

In the persona of a business owner, why pay for something that you don’t legally, need to pay for? The question of how copyright applies to LLMs and other AI is still open. They’d be fools to buy licenses before it’s been decided.

More importantly, we’re potentially talking about the entire knowledge of humanity being used in training. There’s no-one on earth with that kind of money. Sure, you can just say that the business model doesn’t work, but we’re discussing new technologies that have real benefit to humanity, and it’s not just businesses that are training models this way.

Any decision which hinders businesses from developing models with this data will hinder independent researchers 10 fold, so it’s important that we’re careful about what precedent is set in the name of punishing greedy businessmen.


> They’d be fools to buy licenses before it’s been decided.

They are willingly ignoring licenses until someone sues them? That's still illegal and completely immoral. There is tons of data to train on. The entirety of Wikipedia, all of StackOverflow (at least previously), all of the BSD and MIT licenses source code on Github, the entire Gutenberg project. So much stuff, freely and legally available, yet their feel that they don't need to check licenses?


The legality of their behavior is not currently well defined, because it's unprecedented. Fair use permits transformative works. It has yet to be decided whether LLMs and their output qualify as transformative, or even if the training is capable of infringing copyright of an individual work in the first place if they're not reproducing it. In fact, there's a good amount of evidence which indicates that fair use _does_ apply, given how Google operates and what they've argued successfully (https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com...).

Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

You might not like the idea of your blog posts or other publicly posted materials being used to train LLMs, but that doesn't make it illegal (morality is subjective and I'm not about to argue one way or another). If it's really that much of a problem, you _do_ have the ability to remove your information from public accessibility, or otherwise protect it against LLM ingestion (IP restrictions, etc.).

edit: I am not a lawyer (this is likely obvious to any lawyers out there); this is my personal take.


Note that not all jurisdictions have the concept of "fair use" (use of copyrighted material, regardless of transformation applied, is permitted in certain contexts…ish). Canada, the UK, Australia, and other jurisdictions have "fair dealing" (use of copyrighted material depends on both reason and transformation applied…ish). Other jurisdictions have neither, and only licensed uses are permitted.

Because the companies behind large models (diffusion, LLM, etc.) have consumed content created under non-US copyright laws and have presented it to people outside of US copyright law jurisdiction, they are likely liable for misapplication of fair dealing, even if the US ultimately deems what they have done as "fair use" (IMO this is unlikely because of the perfect reproduction problems that plague them all in different ways; there are likely to be the equivalent of trap streets that will make this clearly copyright violation on a large scale).

It's worth noting that while models like GitHub Copilot "freely" use MIT, BSD (except BSD0), and Apache licensed software, they are likely violating the licenses every time a reasonable facsimile pops up because of the requirement to include copies of the licensing terms for full or partial distribution or derivation.

It's almost as if wholesale copyright violations were the entire business model.


You're right. I'm definitely taking a very US-centric view here; it's the only copyright system I'm familiar with. I'm really curious how jurisdictions with no concept of fair use or fair dealing work. That seems like a legal nightmare. I expect you wouldn't even be able to critique a copyrighted work effectively, nor teach about it.

When you speak of the "perfect reproduction" problem, are you referring to cases where LLMs have spit out code which is recognizable from source training data? I agree that that's a problem, but I expect the solution is to have a wider range of training data to allow the LLM to better "learn" the structure of what it's being trained on. With more/broader training data, the resulting output should have less chance of reproducing exactly what it was trained on _and_ potentially introduce novel methods of solving a given problem. In the meantime, it would probably be smart for some kind of test for recognizable reproduction and for the answers to be thrown out, perhaps with a link to the source material in their place.

There's also a point, however, where the same code is likely to be reproduced regardless of training. Mathematical formulas and algorithms come to mind. If there's only one good solution to a problem, even humans are likely to come up with the same code without even seeing each others output. It seems like there's a grey area here which we need to find some way to account for. Granted this is probably the exception, rather than the rule.

> It's almost as if wholesale copyright violations were the entire business model.

If I had to guess, this is probably a case where businesses are pushing something out sooner than it should have been. I find it unlikely that any business is truly basing their model on something which is so obviously illegal. I'm fully willing to believe, however, that they're willing to ignore specific instances of unintentional copyright infringement until they're forced to do something about it. I'm no corporate apologist. I just don't want to see us throw this technology away because it has problems which still need solving.


I live in a fair dealing jurisdiction, and additional uses would need to be negotiated with the rights holders. (I believe that this is part of the justification behind the Canadian law on social media linking to news organizations.) It is worth noting that in addition to the presence or absence of fair dealing/fair use, there are also moral rights which must be considered (which is another place where LLM tech — especially the so-called summarization — likely falls afoul of the law: authors have the moral right to not be misrepresented and the LLM process of "summarization" may come to the opposite conclusion of what the author actually wrote).

Perfect reproductions apply not only to software, but to poetry, prose, and images. There is a reason why diffusion model providers are facing lawsuits over "in the style of <artist>", because some of the styles are very distinctive and include elements akin to trap streets on maps (this happens elsewhere — consider the lawsuit and eventual settlement over the tattoo image used in The Hangover 2).

With respect to "training it on more data", I do not believe you are correct — but I have no proof. The public statements made by the people who have done the training have suggested that they have done such training on extremely wide and deep sources that have been digitized, including a number of books and the wider Internet. The problem is that, on some subjects, there are very few source materials and some of those source materials have distinctive styles which would be reproduced when discussing those subjects.

I’m now more than thirty years into my career. Some algorithms will see similar code written by humans, but most code has some variability outside of those fairly narrow ranges. Twenty years ago, I derived the Diff::LCS library for Ruby from the same library for Perl, but I look back on the original code I ported from and I cannot recognize the algorithms (this is a problem for wanting to consider how to implement things differently). Someone else might have ported it differently and chosen different trade-offs than I did. Even simple things like the variable names chosen likely differ between two developers for similarly complex pieces of code implementing the same algorithm.

There is an art to programming — and if someone has a particular coding style (in Ruby, think Seattle style as distinct) which shows up in copilot output, then you have a possible source for the training.

Finally, I believe you are being naïve about businesses basing their model on "something which is so obviously illegal". Might I remind you of Uber (private care hires were illegal in most jurisdictions because it is something that requires licensing and insurance), AirBnB (private hotel-style rentals were illegal in most jurisdictions because it is something that requires licensing and insurance and specific tax filings), Napster (all your music are belong to no one, at least until the musicians and their labels got involved), etc. I firmly believe that every single commercial LLM available now — possibly with the exception of Apple's, because they have been chasing licensing — is based on wholesale intentional copyright violations. (Non-commercial LLMs may be legal under fair use and/or fair dealing provisions, which does not address issues for content created where neither fair use nor fair dealing apply.)

I am unwilling to give people like sama the benefit of the doubt; any copyright infringement was not only intentional, but brazen and challenging in nature.

I'm frankly looking forward to the upcoming AI winter, because none of these systems can deliver on their promises, and they can't even exist without misusing content created by other people.


> Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

Your take on how all this works is probably more inline with reality than mine, it's just that my brain refuse to comprehend the willingness to take on that type of risk.

You're basically telling investors that your business may be violating all sorts of IP laws, you don't know and have taken no actions to determine that. It's just a gamble that this might work out, while taking billions in funding. There's apparently no risk assessment in VC funding.


> If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes.

Even that can be considered infringement and get you taken to court. It's one of the reasons reading leaked code is considered bad and you hear terms like cleanroom[0] when discussing reproductions of products.

[0]: https://en.wikipedia.org/wiki/Clean_room_design


It certainly can be, but it's not guaranteed. Clean room design is one way to avoid a legally ambiguous situation. It's not a hard requirement to avoid infringement. For example, the US Supreme Court ruled that Google's use of the Java APIs fell under fair use.

My point is: just because certain source material was used in the making of another work does not guarantee that it's infringing on the rights of that original IP.


Reading a book is not theft. Building a business on processing other people's copyrighted material to produce content is.


I think that's called a school


Main issues:

1) Schools use primarily public domain knowledge for education. It's rarely your private blog post being used to mostly learn writing blog posts.

2) There's no attribution, no credit. Public academia is heavily based (at least theoretically) on acknowledging every single paper you built your thesis on.

3) There's no payment. In school (whatever level) somebody's usually paying somebody for having worked to create a set of educational materials.

Note: Like above. All very theoretical. Huge amounts of corruption in academia and education. Of Vice/Virtue who wants to watch the Virtue Squad solve crimes? What's sold in America? Working hard and doing your honest 9 to 5? Nah.


1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?


> 1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

If I grow apple trees in front of my house and you come and take all apples and then turn up at my doorstep trying to sell me apple juice made from the apples you nicked that doesn't mean you had the right to do it, because I chose not to build a tall fence around my apple trees. Public content is free to read for humans, not free for corporations to offer paid content generation services based on my public content taken without me knowing or being asked for permission.

> 2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

You are making this kind of argument: "How much is a drop of gas? Nothing. Right, could you fill my car drop by drop?"

If we have technology that can charge for producing bullshit on an industrial scale by recombining sampled works of others, we are perfectly capable of keeping track of the sources used for training and generative diarrhoea.

> 3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?

Yes https://www.bl.uk/plr


All of these responses were so quality, there's really no need to add. I Especially like the apple argument about a product in your front yard. You still have no basis to take them from my front yard.

If there was the equivalent of what a lot of other sites have (gems, gold, ribbons) I'd give you one. Got a lot of gems, I'll send you an admittedly teeny heliodore, tourmaline, or peridot at cost if you want one. Gemstone market's junk lately with the economy.


You're both just repeating the "you wouldn't download an apple" argument. In the context of the Internet, you're voluntarily sending the user an apple and expecting them to not do various things to it, which is unreasonable. Nothing is taken. If it were, your website would be completely empty.

Remember, Copying Is Not Theft. Copyright law is just a temporary monopoly meant to economically incentivize you. Nothing more.

BTW, pro-AI countries do differentiate between private and public posts. If it's public, it's legally fair game to train on it. If it's private, you need a license to access it. So it does matter. Also see: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


Schools use books that were paid for and library lending falls under PLR (in the UK), so authors of books used in schools do get compensated. Not a lot, but they are. AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff. Fuck that lot.


> AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff.

Funnily enough they do understand that having your own product used to build a competing product is uncool, they just don't care unless it's happening to them.

https://openai.com/policies/terms-of-use/

> What you cannot do. You may not use our Services for any illegal, harmful, or abusive activity. For example [...] using Output to develop models that compete with OpenAI.


Schools pay for books, or use public domain materials


If you think going to school to get an education is the same thing as training an LLM then you are just so misguided. Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity. This is not what training an LLM does.


LLMs don’t memorize everything they’re trained on verbatim, either. It’s all vectors behind the scenes, which is relatable to how the human brain works. It’s all just strong or weak connections in the brain.

The output is what matters. If what the LLM creates isn’t transformative, or public domain, it’s infringement. The training doesn’t produce a work in itself.

Besides that, how much original creative work do you really believe is out there? Pretty much all art (and a lot of science) is based on prior work. There are true breakthroughs, of course, but they’re few and far between.


Some people memorize verbatim. Most LLM knowledge is not memorized. Easy proof: source material is in one language, and you can query LLMs in tens to a hundred plus. How can it be verbatim in a different language?


If you buy a copy of Harry Potter from the bookstore, does that come with the right to sell machine-translated versions of it for personal profit?

If so, how come even fanfiction authors who write every word themselves can't sell their work?


Doujinshi authors sell their work all the time.


These "some people" would not fall under the "normal people" that I specifically said. but you go right ahead and keep thinking they are normal so you can make caveats on an internet forum.


> Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity.

LLMs wouldn't hallucinate so much if they did that, either.


I think this is tricky because of course this is okay most of the time. If I produce a search index, it's okay. If I produce summate statistics of a work (how many words starting with an H are in John Grisham novels?) that's okay. Producing an unofficial guide to the Star Wars universe is okay. "Processing" and "produce content" I think are too vague.


You should be able to judge whether something is a copyright violation based on the resulting work. If a work was produced with or without computer assistance, why would that change whether it infringes?


It helps. If it's at stake whether there is infringement or not, and it comes that you were looking at a photograph of the protected work while working on yours (or any other type of "computer assistance") do you think this would not make for a more clear cut case?

That's why clean room reverse engineering and all of that even exists.


As a normative claim, this is interesting, perhaps this should be the rule.

As a descriptive claim, it isn't correct. Several lawsuits relating to sampling in hip-hop have hinged on whether the sounds in the recording were, in fact, sampled, or instead, recreated independently.


There were also cases that (very broadly speaking) claimed that songs were sufficiently similar to constitute a copyright infringement https://en.wikipedia.org/wiki/Pharrell_Williams_v._Bridgepor...

This is interesting from the legal point of view, because AI service providers like OpenAI give you "rights" to the output produced by their systems. E.g. see the "Content" section of https://openai.com/policies/eu-terms-of-use/

Given that output cannot be produced without input, and models have to be trained on something, one could claim the original IP owners could have a reasonable claim against people and entities who use their content without permission.


If the LLM is automatically equivalent to a human doing the same task, that means it's even worse: The companies are guilty of slavery. With children.

It also means reworking patent law, which holds that you can't just throw "with a computer" onto something otherwise un-patentable.

Clearly, there are other factors to consider, such as scope, intended purpose, outcome...


Computers are not people. Laws differ and consequences can be different based on the actor (like how minors are treated differently in courts). Just because a person can do it does not automatically mean those same rights transfer to arbitrary machines.


Corporations are people. Not saying that’s right. But is that not the law?


Corporations are legal persons, which are not the same as natural persons (AKA plain old human beings).

The law endows natural persons with many rights which cannot and do not apply to legal persons - corporations, governments, cooperatives and the like can enter into contracts (but not marriage contracts), own property (which will not be protected by things like homestead laws and the such), sue, and be sued. They cannot vote, claim disability exemptions, or have any rights to healthcare and the like, while natural persons do.

Legal persons are not treated and do not have to be treated like natural persons.


Is reading a book the same as photocopying it for sale?

Which of the scenarios above is more similar to using it to train a LLM?


If I was forced to pick, LLMs are closer to reading than to photocopying.

But, and these are important, 1) quantity has a quality all of its own, and 2) if a human was employed to answer questions on the web, then someone asked them to quote all of e.g. Harry Potter, and this person did so, that's still copyright infringement.


But you pay money to buy a book and read it.


Not if you check it out from the library


The library paid. Similarly, you can't go to a public library, photocopy entire books, then offer them for sale behind a subscription based chatbot.


>Not if you check it out from the library

...who paid money for the book on your behalf


Is it same as human reading a book?

We are not even giving same rights to other mammals. So why should we give it to software.


How is a human reading a book in any way related or comparable to a machine ingesting millions of books per day with the goal of stealing their content and replacing them?


Directly.

What if while reading you make notes - are you strealing content? If yes - should then people be forbidden from taking notes? How does writing down a note onto a piece of paper differ from writing it into your memory?


The nice thing about law as opposed to programming is that legal scholars have long realized it's impossible to cover every possible edge case in writing so judges exist to interpret the law

So they could easily decide logically unsound things that make pedants go nuts, like taking notes, or even an AI system that automatically takes notes, could be obvious fair use, while recording the exact same strings for training AI are not.


> The nice thing about law as opposed to programming

in programming that is called "Undefined behavior"


Because humans cannot reasonably memorize and recall thousands of articles and books in the same way, and because humans are entitled to certain rights and privileges that computer systems are not.

(If we are to argue the latter point then it would also raise interesting implications; are we denying freedom of expression to a LLM when we fine-tune it or stop its generation?)


it's comparable exactly in the way 0.001% can be compared to 10^100

humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

consider one teacher and one student. first there is one idea in one head but then the idea is in two heads.

now add book technology1 the teacher writes the book once, a thousand students read it. the idea has gone from being in one head (book author) onto most of the book readers!


> humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

This is dangerous framing because it papers over the significant material differences between AI training and human learning and the outcomes they lead to.

We all have a collective interest in the well-being of humanity, and human learning is the engine of our prosperity. Each individual has agency, and learning allows them to conceive of new possibilities and form new connections with other humans. While primarily motivated by self interest, there is natural collective benefit that emerges since our individual power is limited, and cooperation is necessary to achieve our greatest works.

AI on the other hand, is not a human with interests, it's an enormously powerful slave that serves those with the deep pockets to train them. It can siphon up and generate massive profits from remixing the entire history of human creativity and knowledge creation without giving anything back to society. It's novelty and scale makes it hard for our legal and societal structures to grapple with—hence all the half-baked analogies—but the impact that it is having will change the social fabric as we know it. Mechanistic arguments about very narrow logical equivalence between human and AI training does nothing but support the development of an AI oligarchy that will surely emerge if human value is not factored in to how we think about AI regulation.


you're reading what I say in the worst possible light

if anything, the parallel I draw between AI learning and humans learning is all the opposite of narrow and logical... in my intent, the analogy is loose and poetic, not mechanistic and exact.

AI are tools, if AI are enslaving is because there are human actors (I hope....) deciding to enslave other humans, not because of anything inherent to training (if AI; learning if humans)

but what I really think is that there are collections of rules (people "just doing their jobs") all collectively but disjointedly deciding that it makes the most sense to utilize AI technology to ensalve other humans because the data models indicate greater profit that way.


Your response is fair and I hope you didn't take my message personally. I agree with you, AI is just a tool same as countless others that can be used for good or evil.


> humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

Train an LLM on the state of human knowledge 100,000 years ago - language had yet to be invented and bleeding edge technology was 'poke them with the pointy side.' It's not going to be able to do or output much of anything, and it's going to be stuck in that state for perpetuity until somebody gives it something new to parrot. Yet somehow humans went from that exact starting to state to putting a man on the Moon. Human intelligence, and elaborate auto-complete systems, are not the same thing, or even remotely close to the same thing.


> bleeding edge technology was 'poke them with the pointy side.'

Relevant: https://www.smbc-comics.com/comic/rise-of-the-machines




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: