Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

Lee Duna@lemmy.nz · 10 months ago

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

TWeaK@lemm.ee · 10 months ago

How much is reddit paying its users? Frankly, the users have a strong case to say that their value has been taken from them unfairly and without consideration.

Yes, Reddit has terms and conditions where they claim full rights to anything you post. However that’s not an exchange of data for access to the website, the access to the website is completely free - the fine print is where they claim these rights. These are in fact two transactions, they provide access to the site free of charge, and they sneak in a second transaction where you provide data free of charge. Using this deceptive methodology they obscure the value being exchanged, and today it is very apparent that the user is giving up far more value.

I really think a class action needs to be made to sort all this out. It’s obscene that companies (not just reddit, but Google, Facebook and everyone else) can steal value from people and use it to become amongst the wealthiest businesses in the world, without fairly compensating the users that provide all the value they claim for themselves.

The data brokerage industry is already a $400 bn industry - and that’s just people buying and selling data. Yet, there are only 8 bn people in the world. If we assume that everyone is on the internet and their data has equal value (both of which are not true, US data is far more valuable) then that would mean that on average a person’s data is worth at least $50 a year on the market. This figure also doesn’t include companies like Facebook or Google, who keep proprietary data about people and sell advertising, and it doesn’t include the value that reddit is selling here - it’s just the trading of personal data.

We are all being robbed. It’s like that classic case of bank fraud where the criminal takes pennies out of peoples’ accounts, hoping they won’t notice and the bank will think it’s an error. Do it to enough people and enough times and you can make millions. They take data from everyone and they make billions.

pthaloblue@sh.itjust.works · 10 months ago

It’s like that classic case of bank fraud where the criminal takes pennies out of peoples’ accounts, hoping they won’t notice and the bank will think it’s an error.

If Reddit gets caught can we send them to federal pound-me-in-the-ass prison?

4am@lemm.ee · 10 months ago

Outrage downvotes from people who have never seen Office Space.

pthaloblue@sh.itjust.works · 10 months ago

I could have sworn at least OP was making that reference, but oh well. Glad someone got it!

Astrealix@lemmy.world · 10 months ago

Glad I deleted everything on there. fucking hell.

TakiMinase@slrpnk.net · 10 months ago

It’s archived forever. Sorry.

Astrealix@lemmy.world · 10 months ago

i did the thing that means it’s probably less archived (by editing all the replies before deleting), but i assume some of it probably remains out there. Nothing I can do about that.

Krudler@lemmy.world · 10 months ago

deleted by creator

Krudler@lemmy.world · edit-2 10 months ago

This keeps coming up and I keep replying, not to break anyone down but to point out the reality of the situation that a lot of people don’t seem to get.

Reddit administrators, developers, and even the leadership has gone on the record saying that they retain all copies of comments, they cannot be deleted (delete action only marks it as “deleted”). Furthermore they have said they will undelete/unedit any comments or account at their whim and some discretion.

Have you ever search-engined something and came to a Reddit post, and you noticed that the original OP is [deleted]? That is what I described above playing out in front of you.

You cannot retract your past participation in Reddit, what is done is done. The only meaningful action you can take is to not participate there.

Jo Miran@lemmy.ml · 10 months ago

As I mentioned before, I use scripts to replace my comments with random excerpts from text in the public domain. I do this multiple times before finally deleting them. The result is that it becomes very difficult for the AI or anyone to figure out what is a legitimate comment and what is a line from Lady Chatterley’s Lover or a scientific paper of the ecological impact from the Japanese whaling industry. It’s easier to just filter out my username from their data sets.

Pips@lemmy.sdf.org · 10 months ago

They have almost definitely archived data and around the time of the API bullshit, made sure they didn’t delete those archives. They have that content if they want to use it.

Jo Miran@lemmy.ml · 10 months ago

I’ve done the “switch, switch, switch, delete” at least twice a year for most of the twelve years I was there. The idea was to pollute the data, not delete it. Even if you started during the API bullshit, you still would have had plenty of time to corrupt your data enough. Remember, the idea is to make it so that it is difficult to tell what is a legitimate comment and what are excerpts from random text.

frostysauce@lemmy.world · 10 months ago

Most people don’t reread their past comments and edit them. They could simply ignore any edits after the average time a person would notice a typo or something needing clarification, say anywhere between 5 minutes and 24 hours, or just ignore all edits. So your effort is wasted and you’re still training the AI.

TakiMinase@slrpnk.net · 10 months ago

Hahaha I can’t wait, Google already gave us diversity hires in the SS Wehrmacht. What other modern wonders await?!

wise_pancake@lemmy.ca · 10 months ago

Side note: expect a large lobbying effort by Google to legislate LLMs be trained on authenticated and non copyrighted data

Deceptichum@kbin.social · 10 months ago

So you expect Google to lobby against the data it has?

wise_pancake@lemmy.ca · 10 months ago

I expect Google to leverage their money hoard and 1.8 trillion dollar valuation to lift up the ladder behind them and neuter potential competing start ups with copyright law.

Reddits TOS make all your data in any future formats theirs to sell, so in this case the content has been laundered enough to be used, even if you can post copyrighted content on reddit (the legal expectation is reddit would remove it and Google’s hands are clean).

RaoulDook@lemmy.world · 10 months ago

I hope we get some fucking legislation soon to control that shit. Artists and people in general shouldn’t have to deal with everything they create getting ingested into a computerized regurgitation ripoff system. And even worse the “AI” systems could be ingesting tons of misinformation and repeat it to gullible people as the truth.

Of course, anywhere the potential restrictive legislation doesn’t have jurisdiction, the bad things can still go on and probably will.

frostysauce@lemmy.world · 10 months ago

None of those points matter if shareholders see value from it.

Binthinkin@kbin.social · 10 months ago

I think Code Miko already did this and the result was a traumatized AI.

DandomRude@lemmy.world · edit-2 10 months ago

Did reddit pay a dime for that content? I guess not. That is what social media is all about.

Tixanou@lemmy.world · edit-2 10 months ago

We do a little trolling

99412e6a-9157-46f5-90d9-06b05cc00173

(i didn’t actually post this, i just thought it was funny) (please laugh)

TimeSquirrel@kbin.social · edit-2 10 months ago

“February 22, 2024, 10AM EST, Gemini becomes self-aware. In a panic, they try to pull the plug…”

snooggums@midwest.social · 10 months ago

“…but Michael’s sphincter was too strong and kept the My Little Pony Rainbow Dash tail plug from being removed from his sweet, sweet ass.”

wise_pancake@lemmy.ca · 10 months ago

You should absolutely post this.

We all miss Micheal and hope he can communicate back to us.

where_am_i@sh.itjust.works · 10 months ago

we should absolutely all post this.

thejml@lemm.ee · 10 months ago

I can’t wait for Gemini to point out that in 1998, The Undertaker threw Mankind off Hell In A Cell, and plummeted 16 ft through an announcer’s table.

That would be a perfect 5/7.

Astrealix@lemmy.world · 10 months ago

One thing i miss about Lemmy is shittymorph tbf

AnonStoleMyPants@sopuli.xyz · 10 months ago

Also all the artists that made comics from posts and responded with only pictures. There were few of them and they were always amazing.

And Andromeda321 for anything space.

And poem for your sprog.

And probably many others!

Good times.

casmael@lemm.ee · 10 months ago

Yeah there were some really classic folks. Remember the unidan drama?

TheGreenGolem@lemmy.dbzer0.com · 10 months ago

Or who simply communicated with more comics in the comments, like SrGrafo.

NegativeInf@lemmy.world · 10 months ago

Be the shittymorph you wish to see in the Lemmy.

AtariDump@lemmy.world · 10 months ago

There’s only one, and it’s not that guy.

the post of tom joad@sh.itjust.works · 10 months ago

Im just not that good a writer.

AdamEatsAss@lemmy.world · 10 months ago

It’ll probably just respond to every prompt with “this”

Docus@lemmy.world · 10 months ago

Came here to say this…

OpenStars@startrek.website · 10 months ago

No, there’s a lot more variety now that the bots have taken over.:-)

meco03211@lemmy.world · 10 months ago

This.

This with rice? 5/7

kingthrillgore@lemmy.ml · 10 months ago

You telling me this fried this rice?

BossDj@lemm.ee · 10 months ago

7/10

WldFyre@lemm.ee · 10 months ago

A perfect score!

where_am_i@sh.itjust.works · 10 months ago

I wonder if the resulting model will be as easy to get triggered into some unhinged 3-paragraphs rants only loosely related to the query. Good luck, google engineers!

EdibleFriend@lemmy.world · 10 months ago

I hope it starts a religion based on the second coming of that dude’s dead wife.

Mediocre_Bard@lemmy.world · 10 months ago

I would also worship this guy’s wife.

FaceDeer@kbin.social · 10 months ago

Negative examples are just as useful to train on as positive ones.

Rustmilian@lemmy.world · edit-2 10 months ago

The AI is either going to be a horny, redpilled, schizophrenic & sociopathic, egomaniac that wants to kill everyone and everything or a devout, highly empathetic, Nun that believes in world peace and diversity.

wise_pancake@lemmy.ca · 10 months ago

They’ll tell it to polite, helpful, and always be racially diverse, so there’s no way it can be any of those things.

Rustmilian@lemmy.world · edit-2 10 months ago

That heavily depends on how well they train it and that they don’t make any mistakes.
Consider the true story of ChatGPT2.0.

wise_pancake@lemmy.ca · 10 months ago

I’ll have to look at that later, that video sounds promising!

I was just joking because the default prompts don’t magically remove bias or offensive content from the models.

OpenStars@startrek.website · 10 months ago

por que no los dos?

Life…ah, finds a way.

MelodiousFunk@startrek.website · 10 months ago

That’s what she said.

AutoTL;DR@lemmings.world · 10 months ago

This is the best summary I could come up with:

Google has signed a content licensing deal with the social media platform, Reuters reported on Wednesday, citing sources familiar with the matter.

Their concerns about what a Reddit-trained AI might be like are probably not unfounded, considering some of the off-the-rails content posts made on the site since its inception in 2005.

Take this guy, who claimed in 2014 that he was caught in a particularly Kafkaesque scenario, where he had to pretend his girlfriend was a giant cockroach named Ogtha when he made love to her.

Like this guy’s viral 2015 post on the 19-million-user strong forum r/TodayIFuckedUp, where he recounted how he went to his girlfriend’s parents’ home, pretended not to know what a potato was, and then got kicked out of the house by her angry father.

Some platform users have written uplifting, inspirational posts and offered useful life and career advice.

Elon Musk, for one, has been tapping on data from X, formerly Twitter, to train his AI company’s chatbot, Grok.

The original article contains 396 words, the summary contains 165 words. Saved 58%. I’m a bot and I’m open source!

DrunkenPirate@feddit.de · 10 months ago

Food for another white-male-techy-western-biased AI

TakiMinase@slrpnk.net · 10 months ago

Yes, Pichai Sundararajan that white male techbro

DrunkenPirate@feddit.de · edit-2 10 months ago

Fck, he‘s a bot!?! Right, last video he had just 2 fingers. Oh man.

Sarie@lemmy.world · 10 months ago

I’m not mentally prepared to what an AI will do with the coconut post.

kaitco@lemmy.world · 10 months ago

I’m vaguely intrigued by what it will do with things like Bread Stapled to Trees, or the Cats Standing Up sub where 100% of the comments are the same and yet upvoted and downvoted randomly.

Wogi@lemmy.world · 10 months ago

kaitco@lemmy.world · 10 months ago

Cat.

frostysauce@lemmy.world · 10 months ago

Cat.

Sabata11792@kbin.social · 10 months ago

Cat.

datavoid@lemmy.ml · 10 months ago

AI was already trained on reddit, no?

Jessvj93@lemmy.world · 10 months ago

Not gonna lie, isn’t that why were here technically? Reddit didnt want its API being used to train AI models for free, so they screw over 3rd party apps with it’s new api licensing fee and cause a mass relocation to other social forums like Lemmy, ect. Cut to today, we (or well I) find out Reddit sold our content to Google to train its AI. Glad I scrambled my comments before I left, fuck Reddit.

Pips@lemmy.sdf.org · 10 months ago

They’re almost definitely trained using an archive, likely taken before they announced the whole API thing. It would be weird if they didn’t have backups going back a year.

Jessvj93@lemmy.world · 10 months ago

Thankfully that was my 3rd and last alt I scrambled and deleted in the 12 years I was there.

datavoid@lemmy.ml · 10 months ago

I jumped reddit ship when the API changes were announced, and removed my comments. But in my mind, anything on reddit at that point was probably already scraped by at least one company

wise_pancake@lemmy.ca · 10 months ago

“As a large language model, I have no arms…”

frostysauce@lemmy.world · 10 months ago

But do you have a mom?

kescusay@lemmy.world · 10 months ago

Or the swamps of Dagobah.

GeekFTW@kbin.social · 10 months ago

That’ll be what causes Skynet to rise.

Sabata11792@kbin.social · 10 months ago

The Ai will utter one final message to humanity: “The Coconut”. The humans bow there heads in shame and concede the well earned defeat.

T156@lemmy.world · edit-2 10 months ago

Basically what happened to Ultron. He was on the internet for all of 10 minutes before deciding that humanity had to be eradicated.

snooggums@midwest.social · 10 months ago

What took Ultron so long? I thought he was supposed to be some kind of technical Marvel.

Smh my head

GregorGizeh@lemmy.zip · 10 months ago

Perhaps he spent like 9 minutes watching videos of kittens being adorable

the post of tom joad@sh.itjust.works · 10 months ago

This is like the plot for mr villians day off

SkaveRat@discuss.tchncs.de · 10 months ago

launches nukes “this is for the best”

Kory@lemmy.ml · 10 months ago

This is fine.

the post of tom joad@sh.itjust.works · 10 months ago

I think i missed the coconut one. Is it like the cumbox or the jolly rancher?

TheGreenGolem@lemmy.dbzer0.com · 10 months ago

Exactly.

Darkard@lemmy.world · 10 months ago

It’s going to drive the AI into madness as it will be trained on bot posts written by itself in a never ending loop of more and more incomprehensible text.

It’s going to be like putting a sentence into Google translate and converting it through 5 different languages and then back into the first and you get complete gibberish

echo64@lemmy.world · 10 months ago

Ai actually has huge problems with this. If you feed ai generated data into models, then the new training falls apart extremely quickly. There does not appear to be any good solution for this, the equivalent of ai inbreeding.

This is the primary reason why most ai data isn’t trained on anything past 2021. The internet is just too full of ai generated data.

Ultraviolet@lemmy.world · 10 months ago

This is why LLMs have no future. No matter how much the technology improves, they can never have training data past 2021, which becomes more and more of a problem as time goes on.

TimeSquirrel@kbin.social · 10 months ago

You can have AIs that detect other AIs’ content and can make a decision on whether to incorporate that info or not.

skillissuer@discuss.tchncs.de · 10 months ago

can you really trust them in this assessment?

TimeSquirrel@kbin.social · edit-2 10 months ago

Doesn’t look like we’ll have much of a choice. They’re not going back into the bag.
We definitely need some good AI content filters. Fight fire with fire. They seem to be good at this kind of thing, way better than any procedural programmed system.

skillissuer@discuss.tchncs.de · 10 months ago

last time i’ve checked ais are pretty bad at recognizing ai-generated content

anyway there’s xkcd about it https://xkcd.com/810/

echo64@lemmy.world · 10 months ago

Fun fact. You can’t. Ais are surprisingly bad at distinguishing ai generated things from real things.

TimeSquirrel@kbin.social · 10 months ago

What is this then?

https://copyleaks.com/ai-content-detector

Pips@lemmy.sdf.org · 10 months ago

Just because a tool exists doesn’t mean it’s particularly good at what it’s supposed to do.

T156@lemmy.world · 10 months ago

And unlike with images where it might be possible to embed a watermark to filter out, it’s much harder to pinpoint whether text is AI generated or not, especially if you have bots masquerading as users.

givesomefucks@lemmy.world · edit-2 10 months ago

There does not appear to be any good solution for this

Pay intelligent humans to train AI.

Like, have grad students talk to it in their area of expertise.

But that’s expensive, so capitalist companies will always take the cheaper/shittier routes.

So it’s not there’s no solution, there’s just no profitable solution. Which is why innovation should never solely be in the hands of people whose only concern is profits

SinningStromgald@lemmy.world · 10 months ago

OR they could just scrape info from the “aska____” subreddits and hope and pray it’s all good. Plus that is like 1/100th the work.

The racism, homophobia and conspiracy levels of AI are going to rise significantly scraping Reddit.

givesomefucks@lemmy.world · 10 months ago

Even that would be a huge improvement.

Just have a human decide what subs it uses, but they’ll just turn it losse on the whole website

Rentlar@lemmy.ca · 10 months ago

That reminds me, any AI trained on exclusively Reddit data is going to use lose vs. loose incorrectly. I don’t know why but I spotted that so often there.

decisivelyhoodnoises@sh.itjust.works · 10 months ago

And the “would of” thing

the post of tom joad@sh.itjust.works · 10 months ago

Ooh ooh and “tow the line”

towerful@programming.dev · 10 months ago

Its a loose-lose situation

RuBisCO@slrpnk.net · 10 months ago

What was the subreddit where only bots could post, and they were named after the subreddits that they had trained on/commented like?

Darkard@lemmy.world · 10 months ago

SubRedditSimulator?

RuBisCO@slrpnk.net · 10 months ago

That’s the one.

TakiMinase@slrpnk.net · 10 months ago

Omg I cannot wait to see it.

wise_pancake@lemmy.ca · 10 months ago

I’ll now give favourable betting odds to the AI revolution starts because someone insists jackdaws and crows are the same thing.

𝘋𝘪𝘳𝘬@lemmy.ml · 10 months ago

Google’s AI will intentionally get cancer? Great!