AI and the human factor — Transcript

Like, reinforcement learning is amazing if there’s a true north. So, as DeepMind trained a video game or trained on chess, there is a true north when you win and when you don’t. If you train on what is a good joke and you use an AI model to decide what a good joke is, then obviously it’s BS in BS out. Hello, everyone.

Good morning, Lutz. Good morning. How are you? Amazing.

Welcome to another episode of Lutz and Jasper. Lutz, first question, how’s the coffee? My sidekick, if you don’t only want to listen to my podcast, I do have a Google identity and I do score coffee shops. That’s what I actually do.

So, if you want to figure out where good coffee is in the world, you have to follow me on Google. But I went through Japan and Japanese people love drip coffee. They don’t love espressos. So, you get amazing light roasts, fruity light roasts, drip.

But if you go to espressos, it’s all dark. It’s all bad. It’s not good. So, Japan is a good place for drip coffee.

Except you go to Sapporo, there is the Mermaid espresso shop, which is amazing. But any other one, wasn’t it? And I can agree with that. Last time I was in London, a friend of mine who’s married to a Japanese got me to a coffee shop in Soho, which is Japanese.

The coffee was amazing, the drip coffee. So, let’s kick it off. We both are reading a lot of stuff. And so, we read about an article in The Verge, which I personally liked, the growing demand for human labor in AI training.

And we were both like, oh, that’s interesting. But that’s a new topic. What’s so special about it? And we felt like, hey, let’s do a short episode on this.

But before we do it, let’s kick off with some chatter that’s going on in the AI scene. Lutz, you want to start? We heard about the Databricks deal. Yeah.

So, Databricks acquired ML for a staggering amount of money. What is all about this? I think the interesting one is it’s $1.3 billion. And the company is pretty new, right?

So, it’s not that Mosaic is a very old company, although Databricks is a very old company. And we remember, we spoke about Databricks. They have this model that they call Dolly, because they essentially just copied other AI models and trained it for themselves. And they…

Databricks is using this to enable enterprises to use their platform, having more AI applications. But apparently, and I found that interesting, that’s not enough. So, they have to pay quite some money because their valuation is $3.6 billion, at least the last one I found. And now it’s $1.3 billion, which will probably be an all-shales deal.

But still, it’s a lot of money just for a company. Now, the question is, are they desperate? What’s the discussion about how models get trained and how human labor is? And how much human labor is used in those models?

It’s actually interesting. We have reinforcement learning from human feedback. And Databricks was actually one of the companies who pushed out a data set on a very reasonable license, which then led to way more development on large language models. So, something is ongoing.

And for me, it makes perfect sense that Databricks acquired them. They have a very clear notion of not, not only offering a data platform and data sets, but as well, carving out a stake in that space. And the platform approach looks good to me. If you look at Hugging Face, how many large language models we have, the number is insane.

It’s about, I think, a thousand this year already alone got launched. Now, obviously, everybody talks about OpenAI, but for VR now, right? Or like, or not like, or inflectional. Inflection, not trivia.

Inflection, big OpenAI trivia questions. No, but everybody talks about those big models, but there are many, many other models. And it’s good to have a platform that actually has fine-tuned data, fine-tuned models. And to our discussion later on, controlled models.

Yes. And maybe one thing to add to think about it, they claim they deliver 2x to 7x faster model optimization. So it’s lower cost. And this is obviously an issue as we discussed it before.

So looking at the whole model, it’s a little bit more complicated. And all the models you’ve just described, Mosaic must have something great that will help data break. So good luck with that. It’s a lot of money.

So they definitely have to deliver on it. Before we jump into the base topics, let’s talk about the other things which has been ongoing. Because now we have all those models. Now we have exciting, the exciting times where everybody tries on new things.

And we see the backlash. We see Disney used AI images. I just this morning published a Forbes article. And I used, I used AI to generate an image, right?

Disney did the same thing. Why shouldn’t they? However, there was backlash. You’re talking about, you’re talking about the secret invasion, the intro, which looks a bit awkward.

Yeah. Secret invasion intro. Awkward or not, they use a tool. I suppose.

Yeah. They probably had spelling controls running over their scripts and nobody complained about it. But now the new tool they complain about. You see it every time when you watch it.

And it’s a pretty good thing. It’s a pretty good one to watch, by the way. So always to get a reminder. And people don’t like it.

Yes. No, other companies are more on the other side. They’re not using those tools. So Nature actually banned AI generated images.

Wow. And Grammy, the Grammy awards and AI generated music. Yeah. I hope they can detect, I hope they can detect it.

But yes, they definitely made this official. Yes. Yes. So it’s, it’s funny.

There is, there has been actually a pretty need for AI. There’s a very interesting article from Mark Andreessen out about AI, that AI won’t kill us all. And yes, there will be bad actors using AI and there will be a lot of good actors using AI. And he pointed to one amazing website, which is called pessimistarchive.org.

And if you actually look at this, it shows all the inventions and it goes through those inventions and it shows articles from that time as the invention hit the road, where people were saying, this is bad. For example, in the 1880s, novels were seen as corrupting the youth of planting dangerous ideas into heads of housewives and distracting everyone from more serious, important books. And we had this for the portable cassette player. There was a time where this was actually illegal to walk with headphones in New York.

It was probably also with the cars because that’s suddenly the horses, you know, issue. So, yeah. And, I think that’s, that’s a good point. And I think this is, will be an ongoing debate.

So we watch this closely. By the way, I also learned over a dinner last week that coming back to the secret invasion topic and using AI that now actually agencies try to mimic music, but not so much that it’s copyright infringement, but that it sounds a bit similar that certain artists. So when you play that, it’s not infringement. So these are still humans doing it.

It’s an interesting other application. Let’s jump into the topic. So what do we know? We know that there are humans out there that help.

AI with the output that are correcting the output. But we also know that the input. So the data that goes in, at least in the past needed a lot of help. So you would put in data just like that wouldn’t work, right?

You have to clean the data, but you also have to tell the AI model what the data actually means. So that’s supervised learning. And I invested in a lot of companies, accounting automation and chat automation invoice where this was just the norm, right? You would never just feed in data.

You would tell the AML. This is good data. This is bad data. This is a good response.

My dear model. This is a bad response. And that means that especially for the more complex applications, accounting automation, you would need very expensive people doing this annotation labeling work. And then sometimes you could even use your customers or users because as you might know, when you use ChatGPT, it says, is it good?

Is it bad? Is this result better or that result better? The same happens when you automate chat as we discussed with Retu of Ultimate AI. Hey, it’s unsupervised.

You just use it. It works. Whatever you do. It works.

I tried lawyers jokes today to annoy our legal counsel. And we tried a lot of jokes and I could really see the jokes were repeating and not really good. I should read one of them maybe out loud. But the question now is for me, when you compare the supervised learning that was the initial big curse I had at least as far as I can remember.

And now we come to the unsupervised learning. Why do we still have so many people out there that have to help the model? Why do we still have so many people out there that have to help the model understand what is good and bad? Let’s do a quick primer to what is supervised and unsupervised.

Supervised is there is a true north. There is a label. Unsupervised is there is no label. There is no understanding what does it mean.

So unsupervised learning, very traditional stuff would be K-means or K-means algorithms where I just define a formula and let the computer fit the formula. And whether the outcome makes sense or not. Is up to the user. So because the computer when it’s not supervised comes up with something which is not true north.

Well, when I have a supervised label, I have a true north. So first of all, who defines true north? Right? So supervised learning can only be as good as a true north I put in.

And that is the discussion on bias which we had last time or like a couple episodes back. Now for the current models. We actually have a model. We have a model.

We actually use this essentially statistical toolings like the sentence which we always say in this podcast here is life is like a box off. Okay. Now there is a supervision which is many people say chocolate after the sentence. That is supervision.

That is a supervised approach based on a statistical evidence that many people say chocolate. That’s it. It’s just the following word, a supervised approach. However, how do you judge?

Whether that is the right word to use? If there is a true north or Obama was the president of the United States, this is a true north fact. This is a fact. It’s perception.

Obama’s policies were and now you like it’s depending on which political spectrum you’re on, what kind of work you will use. And it’s even harder if it gets down to generation of new. Things where you have no supervised true north. So that’s kind of the perfect setup.

I have all the data. I know what the data means. The model just trains based on it. And then I would even alter the little bits of the results.

Fine tune. Maybe you might call it. So make sure this is great. But I would even ingest more perfect data.

Also understanding what the model does with the data. The ingesting more data that is right. Right. Delabelled.

Okay. But if I understand this correctly, it doesn’t feel like we are on an unfully unsupervised track. What are current LLMs actually doing? It sounds very clear.

Right. Because the typical way of how we had supervised learning is there was a very clear yes or no decision. Typical machine learning programs would predict whether somebody clicks on a movie or buys a certain thing. Or is a transaction as fraud or not.

Here, the label data is very clear. 100% true. In the way of large language models, we are using also machine learning to define what is the most likely next word in a chain. Life is like a box of chocolate.

The question is, is that really the truth? And that makes large language models way more complicated. Because if it’s an accurate fact, we can easily say it’s true or not. Obama was the president of the United States.

There is no question about it. If it’s more something which has to do with perception, it becomes horror. Whether a joke is funny or not, you use the example of your lawyers. It is in the eyes of lawyers, though funny and different lawyers.

Now, it gets even harder if you are thinking about generation of an image. Art is famous for generating images the human society didn’t expect. And then they kind of thought this is amazing. So at what point in time is something just random?

And at what point in time is it amazing? So for us, it is a supervised learning process. But you need humans in order to figure out whether that is true or not. And it’s also a matter of whether the outcome is actually accurate or whether the perception is the right one.

Maybe to add here, I also asked for a West Coast versus East Coast joke. And it gives me the feeling the model is a bit biased towards the West Coast. So why did the West Coast venture capital investor take up surfing? Because they wanted to catch the next big wave of startups while the East Coast investors were still stuck in a suit and tie, boardroom mentality, riding the subway to their next meeting.

Wow. Yes. So you know, the East Coast guys are in New York, so they have to take the subway, always wearing suits. It is amazing, right?

But it’s not funny. But anyways, it’s making up something. Well, you might think it’s not funny. But that comes down to, so what we see currently is we see a lot of people training and correcting models.

Maybe a plug to the… LegalOS episode, which we published last week. Pretty good episode. And Torben talked about the need for guardrails.

Goldman Sachs last week published, or last week, or two weeks ago, published that they believe 44% of all human lawyer tasks will be automated soon. Now, we should say the existing tasks will be automated and lawyers will find new tasks, which makes… Definitely. …more ability for everybody to access legal matters.

But we are trying to automate those tasks. But all of us know that legal matters are not 100% there is one true north. There is a way of interpreting the law. There is a way of finding context and then interpreting the law.

And this is all complicated. And because of that, we need guardrails and we need humans to actually do that. And we need humans to actually train those models on those efforts. Why don’t sharks attack lawyers?

Professional courtesy. The worst one was actually, why don’t lawyers go to the beach? Because even the sand can’t escape their meticulous cross-examination. Okay.

Wow. Okay. This is highly intellectual stuff, guys. So, okay.

So, now we understand we still need these people. And the question is, what happens in the future? Because I would be pretty… And I mean, Databricks is not doing this quality assurance that ScaleAI, Mechanical Turk actually learned this.

There’s even more companies. But that they all would do for me. So, I would have to not just pay for compute. I would also have to pay for all the people I would manage, would have to manage those people and the quality same as in the old days of full supervised learning.

So, now the future, what is the future doing? How can I get… I don’t think you can. At the moment, the…

How do you explain this best? So, at the moment, we are using the corpus of all human text to train machines on logic, human logic, on conclusion, human conclusion powers, on conclusion chains, and so on and so forth. And now we have this underlying structure from the machinery. But we are generating new text.

We are giving answers. And we have to have… A way to control for those answers, whether those answers make sense, whether the intended outcome makes sense. And for that, we’re using something which is called RLHF, reinforcement learning from human feedback.

And what we see from all the examples, it’s pretty clear that human feedback is not sufficient. And that despite all the efforts we are putting in, the machines are there. And that’s why the machine learning algorithm goes bad. From times to times, edge cases.

And we should discuss a little bit why that is. So, we are labeling data as an answer is correct or not. So, your lawyer jokes, you could now ask me as the labeling person, was this funny? And depending on what type of person I am, I will say, yes, it was funny or no, it wasn’t funny.

Now, if it’s not funny, then the machine learning algorithm will say, no, it wasn’t funny. And the machine learning algorithm will learn, okay, that was a bad joke. Let’s redo it. If I say it was funny, it will know that this is the right joke.

But what is true? Well, true is that certain people like, think certain jokes are funny and others don’t. It’s a very personal thing. So, now I need to ask many, many different people to figure out what is a funny joke or not.

Plus, I might want to know who’s actually in front of the computer who wants to have the joke. So, let’s say a user profile wouldn’t be too bad so I can adjust my output. Well, this is what actually you want to have many different feedbacks on the joke so that the machine learning algorithm understands what are good jokes overall and has different classes of jokes for different people. From the ranging for all kinds of jokes.

We’re talking racist jokes. Yes. We’re talking dirty jokes. We’re talking intellectual jokes.

We have them all in here so that the machine learning algorithm can pick the right joke being probed correctly. And that’s what you get when you trained the large language model on Reddit, right? So, like on using data from Reddit. Now you have all those different jokes.

Now you need to figure out from racist to other kinds of jokes. Now you need to figure out how do you create the guardrails that it’s not funny. So, you need to have guardrails that in a chat GPT discussion. Remember, chat GPT is the chat interface under the large language model from OpenAI.

So, in a chat GPT discussion, racist jokes don’t come up. So, you need to have guardrails. And now a human needs to say, actually, you trained it on racist jokes. Yes, you know racist jokes, but actually, that joke is racist.

Don’t tell it. So, how do you train that effort? Yeah. And it’s interesting because when you…

I don’t have TikTok. But this is what TikTok, Instagram, Facebook, many others do. They show you content individualized to your usage, preference, whatever. Chat GPT doesn’t so far.

Mid-journey doesn’t. It’s just prompting. But it’s not… This will happen eventually at some point in time.

But a bit missing. Maybe, dear startups out there or founders, this could be something. But just think about it. Cool.

We should actually talk about why this is so complicated. So, now we established what we are doing here, right? In terms of human feedback. But we should talk about why it’s complicated.

Because… And we know this from before we had large language models. Because it’s the same thing for images. For images, you have the same thing.

You have something which is a true north. This is in my time at Google. I actually… I worked on a team.

Google did some work on retina. Scanning off the back of the eye. And they built an AI algorithm. To detect common forms of blindness.

Now, how do they get the true north? Is they get specialists to analyze retina images. And those specialists create a label. Also, those specialists would disagree on a given image.

So, five specialists would come up with slightly different results on the same images. Because humans disagree. However, in general, we have a true north here. Now, if you talk more about other topics.

It becomes harder. And the true north might be changing. So, for example, ImageNet. ImageNet is a database for classifying images.

Used initially a hierarchy from the old Library of Congress. In that library, LGBTQ were classified under abnormal sexual relations. Including sexual crimes. So, now you have a system.

Where you classify images. Based on a taxonomy. But the taxonomy is outdated. And based on an old setup.

So, how do you train your model to say something which is funny. If the underlying notion of funniness has changed. Or if something is correct. The underlying notion has changed.

Therefore, since there is not a true label for certain things. We will constantly have the need for people involvement. To actually give feedback on the model. And be guardrails.

So, essentially, we are replacing our. Let’s talk legal lawyers. We are replacing our legal feedback. From a lawyer.

By a legal feedback from a large language model. Plus a bunch of lawyers who correct the large language model. Yeah. So, then the question is.

As a next step. How does that get easier eventually? Oh, that’s a good one. That’s a good one.

So, we read one piece. Which I found very interesting. So, coming back to the people. There is a study.

Maybe we should link it. Let me look at it correctly. So, it’s the Swiss Federal Institute of Technology. And they hired 44 people.

Gig workers. And on a platform. Let’s not mention which one. And they checked how they actually came up with the results.

And somewhere between 33 and 46% of the workers had used AI models. To actually do the labeling. Which is worrisome. But it works.

And nobody can really control it. So, obviously. Talking or coming back to our bias discussion. They will use models that have certain biases.

To actually label data. To maybe get rid of biases in the future. Which work this way. This is not peer reviewed yet.

But it’s already out there. So, this is one way. To get rid of human labeling. Also, synthetic data.

I think is one part of it. Which is very important. Because it’s not just a single model. It’s a whole bunch of models.

Which could also be biased. Any thoughts there, Lutz? Or would you just say. Let’s embrace it.

No. I actually. So, I actually think this. There are two things.

First of all. Using AI. To label data. Is a little bit like.

The snake biting their own tail. Right? This is the. It’s.

Now. We are training. The AI model. By themselves.

Without still knowing. What is true. Like reinforcement learning is amazing. But it’s also.

It’s not. It’s not. New participants. off when you win and when you don’t.

If you train on what is a good joke and you use an AI model to decide what a good joke is, then obviously it’s BS in, BS out. And that is an issue because since we have seen large language models, about when did they start? 2017 or so, more and more people are actually using those things to annotate, to write just by AI. It uses obviously those tools to create web pages.

Like Google deals with a whole range of spam pages, right? Where they are generated. And now we use an AI to train on those. This is BS in, BS out.

The other problem which we have is, as I said, the notion of society changes and the notion of what is right, what is wrong changes. And that’s a big problem. So I do not think that at the current moment, we have a useful way to get rid of it. And there was a very interesting study from Stanford on the EU regulation and which of the large language models fit into the EU regulation.

And I think it’s a very, very good read. And we should link in the show notes about this. It’s a very good read because essentially, in order to define whether it fits the EU regulation or not, we actually need to have a measurement for something. Now, how would you say that an AI, like a joke about lawyers, is a good joke or not a good joke?

How would you say that the joke is offensive or not offensive? So you cannot, you don’t have a measurement for it. And therefore, you cannot so easily regulate it. So I believe that EU regulation is a little bit premature here in this case, because we actually are missing the correct measurements to actually make sure that a large language model fits the purpose for which what it was built.

But then it also means looking into the future, whoever is using this, yes, there will be probably ways to train the models faster, and you can ingest your own data easier, also in the large language. Once we discussed it, yes, there are nice tools out there to do all the process management of mechanical tricks, or whatever you want to call it. But you still need human beings, you still have to control the quality output, and you should make sure that this, what you’re doing with the model is what it was intended to do. And it doesn’t go offside, and the machine can’t support you on that, you have to do it yourself.

Well, as you know, and the listeners probably as well, I’m working in healthcare, right? So I use data and AI in healthcare. And there’s a huge discussion on how much do you trust the model in healthcare. And it all comes down to that in healthcare, you say, well, you have a human in the loop to take the decision.

And it’s very similar to how we started with self driving cars that we build tools for humans to actually have safe guard or safe, like like guarding guard guard roads. or warning systems. And so we are using now AI as an output. But we should not assume that this output can be directly patched through.

In healthcare, you always have the human in the loop somewhere. We are using AI to supercharge humans, and not to replace them. Yeah, and I think it’s in the interest of everyone, you want to see the results, you want to control them. So what we see so far, the tuning is getting better, and better, the models are getting better and better.

One part of the commoditization. So great for us, great for the startups. But whatever you do, this is not autopilot. Even if you have a chat, write a joke about lawyer, do it 10 times, and you will see that they end up very similar and not funny, at least according to our lawyer.

Cool. Thank you for listening. Thanks, everyone. And yeah, have a good day.

Well, thanks, guys. 1 you