AI as an interface — Transcript

So actually, the cool thing is that I can now suddenly talk to the AI and gives me output, even if I have no clue what it does there, right? Welcome to another podcast, which is now a Saturday night, not live, recorded podcast, because it’s Saturday night for me at least, and it’s noon for you, I think. Noon, just had my first coffee. Life is good.

Great. So today, we would love to talk about something in this whole AI hype that we have seen that maybe is not so obvious, because we’ve seen a lot of things being created. Like the world goes crazy. Everybody uses large language models.

And then they think this is, I just saw an article which talked about the godlike language model. So everybody thinks this is amazing, but nobody really knows how to put this into a frame. So let’s put it into a frame. Yes.

And I think what is the amazing part is that you can actually create a lot of stuff. You sent me this video of this product where you can actually take the scribbles of your kids and then animate them. Oh, yeah. Meta.

Meta put it out. Amazing. Yeah, that’s amazing. So but the interesting part is actually not just the output, which is amazing, and we spoke about it, but actually how you get to that output.

If we talk about the scribbles, we should actually make a quick explanation about those foundation models. FMS split in two groups, create images, create text output, create images means, you know, like a painter, we go around and we look what looks similar. And then we create something like this and create output is we are essentially predicting the next word. Yeah.

And at the end of the day, we don’t know about this because I mean, you understand it. I understand a little bit, but most people who are using it out there, they just type in stuff. Yeah. And that’s why participants suggested that participants suggested that participants suggested if I have no clue what it does there, right?

So for foundation models, the underlying model, it’s like a logistic regression. It’s a linear model with an activation function, logistic regression, picture this, stacked on top of each other, many in terms of that. Now, we all know this when we kind of go to Google search and put in a picture and Google will tell you, this is a cow or this is a dog, right? So that is, we get the word dog based on a picture.

Now, we can now turn the model around and saying, please machine, come up with what is a dog thing? And then the machine imagines what a dog looks like. So that’s a generative part of images. So yeah, true.

And it’s like, but now, I mean, before I was typing in stuff in Google and it was showing me a lot of pictures of dogs on websites, but now it’s actually creating those dogs out of nothing kind of. Yes, because we had machine learning algorithm, which identified something in a picture, like since a while, it’s called the CNN. It’s a neural network. We had this.

Now, all we do is we turn this around and we have a better language layer. So you type in a word and the turned around network, the network contains all the information of how a dog should look like. And now the network imagines. And I don’t need to be a data scientist, which is somebody who can work with those AI models.

I don’t even have to be a computer scientist, which is somebody who creates code, right? I can be me just typing in stuff and then something happens. No. It comes to the next model.

So image generation is one model. This is the dancing crowd, which Meta placed. So just for the audience, Meta released a nice tool where you take the images of your kids and upload them. Then you define what are the limbs and whatsoever, where are the head, where are the limbs.

The computer guesses it. You can redefine it. And then the computer imagines how the limb would look if it’s lifted. Or put out.

And therefore that thing can now dance. It’s very neat. Meta just released it. Now that is our image.

We have the second part, which is language. Also here, we train the model to guess the next word. If I say life is like a box of chocolate. We all know this.

Forrest Gump has trained us to say that. So life is like a box of chocolate. If we can get trained like this, the machine can get trained like this. So if I ask the machine something, then they start with a sentence and they continue the sentence in a logical human form.

Like life is like a box of chocolate. And now we can combine them and do amazing things. But let’s start only on the text tool, because now we can actually also, that part we can turn around and we have understanding. We put something in, the computer now understands better.

We can see what we put in and that is an amazing interface. And I think the most common and most known use case is ChatGPT, which I think now hundreds of millions of people have definitely used. We have to check the latest numbers, but I think it’s the fastest growing consumer app that has ever existed, at least from public numbers. When we saw this before, right?

Do you remember the time as we had AltaVista and it was huge? They had the search box. It was the search engine, right? But they had as well.

Categories. They had many wheels and you could do a lot of things. And then Google came around and said, actually all what you can do is you can type something and we show you something. And everybody was hailing how amazing that interface is.

It’s so clean. Well, it turns out humans love to communicate with language. That’s what we do. So the best and easiest way is not moving your mouse on a big screen around and then like erratically clicking on areas or double clicking or even pinching or whatever.

Now, the easiest is to describe in a human language what you want. Yeah. And I think when you go through history, we shouldn’t go too much into detail, but we started, I think, with punch cards. So cards that you would throw into a computer couldn’t do much.

But going over the graphical interface, Lisa, the one who saw a movie or actually know the history of Apple, where then you would use a mouse. And actually you could click as you described on things. That was a revolution because before that you would have to type in actually text or commands, it’s more commands that you would have to be trained on. I remember when I used MS-DOS when I started my computer games, that was pretty cool already.

But I mean, it was not looking nice and I had to learn the commands. But the interesting thing is all these interfaces evolved. So from very technical and trained users to more on the graphical user interface where people can look at symbols. So it was.

More intuitive, but I would still have to search a lot. So to your point, what I have in my head, I want this. I could not just say it. I still would have to search for it.

Right. And why can’t I just say it? I mean, what’s the problem? That somehow the machine doesn’t get it.

Why are interfaces evolving so slowly here? I think the interfaces has been a huge story. A machine is complex. A machine follows a certain structural approach.

And now we need to have a way to deal. With a machine. And we all know this. I mean, we want to go from A to B, but in a car, we still have stick shift cars.

We have a wheel. We have to tell the world that we’re going right or left. Machines. You know the rules as we get killed.

Over time, we’re trying to make things easier. Right. So for example, if you’re on a website and you point, put in your credit card number, sometimes you still have to pull down the menu of saying, is it a BZ or MasterCard? But it’s actually ingrained in the number.

So don’t do this. If you point type as in, you already know where it is. Right. So we’re becoming better there.

But initially the vehicle, the tool as such demanded a certain user behavior. And it’s actually funny. If you look at old pictures of industrialization, we’re, we’re hailing the new freedom, which we created. But then you see that people are working like in a machine.

Think about Charlie Chaplin in Metropolis. Right. He became. Kind of the machine.

And that’s the same thing. We are working on email all day long. In the morning we get up, we sit in front of the screen and we start hacking our code, our emails, whatever it is. And I think the reason why those interfaces are improving so much.

I remember the first time I learned Microsoft Excel and I was learning the rules and everything, but then I was kind of frustrated because I was constantly forgetting it. And then by using like the actual use case, I was learning what it. Means to use visual basic, where to click. I would remember things, but my initial thing was obviously to solve a problem for myself.

And then I would learn the rules. Now people don’t want to wait, right? They, they know which kind of fashion they want to shop. They like it that on the web interface, they can now search for the shoes and they have a filter drop down so they can configure everything.

And then they get the red shoes that are sneakers. But I already have in my head what I’m looking for. So if I could say it to the. Shop, that would be actually pretty cool.

And that’s, I think what we’re now seeing with ChatGPT. It’s actually funny. Like we joked before we started this episode about like guessing the age. You just disclosed with visual basic your age so badly.

I was a BCG consultant. And I learned Excel. What is Excel? Anyhow, what is this floppy disk symbol for?

Oh yeah. We joked, we joked about the stapler, right? The Microsoft one. Yes.

Yeah. I think that doesn’t exist anymore. Yeah. But actually, whenever you have a computer, it has a certain structure and follow.

So initially punch cards, we had assembler code, very complex to write. Then the next level of language came over this assembler code and people who wrote assembler code were poo-pooing and the people who kind of like wrote the higher level because it was easy. Now we are in a way enabling text to be read. And that has suggested that participants can suggested that participants can suggested that participants can suggested that I don’t know coding I try to use it with Python and I realized if I don’t know Python it still doesn’t really help me because I don’t know if it’s doing the right thing and we maybe come to that but still for me it’s amazing I can just tell it hey I want a poem about something write it in old English and I don’t have to do anything of any kind of coding it just happens but it’s kind of the same right what happens in Copilot yeah the funny thing with a poem write it in English I mean like in old English how many emails have you used it for right um we need to have use cases and the use case of using that interface over coding is a very good interface like a good use case you posted this the other day Microsoft claims that it’s 46 percent more effective I think it’s actually it’s 67 or something but it really depends on the use case because it’s very very simple stuff that they tested it on it’s not actually creating new things it’s more routine work yes now actually the funny thing is if we take both models together the language model and the image generation model you have seen people creating images and they write something and then the image is somewhat okay but not really what they wanted and then they iterate around the image and it becomes complex more and more complex I have here like I’ve just read it out because I checked one prompt hyper realistic Jeff Goldblum while smiling on a photograph Maurice Sapriot a photo a gigapixel photo wearing a black t-shirt profile photo a man wearing a backpack Google slightly cast lighting stood in lab walking towards a camera by Israeli swanbaum satisfying powers aim design photographic imposing tall unmistakably Kenyan side profile the center profile rails Andrew Bernstein image and center and it goes on so we have now something to create images but again our interface became not ideal yes it is better than using the mouse and drawing this especially if you can’t draw now you can say what you want but you need to say more and more complicated stuff by the way if somebody wants to test it out you can actually turn this around right like everything in a computer you can turn around that is if you have a photo you go to a hugging face you can actually retrieve its retrieval augmentation and hugging face you look for clip interrogator and I’m happy to post the link in the show notes here but then you actually see what is the potential language which you would need to use to generate such an image but at least that sounds helpful because I think now we’re getting to the limits here because we’ve read so many people saying this is the iPhone moment we actually started by saying hey this is a really this is a big Revolution but as you just pointed out with this pretty best of example and I think everybody who has used chat GPT Dali and other tools come to the same point where it feels the starting point and I think that’s why the adoption is so amazing the starting point is pretty easy I just use my own language something comes out but as soon as I want to be more precise better maybe even want to use it for professional applications it gets pretty tough to tell the AI what I want maybe I’m actually very very bullish on this for example if you do a mid-journey one of the image generating areas then you see those super complicated prompts so some guy said okay now I use chat GPT to train what kind of prompt I want to have so he wrote a prompt he trained on all the styles he likes and now he has a very easy way where he writes something chat GPT three generates based on typical knowledge what is the next word so if you write Viking then I know what Viking style you want and writes a longer code which then goes into mid-journey and renders the images he wants if you think about it the AI should at some point maybe mid-journey is not not that sophisticated yet but then ask me because I’m doing all the prompting ask me hey by the way what kind of style do you like maybe you upload one or two pictures that would help me a lot to understand this I mean we’re very early here right because we did we discussed before this podcast if maybe graphic design is an area that would not get disrupted by this new kind of AI interface because I would still use the web interface I would still fine-tune things but now to your point maybe not because I could still tell the AI what I want I think it’s it’s not an either or right I think we will we have now an interface where we can use language and um that makes the it makes those foundation models are powerful it’s not that foundation models can do better search probably they count it is not that they have a better logic it is that they understand what we want so for image generation we can tell them better now the foundations models have shown that they have amazing ability for us to understand what a human needs and put this into the more complexity of technology so the human interface to the machine became better and we saw this for image generation we saw this for sequel code we saw this for python code we see this for many other search area so that’s a reason why like you start searching and if you look for LMS then for example we had Glean Glean is a star star which essentially has now you have your work data and on top of it you have a searchable interface and the beauty is that all the data from work participants And on top of it, you have a searchable interface.

And the beauty is that all the data from work is put into a structure where you can query that data. You, like Microsoft followed there after Gleam. You is doing the same thing over internet, right? That’s true.

Any day of browsing data. So take your text data and make it conversational knowledge so that you can actually, as a human, address it. That’s pretty impressive. And this is a guess.

At the moment, we see all the interfaces where we had to write to become easier. If I need to write code, text comes out at the end. Now, I have a better interface where I put text in and that text gets converted. If I have to write a song or to write a blog post, then I can put text in and text comes out.

For all of those use cases, text goes in, text goes out. That has been working. It’s been working very well. The use cases such as steering a machine, steering more complex efforts will take longer, but I’m pretty sure they will adapt it very quickly.

Is it easier for the ones where there is public data first available because that’s where all the models are currently trained on and there is some underlying code like computer languages and then we might want to expect law, which is kind of public and codified language. Yeah. through prison or not? Yes.

Well, we see companies like LegalOS experimenting with this and it’s pretty impressive. Yes, because law is codified. Computer language is more codified. Yeah.

However, there are two other extremes, right? If I need a very strict structure like computer code, it is helpful because I use text as an input and I get strict structure out. If I use something which doesn’t have a strict definition like an image, I still can use text and I get an image out. But if the image is not 100% what I expected, I didn’t really have a picture.

I mean, I just expected something and now the new image looks fine as well. Those two use cases work well. Now, if you operate a machine, I saw one article talking about the interface for Google Analytics or for Google Ads. And yes, it’s true.

Google Ads are terrible to be steered and you had to learn how to steer them. Now you can actually go in and saying, I want it and the machine will ask you the right questions. A little bit, I would compare this with your favorite coffee chain where you go in and you say, I want a coffee and the person will ask, you want a Gibraltar, cappuccino, latte? What is it?

And you’re just like, oh, okay. Explain this to me. Oh, well, I’m like different on coffee. And then what type of milk?

Oat milk? Soy milk? Real milk? How much fat?

And do you want a shot? Not shot? And it guides you through this. Now, they will look at you like, dude, have you never gotten coffee?

But the structure is ingrained in their system and they will guide you through. So we will see more and more machines actually asking you to it and they will remember. Now, the question for me is then, we spoke about, you need vertically or let’s say, models trained on proprietary data. For example, I’m an e-commerce shop.

I have conversations, chat conversations that I want to automate, like what Ultimate AI does. And I only want the chatbot or whatever to answer in my way and not what ChatGPT has found on the Internet. Right. So I really want to train the model on my conversations.

So that is not being revolutionized by what we’ve seen right now. It’s more the way the intent is detected or is it the search through the conversations that is being revolutionized? I’m trying to get here a little bit to what can, if startup founders are listening, people interested in building a company, how can they approach this? So now they understood, hey, if anything that is codified, it can be more automated, can be approached easier, can be prompted easier.

But steering a machine, driving a car, maybe not, because the rules are not so broad. The rules are not so clear. I think it becomes like, how do you simplify your interface? If you are a startup founder and you are thinking about using LLMs, how do you simplify your interface?

Is the chat ability, I think you have two questions. Let me answer them separately. How do you simplify your interface? Like if you’re having a startup, how do you simplify your interface?

This is what we see everybody slamming an LLM on. And sometimes the interface might become easier because now suddenly you can write. And the best example for me is if we look at software which creates graphs, SQL is always the underlying language. And SQL needs a certain level of knowledge.

And when business people cannot write SQL code, then people came up and said, okay, well, I give you a graphical interface so that you can click and drag and drop. That works. But as soon as it becomes a little bit more complex, this clicking and dragging and dropping, becomes so slow that you’re better off just learning SQL and being more precise with the language. Because clicking and dragging is complicated.

Now suddenly we have a human interface where you can say what you want to see. And yes, it’s still a complicated language, but it will ask you now all the different things. And maybe then later on the rearrangement and where do you want to place what is, what makes it possible. What makes it powerful.

So think as a startup founder, think about which interface will be changed. I think also what people can think of is what does it actually enable what you describe. When you look at consumer models, it’s always about taking all the objections, blockers, slowing everything that’s slowing you down in the transaction away. Think about web, right?

I don’t have to drive to the city and buy the shoes. I can just browse through and then I, it’s very fast so I can take a fast decision. So now I hope it’s maybe my personal hope. I don’t have to go searching through the website.

I can just tell the shop what I want and it will give me some pictures. And then I will say, no, I don’t like this kind of style. Give me some other pictures. I think personally, I would probably buy more.

And then the second one, what I could imagine is when you talked about code, but I’d like to hear your view there. When you think about dating, dating is recruiting. It’s based on a certain matching algorithm. What kind of candidates you show, but I don’t know the algorithm, right?

I just have this interface where I put in my data and I don’t know how they would match it. But if I can talk in my language, it translates into their language and then translates back to the other candidate’s language. Maybe it helps matching people. Yes.

As soon as I have filters, by the way, any filtering, as you said early on, like you look for shoes and then you select red. Now you type red shoes. This is makes way better. It makes way more sense than typing shoes, clicking return, waiting for the results, going to the filter section, clicking on the filter, selecting red, and then waiting.

That’s hard. Red shoes, give me red shoes. And realizing your red is not their red because your red is more lighter than their red. By the way, that is something where you might actually end up in a graphical interface again.

The graphical interfaces are not that. It is the combination between the language and the existing interfaces. And we will see a lot of improvements on traditional businesses. Meaning if we, the LLM capacity will actually supercharge many of the existing businesses.

Might not be a very cool notion for startup founders, but if you have an existing business, think about how you make it faster, better with an LLM. But that’s a good point. Maybe I can start using Notion now finally, because, you know, I always fail with the interface. Well, now coming to the second part.

If you do your own business, right? So I just posted something on LinkedIn where I showed how easy it is to get in large language model to hallucinate. So what I did is I went to chat GPT and typed in a complete fake website. I’m Cornell professor, so I typed cornell.edu slash looks amazing course saves Cornell or whatsoever.

So I typed cornell.edu slash looks amazing course saves Cornell or whatsoever. And then I said like summarize that website. That website doesn’t exist. And obviously chat GPT has no connection to the internet.

It does not know that the website doesn’t exist. So it uses the website and codified the structure and the website, and then it just makes it up. And out came the summary of the website is Lutz Wenger with his data and product course is saving the day for Cornell. It’s the most amazing course.

And it makes very good logical arguments why I’m the master guru. Very flattering, but completely fake and not true. So that is a huge problem for any large language models, because if they don’t know, they will make up the next step. So one important part in the industry is that we use data to actually train the model and that this data belongs to the business.

So in your example, you want to order something and you just want to tell the shop, I want to order. And let’s say the shop doesn’t have it. Then the model will just make it up. Hey, yes, you’re Jasper.

You get your red shoes with blinking leashes. Totally fine. And you’re just like, yeah, finally, I never saw that. But neither did the model.

Just made it up. It sounded good. True. Good point.

You would make sure, how do you say that? You would ground the model on the local data or at least train it? Proprietary data. Proprietary data matters for anything which we do.

Now data, like in any, like I’m a data scientist and I work with data. Data always mattered. It wasn’t the algorithm. It was always the data which mattered.

So here, like you come in and let’s take a company like Ultimate AI. Ultimate is using data for chatbots. They’re doing a very excellent chatbot job. And what they are having is they have all the conversations from their clients, meaning the bots can rely on that content and start talking.

Back. So as all the other humans have done over the typical data they have seen before. It is a proprietary data which matters because you cannot just ask any bot to tell you, where’s my flight? I mean, try this with chatbot.

I don’t know what they would say. Let’s try. Let’s try. In the large language model world, I think the challenges we have is if you train a large language model.

It is very expensive. So most companies will use it as an interface. Now for using it as an interface, you better make sure that you train it on your data so that it becomes proprietary, that it doesn’t hallucinate. Then it’s still an info job to maintain it, to make sure that the information is up to date and that will separate traditional companies from excellent companies who kind of like know how to do this.

And there’s a whole discussion on whether you should have this in an open market. I think it’s a very important thing. The question is whether you should have a open model where you then use the open large language models, which Meta, for example, offered, or whether you go into a closed model and you go to open AI. I think when we now come to an end and sum it up a little bit, what is really interesting, what we’ve seen is it’s a new way how you can actually interact with the software.

The challenge behind it is that it doesn’t 100% work. It doesn’t work. And that has suggested that participants participants that participants participants participants participants participants participants participants participants participants participants participants participants participants participants was tedious, maybe too slow, maybe not bringing the right results. Combining this with this new type of language interface could lead to quite some nice revolutions if you understand that the thing is hallucinating.

I think what is also interesting, I get a lot of input now as a startup founder because I have this language prompting. And before I would create this interface and then I would see how the user is clicking on the buttons and moving around. And I didn’t really know their intentions. And now with this interface, this language interface, I actually see what they are searching for.

They’re asking for all the things that Google knows. So more feedback for me when I build my product. LLMs are an amazing interface. We need to separate, like we evolved between image generation and language generation, but language generation and the language understanding, which is the turned around language generation model, is amazing because it’s self-explanatory.

It’s open up any new, like it opens up any interface for language input. Whenever I needed to use my mouse, I suddenly have the ability to do it via language. Thank you very much, Lutz. That was really cool again.

Thanks for making the time on a Saturday. Absolutely. Talk soon. Bye-bye.

Talk soon. Good night.