Tech Talk: "DALL·E 2: Image generation with visual AI" mit David Estévez von kaleido

Tech Talk: "DALL·E 2: Image generation with visual AI" mit David Estévez von kaleido

Good evening. In this presentation I'm going to talk about DALL·E 2, which is a pretty new model now for image generation. The overall topic of the talk is going to be: Image generation with visual AI, but we are going to focus on this new model and we're going to explain it a bit how it works and how a visual AI or machine is able to generate images from scratch. Before going into the presentation, a little bit about myself: I'm a Ph.D. in Robotics and Artificial Intelligence, from Madrid, Spain and I've been working at Kaleido for a few months as a deep learning engineer. Kaleido is a company whose mission is making visual AI simple.

Right now we have three main products. We have "removebg" which is an application allowing us to remove the background automatically for an image and we also have "unscreen" which is pretty much the same thing but for videos. Moreover, there's "designify", which is a little bit more, we can do different compositions with objects and different designs and also autimate this process for products on our coordinate images. On top of that we are also part of this big company called Canva which makes us part of the Canva family.

So, for the introduction we're going to talk about image generation. I want to start with a little game. I'm going to present you with three different images and I want you to look at them and tell me which images of these three have been generated by a machine and which one has been painted or drawn by hand. So just take a little moment, look at the images and try to see if you can figure out which of them have been automatically generated by a machine and which one has been made by a person. This is a little bit of a trick question because all of them have been generated by a machine.

More precisely, the first two of them, I did by myself. The first one was done with a StyleGAN and the second one with GitHub an FFT generator which I'm going to talk more about in this presentation. Speaking of the one on the right, this image is pretty new. It came out just a week ago and it's made by an AI called DALL·E 2 made by the company OpenAI which has the pecualiarity of being made from a string of text. So, someone wrote: "I want an author in the style of the girl with the pearl painted by this artist" and the machine just generated this image out of the blue and it looks pretty nice.

So, in this presentation we're going to learn how this system works and how machines or AI can generate images that can fool us into thinking that they have been made by a person. And we are going to go through these steps.

The first step is going to be generating images. So we want to learn how the machine is able to generate images. The most common method up until a few months or years ago was Generative Adversarial Networks which is a fancy name for a couple of neural networks that are trained by a small competition between them.

So we first have a network on the left which is called Generator that is going to take a random number and from that number is going to output an image. This number is just used to generate different images, so to get a random number and we get different output images and it's going to try to make images that look like real images. We're going to mix these fake images that the network made with a bunch of real images that we gather from i.e. the internet, that represent real images that we actually want to generate. Then we're going to shuffle and mix them together. Next, we are going to put these images in a second network called Discriminator. The Discriminator's work is going to be trying to tell apart fake images made by the other generator network and real images and we're trying to train both networks at the same time. So this is going to be a little competition between them and the better one of them gets, the other one is going to have more feedback of what it needs to do to fake real images. As soon as one of them is improving, the other one is also going to get better. With this game we're training both at the same time. With the balance between them we're going to get a generator disabled to fake and to generate real images.

There are many different Generative Adversarial Networks but there's one that is pretty common because of their nice properties. It's called: StyleGAN. There are different versions of StyleGANs: 1, 2 and 3 but they all work pretty much the same. So, they have a Generator and a Discriminator as well but in this case the generator has two main parts. We still have some network that takes a random number but instead of generating the output image directly it's going to generate a set of parameters for different layers so that these parameters represent different features in the image.

We have parameters in the first layer so that the image is going to be constructed from a small image and it's going to grow. These first layers are going to determine the overall course features of the image. Let's imagine these numbers as door knobs that we can twist. There are different parameters to adjust and if we turn this knob in the first top layers, we're going to be able to manage the course details of the image. We can change if it's a woman or a man, whether the person has a beard, glasses, or something in the face. If we continue turning the knobs in the lower layers, we will be able to adjust finer detail such as the color of the image, the color of the hair or whether they have closed or open eyes.

So that's what's called a StyleGAN because it allows us with these parameters to control the style of the image that we generate. In this case we train with lots of different faces and it's able to generate new faces by turning these parameters. Nevertheless it's not that easy to control because if we for example generate the face of a person and we want to change it a little bit and we're going to make it change the position of the head or just make it a little bit older or change the color of the hair, create a bigger smile, whatever we want to do. There's not really a knob for that, so the knob is just controlling different features but doesn't relate to actual things like smiling or that kind of stuff, so you actually need to turn all the knobs in a certain way so that you can control that. It's possible but a bit difficult and it's not really intuitive for the user so you cannot really control it as simple as a slider to say: "Ok, I want something older or something younger". So we can improve, we can do things better, I can improve this method.

After generating the image, the next step would be to guide this generator in such a way that you can control things easily. For this I'm going to introduce a different network called CLIP. This stands for Contrastive Language Image Pretraining. CLIP is basically a network that will take an image and a piece of text and is able to compare these two and find out how well these things match. In this case for example we have CLIP that is composed of two networks. One network will take a picture of a dog and we'll output a number that's like an ID for that image. When it comes to the text, we also have another network for encoding this information. So we can take a picture of a dog, encode it with another ID and a number. Then we can compare these two IDs whether they look similar or not. If the numbers will be very different, there's no match and we're going to say: "Ok, this is not a picture of a dog so how is this useful?" Because we are talking about image generation. This is not generation, this is just scoring, so this is useful because with this network we can then do an image optimization.

To sum it up, we can start from a random image that is not like a dog and we can say: "Ok, I want a picture of a dog" and we could take this picture, obtain the ID, take the text and compare them. With this comparison we can have some feedback about how we can change the image so that it looks like a dog which means modifying the pixels. We repeated this process enough times until we get something that looks a little bit more like a dog. Finally, you end up with something that looks exactly like a picture of a dog. In this process we can use text to generate images.

This is still not very efficient because pixels of the image are not really independent in real life. So if you take a picture of a face, all the pixels that compose the eye of the person cannot be random pixels, they just need to be the shape of an eye. There are different eyes, different eye shapes, eye colors, but all the eyes look like an eye. If we have some way of putting all this knowledge about the world inside the network instead of modifying the pixel independently, we can generate more efficient pictures. Turns out we actually do have this knowledge because as I told you before there are generator networks that can generate faces. This means, instead of just editing the pixels, one by one, we can take i.e. a picture of a face and say: "I want a smiling face from this picture". But instead of using the feedback from the comparison for changing individual pixels of the image you can use this feedback for changing the knobs of the layer (the different parameters of the layers) which is something that if you do it by hand you don't really know how to do. Luckily, it's an automatic process and therefore much simpler. So if we combine this generator network with this knowledge about the difference of text and image, we really have a nice method to obtain and to edit images with natural language and sentences, without any real knowledge of what's going on below.

Most of the methods that currently exist for generating AI art and images on the net, use some variation of this, so maybe the generator network is not a StyleGAN but it's another model like a diffusion process or something like that. However the method is pretty much the same. So we have some input image that is ranked with clip and compared to a text. Then this feedback is used to adjust some values of the generator so that we get the picture that we want.

This is an iterative process but to improve it a bit more we need to go over and over, always adjust a little bit until we get the final image. Instead of repeating this process over and over it would be nice to have an instant method. And this is exactly what this new model, DALL·E 2, made by open AI does. So this is a model that came out recently, just last week, by a company called OpenAI. This model introduces a new network that acts as an intermediary between the clip network and the generator. So in this case for example, we can ask for an astronaut riding a horse in a photorealistic style. Using CLIP, we can obtain this ID that we talked about before. This idea will encode all of the meaning of this sentence about what an astronaut is and what it means to ride a horse. Using this description, this ID, DALL·E adds a pretty large neutral network in between that is able to learn how to translate from this ID to the knob value. So instead of having to go around and optimize this through several iterations, it can directly predict right from the ID of the clip embedding the values of the nozzle generator. We can get images such as this astronaut riding a horse in photorealistic scale. Additionally, using CLIP has another advantage. As we saw before, it can encode text but also images. So if instead of encoding the text, we take a random picture like this Salvador Dalí painting, encode it with CLIP, use this description of the image and we can make this DALL·E 2 system to generate images that look like the picture. So we are asking for similar things by basically the essence of the picture, but it's not the exact picture, meaning, it's just variations with different styles of the same picture. On top of that, we can also take an existing picture and say: "Ok, in this place that I marked in the image, I want you to add a flamingo". Then the system will just take this information, run it through the generator, turn the knobs and we will get a flamingo in the image in the place where we wanted. We can actually say: "No, but I don't want it here, I want it to be there" and the network still put it there and if you look at it quite closely, you can see that even the reflections in the water are there, so it has some knowledge of the words and also the physics of the world. It knows that if you put some object in the water, then you need a reflection for that which is pretty awesome.

Finishing the talk, I'm going to show some other pictures made by DALL·E. You can see an avocado chair and some other culturalistic pictures. In the future years we are going to see more and more of this technology. Hence it's a very powerful way of generating images and editing them. It's a good thing to get used to this kind of technology because it's going to be present in many many applications in the future.

Thank you very much for your attention and just to mention, we are currently hiring at Kaleido. So if you liked the talk and you think that you can work with this and are interested in the topic, you can always send your CV and we will be glad to accept new newcomers. Thank you.



Erfahre mehr zum DevTeam von Kaleido