Tech Talk: "Artificial Intelligence & Image Recognition" mit Philipp Remplbauer

Tech Talk: "Artificial Intelligence & Image Recognition" mit Philipp Remplbauer

My name is Phillip and today I am going to talk about artificial intelligence and especially Image Recognition. Nowadays it's a huge topic, especially with self driving cars from Tesla or it helps detecting diseases on X ray images or objected detection.

I want to point out again the Tesla part here with their computer vision and detecting childs and objects in front of the car. That's more in the field of computer vision but image recognition is a huge part of it and now we will focus on this one.

So we have to think about the human brain and the machine brain. The human brain works like we have the neurons, we have connections between these neurons, which we would call synapses, and they can have a certain thickness or how good they pass information. So in the end it's like a big cluster of these neurons and it works perfectly for learning as you see with us humans. So for example if some object is flying by us, we think about it, neurons are firing and they're connected to other neurons, they fire as well and in the end we we will notice that it's a bird.

And what we did here is what we always do – in physics for example – we model the human brain and we call it a neural network. And it would look like that. You can see we have there our neurons as well, we have our connections from the neurons, which would be called weights in this case and which would be like the influence if a neuron fires or not. So we have our input layer where we put in our information we know, we have our hidden layers where some magic is happening – but in fact it's just maths what's happening here. And then we have our output layer where we get our result. We would call that kind of network a feed forward neural network.

We are using for example supervised learning, which means we have our input data, we have our output data, that we want to predict and we use a term called back propagation – I will not explain this in detail. But after some training, we give the network some similar inputs and it predicts outputs for these similar inputs but it never saw this inputs before.

So, to understand image recognition we need to understand how a computer handles images. For us, you can see this dog on the left side and we see it's a dog, it's beautiful, it's daylight outside, the background is green. But for a computer it's different, it does not know all this things. For the computer, it knows the the width of the image, it knows the the height, and it's like a matrix where each point in this matrix is called a pixel. This pixel has a colour and for the computer it's using a space called RGB which stands for "red, green, blue". It's the colour space. So we would have three channels, a red one, a green one and a blue one and each one has an intensity of how much of this colour is represented in this pixel. For example if I would take 0 of red, 0 of green, 0 of blue, it would be black. Or if I take 255 of red, 255 of green, 255 of blue it would be a white colour – you can imagine it like mixing colours.

So now we understand how the computer handles images, and we see in the end it's just integers and we can do much more of that than just with the image or the visual image.

So, what we have here is a convolution neural network, it's called CNN, and it's just another kind of neural network which also includes a part of another one I showed you before – it's a more advanced one but I will give you a rough overview of what it's doing.

So on the left side here we have 28x28 image, in this case it has just one channel, you see just black and white. We are giving that image with the integer values I showed you before as input. Then we have a convolution layer which you can imagine like filtering over the image and extracting informations – we call those feature – and we're trying to find certain patterns in the image which could help us to distinguish between different cases. So, we're training these filters and this is happening in the convolution. It's like extracting filters.

On the next side, we have the pooling layers – you can see here the image is getting smaller, so from 28x28 it's reduced to 14x14. What's happening here is we're losing information about the pixels around certain pixels, but we are reducing complexity – because the goal is if we go from layer to layer we are pointing out more specific features. At the beginning we would have the whole image and all the features in there and we want to find or point out features which help us to distinguish between different cases.

As I said before – I am not sure anymore if I have mentioned it – most of the time we can not really say what is really happening in this fully connected layer. It's like in our brain, we don't have really the knowledge of what's specifically happening. But in the case of the image detection with the filters I will show you later we get a rough overview what the network is doing.

So, as I said, we have the input, we have the convolution layers, and we have the pooling layers. We have a few of them behind each other and in the end the classification. The classification is as showed before a fully connected layer. What we're doing here, we are putting in all the extracted features and we are learning on these features and the network learns how it's detecting and distinguishing between different cases and in the end we get some result like we have a dog in the image or we have a human or whatever.

So, regarding the features I mentioned before – we are going here from right to left and at the beginning you can see we have the full face and the first convolution layers normally are doing just edge detection. So we still have the whole image in there, but the goal as I said is to reduce it down to certain features. As you see, if you go a little bit deeper, it's already focusing on the eyes or on the nose or on the mouth – certain parts of the face which can help it to decide between different faces. And if you go a bit deeper again – as you see in the left image – I could not even tell you anymore on what it is focusing on, but these are some parts which helps the network to distinguish between the different faces.

So, if you come back here you see the network, it's structured. In the end it's simply mathematics which are in here. I will not go much into detail about the mathematics, maybe this is a topic for another talk – but you see we have input layers, we have these convolutions which are extracting features, we are giving it into pooling and in the end we are doing a simple classification.

In the end it's not magical at all, it's simple maths but it may have sounded a bit magical at first. But I hope that now you have a rough overview on what's happening inside.