Tech Talk: How NETCONOMY built: „Object Detection via Hololens“ mit Vid Jelen

Tech Talk: How NETCONOMY built: „Object Detection via Hololens“ mit Vid Jelen

Introduce yourself, your team and your role.

Hi, I’m Vid. I’m the machine learning engineer here at NETCONOMY, I’m part of the machine learning team. My role here is basically making sure the infrastructure is set up for all the machine learning projects and then also executing. So, different machine learning projects – partly research, partly implementation, but more or less: fun.

 

Why did you/your team use/try/set-up Object Detection via Hololens?

Starting up with this object detection was actually quite a fun challenge. Our infrastructure wasn’t up to the task and also there weren’t that many people involved. I personally only worked alone on that project, because there were also other projects in parallel that other team members also took.

Hololens, that’s one of the technologies – you could call it one of the newer technologies coming out, it’s still developing and maturing – that shows quite a lot of promise in the future. Enabling future workflows inside your workplace, or remote working, or even having somebody on the other side seeing through your Hololens directly what you are seeing, annotating, showing you individual objects. We find all that has a lot of potential in the future and we would also like to have more of the know-how of how Hololens works inherently. Since we’re the machine learning team and we want to work with computer vision, we found the perfect way of combining two different research fields: we wanted to know more about object detection, so we took it as part of the concept that we wanted to implement. On the other side Hololens, where we get more experience with augmented reality.

How did you build it? Which technologies did you use?

Well, since it’s a PLC, the team was very small. Actually, I was the only one working on this project, there were multiple projects in parallel being executed, so other team members worked on those.

The initial concept took about, I guess, two to three weeks to be set up. That’s including the device, setting up everything on my laptop. What I needed to do was to set up the full frameworks, both TensorFlow for doing the model inference parts, then also setting up OpenCV so that we can get the image capture or the video stream to then send it through our machine learning model. We used some pre-trained models, particularly a model called You Look Only Once (YOLO) and that one already had a quite large number of various categories, something like 50 or 60 categories with common objects, like a table, a chair, a screen, a laptop, etc. and that was the base building block.

There’s this more or less architectural work, so this one we did with containers, we used Docker Container for this to make it all supportable. This was the part that took the majority of the timeline, then the experimentation part – where we were actually sending videostreams through the model and analyzing it and improving on what were the detection thresholds, which objects were being detected correctly, etc. – was maybe another week of work.

Which challenges did you have to solve?

When we started with this, first we had two problems. One, we lacked the data. What type of objects do we want to detect, that was one of the issues. The other one was, where is this going to run. We didn’t actually have a quality computory source for this – a server, an IOT device, etc. This was still something to figure out.

To make it just as a really minimally viable project, we started off running it on our laptop, using the built in camera, see what we can get through that one. With regards to objects, one important thing would be to train our own machine learning model. For that we would need to assemble a dataset, find the proper objects we want to detect, then use that in the training process. Once we had the model, once we knew what we were gonna do – the dataset is basically the input camera of the laptop – we came to a very trivial problem: video processing is a very computer intensive process. So processing all that on a not very performance oriented laptop that we were using, we could analyze video in the very best cases at a rate of up to 5 fps. So, that’s not really something that one could use as a product, or something that would be nice for a showcase. We could use it as a proof of concept, we showed that this thing was working. From there on was the challenge of how to improve it. The answer to that was to use graphics cards, especially the ones from NVIDIA. This was actually one of the biggest challenges as the next step that we needed to solve for this project.

If you could start over, what would you change?

That’s a hard take, since we never completely finished the project. But definitely now that we have more know how about how to use graphics cards when doing video analysis, I think we would start setting up this kind of architecture first. There is a lot of overhead, a lot of things that you need to get right, lots of moving parts. But once you have that architecture in place, then just doing the repetitions, retraining, retrying, all that gets much easier to iterate. I guess first setting up the architecture properly and then iterating would be something that we would most gladly change, if we started again.