Welcome to this TechTalk about logging in a polyglot environment – in an IoT environment at Tractive. My name is Dominik Hurnaus, I’m CTO at Tractive and today’s topic will be about logging.
What is Tractive? Tractive is a Linz/Austria based Startup, creating pet safety products. We create a GPS tracker for your dog that you can attach to its collar and track its location in realtime.
What are challenges that we are facing? With a GPS device in the field, server software, mobile apps, there are quite a lot of challenges that we see today, such as security – for the communication between apps-servers and server and devices –, updatability – updating the firmware of the apps, the firmware of the GPS trackers, updating the apps, firmware updates are done via bluetooth, via WIFI, via GSM or LTE. Also one of the challenges is logging. Many devices in the field, many requests, big server farms in the background create lots of logs out of various software systems. Furthermore, there are challenges when it comes to availability of systems that need to be in the field 24/7. That means we have to take care of a zero-downtime-deployment and make sure that our customers – that are around the world – always see the latest location of their pets. Also with lots of pets, lots of customers, there is lots of data – scaling the database is another challenge.
So how did we solve this challenge? First of all, let me explain what runs in the background. Our Tractive ecosystems runs on a Docker Swarm environment in a larger cluster of several servers. So there are various applications like REST APIs, there are endpoints for our GPS trackers that consume the IoT device’s data and supplementary service that you can think of like a notification service, sending push notifications, other geolocation related applications and many more. In addition to that – obviously – there is quite some infrastructure running, load balancers, caches – MongoDB is our core database for example. All of those systems generate logs. On the outside also some GPS trackers that produce logs over their runtime and send it to our device endpoints. Same holds true for our mobile apps. So how do we aggregate those logs? We opted for the ELK-Stack, or now called Elastic Stack. This stack is built upon four main components from Elastic and I want to quickly explain those four components. First, we start at the left with the Beats environment – Beats is a set of tools that collects logs from various systems. We are using Filebeat as you will see in a second, since we are reading the logs from the files that are generated by our Docker environment. Beats collects those files and forwards them to Logstash. Logstash aggregates those logs, might filter a little bit, might add a few fields, depending on the specific needs of the services and then sends the log information forward to Elasticsearch, which is used as a databse for our logs. Kibana finally is the system that helps us query those logs, very similar to what you would be using Google right now - just entering any term would do a fulltext search on your logs.
The key to solving the puzzle of bringing all those logs from various frameworks and services and various programming languages together is using one central logging format that is JSON. Docker provides a JSON logger that essentially puts all logs into JSON files and Beats – in that case Filebeat - has the possibility to read all those logs generated into the JSON files from the Docker environment. So whenever any service starts up, Filebeats automatically picks up the logs and would forward those logs to Elasticsearch. There is no more looking into log files, but all the logs are in one centralized cluster.
What are a couple of the lessons that we have learned in the past from setting up a logging system this way? What’s very important is to add lots of metadata and context to any log-message that you’re writing. If a log-message just says „user logged out“ it’s not really worth anything, unless you know which user it was. So, add the IDs of any affected entities. If it’s about creation of a pet in our use case, you would want to maybe add some information about the pet. If it’s about the deletion of objects, at least note the ID of that object. Also important is to use proper log levels. You might want to log more information in a staging- or test-environment than you typically log in a production-environment. Also make sure that you never log any sensitive information like personally identifiable information, secrets, keys, API keys, or any other data that you don’t want anyone else to read. And finally, one very important thing that we figured out is: whenever you call third party systems, log all details about that call. Log what data you sent, what data you received, include the headers since they contain valuable information, if at a certain point in time those systems fail or your connection to those systems fails. The last thing that we figured out is how many logs do you really need? Right now we are averaging at 220 GB of logs per day from the various systems. This is a big challenge also for the environment where you want to store that data – which brings us to the final slide. What do you do now that you have all those logs?
Let me go into detail of one of those logs that contains a sample of one of our application logs. You see first of all a log has a timestamp associated, it has a log level and a message – that’s typical for mostly any logging system that you have. But then there are certain IDs like a payment ID, a tracker ID, subscription ID and so on associated with that log, because it makes sense for this type of log. For other logs, completely different fields might be there.
Now that you’ve got a bunch of logs and many gigabytes of logs over time, the question is: What do you do next with those logs? First and most important is to train your developers to understand how to use the logs. Use the logs on a daily basis to review any errors that you have in your system. We figured out that a daily check in the morning is the first thing that a person in the team does. We rotate this role so every day there is a different person checking the logs in the morning, figuring out whether there are any errors, any new or unknown errors and creating tickets in our ticketing system. At the same time, whenever we deploy one of our systems, we also do an error-log review afterwards to see if anything changed. In addition to application performance monitoring – that should be part of any system anyways – we do additional monitoring and alerting based on our logs. Our logs tell us specific situations like a purchase being made and a user being created in the system and so on. Whenever we see certain thresholds per time exceeded or no payments within one hour, then we trigger an alert out of the logs. And with many logs, log lifecycle management becomes an issue. Data retention in the database is something that people typically care for, also for logs the same holds true. Think about how long do I need that data? ELK Stack or Elastic Stack comes with a feature called lifecycle management that allows you configure how long you want to keep the data. For example if you configure your data to be there for thirty days, after thirty days it would automatically be deleted and you can be sure that the size of the data doesn’t grow indefinitely.
So, that was a quick introduction to how we do logging in a such a diverse field as GPS tracking. Thank you very much! Obviously we are hiring, we are expanding our team, the company is growing, so we are looking for more people on our teams. Bright minds like senior back end developers in Kotlin or Ruby, cloud engineers, software testers, hardware engineers, firmware engineers – so if you think this might a great opportunity for you, drop us a line at tractive.com/jobs or write a short email to myself. Thank you very much!