Time Series and How to Detect Anomalies in Them — Part III
Eventually easier than it seemed
Hello there, my name is Artur.
You might be reading this intro for the third time — and if this is the case, I appreciate your sticking with this article series.
I am the head of the Machine Learning team in Akvelon-Kazan and you are about to read the last part of the tutorial for anomaly detection in time series.
During our research, we’ve managed to gather a lot of information from tiny useful pieces all over the internet and we don’t want this knowledge to be lost so we are sharing it with you!
We already dove into the theory and data preparation in Part I and defined and trained three models in Part II:
- Part I — Intro to Anomaly Detection and Data Preparation
- Part II — Implementation of ARIMA, CNN, and LSTM
We reuse our code so if something seems unclear, consider visiting the previous parts once more.
Fantastic, let’s complete this series!

Just to briefly remind the tools that we use:
- Jupyter Notebooks environment for the implementation of the models
- Scikit-learn for some data preprocessing
- Statsmodel library for ARIMA model
- PyTorch for neural networks
- Plotly for plots and graphs
And what type of models and how we trained:
- ARIMA statistical model — predicts next value
- Convolutional Neural Network — predicts next value
- Long Short-Term Memory Neural Network — reconstructs current value
Anomaly Detection with Static and Dynamic Threshold
Amazing, we trained all three models! But every line of the code before was just the preparation for the anomaly detection.
So just after a very small amount of additional preparations, we will be able to finally detect anomalies.
What exactly lies behind these “additional preparations”? These things (remember the “Saying what we want from our models out loud” from Part I?):
- Calculation of the errors for each item in datasets
- Threshold calculation based on errors
And then we will be able to detect anomalies extremely fast like literally just comparing errors with the threshold.
What are we waiting for? We’re ready for this!
The error calculation differs for each model due to different implementations of Datasets. The algorithm stays the same.
ARIMA’s errors calculation:
Just as a reminder — we use absolute error for ARIMA because it causes better results. And if you are wondering how we came to this, the answer is — we just tried and it worked.
CNN’s errors calculation:
Trending AI Articles:
1. How to automatically deskew (straighten) a text image using OpenCV
2. Explanation of YOLO V4 a one stage detector
3. 5 Best Artificial Intelligence Online Courses for Beginners in 2020
4. A Non Mathematical guide to the mathematics behind Machine Learning
LSTM’s errors calculation:
Threshold calculation — common for all three models:
Static threshold
This threshold is calculated with this formula of the three-sigma rule.
Dynamic threshold
For the dynamic threshold, we will need two more parameters — window
inside which we will calculate threshold and std_coef
that we will use instead of 3
from the static threshold formula.
- For ARIMA
window=40
andstd_coef=5
- For CNN and LSTM
window=40
andstd_coef=6
These two parameters are empirically chosen for each model using only the training data.
You may wonder — “Why does he always emphasize the usage of only training data? Why can’t I also use validation to choose better parameters?”.
The reason why we use just training data for choosing the parameters of our models is that this is the only way we can be sure that our models will work on data from the real world outside the training dataset. The validation part of the dataset imitates such real-world data and provides a better understanding of models’ capabilities because we know — it wasn’t used to train our models.
Let’s get down to business! Here is the code to calculate the dynamic threshold:
And the last element to fulfill our puzzle is metrics calculation. What kind of metrics? I am glad you asked. We calculate every base metric to fully analyze the models’ performance:
- Confusion matrix to see how a model performs in detail
- Precision to see how precisely our model predicts
- Recall to see how a model detects true anomalies
- F2-score to see the combined precision and recall, we are using F2 instead of F1 because detection of true anomalies is more important than avoiding false anomalies (recall is more important than precision)
Excellent! We can move to the piece of code with exact anomaly detection.
ARIMA with static threshold:
For each model, we are going to filter errors with the given threshold and then simply return indexes of unfiltered ones. These unfiltered values we will consider as detected anomalies!
And of course, we are going to visualize everything that we detected! (By still using the same code from the Part I)


We will leave the metrics until the results part. But here are the code and printed confusion matrices:


Yeah, seems not so good (because of many falsely detected anomalies), but it still catches every anomaly.
ARIMA with dynamic threshold:
Let’s do the same for the dynamic threshold and see if it can change the situation.


The code for metrics is the same, so we can skip it and take a look at the confusion matrices.


Well, these look much better (no more huge amount of incorrectly detected anomalies)! A tough baseline for our neural nets!
NN’s anomaly detection
For both neural nets, we will provide a unified generic function for anomaly detection.
And that’s it! We can effortlessly process the results of neural nets.
CNN with static threshold:




It seems that our CNN model overfitted — it has an enormous amount of incorrect anomalies — but there is no need to make hasty decisions, it is better to look onto results with the dynamic threshold.
CNN with dynamic threshold:
And let’s do the same with the dynamic threshold:


The metrics calculation is still the same.


These results are better than ARIMA’s. We already can say that we didn’t waste time on this!
And the last model (but certainly not the least) is LSTM.
LSTM with static threshold:


Once again, metrics calculations are identical to CNN’s.


Here we have the same situation as with CNN, but now we know that the dynamic threshold will reveal the truth!
LSTM with dynamic threshold:




And the dynamic evaluation certainly made a near-perfect detector out of our LSTM model.
Real-time evaluation with static/dynamic threshold
If it is hard to figure out from the code how to use these models in real-life data (and this is normal), so here are some visualizations of the real-time evaluation:
The top chart shows the original data with true anomalies and detected anomalies. On the bottom chart, we can see the error of a model with the purple static threshold line.
And here is the visualization of the same process with the dynamic threshold.
As you can see, the dynamic threshold adapts to the dispersion of the error. That is why the threshold is low when the error deviates a bit and high otherwise.
Results of the models
Finally, we can compare the metrics to be sure that we correctly put the LSTM onto the first place. We are using F2-score to decide which model is the best. Precision and recall are shown separated for the understanding of weak and strong sides of our models.


However, ARIMA performs slightly better with the static threshold, and the neural networks outperforms it with dynamic threshold — especially LSTM.
Ultimate Conclusion
Lastly, I would like to emphasize that these models can already be taken for production with not so much effort.
Nevertheless, these models are far from their limits and can be enhanced via:
- Increasing the amount of training data
- Adding other metrics such as memory, network, etc
- Combination of LSTM and CNN architectures
- Feature Engineering
Thank you very much for your attention, I hope that this tutorial gave you some understanding and hints on implementation.
And don’t stop looking for anomalies!
About us
We at Akvelon Inc love cutting edge technologies in mobile development, blockchain, big data, machine learning, artificial intelligence, computer vision, and many others. This article is the result of one of many projects developed in our Office Strategy Labs where we’re testing new technologies and approaches before delivering them to our clients.
If you would like to work with our strong Akvelon team — please see our open positions.
Designed and implemented with love in Akvelon:
Team Lead — Artur Khanin
Delivery Manager — Sergei Volynkin
Technical Account Manager — Max Kostin
ML Engineers — Irina Nikolaeva, Rustem Saitgareev