Time Series and How to Detect Anomalies in Them — Part II

Implementation of ARIMA, CNN, and LSTM

Published in

Becoming Human: Artificial Intelligence Magazine

10 min readNov 11, 2020

Hello fellow reader (and hello again if you read the first part of this article series). My name is Artur, and I am the head of the Machine Learning team in Akvelon’s Kazan office and you are about to read the second part of the tutorial for anomaly detection in a time series.

During our own research, we’ve managed to gather a lot of information from tiny useful pieces all over the internet and we don’t want this knowledge to be lost!

We already dove into theory and data preparation in Part I:

Part I — Intro to Anomaly Detection and Data Preparation

And here is the shortcut to the next chapter:

Part III — Eventually easier than it seemed

We reuse our code so if something seems unclear, consider visiting the previous part once more.

Alright then, let’s move on!

Implemented Approaches

Amongst all possible approaches listed in Part I, we chose these suitable ones:

ARIMA statistical model — predicts next value
Convolutional Neural Network — predicts next value
Long Short-Term Memory Neural Network — reconstructs current value

Let’s start with the ARIMA model.

ARIMA Statistical Model

ARIMA is an autoregression statistical model that optimizes coefficients (also known as hyperparameters) during training. Then these hyperparameters are used in inference.

It is logical that during nighttime and daytime CPU usage may vary which is totally normal behavior. To consider this behavior, we can think of night and day times as separate “seasons”. And because of this, we will use SARIMAX — which is just a modified ARIMA model with the same idea that adds seasonal causes into ARIMA. SARIMAX will deal with the separation of seasons for us, so we don’t have to provide anything else except our dataset.

Here you can see the schema for the training process of SARIMAX:

Training process and architecture of the SARIMAX model

The model tries to predict the next value using the current one and then it compares the predicted result with actual value.

SARIMAX Implementation

Implementation of this model is not so interesting since all we can play with are hyperparameters for this model. To find the model that produces the best predictions we will iterate over hyperparameters and will pick the combination.

All we need is just the model itself from the statsmodels.api package and product function from itertools for iteration over hyperparameters.

And also, it would be a good idea is to write predictions of the best model into the new column of the DataFrame with initial data to ease further metrics calculation.

Here is the code to pick the best model and write its predictions into training and validation DataFrames:

At this very moment, we have the best set of parameters and predictions for both training and validation data. To understand how good our model is, we should calculate different metrics such as precision, recall, and F-score. We will fully cover the metrics theme in Part III, however, we already can visualize the predictions and see how our model performs:

Note: one important thing about ARIMA is that time for training and optimizing the coefficients takes ~10 times longer than it takes to complete the same training of both neural networks.

Convolutional Neural Network

Convolutional neural networks are usually applied to image-connected tasks, such as image classification, segmentation, etc. But the purpose of the convolutional layers is to find and recognize patterns, which is totally applicable for analysis of the CPU utilization metric.

We use ResNet-ish architecture (which has already become the best type of architecture to use in CNNs) that consists of Residual Blocks (ResBlocks). The idea behind ResBlock is simple yet efficient — add the input of the block to its output. This idea allows neural networks to “remember” each intermediate result and take it into account in the final layers.

CNN tries to predict the next value using some number of previous values. In our case, this number equals 10, but of course, it may be configured.

CNN Implementation

Before coding CNN itself we should make a few additional preparations of the data. In general, since we use PyTorch, the data should be wrapped into something compatible with PyTorch’s Dataset. This may be as well the class inherited from the Dataset class, a generator, or even a simple iterator.
We will use the class option due to its readability and implicit PyTorch’s enhancements.

Once again, there are some necessary imports for Dataset creation:

Let’s recall the task for CNN. We want the model to predict the next value using some amount of the previous values. This means that our Dataset class should contain each item in a specific format — divided into 2 parts:

n values in sequential order that the model uses for prediction (to easily change the amount of these values, we make n the parameter)
1 value that goes next after n values from the first part of the item

Coming back to the implementation — actually, it is very easy to wrap data with our custom class that just inherits from PyTorch’s Dataset. It should implement only __init__, __len__ and __getitem__ methods:

Now it is the right time to define the number of previous values that are to be used for the prediction of the next one. And, of course, wrap the data with our CPUDataset class.

Here comes the best part — the definition of the neural network.

Let’s rewind how Residual block looks like in regular ResNets. It consists of:

The only difference between the regular ResBlock and ours is that we removed the last ReLU activation — it turned out that in our case CNN without the last ReLU in ResBlock generalizes better.

Many Residual Blocks bring twice more Convolution + Batch Norm. (+ ReLU) combinations. So, such a combination is a good starting point to define.

In each Residual Block, we should remember about the case of changing the number of output channels (when in_feat != out_feat). One possible way to synchronize the number of channels is to multiply or cut them. However, there is another greater way — we can handle this using a 1x1 convolution without padding. This trick not only allows us to fit layer input into layer output but also adds more reasonable computations for the neural network.

It is widely used to finish the base block of the convolutional net with Max Pooling or Average Pooling depending on the task. Here comes another useful trick for Convolutional Neural Nets (thanks to Jeremy Howard and his fantastic fast.ai library) — concatenate Average Pooling and Max Pooling. It allows our neural net to decide, which approach is better for the current task and how to combine them to get better results:

And here is our resulting CNN class that is made of the building blocks that we implemented above:

Now we have the definition of CNN and can create a randomly initialized model, pass items from our Datasets, and check that data shapes are correct and that no exceptions are raised.

After the model is defined, we can move to the training loop. This loop is quite general and may be used with the majority of the neural nets:

Loop over the epochs
Loop over the training part of the dataset making optimization steps
Loop over the validation part of the dataset
Calculate losses for each epoch

Alright, we are almost ready to start fitting our CNN model. The only thing left is to initialize the model, dataloaders, and training parameters.

Here we use:

Adam optimizer — one of the best general optimizers.
If you have no idea which type of optimizer to pick — use Adam. Other optimizers such as SGD, RMSProp, etc. may converge better in some specific situations, but it is not our case.
Learning rate scheduler with One Cycle policy to speed up the convergence of the neural net.
Instead of keeping the learning rate with the same value across all iterations, we can change it. There a lot of policies such as Cosine, Factor, Multi-Factor, Warmup, etc. We choose One Cycle policy because it seems logical for us — warm-up learning rate for the ~1/3 of the iterations and then gradually decrease it.
Mean Squared Error loss criterion as we described in Part I.

Eventually, we can run the most desired line of code — execution of the training process:

And see the losses of our model.

We always want both training and validation losses to move down because this behavior means that the model has learned something useful about our data. If any loss eventually moves up, then the model can’t figure out how to solve the task, and you should change or modify it.

These losses on the picture above seem pretty nice because they eventually move down, but it is an early assumption (their niceness) since we haven’t checked the predictions yet.

At this very moment, we can easily calculate them with our model:

And take a look:

Just to be clear, we don’t really want our models to perfectly predict values, as such perfection would destroy the whole idea of the anomaly detection process (described in “Saying what we want from our models out loud” in Part I). That’s why when we look at plots, we want to see our model catch the main trend, not the particular values.

Long Short-Term Memory Neural Network

LSTM neural networks have an internal memory state that allows them to remember its’ previous evaluations that make this architecture the perfect candidate for time series anomaly detection.

Unlike the previous two models, this neural network tries to reconstruct the current value using the value itself. It may seem trivial, but this approach has extremely good results in anomaly detection.

LSTM Implementation

We can simplify the wrapping of data into the Dataset from CNN because we need the very same output as input for reconstruction. That is why we get rid of the first part of each item and just keep 1 value. We also should consider that LSTM models usually can’t be fed with the whole sequence at once (because of the memory consumption for the whole sequence)— we have to separate data accordingly into partial sequences to train. Moreover, as training with the whole sequence may reduce the model’s ability to generalize, it may simply get used to the data this way.

The definition of the LSTM neural net is much easier than the CNN one. PyTorch already has an implemented class of LSTM cells that we can use.

Considering that we changed the Dataset class a bit we also should change the training loop to fit data.

In LSTM case we also use:

Adam optimizer
Learning rate scheduler with One Cycle policy
Mean Squared Error loss criterion

One more time we face the most desired line of the code with its result:

Because of the training with partial sequences, we can’t directly send the Dataset instance to get the results, but we certainly can extract the entire sequences from DataFrames:

And then we calculate the predictions:

Another intermediate conclusion

Terrific, we have 3 trained models and their results on our dataset!

The toughest part is behind us and now we are prepared to make our final steps towards anomaly detection. In the last chapter, we will do some extra preparations and reveal the detection process along with its results.

If you want to refresh some theory or data preprocessing, don’t be shy and go to the first part:

Part I — Intro to Anomaly Detection and Data Preparation

Otherwise, the third part awaits:

Part III — Eventually easier than it seemed

About us

We at Akvelon Inc love cutting edge technologies in mobile development, blockchain, big data, machine learning, artificial intelligence, computer vision, and many others. This article is the result of one of many projects developed in our Office Strategy Labs where we’re testing new technologies and approaches before delivering them to our clients.

If you would like to work with our strong Akvelon team — please see our open positions.

Designed and implemented with love in Akvelon:
Team Lead — Artur Khanin
Delivery Manager — Sergei Volynkin
Technical Account Manager — Max Kostin
ML Engineers — Irina Nikolaeva, Rustem Saitgareev