An End-to-End Unsupervised Anomaly Detection

Time-series outlier detection using Facebook’s Prophet for Batch Processing

Published in

Teknasyon Engineering

8 min readMar 3, 2021

An application focused on your security, **Getcontact**

Getcontact having been launched in 2016 has consolidated itself as one of the leading mobile identity and fraud prevention services globally, with more than 50 million active users per month and a growing presence across four continents.

Such a rapid growth of the Getcontact application means keeping many basic KPIs, such as registered users and verified users, under control on the technology side. Daily review of KPIs and supporting them with machine learning algorithms help to plan growth strategies more accurately. Monitoring the changes in our users’ behaviors closely helps us to prevent and balance the system overload in some critical cases. Then, the loads can be distributed wisely if we detect the correct anomaly points in KPIs counts.

Goal
Keeping important KPIs under control, depending on the number of users and traffic increasing day by day.

Anomaly Detection

Anomaly detection, also known as outlier detection, is an identification technique used to find unusual observations that cause sudden spikes or dips in a given data set. Although there are many solutions under this technique in order to solve the anomaly problem correctly one must first examine the raw data structure should be examined in detail. After examining the raw data structure, the essential question of data science interviews can be asked to define the problem. Is our problem supervised learning or unsupervised learning?

The difference between Supervised Learning and Unsupervised Learning

Today, anomaly problems require more applications on data that do not have a label. While starting the anomaly detection project for Getcontact, I extracted sample data to design the project end-to-end. When I analyzed the raw data I got from the test database, I described the problem with its details. I wrote the python code to prepare the model data using the insights I got from the sample data. Model raw data was created in a short time by aggregating unlabelled KPI data from Getcontact’s production database in the country breakdown. In this article, I will talk about how to detect anomalies theoretically in time-series-based raw data. I will do a simple outlier detection by using Facebook’s Prophet on unlabelled time-series data. Unlabelled daily data to be used in this article were randomly generated.

Sample Aggregated Data for Anomaly Detection Algorithm

Problem Definition
How can we understand and follow the sudden increases and decreases in the values of the KPIs in the country breakdown?

Python Outlier Detection Toolbox (PyOD)

Before the Facebook Prophet, I want to talk about the PyOD package. PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data. PyOD includes more than 30 detection algorithms. I tried the Isolation Forest algorithm which can model multivariate data among these algorithms. PyOD IForest is a wrapper of scikit-learn Isolation Forest with more functionalities. Isolation forest algorithm can be also used directly from the scikit-learn package. I selected any country and created pivot data from the remaining KPI data. Thus, a single anomaly label is assigned to all time-based KPI data for the selected country, as you can see in the image below.

Firstly, I detected the anomalies by running a separate isolation algorithm for each metric because evaluating KPI metrics independently will catch the anomaly points better. However, the model results of some countries with insufficient data were not good enough. It can be solved this problem by fine-tuning the parameters of the model method. Especially, there is a parameter called contamination that needs to be set different values for each type and country. If this way was chosen to design the project, the project would not be easy to maintain, so I decided to use the FBProphet package.

Solution
Facebook Prophet, a brilliant time-series analysis package.

Facebook Prophet

FB Prophet is an open-source software launched by Facebook. Its installation and usage are pretty easy. It is mainly used for time series forecasting. Besides, this powerful library that could perform all the fine-tuning parts within makes our life easier for finding outliers. FB Prophet is robust to outliers, missing data, and dramatic changes in time series. It provides uncertainty intervals (yhat_upper, yhat_lower) along with the predictions it generates. Points outside the uncertainty intervals are marked as anomaly points. In this way, we will be able to learn whether the previous day was an outlier or not by running a few pieces of code.

Sample data for a specific KPI type in any country breakdown are as follows. Before applying the Prophet module on data, column names need to be arranged as ‘ds’ (date variable) and ‘y’ (target variable).

In the first code snippet, it can be reviewed fit_predict_model method that models the data. In the second one, there is a method that determines the anomaly points called detect_outliers. Because they are sample code pieces, the default value of method parameters was taken. It can be made a more robust model by changing the parameters (yearly_seasonality, daily_seasonality, seasonality_prior_scale, etc.) under the Prophet() functions.

def fit_predict_model(df, metric_type, country):
    """
    :param df: dataframe
    :param metric_type: string
    :param country: string
    :rtype: dataframe
    """
    try:
        # define the model
        m = Prophet()
        # fit the model
        m = m.fit(df)
        # use the model to find an outlier
        forecast = m.predict(df)
        forecast['actual'] = df['y'].reset_index(drop=True)
        # displaying Prophet plot
        fig1 = m.plot(forecast)
        return forecast
    except Exception as e:
        logger.error(e.args)
        mess = "Model could not be fitted for {0}_{1}! There can be improper model data.".format(metric_type, country)
        logger.error(mess)
        send_slack_notif_for_fails(mess)

Detecting Anomalies;

The light blue boundaries in the above graph are yhat_upper and yhat_lower.
If y value is greater than yhat_upper and less than yhat_lower then it is an anomaly.

def detect_outliers(forecast):
    """
    :param forecast: dataframe
    :rtype: dataframe
    """
    forecasted = forecast[['ds', 'trend', 'yhat', 'yhat_lower', 'yhat_upper', 'actual']].copy()

    logger.info("Creating anomaly status...")
    forecasted['anomaly'] = 0
    forecasted.loc[forecasted['actual'] > forecasted['yhat_upper'], 'anomaly'] = 1
    forecasted.loc[forecasted['actual'] < forecasted['yhat_lower'], 'anomaly'] = -1

    # Anomaly score or importance can be created by using below formulas.
    # high = (actual - yhat_upper) / actual
    # low = (yhat_lower - actual) / actual

    return forecasted

Project Structure and Data Visualization

After the detection of anomaly points, there remains one last important step in the data science area, visualization. We should always make a meaningful story of our project results with the help of an up-to-date BI tool. I used Plotly, the popular package of recent years, instead of an external tool in the development. I chose Grafana for production. Also, I created an alert mechanism with the help of slack for anomalies detected in production. Those who do not prefer this structure can also create an alert mechanism from within Grafana. Below, you can examine the structure of the project over the diagram. I wanted to show how an end-to-end data science project should be presented to business by exemplifying the development part for my data enthusiast friends.

Deployment and Monitoring

The deployment stage is the key area for any machine learning project. An automated structure should be established as much as possible. There are many tools that can facilitate the deployment phase. We used Bitbucket to store, build, test, and deploy our anomaly detection project. I want to thank Beytullah Gürpınar who has managed the deployment process. The details of our deployment stage will be the topic of another article. I will briefly talk about the Docker part. A sample project structure is below.

workspace
├── project
│   ├── anomaly-detection (repository)
│       ├── .gitignore
│       ├── anomaly_detector.py
│       ├── bitbucket-pipelines.yml
│       ├── build_docker_image.sh
│       ├── Dockerfile
│       ├── ml_anomaly_detector_docker_image_version
│       ├── push_docker_image_to_gcr.sh
│       ├── README.md
│       ├── requirements.txt
│       ├── settings.yaml
│       ├── settings_dev.yaml

It was used docker for our ML development and deploying our ML application. Firstly, I used the below Dockerfile for the project.

FROM python:3.7

COPY . /app

RUN pip install Cython numpy pandas
RUN pip install pystan
RUN pip install -r /app/requirements.txt
RUN chmod 755 /app/anomaly_detector.py

ENTRYPOINT ["python"]
CMD ["/app/anomaly_detector.py"]

‘requirements.txt’ file contains the following packages.

requests
SQLAlchemy
mysql-connector-python
mysqlclient
fbprophet
pyyaml

However, in deployment, it was encountered some compiling problems with pip for Prophet in Docker. The problems were overcome by using conda.

FROM continuumio/miniconda3

COPY . /app

RUN pip3 install mysql-connector-python
RUN  conda install -c conda-forge pystan
RUN  conda install -c conda-forge fbprophet
RUN  conda install -c anaconda cython
RUN  conda install -c anaconda numpy
RUN  conda install -c anaconda PyMySQL
RUN  conda install -c anaconda pandas
RUN  conda install -c anaconda requests
RUN  conda install -c anaconda sqlalchemy
RUN  conda install -c bioconda mysqlclient
RUN  conda install -c anaconda pyyaml
RUN  chmod  755 /app/anomaly_detector.py

ENTRYPOINT ["python3"]
CMD ["/app/anomaly_detector.py"]

Instead of writing the packages one by one, packages can be installed with a single command or using a yml file.

Conclusion

Finding Outliers for Some Countries in KPI Data

An anomaly detection project has been carried out for monitoring essential KPIs in selected countries. In this way, we will be able to learn which KPIs alarmed in which countries by making use of machine learning algorithms. Some useful python libraries were applied, and an end-to-end machine learning project was considered in detail by clarifying dev & prod environment.

Stay safe. Stay with Getcontact!