Case-Study: A Generative Approach to Anomaly Detection (Under NDA)

Problem

Our client, an SME (Small and Medium-sized Enterprise), was highly regarded for its ability to create solutions that validate user-supplied data. They were well-known for ensuring the accuracy and reliability of the data that users provided. However, as the scope and complexity of their projects increased, they became interested in the potential for anomaly detection. Their specific focus was to identify cases where data appeared to “drift” away from expected norms, which could suggest errors, unexpected changes, or even more serious concerns like security threats.

Anomaly detection is a field within machine learning that aims to find anomalous data points in a dataset. These data points deviate significantly from the majority and may indicate errors, fraud, or other unexpected events. The challenge here was twofold: detecting anomalies in real time and ensuring that the system would work with the unbalanced nature of data anomalies. In such cases, the “normal” data vastly outnumbers the anomalies, making them harder to detect.

The client wanted to integrate a more advanced, unsupervised machine learning approach. This method would allow them to identify anomalies without requiring large amounts of labelled data, a common limitation in anomaly detection systems. Instead, the system would need to rely on unsupervised anomaly detection techniques, which could detect patterns and variations that weren’t explicitly programmed. The goal was to help them enhance their existing toolkit to better detect and address these data anomalies.

Solution

TechnoLynx developed a tailored solution to meet the client’s needs. We approached this project by combining several cutting-edge anomaly detection methods. Our solution was an imaging-based anomaly detection system that leveraged modern neural networks and custom-built machine learning models. Specifically, our team extended its usual array of tools, including variational auto-encoders (VAEs) and adversarial auto-encoders (AAEs), by adding diffusion models into the mix for the first time.

Our approach focused on understanding the underlying distribution of the data, which is essential for detecting anomalies. By modelling the normal patterns in the data, we could then identify points that fell outside these norms, known as global outliers or point anomalies. These are the data points that stand out from the rest and are key indicators that something unusual is happening.

To make our solution effective in a real-time environment, we opted for an unsupervised approach. Unsupervised anomaly detection is particularly useful in cases where the data is vast and varied, and where labelled examples of anomalies are either scarce or non-existent. By using unsupervised machine learning algorithms, our model could independently learn what “normal” looked like and detect deviations from this norm.

Our system combined density-based techniques with neural networks. Density-based anomaly detection methods help by identifying points in sparse regions of the dataset, where data is less frequent and hence more likely to be anomalous. This was essential in dealing with anomalous data points that would otherwise be lost in a sea of normal data. By training our model on this data, it was able to continuously improve its accuracy in identifying anomalies over time.

To enhance the solution further, we used custom-built diffusion models. Diffusion models are a type of generative model used to learn complex distributions, and they allowed us to model the probability of each data point being part of the “normal” distribution. By integrating them into our anomaly detection system, we could achieve more precise identification of anomalies.

The entire solution was developed in-house by our team, using 100% custom code built primarily in PyTorch. PyTorch is a popular deep learning framework, and it allowed us to build and train our models efficiently. The flexibility of PyTorch also meant we could tailor the solution to the client’s specific requirements, ensuring it was robust enough to handle their unique data processing challenges.

In the field of anomaly detection, a variety of techniques are available to identify irregularities in datasets, each with its strengths depending on the type of data and the presence of labelled examples. Supervised anomaly detection techniques are effective when a dataset contains sufficient labelled instances of both normal and anomalous data.

These models use artificial intelligence to learn the characteristics of normal and abnormal patterns, identifying anomalies with high accuracy. However, this method is often limited by the availability of labelled data, which is rare in many real-world applications. Many datasets, especially those that track continuous processes, suffer from an inherently unbalanced nature, where anomalies are far less frequent than normal data points. This makes it challenging for supervised models to capture the full spectrum of anomalies.

On the other hand, semi-supervised anomaly detection techniques offer a more flexible approach. These methods primarily learn from labelled normal data and use that knowledge to detect deviations. By focusing on what is considered normal, semi-supervised models can flag instances that do not conform, even without a large set of labelled anomalies.

This is particularly useful in scenarios where obtaining labelled anomalies is difficult or expensive. Semi-supervised techniques bridge the gap between fully supervised and unsupervised approaches, providing a practical solution when only partial data labelling is available.

Anomaly detection becomes even more complex when working with time series data anomalies. In time series data, the order and timing of events are crucial. Anomalies may occur as sudden spikes, dips, or long-term deviations from expected patterns.

Detecting these anomalies requires models that can understand both the temporal sequence and the patterns in the data. For example, in applications such as financial monitoring or industrial process control, time series anomalies may indicate fraud or equipment failure. Addressing these anomalies with advanced machine learning models, particularly those that can account for the temporal aspects of the data, is essential for maintaining system reliability.

By leveraging both supervised and semi-supervised anomaly detection methods, organisations can tackle the unique challenges posed by unbalanced datasets and time series data, ensuring that even rare or subtle anomalies are detected in real time. TechnoLynx helps its clients address these challenges by incorporating sophisticated AI-driven solutions tailored to specific needs.

Results

The client was highly satisfied with the outcome. Our solution provided them with a powerful tool that significantly expanded their existing capabilities for anomaly detection. The custom system we developed allowed them to detect data anomalies more effectively, even in situations where there were no labelled examples available.

The client integrated our Proof-of-Concept (PoC) solution into their internal toolkit, and their team has since taken over further development and productisation. The initial system serves as a strong foundation for future advancements, with the client’s in-house team continuing to build upon it. They now have a scalable, high-performance anomaly detection system that can identify anomalies across various datasets, regardless of their type or size.

Unsupervised Anomaly Detection Techniques

Anomaly detection can be broadly classified into supervised and unsupervised methods. In the case of supervised detection, the system requires a labelled dataset that includes examples of both normal and anomalous data points. However, labelled data is not always available, especially in real-world applications, which is where unsupervised methods shine.

In our client’s case, the nature of the data they were working with was inherently unbalanced. Most of the data followed a normal distribution, with anomalies being rare and scattered. This unbalanced nature meant that unsupervised machine learning methods were the ideal choice. These methods can detect anomalies in cases where we only have examples of normal behaviour, without needing explicit examples of anomalies.

By using techniques like auto-encoders and diffusion models, we were able to build a system that could detect anomalies in real-time, flagging data points that deviated from the expected distribution. These models work by first learning the underlying structure of the data, which allows them to identify outliers that don’t fit this structure. This is particularly useful for applications like intrusion detection systems, where we need to detect unusual activity without prior knowledge of what that activity might look like.

Machine Learning Algorithms for Anomaly Detection

At the heart of our solution were several machine learning algorithms designed specifically for anomaly detection. We used a combination of auto-encoders and diffusion models, which are widely regarded as effective tools for detecting anomalies in large datasets.

Auto-encoders work by compressing the input data into a lower-dimensional representation and then reconstructing it. The reconstruction error—how closely the reconstructed data matches the original—can be used to identify anomalies. If a data point cannot be accurately reconstructed, it is likely to be an anomaly.

In addition to auto-encoders, we incorporated diffusion models, which enabled us to model the probability distribution of the data more accurately. Diffusion models are particularly well-suited for handling the unbalanced nature of the data, as they can generate synthetic data points that represent the normal distribution. This helps the model learn the structure of the data more effectively and improves its ability to detect outliers.

By combining these different detection methods, we were able to build a robust system that could detect anomalies in a wide range of scenarios. Whether the anomalies were subtle, such as a slight drift in the data distribution, or more pronounced, like a complete outlier, our system was capable of identifying them.

Density-Based Anomaly Detection

One of the key techniques we used in this project was density-based anomaly detection. This method works by identifying regions of the dataset where the density of data points is lower than expected. In other words, if a data point is located in a sparse region of the dataset, it is more likely to be an anomaly.

Density-based methods are particularly useful for detecting point anomalies and global outliers. Point anomalies are individual data points that deviate significantly from the rest of the data, while global outliers are data points that are anomalous relative to the entire dataset.

In our client’s case, the density-based approach allowed us to identify anomalies in real-time, even in large datasets. By continuously monitoring the density of the data, the system could flag any data points that fell outside the expected range.

Real-Time Anomaly Detection

Real-time anomaly detection is essential in many applications, such as intrusion detection systems and monitoring systems for critical infrastructure. In these cases, the ability to detect anomalies as they occur can prevent costly or dangerous events from happening.

Our system was designed to operate in real-time, allowing the client to detect anomalies as soon as they occurred. This was achieved by using efficient algorithms and optimising the system for performance. By leveraging the power of neural networks and density-based methods, we were able to build a system that could process large amounts of data quickly and accurately.

Conclusion

In today’s world, detecting anomalies in data is crucial for businesses across industries. Whether it’s identifying fraud, detecting errors in time series data, or monitoring systems for signs of intrusion, having a reliable anomaly detection system is essential.

In this case, TechnoLynx provided a comprehensive, custom solution that combined advanced machine learning techniques with real-time processing capabilities. By building a system that could detect anomalies without the need for labelled data, we were able to help our client improve their existing toolkit and enhance their ability to identify anomalies.

Our solution was built using a combination of auto-encoders, diffusion models, and density-based methods, ensuring that it could detect anomalies in a wide range of scenarios. The client was highly satisfied with the outcome and has since integrated our system into their internal processes, where it continues to provide valuable insights and help them maintain the accuracy and integrity of their data.