COPOD: How To Use "Statistics" and "Machine Learning" to detect anomalies
But in reality, there are two problems that make this method ineffective. First of all, real data is often not one-dimensional, but has many dimensions ? The simplest assumption is of course that each dimension is independent of each other.
1. Background knowledge
Outlier detection, from the simplest point of view, is to see how far the value deviates from the mean . Taking the simplest one-dimensional data as an example, you can first calculate the mean ( ) and standard deviation ( ), then the value 2 or 3 beyond the mean can be simply regarded as an anomaly. If the data is assumed to conform to a normal distribution, it is probably the position marked in yellow in the figure. If we can know the distribution of the data (such as the CDF), then we can also calculate the probability that a sample may be on the far left and right of the distribution .
But in reality, there are two problems that make this method ineffective.
First of all, real data is often not one-dimensional, but has many dimensions ? The simplest assumption is of course that each dimension is independent of each other. Then we can find the degree of abnormality in each dimension separately, and then find the average abnormality in all dimensions, or see if there are several dimensions that have relatively large abnormalities. But this method has a core limitation: not all dimensions are independent, and they are often related to each other ! Therefore, this assumption will ignore this relationship, leading to oversimplification of modeling.
The second is to simply look at whether a point is far from the mean may be blind , because there are many kinds of distributions, not every distribution is as graceful as the normal distribution. A more reasonable method is to estimate the tail probabilty of a point, that is, the possibility of being at the extreme position of the distribution.
Combining these two purposes: if we can estimate the joint distribution on the multidimensional data well, then we can try to estimate the tail probability of each point, and then we can evaluate the abnormal situation.
2. COPOD: COPOD: Copula-Based Outlier Detection
Under this premise, we proposed an anomaly detection method based on copula. Copula is a statistical probability function, used to model multi-dimensional cumulative distribution, and can be used to effectively model the dependency between multiple random variables (RV). This article was published in this year's International Conference on Data Mining (ICDM'20), and interested friends will read the short arxiv version.
COPOD uses a non-parametric method to obtain the empirical copula through Empirical CDF. After that, we can simply use the empirical copula to estimate the tail probability of the joint distribution in all dimensions.
3. Tail Probability and Correction
In our imagination, we should be concerned about the possibility that a sample falls on the tail on the left and the tail on the right at the same time, but the actual situation may be more complicated.
Take the above figure as an example. Anomalies may appear on the left side of the distribution, or on the right side of the distribution, or on both sides. Under different circumstances, using different tail probabilities will get different results. In this case, we calculate the skewness of the distribution, in other words, the distribution is skewed to the left or to the right. If it is to the left, then we care more about the end on the right, and vice versa.
In this case, the COPOD algorithm we designed does the following things:
- 1-4 lines, calculate the empirical cdf, get the left tail and right tail, and the skewness (to decide how to correct)
- Line 6-15, use copula to get the tail probabilities on the left and right, and output the most suitable tail probabilities according to the specific situation
4. Advantages of COPOD
First of all, unlike most anomaly detection algorithms, COPOD does not need to calculate the distance of the sample, so the running cost is small and the speed is fast . The second point is that you don't need to adjust the parameters , just call it directly. The third is the effectiveness of the group , compared with other 9 mainstream algorithms (such as isolated forest, LOF), etc. on more than 30 data sets, the comprehensive ranking is the highest.
In addition, COPOD can provide some interpretability for which dimensions are caused by anomalies . For example, we can directly find the dimensions that cause the most anomalies and conduct in-depth analysis.
5. Use and Read
Like our previous work, we have also fully open-sourced the COPOD project and integrated it into PyOD, so that you can use a few lines of code for testing.
# train the COPOD detector
from pyod.models.copod import COPOD
clf = COPOD()
clf.fit(X_train)
# get outlier scores
y_train_scores = clf.decision_scores_ # raw outlier scores
y_test_scores = clf.decision_function(X_test) # outlier scores
Original project code : winstonll/COPOD
PyOD integration : https:// github.com/yzhao062/pyo d/blob/master/examples/copod_example.py
Arxiv Lite Edition : COPOD: Copula-Based Outlier Detection
Citation : Li, Z., Zhao, Y., Botta, N., Ionescu, C. and Hu, X. COPOD: Copula-Based Outlier Detection. IEEE International Conference on Data Mining (ICDM) , 2020.
@inproceedings{li2020copod, title={{COPOD:} Copula-Based Outlier Detection}, author={Li, Zheng and Zhao, Yue and Botta, Nicola and Ionescu, Cezar and Hu, Xiyang}, booktitle={IEEE International Conference on Data Mining (ICDM)}, year={2020}, organization={IEEE}, }
It is worth mentioning that when we use the simplest empirical cdf to estimate empirical copula, in fact, the modeling of dependency is not enough. This is the tradeoff of speed and effect. Interested readers can try to model with more complex copula functions.