Problem Statement of the Anomaly Detection

Dataset: {x^{(1)}, x^{(2)}, \dots, x^{(m)}} Is x_{test} anomalous?

B) Anomaly Detection Example

Fraud Detection
- $x^{(i)}$ is features of user $i$ ‘s activities
- Model $p (x)$ from data: $p (x) < ε$ 인 경우를 체크하여 비정상적인 활동을 감지
Monitoring computers in a data center
- $x^{(i)}$ is features of machine $i$
- $x_{1}$ : 메모리 사용량, $x_{2}$ : 초당 디스크 액세스 횟수 …

C) Parameter Estimation

주어진 데이터셋 ${x^{(1)}, x^{(2)}, \dots, x^{(m)}}, x^{(i)} \in R$ 에 대하여 [가우시안 분포](Gaussian distribution)에 대한 parameter들( $μ$ , $σ$ ) 을 예측하는 것

μ = \frac{1}{m} i = 1 \sum m x^{(i)}

σ^{2} = \frac{1}{m} i = 1 \sum m (x^{(i)} - μ)^{2}

데이터셋이 너무 적으면, $\frac{1}{m}$ 에서 $m$ 대신, $m - 1$ 를 사용하기도 한다 (거의 차이가 없음).

D) Density Estimation

학습 데이터: ${x^{(1)}, \dots, x^{(m)}}$ (Each example is $x \in R^{n}$ ) 가 주어졌을 때, $p (x)$ 를 추정하는 방법

p (x) = i = 1 \prod n p (x_{j}; μ_{j}, σ_{j}^{2})

$p (x) = p (x_{1}; μ_{1}, σ_{1}^{2}) p (x_{2}; μ_{2}, σ_{2}^{2}) p (x_{3}; μ_{3}, σ_{3}^{2}) \dots p (x_{n}; μ_{n}, σ_{n}^{2})$

E) Anomaly Detection Algorithm

anomalous 의 기준이 된다고 생각하는 features $x_{i}$ 를 고른다.
parameter 학습 (계산): $μ_{1}, \dots, μ_{n}, σ_{1}^{2}, \dots, σ_{n}^{2}$
- $μ_{j} = \frac{1}{m} i = 1 \sum m x_{j}^{(i)}$
- $σ_{j}^{2} = \frac{1}{m} i = 1 \sum m (x_{j}^{(i)} - μ_{j})^{2}$
새로운 example $x$ 가 주어졌을 때, $p (x)$ 를 계산한다.
- $p (x) = j = 1 \prod n p (x_{j}; μ_{j}, σ_{j}^{2}) = j = 1 \prod n \frac{1}{2 π σ _{j}} exp (- \frac{( x _{j} - μ _{j} ) ^{2}}{2 σ _{j}^{2}})$
1. 만약 $p (x) < ε$ 이면 anomaly 이다.

F) Anomaly Detection System 을 구성하기 위한 단계

우선 정상적인 (not anomalous) 학습 데이터들을 이용하여 $p (x)$ 를 찾는다.
label 된 cross-validation set 으로 시스템을 평가하여 성적이 좋은 모델만 추려낸다.
마지막으로 test set 으로 평가

(x_{test}^{(1)}, y_{test}^{(1)}), \dots, (x_{test}^{(m_{test})}, y_{test}^{(mtest)})

G) Evaluating an Anomaly Detection System

accuracy 와 같은 평가 방법은 좋지않다. 왜냐하면 cross-validation set 과 test set 의 anomalous 비율이 normal 에 비해 매우 작으므로 올바른 성능을 평가하기 힘들기 때문이다.

추천하는 evaluation metrics: precision, Recall, F1 Score
평가 이후, threshold $ε$ 을 계산할 수 있는데, 그 이유는 높은 성능을 내는 system 이 사용한 $ε$ 을 찾으면 되기 때문이다.
anomaly detection vs supervised learning
- Anomaly Detection 를 사용하면 좋을 경우
  - positive examples 에 대한 데이터 수가 매우 작고, negative examples 이 매우 많을 경우
    - positive 는 normal 현상, negative 는 anomalous 현상
- Supervised Learning 를 사용하면 좋을 경우
  - positive 와 negative 둘 다 데이터 수가 충분히 많을 때
- 세상에는 매우 많은 종류의 anomalies 가 존재한다.
  - 모델이 학습한 데이터에 대한 anomalies 는 이후 anomalies 와 많이 다를 수 있다.
Gaussian distribution 이 아닌 데이터를 Gaussian distribution 형태로 맞추는 방법
- log, sqrt 등 을 활용하면 된다.

Error analysis for anomaly detection
- normal examples 이 많고, anomalous example 수는 적을 때, anomaly 데이터인데도 불구하고 normal 로 처리해버리는 모델들이 있다.
- 그럴 때는 새로운 feature 들을 추가해서 detection 의 성능을 향상시킨다.
Multivariate Gaussian distribution
- 각 차원에 대한 개별적인 gaussian distribution model 을 만드는 것은 좋지 않다.

위 그림에서 왼쪽 위에 있는 x 표시의 example 은 2 차원에서는 anomaly 인데도 불구하고, 각 $x_{1}$ 과 $x_{2}$ 에서는 normal example 로 분류된다.
- Multivariate Gaussian Normal distribution 은 모든 차원의 gaussian model 을 하나로 관리한다.

얻어진 $μ$ 와 $Σ$ 는 값에 따라 아래 그림과 같은 모양을 가진다.

anomaly detection using the Multivariate Gaussian distribution
1. model $p (x)$ 를 fit 한다.
  - $μ Σ = \frac{1}{m} i = 1 \sum m x^{(i)} = \frac{1}{m} i = 1 \sum m (x^{(i)} - μ) (x^{(i)} - μ)^{T}$
2. 새로운 $x$ 가 입력되면, 다음을 계산한다.
  - $p (x) = \frac{1}{( 2 π ) ^{\frac{n}{2}} ∣Σ ∣ ^{\frac{1}{2}}} exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))$
3. 만약 $p (x) < ε$ 을 만족하면, anomaly 이다.

Relationship between multivariate Gaussian distribution and univariate (original model)

사실 기존의 방식과 비교하면, 여러 모델의 확률을 곱한 것과 비슷하다.

p (x) = p (x_{1}; μ_{1} σ_{1}^{2}) \times p (x_{2}; μ_{2}, σ_{2}^{2}) \times \dots \times p (x_{n}; μ_{n}, σ_{n}^{2})

완전히 같은 경우는 공분산 행렬 $Σ$ 을 구했을 때, 대각선 성분 외에 나머지가 0 인 경우다. 이는 feature 간 linear independent 한 경우 관찰할 수 있다.

Original Model

H) Multivariate Gaussian

p (x_{1}; μ_{1}, σ_{1}^{2}) \times \dots \times p (x_{n}; μ_{n}, σ_{n}^{2})

p (x) = \frac{1}{( 2 π ) ^{\frac{n}{2}} ∣Σ ∣ ^{\frac{1}{2}}} exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))

anomaly 를 효과적으로 잡아내기 위해서, 수동으로 feature 를 만들어내야 함
- 공분산 행렬을 구함으로써, 자동으로 feature 간 관계를 찾아낼 수 있음
낮은 복잡도
- 높은 복잡도 (공분산 행렬 계산)
training set size $m$ 이 적어도 괜찮음
- $Σ^{- 1}$ 을 계산해야 하기 때문에, $m > n$ 을 반드시 만족해야 하며, feature 간 linear independent 한 경우에도 $Σ^{- 1}$ 이 존재하지 않을 수 있음

novelty detection

Zzong's Notes

탐색기

anomaly detection

Problem Statement of the Anomaly Detection

B) Anomaly Detection Example

C) Parameter Estimation

D) Density Estimation

E) Anomaly Detection Algorithm

F) Anomaly Detection System 을 구성하기 위한 단계

G) Evaluating an Anomaly Detection System

H) Multivariate Gaussian

J) References

링크된 언급

목차

탐색기

anomaly detection

Problem Statement of the Anomaly Detection

B) Anomaly Detection Example

C) Parameter Estimation

D) Density Estimation

E) Anomaly Detection Algorithm

F) Anomaly Detection System 을 구성하기 위한 단계

G) Evaluating an Anomaly Detection System

H) Multivariate Gaussian

I) Related

J) References

링크된 언급

함께 보면 좋은 글

목차