Generalized Linear Model

feature vector $x \in R^{d}$ 가 주어졌을 때 observation $Y$ 는 평균이 $μ (x^{⊤} θ)$ 인 exponential family 분포를 따른다.

A.1) Likelihood

$D = {(x_{ℓ}, y_{ℓ})}_{ℓ = 1}^{n}$ 가 $n$ 개의 observation 들의 집합이라고 하자.
- $x_{ℓ} \in R^{d} and y_{ℓ} \in R$
model parameter $θ$ 에 대한 $D$ 의 negative log likelihood 는 다음과 같다.

L (D; θ) = ℓ = 1 \sum ∣ D ∣ b (x_{ℓ}^{⊤} θ) - y_{ℓ} x_{ℓ}^{⊤} θ - c (y_{ℓ})

\nabla L (D; θ) = ℓ = 1 \sum ∣ D ∣ (μ (x_{ℓ}^{⊤} θ) - y_{ℓ}) x_{ℓ}

\nabla^{2} L (D; θ) = ℓ = 1 \sum ∣ D ∣ \overset{μ}{˙} (x_{ℓ}^{⊤} θ) x_{ℓ} x_{ℓ}^{⊤}

$\overset{μ}{˙}$ 는 $μ$ 의 미분이고, $μ$ 는 increasing 하므로 $\overset{μ}{˙}$ 의 값은 항상 양수다.

Maximum Likelihood Estimation of model parameters 는 $\nabla L (D; θ) = 0$ 를 만족하는 vector $θ \in R^{d}$ 를 찾는 것이다.

a variant of Thompson sampling where the posterior of $θ_{*}$ is approximated by its Laplace approximation.
임의의 parameter vector 는 Laplace approximation 에서 샘플링된다.
- $\tilde{θ}_{t} \sim N (\overset{ˉ}{θ}_{t}, a^{2} H_{t}^{- 1})$
  - $a > 0$ 는 tunable parameter
  - $\overset{ˉ}{θ}_{t} = θ \in R^{d} ar g min L ({(X_{ℓ}, Y_{ℓ})}_{ℓ = 1}^{t - 1}; θ)$

H_{t} = ℓ = 1 \sum t - 1 \overset{μ}{˙} (X_{ℓ}^{⊤} \overset{ˉ}{θ}_{t}) X_{ℓ} X_{ℓ}^{⊤}

GLM-FPL(follow-the-perturbed-leader)
- 임의의 parameter vector 는 Gaussian noise 가 추가된 $t - 1$ 까지의 reward 들로 부터 MLE 를 수행한 결과다.
  - $\tilde{θ}_{t} = θ \in R^{d} ar g min L ({(X_{ℓ}, Y_{ℓ} + Z_{ℓ})}_{ℓ = 1}^{t - 1}; θ)$
    - $Z_{ℓ} \sim N (0, a^{2})$ 는 normal random variables (매 round 마다 resampling 됨)
      - $a > 0$ 는 tunable parameter
Computationally-Efficient Implementations
위의 식에서 사용되는 MLE 는 Newton-Raphson method 를 사용하는 Iteratively Reweighted Least Squares (IRLS) 를 통해 계산될 수 있다.
Roughly speaking, each step of IRLS multiplies the inverse of $\nabla L (D; θ)$ and $\nabla^{2} L (D; θ)$ .
즉, $\nabla L (D; θ)$ 의 경우는 $\sum_{x \in X} (N_{x} μ (x^{T} θ) - Y_{x}) x$ 로 계산하고, $\nabla^{2} L (D; θ)$ 의 경우는 $\sum_{x \in X} N_{x} \overset{μ}{˙} (x^{T} θ) x x^{T}$ 형태로 계산할 수 있다고 한다.
- $N_{x}$ 는 $x$ 가 history $D$ 에서 발생한 횟수고, $Y_{x}$ 는 보상이라고 하는데, 이 둘은 incrementally update 될 수 있다.
잘 모르겠는 부분
- $\nabla L (D; θ)$ 는 $d$ 차원 벡터인데, 어떻게 inverse 계산을 할 수 있을까?