📝 Logistic Regression

i.i.d assumption:

all labels are drawn independently from same posterior

$p ((Y_{i}^{*})_{i = 1}^{N} ∣ (X_{i})_{i = 1}^{N}) = \prod_{i = 1}^{N} p (Y_{i}^{*} ∣ X_{i})$

Maximum Likelihood Principle

choose Parameters such that posterior of TS is maximized

$\hat{β}, \hat{b} = arg max_{β, b} \prod_{i = 1}^{N} p (Y_{i}^{*} ∣ X_{i})$ $= ar g min (- \sum_{i = 1}^{N} lo g p (Y_{i}^{*} ∣ X_{i}))$

= ar g min i : Y_{i}^{*} = 1 \sum lo g σ (x_{i} β + b) + i : Y_{i}^{*} = - 1 \sum lo g σ (- (x_{i} β + b))

which gives

$\hat{β}, \hat{b} = ar g min_{β, b} - i = 1 \sum N k \sum 1 [Y_{i}^{*} = k] lo g σ_{k} (x β + b)$

What is the logarithm of the sigmoid?

⇒ Softplus Function

Common convention in literature is to write the labels as $y_{i}^{*} \in {0, 1}$ Then the Logistic Regression Objective is

$\hat{β}, \hat{b} = ar g min_{β, s} - \sum_{i = 1}^{N} [Y_{i}^{*} lo g σ (x β + b) + (1 - Y_{i}^{*}) lo g σ (- (x β + b))]$

This has no analytic solution, but it is a convex objective, which means that iterative algorithms have unique solutions So we make a simplifaction that if the features are centered, we can set $b = 0$

The derivatives are:

\frac{\partial σ ( x β )}{\partial β} = σ^{'} (Xβ) \cdot x = σ (Xβ) \cdot σ (- x β) \cdot x

\frac{\partial lo g σ ( x β )}{\partial β} = \frac{1}{σ ( x β )} σ (x β) σσ (- x β) \cdot x = σ (- x β) x

\frac{\partial lo g σ ( - x β )}{\partial β} = \frac{1}{σ ( - x β )} σ (x β) σσ (- x β) \cdot (- x) = - σ (x β) x

From which we can derive the derivate of the entire Training set

\frac{\partial loss ( TS )}{\partial β} = - n = 1 \sum N [Y_{i}^{*} σ (- x_{i} β) x_{i} - (1 - Y_{i}^{*}) σ (x_{i} β) (- x_{i})]

Which gives after simplifying

\frac{\partial loss ( \partial β )}{σ _{i}} = i = 1 \sum N (σ (x_{i} β) - Y_{i}^{*}) x_{i} =! 0

This explains what is happening: the term in the sum is basically the error

case 1: $Y_{i}^{*} = 1$ and classifier correct: $σ (X_{i} β) = 1$ → Error is close to 0
case 2: $Y_{i}^{*} = 1$ and classifier wrong: $σ (X_{i} β) \approx 0$ → Error is close to -1 → so because we do gradient decent, this means we move $σ (X_{i} β) \to 1$
case 3: $Y_{i}^{*} = 0$ and classifier correct: $σ (X_{i} β) = 0$ → error is clos to 0, no correction
case 4: $Y_{i}^{*} = 0$ and classifier wrong: $σ (X_{i} β) = 1$ → Error close to 1 ⇒ Correction (gradient descent) towards $σ (X_{i} β) \to 0$

🪴 Notes

Explorer

📝 Logistic Regression

i.i.d assumption:

Maximum Likelihood Principle

Graph View

Table of Contents

Backlinks