i.i.d assumption:

all labels are drawn independently from same posterior

Maximum Likelihood Principle

choose Parameters such that posterior of TS is maximized

which gives

What is the logarithm of the sigmoid?

β‡’ Softplus Function

Common convention in literature is to write the labels as Then the Logistic Regression Objective is

This has no analytic solution, but it is a convex objective, which means that iterative algorithms have unique solutions So we make a simplifaction that if the features are centered, we can set

The derivatives are:

From which we can derive the derivate of the entire Training set

Which gives after simplifying

This explains what is happening: the term in the sum is basically the error

  • case 1: and classifier correct: β†’ Error is close to 0
  • case 2: and classifier wrong: β†’ Error is close to -1 β†’ so because we do gradient decent, this means we move
  • case 3: and classifier correct: β†’ error is clos to 0, no correction
  • case 4: and classifier wrong: β†’ Error close to 1 β‡’ Correction (gradient descent) towards