i.i.d assumption:
all labels are drawn independently from same posterior
Maximum Likelihood Principle
choose Parameters such that posterior of TS is maximized
which gives
What is the logarithm of the sigmoid?
Common convention in literature is to write the labels as Then the Logistic Regression Objective is
This has no analytic solution, but it is a convex objective, which means that iterative algorithms have unique solutions So we make a simplifaction that if the features are centered, we can set
The derivatives are:
From which we can derive the derivate of the entire Training set
Which gives after simplifying
This explains what is happening: the term in the sum is basically the error
- case 1: and classifier correct: β Error is close to 0
- case 2: and classifier wrong: β Error is close to -1 β so because we do gradient decent, this means we move
- case 3: and classifier correct: β error is clos to 0, no correction
- case 4: and classifier wrong: β Error close to 1 β Correction (gradient descent) towards