Introduction to MLE and MAP

Posted on 2019-01-25 标签机器学习 , 数学 , 转载

Recently I came across a great article about MLE (Maximum Likelihood Estimation) and MAP (Maximum A Posteriori). It’s very well written and worth reading. The article link is 聊一聊机器学习的 MLE 和 MAP：最大似然估计和最大后验估计. This article is reposted almost without modification (deleted upon infringement).

Overview

Sometimes when chatting with others, they claim to have a lot of machine learning experience, but upon deeper conversation, I find they have only a partial understanding of MLE and MAP. At least in my view, this person’s machine learning foundation is not solid. Could it be that in this era of deep learning popularity, many students only focus on tuning parameters?

The ultimate problem of modern machine learning transforms into solving the optimization of objective functions, and MLE and MAP are very basic ideas for generating these functions, so our understanding of both is crucial. Let’s seriously discuss these two estimators: MLE and MAP.

Controversy Between Two Schools

Abstractly speaking, Frequentists and Bayesians have fundamentally different understandings of the world: Frequentists believe the world is deterministic, with an entity whose true value is unchanging, and our goal is to find this true value or the range where it exists; Bayesians believe the world is uncertain, people have a preconception about the world, and then adjust this preconception through observed data. Our goal is to find the optimal probability distribution describing this world.

When modeling things, we use \(\theta\) to represent model parameters, and solving the problem essentially means finding \(\theta\). The difference between Frequentists and Bayesians lies in:

Frequentists: There exists a unique true value \(\theta\). Take a simple intuitive example - coin flipping. We use \(P(head)\) to represent the coin’s bias. Flip a coin 100 times, with 20 heads, to estimate the bias \(P(head)=\theta\) for heads. In the Frequentist view, \(\theta\) = 20 / 100 = 0.2, very intuitive. When data approaches infinity, this method gives accurate estimates; however, lacking data may cause severe bias. For example, for a fair coin with \(\theta\) = 0.5, flip 5 times getting 5 heads (this scenario has probability 1/2^5=3.125%), Frequentists would directly estimate this coin’s \(\theta\) = 1, a serious error.
Bayesians: \(\theta\) is a random variable following a certain probability distribution. In Bayesian school, there are two inputs and one output. Inputs are prior and likelihood, output is posterior. Prior, i.e., \(P(\theta)\), refers to the preconception about \(\theta\) without observing any data. For example, given a coin, a feasible prior is believing this coin has a high probability of being fair, and a small probability of being unfair. Likelihood, i.e., \(P(X|\theta)\), is what the observed data should look like assuming \(\theta\) is known. Posterior, i.e., \(P(\theta|X)\), is the final parameter distribution. The foundation of Bayesian estimation is Bayes’ formula:

\[\begin{align} P(\theta|X)=\frac{P(X|\theta) \times P(\theta)}{P(X)} \end{align}\]

Using the same coin flipping example, flip a fair coin 5 times getting 5 heads. If the prior believes the coin is most likely fair (e.g., Beta distribution with maximum at 0.5), then \(P(head)\), i.e., \(P(\theta|X)\), is a distribution with maximum between 0.5~1, not the arbitrary \(\theta\) = 1.

Two points worth noting:

As data increases, parameter distribution increasingly leans toward data, and prior influence decreases
If prior is a uniform distribution, Bayesian method is equivalent to Frequentist method. Intuitively, a uniform prior essentially means no preconception about things

MLE - Maximum Likelihood Estimation

Maximum Likelihood Estimation, MLE is a commonly used estimation method by Frequentists!

Assume data \(x_1, x_2, ..., x_n\) is an i.i.d. sample, \(X = (x_1, x_2, ..., x_n)\). Here i.i.d. means Independent and identical distribution. Then MLE’s estimation method for \(\theta\) can be derived as follows:

\[\begin{align} \hat{\theta}_\text{MLE} &= \arg \max P(X; \theta) \\\ &= \arg \max P(x_1; \theta) P(x_2; \theta) \cdot\cdot\cdot\cdot P(x_n;\theta) \\\ & = \arg \max\log \prod_{i=1}^{n} P(x_i; \theta) \\\ &= \arg \max \sum_{i=1}^{n} \log P(x_i; \theta) \\\ &= \arg \min - \sum_{i=1}^{n} \log P(x_i; \theta) \end{align}\]

The function optimized in the last line is called Negative Log Likelihood (NLL), this concept and the above derivation are very important!

We often use MLE without realizing it, for example:

The coin probability example about Frequentists above, its method is essentially derived from optimizing NLL. Given some data, when finding the corresponding Gaussian distribution, we often calculate the mean and variance of these data points and substitute into the Gaussian distribution formula. Its theoretical basis is optimizing NLL. Cross entropy loss used in deep learning classification tasks is essentially MLE

MAP - Maximum A Posteriori Estimation

Maximum A Posteriori, MAP is a commonly used estimation method by Bayesians!

Similarly, assume data \(x_1, x_2, ..., x_n\) is an i.i.d. sample, \(X = (x_1, x_2, ..., x_n)\). Then MAP’s estimation method for \(\theta\) can be derived as follows:

\[\begin{align} \hat{\theta}_\text{MAP} &= \arg \max P(\theta | X) \\\ &= \arg \min -\log P(\theta | X) \\\ & = \arg \min -\log P(X|\theta) - \log P(\theta) + \log P(X) \\\ &= \arg \min -\log P(X|\theta ) - \log P(\theta) \end{align}\]

Here, Bayes’ theorem is used from the second to third line, and \(P(X)\) can be dropped from third to fourth line because it’s unrelated to \(\theta\). Note that \(-\log P(X|\theta )\) is actually NLL, so the difference between MLE and MAP in optimization is the prior term \(- \log P(\theta)\). Now let’s examine this prior term, assuming the prior is a Gaussian distribution, i.e.:

\[\begin{align} P(\theta) = \text{constant} \times e^{-\frac{\theta^2}{2\sigma^2}} \end{align}\]

Then, \(-\log P(\theta) = \text{constant} + \frac{\theta^2}{2\sigma^2}\). At this point, something magical happens:

Using a Gaussian distribution prior in MAP is equivalent to using L2 regularization in MLE!

References:

Bayesian Methods
MLE, MAP, Bayes classification