Exercise Type 13: Log-Likelihood & MLE for Generative Classifiers
What the exam asks: Given a generative classifier model, write down the log-likelihood function and/or derive the Maximum Likelihood Estimate (MLE) for parameters like the mean and covariance of each class.
Part 0: What Do All These Symbols Mean?
The Key Notation
| Symbol | How to Read It | What It Means |
|---|---|---|
| $D$ | Capital D | The dataset — all our training data |
| $\theta$ | Greek letter theta | The collection of ALL model parameters (means, covariances, mixing coefficients) |
| $\log$ | Logarithm (natural log, base e) | The inverse of exponentiation — turns products into sums |
| $\log p(D|\theta)$ | "log probability of data given parameters" | The log-likelihood — how well do these parameters explain the data? |
| $\hat{\mu}_k$ | "mu hat sub k" | The MLE estimate for the mean of class k |
| $\hat{\Sigma}_k$ | "Sigma hat sub k" | The MLE estimate for the covariance of class k |
| $y_{nk}$ | "y sub n k" | Is data point n in class k? (1 if yes, 0 if no — one-hot encoding) |
| $\pi_k$ | "pi sub k" | The mixing coefficient — what fraction of data comes from class k |
| $(x-\mu)(x-\mu)^T$ | "outer product" | Matrix multiplication that gives a covariance-like matrix |
| $(x-\mu)^T(x-\mu)$ | "inner product" | Matrix multiplication that gives a scalar (single number) |
What Is a Log-Likelihood? (Plain English)
The likelihood $p(D|\theta)$ answers: "If these were the true parameters, how likely would our data be?"
The log-likelihood $\log p(D|\theta)$ is the same thing but with a logarithm applied. We do this because:
- Products become sums — much easier to work with
- Very small numbers become manageable — likelihoods can be tiny (like $10^{-100}$), but their logs are reasonable (like $-230$)
- Maximizing the log gives the same answer as maximizing the original — so we can use the log without changing the result
What Is MLE? (Plain English)
Maximum Likelihood Estimation asks: "What parameter values make the data MOST likely?"
In plain English: "Find the parameter values that maximize the log-likelihood."
Part 1: The Generative Classifier Model
The Joint Distribution
For a two-class classifier with one-hot encoding $y_{nk}$:
The Log-Likelihood
Taking the log of the product of all N data points:
How to read this:
- First term: For each data point n and each class k, if $y_{nk} = 1$ (point n is in class k), add the log probability of that point under class k's Gaussian
- Second term: For each data point n and each class k, if $y_{nk} = 1$, add the log of the mixing coefficient
The MLE for Gaussian Parameters
For class k:
Where $N_k = \sum_n y_{nk}$ (number of points in class k).
In plain English: - The MLE mean is just the average of the data points assigned to that class - The MLE covariance is the average outer product of deviations from the mean
Part 2: FULL Walkthrough of Real Exam Questions
EXAM QUESTION 1 (2021-Part-B, Question 2c)
The log-likelihood $\log p(D|\theta)$ is:
Options: - (a) $\sum_k y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k) + \sum_k y_{nk} \log \pi_k$ - (b) $\sum_n \sum_k y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k) + \sum_n \sum_k \log \pi_k$ - (c) $\sum_n \sum_k y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k) + \sum_n \sum_k y_{nk} \log \pi_k$ - (d) $\sum_k y_{nk} \log(\pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k))$
STEP-BY-STEP SOLUTION
Step 1: Start with the joint distribution
Step 2: Take the logarithm
Using the rule $\log(\prod a_i) = \sum \log(a_i)$:
Step 3: Use the rule $\log(ab) = \log(a) + \log(b)$
Step 4: Match the answer
(c) matches exactly.
Key features of the correct answer: - Sum over n (all data points) AND sum over k (all classes) - $y_{nk}$ appears in BOTH terms (it selects the right class for each point)
(a) Missing the sum over n. Only sums over k. ELIMINATE. (b) Missing $y_{nk}$ in the $\log \pi_k$ term. Every data point should only contribute to its assigned class, not all classes. ELIMINATE. (d) Missing the sum over n. ELIMINATE.
Answer: (c) ✅
EXAM QUESTION 2 (2021-Part-B, Question 2d)
Let $\hat{\mu}_2$ be the MLE for $\mu_2$. The MLE for $\Sigma_2$ is:
Options: - (a) $\hat{\Sigma}2 = \frac{1}{N} \sum_n (x_n - \hat{\mu}_2)(x_n - \hat{\mu}_2)^T$ - (b) $\hat{\Sigma}_2 = \frac{1}{N} \sum_n y} (x_n - \hat{\mu2)(x_n - \hat{\mu}_2)^T$ - (c) $\hat{\Sigma}_2 = \frac{1}{N} \sum_n y} (x_n - \hat{\mu2)^T (x_n - \hat{\mu}_2)$ - (d) $\hat{\Sigma}_2 = \frac{1}{N} \sum_n y_2)^2$} (x_n - \hat{\mu
STEP-BY-STEP SOLUTION
Step 1: For a single Gaussian, the MLE covariance is:
Step 2: For a GMM, we only use points assigned to class 2:
Where $y_{n2}$ acts as a selector: it's 1 if point n is in class 2, 0 otherwise. So only points from class 2 contribute.
Note: The exam uses $1/N$ instead of $1/N_2$, but the key distinguishing feature is the structure of the formula.
Step 3: Match the answer
(a) No $y_{n2}$ selector — uses ALL data points, not just class 2. ELIMINATE.
(b) Has $y_{n2}$ selector AND the outer product $(x_n - \hat{\mu}_2)(x_n - \hat{\mu}_2)^T$. This gives an $N \times N$ matrix (correct for covariance). ✓
(c) Has the INNER product $(x_n - \hat{\mu}_2)^T (x_n - \hat{\mu}_2)$ — this gives a single number (scalar), not a matrix. Covariance must be a matrix. ELIMINATE.
(d) Has $(x_n - \hat{\mu}_2)^2$ — this is for 1D case only. The problem states $x_n \in \mathbb{R}^{2 \times 1}$ (2D vectors), so we need the outer product. ELIMINATE.
Answer: (b) ✅
EXAM QUESTION 3 (2021-Part-B, Question 2e)
Aside from degenerate cases, the discrimination boundary between two Gaussian classes will be:
Options: - (a) straight line - (b) parabola - (c) square - (d) triangle
STEP-BY-STEP SOLUTION
Step 1: Understand the decision boundary
The boundary is where $p(C_1|x) = p(C_2|x)$, or equivalently:
Step 2: What does $\log p(x|C_k)$ look like?
For a Gaussian: $\log \mathcal{N}(x|\mu_k, \Sigma_k) = -\frac{1}{2}(x-\mu_k)^T \Sigma_k^{-1} (x-\mu_k) + \text{constants}$
This is a quadratic function of x (it has $x^2$ terms).
Step 3: What happens when we set the two equal?
If $\Sigma_1 \neq \Sigma_2$ (different covariances), the quadratic terms don't cancel, and the boundary is quadratic — a parabola in 2D.
If $\Sigma_1 = \Sigma_2$ (shared covariance), the quadratic terms cancel, and the boundary is linear — a straight line.
Since the problem doesn't assume shared covariance:
Answer: (b) parabola ✅
EXAM QUESTION 4 (2021-Part-B, Question 2b)
Posterior class probability $p(y_{n1}=1|x_n)$:
Options: - (a) $\frac{\mathcal{N}(x_n|\mu_1,\Sigma_1)}{\mathcal{N}(x_n|\mu_1,\Sigma_1) + \mathcal{N}(x_n|\mu_2,\Sigma_2)}$ - (b) $\frac{\pi_1}{\pi_1 + \pi_2}$ - (c) $\frac{\pi_2 \cdot \mathcal{N}(x_n|\mu_2,\Sigma_2)}{\pi_1 \mathcal{N}(x_n|\mu_1,\Sigma_1) + \pi_2 \mathcal{N}(x_n|\mu_2,\Sigma_2)}$ - (d) $\frac{\pi_1 \cdot \mathcal{N}(x_n|\mu_1,\Sigma_1)}{\pi_1 \mathcal{N}(x_n|\mu_1,\Sigma_1) + \pi_2 \cdot \mathcal{N}(x_n|\mu_2,\Sigma_2)}$
STEP-BY-STEP SOLUTION
By Bayes' rule:
Where: - $p(x_n|y_{n1}=1) = \mathcal{N}(x_n|\mu_1, \Sigma_1)$ - $p(y_{n1}=1) = \pi_1$ - $p(x_n) = \sum_{k=1}^2 \mathcal{N}(x_n|\mu_k, \Sigma_k) \cdot \pi_k$
So:
Answer: (d) ✅
Part 3: Tricks & Shortcuts
TRICK 1: Log-Likelihood Pattern
$\sum_n \sum_k y_{nk} \log(\text{Gaussian}) + \sum_n \sum_k y_{nk} \log(\text{mixing coefficient})$
Both terms need $y_{nk}$ AND sums over both n and k.
TRICK 2: MLE Covariance for GMM
- Must have $y_{nk}$ selector (only points from that class)
- Must have OUTER product: $(x-\mu)(x-\mu)^T$ (gives a matrix)
- NOT inner product $(x-\mu)^T(x-\mu)$ (gives a scalar)
- NOT $(x-\mu)^2$ (only for 1D)
TRICK 3: Decision Boundary
- Different covariances → quadratic boundary (parabola)
- Same covariance → linear boundary (straight line)
TRICK 4: Posterior Class
- Numerator = class likelihood × class prior
- Denominator = sum over ALL classes
Part 4: Practice Exercises
Exercise 1
The log-likelihood $\log p(D|\theta)$ for a two-class GMM:
Options: - (a) $\sum_k y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k) + \sum_k y_{nk} \log \pi_k$ - (b) $\sum_n \sum_k y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k) + \sum_n \sum_k \log \pi_k$ - (c) $\sum_n \sum_k y_{nk} \log \mathcal{N}(x_n|\mu_k,\Sigma_k) + \sum_n \sum_k y_{nk} \log \pi_k$ - (d) $\sum_k y_{nk} \log(\pi_k \mathcal{N}(x_n|\mu_k,\Sigma_k))$
Exercise 2
MLE for $\Sigma_2$:
Options: - (a) $\frac{1}{N} \sum_n (x_n - \hat{\mu}2)(x_n - \hat{\mu}_2)^T$ - (b) $\frac{1}{N} \sum_n y} (x_n - \hat{\mu2)(x_n - \hat{\mu}_2)^T$ - (c) $\frac{1}{N} \sum_n y} (x_n - \hat{\mu2)^T (x_n - \hat{\mu}_2)$ - (d) $\frac{1}{N} \sum_n y_2)^2$} (x_n - \hat{\mu
Exercise 3
Discrimination boundary between two Gaussian classes with different covariances:
Options: - (a) straight line - (b) parabola - (c) square - (d) triangle
Exercise 4
Posterior $p(y_{n1}=1|x_n)$:
Options: - (a) $\frac{\mathcal{N}(x_n|\mu_1,\Sigma_1)}{\mathcal{N}(x_n|\mu_1,\Sigma_1) + \mathcal{N}(x_n|\mu_2,\Sigma_2)}$ - (b) $\frac{\pi_1}{\pi_1 + \pi_2}$ - (c) $\frac{\pi_2 \cdot \mathcal{N}(x_n|\mu_2,\Sigma_2)}{\pi_1 \mathcal{N}(x_n|\mu_1,\Sigma_1) + \pi_2 \mathcal{N}(x_n|\mu_2,\Sigma_2)}$ - (d) $\frac{\pi_1 \cdot \mathcal{N}(x_n|\mu_1,\Sigma_1)}{\pi_1 \mathcal{N}(x_n|\mu_1,\Sigma_1) + \pi_2 \cdot \mathcal{N}(x_n|\mu_2,\Sigma_2)}$