Estimation and Machine Learning

February 26, 2019 21 minute read

1. Sampling from a Probability Distribution

我觉得最好的解释在 kccu@StackExchange: How does one formally define sampling from a probability distribution?:

I have just described how to go from a random variable to its distribution function, but we can go the other way. Namely, given a distribution ($F_{X}$), we can sample a random variable from it, by which we mean we choose a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ and a function $X:\Omega \to \mathbb{R}$ satisfying $\mathbb{P}(X \leq a)=F_{X}(a)$ for all $a \in \mathbb{R}$. It is not obvious that such a random variable must exist! But in fact one does provided you have a valid distribution function.

我们回头看一下没什么太大用的 sampling 的定义：Wikipedia: Sampling (statistics):

… sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt for the samples to represent the population in question. Two advantages of sampling are lower cost and faster data collection than measuring the entire population.

所以：

假设真实问题领域有一个 probability space $(\Omega, \mathcal{F}, \mathbb{P})$
- 统计学上这个 outcome 的集合 $\Omega$ 在这里称为 population
我们 somehow 从一个 distribution (搞不好还是从 sample 中估计出来的) 反向推出一个 probability space $(\Omega’, \mathcal{F}’, \mathbb{P}’)$ (通过 sampling)
- 这个 $\Omega’ \subseteq \Omega$ 在这里称为一个 statistical sample
- 搞不好这里还有个 “先有鸡还是先有蛋的问题”：你是先拿出一个 sample 来 estimate 一个 distribution，还是先拿出一个 distribution 反推出一个 sample？

2. Sample

虽然 sampling 的定义很清晰，但是很遗憾的是，一个 sample 可以指：

一个 random variable $X$ (来自 sampling)
- Terminology Recap: Random Variable / Distribution / PMF / PDF / Independence / Marginal Distribution / Joint Distribution / Conditional Random Variable 里我们已经见识到，一个 random variable vector (假设长度为 $m$) 也可以看做一个 random variable (引入 joint distribution)，所以一个 sample 也可能是一个 random variable vector，所以也可以理解为 $m$ 个 random variables，所以也可以理解为 $m$ 个 samples ()
一个 $\Omega’ \subseteq \Omega$ (来自 sampling)
- 假设 $\vert \Omega’ \vert = m$，那么这个 $m$ 个 outcomes 又称为 $m$ 个 sample values
- 所以一个 sample 也可以理解为 $m$ 个 sample values
  - 你要是理解成这 $m$ 个 sample values 是来自 $m$ 个 random variables，从而算是来自 $m$ 个 samples，我觉得你是把问题复杂化了。Don’t do this.

3. Sample Space / Experiment

虽然看上出很自然，但是 sampling 出来的 $(\Omega’, \mathcal{F}’, \mathbb{P}’)$ 中的 $\Omega’$ 并不叫 sample space。

而真正的 sample space 永远要和 experiment 联系起来。遗憾的是，experiment 没有 formal definition：Wikipedia: Experiment (probability theory)

In probability theory, an experiment or trial is any procedure that can be infinitely repeated and has a well-defined set of possible outcomes, known as the sample space.

但是不要紧，你记住一点就行：一个 experiment 一定对应一个 probability space $(\Omega’’, \mathcal{F}’’, \mathbb{P}’’)$:

然后这个 $\Omega’’$ 就叫 sample space，至于为什么这么叫，我倒是也想问问最初这么命名的人 ()
- 发现了没？sample space 和 sampling 在定义上是没有直接关联的！老子信了你的邪！ (当然你硬要从 sample 的定义上把这两者联系起来也是行得通的，但是，有帮助到你理解吗？)
sample space $\Omega’’$ 不一定是 population $\Omega$，但有可能是
- sample space 小于 population 的例子
  - 比如 population 全国人口的身高
  - 然后 experiment 是在北京市抽查 100 人的身高，然后恰巧全国最高的人不在北京，你在北京怎么抽也不可能抽到这个最大的身高值
- sample space 等于 population 的例子：
  - experiment 是 “roll a dice once”

4. Statistical Model

4.1 (预备知识) Mathematical Model

Mathematical model 是一个抽象概念：Wikipedia: Mathematical model:

A mathematical model is a description of a system using mathematical concepts and language.

那至于 system 是什么？这里就不展开了，也是个抽象的概念

Mathematical models are usually composed of relationships and variables.

Relationships can be described by operators, such as algebraic operators, functions, differential operators, etc.
Variables are abstractions of system parameters of interest, that can be quantified.

参照不同的标准，Mathematical model 可以分为以下几个大类：

Linear vs. nonlinear
- 判断标准是：operator 是否是 linear 的
Static vs. dynamic
- A dynamic model accounts for time-dependent changes in the state of the system。
  - 一般会用到 differential equations (微分方程) 或者 difference equations (差分方程)
    - 注：差分方程其实就是递推关系 (recurrence relation)，类似 $a_{n+1} = f(a_n)$ 这种
- A static (or steady-state) model calculates the system in equilibrium, and thus is time-invariant.
Explicit vs. implicit
- Explicit 指 “all of the input parameters of the overall model are known, and the output parameters can be calculated by a finite series of computations”
- Implicit 指 “the output parameters are unknown, and the corresponding inputs must be solved for by an iterative procedure”
  - Newton’s method 即是这类 iterative procedure 的一种
Discrete vs. continuous
Deterministic vs. probabilistic (stochastic)
- A deterministic model is one in which every set of variable states is uniquely determined by parameters in the model and by sets of previous states of these variables; therefore, a deterministic model always performs the same way for a given set of initial conditions.
- In a stochastic model, randomness is present, and variable states are not described by unique values, but rather by probability distributions.
  - 下面所说的 probability model 和 statistical model 都属于 stochastic models 的大类
Deductive vs. inductive
- A deductive model is a logical structure based on a theory.
- An inductive model arises from empirical findings and generalization from them.

4.2 (预备知识) Probability Model

废话少说：A probability model is defined by a probability space $(\Omega, \mathcal{F}, \mathbb{P})$. Period.

注：$\Omega$ 是 sample space

4.3 Statistical Model

根据 MIT 18.650 Statistics for Applications:

Let the observed outcome of a statistical experiment be a sample of $m$ i.i.d. random variables $X_1, \dots, X_m$ in some measurable space $E$ (usually $E \subseteq \mathbb{R}$) and denote by $\mathbb{P}^\ast$ their common distribution. A statistical model associated to that statistical experiment is a pair $(E, \mathcal{P})$, where

$E$ is the sampel space,
$\mathcal{P} = \lbrace \mathbb{P}_{\theta} \mid \theta \in \Theta \rbrace$ is a family of probabilty measures (i.e. distributions) on $E$, and
$\Theta$ is the parameter set.

$\mathcal{P}$ 举个例子：”不同参数的 Gaussian distributions”：$\mathcal{P} = \lbrace \mathbb{P}_{\mathcal{N}(\sigma, \mu^2)} \mid (\sigma, \mu) \in \mathbb{R}^2 \rbrace$.

4.3.1 Specification

A statistical model is said to be well-specified if $\exists \theta^\ast \in \Theta$ such that $\mathbb{P}_{\theta^\ast} = \mathbb{P}^\ast$.

A statistical model is said to be misspecified if $\not \exists \theta^\ast \in \Theta$ such that $\mathbb{P}_{\theta^\ast} = \mathbb{P}^\ast$.

What does it mean for a probability model to be “well-specified” or “misspecified”?:

Well-specified means that the class of distribution $\mathcal{C}$ you are assuming for your modeling actually contains the unknown probability distribution $p$ from where the sample is drawn.

Misspecified means, on the other hand, that $\mathcal{C}$ does not contain $p$. You made a modeling assumption, and it is not perfect: for instance, you assume your sample is Gaussian, but (maybe due to noise, or just inherently) it is not actually originating from any Gaussian distribution.

一般情况下，我们都是 assume that our statistical model is well-specified.

This particular $\theta^\ast$ is called the true parameter, and is unknown: The aim of the statistical experiment is to estimate $\theta^\ast$, or check its properties when they have a special meaning (比如 $\theta^\ast > 2$ 时 $\mathbb{P}_{\theta^\ast} = \mathbb{P}^\ast$ 有什么特点)

4.3.2 Identifiability

We say that statistical model $(E, \mathcal{P})$ is identifiable if the mapping $\theta \mapsto \mathbb{P}_{\theta}$ is injective, i.e.

\[\theta = \theta' \Rightarrow \mathbb{P}_{\theta} = \mathbb{P}_{\theta'}\]

此时 any parameter $\theta$ can be called identified.

4.3.3 Dimension

如果有 $\Theta \subseteq \mathbb{R}^k$，我们称 $k$ 为 dimension of the statistical model:

The model is said to be parametric if it has a finite dimension
The model is said to be nonparametric if it has a infinite dimension

注：dimension 不是 statistical model 专有的，其他涉及 parameter 的 mathematical model 应该都有这个概念。参考 Parametric vs. non-parametric models

5. Statistic / Estimator

Given an observed sample $X_1, \dots, X_m$ and a statistical model $(E, \mathcal{P})$, we define:

A statistic is any measurable function of the sample
An estimator of $\theta^\ast$, denoted by $\hat \Theta_m$, is any statistic whose expression does not depend on $\theta^\ast$
- 比如 $\hat \Theta_m = f(X_1, \dots, X_m)$
An estimate of $\theta^\ast$ is denoted by $\hat \theta_m$
- 比如 $\hat \theta_m = f(x_1, \dots, x_m)$
- 这里又涉及到类似 “一个 random variable $X=3$ 是什么意思?” 的问题。我们在根据 $\hat \Theta_m$ 计算 $\hat \theta_m$ 的时候的确就是把 $X_i$ 替换成 $x_i$ 而已，但这并不是 $X_i = x_i$ 的意思，也不是说要用 $X_i(x_i)$ 这个值去算，而是模拟了一个 $X_i(\overset{?}{\cdot}) = x_i$ 的过程，至于这个 $\overset{?}{\cdot}$ 是多少这里我们并不关心

注意下逻辑关系：

estimand $\theta^\ast$ = the parameter of interest (to be estimated)
estimator $\hat \Theta_m$ = a function (of estimating)
estimate $\hat \theta_m$ = a value (of estimating)

从本质上来说，statistic/estimator 是一个 function，也是个 random variable，estimate 是一个值，但遗憾的是：

因为 statistic 没有一个名词去表示它的具体值，所以 statistic 有时表示函数 (or random variable)，有时表示一个具体值
estimator 一定是函数 (or random variable)，但有时有的人就是要叫它 estimate，而且就是不愿意用大写字母表示 ()，所以你看到 “an estimate $\hat \theta_m$” 这样的描述，你需要根据上下文来判断它到底说的是 “an estimator $\hat \Theta_m$” 还是 “an estimate $\hat \theta_m$” ()

对 parameter 的 estimation 我们称为 point estimation (视 $X_1, \dots, X_m$ 为 $m$ 个 points)，我们可以把它引申到对 function 的 estimation (也称 function approximation)。假设 estimand 是 $g^\ast$，estimate 是 $\hat g_m$，我们也可以计算它们的 error、bias 之类的。A function estimator can be seen as simply a point estimator in function space.

5.1 Bias

The bias of an estimator $\hat \Theta_m$ is defined as:

\[\operatorname{bias}(\hat \Theta_m) = \mathbb{E}[\hat \Theta_m] - \theta^\ast\]

An estimator $\hat \Theta_m$ is said to be unbiased if $\operatorname{bias}(\hat \Theta_m) = 0$, which implies that $\mathbb{E}[\hat \Theta_m] = \theta^\ast$.

An estimator $\hat \Theta_m$ is said to be asymptotically unbiased if $\underset{m \to \infty}{\lim} \operatorname{bias}(\hat \Theta_m) = 0$, which implies that $\underset{m \to \infty}{\lim} \mathbb{E}[\hat \Theta_m] = \theta^\ast$.

5.2 Variance / Standard Deviation / Standard Error (to its own Mean)

对任意一个 random variable $X$，它的 variance 为：

\[\operatorname{Var}(X) = \mathbb{E} \big[(X - \mathbb{E}[X])^2\big]\]

Standard deviation 为：

\[\sigma(X) = \sqrt{\operatorname{Var}(X)}\]

Standard error (to its own mean) 为：

\[\operatorname{SEM}(X) = \sqrt{\frac{\operatorname{Var}(X)}{m}} = \frac{\sigma(X)}{\sqrt{m}}\]

$m$ is the size of the sample (number of observations).

你把 $X$ 换成 $\hat \Theta_m$，就能得到 estimator 的 variance、standard deviation、standard error (to its mean).

需要注意的一点是：

$\operatorname{bias}(\hat \Theta_m)$ 有涉及 $\theta^\ast$，所以它研究的是 “$\Theta_m$ 与 $\theta^\ast$ 之间的关系”
而 $\operatorname{Var}(\hat \Theta_m)$, $\sigma(\hat \Theta_m)$, $\operatorname{SEM}(\hat \Theta_m)$ 和 $\theta^\ast$ 无关，所以它们表示的是 “$\Theta_m$ 本身的性质”

5.3 Error (to its Estimand) / Mean Square Error (to its Estimand) / (Quadratic) Risk

The error of an estimator $\hat \Theta_m$ to its estimand $\theta^\ast$ is defined as:

\[e(\hat \Theta_m) = \hat \Theta_m - \theta^\ast\]

The mean square error of an estimator $\hat \Theta_m$ to its estimand $\theta^\ast$ is defined as:

\[\operatorname{MSE}(\hat \Theta_m) = \mathbb{E} \big[ e(\hat \Theta_m)^2 \big] = \mathbb{E} \big[ (\hat \Theta_m - \theta^\ast)^2 \big]\]

The (quadratic) risk of an estimator $\hat \Theta_m$ is defined as:

\[\operatorname{risk}(\hat \Theta_m) = \mathbb{E} \big[\vert \hat \Theta_m - \theta^\ast \vert^2 \big]\]

明显，$\operatorname{risk}(\hat \Theta_m) = \operatorname{MSE}(\hat \Theta_m)$

If $\theta^\ast \in \mathbb{R}$ (一维参数), then:

\[\operatorname{risk} = \operatorname{MSE} = \operatorname{bias}^2 + \operatorname{Var}\]

5.4 Consistency

Let $X_1, \dots, X_m$ be a sample sharing a common distribution $\mathbb{P}$ from a statistical model $(E, \mathcal{P})$ (which implies $\mathbb{P} \in \mathcal{P}$), and $\hat \Theta_m = f(X_1, \dots, X_m)$ be an estimator of $\theta^\ast$.

Estimator $\hat \Theta_m$ is said to be (weakly) consistent if it converges to $\theta^\ast$ in probability, i.e.:

\[\underset{m \to \infty}{p\lim} \hat \Theta_m = \theta^\ast\]

which means $\forall \epsilon > 0$:

\[\underset{m \to \infty}{\lim} \mathbb{P}(\lvert \hat \Theta_m - \theta^\ast \rvert > \epsilon) = 0\]

我们也可以把 convergence in probability 写作 $\hat \Theta_m \overset{\mathbb{P}}{\rightarrow} \theta^\ast$

Estimator $\hat \Theta_m$ is said to be mean square consistent if $\underset{m \to \infty}{\lim} \operatorname{MSE}(\hat \Theta_m) = 0$

Estimator $\hat \Theta_m$ is said to be $L_2$ consistent if $\lVert \hat \Theta_m - \theta^\ast \rVert_2 \overset{\mathbb{P}}{\rightarrow} 0$

这么一来，weak consistency 也可以看做是 $L_1$ consistency

Estimator $\hat \Theta_m$ is said to be strongly consistent if it converges to $\theta^\ast$ almost surely, i.e. $\forall \epsilon > 0$:

\[\mathbb{P} \big ( \underset{m \to \infty}{\lim} \lvert \hat \Theta_m - \theta^\ast \rvert \leq \epsilon \big ) = 1\]

我们也可以把 almost sure convergence 写作 $\hat \Theta_m \overset{\text{a.s.}}{\rightarrow} \theta^\ast$

注意：

$L_1$ consistency $\Leftrightarrow$ weak consistency
mean square consistency $\Rightarrow$ weak consistency
$L_2$ consistency $\Rightarrow$ weak consistency
strong consistency $\Rightarrow$ weak consistency
但目前 mean square consistency、$L_2$ consistency 和 strong consistency 这三者的强弱关系我还不清楚

5.5 Efficiency (of Unbiased Estimators)

假定 $\hat \Theta_m$ 是 $\theta^\ast$ 的 unbiased estimator; $\theta^\ast$ 的 density function 是 $f(x;\theta^\ast)$。

$\operatorname{Var}(\hat \Theta_m)$ 有 Cramér–Rao bound (CRB):

\[\operatorname{Var}(\hat \Theta_m) \geq \frac{1}{I(\theta^\ast)}\]

where the Fisher Information $I(\theta^\ast)$ is defined by:

\[I(\theta^\ast) = \mathbb{E} \left[ \left( \frac{\partial \log f(x;\theta^\ast)}{\partial \theta^\ast} \right)^2 \right] = - \mathbb{E} \left[ \frac {\partial^2 \log f(x;\theta^\ast)}{\partial {\theta^\ast}^2} \right]\]

严格来说，Fisher information 是衡量 model 的一个指标，所以它应该把 $\theta^\ast$ 当做一个变量 $\theta$ 对待，所以 $I(\theta)$ 应该是一个函数，当 $\theta = \theta^\ast$ 时，$I(\theta^\ast)$ 取一个具体值
更多内容参考 Chapter 8 - Fisher Information

The efficiency of $\hat \Theta_m$ is defined as:

\[e(\hat \Theta_m) = \frac{1}{\operatorname{Var}(\hat \Theta_m) I(\theta^\ast)} \leq 1\]

假设有两个 unbiased estimators $\hat \Theta_m, \hat \Theta_m’$，如果 $\operatorname{Var}(\hat \Theta_m) < \operatorname{Var}(\hat \Theta_m’) \iff e(\hat \Theta_m) > e(\hat \Theta_m’)$，我们称 $\hat \Theta_m$ is more efficient than $\hat \Theta_m’$.

因为它们都是 unbiased，所以 $\operatorname{MSE} = \operatorname{Var}$. Intuitively, a more efficient estimator may obtain lower generalization error for a fixed number of samples, $m$, or equivalently may require fewer examples to obtain a fixed level of greneralization error. (但是实际 machine learning 中基本不可能得到 unbiased estimator，这个 intuition 很难会用到)

6. Empirical Distribution / Resampling

6.1 EDF $\hat{F}_{\mathbb{P}_{X}}$

Let $X_1, \dots, X_m$ be i.i.d. real random variables with the common distribution function $F_{\mathbb{P}_{X}}$. Then the empirical distribution function (EDF) is defined as:

\[\hat{F}_{m}(a) = \hat{F}_{\mathbb{P}_{X}}(a) = \frac{1}{m} \sum_{i=1}^{m} \mathbf{1}_{X_{i} \leq a}\]

where $\mathbf{1}_{A}$ is the indicator of event $A$.

注意 $F_{\mathbb{P}_{X}}(a) = \mathbb{P}(X \leq a)$

For a fixed $a$, the indicator $\mathbf{1}_{X_{i} \leq a}$ is a Bernoulli random variable with parameter $p = F_{\mathbb{P}_{X}}(a)$; hence $m\hat{F}_{m}(a)$ is a binomial random variable with mean $mF_{\mathbb{P}_{X}}(a)$ and variance $m F_{\mathbb{P}_{X}}(a) \big( 1 − F_{\mathbb{P}_{X}}(a) \big)$. This implies that $\hat{F}_{m}(a)$ is an unbiased estimator for $F_{\mathbb{P}_{X}}(a)$.

另外 $\forall a$, $\hat{F}_{m}(a) \overset{\text{a.s.}}{\rightarrow} F_{\mathbb{P}_{X}}(a)$

6.2 What does an Empirical Distribution represent?

我们上面讨论了半天 EDF，但到底什么是 empirical distribution？我觉得 joriki’s answer, StackExchange: What does an empirical distribution represent? 说得最精炼：

The empirical distribution is the distribution you’d get if you sampled from your sample instead of the whole population.

whuber’s answer, StackExchange: What does an empirical distribution represent? 提到了我们为什么要用 empirical distribution：

The intuition is that if your observations are representative of the original population (that is, of the set of tickets in the original box), then you can study the EDF to learn how to make inferences about the contents of a box based on a sample of it.

简单来说就是：用 sample 的 empirical distribution 去模拟、去 approximate、去代替 population 的 true (but unknown) distribution.

6.3 Resampling / CV

在 machine learning 中，我们是用 training dataset 的 empirical distribution 去模拟、去 approximate、去代替 population 的 true (but unknown) distribution，那自然就有一个问题：

从头到尾只用一个 training dataset 会不会有失偏颇？
万一这一个 training dataset 的 empirical distribution 和 true distribution 差得很远怎么办？

于是我们决定给 training dataset 添加一点随机性，于是就有了 CV。然后 CV 本质上就是 resampling：

resampling：获取不同的 samples
CV：获取不同的 training datasets

6.3.1 Jackknife Resampling / Leave-one-out CV

参考 Introduction to resampling methods, 453 Bootstrap by Rozenn Dahyot:

The Jackknife samples are computed by leaving out one observation $x_i$ from $\mathbf{x} = (x_1, x_2, \dots, x_n)$ at a time:

\[\mathbf{x}_{(i)} = (x_1, x_2, \dots, x_{i−1}, x_{i+1}, \dots, x_n)\]

可见 Jackknife resampling 和 LOOCV 其实是一码事。

6.3.2 $k$-fold CV

$k$-folds CV can be seen as a (very rigid) way of resampling the data without replacement.

6.3.3 Bootstrap Resampling (Bootstrapping)

参考 Bootstrap, Jackknife and other resampling methods, 453 Bootstrap by Rozenn Dahyot:

A bootstrap sample $\mathbf{x}^{\ast} = (x_{1}^{\ast}, x_{2}^{\ast}, \cdots, x_{n}^{\ast})$ is obtained by randomly sampling $n$ times, with replacement, from the original data points $\mathbf{x}=(x_1, x_2, \cdots, x_n)$.

Suppose all $x_i$ are different. The probability that a particular $x_i$ is left out the resample $\mathbf{x}^{\ast} = (x_{1}^{\ast}, x_{2}^{\ast}, \cdots, x_{n}^{\ast})$ is:

\[\mathbb{P}(x_{j}^{\ast} \neq x_i, 1 \leq j \leq n) = (1 - \frac{1}{n})^n\]

since $\mathbb{P}(x_{j}^{\ast} = x_i) = \frac{1}{n}$.

When $n \to \infty$, $(1 - \frac{1}{n})^n$ converges to $e^{-1} \approx 0.37$.

换言之，当 $n$ 足够大时，一个 bootstrap sample 大约会包含 $63\% \times n$ 个 unique data points (剩下的 $37\%$ 都是 duplicates，因为我们是 sampling with replacement)。

6.3.4 Bootstrap Aggregating (Bagging)

Bootstrap resampling 没有直接和某种 CV 对应，但是作为一种 resampling，它可以直接出现在 ensemble framework 中，比如 bagging。

bagging 其实很简单：

Bootstrap resample $1^{\text{st}}$ time $\Rightarrow \mathbf{x}_{1}^{\ast} \Rightarrow$ classifier $C_1$
Bootstrap resample $2^{\text{nd}}$ time $\Rightarrow \mathbf{x}_{2}^{\ast} \Rightarrow$ classifier $C_2$
$\cdots$
Bootstrap resample $n^{\text{th}}$ time $\Rightarrow \mathbf{x}_{n}^{\ast} \Rightarrow$ classifier $C_n$
Aggregate $\Rightarrow$ final classifier $C = f(C_1, C_2, \dots, C_n)$

bagging 有助于减小 variance。

7. The Likelihood Function / Sufficient Statistics

7.1 Sufficient Statistics

Lecture Notes 5, Stat 705, Larry Wasserman@CMU:

Suppose that $X_1, \dots, X_m \sim p(x; \theta^\ast)$. Statistic $T$ is sufficient for $\theta^\ast$ if the conditional distribution of $X_1, \dots, X_m \vert T$ does not depend on $\theta^\ast$. Thus, $p(x_1, \dots, x_m \vert t; \theta) = p(x_1, \dots, x_m \vert t)$.

$T$ is a Minimal Sufficient Statistic (MSS) if:

$T$ is sufficient, and
$\forall$ other sufficient statistic $U$, $\exists$ some function $g$ such that $T=g(U)$
- 亦即任意的 sufficient statistic $U$ 都能变形 (或者认为是 reduce) 成 MSS $T$

What does sufficiency mean?

If $T$ is sufficient, then $T$ contains all the information you need from the data to compute the likelihood function. It does not contain all the information in the data.

注意这里这个 “information” 其实是个抽象的概念，并没有一个 formal 的 definition (我暂时也没有看到有 information theory 的概念和这里有关联)。那么这个 “all the information” 具体是指什么？

它其实指的是 “all the information needed to compute any estimate of the parameter”
换言之，如果有一个 statistic $T$ containing all the information，那么我们就可以不用 $f(X_1, \dots, X_m)$ 的形式、而是用 $g(T(X_1, \dots, X_m))$ 的形式去 estimate
那么这里说 sufficient statistic $T$ 并没有 contains all the information，也就是说 $T(X_1, \dots, X_m)$ 无法取代 $X_1, \dots, X_m$ when computing estimates

Lecture Notes 6, Stat 705, Larry Wasserman@CMU 举了个例子：

Let $C = \lbrace c_1, \dots , c_N \rbrace$ be a finite set of constants. Let $\theta^\ast = \frac{1}{N} \sum_{i=1}^{N} c_i$. Suppose we want to estimate $\theta^\ast$. We proceed as follows. Let $S_1, \dots , S_n \sim \operatorname{Bern}(\pi)$ where $\pi$ is known. If $S_i = 1$ you can see $c_i$; otherwise, you cannot. (This is an example of survey sampling.) The likelihood function is:

\[\prod_i \pi^{S_i} (1 - \pi)^{S_i}\]

The likelihood function contains no information at all. But we can still estimate $\theta^\ast$ (without this likelihood function):

\[\hat \Theta = \frac{1}{N \pi} \sum_{i=1}^{N} c_i S_i\]

and $\mathbb{E}[\hat \Theta] = \theta^\ast$.

7.2 The Likelihood Function

Let $X^m = (X_1, \dots, X_m)$ have joint PDF $f(x^m; \theta) = f(x_1, \dots, x_m; \theta)$ where $\theta \in \Theta$. The likelihood function $L: \Theta \to [0, \infty)$ is dedined by:

\[L(\theta) \equiv L(\theta; x^m) = f(x^m; \theta)\]

where $x^m$ is fixed and $\theta$ varies in $\Theta$.

Let $X^m = (X_1, \dots, X_m)$ have joint PMF $p(x^m; \theta) = p(x_1, \dots, x_m; \theta)$ where $\theta \in \Theta$. The likelihood function $L: \Theta \to [0, 1]$ is dedined by:

\[L(\theta) \equiv L(\theta; x^m) = p(x^m; \theta)\]

where $x^m$ is fixed and $\theta$ varies in $\Theta$.

Wikipedia: Likelihood function:

In informal contexts, likelihood is often used as a synonym for probability.
In statistics, the two terms have different meanings.
- Probability is used to describe the plausibility of some data, given a value for the parameter.
- Likelihood is used to describe the plausibility of a value for the parameter, given some data.
- 非常哲学的 “一体两面” 的思想

7.3 The likelihood Function is a Minimal Sufficient Statistic

首先一个问题是：为什么 likelihood function 是一个 statistic？这个 confusion 主要来自：

statistic 定义本身，因为 statistic 既可以指 function 又可以指 value
likelihood function 是一个 parametric statistic (varied by $\theta$)

如果我们定义 likelihood $L(\theta) \equiv L(\theta; x^m)$，那么它是一个 parametric statistic value
如果我们把 likelihood 看做 $L(\theta; X^m)$，那么它是一个 parametric statistic function

likelihood function 是 sufficient 这一点比较直观，那么下一步我们需要证明它是一个 MSS。这需要 Factorization Theorem 的辅助：

假设 sample $X_1, \dots, X_m$ have joint PDF/PMF $f(x_1, \dots, x_m; \theta)$. Statistic $T=t(X_1, \dots, X_m)$ is sufficient for $\theta^\ast \iff f(x_1, \dots, x_m; \theta)$ can be factored into two components in the form of:

\[f(x_1, \dots , x_m;\theta) = \phi [ t(x_1, \dots , x_m);\theta ] h(x_1, \dots , x_m)\]

where:

function $\phi$ depends on the data $x_1, \dots, x_m$ ONLY through statistic $t(x_1, \dots , x_m)$, and
function $h$ does not depend on $\theta$

\[\begin{align} & \because L(\theta) \equiv f(x_1, \dots , x_m;\theta) \newline & \therefore \forall \text{sufficient statistic } T, \exists \phi, g \text{ such that } L(\theta) = \phi [ t(x_1, \dots , x_m);\theta ] h(x_1, \dots , x_m) \newline & \therefore L(\theta) \text{ is minimal sufficient} \end{align}\]

再回头看 sufficiency 的意义：

If $T$ is sufficient, then $T$ contains all the information you need from the data to compute the likelihood function.
- 这就很好理解了，因为 sufficient statistic $T$ 一定能 reduce 成 MSS $L(\theta)$
… It does not contain all the information in the data.
- 换言之，$L(\theta)$ 也没有 contains all the information
- 但有的教材认为 $L(\theta)$ (roughly) contains all the information，这是个不严谨的 (我觉得可以直接认为是不正确的) 说法
  - 即便如此，还是存在很多的场合 where parameter 是可以用 $L(\theta)$ 去 estimate 的

8. Connections between Estimation and Machine Learning

8.1 Unsupervised / Supervised Learning

假设 population $X^\ast$ 的 true distrbution 是 $\mathbb{P}^\ast$，我们 observe 到一个 random vector $X = (X_1, \dots, X_m)$，那么 (roughly speaking) unsupervised learning 可以理解为用 $\hat{\mathbb{P}}_{X}$ 去 estimate $\mathbb{P}^\ast$，i.e.:

\[\hat{\mathbb{P}}_{X} = \hat{\mathbb{P}}_m \overset{?}{\to} \mathbb{P}^\ast\]

unsupervised learning 也有可能只 estimate $\mathbb{P}^\ast$ 的部分性质而不是一定要 estimate $\mathbb{P}^\ast$.
注：这一部分在 Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning 有更新

假设 population $Y^\ast \vert X^\ast$ 的 true distrbution 是 $\mathbb{P}^\ast$，我们 observe 到一个 random vector $X = (X_1, \dots, X_m)$ 以及 associated value vector $Y = {Y_1, \dots, Y_m}$，那么 (roughly speaking) supervised learning 可以理解为用 $\hat{\mathbb{P}}_{Y \mid X}$ 去 estimate $\mathbb{P}^\ast$，i.e.:

\[\hat{\mathbb{P}}_{Y \mid X} = \hat{\mathbb{P}}_m \overset{?}{\to} \mathbb{P}^\ast\]

一旦我们 assume 了 statistical model，那么 “estimating distribuition $\mathbb{P}^\ast$ 的过程” 就可以转化成 “estimating parameter $\theta^\ast$ 的过程”
注：上面描述的其实是 frequentist discriminative 的 supervised learning，不包括所有 supervised learning 的内容。这一部分在 Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning 有更新

8.2 Training Error / Test Error / Underfitting / Overfitting

我们用 design matrix 来描述 dataset，i.e. $m$ rows (examples; random variables) by $n$ columns (features). 对单个特定的问题，我们称所有的 datasets 都是通过 data generating process、依据一个统一的 data generating distribution $\mathbb{P}_\text{data}$ (i.e. $\mathbb{P}^\ast$) 来生成的。然后我们假设，无论你如何划分 training dataset $X^{\text{train}}$ 和 test dataset $X^{\text{test}}$，都有：

$X_1^\text{train}, \dots, X_m^\text{train}$ are i.i.d (by distribution $\mathbb{P}_\text{data}$)
$X_1^\text{test}, \dots, X_{m’}^\text{test}$ are i.i.d (by distribution $\mathbb{P}_\text{data}$)

我们在 evaluating machine learning model 的时候一般会选取一个 “error measure”，比如说 $\operatorname{MSE}$，那在 training dataset 上就是 training error，在 test dataset 上就是 test error。如果我们是先确定了一个 $\theta^\dagger$ (不一定要是 $\mathbb{P}_\text{data}$)，然后根据 $\mathbb{P}_{\theta^\dagger}$ 来 sample 出一个 training dataset 和 test dataset，那么 expected training error 应该等于 expected test error，比如:

\[\operatorname{MSE}(\hat{\theta}_{X^\text{train}}) = \operatorname{MSE}(\hat{\theta}_{X^\text{test}}) = \operatorname{MSE}(\theta^\dagger)\]

注意我们这里用的是 $\operatorname{MSE}$ of an estimate 而不是 of an estimator

但是 learning 的过程不是这样的，而是先 estimate $\hat{\Theta}_{X^\text{train}}$，再把 $X^\text{test}$ 的值代进去算的：

\[\begin{aligned} \text{training error} &= \operatorname{MSE} \big( \hat{\Theta}_{X^\text{train}}(x^\text{train}) \big) = \operatorname{MSE}(\hat{\theta}_{X^\text{train}}) \\ \text{test error} &= \operatorname{MSE} \big( \hat{\Theta}_{X^\text{train}}(x^\text{test}) \big) \neq \operatorname{MSE}(\hat{\theta}_{X^\text{test}}) \end{aligned}\]

这个 $\hat{\Theta}_{X^\text{train}}(x^\text{test})$ 的矛盾就是 generalization error 的根源
- 你可能会注意到这里有个 dimension 的问题：若 $\hat{\Theta}_{X^\text{train}} = f(X_1^\text{train}, \dots, X_m^\text{train})$ 的话，你 $X^\text{test} = (X_1^\text{test}, \dots, X_{m’}^\text{test})$ 有 $m’$ 项，如何代入 $f$ 计算？秘诀就是我们可以规定 $f$ 是单个 random vector 的函数，而不是固定数量的 random variables 的函数
然后你计算的过程决定了：$\hat{\Theta}_{X^\text{train}}$ 只会 minimize $\operatorname{MSE}$ on $x^\text{train}$，所以一般有 $\text{training error} \leq \text{test error}$ (proof skeleton)
- Underfitting: $\text{training error}$ high, $\text{test error}$ high
- Overfitting: $\text{training error}$ low, $\text{test error}$ high
- Good fit: $\text{training error}$ low, $\text{test error}$ slightly higher
- Unknown fit: $\text{training error}$ high, $\text{test error}$ low
出现 overfitting 的一个原因可能是，我们 minimize 的 $\operatorname{MSE}$ 并不是 $\hat{\Theta}_{X^\text{train}}$ 和 $\theta^\ast$ 之间的 error，因为 $\theta^\ast$ 我们并不知道，所以我觉得在实际计算时，我们用的是 empirical distribution 的参数 $\theta^\ast_{X^\text{train}} := \theta^\mathbf{1}_{X^\text{train}}$，所以最终的关系可能是：

\[\hat{\Theta}_{X^\text{train}} \hat= \theta^{\mathbf{1}}_{X^\text{train}} \overset{?}{\rightarrow} \theta^\ast\]

所以一定程度以后，越逼近 $\theta^{\mathbf{1}}_{X^\text{train}}$ 可能离 $\theta^\ast$ 越远

8.3 Bias-Variance Tradeoff

我这里只是想说一下这个 tradeoff 和这个式子其实是没什么关系的：

\[\operatorname{risk} = \operatorname{MSE} = \operatorname{bias}^2 + \operatorname{Var}\]

因为 bias-variance tradeoff 实际是在权衡不同类型的 models 之间的利弊 (high-variance-low-bias vs low-variance-high-bias)，那 model 变了，estimator 自然就变了，$\operatorname{MSE}$ 自然也变了；所以并不是说你可以维持一个具体的 $\operatorname{MSE}$ 不变，然后 somehow 让 $\operatorname{bias}$ 和 $\operatorname{Var}$ negatively correlated 地变化。

那为什么是 high-variance-low-bias vs low-variance-high-bias？因为这俩是常态，high-variance-high-bias 这属于 model 本身有问题，low-variance-low-bias 这已经是最优解了，还 tradeoff 个啥。

我有写过一篇 Hypothesis Space / Underfitting / Overfitting / Bias / Variance 可以参考。

普遍的规律是：

conservative (less complex) models 一般 low-variance-high-bias，有 underfitting 倾向
flexible (more complex) models 一般 high-variance-low-bias，有 overfitting 倾向

8.4 Bayes Error

我们常见有假设 linear model $Y = \mathbf w X + \epsilon$ 的做法，这是假设了 even in true distribution $\mathbb{P}^\ast$ there may still be some noise. 这个 error 我们称为 Bayes error。更多内容参考：

有时为了进一步研究的方便，我们假设 $\epsilon \sim \mathcal{N}(0, 1)$

8.5 Estimators used in Machine Learning

上面说的都是 “如何用 general 的 estimation 概念去理解 machine learning 的一般性质”；下面说的是 “machine learning 中具体如何做 estimation”。

其实你 perform XXX estimation，相应地就会得到 XXX estimator，但是我始终觉得把 MLE 解释为 Maximum Likelihood Estimator 更好理解，whatever.

然后有很多的 estimation，你需要分清哪些是抽象的 estimator framework，哪些是具体的 estimator，就犹如有些 estimation 你要理解成 interface，有些 estimation 你要理解成 implementation。

大的 estimation framework 有：

Frequentist Statistics
- MLE, Maximum Likelihood Estimation
  - 等价于 minimizing KL divergence $D_{KL}(\hat{\mathbb{P}}_{\text{data}} \Vert \mathbb{P}_{\text{model}})$
  - 等价于 minimizing cross-entropy $H(\hat{\mathbb{P}}_{\text{data}}, \mathbb{P}_{\text{model}})$
    - When $\mathbb{P}_{\text{model}}$ is Gaussian，等价于 minimizing $\operatorname{MSE}$
    - 亦即 $\operatorname{MSE}$ is the cross-entropy between the empirical distribution and a Gaussian model.
Bayersian Statistics
- MAP, Maximum A Posteriori (Estimation)

8.5.1 MLE

这一小节我们使用如下的符号：

True but unknown data generating distribution $\mathbb{P}_\text{data}(\mathbf{x})$ (Estimand, i.e. $\mathbb{P}^\ast$, 对应 $\theta^\ast$)
- 但是我们在 estimation 的过程中根本就不会用到这个 distribution，这里列出来只是为了对应 $\hat{\mathbb{P}}_\text{data}$
Distribution of the statistic model $\mathbb{P}_\text{model}(\mathbf{x}; \theta)$ (Estimator)
Training data $\mathbb{X} = \lbrace x^{(1)}, \dots, x^{(m)} \rbrace$ (assumed to be draw independently from $\mathbb{P}_\text{data}$)
Empirical distribution of the training data $\hat{\mathbb{P}}_\text{data}(\mathbf{x})$ (Approximation of the Estimand)

\[\begin{aligned} \theta_{\text{MLE}} &= \underset{\theta}{\arg \max } \, \mathbb{P}_{\text{model}}(\mathbb{X}; \theta) \newline &= \underset{\theta}{\arg \max } \prod_{i=1}^{m} \mathbb{P}_{\text{model}} (x^{(i)}; \theta) \,\,\,\, \text{(i.i.d.)} \iff \underset{\theta}{\arg \max } \prod_{i=1}^{m} L_{\text{model}} (\theta; x^{(i)}) \newline &= \underset{\theta}{\arg \max } \sum_{i=1}^{m} \log \mathbb{P}_{\text{model}} (x^{(i)} ; \theta) \,\,\,\, \text{(顺便解决了 underflow 的问题)} \newline &= \underset{\theta}{\arg \max } \, \frac{1}{m} \sum_{i=1}^{m} \log \mathbb{P}_{\text{model}} (x^{(i)} ; \theta) \newline &= \underset{\theta}{\arg \max } \, \mathbb{E}_{\mathbf x \sim \hat{\mathbb{P}}_\text{data}} \left[ \log \mathbb{P}_{\text{model}} (\mathbf x; \theta) \right] \newline &= \underset{\theta}{\arg \min } \, - \mathbb{E}_{\mathbf x \sim \hat{\mathbb{P}}_\text{data}} \left[ \log \mathbb{P}_{\text{model}} (\mathbf x; \theta) \right] \newline &= \underset{\theta}{\arg \min } \, H(\hat{\mathbb{P}}_\text{data}, \mathbb{P}_{\text{model}}(\theta)) \,\,\,\, \text{(cross-entropy)} \newline &= \underset{\theta}{\arg \min } \, -H(\hat{\mathbb{P}}_\text{data}) + H(\hat{\mathbb{P}}_\text{data}, \mathbb{P}_{\text{model}}(\theta)) \,\,\,\, \text{(} \hat{\mathbb{P}}_\text{data} \text{ 的 entropy 与 } \theta \text{ 无关)} \newline &= \underset{\theta}{\arg \min } \, D_{KL}(\hat{\mathbb{P}}_{\text{data}} \Vert \mathbb{P}_{\text{model}}(\theta)) \end{aligned}\]

entropy、cross-entropy、KL divergence 等概念可以参考 Convex Functions / Jensen’s Inequality / Jensen’s Inequality on Expectations / Gibbs’ Inequality / Entropy
Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer.
注意：这里的 training data $\mathbb{X} = \lbrace x^{(1)}, \dots, x^{(m)} \rbrace$ 你应该理解成 $\mathbf{D} = \lbrace D_1, \dots, D_m \rbrace$，而 $\mathbf{D}$ 又可以是 $\mathbf{Y} \vert \mathbf{X}$ 或者 $\mathbf{Y}, \mathbf{X}$ 的形式。参 Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning

8.5.2 MAP

假设有 training data $\mathbf{D} = \lbrace D_1, \dots, D_m \rbrace$。

Bayesians represent their assumption of $\theta$ using the prior probability distribution, $\mathbb{P}(\theta)$ (sometimes referred to as simply “the prior”), thus:

Prior $\Rightarrow$ $\mathbb{P}(\theta)$
Likelihood $\Rightarrow$ $\mathbb{P}(\mathbf{D} \vert \theta)$
Posterior $\Rightarrow$ $\mathbb{P}(\theta \vert \mathbf{D})$

\[\begin{aligned} \because & \mathbb{P}(\theta \vert \mathbf{D}) = \frac{\mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta)}{\mathbb{P}(\mathbf{D})} \newline \therefore &\mathbb{P}(\theta \vert \mathbf{D}) \propto \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \newline \therefore & \underset{\theta}{\arg \max } \, \mathbb{P}(\theta \vert \mathbf{D}) = \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \end{aligned}\]

所谓 Maximum A Posteriori (Estimation) 其实就是 Maximum Posterior Estimation，只不过我们不叫它 MPE 而是给它起了个洋气一点的名字叫 MAP：

\[\begin{aligned} \theta_{\text{MAP}} &= \underset{\theta}{\arg \max } \, \mathbb{P}(\theta \vert \mathbf{D}) \newline &= \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \newline &= \underset{\theta}{\arg \max } \, \log \mathbb{P}(\mathbf{D} \vert \theta) + \log \mathbb{P}(\theta) \end{aligned}\]

MAP with a Gaussian prior on the weights thus corresponds to weight decay.
- weight decay 其实就是我们在 Hypothesis Space / Underfitting / Overfitting / Bias / Variance 讲的限制 $\Vert \mathbf{w} \Vert^2 < s$
- weight decay 你可以简单理解成 $L_2$ regularization，但是其实有微妙的不同，参 weight decay vs L2 regularization
反过来，many regularized estimation strategies, such as maximum likelihood learning regularized with weight decay, can be interpreted as making the MAP approximation to Bayesian inference.
- This view applies when the regularization consists of adding an extra term to the objective function that corresponds to $\log \mathbb{P}(\theta)$.
相比 MLE 有 increase bias, decrease variance 的功效
- 你联系 weight decay 就好理解了

8.5.3 MLE vs MAP

\[\begin{aligned} \theta_{\text{MLE}} &= \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D}; \theta) \newline \theta_{\text{MAP}} &= \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \end{aligned}\]

从表达式的角度出发，如果有一个 $f(\mathbf{D}, \theta)$，你既可以把它理解成 $\mathbb{P}(\mathbf{D}; \theta)$，也可以理解成 $\mathbb{P}(\mathbf{D} \vert \theta)$。

再强调一下，这两个 distribution 在 probabilistic 上的意义明显不同，但不妨碍它们可以有同样的表达式

所以在 $\mathbb{P}(\mathbf{D}; \theta) = \mathbb{P}(\mathbf{D} \vert \theta)$ 的前提下，如果 $\theta \sim \operatorname{Uniform}(a, b)$，i.e. $\mathbb{P}(\theta)$ is a constant，那么我们可以认为 $\theta_{\text{MLE}} = \theta_{\text{MAP}}$

Twitter Facebook LinkedIn