The Pennsylvania State University, Spring 2021 Stat 415-001, Hyebin Song

Point Estimation

Go to course main page

Introduction to Point Estimation

Learning objectives

Understand the goal of point estimation
Understand bias and mean square error of point estimators

Recap

Setting: $X_1,\dots,X_n\sim F_\theta$ $\theta$ $\Omega$ .

$(X_1,\dots,X_n)$ $X_i$ $X_i$ $X_i$ $\{F_\theta; \theta \in \Omega\}$ $\theta$ .
$\theta$ $\Omega$ $(X_1,\dots,X_n)$ .

Recall the definitions:

point estimation

$\theta$ :
$\theta$ :

Notation:

$X$ $f_\theta$ ${X\in B}$ $X$ $X$ are computed as

$P(X\in B) = \int_B f_\theta(x)dx$
$E[X] = \mu= \int_\mathbb{R} x f_\theta(x)dx$
${\rm Var}[X]= \int_\mathbb{R} (x-\mu)^2 f_\theta(x) dx$

Note: $X$ , we can replace the pdf with a pmf and the integrations with summations.

$P_\theta(X\in B)$ $E_\theta[X]$ ${\rm Var}_\theta[X]$ $\theta$ .

The bias and mean squared error of point estimators

Definition (Biased and unbiased estimators)

${\rm Bias}(\widehat{\theta}; \theta) = E_\theta[\widehat{\theta}(X_1,\dots,X_n)]-\theta$
${\rm Bias}(\widehat{\theta}; \theta$ $=0$ $\theta \in \Omega$ $\widehat{\theta}(X_1,\dots,X_n)$ is unbiased. Otherwise, the estimator is biased.

Example $X_1,\dots,X_n \sim N(\mu,1)$ $\Omega = \mathbb{R}$ . We consider the following three two estimators

\begin{align} \widehat{\theta}_1 &:= \bar{X}\\ \widehat{\theta}_2 &:= (X_1+X_2)/2\\ \widehat{\theta}_3 &:= \bar{X}+1 \end{align}

$\widehat{\theta}_1,\widehat{\theta}_2, \widehat{\theta}_3$ unbiased?

$\mu \in \mathbb{R}$ ,
${\rm Bias}(\bar{X};\mu) = E_\mu[\bar{X}]-\mu =0$ $\bar{X}$ $\mu$ .
${\rm Bias}((X_1+X_2)/2;\mu) = E_\mu[(X_1+X_2)/2]-\mu =0$ $(X_1+X_2)/2$ $\mu$ .
${\rm Bias}(\bar{X}+1;\mu) = E_\mu[\bar{X}+1]-\mu =1\neq 0$ $\bar{X}+1$ not $\mu$ .

Mean Squared Error (MSE)

Definition ${\rm MSE}(\widehat{\theta}; \theta) = E_\theta[(\widehat{\theta}(X_1,\dots,X_n)-\theta)^2]$
MSE captures both bias and variance
${\rm MSE}(\widehat{\theta};\theta) = {\rm Var}_\theta(\widehat{\theta})+{\rm Bias}(\widehat{\theta};\theta)^2$
$E_\theta[(\widehat{\theta}-\theta)^2] = E_\theta[(\widehat{\theta}-E_\theta(\hat{\theta})+E_\theta(\hat{\theta})-\theta)^2]=E_\theta[(\widehat{\theta}-E_\theta(\hat{\theta}))^2]+(E_\theta(\hat{\theta})-\theta)^2$
$\widehat{\theta}$ ${\rm MSE}(\widehat{\theta};\theta) = {\rm Var}_\theta(\widehat{\theta})$ .

Example (cont'd) $\widehat{\theta}_1,\widehat{\theta}_2, \widehat{\theta}_3$ .

${\rm MSE}(\widehat{\theta}_1;\mu) = {\rm Var}_\mu(\bar{X}) = \frac{1}{n}$
${\rm MSE}(\widehat{\theta}_2;\mu) = {\rm Var}_\mu(\frac{(X_1+X_2)}{2}) = \frac{1}{2}$
${\rm MSE}(\widehat{\theta}_3;\mu) = {\rm Bias}(\hat{\theta}_3;\mu)^2+{\rm Var}_\mu(\bar{X}+1) = \frac{1}{n}+1$

Method of Moments Estimation

Learning objectives

Understand how to compute method of moments estimators

Setting: $X_1,\dots,X_n\sim F_\theta$ $\theta$ $\Omega$ .

$k$ $X$ $\mu_k = E[X^k]$ .
$k$ $(X_1,\dots,X_n)$ $M_k =\frac{1}{n}\sum_{i=1}^{n}X_i^k$

Idea: Substitution principle

$M_k=\frac{1}{n}\sum_{i=1}^n X_i^k \approx \mu_k$ $k=1,2,3,\dots$
$\theta = g(\mu_1,\dots,\mu_p)$ $\widehat{\theta}^{MoM}(X_1,\dots,X_n) = g(M_1,\dots,M_p)$

Example. $X_1,\dots,X_n\sim N(\mu,1)$ $\mu$ .

$\theta = \mu$ $\widehat{\theta}^{MoM}(X_1,\dots,X_n) = \frac{1}{n}\sum_{i=1}^n X_i$ .

$\boldsymbol{\theta} = (\theta_1,\dots,\theta_p)$

$\mu_1, \dots, \mu_p$ $\theta_1,\dots,\theta_p$ .

\begin{align} \mu_1 &= h_1(\theta_1,\dots,\theta_p)\\ \mu_2 &= h_2(\theta_1,\dots,\theta_p)\\ &\dots\\ \mu_p &= h_p(\theta_1,\dots,\theta_p) \tag{1} \end{align}

$\theta_1,\dots,\theta_p$ $g_1,g_2,\dots,g_p$ such that

\begin{align} \theta_1 &= g_1(\mu_1,\dots,\mu_p)\\ \theta_2 &= g_2(\mu_1,\dots,\mu_p)\\ &\dots\\ \theta_p &= g_p(\mu_1,\dots,\mu_p) \end{align}

Substituted population moments with sample moments

\begin{align} \widehat{\theta}_1^{MoM} &= g_1(M_1,\dots,M_p)\\ \widehat{\theta}_2^{MoM} &= g_2(M_1,\dots,M_p)\\ &\dots\\ \widehat{\theta}_p^{MoM} &= g_p(M_1,\dots,M_p) \end{align}

Example $X_1,\dots,X_n\sim N(\mu,\sigma^2)$ $\boldsymbol{\theta}=[\mu,\sigma^2]$ .

$\mu_1,\mu_2$ $\theta_1,\theta_2$ .
$\mu_1 = \mu = \theta_1$
$\mu_2 = \sigma^2 + \mu^2 = \theta_2+\theta_1^2$ .
$\theta_1,\theta_2$ .
$\theta_1= \mu_1$
$\theta_2 = \mu_2 - \mu_1^2$ .
Substitute population moments with sample moments
$\widehat{\theta}_1^{MoM} = M_1$
$\widehat{\theta}_2^{MoM} = M_2 - M_1^2$
Therefore,
$\widehat{\theta}_1^{MoM} = \bar{X}$
$\widehat{\theta}_2^{MoM} = \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X})^2$

Remark: $\mu$ $\sigma^2$ is biased

Maximum Likelihood Estimation

Learning objectives

Understand how to compute maximum likelihood estimators (MLE)
Understand invariance property of MLE

Setting: $X_1,\dots,X_n\sim F_\theta$ $\theta$ $\Omega$ .

Idea: $\theta \in \Omega$ that is most likely to have given rise to the observed data

Example: $X_1,\dots,X_n\sim {\rm Ber}(p)$ $(1,1,0,1)$ $\Omega = \{0.2, 0.7\}$ (so the parameter space contains only 2 values).

$p \in \Omega$ $(1,1,0,1)$ ?

$p=0.2$ $P_{p=0.2}(X_1=1,X_2=1,X_3=0,X_4=1) = (0.2)^3 (0.8) = 0.0064$
$p=0.7$ $P_{p=0.7}(X_1=1,X_2=1,X_3=0,X_4=1) = (0.7)^3 (0.3) = 0.1029$
$0.1029>0.0064$ $p=0.7$ $p$ .

Remark:

$p \in [0,1]$ $P_{p}(X_1=1,X_2=1,X_3=0,X_4=1) = p^3 (1-p)$ $p$ .
$L(p) = p^3(1-p)$ $p = 0.2,0.7$ .
$L(p; 1,1,0,1) = p^3(1-p)$ the likelihood function.

Definition (Likelihood function)

$\boldsymbol{\theta} = [ \theta_1,\dots,\theta_p]$ $(x_1,\dots,x_n)$ $(X_1,\dots,X_n)$ $p(\cdot; \theta_1,\dots,\theta_p)$ $f(\cdot; \theta_1,\dots,\theta_p)$ is

\begin{align} L(\theta_1,\dots,\theta_p; x_1,\dots,x_n) & = \prod_{i=1}^n p(x_i; \theta_1,\dots,\theta_p) \mbox{ (discrete $X_i$)}\\ & = \prod_{i=1}^n f(x_i; \theta_1,\dots,\theta_p) \mbox{ (continuous $X_i$)}\\ \end{align}

$\boldsymbol{\theta} = [ \theta_1,\dots,\theta_p]$ $\log L(\theta_1,\dots,\theta_p; x_1,\dots,x_n)$ .

Definition (Maximum Likelihood Estimator)

$\boldsymbol{\theta} \in \Omega$ $L(\theta_1,\dots,\theta_p; x_1,\dots,x_n)$ . In other words,
$\hat{\boldsymbol{\theta}}(x_1,\dots,x_n) = (\hat{\theta_1},\dots,\hat{\theta_p})$ such that
$L(\hat{\theta_1},\dots,\hat{\theta_p}; x_1,\dots,x_n)\geq L(\theta_1,\dots,\theta_p; x_1,\dots,x_n)$ $\boldsymbol{\theta} \in \Omega$ .
$\hat{\boldsymbol{\theta}}(X_1,\dots,X_n)$ .
$\log(\cdot)$ $L(\theta_1,\dots,\theta_p; x_1,\dots,x_n)$ $\log L(\theta_1,\dots,\theta_p; x_1,\dots,x_n)$ . Maximizing the log-likelihood is often easier.

Example $p$ $(x_1,\dots,x_n)$ $X_i \sim {\rm Ber}(p)$ $\Omega= [0,1]$ .

$p(x_i ;p) = p^{x_i} (1-p)^{1-x_i}$ .
Therefore,
$L(p;x_1,\dots,x_n) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} = p^{\sum_{i=1}^n x_i} (1-p)^{n-\sum_{i=1}^n x_i}$
$\log L(p;x_1,\dots,x_n) = {\sum_{i=1}^n x_i} \log p + (n-{\sum_{i=1}^n x_i})\log (1-p)$

$\boldsymbol{\theta} = (\theta_1,\dots,\theta_p)$

$\log L(\theta_1,\dots,\theta_p; x_1,\dots,x_n)$
1. $X_i$
2. Find the log-likelihood function

\begin{align} \log L(\theta_1,\dots,\theta_p; x_1,\dots,x_n) &= \sum_{i=1}^n \log p(x_i; \theta_1,\dots,\theta_p) \mbox{ (if discrete) }\\ &= \sum_{i=1}^n \log f(x_i; \theta_1,\dots,\theta_p) \mbox{ (if continuous) } \end{align}

$\hat{\boldsymbol{\theta}}(x_1,\dots,x_n) = (\hat{\theta_1},\dots,\hat{\theta_p})$ $\log L(\theta_1,\dots,\theta_p; x_1,\dots,x_n)$
1. Compute stationary points of the log-likelihood.
  - $\triangledown \log L(\theta_1,\dots,\theta_p; x_1,\dots,x_n) =0$ inside the parameter space.
2. Find a maximizer among candidate points (stationary points and boundary points).
$(x_1,...,x_n)$ $\hat{\boldsymbol{\theta}}(x_1,\dots,x_n)$ $\hat{\boldsymbol{\theta}}(X_1,\dots,X_n)$ , which is a random variable.

Example (Bernoulli MLE) $p$ $(X_1,\dots,X_n)$ $X_i \sim {\rm Ber}(p)$ $\Omega= [0,1]$ .

$\log L(p;x_1,\dots,x_n) = {\sum_{i=1}^n x_i} \log p + (n-{\sum_{i=1}^n x_i})\log (1-p)$

$(0,1)$ .
$\frac{d}{dp} \log L(p;x_1,\dots,x_n) = {\sum_{i=1}^n x_i} \frac{1}{p}- (n-{\sum_{i=1}^n x_i})\frac{1}{1-p}=0$
$p$ $\hat{p} = \bar{x}$ .
$0<\sum_{i=1}^n x_i<n$ $\bar{x}$ $(0,1)$ .
$\sum_{i=1}^n x_i=0$ $n$ $(0,1)$ .

$0<\sum_{i=1}^n x_i<n$ $\hat{p} = \bar{x}$ is the global maximizer, since
$\frac{d^2}{dp^2} \log L(p;x_1,\dots,x_n) = -{\sum_{i=1}^n x_i} \frac{1}{p^2}- (n-{\sum_{i=1}^n x_i})\frac{1}{(1-p)^2}<0$ $0<p<1$ , and
$\lim_{p\rightarrow \pm\infty}\log L(p;x_1,\dots,x_n) = -\infty$ .
$\sum_{i=1}^n x_i=0$ $n$ ,
$\log L(p;x_1,\dots,x_n) =\begin{cases} n \log (1-p) & {\rm if}\,\, \sum_{i=1}^n x_i = 0\\ n\log p & {\rm if}\,\,\sum_{i=1}^n x_i = n\end{cases}$
$p=0$ $p=1$ $\hat{p} = \bar{x}$ is the global maximizer.
$p$ $\hat{p}(x_1,\dots,x_n) = \bar{x}$ .

$p$ $\hat{p}(X_1,\dots,X_n) = \bar{X}$ .

$\mu$ $\sigma^2$ unknown) $p$ $(X_1,\dots,X_n)$ $X_i \sim N(\theta_1,\theta_2)$ $\Omega= \{(\theta_1,\theta_2); -\infty<\theta_1<\infty, 0<\theta_2<\infty\}$

$\boldsymbol{\theta} = [\theta_1,\theta_2]$
$X_i$ $f(x_i; \boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi\theta_2}}e^{-\frac{1}{2\theta_2}(x_i-\theta_1)^2}$
$\log L(\boldsymbol{\theta}; x_1,\dots,x_n) = \sum_{i=1}^n \{ -\frac{1}{2}\log (2\pi)-\frac{1}{2}\log(\theta_2)-\frac{1}{2\theta_2} (x_i-\theta_1)^2 \}$
$\Omega$
$\triangledown \log L(\boldsymbol{\theta}; x_1,\dots,x_n) = \begin{bmatrix} \frac{1}{\theta_2}\sum_{i=1}^n (x_i-\theta_1)\\ -\frac{n}{2\theta_2} +\sum_{i=1}^n\frac{1}{2\theta_2^2} (x_i-\theta_1)^2 \end{bmatrix} =0$
$\hat{\theta}_1 = \bar{x}$ $\hat{\theta}_2 = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$ .

b. we can verify the solution in a is indeed the unique global maximizer by using a second derivative condition (positive determinant and negative first element of the hessian matrix) and checking that there is no maximum at infinity.
$\widehat{\boldsymbol{\theta}} = [\bar{X}, n^{-1}\sum_{i=1}^n(X_i-\bar{X})^2]$

Example (Uniform MLE) $\theta$ $(X_1,\dots,X_n)$ $X_i \sim Unif(0,\theta)$ $\Omega= \{\theta;\theta>0\}$ .

$X_i$ :
$f(x_i;\theta) = \begin{cases}\frac{1}{\theta} & 0\leq x_i \leq \theta\\ 0 & {\rm elsewhere}\end{cases}$
$f$ $(X_1,\dots,X_n)$ :
$f(x_1,\dots,x_n;\theta) = \begin{cases}(\frac{1}{\theta})^n & 0\leq x_1,\dots,x_n \leq \theta\\ 0 & {\rm elsewhere}\end{cases}$
b. Likelihood:
$L(\theta;x_1,\dots,x_n) = \begin{cases}(\frac{1}{\theta})^n & \theta \geq \max_{1\le i \le n}x_i \\ 0 & {\rm elsewhere}\end{cases}$
Log-likelihood:
$\log L(\theta;x_1,\dots,x_n) = \begin{cases} -n\log \theta & \theta \geq \max_{1\le i \le n}x_i\\ -\infty & {\rm elsewhere}\end{cases}$
$\Omega$ .
$\theta > \max_{1\le i \le n} x_i$ ,
$\frac{d}{d\theta}\log L(\theta;x_1,\dots,x_n) = -\frac{n}{\theta}<0$ $\theta$ .
$\hat{\theta} = \max_{1\le i \le n} x_i$ .
$\hat{\theta} = \max_{1\le i \le n }X_i$ .

Theorem (Invariance) $\widehat{\theta}$ $\theta$ $g$ $g(\widehat{\theta})$ $g(\theta)$ .

Remark $g$ $g$ is not one-to-one, the discussion becomes more subtle, but we will not worry about this point in this class.

$g$ is one-to-one)
$\eta = g(\theta)$ $\widehat{\theta}$ $L(\theta; x_1,\dots,x_n) = L(g^{-1}(\eta);x_1,\dots,x_n)$ .
$L(\widehat{\theta}; x_1,\dots,x_n) = L(g^{-1}(g(\widehat{\theta})); x_1,\dots,x_n)$ .
$\hat{\eta} = g(\hat{\theta})$ $L(g^{-1}(\eta);x_1,\dots,x_n)$ .

Example (Bernoulli MLE) ${\rm Var}(X_i)$ $(X_1,\dots,X_n)$ $X_i \sim {\rm Ber}(p)$ $\Omega= [0,1]$ .

$\eta = {\rm Var}(X_i) = p(1-p)$ .
$p$ $\bar{X}$ $\eta$ $\bar{X}(1-\bar{X})$ .

Properties of Point Estimators

Learning objectives

$n$ properties (consistency, asymptotic normality/efficiency) of point estimators
Understand optimal properties of maximum likelihood estimators

Finite sample properties

Unbiasedness
- $\widehat{\theta}$ ${\rm Bias}(\widehat{\theta}; \theta$ $=0$ $\theta \in \Omega$ .
Example: $(X_1,\dots,X_n)$ $X_i \sim {\rm Poisson}(\lambda)$ $\lambda$ $\widehat{\theta}_1 = \bar{X}$ $\hat{\theta}_2 = X_1$ .
$\lambda >0$ ${\rm Bias}(\hat{\theta_1};\lambda) = E_\lambda[\hat{\theta}_1]-\lambda = E_\lambda[\bar{X}]-\lambda = 0$ ${\rm Bias}(\hat{\theta_1};\lambda) = E_\lambda[\hat{\theta}_2]-\lambda = E_\lambda[X_1]-\lambda = 0$ .
Both are unbiased.
$\widehat{\theta}_1$ $\widehat{\theta}_2$ , which estimator should we use?

Relative Efficiency
- Definition $\hat{\theta}_1$ $\hat{\theta}_2$ $\theta$ $\hat{\theta}_1$ $\hat{\theta}_2$ ${\rm eff}( \hat{\theta}_1,\hat{\theta}_2 )$ , is defined to be the ratio
${\rm eff}( \hat{\theta}_1,\hat{\theta}_2 ) = \frac{{\rm Var}_\theta(\hat{\theta}_2)}{{\rm Var}_\theta(\hat{\theta}_1)}$
- $\hat{\theta}_1$ $\hat{\theta}_2$ ${\rm eff}( \hat{\theta}_1,\hat{\theta}_2 )>1$ . Obviously, given any two unbiased estimators, we would prefer a more efficient estimator.
  Pushing this idea even further, we may consider the "best" unbiased estimator, which we define in the following way.
Example: $(X_1,\dots,X_n)$ $X_i \sim {\rm Poisson}(\lambda)$ $\widehat{\theta}_1 = \bar{X}$ $\hat{\theta}_2 = X_1$ $\lambda$ .
${\rm eff}( \hat{\theta}_1,\hat{\theta}_2 ) = \frac{{\rm Var}_\lambda(\hat{\theta}_2)}{{\rm Var}_\lambda(\hat{\theta}_1)} = \frac{\lambda}{\lambda/n} = n >1.$
$\hat{\theta}_1$ $\hat{\theta}_2$ .

Definition (Minimum Variance Unbiased Estimator (MVUE)) $\hat{\theta}$ $\theta$ $\hat{\theta}$ $E_\theta[\hat{\theta}] = \theta$ $\theta \in \Omega$ $T$ ${\rm Var}_\theta(\hat{\theta}) \leq {\rm Var}_\theta(T)$ $\theta \in \Omega$ .)

any $\theta$ . That is, if we find an unbiased estimator for which the variance matches with the lower bound, then we know that we have found the MVUE.
- Theorem (Cramer-Rao Inequality) $X_1,\dots,X_n\sim F_\theta$ $f_\theta(x)$ $\hat{\theta}=\hat{\theta}(X_1,\dots,X_n)$ $\theta$ . Under very general conditions,
${\rm Var}_\theta(\hat{\theta}(X_1,\dots,X_n)) \geq \frac{1}{I_n(\theta)}, \,\, \mbox{where } I_n(\theta)= nE_\theta[-\frac{\partial^2 \log f_\theta(X)}{\partial \theta^2}]$
- efficiency $\hat{\theta}$ as
${\rm eff}(\hat{\theta}) = \frac{I_n(\theta)^{-1}}{{\rm Var}_\theta(\hat{\theta})}.$
- $\hat{\theta}$ efficient ${\rm eff}(\hat{\theta})=1$ $\hat{\theta}$ $I_n(\theta)^{-1}$ .
  
  Example: $X_1,X_2,...,X_n$ $\lambda >0$ . Determine the Cramer-Rao lower bound.
  $\log p_\lambda(x_i) = -\lambda+x_i\log \lambda - \log x_i!$ $I_n(\lambda ) = nE_\lambda [ -\frac{\partial^2 \log p_\lambda(X)}{\partial \lambda^2}] = E_\lambda[x_i \frac{1}{\lambda^2}] = \frac{n}{\lambda}.$
  $\lambda/n$ .
  ${\rm Var}_\lambda (\bar{X}) = \lambda/n$ $\bar{X}$ $1$ $\bar{X}$ is efficient.

Sufficiency
- $T(X_1,\dots,X_n)$ sufficient $\theta$ $\theta$ (the formal definition will be presented later).
- $(X_1,\dots,X_n)$ is a sufficient statistic. We would like to find a sufficient statistic that reduces the data in the sample as much as possible. Such a statistic is called minimal sufficient.
- Often, "good" estimators are functions of a minimal sufficient statistic.

Large sample (asymptotic) properties

Consistency $P_\theta(|\widehat{\theta}(X_1,\dots,X_n) - \theta|\geq \epsilon) \underset{n\rightarrow \infty}{\rightarrow} 0, \forall \epsilon>0$ $\widehat{\theta}_n=\widehat{\theta}(X_1,\dots,X_n)$ is a consistent estimator.
Remark: consistency is often considered as a "minimum requirement" an estimator should meet.

Theorem ${\rm MSE}(\hat{\theta};\theta) \underset{n\rightarrow\infty}{\rightarrow} 0$ $\hat{\theta}$ is consistent.
$\epsilon>0$ $P_\theta(|\widehat{\theta} - \theta|\geq \epsilon) =P_\theta(|\widehat{\theta} - \theta|^2\geq \epsilon^2)\leq \frac{E[|\hat{\theta}-\theta|^2]}{\epsilon^2}\underset{n\rightarrow\infty}{\rightarrow} 0$ .

Example $(X_1,\dots,X_n)$ $X_i \sim N(\mu,1)$ $\bar{X}$ $\mu$ $(X_1+X_2)/2$ ?
1. $P_\mu(|\bar{X}-\mu|\geq\epsilon) \underset{n\rightarrow \infty}{\rightarrow} 0, \forall \epsilon>0$ .
  ${\rm MSE}(\bar{X};\mu) = {\rm Var}_\mu (\bar{X}) = 1/n \underset{n\rightarrow\infty}{\rightarrow} 0$ $\bar{X}$ $\mu$ .
2. $T=(X_1+X_2)/2 \sim N(\mu,1/4)$ $\epsilon$ be any positive number. We have,
  $P(|T - \mu| \geq \epsilon)= P(\frac{|T - \mu|}{1/4} \geq 4\epsilon) = \Phi(4\epsilon)-\Phi(-4\epsilon)$
  $T$ $\mu$ .

Asymptotic Normality $\widehat{\theta}_n$ $\widehat{\theta}_n \approx N(\theta, V_n(\theta))$ $n$ is large.
$\hat{\theta}_n$ $n$ asymptotic distribution $\hat{\theta}_n$ $\widehat{\theta}_n$ is asymptotically normal if its asymptotic distribution is Normal, and we denote it as

\widehat{\theta}_n \dot\sim N(\theta, V_n(\theta)).

Example: $(X_1,\dots,X_n)$ $X_i \sim {\rm Ber}(p)$ $\bar{X}$ $p$ ?
$\bar{X} \overset{\cdot}{\sim} N(p, \frac{p(1-p)}{n})$ $\bar{X}$ $p$ .

Asymptotic Efficiency $\widehat{\theta}_n$ $V_n(\theta)$ $1/I_n(\theta)$ .

Example: $(X_1,\dots,X_n)$ $X_i \sim {\rm Ber}(p)$ $\bar{X}$ $p$ ?
We have,
$\bar{X} \overset{\cdot}{\sim} N(p, \frac{p(1-p)}{n}).$
$V_n(p) = \frac{p(1-p)}{n}$ . Now we compute the C-R lower bound.
$\log p_\theta(X) = X\log p+(1-X)\log (1-p)$ , and
$E[\frac{\partial^2}{\partial p^2}\log p_\theta(X)] = -E[\frac{X}{p^2}+\frac{(1-X)}{(1-p)^2}] = -\frac{1}{p}-\frac{1}{1-p}$ .
$I_n(p) = n(\frac{1}{p}+\frac{1}{1-p}) = \frac{n}{p(1-p)}$ .
$V_n(p) = \frac{p(1-p)}{n} = I_n(p)^{-1}$ $\bar{X}$ is an asymptotically efficient estimator.

Properties of MoM estimators

In most practical cases, MoM estimators are consistent.
Not necessarily unbiased.
Often, MoM estimators are not minimal sufficient, and a "better" estimator can be found.

Properties of MLEs

Theorem (asymptotic optimality of the MLE) $n$ is large,
$\hat{\theta}(X_1,\dots,X_n) \overset{\cdot}{\sim} N(\theta,I_n(\theta)^{-1})$

In most cases, an MLE is minimal sufficient.

Sufficient Statistics and Rao-Blackwellization

Learning Objectives

Understand the concept of sufficient statistics
Understand how to find sufficient statistics
Understand how we can improve an unbiased estimator using the Rao-Blackwell theorem.

Sufficient statistics and the factorization theorem

Definition (Sufficient Statistics) $X_1,\dots,X_n$ $\theta$ $T(X_1,\dots,X_n)$ $\theta$ $X_1,\dots,X_n$ $T(X_1,\dots,X_n)$ $\theta$ .

Remark: $T$ $\theta$ $T(X_1,\dots,X_n)$ $(X_1,\dots ,X_n)$ $T$ $\theta$ $T$ $X_1,\dots,X_n$ , since we cannot gain any further information about the parameter.

Example $(X_1,\dots,X_n)$ $X_i \sim {\rm Ber}(p)$ $Y= \sum_{i=1}^n X_i$ $p$ .
Compute the conditional pdf
$\begin{align} p_{X_1,\dots,X_n|Y}(x_1,\dots,x_n|y) &= P(X_1=x_1,\dots,X_n = x_n |Y=y) \\ &= \frac{P(X_1=x_1,\dots,X_n = x_n , Y=y)}{P(Y=y)}\\ &= \begin{cases} 0 & \mbox{ if } \sum_{i=1}^n x_i \neq y\\ \frac{p^{x_1}(1-p)^{1-x_1}\dots p^{x_n}(1-p)^{1-x_n} }{\binom{n}{y}p^y(1-p)^{n-y}} & \mbox{ if } \sum_{i=1}^n x_i = y \end{cases}\\ & = \begin{cases} 0 & \mbox{ if } \sum_{i=1}^n x_i \neq y\\ \frac{1 }{\binom{n}{y} } & \mbox{ if } \sum_{i=1}^n x_i = y \end{cases} \end{align}$

This definition tells us how to check whether a statistic is sufficient, but it does not tell us how to find a sufficient statistic. The following theorem is very useful in the search for a sufficient statistic.
Theorem (Factorization Theorem) $X_1,X_2,...,X_n$ $f(x_1, x_2,..., x_n; \theta)$ $θ$ $T(X_1,X_2,...,X_n)$ $\theta$ if and only if
$f(x_1, \dots ,x_n; θ) = \phi[T(x_1, x_2,..., x_n); θ]h(x_1, x_2,..., x_n),$
$\phi$ $x_1,x_2,\dots, x_n$ $T(\cdot)$ $h(\cdot)$ $\theta$ .

Theorem (Factorization Theorem for two parameters) $X_1,X_2,...,X_n$ $f(x_1, x_2,..., x_n; \theta_1,\theta_2)$ $\theta_1, \theta_2$ $T_1(X_1,X_2,...,X_n)$ $T_2(X_1,X_2,...,X_n)$ $\theta_1,\theta_2$ if and only if
$f(x_1, \dots ,x_n; θ_1,θ_2) = \phi[T_1(x_1, x_2,..., x_n), T_2(x_1, x_2,..., x_n); θ_1,θ_2]h(x_1, x_2,..., x_n),$
$\phi$ $x_1,x_2,\dots, x_n$ $T_1(\cdot)$ $T_2(\cdot)$ $h(\cdot)$ $\theta_1, \theta_2$ .

Example (one parameter): $X_1,X_2,...,X_n$ $\lambda >0$ $\sum_{i=1}^n X_i$ $\lambda$ .
$p(x_1,\dots,x_n; \lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} = \frac{e^{-n\lambda}\lambda^{\sum_{i=1}^n x_i}}{\prod_{i=1}^n x_i!}$
$\sum_{i=1}^n X_i$ $\lambda$ .

Example (two parameters): $X_1,X_2,...,X_n$ $\mu,\sigma^2 >0$ $\sum_{i=1}^n X_i, \sum_{i=1}^n X_i^2$ $(\mu,\sigma^2)$ .
$\begin{align} f(x_1,\dots,x_n; \mu,\sigma^2) &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{1}{2\sigma^2} (x_i-\mu)^2) \\ &= \prod_{i=1}^n\exp\{-\frac{1}{2\sigma^2}x_i^2 + \frac{1}{\sigma^2} \mu x_i - \frac{1}{2\sigma^2}\mu^2 - \log\sqrt{2\pi\sigma^2} \}\\ &= \exp\{-\frac{1}{2\sigma^2}\sum_{i=1}^n x_i^2 + \frac{1}{\sigma^2} \mu(\sum_{i=1}^n x_i) - \frac{n}{2\sigma^2}\mu^2 - n\log\sqrt{2\pi\sigma^2} \} \end{align}$
$\sum_{i=1}^n X_i, \sum_{i=1}^n X_i^2$ $(\mu,\sigma^2)$ .

Theorem (Sufficient Statistic when a pmf/pdf is of the exponential form) $X_1,X_2,...,X_n$ $f(x; \theta)$ of the form
$f(x;\theta) = \exp \{K(x) p(\theta) +S(x) + q(\theta)\}$
$\theta$ $\sum_{i=1}^n K(X_i)$ $\theta$ .
Proof. Factorization theorem.

Remark 1 $\theta$ .
$X\sim Unif(0,\theta),\,\, f(x;\theta) = \frac{1}{\theta}, 0<x<\theta$ .
Remark 2 $\sum_{i=1}^n K(X_i)$ minimal $\theta$ .

Example: $X_1,X_2,...,X_n$ $\lambda >0$ $\sum_{i=1}^n X_i$ $\lambda$ .
$p(x;\lambda ) = \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} = \exp\{-\lambda +x \log \lambda -\log x!\}$ $K(x) = x$ $\sum_{i=1}^n X_i$ $\lambda$ .

Rao-Blackwellization

all $X_1,\dots,X_n$ $X_1,\dots,X_n$ .

In fact, if we have an unbiased estimator which is not based on a sufficient statistic, we can improve the estimator based on the sufficient statistic.

Theorem (Rao-Blackwell Theorem) $X_1,X_2,...,X_n$ $f(x; \theta), \theta \in \Omega$ $\hat{\theta}(X_1,X_2,...,X_n)$ $θ$ $T(X_1,X_2,...,X_n)$ $θ$ $\hat{\theta}^* = E[\hat{\theta}(X_1,X_2,...,X_n)|T(X_1,\dots,X_n)]$ . Then,

$\hat{\theta}^*$ $T(X_1,\dots,X_n)$ $\theta$ $\widehat{\theta}^*$ a statistic.
$E[\hat{\theta}^*] = \theta$ .
${\rm Var}(\hat{\theta}^*) \leq {\rm Var}(\hat{\theta})$ $\hat{\theta}(X_1,X_2,...,X_n)$ $T(X_1,\dots,X_n)$ alone.

Remark 1 $\hat{\theta}^*$ $T(X_1,\dots,X_n)$ $\hat{\theta}(X_1,X_2,...,X_n)$ $T(X_1,\dots,X_n)$ alone.

Remark 2 $\sum_{i=1}^n K(X_i)$ from the previous theorem (Sufficient Statistic when a pmf/pdf is of the exponential form).

Point Estimation

Introduction to Point Estimation

Learning objectives

Recap

The bias and mean squared error of point estimators

Method of Moments Estimation

Learning objectives

Procedures to obtain Method of Moments (MoM) estimators of \boldsymbol{\theta} = (\theta_1,\dots,\theta_p)

Maximum Likelihood Estimation

Learning objectives

Procedures to obtain Maximum Likelihood Estimators (MLE) of \boldsymbol{\theta} = (\theta_1,\dots,\theta_p)

Properties of Point Estimators

Learning objectives

Finite sample properties

Large sample (asymptotic) properties

Properties of MoM estimators

Properties of MLEs

Sufficient Statistics and Rao-Blackwellization

Learning Objectives

Sufficient statistics and the factorization theorem

Rao-Blackwellization

$\boldsymbol{\theta} = (\theta_1,\dots,\theta_p)$

$\boldsymbol{\theta} = (\theta_1,\dots,\theta_p)$