\[ \newcommand{\nRV}[2]{{#1}_1, {#1}_2, \ldots, {#1}_{#2}} \newcommand{\pnRV}[3]{{#1}_1^{#3}, {#1}_2^{#3}, \ldots, {#1}_{#2}^{#3}} \newcommand{\onRV}[2]{{#1}_{(1)} \le {#1}_{(2)} \le \ldots \le {#1}_{(#2)}} \newcommand{\RR}{\mathbb{R}} \newcommand{\Prob}[1]{\mathbb{P}\left({#1}\right)} \newcommand{\PP}{\mathcal{P}} \newcommand{\iidd}{\overset{\mathsf{iid}}{\sim}} \newcommand{\X}{\times} \newcommand{\EE}[1]{\mathbb{E}\left[{#1}\right]} \newcommand{\Var}[1]{\mathsf{Var}\left({#1}\right)} \newcommand{\Ber}[1]{\mathsf{Ber}\left({#1}\right)} \newcommand{\Geom}[1]{\mathsf{Geom}\left({#1}\right)} \newcommand{\Bin}[1]{\mathsf{Bin}\left({#1}\right)} \newcommand{\Poi}[1]{\mathsf{Pois}\left({#1}\right)} \newcommand{\Exp}[1]{\mathsf{Exp}\left({#1}\right)} \newcommand{\SD}[1]{\mathsf{SD}\left({#1}\right)} \newcommand{\sgn}[1]{\mathsf{sgn}} \newcommand{\dd}[1]{\operatorname{d}\!{#1}} \]
2.4 Sampling and Descriptive Statistics
The difference between probability and statistics lies in their perspectives on known and unknown aspects of models for random experiments. Probability focuses on fully known models, whereas statistics comes into play when inferences are made about unknown aspects of the model based on observed results. This section considers scenarios where the outcomes of random variables are assumed to be known and the aim is to make inferences about their unknown distributions. For example, taking samples from a population and recording measurements, such as heights of people, or arsenic content of water samples, are treated as random experiments. Even though the distribution may be unknown, this section explores how to predict properties such as expected value and variance from sampled data, with emphasis on situations where the sample size is small relative to the population size.
2.4.1 Empirical Distribution Function
A natural quantity for estimating the distribution of a random variable from observed data, regardless of the underlying distribution, is a discrete distribution that assigns equal probability to each observed point. This is called the empirical distribution. It is formally defined as follows
Definition 2.1 (Empirical Distribution) Let \(X_1, X_2, \dots, X_n\) be i.i.d. random variables. The empirical distribution function based on these is the discrete distribution with probability mass function given by
\[f(t) = \frac{1}{n} \left|\left\{ X_i = t \right\}\right|\]
The empirical distribution formed without additional assumptions about the underlying distribution is a random quantity that can be studied using probability tools. These studies, known as “descriptive statistics”, don’t make any additional assumptions about the distribution. In the following sections, we’ll explore how making additional valid assumptions leads to “better” inference. The empirical distribution, being random, varies with each realisation of the sample, which is expected to contain information about the underlying distribution, especially as the sample size increases. This intuition is refined in the remainder of the section, which details how the properties of the empirical distribution can be studied to investigate the properties of the empirical distribution.
2.4.2 Sample Mean
The sample mean is calculated as the known average from a given set of observations.
Definition 2.2 (Sample Mean) Let \(X_1, X_2, \dots, X_n\) be a random sample of size \(n\) from a distribution with mean \(\mu\) and variance \(\sigma^2\). The sample mean is defined as \[ \bar{X}_n = \frac{X_1 + X_2 + \dots + X_n}{n} \]
The random variable \(\bar{X}\) is the expected value derived from the empirical distribution based on \(X_1, X_2, \dots, X_n\). When \(X_j\)’s have a finite mean \(\mu\), it’s important to note that \(\bar{X}\), the sample mean, is different from \(\mu\) because it is a random variable. In statistical terms, \(\mu\) is typically considered to be unknown, whereas \(\bar{X}\) can be calculated from the sample data. The following theorem is concerned with assessing how well \(\bar{X}\) serves as an estimate of the fixed constant \(\mu\).
Theorem 2.5 (Sample Mean) Let \(X_1, X_2, \dots, X_n\) be independent and identically distributed random variables with mean \(\mu\) and variance \(\sigma^2\). Then, \[ \EE{\bar{X}_n} = \mu \quad \textsf{and} \quad \Var{\bar{X}_n} = \frac{\sigma^2}{n} \]
Proof. We have \[\begin{align*} \EE{\bar{X}_n} & = \EE{\frac{X_1 + X_2 + \dots + X_n}{n}} \\ & = \frac{\EE{X_1} + \EE{X_2} + \dots + \EE{X_n}}{n} \\ & = \frac{n \mu}{n} = \mu \end{align*}\] Thus \(\bar{X}_n\) is an unbiased estimator of \(\mu\). We also have \[\begin{align*} \Var{\bar{X}_n} & = \Var{\frac{X_1 + X_2 + \dots + X_n}{n}} \\ & = \frac{\Var{X_1} + \Var{X_2} + \dots + \Var{X_n}}{n^2} \tag{as $X_i$'s are independent} \\ & = \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n} \end{align*}\] Since, \(\Var{\bar{X}_n} \to 0\) as \(n \to \infty\), \(\bar{X}_n\) is a consistent estimator of \(\mu\).
The equality \(\EE{X} = \mu\) means that, on average, the quantity \(\bar{X}\) serves as an unbiased estimator of the unknown mean \(\mu\). As the sample size (\(n\)) increases, the variance \(\Var{X}\) approaches zero, indicating that larger samples improve the accuracy of \(\bar{X}\) in representing the mean of \(\mu\). This suggests that, when sampling from an unknown distribution, a larger sample size tends to produce a value closer to the expected value of the distribution. In statistical terms, the sample mean is considered to be a consistent estimator of the population mean \(\mu\).
2.4.3 Sample Variance
Given a sample of observations, we define the sample variance below.
Definition 2.3 (Sample Variance) Let \(X_1, X_2, \dots, X_n\) be a random sample of size \(n\) from a distribution with mean \(\mu\) and variance \(\sigma^2\). The sample variance is defined as \[ S_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2 \]
It’s important to note that this definition of sample variance is not universal. It differs from the alternative definition, which uses \(n\) in the denominator instead of \(n - 1\), and is aligned with the variance of the empirical distribution of \(X_1, X_2, \dots, X_n\). The presented definition using \(n - 1\) results in an unbiased quantity for the underlying population variance, as confirmed by the following theorem.
Theorem 2.6 (Sample Variance) Let \(X_1, X_2, \dots, X_n\) be independent and identically distributed random variables with mean \(\mu\) and variance \(\sigma^2\). Then, \[ \EE{S_n^2} = \sigma^2 \]
Proof. We have \[\begin{align*} \EE{S_n^2} & = \EE{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2} \\ & = \frac{1}{n-1} \EE{\sum_{i=1}^n (X_i^2 - 2 X_i \bar{X}_n + \bar{X}_n^2)} \\ & = \frac{1}{n-1} \EE{\sum_{i=1}^n X_i^2 - 2 \sum_{i=1}^n X_i \bar{X}_n + \sum_{i=1}^n \bar{X}_n^2} \\ & = \frac{1}{n-1} \EE{\sum_{i=1}^n X_i^2 - 2 \bar{X}_n \sum_{i=1}^n X_i + n \bar{X}_n^2} \\ & = \frac{1}{n-1} \EE{\sum_{i=1}^n X_i^2 - 2 n \bar{X}_n^2 + n \bar{X}_n^2} \\ & = \frac{1}{n-1} \EE{\sum_{i=1}^n X_i^2 - n \bar{X}_n^2} \\ & = \frac{1}{n-1} \EE{\sum_{i=1}^n X_i^2 - \frac{1}{n} \left(\sum_{i=1}^n X_i\right)^2} \\ & = \frac{1}{n-1} \EE{\sum_{i=1}^n X_i^2 - \frac{1}{n} \sum_{i=1}^n X_i^2} \\ & = \frac{1}{n-1} \EE{\frac{n-1}{n} \sum_{i=1}^n X_i^2} \\ & = \EE{X_1^2} = \sigma^2 \end{align*}\] Hence, \(S_n^2\) is an unbiased estimator of \(\sigma^2\). See Exercise 2.5.
2.4.4 Sample Proportion
While expectation and variance are commonly used to summarise a random variable, they do not fully characterise its distribution. Full knowledge of the distribution requires the ability to compute probabilities for any event A, such as \(\Prob{X \in A}\). In particular, it is sufficient to know probabilities such as \(\Prob{X \leq t}\), which is the cumulative distribution function of \(X\) at \(t\). When dealing with a sample of independent and identically distributed (i.i.d.) observations \(X_1, X_2, \dots, X_n\) from a common distribution defined by the random variable \(X\), the probability \(\Prob{X \in A}\) for any event A can be approximated by \(\Prob{Y \in A}\), where Y follows the empirical distribution based on \(X_1, X_2, \dots, X_n\). This approximation, which considers \(Y\) to take values \(X_1, X_2, \dots, X_n\) with equal probability \(\frac 1n\), is expressed as \[\Prob{Y \in A} = \sum_{X \in A} \frac 1n = \frac{\#\{X_i \in A\}}{n}\], which represents the proportion of sample observations for which the event A occurred. Consequently, \(\Prob{Y \in A}\) serves as a good estimator of \(\Prob{X \in A}\), as in the following result.
Theorem 2.7 Let \(X_1, X_2, \dots, X_n\) be i.i.d. random variables with \(\Prob{X \in A} = p\). Let, \(\hat{p} = \frac{\#\{X_i \in A\}}{n}\). Then \(\hat{p}\) is an unbiased estimator of \(p\) and \(\Var{\hat{p}} = \frac{p(1-p)}{n} \to 0\) as \(n \to \infty\).
Proof. We have \[\begin{align*} \EE{\hat{p}} & = \EE{\frac{\#\{X_i \in A\}}{n}} \\ & = \frac{1}{n} \EE{\sum_{i=1}^n \mathbb{I}_{\{X_i \in A\}}} \\ & = \frac{1}{n} \sum_{i=1}^n \EE{\mathbb{I}_{\{X_i \in A\}}} \\ & = \frac{1}{n} \sum_{i=1}^n \Prob{X_i \in A} \\ & = \frac{1}{n} \sum_{i=1}^n p = p \end{align*}\] Thus, \(\hat{p}\) is an unbiased estimator of \(p\). We also have \[\begin{align*} \Var{\hat{p}} & = \Var{\frac{\#\{X_i \in A\}}{n}} \\ & = \frac{1}{n^2} \Var{\sum_{i=1}^n \mathbb{I}_{\{X_i \in A\}}} \\ & = \frac{1}{n^2} \sum_{i=1}^n \Var{\mathbb{I}_{\{X_i \in A\}}} \tag{as $X_i$'s are independent} \\ & = \frac{1}{n^2} \sum_{i=1}^n \EE{\mathbb{I}_{\{X_i \in A\}}^2} - \frac{1}{n^2} \sum_{i=1}^n \EE{\mathbb{I}_{\{X_i \in A\}}}^2 \\ & = \frac{1}{n^2} \sum_{i=1}^n \EE{\mathbb{I}_{\{X_i \in A\}}} - \frac{1}{n^2} \sum_{i=1}^n \EE{\mathbb{I}_{\{X_i \in A\}}}^2 \\ & = \frac{1}{n^2} \sum_{i=1}^n \Prob{X_i \in A} - \frac{1}{n^2} \sum_{i=1}^n \Prob{X_i \in A}^2 \\ & = \frac{1}{n^2} \sum_{i=1}^n p - \frac{1}{n^2} \sum_{i=1}^n p^2 \\ & = \frac{p}{n} - \frac{p^2}{n} = \frac{p(1-p)}{n} \end{align*}\] Thus, \(\Var{\hat{p}} \to 0\) as \(n \to \infty\).
The above result is a specific instance of the broader law of large numbers, which will be discussed further in Chapter 3. This law is significant because it formally supports our intuition that the probability of an event reflects the asymptotic relative frequency of that event across repeated trials of an experiment.
Exercises
Exercise 2.5 Show that \(S_n^2\) in Theorem 2.6 is a consistent estimator of \(\sigma^2\).
Exercise 2.6 (Order Statistics) Let \(n \geq 1\) and let \(X_1, X_2, \dots, X_n\) be a \(\rm iid\) random variables. Let the \(X\)’s be arranged in increasing order of magnitude denoted by \[X_{(1)} \leq X_{(2)} \leq \ldots \leq X_{(n)}\] These ordered values are called the order statistics of the sample \(X_1, X_2, \dots, X_n\). For, \(1 \leq i \leq n, X_{(r)}\) is called the \(r\)th order statistic. Find the probability density function of, \(X_{(r)}\), the \(r\)th order statistic for \(1 \leq i \leq n\) when
- \(X_1 \sim \mathsf{Unif}(0,1)\).
- \(X_1 \sim \Exp{\lambda}\) for some \(\lambda > 0\).
Exercise 2.7 Suppose Ruhi has a coin that Ruhi claim is “fair” (equally likely to come up heads or tails) and that her friend claims is weighted towards heads. Suppose Ruhi flips the coin \(20\) times and find that it comes up heads on sixteen of those \(20\) flips. While this seems to favor her friend’s hypothesis, it is still possible that Ruhi is correct about the coin and that just by chance the coin happened to come up heads more often than tails on this series of flips. Let \(S\) be the sample space of all possible sequences of flips. The size of \(S\) is then \(2^{20}\), and if Ruhi is correct about the coin being “fair”, each of these outcomes are equally likely.
- Let \(E\) be the event that exactly sixteen of the flips come up heads. What is the size of \(E\)? What is the probability of \(E\)?
- Let \(F\) be the event that at least sixteen of the flips come up heads. What is the size of \(F\)? What is the probability of \(F\)?
Exercise 2.8 Two types of coin are produced at a factory: a fair coin and a biased one that comes up heads \(55\%\) of the time. We have one of these coins but do not know whether it is a fair or biased coin. In order to ascertain which type of coin we have, we shall perform the following statistical test. We shall toss the coin \(1000\) times. If the coin comes up heads \(525\) or more times we shall conclude that it is a biased coin. Otherwise, we shall conclude that it is fair. If the coin is actually fair, what is the probability that we shall reach a false conclusion? What would it be if the coin were biased?
Exercise 2.9 (Birthday Problem) We will revisit the birthday problem. That is, we have \(N\) people in a room. We further assume that all years have \(365\) days and that birthrates are constant throughout. Using ggplot()
plot the \(\Prob{\textsf{at least two people in the room who share a birthday}}\) as a function \(N\), with \(N\) varying from to \(100\). From the plot deduce the following:
- What is the value of \(N\) above which there is a \(60\%\) chance that two of the \(N\) people will have a common birthday?
- If \(N = 20\) people in a room, from the plot, infer what is chance that two of the \(20\) people will have a common birthday?
In the same room of \(N\) people, what are the chances that \(3\) people share the same birthday? Can you a write a R-code to implement the same for any \(k\) people?