*building intuition by explaining with both rigorous math and
non-rigorous english*

Throughout this post, \[ X_n \] is a sequence of independent and identically distributed random variables with mean \[ \mu \] and standard deviation \[ \sigma \]. The mean of this sequence is given by:

\[ \bar{X} = \frac{X_n}{n}\]

You probably already have a pretty good intuition for the weak law of large numbers. It states that as sample size increases, the mean of that sample converges in probability towards the true mean.

\[ \bar{X} \overset{P}{\to} \mu \ \ \mathrm{as} \ n\to\infty \]

We’ll also be using Chebyshev’s Inequality, which acts as a sort of cap on what percentage of values can be far away from the mean. This is a little harder to get an intuition for (especially understanding why the numbers are what they are), but it’s really just relating the chance of deviating from the mean by more than \[ k \] standard deviations to be less than or equal to \[ \frac{1}{k^2} \]. Intuitively, there’s perhaps a rough analogy here to the inverse square rule for gravity, sound, etc. but for probability. As we get farther away from the mean, fewer and fewer data points can be that far away in any non-pathological (for example, uniform) distribution.

\[ P(|\bar{X} - \mu| \geq k \sigma) \leq \frac{1}{k^2} \]

This is where the fun begins. We first want to find variance so we can plug it in to Chebyshev. We use the convenient fact that the variance of this sum is the sum of the variances (thanks to independence!), and expand out the variance for each \[ X_i \]. Then we eliminate a factor of \[ n \] from the final denominator by noting that we’re summing over \[ n \] random variables. The final \[ \sigma^2 \] should feel familiar, and familiarity is good when we’re building intuition.

\[ \mathrm{var}(\bar{X}) = \mathrm{var}\left(\frac{X_n}{n}\right) = \sum_{i=1}^n \mathrm{var}\left({\frac{X_i}{n}}\right) = \sum_{i=1}^n \frac{\sigma^2}{n^2} = \frac{\sigma^2}{n} \]

We can put this back into Chebyshev’s Inequality now. First we define \[ \epsilon = k \sigma > 0 \]. We use variance over \[ \epsilon^2 \] to get out our last useful little bit of formula.

\[ P(|\bar{X} - \mu| \geq \epsilon) \leq \frac{\mathrm{var}(\bar{X})}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2} \]

Now, we just push \[ n \] to infinity and watch the magic happen! We’re dividing some positive constant by infinity times some other positive constant, so get out a zero.

\[ \lim_{n \to \infty} \frac{\sigma^2}{n \epsilon^2} = 0 \]

And to put the bow on top, we’ll state it in a pretty way.

\[ \lim_{n \to \infty} P(|\bar{X} - \mu| \geq \epsilon) = 0\]

Intuitively, once again: this means that as \[ n \] goes to infinity, the probability of the observed mean differing from the true mean by more than an arbitrarily small positive number \[ \epsilon \] is zero. We can restate this as our original formulation: as sample size increases, the mean of that sample converges in probability towards the true mean.