Probability Theory
A rigorous treatment of modern probability: from Kolmogorov axioms and sigma-algebras through distributions, the law of large numbers, the central limit theorem, Markov chains, and convergence theory.
1. Foundations: Sample Spaces and Sigma-Algebras
Sample Space
The sample space (denoted Omega) is the set of all possible outcomes of a random experiment. Every probability model begins here.
- Discrete sample space: Omega is finite or countably infinite. Example: rolling a die, Omega = (1, 2, 3, 4, 5, 6).
- Continuous sample space: Omega is uncountably infinite. Example: measuring a voltage, Omega = the real line R.
Events
An event is a subset A of Omega to which we wish to assign a probability. Not every subset is necessarily an event — this is where sigma-algebras become essential.
Sigma-Algebras (Sigma-Fields)
A sigma-algebra (or sigma-field) F on Omega is a collection of subsets satisfying three properties:
- Non-empty: Omega is in F.
- Closed under complementation: If A is in F, then the complement of A (denoted A^c) is also in F.
- Closed under countable unions: If A_1, A_2, ... are in F, then the union of all A_i is also in F.
The trivial sigma-algebra is just (empty set, Omega). The power set of Omega — all subsets — is always a sigma-algebra. For continuous sample spaces the most important sigma-algebra is the Borel sigma-algebraon R, generated by all open intervals.
Probability Space
Definition
A probability space is a triple (Omega, F, P) where Omega is the sample space, F is a sigma-algebra on Omega, and P is a probability measure on F satisfying Kolmogorov's axioms.
2. Probability Axioms and Measure Theory
Kolmogorov's Axioms (1933)
Andrey Kolmogorov placed probability on a rigorous axiomatic foundation in 1933. Given a probability space (Omega, F, P):
Axiom 1 — Non-negativity
P(A) >= 0 for all A in F
Axiom 2 — Normalization
P(Omega) = 1
Axiom 3 — Countable Additivity (sigma-additivity)
If A_1, A_2, ... are pairwise disjoint events in F, then P(union of A_i) = sum of P(A_i)
Derived Properties
All other standard probability rules follow from these three axioms:
| Rule | Formula |
|---|---|
| Complement Rule | P(A^c) = 1 - P(A) |
| Empty Set | P(empty) = 0 |
| Monotonicity | If A subset B then P(A) <= P(B) |
| Addition Rule | P(A union B) = P(A) + P(B) - P(A intersect B) |
| Bonferroni Inequality | P(A union B) <= P(A) + P(B) |
| Total Probability | P(B) = sum over i of P(B|A_i) * P(A_i) |
Probability as a Measure
A probability measure is a special case of a general measure (from measure theory) with the added constraint that the total measure is 1. This connection to Lebesgue measure theory allows powerful analytic tools — Lebesgue integration, dominated convergence, monotone convergence — to be applied directly to probability problems.
3. Conditional Probability, Bayes, and Independence
Conditional Probability
The conditional probability of A given B is defined whenever P(B) greater than 0:
Conditional probability satisfies all three Kolmogorov axioms — it is itself a valid probability measure on F, restricted to the "universe" B.
Chain Rule (Multiplication Rule)
P(A intersect B) = P(A | B) * P(B)
P(A_1 intersect ... intersect A_n) = P(A_1) * P(A_2|A_1) * P(A_3|A_1,A_2) * ...
Law of Total Probability
If (A_1, ..., A_n) is a partition of Omega (pairwise disjoint, union = Omega, all with positive probability), then for any event B:
Bayes' Theorem
Bayes' Theorem
P(A_i | B) = P(B | A_i) * P(A_i) / sum_j P(B | A_j) * P(A_j)
Where P(A_i) is the prior, P(B | A_i) is the likelihood, and P(A_i | B) is the posterior. Bayes' theorem is the mathematical basis for Bayesian inference.
Example — Medical Testing
A disease affects 1% of the population. A test is 99% sensitive (true positive rate) and 95% specific (true negative rate, so 5% false positive). If a patient tests positive, what is the probability they actually have the disease?
P(Disease | Positive) = (0.99 * 0.01) / (0.99 * 0.01 + 0.05 * 0.99) = 0.0099 / (0.0099 + 0.0495) = 0.0099 / 0.0594 approx 0.167
Only about 16.7% — despite a seemingly accurate test, low prevalence dominates.
Independence
Events A and B are independent if and only if:
Equivalently (when P(B) greater than 0): P(A | B) = P(A) — knowing B gives no information about A.
Caution: Mutual vs Pairwise Independence
A collection of events (A_1, ..., A_n) is mutually independent if every subset satisfies the product rule. Pairwise independence (every pair is independent) does NOT imply mutual independence.
For sigma-algebras: sub-sigma-algebras F_1, ..., F_n are independent if for every choice A_i in F_i the events A_1, ..., A_n are mutually independent.
4. Random Variables, Distributions, and Expectations
Random Variables
A random variable X is a measurable function X: Omega to R — it maps outcomes to real numbers, and the preimage of every Borel set belongs to F. Random variables transform the abstract probability space into a numeric one we can compute with.
Probability Mass Function (PMF)
For a discrete random variable X (taking countably many values), the PMF is:
Probability Density Function (PDF)
For a continuous random variable X, the PDF f(x) satisfies:
P(a <= X <= b) = integral from a to b of f(x) dx
f(x) >= 0 for all x, and integral over all reals of f(x) dx = 1
Note: P(X = x) = 0 for any single point when X is continuous. Probability lives in intervals, not points.
Cumulative Distribution Function (CDF)
F(x) = P(X <= x)
- F is non-decreasing: if a < b then F(a) <= F(b)
- Right-continuous: lim from right of F(x) = F(x)
- Limits: F(-infinity) = 0 and F(+infinity) = 1
- For continuous X: f(x) = F prime (x) wherever F is differentiable
Expected Value
The expected value (mean) E[X] is the probability-weighted average:
Discrete: E[X] = sum over x of x * p(x)
Continuous: E[X] = integral over all reals of x * f(x) dx
Variance and Standard Deviation
Var(X) = E[(X - mu)^2] = E[X^2] - (E[X])^2
SD(X) = sqrt(Var(X))
Moment Generating Functions
The moment generating function (MGF) of X is:
The k-th moment is E[X^k] = M_X^(k)(0) — the k-th derivative of M_X evaluated at 0. If two distributions share the same MGF on an open interval around 0, they are identical. MGFs are powerful tools for proving the CLT and working with sums of independent random variables.
Key Inequalities
| Inequality | Statement |
|---|---|
| Markov | P(X >= a) <= E[X] / a for X >= 0, a > 0 |
| Chebyshev | P(|X - mu| >= k*sigma) <= 1 / k^2 |
| Jensen | phi(E[X]) <= E[phi(X)] for convex phi |
| Cauchy-Schwarz | |E[XY]|^2 <= E[X^2] * E[Y^2] |
5. Common Probability Distributions
Discrete Distributions
| Distribution | PMF p(k) | Mean | Variance | Use Case |
|---|---|---|---|---|
| Bernoulli(p) | p^k * (1-p)^(1-k), k in (0,1) | p | p(1-p) | Single yes/no trial |
| Binomial(n,p) | C(n,k) * p^k * (1-p)^(n-k) | np | np(1-p) | Successes in n trials |
| Poisson(lambda) | e^(-lambda) * lambda^k / k! | lambda | lambda | Rare events per interval |
| Geometric(p) | (1-p)^(k-1) * p, k = 1,2,... | 1/p | (1-p)/p^2 | Trials until first success |
Poisson as Limit of Binomial
When n is large and p is small with lambda = np held constant, Binomial(n, p) converges to Poisson(lambda). This is the Poisson limit theorem — useful for modeling rare events like radioactive decay, server arrivals, and insurance claims.
Continuous Distributions
| Distribution | PDF f(x) | Mean | Variance |
|---|---|---|---|
| Uniform(a,b) | 1/(b-a) on [a,b] | (a+b)/2 | (b-a)^2 / 12 |
| Normal(mu, sigma^2) | (1 / sigma*sqrt(2*pi)) * exp(-(x-mu)^2 / 2*sigma^2) | mu | sigma^2 |
| Exponential(lambda) | lambda * e^(-lambda*x), x >= 0 | 1/lambda | 1/lambda^2 |
| Gamma(alpha, beta) | x^(alpha-1) * e^(-x/beta) / (Gamma(alpha) * beta^alpha) | alpha * beta | alpha * beta^2 |
| Beta(alpha, beta) | x^(alpha-1) * (1-x)^(beta-1) / B(alpha, beta), on [0,1] | alpha / (alpha+beta) | alpha*beta / ((alpha+beta)^2 * (alpha+beta+1)) |
The Normal Distribution in Depth
The normal (Gaussian) distribution N(mu, sigma^2) is central to probability and statistics. Key properties:
- Symmetric about the mean mu
- The 68-95-99.7 rule: 68% of data within 1 sigma, 95% within 2 sigma, 99.7% within 3 sigma
- Linear combinations of independent normals are normal: if X ~ N(mu_1, s_1^2) and Y ~ N(mu_2, s_2^2) independently, then X + Y ~ N(mu_1 + mu_2, s_1^2 + s_2^2)
- MGF: M_X(t) = exp(mu*t + sigma^2*t^2/2)
- Standard normal Z = (X - mu) / sigma has distribution N(0,1)
Memoryless Property of the Exponential
The exponential distribution is the unique continuous distribution with the memoryless property: P(X > s + t | X > s) = P(X > t). This makes it the natural model for waiting times in Poisson processes and for component lifetimes in reliability theory.
6. Limit Theorems: Law of Large Numbers and CLT
Weak Law of Large Numbers (WLLN)
Let X_1, X_2, ... be i.i.d. random variables with E[X_i] = mu and Var(X_i) = sigma^2 (finite). Define the sample mean X-bar_n = (X_1 + ... + X_n) / n. Then:
This is convergence in probability. The WLLN follows immediately from Chebyshev's inequality: P(|X-bar_n - mu| > epsilon) <= sigma^2 / (n * epsilon^2), which goes to 0 as n grows.
Strong Law of Large Numbers (SLLN)
Under the same conditions (in fact only requiring E[|X|] finite):
This is almost-sure convergence — a strictly stronger statement. The event "the sample mean converges to mu" has probability exactly 1. The SLLN is what justifies the frequency interpretation of probability over the long run.
Central Limit Theorem (CLT)
Central Limit Theorem
Let X_1, X_2, ... be i.i.d. with mean mu and variance sigma^2, both finite and sigma^2 greater than 0. Then:
Equivalently, for large n: X-bar_n is approximately N(mu, sigma^2/n), and S_n = X_1 + ... + X_n is approximately N(n*mu, n*sigma^2).
The CLT is one of the most remarkable results in mathematics — regardless of the underlying distribution (uniform, exponential, Poisson, etc.), the normalized sum converges to a Gaussian. It explains why normal distributions appear so frequently in nature and is the foundation for much of classical statistics.
CLT in Practice
A rule of thumb: the normal approximation is reliable for n >= 30 when the distribution is not severely skewed. For symmetric distributions, even n = 10 may suffice. For the Poisson, the normal approximation is reasonable when lambda >= 5.
Berry-Esseen Theorem (Rate of Convergence)
The Berry-Esseen theorem quantifies how fast the CLT converges. If E[|X|^3] is finite with rho = E[|X - mu|^3], then the error in the normal approximation to the CDF is bounded by C * rho / (sigma^3 * sqrt(n)), where C is an absolute constant (best known value approximately 0.4748). Faster convergence occurs for distributions closer to normal.
7. Markov Chains
The Markov Property
A stochastic process (X_n) indexed by n = 0, 1, 2, ... and taking values in a (discrete) state space S is a Markov chain if it satisfies the Markov property — the future is independent of the past given the present:
Transition Matrix
A (time-homogeneous) Markov chain on a finite state space S = (1, ..., N) is fully described by its transition probability matrix P, where:
P_ij = P(X_(n+1) = j | X_n = i)
Each row sums to 1: sum over j of P_ij = 1
The n-step transition probabilities P^(n)_ij = P(X_n = j | X_0 = i) are given by the (i,j) entry of the matrix P^n (P raised to the n-th power).
Classification of States
| Concept | Definition |
|---|---|
| Accessible | State j is accessible from i if P^(n)_ij > 0 for some n >= 0 |
| Communicating | i and j communicate if each is accessible from the other; written i <-> j |
| Irreducible | All states communicate — only one communicating class |
| Recurrent | P(return to i eventually | start at i) = 1; sum of P^(n)_ii = infinity |
| Transient | P(return to i eventually | start at i) < 1; chain leaves i forever after some point |
| Period | d(i) = gcd of n such that P^(n)_ii > 0; aperiodic if d(i) = 1 |
Stationary Distribution
A distribution pi on S is stationary (invariant) for the chain if:
For an irreducible, positive recurrent Markov chain, a unique stationary distribution exists and pi_i = 1 / E[T_i | X_0 = i] (the reciprocal of the expected return time).
Ergodic Theorem for Markov Chains
For an irreducible, aperiodic, positive recurrent (ergodic) Markov chain, the long-run distribution converges to pi regardless of the starting state: P^(n)_ij approaches pi_j as n approaches infinity for all i and j. Time averages equal space averages (ensemble averages) — a probability analogue of ergodicity in dynamical systems.
8. Convergence of Random Variables
Four Modes of Convergence
Let (X_n) be a sequence of random variables on (Omega, F, P) and let X be a random variable. There are four standard notions of convergence, from strongest to weakest:
1. Almost-Sure Convergence (a.s.) — Strongest
The set of outcomes where X_n does NOT converge to X has probability zero. This is the mode used in the SLLN.
2. Convergence in Probability
Weaker than a.s. convergence. The WLLN gives this mode. Almost-sure implies convergence in probability but not vice versa.
3. Convergence in L^p (Mean-Square for p=2)
L^p convergence implies convergence in probability (by Markov's inequality). L^2 convergence (mean-square) is especially important in signal processing and statistics.
4. Convergence in Distribution (Weak Convergence) — Weakest
Only the distribution functions converge, not the random variables themselves. This is the mode in the CLT. Implies nothing about the joint behavior of X_n and X.
Implications Between Modes
a.s. ==> in probability ==> in distribution
L^p ==> in probability ==> in distribution
None of the reverse implications hold in general. Convergence in distribution to a constant implies convergence in probability.
Skorokhod's Representation Theorem
If X_n converges in distribution to X, then there exist random variables Y_n and Y (defined on a common probability space) with Y_n having the same distribution as X_n, Y having the same distribution as X, and Y_n converging to Y almost surely. This theorem is a powerful tool for proving distributional convergence results.
Continuous Mapping Theorem
If X_n converges in distribution (or probability, or a.s.) to X, and g is a continuous function, then g(X_n) converges in the same mode to g(X). This is used extensively when applying transformations to converging sequences — for example, if X_n converges in distribution to N(0,1) then X_n^2 converges to chi-squared(1).
Frequently Asked Questions
What is a sigma-algebra and why does it matter in probability theory?+
What are Kolmogorov's three axioms of probability?+
How does Bayes' theorem work and when do you use it?+
What is the difference between the weak and strong law of large numbers?+
What does the Central Limit Theorem say?+
What is a Markov chain and what is the Markov property?+
What are the four modes of convergence for random variables?+
Key Terms Glossary
The set of all possible outcomes of a random experiment
A collection of subsets of Omega closed under complementation and countable unions
A function P: F to [0,1] satisfying Kolmogorov's three axioms
A measurable subset A of Omega belonging to F
A measurable function X: Omega to R
Probability mass function — gives P(X=x) for discrete X
Probability density function — integrates to give probabilities for continuous X
Cumulative distribution function F(x) = P(X <= x)
Probability-weighted average of X; the mean mu = E[X]
E[(X - mu)^2] — measures spread around the mean
Moment generating function M(t) = E[e^(tX)]
P(A intersect B) = P(A)*P(B); knowing one event gives no info about the other
Weak law of large numbers — sample mean converges in probability to mu
Strong law of large numbers — sample mean converges almost surely to mu
Central limit theorem — normalized sum converges in distribution to N(0,1)
Stochastic process satisfying the Markov (memoryless) property
Distribution pi satisfying pi = pi*P; long-run distribution of ergodic chain
X_n to X with probability 1; the strongest mode of convergence
Ready to Practice Probability Theory?
Test your understanding of sigma-algebras, distributions, convergence theorems, and Markov chains with our adaptive math practice tools.
Start Practicing