4 Probability: from error handling to inference

4.1 Composing uncertainty like error handling

Every engineer who has designed a distributed system has done probability without calling it that. Suppose you have three independent services, each with 99.5% uptime. What’s the probability that all three are up simultaneously?

# Three independent services, each 99.5% available
p_up = 0.995

# All three up: multiply independent probabilities (AND rule)
p_all_up = p_up ** 3
print(f"P(all three up)  = {p_all_up:.6f}  ({p_all_up * 100:.2f}%)")

# At least one down: complement rule
p_at_least_one_down = 1 - p_all_up
print(f"P(≥1 down)       = {p_at_least_one_down:.6f}  ({p_at_least_one_down * 100:.2f}%)")

P(all three up)  = 0.985075  (98.51%)
P(≥1 down)       = 0.014925  (1.49%)

You’ve just used two rules of probability — the multiplication rule for independent events and the complement rule. These aren’t abstract axioms; they’re the same arithmetic you use when calculating error budgets and composing SLOs. The only difference is that statistics names the rules explicitly and extends them to situations where “independence” doesn’t hold and events aren’t as clean as “up or down.”

Distributions (Section 2.1) and descriptive statistics (Section 3.1) gave us tools for summarising data we already had. Probability works in the other direction: it gives us the rules for reasoning about what data we should expect given what we know about the process that generates it.

The uptime example was simple because the events were binary, independent, and we knew the probabilities exactly. Real problems rarely stay this clean. What if deployments fail more often during peak hours? What if service failures are correlated? The rules of probability give us the general toolkit for handling these complications.

4.2 The rules of probability

All of probability rests on a small set of rules. If you know these four, you can derive almost everything else.

Complement rule. The probability of something not happening is one minus the probability of it happening.

\[ P(\text{not } A) = 1 - P(A) \]

You use this instinctively. “99.5% uptime” means “0.5% downtime.” The complement rule is why error budgets work — the budget is $1 - \text{SLO}$.

Addition rule (OR). The probability of either $A$ or $B$ occurring is the sum of their individual probabilities, minus the overlap (to avoid double-counting).

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

The $\cup$ symbol means “union” (OR) and $\cap$ means “intersection” (AND) — the same set operations you’d find in any programming language. If the events are mutually exclusive — they can’t both happen — the intersection is empty and the formula simplifies to $P(A) + P(B)$.

Multiplication rule (AND). The probability of both $A$ and $B$ occurring depends on whether they’re independent.

\[ P(A \cap B) = P(A) \times P(B \mid A) \]

Here $P(B \mid A)$ means “the probability of $B$ given that $A$ has occurred” — we’ll define this precisely in a moment. If $A$ and $B$ are independent — knowing $A$ happened tells you nothing about $B$ — this simplifies to $P(A) \times P(B)$. That’s the calculation we did with the three services. But independence is an assumption, not a given, and the failure modes are familiar: just as “independent” microservices often turn out to share a database, a network link, or a cloud availability zone, “independent” events can be driven by a hidden common cause. When independence breaks down, you need the full conditional version of the rule.

Conditional probability. The probability of $A$ given that $B$ has already occurred:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

In plain English: restrict your attention to the world where $B$ is true, then ask how often $A$ also happens.

Let’s make this concrete. Imagine you’ve collected deployment outcomes across different times of day.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# Simulate deployment outcomes by time of day
n = 1000
time_of_day = rng.choice(['off-peak', 'peak'], size=n, p=[0.6, 0.4])

# Failure rate is higher during peak hours
peak_mask = time_of_day == 'peak'
outcome = np.empty(n, dtype='<U7')  # 7 chars fits both 'success' and 'failure'
outcome[peak_mask] = rng.choice(
    ['success', 'failure'], size=peak_mask.sum(), p=[0.90, 0.10]
)
outcome[~peak_mask] = rng.choice(
    ['success', 'failure'], size=(~peak_mask).sum(), p=[0.97, 0.03]
)

deployments = pd.DataFrame({'time': time_of_day, 'outcome': outcome})

# Joint probability table: crosstab counts co-occurrences of two categorical
# variables; normalize='all' converts counts to proportions (joint probabilities)
joint = pd.crosstab(deployments['time'], deployments['outcome'], normalize='all')
print("Joint probability table:")
print(joint.round(4))

# Conditional probability: P(failure | peak)
peak_rows = deployments[deployments['time'] == 'peak']
p_failure_given_peak = (peak_rows['outcome'] == 'failure').mean()
print(f"\nP(failure | peak) = {p_failure_given_peak:.4f}")

# Compare with P(failure) overall
p_failure = (deployments['outcome'] == 'failure').mean()
print(f"P(failure)        = {p_failure:.4f}")

Joint probability table:
outcome   failure  success
time                      
off-peak    0.023    0.580
peak        0.040    0.357

P(failure | peak) = 0.1008
P(failure)        = 0.0630

The table above is a joint probability table — each cell shows the probability of a particular combination of time and outcome. The conditional probability $P(\text{failure} \mid \text{peak})$ is notably higher than the overall failure rate. Time of day and deployment outcome aren’t independent — knowing when a deployment happened changes your estimate of its outcome.

Engineering Bridge

Conditional probability is the WHERE clause of probability. When you write SELECT COUNT(*) FROM deployments WHERE time = 'peak' AND outcome = 'failure' and divide by SELECT COUNT(*) FROM deployments WHERE time = 'peak', you’re computing $P(\text{failure} \mid \text{peak})$ exactly. The conditional restricts the denominator to a subset of the data, just like a WHERE clause restricts the rows. If you’ve ever computed a metric “sliced by” some dimension in a dashboard, you’ve computed a conditional probability.

4.3 Bayes’ theorem: updating beliefs

The conditional probability formula can be rearranged into one of the most important results in all of statistics. Starting from:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P(B \mid A) = \frac{P(A \cap B)}{P(A)} \]

we can substitute to get Bayes’ theorem:

\[ P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)} \]

Each term has a name — borrowed from Bayesian statistics, where $A$ is a hypothesis and $B$ is observed data. $P(A)$ is the prior — what you believed before seeing the data. $P(B \mid A)$ is the likelihood — how probable the data is if your hypothesis is correct. $P(A \mid B)$ is the posterior — your updated belief after seeing the data. And $P(B)$ is the normalising constant (also called the “evidence”) that ensures the posterior is a valid probability.

This isn’t just algebra. Bayes’ theorem describes a process: start with a belief, observe data, update the belief. It’s the formal machinery for learning from evidence.

Here’s a scenario every on-call engineer has lived. Your monitoring system fires an alert. Should you wake up the team?

# Monitoring alert scenario
p_incident = 0.02           # 2% of the time, there's a real incident
p_alert_given_incident = 0.95   # Alert fires 95% of the time when there IS an incident
p_alert_given_no_incident = 0.05  # Alert fires 5% of the time when there ISN'T one

# P(alert) via the law of total probability:
# enumerate every way an alert can fire, weight by probability of each path
p_alert = (
    p_alert_given_incident * p_incident +
    p_alert_given_no_incident * (1 - p_incident)
)

# Bayes' theorem: P(incident | alert)
p_incident_given_alert = (
    p_alert_given_incident * p_incident / p_alert
)

print(f"P(incident)          = {p_incident:.4f}")
print(f"P(alert)             = {p_alert:.4f}")
print(f"P(incident | alert)  = {p_incident_given_alert:.4f}")
print(f"\nWhen the alert fires, there's only a {p_incident_given_alert:.0%} chance "
      f"of a real incident.")

P(incident)          = 0.0200
P(alert)             = 0.0680
P(incident | alert)  = 0.2794

When the alert fires, there's only a 28% chance of a real incident.

A detector with 95% sensitivity (it catches 95% of real incidents) and only a 5% false positive rate (it fires on 5% of non-incidents) still produces alerts that are mostly noise — out of every 100 alerts that fire, roughly 72 are false alarms. The prior dominates: real incidents are so rare (a base rate of $P(\text{incident}) = 0.02$) that even a low false positive rate generates more noise than signal. This is why alert fatigue is a mathematical inevitability when base rates are low and thresholds aren’t adjusted.

Notice the denominator calculation in the code: we computed $P(\text{alert})$ by enumerating all the ways an alert can fire — via a real incident or via a false positive — and summing. This is the law of total probability, the probabilistic equivalent of handling every branch of an enum. You’ll see this pattern whenever you need the denominator of Bayes’ theorem.

Author’s Note

This calculation genuinely changed how I think about monitoring. I used to evaluate alerting rules purely by their detection rate and false positive rate, as though those two numbers told the whole story. Bayes’ theorem shows they don’t — the base rate of the event you’re detecting matters enormously. A 5% false positive rate sounds low, but when real incidents are much rarer than the false positive rate, most of your alerts are still noise. Now I always ask: “what’s the base rate?” before evaluating any classifier or detector.

We’ll return to Bayesian reasoning as a full inference framework in Bayesian inference: updating beliefs with evidence. For now, the key insight is that probability isn’t static — it updates as evidence arrives.

4.4 From one sample to many: the sampling distribution

Here’s a question that doesn’t come up in deterministic software but is central to statistics: if you collected a different sample, how different would your answer be?

In Section 3.2, we computed the sample mean of some response times — a single number summarising 2,000 observations. That mean was a point estimate of the true population mean. But if we’d measured a different batch of requests, we’d have got a different sample mean. And a different one again the next time. The distribution of those sample means — across all possible samples of the same size — is the sampling distribution.

This isn’t just a philosophical point. It’s the mechanism that lets us quantify how much we should trust any single estimate.

rng = np.random.default_rng(42)

population_mean = 50
n_samples = 2000
sample_size = 50

# Draw many samples from a skewed population
# stats.expon(scale=μ) parameterises by mean, not rate — scale = 1/λ
population = stats.expon(scale=population_mean)
sample_means = [
    population.rvs(size=sample_size, random_state=rng).mean()
    for _ in range(n_samples)
]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Left: population distribution
x = np.linspace(0, 250, 300)
ax1.plot(x, population.pdf(x), 'steelblue', linewidth=2)
ax1.fill_between(x, population.pdf(x), alpha=0.2, color='steelblue')
ax1.axvline(population_mean, color='coral', linestyle='--', linewidth=1.5,
            label=f'μ = {population_mean}')
ax1.set_xlabel("Value")
ax1.set_ylabel("Density")
ax1.set_title("Population: Exponential(mean=50)")
ax1.legend()

# Right: sampling distribution of the mean
ax2.hist(sample_means, bins=50, density=True, color='steelblue', alpha=0.7,
         edgecolor='white')
ax2.axvline(np.mean(sample_means), color='coral', linestyle='--', linewidth=1.5,
            label=f'Mean of means = {np.mean(sample_means):.1f}')
ax2.set_xlabel("Sample mean")
ax2.set_ylabel("Density")
ax2.set_title(f"Sampling distribution (n={sample_size}, {n_samples} samples)")
ax2.legend()

plt.tight_layout()
plt.show()

Two-panel figure. The left panel shows the exponential population distribution, which is heavily right-skewed with most density near zero and a long tail. The right panel shows a histogram of 2,000 sample means, which forms a symmetric bell shape centred on the true population mean of 50, despite the skewed population. — Figure 4.1: The population is heavily right-skewed, but the sampling distribution of the mean is approximately normal — a preview of the Central Limit Theorem.

Three observations from Figure 4.1:

Shape. The population is dramatically right-skewed, but the sampling distribution of the mean looks approximately normal.
Centre. The sampling distribution is centred on the true population mean $\mu = 50$.
Spread. The sampling distribution is much narrower than the population — averaging 50 observations smooths out the noise considerably.

None of these are coincidences. They all follow from a single theorem.

4.5 The Central Limit Theorem

The Central Limit Theorem (CLT) says: take the mean of $n$ independent observations, each drawn from the same distribution (statisticians call this “identically distributed” — together, “independent and identically distributed,” or i.i.d.), with finite mean $\mu$ and finite variance $\sigma^2$. As $n$ grows, the distribution of that sample mean approaches a Normal distribution:

\[ \bar{X}_n \;\approx\; \text{Normal}\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{for large } n \]

Here $\bar{X}_n$ denotes the sample mean of $n$ observations — the capital letter signals that it’s a random variable (it varies from sample to sample), while the bar indicates averaging. The Normal distribution is parameterised by its mean and variance (not standard deviation).

In plain English: it doesn’t matter what the underlying data looks like — skewed, bimodal (two peaks), uniform, or anything else. Average enough of it and the result is approximately normal. This holds regardless of the shape of the original distribution, provided the observations are drawn from the same distribution and the variance is finite.

The CLT has three implications that underpin almost all of inferential statistics — the branch of statistics concerned with drawing conclusions about a population from a sample:

Shape → normal. The sampling distribution of the mean is approximately normal, even when the population isn’t.
Centre → $\mu$. The sampling distribution is centred on the true population mean. Sample means are unbiased — they don’t systematically overshoot or undershoot.
Spread → $\sigma / \sqrt{n}$. The variability of the sample mean decreases as the sample size increases, at a rate of $1/\sqrt{n}$.

Let’s see this in action across four very different populations.

rng = np.random.default_rng(42)

n_simulations = 5000
sample_sizes = [1, 5, 30]

# Four different population shapes
def draw_uniform(size):
    return rng.uniform(0, 1, size)

def draw_exponential(size):
    return rng.exponential(scale=1, size=size)

def draw_bimodal(size):
    # Mixture of two normals
    mix = rng.choice([0, 1], size=size, p=[0.5, 0.5])
    return np.where(mix == 0,
                    rng.normal(-2, 0.5, size),
                    rng.normal(2, 0.5, size))

def draw_binomial(size):
    return rng.binomial(n=1, p=0.2, size=size).astype(float)

# Theoretical mean and variance for each population
populations = [
    ("Uniform(0, 1)", draw_uniform, 0.5, 1/12),
    ("Exponential(1)", draw_exponential, 1.0, 1.0),
    ("Bimodal", draw_bimodal, 0.0, 4.25),  # E[X]=0, Var = 0.5*(0.25+4) + 0.5*(0.25+4) = 4.25
    ("Binomial(1, 0.2)", draw_binomial, 0.2, 0.16),
]

fig, axes = plt.subplots(4, 3, figsize=(12, 12))

for row, (name, draw_fn, mu_pop, var_pop) in enumerate(populations):
    for col, n in enumerate(sample_sizes):
        means = [draw_fn(n).mean() for _ in range(n_simulations)]
        ax = axes[row, col]
        ax.hist(means, bins=40, density=True, color='steelblue', alpha=0.7,
                edgecolor='white')

        if n > 1:
            # Overlay CLT-predicted normal (theoretical, not fitted)
            se_predicted = np.sqrt(var_pop / n)
            x_line = np.linspace(mu_pop - 4 * se_predicted,
                                 mu_pop + 4 * se_predicted, 200)
            ax.plot(x_line, stats.norm(mu_pop, se_predicted).pdf(x_line),
                    'coral', linewidth=1.5,
                    label='CLT prediction' if row == 0 and col == 1 else None)

        if row == 0:
            ax.set_title(f"n = {n}", fontsize=11)
        if col == 0:
            ax.set_ylabel(name, fontsize=10)
        if row == 3:
            ax.set_xlabel("Sample mean")

# Add a single legend for the coral curve
axes[0, 1].legend(loc='upper right', fontsize=9)

plt.suptitle("Central Limit Theorem: convergence to normal", y=1.01,
             fontsize=13)
plt.tight_layout()
plt.show()

A 4-by-3 grid of histograms. Each row starts from a different population shape (uniform, exponential, bimodal, binomial). The leftmost column (n=1) shows the raw population shape. The middle column (n=5) shows partial convergence. The rightmost column (n=30) shows all four converging to a bell shape matching the overlaid coral normal curve, regardless of starting distribution. — Figure 4.2: The Central Limit Theorem at work. Each row is a different population shape. Each column increases the sample size. The coral curves show the CLT-predicted normal distribution.

With $n = 1$, the “mean” of a single observation is just the observation itself, so the leftmost column shows the raw population shape — each looks completely different. The middle column ($n = 5$) already shows convergence beginning: the sharp edges and asymmetries soften. By $n = 30$, all four are well-approximated by a Normal distribution (the coral curve). Averaging has erased the starting shape.

A common rule of thumb says $n = 30$ is “enough” for the CLT to kick in, and for these four populations it’s visually convincing. But heavily skewed or heavy-tailed distributions — the kind you encounter with network latencies or file sizes — can require hundreds or thousands of observations before the normal approximation becomes reliable. Treat $n = 30$ as a starting point, not a guarantee.

This is why the normal distribution from Section 2.4.1 appears so often in practice. It’s not that the world is inherently normal — it’s that averages are approximately Normal, and most of the statistics we compute (means, proportions, and regression coefficients — weights that quantify the relationship between variables, covered in Linear regression: your first model) are some form of average.

4.6 Standard error: the precision of an estimate

The CLT tells us that the spread of the sampling distribution is $\sigma / \sqrt{n}$. This quantity has its own name: the standard error (SE).

\[ \text{SE} = \frac{\sigma}{\sqrt{n}} \]

In practice, we rarely know $\sigma$ (the true population standard deviation), so we estimate the standard error using the sample standard deviation $s$:

\[ \widehat{\text{SE}} = \frac{s}{\sqrt{n}} \]

The hat on $\widehat{\text{SE}}$ is statistical notation meaning “estimated from data” — we don’t know the true SE (which depends on $\sigma$), so we estimate it using $s$. In practice, most authors drop the hat and simply write SE, since the context makes it clear whether the true or estimated version is meant. We’ll follow that convention from here on. The standard error tells you how precisely you’ve estimated the mean. A small SE means your sample mean is likely close to the true mean; a large SE means it could be far off.

The $\sqrt{n}$ in the denominator has a critical implication: doubling your precision requires quadrupling your data. Going from $n = 100$ to $n = 400$ halves the standard error. Going from 400 to 1,600 halves it again. There are diminishing returns to collecting more data — each additional observation contributes less to precision than the last.

rng = np.random.default_rng(42)

# Population: Exponential with mean 50 (sigma = 50 for exponential)
population_sigma = 50
sample_sizes = [10, 25, 50, 100, 250, 500, 1000]

theoretical_se = [population_sigma / np.sqrt(n) for n in sample_sizes]
empirical_se = []

for n in sample_sizes:
    means = [rng.exponential(scale=50, size=n).mean()
             for _ in range(5000)]
    empirical_se.append(np.std(means))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Left: theoretical vs empirical
ax1.plot(sample_sizes, theoretical_se, 'coral', linewidth=2, marker='o',
         label='Theoretical: σ/√n')
ax1.plot(sample_sizes, empirical_se, 'steelblue', linewidth=2, marker='s',
         linestyle='--', label='Empirical (5,000 simulations)')
ax1.set_xlabel("Sample size (n)")
ax1.set_ylabel("Standard error")
ax1.set_title("Theoretical vs empirical SE")
ax1.legend()

# Right: SE as a function of n (continuous curve)
n_range = np.linspace(5, 1000, 200)
ax2.plot(n_range, population_sigma / np.sqrt(n_range), 'steelblue', linewidth=2)
ax2.axhline(y=population_sigma / np.sqrt(100), color='grey', linestyle='--',
            alpha=0.5)
ax2.annotate("n=100", xy=(100, population_sigma / np.sqrt(100)),
             xytext=(200, population_sigma / np.sqrt(100) + 2),
             arrowprops=dict(arrowstyle='->', color='grey'), fontsize=9,
             color='grey')
ax2.axhline(y=population_sigma / np.sqrt(400), color='grey', linestyle='--',
            alpha=0.5)
ax2.annotate("n=400 (half the SE)", xy=(400, population_sigma / np.sqrt(400)),
             xytext=(500, population_sigma / np.sqrt(400) + 2),
             arrowprops=dict(arrowstyle='->', color='grey'), fontsize=9,
             color='grey')
ax2.set_xlabel("Sample size (n)")
ax2.set_ylabel("Standard error")
ax2.set_title("Diminishing returns: SE vs sample size")

plt.tight_layout()
plt.show()

Two-panel figure. The left panel plots theoretical SE (solid coral line) and empirical SE (dashed steelblue line with square markers) against sample size; both curves decline together, confirming the theory. The right panel shows a smooth SE-vs-n curve with annotations highlighting that going from n=100 to n=400 halves the standard error. — Figure 4.3: Theoretical vs empirical standard error as sample size grows, and the diminishing returns curve — quadruple the data for double the precision.

Engineering Bridge

The $\sqrt{n}$ curve is the capacity planning problem of statistical precision. Every engineer knows the economics of diminishing returns — doubling your server count doesn’t halve your latency. The standard error follows a similar economic pattern: you can always buy more precision with more data, but the cost-per-unit-of-precision increases quadratically. Halving the SE requires 4× the data. Halving it again requires 16×. The SE does keep shrinking — there’s no hard ceiling — but the economics become unfavourable fast. This is why sample size calculations matter: they tell you the point at which collecting more data isn’t worth the cost.

Author’s Note

The $\sqrt{n}$ relationship changed how I think about data collection. My engineering instinct was always “more data is better” — just log everything, store it all, analyse later. But the standard error shows that data has diminishing marginal returns. Going from 100 to 10,000 observations is transformative; going from 10,000 to 1,000,000 barely moves the needle on precision. Once I internalised this, I stopped reflexively reaching for bigger datasets and started asking “how precise do I actually need to be?” first. That question — answered through sample size calculations — saves enormous amounts of time and compute.

4.7 Worked example: can we trust this A/B test result?

Let’s put these ideas together with a scenario you’ll encounter in practice. Your team has run an A/B test on a checkout flow. The control group had a 12.0% conversion rate ($n = 1\text{,}000$), and the variant had a 13.5% conversion rate ($n = 1\text{,}000$). That’s a 1.5 percentage point difference. Is this real, or could it be noise?

We’re not going to answer this question fully here — that’s what hypothesis testing and confidence intervals are for. But we can use the tools from this chapter to frame the question precisely. We know the standard error of each proportion, and therefore the SE of their difference — so we can ask: how many standard errors apart are these two estimates?

# Observed data
p_control = 0.120
p_variant = 0.135
n_ab = 1000

# Standard errors for proportions: SE = sqrt(p(1-p)/n)
# For binary outcomes (convert/don't), variance = p(1-p) — from the
# Bernoulli distribution (see Ch2). Dividing by n and taking the square
# root gives the SE of the estimated proportion.
se_control = np.sqrt(p_control * (1 - p_control) / n_ab)
se_variant = np.sqrt(p_variant * (1 - p_variant) / n_ab)

# We use the observed proportions as plug-in estimates (substituting
# sample values where the formula calls for unknown population values)
print(f"Control: p = {p_control:.3f}, SE = {se_control:.4f}")
print(f"Variant: p = {p_variant:.3f}, SE = {se_variant:.4f}")

# How far apart are they in standard-error units?
# SE of a difference: variances add for independent samples, then take sqrt.
# Note: a formal hypothesis test would pool the proportions under H₀,
# giving a slightly different SE. We'll see that approach in Chapter 5.
se_diff = np.sqrt(se_control**2 + se_variant**2)
z = (p_variant - p_control) / se_diff
print(f"\nSE of the difference: {se_diff:.4f}")
# z-score: how many standard errors the difference is from zero
print(f"Difference in SE units (z): {z:.2f}")

Control: p = 0.120, SE = 0.0103
Variant: p = 0.135, SE = 0.0108

SE of the difference: 0.0149
Difference in SE units (z): 1.01

# Sampling distributions (by CLT, approximately normal)
x = np.linspace(0.08, 0.18, 300)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, stats.norm(p_control, se_control).pdf(x), 'steelblue', linewidth=2,
        label=f'Control: {p_control:.1%} ± {se_control:.4f}')
ax.fill_between(x, stats.norm(p_control, se_control).pdf(x), alpha=0.2,
                color='steelblue')
ax.plot(x, stats.norm(p_variant, se_variant).pdf(x), 'coral', linewidth=2,
        label=f'Variant: {p_variant:.1%} ± {se_variant:.4f}')
ax.fill_between(x, stats.norm(p_variant, se_variant).pdf(x), alpha=0.2,
                color='coral')

ax.set_xlabel("Conversion rate")
ax.set_ylabel("Density")
ax.set_title("Sampling distributions of two conversion rates")
ax.legend()
plt.tight_layout()
plt.show()

Two overlapping bell curves representing the sampling distributions of the control (12.0%, steelblue) and variant (13.5%, coral) conversion rates. The curves share a wide region of overlap, illustrating that the observed difference could plausibly be due to sampling noise rather than a genuine effect. — Figure 4.4: The sampling distributions of the two conversion rates overlap substantially. The observed difference might be real — or it might be sampling noise.

The two distributions share a wide band of overlap. The 1.5 percentage point difference could be a genuine improvement — or it could be the kind of variation you’d expect from two samples drawn from the same underlying rate. The z-score (a standardised measure of “how many standard errors away from zero”) of roughly 1.0 tells us the difference is about one standard error wide — suggestive, but far from conclusive. (We’ll learn what to do with that number in Hypothesis testing: the unit test of data science.)

To resolve this properly, we need formal tools: a way to quantify the probability that the observed difference arose by chance (hypothesis testing), a way to express the range of plausible true values (confidence intervals), and a structured approach to designing and interpreting these experiments (A/B testing methodology). Those are exactly the topics of the next three chapters: Hypothesis testing: the unit test of data science, Confidence intervals: error bounds for estimates, and A/B testing: deploying experiments.

4.8 Summary

Probability has a small set of composable rules — complement, addition, multiplication, and conditioning. These are the same rules you use for SLO composition and error budgets, now formalised.
Bayes’ theorem turns conditional probability into a learning process — it updates prior beliefs in light of evidence. The base rate of the event you’re detecting matters at least as much as the accuracy of your detector.
The sampling distribution describes the variability of an estimate — if you collected a different sample, you’d get a different answer. The sampling distribution tells you how much those answers would vary.
The Central Limit Theorem says that averages are approximately normal — regardless of the shape of the underlying population, for large enough $n$. This is the theoretical foundation of most inferential statistics.
The standard error quantifies the precision of an estimate — it shrinks as $\sigma / \sqrt{n}$, meaning you get diminishing returns from additional data. Knowing the SE lets you move from “here is the sample mean” to “here is how much we should trust the sample mean” — which is exactly where inference begins.

4.9 Exercises

A notification system delivers alerts through three independent channels: email (98% delivery rate), SMS (95%), and push notification (90%). What is the probability that a user receives an alert through at least one channel? Compute this analytically using the complement rule, then verify by simulating 100,000 alerts using numpy.
The Central Limit Theorem works for any distribution with finite variance. Use scipy.stats.poisson(mu=3) to explore this: draw 5,000 samples of sizes $n \in \{1, 5, 15, 50\}$, compute the mean of each, and plot the four sampling distributions. For each, overlay a normal distribution with the CLT-predicted mean and standard deviation. At what sample size does the normal approximation become convincing? A QQ plot (quantile-quantile plot) provides a more rigorous check — it plots your data’s quantiles against a theoretical distribution’s quantiles, where points falling on the diagonal indicate a good fit. Use scipy.stats.probplot for the $n = 50$ case.
Using an Exponential distribution with mean 100, compute the theoretical standard error for sample sizes $n \in \{25, 100, 400, 1600\}$. Verify each empirically by drawing 10,000 samples. Plot both the theoretical and empirical SE values against $n$ on the same axes. How does the $4\times$ rule (quadruple data, halve SE) hold up in practice?
A code-coverage tool claims to detect bugs with 80% accuracy (it flags 80% of real bugs). It also has a 10% false positive rate (it flags 10% of clean code). In a mature codebase, suppose 3% of functions actually contain a bug. Using Bayes’ theorem, calculate $P(\text{bug} \mid \text{flagged})$. Now repeat the calculation for base rates of 1%, 5%, and 20%. What does this tell you about the usefulness of the tool across different codebases?
Conceptual: This chapter opened by comparing probability rules to error-handling composition in distributed systems. Where does this analogy break down? Consider: what happens when events aren’t independent? When “failure” isn’t binary? When the probability of failure changes over time? Identify at least two ways the clean algebraic composition of probabilities in the opening example doesn’t map to real-world systems, and explain what statistical tools you’d reach for in each case.

--- title: "Probability: from error handling to inference" --- ## Composing uncertainty like error handling {#sec-probability} Every engineer who has designed a distributed system has done probability without calling it that. Suppose you have three independent services, each with 99.5% uptime. What's the probability that all three are up simultaneously? ```{python} #| label: uptime-composition #| echo: true # Three independent services, each 99.5% available p_up = 0.995 # All three up: multiply independent probabilities (AND rule) p_all_up = p_up ** 3 print(f"P(all three up) = {p_all_up:.6f} ({p_all_up * 100:.2f}%)") # At least one down: complement rule p_at_least_one_down = 1 - p_all_up print(f"P(≥1 down) = {p_at_least_one_down:.6f} ({p_at_least_one_down * 100:.2f}%)") ``` You've just used two rules of probability — the **multiplication rule** for independent events and the **complement rule**. These aren't abstract axioms; they're the same arithmetic you use when calculating error budgets and composing SLOs. The only difference is that statistics names the rules explicitly and extends them to situations where "independence" doesn't hold and events aren't as clean as "up or down." Distributions (@sec-distributions) and descriptive statistics (@sec-descriptive-stats) gave us tools for summarising data we already had. Probability works in the other direction: it gives us the rules for reasoning about *what data we should expect* given what we know about the process that generates it. The uptime example was simple because the events were binary, independent, and we knew the probabilities exactly. Real problems rarely stay this clean. What if deployments fail more often during peak hours? What if service failures are correlated? The rules of probability give us the general toolkit for handling these complications. ## The rules of probability {#sec-probability-rules} All of probability rests on a small set of rules. If you know these four, you can derive almost everything else. **Complement rule.** The probability of something *not* happening is one minus the probability of it happening. $$ P(\text{not } A) = 1 - P(A) $$ You use this instinctively. "99.5% uptime" means "0.5% downtime." The complement rule is why error budgets work — the budget *is* $1 - \text{SLO}$. **Addition rule (OR).** The probability of *either* $A$ or $B$ occurring is the sum of their individual probabilities, minus the overlap (to avoid double-counting). $$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$ The $\cup$ symbol means "union" (OR) and $\cap$ means "intersection" (AND) — the same set operations you'd find in any programming language. If the events are mutually exclusive — they can't both happen — the intersection is empty and the formula simplifies to $P(A) + P(B)$. **Multiplication rule (AND).** The probability of *both* $A$ and $B$ occurring depends on whether they're independent. $$ P(A \cap B) = P(A) \times P(B \mid A) $$ Here $P(B \mid A)$ means "the probability of $B$ given that $A$ has occurred" — we'll define this precisely in a moment. If $A$ and $B$ are independent — knowing $A$ happened tells you nothing about $B$ — this simplifies to $P(A) \times P(B)$. That's the calculation we did with the three services. But independence is an assumption, not a given, and the failure modes are familiar: just as "independent" microservices often turn out to share a database, a network link, or a cloud availability zone, "independent" events can be driven by a hidden common cause. When independence breaks down, you need the full conditional version of the rule. **Conditional probability.** The probability of $A$ given that $B$ has already occurred: $$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} $$ In plain English: restrict your attention to the world where $B$ is true, then ask how often $A$ also happens. Let's make this concrete. Imagine you've collected deployment outcomes across different times of day. ```{python} #| label: joint-probability-table #| echo: true import numpy as np import pandas as pd from scipy import stats import matplotlib.pyplot as plt rng = np.random.default_rng(42) # Simulate deployment outcomes by time of day n = 1000 time_of_day = rng.choice(['off-peak', 'peak'], size=n, p=[0.6, 0.4]) # Failure rate is higher during peak hours peak_mask = time_of_day == 'peak' outcome = np.empty(n, dtype='<U7') # 7 chars fits both 'success' and 'failure' outcome[peak_mask] = rng.choice( ['success', 'failure'], size=peak_mask.sum(), p=[0.90, 0.10] ) outcome[~peak_mask] = rng.choice( ['success', 'failure'], size=(~peak_mask).sum(), p=[0.97, 0.03] ) deployments = pd.DataFrame({'time': time_of_day, 'outcome': outcome}) # Joint probability table: crosstab counts co-occurrences of two categorical # variables; normalize='all' converts counts to proportions (joint probabilities) joint = pd.crosstab(deployments['time'], deployments['outcome'], normalize='all') print("Joint probability table:") print(joint.round(4)) # Conditional probability: P(failure | peak) peak_rows = deployments[deployments['time'] == 'peak'] p_failure_given_peak = (peak_rows['outcome'] == 'failure').mean() print(f"\nP(failure | peak) = {p_failure_given_peak:.4f}") # Compare with P(failure) overall p_failure = (deployments['outcome'] == 'failure').mean() print(f"P(failure) = {p_failure:.4f}") ``` The table above is a **joint probability table** — each cell shows the probability of a particular combination of time and outcome. The conditional probability $P(\text{failure} \mid \text{peak})$ is notably higher than the overall failure rate. Time of day and deployment outcome aren't independent — knowing when a deployment happened changes your estimate of its outcome. ::: {.callout-note} ## Engineering Bridge Conditional probability is the `WHERE` clause of probability. When you write `SELECT COUNT(*) FROM deployments WHERE time = 'peak' AND outcome = 'failure'` and divide by `SELECT COUNT(*) FROM deployments WHERE time = 'peak'`, you're computing $P(\text{failure} \mid \text{peak})$ exactly. The conditional restricts the denominator to a subset of the data, just like a `WHERE` clause restricts the rows. If you've ever computed a metric "sliced by" some dimension in a dashboard, you've computed a conditional probability. ::: ## Bayes' theorem: updating beliefs {#sec-bayes} The conditional probability formula can be rearranged into one of the most important results in all of statistics. Starting from: $$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P(B \mid A) = \frac{P(A \cap B)}{P(A)} $$ we can substitute to get **Bayes' theorem**: $$ P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)} $$ Each term has a name — borrowed from Bayesian statistics, where $A$ is a hypothesis and $B$ is observed data. $P(A)$ is the **prior** — what you believed before seeing the data. $P(B \mid A)$ is the **likelihood** — how probable the data is if your hypothesis is correct. $P(A \mid B)$ is the **posterior** — your updated belief after seeing the data. And $P(B)$ is the **normalising constant** (also called the "evidence") that ensures the posterior is a valid probability. This isn't just algebra. Bayes' theorem describes a *process*: start with a belief, observe data, update the belief. It's the formal machinery for learning from evidence. Here's a scenario every on-call engineer has lived. Your monitoring system fires an alert. Should you wake up the team? ```{python} #| label: bayes-alerting #| echo: true # Monitoring alert scenario p_incident = 0.02 # 2% of the time, there's a real incident p_alert_given_incident = 0.95 # Alert fires 95% of the time when there IS an incident p_alert_given_no_incident = 0.05 # Alert fires 5% of the time when there ISN'T one # P(alert) via the law of total probability: # enumerate every way an alert can fire, weight by probability of each path p_alert = ( p_alert_given_incident * p_incident + p_alert_given_no_incident * (1 - p_incident) ) # Bayes' theorem: P(incident | alert) p_incident_given_alert = ( p_alert_given_incident * p_incident / p_alert ) print(f"P(incident) = {p_incident:.4f}") print(f"P(alert) = {p_alert:.4f}") print(f"P(incident | alert) = {p_incident_given_alert:.4f}") print(f"\nWhen the alert fires, there's only a {p_incident_given_alert:.0%} chance " f"of a real incident.") ``` A detector with 95% **sensitivity** (it catches 95% of real incidents) and only a 5% false positive rate (it fires on 5% of *non-incidents*) still produces alerts that are mostly noise — out of every 100 alerts that fire, roughly 72 are false alarms. The prior dominates: real incidents are so rare (a **base rate** of $P(\text{incident}) = 0.02$) that even a low false positive rate generates more noise than signal. This is why alert fatigue is a mathematical inevitability when base rates are low and thresholds aren't adjusted. Notice the denominator calculation in the code: we computed $P(\text{alert})$ by enumerating all the ways an alert can fire — via a real incident *or* via a false positive — and summing. This is the **law of total probability**, the probabilistic equivalent of handling every branch of an enum. You'll see this pattern whenever you need the denominator of Bayes' theorem. ::: {.callout-tip} ## Author's Note This calculation genuinely changed how I think about monitoring. I used to evaluate alerting rules purely by their detection rate and false positive rate, as though those two numbers told the whole story. Bayes' theorem shows they don't — the base rate of the event you're detecting matters enormously. A 5% false positive rate sounds low, but when real incidents are much rarer than the false positive rate, most of your alerts are still noise. Now I always ask: "what's the base rate?" before evaluating any classifier or detector. ::: We'll return to Bayesian reasoning as a full inference framework in *Bayesian inference: updating beliefs with evidence*. For now, the key insight is that probability isn't static — it updates as evidence arrives. ## From one sample to many: the sampling distribution {#sec-sampling-distribution} Here's a question that doesn't come up in deterministic software but is central to statistics: *if you collected a different sample, how different would your answer be?* In @sec-location, we computed the sample mean of some response times — a single number summarising 2,000 observations. That mean was a *point estimate* of the true population mean. But if we'd measured a different batch of requests, we'd have got a different sample mean. And a different one again the next time. The distribution of those sample means — across all possible samples of the same size — is the **sampling distribution**. This isn't just a philosophical point. It's the mechanism that lets us quantify how much we should trust any single estimate. ```{python} #| label: fig-sampling-distribution #| echo: true #| fig-cap: "The population is heavily right-skewed, but the sampling distribution of the mean is approximately normal — a preview of the Central Limit Theorem." #| fig-alt: "Two-panel figure. The left panel shows the exponential population distribution, which is heavily right-skewed with most density near zero and a long tail. The right panel shows a histogram of 2,000 sample means, which forms a symmetric bell shape centred on the true population mean of 50, despite the skewed population." rng = np.random.default_rng(42) population_mean = 50 n_samples = 2000 sample_size = 50 # Draw many samples from a skewed population # stats.expon(scale=μ) parameterises by mean, not rate — scale = 1/λ population = stats.expon(scale=population_mean) sample_means = [ population.rvs(size=sample_size, random_state=rng).mean() for _ in range(n_samples) ] fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # Left: population distribution x = np.linspace(0, 250, 300) ax1.plot(x, population.pdf(x), 'steelblue', linewidth=2) ax1.fill_between(x, population.pdf(x), alpha=0.2, color='steelblue') ax1.axvline(population_mean, color='coral', linestyle='--', linewidth=1.5, label=f'μ = {population_mean}') ax1.set_xlabel("Value") ax1.set_ylabel("Density") ax1.set_title("Population: Exponential(mean=50)") ax1.legend() # Right: sampling distribution of the mean ax2.hist(sample_means, bins=50, density=True, color='steelblue', alpha=0.7, edgecolor='white') ax2.axvline(np.mean(sample_means), color='coral', linestyle='--', linewidth=1.5, label=f'Mean of means = {np.mean(sample_means):.1f}') ax2.set_xlabel("Sample mean") ax2.set_ylabel("Density") ax2.set_title(f"Sampling distribution (n={sample_size}, {n_samples} samples)") ax2.legend() plt.tight_layout() plt.show() ``` Three observations from @fig-sampling-distribution: 1. **Shape.** The population is dramatically right-skewed, but the sampling distribution of the mean looks approximately normal. 2. **Centre.** The sampling distribution is centred on the true population mean $\mu = 50$. 3. **Spread.** The sampling distribution is much narrower than the population — averaging 50 observations smooths out the noise considerably. None of these are coincidences. They all follow from a single theorem. ## The Central Limit Theorem {#sec-clt} The **Central Limit Theorem** (CLT) says: take the mean of $n$ independent observations, each drawn from the *same* distribution (statisticians call this "identically distributed" — together, "independent and identically distributed," or **i.i.d.**), with finite mean $\mu$ and finite variance $\sigma^2$. As $n$ grows, the distribution of that sample mean approaches a Normal distribution: $$ \bar{X}_n \;\approx\; \text{Normal}\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{for large } n $$ Here $\bar{X}_n$ denotes the sample mean of $n$ observations — the capital letter signals that it's a random variable (it varies from sample to sample), while the bar indicates averaging. The Normal distribution is parameterised by its mean and variance (not standard deviation). In plain English: it doesn't matter what the underlying data looks like — skewed, **bimodal** (two peaks), uniform, or anything else. Average enough of it and the result is approximately normal. This holds regardless of the shape of the original distribution, provided the observations are drawn from the same distribution and the variance is finite. The CLT has three implications that underpin almost all of **inferential statistics** — the branch of statistics concerned with drawing conclusions about a population from a sample: 1. **Shape → normal.** The sampling distribution of the mean is approximately normal, even when the population isn't. 2. **Centre → $\mu$.** The sampling distribution is centred on the true population mean. Sample means are *unbiased* — they don't systematically overshoot or undershoot. 3. **Spread → $\sigma / \sqrt{n}$.** The variability of the sample mean *decreases* as the sample size increases, at a rate of $1/\sqrt{n}$. Let's see this in action across four very different populations. ```{python} #| label: fig-clt-grid #| echo: true #| fig-cap: "The Central Limit Theorem at work. Each row is a different population shape. Each column increases the sample size. The coral curves show the CLT-predicted normal distribution." #| fig-alt: "A 4-by-3 grid of histograms. Each row starts from a different population shape (uniform, exponential, bimodal, binomial). The leftmost column (n=1) shows the raw population shape. The middle column (n=5) shows partial convergence. The rightmost column (n=30) shows all four converging to a bell shape matching the overlaid coral normal curve, regardless of starting distribution." rng = np.random.default_rng(42) n_simulations = 5000 sample_sizes = [1, 5, 30] # Four different population shapes def draw_uniform(size): return rng.uniform(0, 1, size) def draw_exponential(size): return rng.exponential(scale=1, size=size) def draw_bimodal(size): # Mixture of two normals mix = rng.choice([0, 1], size=size, p=[0.5, 0.5]) return np.where(mix == 0, rng.normal(-2, 0.5, size), rng.normal(2, 0.5, size)) def draw_binomial(size): return rng.binomial(n=1, p=0.2, size=size).astype(float) # Theoretical mean and variance for each population populations = [ ("Uniform(0, 1)", draw_uniform, 0.5, 1/12), ("Exponential(1)", draw_exponential, 1.0, 1.0), ("Bimodal", draw_bimodal, 0.0, 4.25), # E[X]=0, Var = 0.5*(0.25+4) + 0.5*(0.25+4) = 4.25 ("Binomial(1, 0.2)", draw_binomial, 0.2, 0.16), ] fig, axes = plt.subplots(4, 3, figsize=(12, 12)) for row, (name, draw_fn, mu_pop, var_pop) in enumerate(populations): for col, n in enumerate(sample_sizes): means = [draw_fn(n).mean() for _ in range(n_simulations)] ax = axes[row, col] ax.hist(means, bins=40, density=True, color='steelblue', alpha=0.7, edgecolor='white') if n > 1: # Overlay CLT-predicted normal (theoretical, not fitted) se_predicted = np.sqrt(var_pop / n) x_line = np.linspace(mu_pop - 4 * se_predicted, mu_pop + 4 * se_predicted, 200) ax.plot(x_line, stats.norm(mu_pop, se_predicted).pdf(x_line), 'coral', linewidth=1.5, label='CLT prediction' if row == 0 and col == 1 else None) if row == 0: ax.set_title(f"n = {n}", fontsize=11) if col == 0: ax.set_ylabel(name, fontsize=10) if row == 3: ax.set_xlabel("Sample mean") # Add a single legend for the coral curve axes[0, 1].legend(loc='upper right', fontsize=9) plt.suptitle("Central Limit Theorem: convergence to normal", y=1.01, fontsize=13) plt.tight_layout() plt.show() ``` With $n = 1$, the "mean" of a single observation is just the observation itself, so the leftmost column shows the raw population shape — each looks completely different. The middle column ($n = 5$) already shows convergence beginning: the sharp edges and asymmetries soften. By $n = 30$, all four are well-approximated by a Normal distribution (the coral curve). Averaging has erased the starting shape. A common rule of thumb says $n = 30$ is "enough" for the CLT to kick in, and for these four populations it's visually convincing. But heavily skewed or heavy-tailed distributions — the kind you encounter with network latencies or file sizes — can require hundreds or thousands of observations before the normal approximation becomes reliable. Treat $n = 30$ as a starting point, not a guarantee. This is why the normal distribution from @sec-normal appears so often in practice. It's not that the world is inherently normal — it's that *averages* are approximately Normal, and most of the statistics we compute (means, proportions, and regression coefficients — weights that quantify the relationship between variables, covered in *Linear regression: your first model*) are some form of average. ## Standard error: the precision of an estimate {#sec-standard-error} The CLT tells us that the spread of the sampling distribution is $\sigma / \sqrt{n}$. This quantity has its own name: the **standard error** (SE). $$ \text{SE} = \frac{\sigma}{\sqrt{n}} $$ In practice, we rarely know $\sigma$ (the true population standard deviation), so we estimate the standard error using the sample standard deviation $s$: $$ \widehat{\text{SE}} = \frac{s}{\sqrt{n}} $$ The hat on $\widehat{\text{SE}}$ is statistical notation meaning "estimated from data" — we don't know the true SE (which depends on $\sigma$), so we estimate it using $s$. In practice, most authors drop the hat and simply write SE, since the context makes it clear whether the true or estimated version is meant. We'll follow that convention from here on. The standard error tells you how precisely you've estimated the mean. A small SE means your sample mean is likely close to the true mean; a large SE means it could be far off. The $\sqrt{n}$ in the denominator has a critical implication: **doubling your precision requires quadrupling your data**. Going from $n = 100$ to $n = 400$ halves the standard error. Going from 400 to 1,600 halves it again. There are diminishing returns to collecting more data — each additional observation contributes less to precision than the last. ```{python} #| label: fig-standard-error #| echo: true #| fig-cap: "Theoretical vs empirical standard error as sample size grows, and the diminishing returns curve — quadruple the data for double the precision." #| fig-alt: "Two-panel figure. The left panel plots theoretical SE (solid coral line) and empirical SE (dashed steelblue line with square markers) against sample size; both curves decline together, confirming the theory. The right panel shows a smooth SE-vs-n curve with annotations highlighting that going from n=100 to n=400 halves the standard error." rng = np.random.default_rng(42) # Population: Exponential with mean 50 (sigma = 50 for exponential) population_sigma = 50 sample_sizes = [10, 25, 50, 100, 250, 500, 1000] theoretical_se = [population_sigma / np.sqrt(n) for n in sample_sizes] empirical_se = [] for n in sample_sizes: means = [rng.exponential(scale=50, size=n).mean() for _ in range(5000)] empirical_se.append(np.std(means)) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) # Left: theoretical vs empirical ax1.plot(sample_sizes, theoretical_se, 'coral', linewidth=2, marker='o', label='Theoretical: σ/√n') ax1.plot(sample_sizes, empirical_se, 'steelblue', linewidth=2, marker='s', linestyle='--', label='Empirical (5,000 simulations)') ax1.set_xlabel("Sample size (n)") ax1.set_ylabel("Standard error") ax1.set_title("Theoretical vs empirical SE") ax1.legend() # Right: SE as a function of n (continuous curve) n_range = np.linspace(5, 1000, 200) ax2.plot(n_range, population_sigma / np.sqrt(n_range), 'steelblue', linewidth=2) ax2.axhline(y=population_sigma / np.sqrt(100), color='grey', linestyle='--', alpha=0.5) ax2.annotate("n=100", xy=(100, population_sigma / np.sqrt(100)), xytext=(200, population_sigma / np.sqrt(100) + 2), arrowprops=dict(arrowstyle='->', color='grey'), fontsize=9, color='grey') ax2.axhline(y=population_sigma / np.sqrt(400), color='grey', linestyle='--', alpha=0.5) ax2.annotate("n=400 (half the SE)", xy=(400, population_sigma / np.sqrt(400)), xytext=(500, population_sigma / np.sqrt(400) + 2), arrowprops=dict(arrowstyle='->', color='grey'), fontsize=9, color='grey') ax2.set_xlabel("Sample size (n)") ax2.set_ylabel("Standard error") ax2.set_title("Diminishing returns: SE vs sample size") plt.tight_layout() plt.show() ``` ::: {.callout-note} ## Engineering Bridge The $\sqrt{n}$ curve is the **capacity planning problem of statistical precision**. Every engineer knows the economics of diminishing returns — doubling your server count doesn't halve your latency. The standard error follows a similar economic pattern: you can always buy more precision with more data, but the cost-per-unit-of-precision increases quadratically. Halving the SE requires 4× the data. Halving it again requires 16×. The SE does keep shrinking — there's no hard ceiling — but the economics become unfavourable fast. This is why sample size calculations matter: they tell you the point at which collecting more data isn't worth the cost. ::: ::: {.callout-tip} ## Author's Note The $\sqrt{n}$ relationship changed how I think about data collection. My engineering instinct was always "more data is better" — just log everything, store it all, analyse later. But the standard error shows that data has *diminishing marginal returns*. Going from 100 to 10,000 observations is transformative; going from 10,000 to 1,000,000 barely moves the needle on precision. Once I internalised this, I stopped reflexively reaching for bigger datasets and started asking "how precise do I actually need to be?" first. That question — answered through sample size calculations — saves enormous amounts of time and compute. ::: ## Worked example: can we trust this A/B test result? {#sec-worked-example-probability} Let's put these ideas together with a scenario you'll encounter in practice. Your team has run an A/B test on a checkout flow. The control group had a 12.0% conversion rate ($n = 1\text{,}000$), and the variant had a 13.5% conversion rate ($n = 1\text{,}000$). That's a 1.5 percentage point difference. Is this real, or could it be noise? We're not going to answer this question fully here — that's what hypothesis testing and confidence intervals are for. But we *can* use the tools from this chapter to frame the question precisely. We know the standard error of each proportion, and therefore the SE of their difference — so we can ask: how many standard errors apart are these two estimates? ```{python} #| label: ab-test-computation #| echo: true # Observed data p_control = 0.120 p_variant = 0.135 n_ab = 1000 # Standard errors for proportions: SE = sqrt(p(1-p)/n) # For binary outcomes (convert/don't), variance = p(1-p) — from the # Bernoulli distribution (see Ch2). Dividing by n and taking the square # root gives the SE of the estimated proportion. se_control = np.sqrt(p_control * (1 - p_control) / n_ab) se_variant = np.sqrt(p_variant * (1 - p_variant) / n_ab) # We use the observed proportions as plug-in estimates (substituting # sample values where the formula calls for unknown population values) print(f"Control: p = {p_control:.3f}, SE = {se_control:.4f}") print(f"Variant: p = {p_variant:.3f}, SE = {se_variant:.4f}") # How far apart are they in standard-error units? # SE of a difference: variances add for independent samples, then take sqrt. # Note: a formal hypothesis test would pool the proportions under H₀, # giving a slightly different SE. We'll see that approach in Chapter 5. se_diff = np.sqrt(se_control**2 + se_variant**2) z = (p_variant - p_control) / se_diff print(f"\nSE of the difference: {se_diff:.4f}") # z-score: how many standard errors the difference is from zero print(f"Difference in SE units (z): {z:.2f}") ``` ```{python} #| label: fig-ab-test-framing #| echo: true #| fig-cap: "The sampling distributions of the two conversion rates overlap substantially. The observed difference might be real — or it might be sampling noise." #| fig-alt: "Two overlapping bell curves representing the sampling distributions of the control (12.0%, steelblue) and variant (13.5%, coral) conversion rates. The curves share a wide region of overlap, illustrating that the observed difference could plausibly be due to sampling noise rather than a genuine effect." # Sampling distributions (by CLT, approximately normal) x = np.linspace(0.08, 0.18, 300) fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(x, stats.norm(p_control, se_control).pdf(x), 'steelblue', linewidth=2, label=f'Control: {p_control:.1%} ± {se_control:.4f}') ax.fill_between(x, stats.norm(p_control, se_control).pdf(x), alpha=0.2, color='steelblue') ax.plot(x, stats.norm(p_variant, se_variant).pdf(x), 'coral', linewidth=2, label=f'Variant: {p_variant:.1%} ± {se_variant:.4f}') ax.fill_between(x, stats.norm(p_variant, se_variant).pdf(x), alpha=0.2, color='coral') ax.set_xlabel("Conversion rate") ax.set_ylabel("Density") ax.set_title("Sampling distributions of two conversion rates") ax.legend() plt.tight_layout() plt.show() ``` The two distributions share a wide band of overlap. The 1.5 percentage point difference *could* be a genuine improvement — or it could be the kind of variation you'd expect from two samples drawn from the same underlying rate. The **z-score** (a standardised measure of "how many standard errors away from zero") of roughly 1.0 tells us the difference is about one standard error wide — suggestive, but far from conclusive. (We'll learn what to do with that number in *Hypothesis testing: the unit test of data science*.) To resolve this properly, we need formal tools: a way to quantify the probability that the observed difference arose by chance (hypothesis testing), a way to express the range of plausible true values (confidence intervals), and a structured approach to designing and interpreting these experiments (A/B testing methodology). Those are exactly the topics of the next three chapters: *Hypothesis testing: the unit test of data science*, *Confidence intervals: error bounds for estimates*, and *A/B testing: deploying experiments*. ## Summary {#sec-probability-summary} 1. **Probability has a small set of composable rules** — complement, addition, multiplication, and conditioning. These are the same rules you use for SLO composition and error budgets, now formalised. 2. **Bayes' theorem turns conditional probability into a learning process** — it updates prior beliefs in light of evidence. The base rate of the event you're detecting matters at least as much as the accuracy of your detector. 3. **The sampling distribution describes the variability of an estimate** — if you collected a different sample, you'd get a different answer. The sampling distribution tells you how much those answers would vary. 4. **The Central Limit Theorem says that averages are approximately normal** — regardless of the shape of the underlying population, for large enough $n$. This is the theoretical foundation of most inferential statistics. 5. **The standard error quantifies the precision of an estimate** — it shrinks as $\sigma / \sqrt{n}$, meaning you get diminishing returns from additional data. Knowing the SE lets you move from "here is the sample mean" to "here is how much we should trust the sample mean" — which is exactly where inference begins. ## Exercises {#sec-probability-exercises} 1. A notification system delivers alerts through three independent channels: email (98% delivery rate), SMS (95%), and push notification (90%). What is the probability that a user receives an alert through *at least one* channel? Compute this analytically using the complement rule, then verify by simulating 100,000 alerts using `numpy`. 2. The Central Limit Theorem works for any distribution with finite variance. Use `scipy.stats.poisson(mu=3)` to explore this: draw 5,000 samples of sizes $n \in \{1, 5, 15, 50\}$, compute the mean of each, and plot the four sampling distributions. For each, overlay a normal distribution with the CLT-predicted mean and standard deviation. At what sample size does the normal approximation become convincing? A **QQ plot** (quantile-quantile plot) provides a more rigorous check — it plots your data's quantiles against a theoretical distribution's quantiles, where points falling on the diagonal indicate a good fit. Use `scipy.stats.probplot` for the $n = 50$ case. 3. Using an Exponential distribution with mean 100, compute the theoretical standard error for sample sizes $n \in \{25, 100, 400, 1600\}$. Verify each empirically by drawing 10,000 samples. Plot both the theoretical and empirical SE values against $n$ on the same axes. How does the $4\times$ rule (quadruple data, halve SE) hold up in practice? 4. A code-coverage tool claims to detect bugs with 80% accuracy (it flags 80% of real bugs). It also has a 10% false positive rate (it flags 10% of clean code). In a mature codebase, suppose 3% of functions actually contain a bug. Using Bayes' theorem, calculate $P(\text{bug} \mid \text{flagged})$. Now repeat the calculation for base rates of 1%, 5%, and 20%. What does this tell you about the usefulness of the tool across different codebases? 5. **Conceptual:** This chapter opened by comparing probability rules to error-handling composition in distributed systems. Where does this analogy break down? Consider: what happens when events aren't independent? When "failure" isn't binary? When the probability of failure changes over time? Identify at least two ways the clean algebraic composition of probabilities in the opening example doesn't map to real-world systems, and explain what statistical tools you'd reach for in each case.