Appendix A — Mathematical foundations refresher

Mathematical notation is a compression format. Once you can decompress it, formulas that look intimidating turn out to be loops, ratios, and array operations you already write every day. This appendix is a refresher, not a course, and the goal is to jog rusty memories and provide a reference you can flip back to when a chapter introduces a symbol or technique you haven’t seen since university.

Everything here has a Python equivalent. If the notation is unfamiliar, read the code — it computes the same thing.

A.1 Greek letters

Data science uses Greek letters as shorthand for quantities that appear repeatedly. You don’t need to memorise these upfront; each chapter reintroduces the ones it uses. But when you encounter an unfamiliar symbol, this table is the place to look.

Table A.1: Greek letters and their common meanings.
Symbol Name Typical meaning in this book
\(\mu\) mu Population mean; expected value
\(\sigma\) sigma Population standard deviation
\(\sigma^2\) sigma squared Population variance
\(\lambda\) lambda Rate parameter (Poisson, Exponential); regularisation strength
\(\varepsilon\) epsilon Error term; random noise (\(y = f(x) + \varepsilon\))
\(\alpha\) alpha Significance level (hypothesis testing); shape parameter (Beta)
\(\beta\) beta Regression coefficient; shape parameter (Beta); Type II error rate
\(\theta\) theta Generic parameter being estimated
\(\rho\) rho Population correlation coefficient (Pearson); sample estimate is \(r\)
\(\pi\) pi The constant 3.14159…; occasionally \(\pi_k = P(\text{class} = k)\) in mixture models

Two conventions to watch for: a hat (\(\hat{\beta}\)) means “estimated from data”: the model’s best guess at the true value. A bar (\(\bar{x}\)) means “sample average.” These notational marks appear on top of both Greek and Latin letters.

A.2 Reading mathematical expressions

When you encounter an unfamiliar formula, try reading it as pseudocode. Here is an example: the formula for Pearson’s correlation coefficient \(r\) (the sample estimate of the population correlation \(\rho\)):

\[ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}} \]

Read it step by step:

  1. The numerator is a sum: for each observation \(i\), compute the product of how far \(x_i\) is from its mean and how far \(y_i\) is from its mean. Sum all those products. This captures co-movement.

  2. The denominator normalises by the individual spreads, each under its own square root. This forces the result into the range \([-1, 1]\).

  3. The whole expression asks: when \(x\) is above its mean, is \(y\) also above its mean? If yes (consistently), \(r\) is close to \(+1\). If they move in opposite directions, \(r\) is close to \(-1\). If there’s no pattern, \(r\) is near \(0\).

rng = np.random.default_rng(42)
x = rng.normal(50, 10, size=100)
y = 2 * x + rng.normal(0, 5, size=100)  # y is linearly related to x

# Manual calculation following the formula step by step
x_dev = x - np.mean(x)
y_dev = y - np.mean(y)

co_movement = x_dev * y_dev                            # element-wise products
x_spread = np.sqrt(np.sum(x_dev ** 2))
y_spread = np.sqrt(np.sum(y_dev ** 2))
r_manual = np.sum(co_movement) / (x_spread * y_spread)

# NumPy's built-in (returns a correlation matrix; [0,1] is the coefficient)
r_numpy = np.corrcoef(x, y)[0, 1]

print(f"Manual:  {r_manual:.4f}")
print(f"NumPy:   {r_numpy:.4f}")
Manual:  0.9566
NumPy:   0.9566

The pattern is general: most formulas in this book are sums, products, or ratios that you can translate directly into array operations. When a formula looks intimidating, implement it line by line; the code is often clearer than the notation. The rest of this appendix catalogues the building blocks you will encounter.

A.3 Subscripts and index notation

Statistical formulas use subscripts as array indices. \(x_i\) means the \(i\)th element of the vector \(\mathbf{x}\), equivalent to x[i]. \(X_{ij}\) means the element in row \(i\), column \(j\), equivalent to X[i, j]. When you see two indices, the first is usually the observation and the second the feature.

A summation like \(\sum_{i=1}^{n} x_i^2\) reads as: “loop \(i\) from \(1\) to \(n\), square each element, accumulate.” In NumPy: np.sum(x**2).

A.4 Summation and product notation

The capital sigma \(\Sigma\) means “add up a sequence of terms.” If you’ve written a for loop that accumulates a running total, you already understand summation.

\[ \sum_{i=1}^{n} x_i = x_1 + x_2 + \cdots + x_n \]

The subscript (\(i=1\)) is the loop variable’s starting value, the superscript (\(n\)) is its end, and \(x_i\) is the expression evaluated on each iteration.

x = np.array([3, 7, 2, 9, 5])

# The mathematical expression  Σ x_i  is just:
total = np.sum(x)
print(f"Sum: {total}")

# And the mean  (1/n) Σ x_i  is:
mean = np.mean(x)
print(f"Mean: {mean}")
Sum: 26
Mean: 5.2

The capital pi \(\Pi\) is the multiplicative equivalent: “multiply together a sequence of terms”:

\[ \prod_{i=1}^{n} x_i = x_1 \times x_2 \times \cdots \times x_n \]

x = np.array([3, 7, 2, 9, 5])
product = np.prod(x)
print(f"Product: {product}")
Product: 1890

Products appear most often in probability, where the joint probability of independent events is the product of their individual probabilities. This only holds when events are genuinely independent; if they are correlated, you need conditional probabilities instead (see Section A.8).

A.5 Functions, logarithms, and exponentials

The exponential function \(e^x\) (equivalently written \(\exp(x)\)) and the natural logarithm \(\ln(x)\) are inverses of each other:

\[ e^{\ln(x)} = x \qquad \text{and} \qquad \ln(e^x) = x \]

where \(e \approx 2.718\) is Euler’s number. These appear constantly in data science because many natural processes exhibit exponential growth or decay, and logarithms compress wide-ranging values into manageable scales.

# They undo each other
x = 100
print(f"exp(ln({x})) = {np.exp(np.log(x)):.1f}")
print(f"ln(exp(5))   = {np.log(np.exp(5)):.1f}")

# Logarithms turn multiplication into addition
a, b = 1000, 500
print(f"\nln({a} × {b})     = {np.log(a * b):.4f}")
print(f"ln({a}) + ln({b}) = {np.log(a) + np.log(b):.4f}")
exp(ln(100)) = 100.0
ln(exp(5))   = 5.0

ln(1000 × 500)     = 13.1224
ln(1000) + ln(500) = 13.1224

Three properties make logarithms indispensable in practice. First, they turn multiplication into addition: \(\ln(ab) = \ln(a) + \ln(b)\). This is why log-likelihoods are easier to work with than raw likelihoods: products of many small probabilities become sums. This matters computationally, too: multiplying thousands of probabilities together underflows to zero in floating-point arithmetic, while summing their logarithms stays numerically stable. Second, they compress dynamic range: \(\ln(1{,}000{,}000) \approx 13.8\), so when data spans several orders of magnitude (incomes, populations, word frequencies), a log scale makes patterns visible. Third, they are the basis of the logit transform: \(\text{logit}(p) = \ln\!\left(\frac{p}{1-p}\right)\), which maps probabilities in \((0, 1)\) to the entire real line, the foundation of logistic regression.

p = np.linspace(0.01, 0.99, 200)
z = np.linspace(-6, 6, 200)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
fig.patch.set_alpha(0)

# Left: logit
ax1.patch.set_alpha(0)
ax1.plot(p, np.log(p / (1 - p)), color='#0072B2', linewidth=2)
ax1.axhline(0, color='grey', linewidth=0.8, linestyle='-', alpha=0.5)
ax1.axvline(0.5, color='#E69F00', linewidth=1, linestyle=':', alpha=0.7)
ax1.set_xlabel('Probability $p$')
ax1.set_ylabel('logit($p$)')
ax1.set_title('Logit: probability → real line', fontsize=11)
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.yaxis.grid(True, linestyle=':', alpha=0.4, color='grey')
ax1.set_axisbelow(True)

# Right: sigmoid (inverse logit)
ax2.patch.set_alpha(0)
ax2.plot(z, 1 / (1 + np.exp(-z)), color='#0072B2', linewidth=2)
ax2.axhline(0.5, color='#E69F00', linewidth=1, linestyle=':', alpha=0.7)
ax2.axvline(0, color='grey', linewidth=0.8, linestyle='-', alpha=0.5)
ax2.set_xlabel('$z$')
ax2.set_ylabel('sigmoid($z$)')
ax2.set_title('Sigmoid: real line → probability', fontsize=11)
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.yaxis.grid(True, linestyle=':', alpha=0.4, color='grey')
ax2.set_axisbelow(True)

plt.tight_layout()
plt.show()
Two plots showing inverse functions. The first plot shows the logit function: as probability p increases from 0 to 1 on the x-axis, the output rises from negative infinity to positive infinity on the y-axis, crossing zero at p = 0.5. The second plot shows the sigmoid function: as the input z increases from -6 to 6, the output rises from near 0 to near 1, crossing 0.5 at z = 0. Together they illustrate that logit and sigmoid are inverses of each other.
Figure A.1: The logit function maps probabilities (0, 1) to the full real line; the sigmoid function maps back. Logistic regression works in logit space — where linear relationships are natural — then converts predictions to probabilities via the sigmoid.

A.6 Calculus essentials

Most of the calculus you need for this book reduces to two ideas: derivatives tell you the rate of change, and integrals tell you the area under a curve.

A.6.1 Derivatives

The derivative of a function \(f(x)\) at a point tells you how quickly \(f\) changes as \(x\) changes. In Python terms, it is the slope of the function at that point.

\[ f'(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} \]

In plain terms: the derivative measures how much the output changes per unit change in input, as that change becomes infinitesimally small. You rarely need to compute derivatives by hand in data science; optimisation libraries and automatic differentiation frameworks (PyTorch, JAX) handle that. But recognising what a derivative means is essential, because every model-fitting algorithm is solving “find the parameter values where the derivative of the loss function equals zero.”

x = np.linspace(-1, 4, 200)
f = x ** 2

# Tangent line at x0: y = f(x0) + f'(x0)(x - x0)
x0 = 2
f_x0 = x0 ** 2       # f(2) = 4
slope = 2 * x0        # f'(2) = 4

# Clip the tangent to a narrow window around the tangency point
tangent_x = np.linspace(0.5, 3.5, 100)
tangent_y = f_x0 + slope * (tangent_x - x0)

fig, ax = plt.subplots(figsize=(9, 5))
fig.patch.set_alpha(0)
ax.patch.set_alpha(0)
ax.set_title('The derivative is the slope of the tangent at a point',
             fontsize=12, pad=10)
ax.plot(x, f, color='#0072B2', linewidth=2, label=r'$f(x) = x^2$')
ax.plot(tangent_x, tangent_y, color='#E69F00', linewidth=1.5, linestyle='--',
        label=rf"Tangent at $x={x0}$: slope $= {slope}$")
ax.plot(x0, f_x0, 'o', color='#E69F00', markersize=9,
        markeredgecolor='white', markeredgewidth=1.5, zorder=5)
ax.annotate(f'$({x0},\;{f_x0})$', xy=(x0, f_x0),
            xytext=(x0 + 0.35, f_x0 - 1.5), fontsize=10,
            arrowprops=dict(arrowstyle='->', color='grey', lw=0.8),
            color='grey')
ax.set_xlim(-1, 4)
ax.set_ylim(-0.5, 16)
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.yaxis.grid(True, linestyle=':', alpha=0.4, color='grey')
ax.set_axisbelow(True)
plt.tight_layout()
plt.show()
A plot of f(x) = x squared as a solid curve, with a dashed tangent line segment touching the curve at the annotated point (2, 4). A filled circle marks the tangency point. The tangent line's slope of 4 illustrates the derivative at that point.
Figure A.2: The derivative of f(x) = x² is 2x. At x = 2, the slope is 4 — the tangent line touches the curve and shows the instantaneous rate of change. Gradient descent uses exactly this value to decide which direction to step.

When a function has multiple inputs, partial derivatives measure the rate of change with respect to one variable while holding the others constant. The notation changes from \(\frac{d}{dx}\) to \(\frac{\partial}{\partial x}\).

The gradient \(\nabla f\) is the vector of all partial derivatives. It points in the direction of steepest increase. Gradient descent — the most common optimisation algorithm in machine learning — works by repeatedly stepping in the opposite direction: downhill towards a minimum.

TipAuthor’s Note

Every model-fitting algorithm in this book — least squares regression, logistic regression, gradient-boosted trees — is doing the same thing at its core: searching for parameter values that minimise a loss function. The derivative is the compass that tells the algorithm which direction to step. Once you see that, the zoo of algorithms stops looking like a collection of unrelated methods and starts looking like variations on a single theme: define a loss, take the gradient, step downhill, repeat.

A.6.2 Integrals

An integral computes the area under a curve. For a probability density function (PDF), the area under the curve between two points gives the probability of observing a value in that range.

\[ P(a \leq X \leq b) = \int_a^b f(x)\, dx \]

The total area under any valid PDF equals 1 (certainty that some value will be observed).

x = np.linspace(-4, 4, 300)
y = stats.norm.pdf(x)

fig, ax = plt.subplots(figsize=(8, 4))
fig.patch.set_alpha(0)
ax.patch.set_alpha(0)
ax.set_title('Probability = area under the curve between two limits',
             fontsize=12, pad=10)
ax.plot(x, y, color='#0072B2', linewidth=2)

# Faint tail shading to show what is excluded
tail_mask = (x < -1) | (x > 1)
ax.fill_between(x, y, where=tail_mask, alpha=0.08, color='grey')

# Shade the area between -1 and 1
mask = (x >= -1) & (x <= 1)
ax.fill_between(x[mask], y[mask], alpha=0.45, color='#0072B2')

# Vertical markers at integration limits
ax.axvline(-1, color='#0072B2', linewidth=1, linestyle=':', alpha=0.7)
ax.axvline( 1, color='#0072B2', linewidth=1, linestyle=':', alpha=0.7)

# Direct annotation instead of distant legend
ax.text(0, 0.18, r'$P(-1 \leq X \leq 1) \approx 0.683$',
        ha='center', va='bottom', fontsize=11, color='#1a5276',
        fontweight='bold')

ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.yaxis.grid(True, linestyle=':', alpha=0.4, color='grey')
ax.set_axisbelow(True)
plt.tight_layout()
plt.show()
A standard Normal bell curve. The central region between x = -1 and x = 1 is shaded, representing approximately 68.3% of the distribution's total area. The tails beyond plus and minus 1 are lightly shaded in grey. Dotted vertical lines mark the integration limits. The probability P(-1 ≤ X ≤ 1) ≈ 0.683 is annotated on the figure.
Figure A.3: The shaded area under the standard Normal PDF (mean 0, standard deviation 1) between -1 and 1 gives P(-1 ≤ X ≤ 1) ≈ 0.683 — about 68% of the distribution.

In practice, you will rarely compute integrals yourself. The scipy.stats CDF method does it for you: stats.norm.cdf(1) - stats.norm.cdf(-1) gives the same 0.683 as the shaded area above.

A.7 Linear algebra essentials

Linear algebra provides the language for working with collections of numbers simultaneously. In data science, your data almost always lives in a matrix (rows = observations, columns = features), and you express most algorithms as matrix operations.

A.7.1 Vectors

A vector is an ordered list of numbers. In Python, it is a one-dimensional NumPy array.

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \]

The dot product of two vectors multiplies corresponding elements and sums the results:

\[ \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n} x_i y_i \]

This is the core operation behind linear regression: a predicted value is the dot product of features and coefficients.

x = np.array([2, 3, 5])
y = np.array([1, 4, 2])

# These are all equivalent
print(f"np.dot:        {np.dot(x, y)}")
print(f"@ operator:    {x @ y}")
print(f"Manual sum:    {sum(a * b for a, b in zip(x, y))}")
np.dot:        24
@ operator:    24
Manual sum:    24

A.7.2 Matrices

A matrix is a two-dimensional array of numbers. In data science, your feature matrix \(\mathbf{X}\) typically has \(n\) rows (observations) and \(p\) columns (features).

\[ \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \end{bmatrix} \]

The transpose \(\mathbf{X}^T\) swaps rows and columns. Matrix multiplication \(\mathbf{A}\mathbf{B}\) requires the number of columns of \(\mathbf{A}\) to equal the number of rows of \(\mathbf{B}\); the inner dimensions must match. In NumPy, the @ operator handles both dot products and matrix multiplication.

X = np.array([[1, 2],
              [3, 4],
              [5, 6]])

print(f"X shape:    {X.shape}")      # (3, 2)
print(f"X^T shape:  {X.T.shape}")    # (2, 3)

# X^T @ X gives a (2, 2) matrix — this appears in the
# normal equations for linear regression
print(f"\nX^T @ X:\n{X.T @ X}")
X shape:    (3, 2)
X^T shape:  (2, 3)

X^T @ X:
[[35 44]
 [44 56]]
NoteEngineering Bridge

Matrix multiplication is how ML frameworks batch-process multiple observations simultaneously. When you see X @ w in a model’s forward pass, that is \(n\) simultaneous dot products — one per row of \(\mathbf{X}\). The @ operator in NumPy maps directly to optimised BLAS routines, which is why vectorised matrix operations are orders of magnitude faster than Python for loops over individual rows. If you have ever profiled a hot loop and replaced it with a single library call, the same instinct applies here.

A.7.3 Eigenvalues and eigenvectors

If you have a system described by several correlated metrics — CPU usage, memory consumption, request latency — the metrics tend to move together. Eigenvectors of the covariance matrix identify the independent “directions of variation” in that system: the distinct ways the metrics co-move as a unit. Eigenvalues measure how much variation each direction captures.

Formally, an eigenvector of a matrix \(\mathbf{A}\) is a vector whose direction doesn’t change when \(\mathbf{A}\) is applied to it. It only gets scaled by a factor, the eigenvalue (conventionally written \(\lambda\) in this context, distinct from the rate parameter \(\lambda\) in Table A.1):

\[ \mathbf{A}\mathbf{v} = \lambda\mathbf{v} \]

This concept is central to dimensionality reduction (PCA): the eigenvectors of the covariance matrix define the principal component directions, and the eigenvalues tell you how much variance each component captures.

# Covariance matrix of two correlated features
C = np.array([[2.0, 1.2],
              [1.2, 1.0]])

eigenvalues, eigenvectors = np.linalg.eigh(C)

# eigh returns ascending order; reverse to match PCA convention (largest first)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print('Eigenvalues:', eigenvalues)
print('Eigenvectors (columns):')
print(eigenvectors)
pct = eigenvalues / eigenvalues.sum() * 100
print(f"\nVariance explained: PC1={pct[0]:.1f}%, PC2={pct[1]:.1f}%")
Eigenvalues: [2.8 0.2]
Eigenvectors (columns):
[[-0.83205029  0.5547002 ]
 [-0.5547002  -0.83205029]]

Variance explained: PC1=93.3%, PC2=6.7%

The first component captures about 93% of the variance: in PCA terms, you could represent this two-dimensional data with a single number and lose very little information.

A.8 Probability notation

Probability has its own compact notation. Here is the vocabulary you need.

Table A.2: Probability notation reference.
Notation Read as Meaning
\(P(A)\) “probability of A” How likely event \(A\) is (between 0 and 1)
\(P(A \mid B)\) “probability of A given B” How likely \(A\) is, knowing \(B\) occurred
\(P(A \cap B)\) “probability of A and B” Both \(A\) and \(B\) occur
\(P(A \cup B)\) “probability of A or B” At least one of \(A\) or \(B\) occurs
\(X \sim N(\mu, \sigma^2)\) “X follows a Normal” \(X\) is drawn from a Normal with mean \(\mu\) and variance \(\sigma^2\)
\(E[X]\) “expected value of X” The long-run average; \(E[X] = \mu\)
\(\text{Var}(X)\) “variance of X” The expected squared deviation from the mean

A note on the Normal distribution parameterisation: the mathematical convention \(N(\mu, \sigma^2)\) uses the variance as the second parameter. In scipy, however, stats.norm(loc=mu, scale=sigma) takes the standard deviation. Both refer to the same distribution; just be aware which parameterisation a given source is using.

The core rules that connect these symbols are:

Complement rule
\(P(\text{not } A) = 1 - P(A)\)
Addition rule
\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
Multiplication rule
\(P(A \cap B) = P(A) \times P(B \mid A)\)
Bayes’ theorem
\(P(A \mid B) = \dfrac{P(B \mid A) \times P(A)}{P(B)}\)

Bayes’ theorem is the engine behind an entire branch of statistics. It tells you how to update a belief: start with what you believed before observing data — the prior, \(P(A)\). Measure how well your hypothesis predicts what you observed — the likelihood, \(P(B \mid A)\). Normalise by the overall probability of the observation — the evidence, \(P(B)\). The result is your updated belief — the posterior, \(P(A \mid B)\).

# A medical screening test:
#   - 1% of the population has the condition (prior)
#   - The test correctly detects 95% of true cases (sensitivity)
#   - The test incorrectly flags 5% of healthy people (false positive rate)

p_condition = 0.01                     # P(condition)
p_positive_given_condition = 0.95      # P(positive | condition)
p_positive_given_healthy = 0.05        # P(positive | no condition)

# Law of total probability for the denominator
p_positive = (p_positive_given_condition * p_condition
              + p_positive_given_healthy * (1 - p_condition))

# Bayes: posterior
p_condition_given_positive = (p_positive_given_condition * p_condition) / p_positive
print(f"P(condition | positive test) = {p_condition_given_positive:.3f}")
print(f"Despite a 95%-sensitive test, a positive result means only a"
      f" {p_condition_given_positive:.0%} chance of having the condition.")
P(condition | positive test) = 0.161
Despite a 95%-sensitive test, a positive result means only a 16% chance of having the condition.

A.9 Distribution notation

A probability distribution describes the possible values of a random variable and how likely each one is. The notation differs slightly between discrete and continuous distributions.

For discrete distributions (outcomes you can count), the probability mass function (PMF) gives the probability of each specific outcome:

\[ P(X = k) \quad \text{e.g., } P(X = 3) = 0.18 \]

For continuous distributions (outcomes on a smooth scale), the probability density function (PDF) gives the density at each point. Probabilities come from integrating the PDF over an interval; the density at a single point is not itself a probability.

Both types share a cumulative distribution function (CDF):

\[ F(x) = P(X \leq x) \]

Both types also have an inverse CDF, the quantile function (called ppf in scipy), which answers “what value has \(q\)% of the distribution below it?”

# Discrete: Poisson(λ=5)
poisson = stats.poisson(mu=5)
print(f"PMF: P(X = 3) = {poisson.pmf(3):.4f}")
print(f"CDF: P(X ≤ 3) = {poisson.cdf(3):.4f}")
print(f"PPF: value at 95th percentile = {poisson.ppf(0.95):.0f}")

# Continuous: Normal(μ=0, σ²=1)
normal = stats.norm(loc=0, scale=1)  # scale is σ (standard deviation)
print(f"\nPDF: f(0)     = {normal.pdf(0):.4f}")
print(f"CDF: P(X ≤ 0) = {normal.cdf(0):.4f}")
print(f"PPF: value at 97.5th percentile = {normal.ppf(0.975):.4f}")
PMF: P(X = 3) = 0.1404
CDF: P(X ≤ 3) = 0.2650
PPF: value at 95th percentile = 9

PDF: f(0)     = 0.3989
CDF: P(X ≤ 0) = 0.5000
PPF: value at 97.5th percentile = 1.9600

A.10 Common formulas

These formulas recur throughout the book. Each is paired with its NumPy equivalent.

Sample mean:

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \]

Sample variance (with Bessel’s correction, dividing by \(n - 1\) rather than \(n\) to correct for the slight downward bias when estimating population variance from a sample):

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 \]

Standard deviation is the square root of variance: \(s = \sqrt{s^2}\).

Standard error of the mean measures how precisely you’ve estimated the population mean:

\[ \text{SE} = \frac{s}{\sqrt{n}} \]

x = np.array([12, 15, 14, 10, 13, 16, 11, 14, 15, 12])

# Mean
mean = np.mean(x)
print(f"Mean: {mean}")

# Variance (ddof=1 for sample variance — Bessel's correction)
variance = np.var(x, ddof=1)
print(f"Variance: {variance:.2f}")

# Standard deviation
std = np.std(x, ddof=1)
print(f"Standard deviation: {std:.2f}")

# Standard error of the mean
se = std / np.sqrt(len(x))
print(f"Standard error: {se:.2f}")
Mean: 13.2
Variance: 3.73
Standard deviation: 1.93
Standard error: 0.61

Notice the \(\sqrt{n}\) in the denominator of the standard error: quadrupling your sample size halves your standard error. This square-root relationship between data volume and precision is one of the most important practical facts in statistics, and it explains the diminishing returns of simply collecting more data: going from 100 to 400 observations halves your uncertainty, but going from 10,000 to 40,000 barely moves it.

A.11 Set notation

Sets appear in probability contexts. The notation maps directly to concepts you know from programming.

Table A.3: Set notation and Python equivalents.
Symbol Meaning Python equivalent
\(\in\) “is a member of” x in S
\(\cup\) Union (or) A | B (sets)
\(\cap\) Intersection (and) A & B (sets)
\(\subseteq\) Subset (possibly equal) A <= B or A.issubset(B)
\(\subset\) Proper subset (\(A \neq B\)) A < B
\(\emptyset\) Empty set set()
\(\mathbb{R}\) All real numbers float (conceptually)
\(\mathbb{R}^n\) \(n\)-dimensional real space np.ndarray of shape (n,)