As software engineers, we’ve built our entire professional infrastructure on determinism. Tests assert exact equality. CI pipelines expect reproducible builds. Deployments are idempotent by design. Given the same input, a well-written function returns the same output — and if it doesn’t, we file a bug.
def generate_invoice(price: float, quantity: int, tax_rate: float) ->float:"""Same inputs → same output. We bet our test suites on this."""returnround(price * quantity * (1+ tax_rate), 2)# This assertion will pass today, tomorrow, and on every machine# in your CI pipeline. Your entire engineering practice depends# on this being true.assert generate_invoice(9.99, 3, 0.20) ==35.96assert generate_invoice(9.99, 3, 0.20) ==35.96print(f"Invoice total: £{generate_invoice(9.99, 3, 0.20)}")
Invoice total: £35.96
This is comforting. It’s testable. It’s reproducible. And it’s not how the real world works.
1.2 When the same input gives different outputs
Consider a different kind of question: how many customers will visit our website tomorrow? You can look at historical data, account for the day of the week, factor in marketing campaigns — and you’ll still get a different number from what actually happens. Not because your model is broken, but because the process that generates the data is inherently variable.
In statistics, we call this kind of quantity a random variable — a value governed by a probability distribution rather than determined by its inputs. When we observe a random variable repeatedly over time (as with daily visitor counts), the sequence forms a stochastic process — a time-ordered series of random outcomes.
import numpy as npimport matplotlib.pyplot as pltrng = np.random.default_rng(42)# Simulate daily visitors using a Poisson distribution — a model for# event counts that we'll explore properly in the next chapter.# For now, treat it as: "generate realistic count data with average 1,000."fig, ax = plt.subplots(figsize=(10, 5))for _ inrange(10): daily_visitors = rng.poisson(lam=1000, size=30) ax.plot(daily_visitors, alpha=0.5, linewidth=1, color='steelblue')ax.set_xlabel("Day")ax.set_ylabel("Visitors")ax.set_title("Same model, different outcomes: variability is inherent")ax.axhline(y=1000, color='coral', linestyle='--', alpha=0.7, label='Expected value (λ=1,000)')ax.legend()plt.tight_layout()plt.show()
Figure 1.1: Ten simulations of daily website visitors. Same parameters, different outcomes every time.
Every line in Figure 1.1 was generated by the same model with the same parameters. The dashed line marks the expected value — the long-run average that individual outcomes scatter around. The variation isn’t error — it’s the nature of the process.
Engineering Bridge
If you’ve worked with concurrent or distributed systems, you’ve already encountered nondeterminism. Race conditions, network latency jitter, and load balancer distribution all produce different outcomes under apparently identical conditions. The difference is that in systems engineering, we usually treat this variation as a problem to be eliminated. In data science, we treat it as information to be modelled.
1.3 The data-generating process
Behind every dataset is a data-generating process — the real-world mechanism that produces the observations we see. We never observe it directly; we only see its output. The goal of statistical modelling is to infer the properties of this process from the data it produces.
Think of it this way: if a function is a mapping from inputs to outputs, a data-generating process is a mapping from inputs to a distribution of possible outputs.
Engineering Bridge
A data-generating process is like an API you can call but whose source code you cannot read. You observe the responses (data), form hypotheses about the internal logic (model), and test those hypotheses against new responses — but you never get to inspect the implementation directly. Reverse-engineering an API from its behaviour is something most engineers have done; statistical modelling is the same instinct applied to natural processes.
One way to express this mathematically is:
\[
y = f(x) + \varepsilon
\]
instead of the deterministic \(y = f(x)\). Here \(\varepsilon\) (the error term) represents the randomness we cannot predict — irreducible uncertainty with an average value of zero (any systematic component belongs in \(f(x)\), not in the noise).
This additive model is the simplest version and works well when outcomes are continuous (values that can fall anywhere on a smooth scale, like temperature or duration). For count data like our visitor example, the relationship between signal and noise is more nuanced — in a Poisson process, the variance equals the mean, so the noise and signal are entangled rather than simply added together. We will encounter richer models later; for now, the key insight is that \(\varepsilon\) exists at all.
Understanding that \(\varepsilon\) exists is the first step. Our job is to model \(f(x)\) as well as we can, while respecting the limits that \(\varepsilon\) imposes. That balance — signal versus noise, pattern versus randomness — is the fundamental shift from deterministic to probabilistic thinking.
Author’s Note
This was the hardest conceptual shift for me. My engineering instinct says: if the model doesn’t perfectly predict the outcome, the model is wrong. But in data science, a model that perfectly fits every observation is almost certainly overfitting — it’s memorising noise rather than learning signal. The natural response is to push harder — more features, more complexity, more capacity — and watch the training accuracy climb. But the accuracy on new, unseen data starts going the other direction. Accepting irreducible uncertainty as a feature rather than a defect takes genuine rewiring. You have to stop treating residual error as a bug to fix and start treating it as information about the limits of what the data can tell you.
1.4 Measuring uncertainty
If outcomes are uncertain, we need a language for describing how uncertain. This is what probability distributions give us. Rather than saying “we’ll get 1,000 visitors tomorrow,” we can say “the number of visitors tomorrow follows a Poisson distribution with \(\lambda = 1\text{,}000\), giving us roughly a 95% chance of seeing between 938 and 1,062 visitors.” Here \(\lambda\) is the expected rate — the average number of daily visitors.
That range is a prediction interval — it tells us where we expect future observations to fall. It is distinct from a confidence interval, which we will meet in Confidence Intervals.
Two tools help us pin down the numbers. We write \(P(\text{event})\) to mean “the probability that event occurs.” A distribution’s cumulative distribution function (CDF) answers “how likely is it that we see a value at or below \(k\)?” — formally, \(P(X \leq k)\). Its inverse — the quantile function, which scipy calls the percent point function (ppf) — works the other way around: given a probability, it returns the corresponding value. Together, these let us answer questions like “how likely are fewer than 950 visitors?” and “what range covers 95% of outcomes?”
from scipy import stats# Poisson distribution with lambda = 1000poisson = stats.poisson(mu=1000)# Place 2.5% in each tail, leaving 95% in the middlelower, upper = poisson.ppf(0.025), poisson.ppf(0.975)print(f"95% prediction interval: [{lower:.0f}, {upper:.0f}]")# Note: for discrete distributions, CDF(k) = P(X ≤ k), not P(X < k)print(f"P(visitors > 1,050) = {1- poisson.cdf(1050):.3f}")print(f"P(visitors ≤ 950) = {poisson.cdf(950):.3f}")
This is more useful than a single point estimate (a single best-guess value like “1,000 visitors”) because it tells us not just what we expect, but how much we should trust that expectation.
Engineering Bridge
This is directly analogous to SLOs (service-level objectives) and error budgets in site reliability engineering. You don’t say “our API latency is 50ms.” You say “our p50 latency is 50ms, our p99 is 200ms, and we have an SLO of 99.9% of requests under 300ms.” You’re already consuming distributional summaries — percentiles, tail probabilities, threshold-based SLOs. Data science asks you to go one step further: model the distribution itself, so you can reason about scenarios you haven’t observed yet.
1.5 Summary
The shift from deterministic to probabilistic thinking comes down to three ideas:
Real-world processes produce variable outcomes — the same conditions can lead to different results, and this variation is information, not error.
We model the data-generating process, not individual outcomes — our goal is to understand the mechanism and its uncertainty, captured by \(y = f(x) + \varepsilon\).
Uncertainty is quantifiable — probability distributions give us a precise language for describing what we know, what we don’t, and how confident we should be.
In the next chapter, Distributions: the type system of uncertainty, we will formalise this by exploring probability distributions in depth.
1.6 Exercises
Write a Python function that simulates rolling two dice 10,000 times and plots the distribution of totals. Why does the distribution have the shape it does?
Your monitoring system records API response times. These are stochastic — the same request type produces different latencies. What distribution might model this? (Hint: response times are always positive and often right-skewed — most responses are fast, but a long tail of slow responses pulls the distribution to the right.) Don’t worry about getting the “right” answer — think about what properties the distribution would need.
Write a function simulate_dgp(f, noise_std, x, n_simulations) that takes a deterministic function f, a noise level, an input x, and a number of simulations, and returns an array of outputs. Use it to show how increasing noise_std (i.e., increasing \(\varepsilon\)) changes the spread of outcomes while the average remains at f(x).
Conceptual: In software testing, a flaky test is one that sometimes passes and sometimes fails under identical conditions. In what sense is a flaky test a stochastic process? How does the engineering response (eliminate the flakiness) differ from the statistical response (model the variation)?
Source Code
---title: "From deterministic to probabilistic thinking"---## The comfort of determinism {#sec-determinism}As software engineers, we've built our entire professional infrastructure on determinism. Tests assert exact equality. CI pipelines expect reproducible builds. Deployments are idempotent by design. Given the same input, a well-written function returns the same output — and if it doesn't, we file a bug.```{python}#| label: deterministic-example#| echo: truedef generate_invoice(price: float, quantity: int, tax_rate: float) ->float:"""Same inputs → same output. We bet our test suites on this."""returnround(price * quantity * (1+ tax_rate), 2)# This assertion will pass today, tomorrow, and on every machine# in your CI pipeline. Your entire engineering practice depends# on this being true.assert generate_invoice(9.99, 3, 0.20) ==35.96assert generate_invoice(9.99, 3, 0.20) ==35.96print(f"Invoice total: £{generate_invoice(9.99, 3, 0.20)}")```This is comforting. It's testable. It's reproducible. And it's not how the real world works.## When the same input gives different outputs {#sec-stochastic}Consider a different kind of question: *how many customers will visit our website tomorrow?* You can look at historical data, account for the day of the week, factor in marketing campaigns — and you'll still get a different number from what actually happens. Not because your model is broken, but because the process that generates the data is inherently variable.In statistics, we call this kind of quantity a **random variable** — a value governed by a probability distribution rather than determined by its inputs. When we observe a random variable repeatedly over time (as with daily visitor counts), the sequence forms a **stochastic process** — a time-ordered series of random outcomes.```{python}#| label: fig-stochastic-example#| echo: true#| fig-cap: "Ten simulations of daily website visitors. Same parameters, different outcomes every time."#| fig-alt: "Line chart showing ten simulated time series of daily website visitors over 30 days. All traces fluctuate around the expected value of 1,000 visitors, but each follows a unique path, demonstrating that the same statistical model produces different outcomes each time."import numpy as npimport matplotlib.pyplot as pltrng = np.random.default_rng(42)# Simulate daily visitors using a Poisson distribution — a model for# event counts that we'll explore properly in the next chapter.# For now, treat it as: "generate realistic count data with average 1,000."fig, ax = plt.subplots(figsize=(10, 5))for _ inrange(10): daily_visitors = rng.poisson(lam=1000, size=30) ax.plot(daily_visitors, alpha=0.5, linewidth=1, color='steelblue')ax.set_xlabel("Day")ax.set_ylabel("Visitors")ax.set_title("Same model, different outcomes: variability is inherent")ax.axhline(y=1000, color='coral', linestyle='--', alpha=0.7, label='Expected value (λ=1,000)')ax.legend()plt.tight_layout()plt.show()```Every line in @fig-stochastic-example was generated by the same model with the same parameters. The dashed line marks the **expected value** — the long-run average that individual outcomes scatter around. The variation isn't error — it's the *nature of the process*.::: {.callout-note}## Engineering BridgeIf you've worked with concurrent or distributed systems, you've already encountered nondeterminism. Race conditions, network latency jitter, and load balancer distribution all produce different outcomes under apparently identical conditions. The difference is that in systems engineering, we usually treat this variation as a problem to be eliminated. In data science, we treat it as information to be modelled.:::## The data-generating process {#sec-dgp}Behind every dataset is a **data-generating process** — the real-world mechanism that produces the observations we see. We never observe it directly; we only see its output. The goal of statistical modelling is to infer the properties of this process from the data it produces.Think of it this way: if a function is a mapping from inputs to outputs, a data-generating process is a mapping from inputs to a *distribution* of possible outputs.::: {.callout-note}## Engineering BridgeA data-generating process is like an API you can call but whose source code you cannot read. You observe the responses (data), form hypotheses about the internal logic (model), and test those hypotheses against new responses — but you never get to inspect the implementation directly. Reverse-engineering an API from its behaviour is something most engineers have done; statistical modelling is the same instinct applied to natural processes.:::One way to express this mathematically is:$$y = f(x) + \varepsilon$$instead of the deterministic $y = f(x)$. Here $\varepsilon$ (the **error term**) represents the randomness we cannot predict — irreducible uncertainty with an average value of zero (any systematic component belongs in $f(x)$, not in the noise).This additive model is the simplest version and works well when outcomes are continuous (values that can fall anywhere on a smooth scale, like temperature or duration). For count data like our visitor example, the relationship between signal and noise is more nuanced — in a Poisson process, the variance equals the mean, so the noise and signal are entangled rather than simply added together. We will encounter richer models later; for now, the key insight is that $\varepsilon$ exists at all.Understanding that $\varepsilon$ exists is the first step. Our job is to model $f(x)$ as well as we can, while respecting the limits that $\varepsilon$ imposes. That balance — signal versus noise, pattern versus randomness — is the fundamental shift from deterministic to probabilistic thinking.::: {.callout-tip}## Author's NoteThis was the hardest conceptual shift for me. My engineering instinct says: if the model doesn't perfectly predict the outcome, the model is wrong. But in data science, a model that perfectly fits every observation is almost certainly *overfitting* — it's memorising noise rather than learning signal. The natural response is to push harder — more features, more complexity, more capacity — and watch the training accuracy climb. But the accuracy on new, unseen data starts going the other direction. Accepting irreducible uncertainty as a feature rather than a defect takes genuine rewiring. You have to stop treating residual error as a bug to fix and start treating it as information about the limits of what the data can tell you.:::## Measuring uncertainty {#sec-measuring-uncertainty}If outcomes are uncertain, we need a language for describing *how* uncertain. This is what probability distributions give us. Rather than saying "we'll get 1,000 visitors tomorrow," we can say "the number of visitors tomorrow follows a Poisson distribution with $\lambda = 1\text{,}000$, giving us roughly a 95% chance of seeing between 938 and 1,062 visitors." Here $\lambda$ is the expected rate — the average number of daily visitors.That range is a **prediction interval** — it tells us where we expect future observations to fall. It is distinct from a *confidence interval*, which we will meet in *Confidence Intervals*.Two tools help us pin down the numbers. We write $P(\text{event})$ to mean "the probability that event occurs." A distribution's **cumulative distribution function** (CDF) answers "how likely is it that we see a value at or below $k$?" — formally, $P(X \leq k)$. Its inverse — the **quantile function**, which `scipy` calls the **percent point function** (ppf) — works the other way around: given a probability, it returns the corresponding value. Together, these let us answer questions like "how likely are fewer than 950 visitors?" and "what range covers 95% of outcomes?"```{python}#| label: uncertainty-quantification#| echo: truefrom scipy import stats# Poisson distribution with lambda = 1000poisson = stats.poisson(mu=1000)# Place 2.5% in each tail, leaving 95% in the middlelower, upper = poisson.ppf(0.025), poisson.ppf(0.975)print(f"95% prediction interval: [{lower:.0f}, {upper:.0f}]")# Note: for discrete distributions, CDF(k) = P(X ≤ k), not P(X < k)print(f"P(visitors > 1,050) = {1- poisson.cdf(1050):.3f}")print(f"P(visitors ≤ 950) = {poisson.cdf(950):.3f}")```This is more useful than a single **point estimate** (a single best-guess value like "1,000 visitors") because it tells us not just what we *expect*, but how much we should trust that expectation.::: {.callout-note}## Engineering BridgeThis is directly analogous to SLOs (service-level objectives) and error budgets in site reliability engineering. You don't say "our API latency is 50ms." You say "our p50 latency is 50ms, our p99 is 200ms, and we have an SLO of 99.9% of requests under 300ms." You're already *consuming* distributional summaries — percentiles, tail probabilities, threshold-based SLOs. Data science asks you to go one step further: model the distribution itself, so you can reason about scenarios you haven't observed yet.:::## Summary {#sec-deterministic-summary}The shift from deterministic to probabilistic thinking comes down to three ideas:1. **Real-world processes produce variable outcomes** — the same conditions can lead to different results, and this variation is information, not error.2. **We model the data-generating process, not individual outcomes** — our goal is to understand the mechanism and its uncertainty, captured by $y = f(x) + \varepsilon$.3. **Uncertainty is quantifiable** — probability distributions give us a precise language for describing what we know, what we don't, and how confident we should be.In the next chapter, *Distributions: the type system of uncertainty*, we will formalise this by exploring probability distributions in depth.## Exercises {#sec-deterministic-exercises}1. Write a Python function that simulates rolling two dice 10,000 times and plots the distribution of totals. Why does the distribution have the shape it does?2. Your monitoring system records API response times. These are stochastic — the same request type produces different latencies. What distribution might model this? (Hint: response times are always positive and often right-skewed — most responses are fast, but a long tail of slow responses pulls the distribution to the right.) Don't worry about getting the "right" answer — think about what properties the distribution would need.3. Write a function `simulate_dgp(f, noise_std, x, n_simulations)` that takes a deterministic function `f`, a noise level, an input `x`, and a number of simulations, and returns an array of outputs. Use it to show how increasing `noise_std` (i.e., increasing $\varepsilon$) changes the spread of outcomes while the average remains at `f(x)`.4. **Conceptual:** In software testing, a flaky test is one that sometimes passes and sometimes fails under identical conditions. In what sense is a flaky test a stochastic process? How does the engineering response (eliminate the flakiness) differ from the statistical response (model the variation)?