---
# Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md
title: "From deterministic to probabilistic thinking"
---
## The comfort of determinism {#sec-determinism}
As software engineers, we've built our entire professional infrastructure on determinism. Tests assert exact equality. CI pipelines expect reproducible builds. Deployments are idempotent by design. Given the same input, a well-written function returns the same output — and if it doesn't, we file a bug.
```{python}
#| label: deterministic-example
#| echo: true
def generate_invoice(price: float, quantity: int, tax_rate: float) -> float:
"""Same inputs → same output. We bet our test suites on this."""
return round(price * quantity * (1 + tax_rate), 2)
# This assertion will pass today, tomorrow, and on every machine
# in your CI pipeline. Your entire engineering practice depends
# on this being true.
assert generate_invoice(9.99, 3, 0.20) == 35.96
assert generate_invoice(9.99, 3, 0.20) == 35.96
print(f"Invoice total: £{generate_invoice(9.99, 3, 0.20)}")
```
This is comforting. It's testable. It's reproducible. And it's not how the real world works.
## When the same input gives different outputs {#sec-stochastic}
Consider a different kind of question: *how many customers will visit our website tomorrow?* You can look at historical data, account for the day of the week, factor in marketing campaigns — and you'll still get a different number from what actually happens. Not because your model is broken, but because the process that generates the data is inherently variable.
In statistics, we call this kind of quantity a **random variable**, a value governed by a probability distribution rather than determined by its inputs. When we observe a random variable repeatedly over time (as with daily visitor counts), the sequence forms a **stochastic process**, a time-ordered series of random outcomes.
```{python}
#| label: fig-stochastic-example
#| echo: true
#| fig-cap: "Ten Poisson(λ=1,000) simulations over 30 days. Each trace is an independent realisation from the same model, with values typically ranging between 920 and 1,080 visitors per day."
#| fig-alt: "Line chart showing ten simulated time series of daily website visitors over 30 days. Ten blue traces each follow a different path, fluctuating between roughly 920 and 1,080 visitors per day. A dashed orange horizontal line marks the expected value of 1,000 visitors, around which all traces scatter without systematic bias."
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
# Simulate daily visitors using a Poisson distribution — a model for
# event counts that we'll explore properly in the next chapter.
# For now, treat it as: "generate realistic count data with average 1,000."
fig, ax = plt.subplots(figsize=(10, 5))
fig.patch.set_alpha(0)
ax.patch.set_alpha(0)
days = np.arange(1, 31)
for _ in range(10):
daily_visitors = rng.poisson(lam=1000, size=30)
ax.plot(days, daily_visitors, alpha=0.7, linewidth=1, color='#0072B2')
ax.set_xlabel('Day')
ax.set_ylabel('Visitors')
ax.set_title('Same model, different outcomes: variability is inherent')
ax.axhline(y=1000, color='#E69F00', linestyle='--', linewidth=1.5, alpha=1.0,
label='Expected value (λ=1,000)')
ax.set_xlim(1, 30)
ax.yaxis.grid(True, linestyle=':', alpha=0.4, color='grey')
ax.set_axisbelow(True)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend()
plt.tight_layout()
plt.show()
```
Every line in @fig-stochastic-example was generated by the same model with the same parameters. The dashed line marks the **expected value**, the long-run average that individual outcomes scatter around. The variation isn't error; it's the *nature of the process*.
::: {.callout-note}
## Engineering Bridge
If you've worked with concurrent or distributed systems, you've already encountered nondeterminism. Race conditions, network latency jitter, and load balancer distribution all produce different outcomes under apparently identical conditions. The difference is that in systems engineering, we usually treat this variation as a problem to be eliminated — we add retries, idempotency keys, and saga patterns to absorb it. In data science, we take the opposite approach: characterise the variation and make it useful.
:::
## The data-generating process {#sec-dgp}
Behind every dataset is a **data-generating process**, the real-world mechanism that produces the observations we see. We never observe it directly; we only see its output. The goal of statistical modelling is to infer the properties of this process from the data it produces.
Think of it this way: if a function is a mapping from inputs to outputs, a data-generating process is a mapping from inputs to a *distribution* of possible outputs.
::: {.callout-note}
## Engineering Bridge
A data-generating process is like an API you can call but whose source code you cannot read. You observe the responses (data), form hypotheses about the internal logic (model), and test those hypotheses against new responses, but you never get to inspect the implementation directly. Reverse-engineering an API from its behaviour is something most engineers have done; statistical modelling is the same instinct applied to natural processes.
The analogy has limits. Unlike an API, a data-generating process has no documentation, no versioning, and no stability guarantees. The underlying mechanism can shift without warning. And fitting your model too precisely to past responses is like hard-coding an API client to match today's response format: it works until the next release breaks everything.
:::
One way to express this mathematically: where $y$ is the outcome we observe, $x$ is the input, $f(x)$ is the systematic relationship we are trying to model, and $\varepsilon$ (epsilon) is the random component:
$$
y = f(x) + \varepsilon
$$
instead of the deterministic $y = f(x)$. Here $\varepsilon$ (the **error term**) represents the randomness we cannot explain given our model. We construct the model so that $\varepsilon$ has mean zero; any predictable component belongs in $f(x)$ by definition. What remains in $\varepsilon$ is genuine unpredictability: variation our model cannot, in principle, account for.
This additive model is the simplest version and works well when outcomes are continuous (values that can fall anywhere on a smooth scale, like temperature or duration). For count data like our visitor example, the relationship between signal and noise is subtler: in a Poisson process, the variance equals the mean, so the noise and signal are entangled rather than simply added together. We will encounter richer models later; for now, the key insight is that $\varepsilon$ exists at all.
Understanding that $\varepsilon$ exists is the first step. Our job is to model $f(x)$ as well as we can, while respecting the limits that $\varepsilon$ imposes. That balance between signal and noise, between pattern and randomness, is the shift from deterministic to probabilistic thinking.
::: {.callout-tip}
## Author's Note
This is where engineering instinct most reliably fires in the wrong direction. The instinct says: if the model doesn't perfectly predict the outcome, the model is wrong. In a deterministic system, that instinct is correct — residual error really does mean a bug. But in data science, a model that perfectly fits every observation is almost certainly *overfitting*: it's memorising noise rather than learning signal. The natural response is to push harder — more features, more complexity, more capacity — and watch the training accuracy climb. But accuracy on new, unseen data starts going the other way. Accepting irreducible uncertainty as a feature rather than a defect takes genuine rewiring. You have to stop treating residual error as a bug to fix and start treating it as information about the limits of what the data can tell you.
:::
## Measuring uncertainty {#sec-measuring-uncertainty}
If outcomes are uncertain, we need a language for describing *how* uncertain. This is what probability distributions give us. Rather than saying "we'll get 1,000 visitors tomorrow," we can say "the number of visitors tomorrow follows a Poisson distribution with $\lambda = 1\text{,}000$, giving us roughly a 95% chance of seeing between 938 and 1,062 visitors." Here $\lambda$ is the expected rate, the average number of daily visitors.
That range is a **prediction interval**: it tells us where we expect future observations to fall. It is distinct from a *confidence interval*, which quantifies uncertainty about a model parameter rather than a future observation. We will return to that distinction later in the book.
To make that interval precise, we need two complementary tools. We write $P(\text{event})$ to mean "the probability that event occurs." A distribution's **cumulative distribution function** (CDF) answers "how likely is it that we see a value at or below $k$?", formally $P(X \leq k)$.
Its inverse, the **quantile function**, works the other way around: given a probability, it returns the corresponding value. In `scipy`, this is called the **percent point function** (ppf). Together, CDF and quantile function let us answer questions like "how likely are fewer than 950 visitors?" and "what range covers 95% of outcomes?"
```{python}
#| label: uncertainty-quantification
#| echo: true
from scipy import stats
# Poisson distribution with lambda = 1000
# Note: scipy uses `mu` for the Poisson rate parameter, while numpy uses
# `lam`. Both refer to the same quantity (λ) — this inconsistency across
# libraries is a genuine rough edge, not you missing something.
poisson = stats.poisson(mu=1000)
# For a discrete distribution we cannot place exactly 2.5% in each tail.
# ppf(q) gives the smallest k where P(X <= k) >= q, so actual coverage
# is slightly more than 95%.
lower, upper = poisson.ppf(0.025), poisson.ppf(0.975)
print(f"95% prediction interval: [{lower:.0f}, {upper:.0f}]")
# Note: for discrete distributions, CDF(k) = P(X ≤ k), not P(X < k)
print(f"P(visitors > 1,050) = {1 - poisson.cdf(1050):.3f}")
print(f"P(visitors ≤ 950) = {poisson.cdf(950):.3f}")
```
Running this gives a prediction interval of [938, 1,062], a roughly 5–6% probability of exceeding 1,050 visitors, and a similar probability of falling to 950 or below — roughly symmetric because the Poisson distribution is approximately symmetric at large $\lambda$.
This is more useful than a single **point estimate** (a single best-guess value like "1,000 visitors") because it tells us not just what we *expect*, but how much we should trust that expectation.
::: {.callout-note}
## Engineering Bridge
This is directly analogous to SLOs (service-level objectives) and error budgets in site reliability engineering. You don't say "our API latency is 50ms." You say "our p50 latency is 50ms, our p99 is 200ms, and we have an SLO of 99.9% of requests under 300ms." You're already *consuming* distributional summaries — percentiles, tail probabilities, threshold-based SLOs. Data science asks you to go one step further: build the distribution yourself, from raw data, for a system with no prior SLO, so you can reason about scenarios you haven't observed yet.
:::
## Summary {#sec-deterministic-summary}
The shift from deterministic to probabilistic thinking comes down to three ideas:
1. **Real-world processes produce variable outcomes** — the same conditions can lead to different results, and this variation is information, not error.
2. **We model the data-generating process, not individual outcomes** — our goal is to understand the mechanism and its uncertainty, captured by $y = f(x) + \varepsilon$.
3. **Uncertainty is quantifiable** — probability distributions give us a precise language for describing what we know, what we don't, and how confident we should be.
In the next chapter, *Distributions: the type system of uncertainty*, we will formalise this by exploring probability distributions in depth.
## Exercises {#sec-deterministic-exercises}
1. Write a Python function that simulates rolling two dice 10,000 times and plots the distribution of totals. Why does the distribution have the shape it does?
2. Your monitoring system records API response times. These are stochastic — the same request type produces different latencies. What distribution might model this? What properties would a suitable distribution need? There is no single right answer — the interesting part is the reasoning.
3. Write a function `simulate_dgp(f, noise_std, x, n_simulations)` that takes a deterministic function `f`, a noise level, an input `x`, and a number of simulations, and returns an array of outputs. Draw $\varepsilon$ from a normal distribution with mean zero and standard deviation `noise_std`. Use it to show how increasing `noise_std` changes the spread of outcomes while the average remains at `f(x)`.
4. **Conceptual:** The DGP-as-API analogy from this chapter says that statistical modelling is like reverse-engineering an API from its responses. Identify two properties of a real API that a data-generating process does *not* have, and explain why each missing property makes statistical modelling harder than API reverse-engineering.