Thinking in Uncertainty

Transpose: Software Engineering to Data Science

Author

Matthew Gibbons

Published

20 July 2026

Preface

Who this book is for

You’ve been asked to evaluate a model someone built. Or review a data pipeline PR. Or sit in a meeting where someone says “statistically significant”, and you’re not sure whether they’re right. You’re a software engineer. You can build systems, debug code, and reason about complexity, but nobody taught you how to think statistically.

Most data science resources assume you either have a statistics background or no technical background at all. The first kind skips the foundations and leaves you with pattern-matching formulae you don’t understand. The second kind wastes your time explaining what a function is. This book does neither. It starts from what you already know (deterministic systems, type thinking, testing, and architecture) and builds the statistical reasoning you need to work confidently with data.

I wrote this book while making the same transition. Where a concept challenged my engineering instincts, I wrote about why. Where an analogy between software systems and statistical models clarified something, I made the connection explicit; where textbooks assumed knowledge I didn’t have, I filled the gap. The result is a book that’s rigorous enough to trust, practical enough to use, and honest about where things are hard.

What this book covers

The book is organised into five parts, each building on the last.

Part 1: Foundations asks you to make one uncomfortable shift: accepting that the same input can produce different outputs, and that this isn’t a bug. From there, it builds the grammar you need. Distributions, descriptive statistics, and probability, all grounded in concepts you already know from engineering.

Part 2: Inference is where data starts answering questions. Hypothesis testing, confidence intervals, A/B testing, and Bayesian inference give you the tools to move from “I think this is true” to “the evidence supports this, with quantified uncertainty.” If you’ve ever had to decide whether a metric change is real or noise, this is the part that makes that rigorous.

Part 3: Modelling covers the algorithms: linear and logistic regression, regularisation, tree-based models, dimensionality reduction, clustering, and time series. Each is taught as a tool with trade-offs. When it works, when it breaks, and what the diagnostic signals look like.

Part 4: Engineering for Data Science is where your existing skills pay off. Reproducibility, data pipelines, MLOps, data at scale, and testing for data systems: the engineering practices that separate a prototype notebook from a production system. If Parts 1–3 teach you to think statistically, Part 4 teaches you to build statistically.

Part 5: Applied Data Science puts everything together on six industry problems: churn prediction, recommendation systems, demand forecasting, fraud detection, natural language processing, and computer vision. Each chapter works through a realistic problem end to end, from defining the business question to evaluating the deployed model.

Four appendices provide supporting material: a mathematical foundations refresher, a concept bridge mapping software engineering terms to data science terms, a reading list, and worked answers to all chapter exercises.

A note on scope. This is not a deep learning book; there are no neural network architectures or GPU training pipelines. It is not a “learn Python” book; Python fluency is assumed. And the treatment is computational and applied, aimed at a working understanding rather than mathematical formalism.

How this book works

Every concept follows the same rhythm: intuition first, then the mathematics, then executable code that makes it concrete. Formulae are always paired with a plain-English interpretation or a Python equivalent (often both). Mathematical notation is introduced gradually; if you haven’t seen Greek letters since university, the mathematical foundations appendix has a refresher.

Throughout, two types of callout bridge the gap between engineering and statistical thinking:

Engineering Bridge

These connect a data science concept to something you already understand from software engineering. For example, a data-generating process is like an API you can call but whose source code you can’t read. You observe the responses and reverse-engineer the logic. Where the analogy holds, we push it. Where it breaks down, we say so.

Author’s Note

These are conceptual reactions to the material: moments where an idea conflicts with engineering instincts, where a familiar-looking concept turns out to work differently than expected, or where a shift in thinking changes how you approach a problem. They are responses to the ideas, not anecdotes.

Every chart in this book is accompanied by the code that produced it. This is intentional. Data science is a practice, not just a theory, and the code shows you how that practice works: which library calls produce which results, what parameters matter, and how to go from a statistical concept to a working implementation. When you encounter a similar problem in your own work, the code is there as a reference you can adapt.

All visualisations use the Okabe-Ito colour palette (Okabe and Ito 2002), designed to be distinguishable by readers with colour vision deficiencies. Figures include descriptive alt text, and the book is tested for screen reader compatibility. Accessibility is not an afterthought; if a chart cannot communicate its message to every reader, it is not doing its job.

What you’ll need

Python proficiency is assumed. We won’t explain list comprehensions or how pip install works. The code uses the standard data science stack: numpy, pandas, scipy, scikit-learn, statsmodels, matplotlib, and seaborn.

Basic maths at the level of a first-year undergraduate course, enough to recognise a derivative, a summation, or a matrix multiplication, even if the details are rusty. The mathematical foundations appendix covers everything you’ll need.

No prior statistics or probability training. That is what this book provides, from first principles.

Running the code

All code examples are executable Python, designed to run in a Jupyter environment. You can install the dependencies with:

pip install -r requirements.txt

Every chapter opens with a hidden setup cell that imports the three libraries you’ll see in almost every code block — numpy, pandas, and matplotlib.pyplot — and registers a small set of matplotlib defaults (the Okabe-Ito palette, a transparent figure background, and a consistent grid style). This is the same pattern you’d use in a real notebook: put the constant dependencies at the top, import everything else at the point of use. Where a code block needs something topic-specific — scipy.stats, scikit-learn, statsmodels, seaborn — the import stays visible so you can see which library is doing the work.

The setup, which runs before the first code block of every chapter:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Okabe-Ito palette — the colourblind-friendly default used throughout.
# Every figure picks its colours from this cycler unless it asks otherwise.
_okabe_ito_palette = [
    '#0072B2',  # blue
    '#E69F00',  # orange
    '#009E73',  # green
    '#D55E00',  # vermillion
    '#CC79A7',  # reddish purple
    '#56B4E9',  # sky blue
    '#F0E442',  # yellow
    '#000000',  # black
]

plt.rcParams.update({
    # Transparent canvas so figures sit cleanly on any theme.
    'figure.facecolor': 'none',
    'axes.facecolor': 'none',
    'savefig.facecolor': 'none',
    'savefig.transparent': True,
    # Minimal frame: top and right spines hidden by default.
    'axes.spines.top': False,
    'axes.spines.right': False,
    # Dotted grid whenever a chart opts in via ax.grid().
    'grid.linestyle': ':',
    'grid.color': 'grey',
    'grid.alpha': 0.4,
    'axes.axisbelow': True,
    'figure.figsize': (9, 5),
    'axes.prop_cycle': plt.cycler(color=_okabe_ito_palette),
})

If you’re running chapters out of order or outside of Quarto, paste this block into the first cell of your notebook and the rest of the code will work unchanged.

Found an error? Have a better analogy? Issues are welcome.

Getting started

The book is designed to be read in order. Each part builds on the last, and later chapters reference techniques and vocabulary established earlier. That said, if you want a feel for where this is heading, skim the chapter titles in Part 4 or Part 5 to see how the foundations pay off. Then start at Part 1, Chapter 1: the shift from deterministic to probabilistic thinking is where everything begins.

--- number-sections: false --- # Preface {.unnumbered} ## Who this book is for You've been asked to evaluate a model someone built. Or review a data pipeline PR. Or sit in a meeting where someone says "statistically significant", and you're not sure whether they're right. You're a software engineer. You can build systems, debug code, and reason about complexity, but nobody taught you how to think statistically. Most data science resources assume you either have a statistics background or no technical background at all. The first kind skips the foundations and leaves you with pattern-matching formulae you don't understand. The second kind wastes your time explaining what a function is. This book does neither. It starts from what you already know (deterministic systems, type thinking, testing, and architecture) and builds the statistical reasoning you need to work confidently with data. I wrote this book while making the same transition. Where a concept challenged my engineering instincts, I wrote about why. Where an analogy between software systems and statistical models clarified something, I made the connection explicit; where textbooks assumed knowledge I didn't have, I filled the gap. The result is a book that's rigorous enough to trust, practical enough to use, and honest about where things are hard. ## What this book covers The book is organised into five parts, each building on the last. **Part 1: Foundations** asks you to make one uncomfortable shift: accepting that the same input can produce different outputs, and that this isn't a bug. From there, it builds the grammar you need. Distributions, descriptive statistics, and probability, all grounded in concepts you already know from engineering. **Part 2: Inference** is where data starts answering questions. Hypothesis testing, confidence intervals, A/B testing, and Bayesian inference give you the tools to move from "I think this is true" to "the evidence supports this, with quantified uncertainty." If you've ever had to decide whether a metric change is real or noise, this is the part that makes that rigorous. **Part 3: Modelling** covers the algorithms: linear and logistic regression, regularisation, tree-based models, dimensionality reduction, clustering, and time series. Each is taught as a tool with trade-offs. When it works, when it breaks, and what the diagnostic signals look like. **Part 4: Engineering for Data Science** is where your existing skills pay off. Reproducibility, data pipelines, MLOps, data at scale, and testing for data systems: the engineering practices that separate a prototype notebook from a production system. If Parts 1–3 teach you to think statistically, Part 4 teaches you to build statistically. **Part 5: Applied Data Science** puts everything together on six industry problems: churn prediction, recommendation systems, demand forecasting, fraud detection, natural language processing, and computer vision. Each chapter works through a realistic problem end to end, from defining the business question to evaluating the deployed model. Four appendices provide supporting material: a mathematical foundations refresher, a concept bridge mapping software engineering terms to data science terms, a reading list, and worked answers to all chapter exercises. A note on scope. This is not a deep learning book; there are no neural network architectures or GPU training pipelines. It is not a "learn Python" book; Python fluency is assumed. And the treatment is computational and applied, aimed at a working understanding rather than mathematical formalism. ## How this book works Every concept follows the same rhythm: intuition first, then the mathematics, then executable code that makes it concrete. Formulae are always paired with a plain-English interpretation or a Python equivalent (often both). Mathematical notation is introduced gradually; if you haven't seen Greek letters since university, the mathematical foundations appendix has a refresher. Throughout, two types of callout bridge the gap between engineering and statistical thinking: ::: {.callout-note} ## Engineering Bridge These connect a data science concept to something you already understand from software engineering. For example, a data-generating process is like an API you can call but whose source code you can't read. You observe the responses and reverse-engineer the logic. Where the analogy holds, we push it. Where it breaks down, we say so. ::: ::: {.callout-tip} ## Author's Note These are conceptual reactions to the material: moments where an idea conflicts with engineering instincts, where a familiar-looking concept turns out to work differently than expected, or where a shift in thinking changes how you approach a problem. They are responses to the ideas, not anecdotes. ::: Every chart in this book is accompanied by the code that produced it. This is intentional. Data science is a practice, not just a theory, and the code shows you how that practice works: which library calls produce which results, what parameters matter, and how to go from a statistical concept to a working implementation. When you encounter a similar problem in your own work, the code is there as a reference you can adapt. All visualisations use the Okabe-Ito colour palette [@okabe2002], designed to be distinguishable by readers with colour vision deficiencies. Figures include descriptive alt text, and the book is tested for screen reader compatibility. Accessibility is not an afterthought; if a chart cannot communicate its message to every reader, it is not doing its job. ## What you'll need **Python proficiency** is assumed. We won't explain list comprehensions or how `pip install` works. The code uses the standard data science stack: `numpy`, `pandas`, `scipy`, `scikit-learn`, `statsmodels`, `matplotlib`, and `seaborn`. **Basic maths** at the level of a first-year undergraduate course, enough to recognise a derivative, a summation, or a matrix multiplication, even if the details are rusty. The mathematical foundations appendix covers everything you'll need. **No prior statistics or probability training.** That is what this book provides, from first principles. ## Running the code All code examples are executable Python, designed to run in a Jupyter environment. You can install the dependencies with: ```bash pip install -r requirements.txt ``` Every chapter opens with a hidden setup cell that imports the three libraries you'll see in almost every code block — `numpy`, `pandas`, and `matplotlib.pyplot` — and registers a small set of matplotlib defaults (the Okabe-Ito palette, a transparent figure background, and a consistent grid style). This is the same pattern you'd use in a real notebook: put the constant dependencies at the top, import everything else at the point of use. Where a code block needs something topic-specific — `scipy.stats`, `scikit-learn`, `statsmodels`, `seaborn` — the import stays visible so you can see which library is doing the work. The setup, which runs before the first code block of every chapter: ```{python} #| eval: false import numpy as np import pandas as pd import matplotlib.pyplot as plt # Okabe-Ito palette — the colourblind-friendly default used throughout. # Every figure picks its colours from this cycler unless it asks otherwise. _okabe_ito_palette = [ '#0072B2', # blue '#E69F00', # orange '#009E73', # green '#D55E00', # vermillion '#CC79A7', # reddish purple '#56B4E9', # sky blue '#F0E442', # yellow '#000000', # black ] plt.rcParams.update({ # Transparent canvas so figures sit cleanly on any theme. 'figure.facecolor': 'none', 'axes.facecolor': 'none', 'savefig.facecolor': 'none', 'savefig.transparent': True, # Minimal frame: top and right spines hidden by default. 'axes.spines.top': False, 'axes.spines.right': False, # Dotted grid whenever a chart opts in via ax.grid(). 'grid.linestyle': ':', 'grid.color': 'grey', 'grid.alpha': 0.4, 'axes.axisbelow': True, 'figure.figsize': (9, 5), 'axes.prop_cycle': plt.cycler(color=_okabe_ito_palette), }) ``` If you're running chapters out of order or outside of Quarto, paste this block into the first cell of your notebook and the rest of the code will work unchanged. Found an error? Have a better analogy? Issues are [welcome](https://github.com/accuser/thinking-in-uncertainty-book). ## Getting started The book is designed to be read in order. Each part builds on the last, and later chapters reference techniques and vocabulary established earlier. That said, if you want a feel for where this is heading, skim the chapter titles in Part 4 or Part 5 to see how the foundations pay off. Then start at Part 1, Chapter 1: the shift from deterministic to probabilistic thinking is where everything begins.