Thinking in Uncertainty
Transpose: Software Engineering to Data Science
Preface
Who this book is for
You’ve been asked to evaluate a model someone built. Or review a data pipeline PR. Or sit in a meeting where someone says “statistically significant” and you’re not sure whether they’re right. You’re a software engineer — you can build systems, debug code, and reason about complexity — but nobody taught you how to think statistically.
Now you want, or need, to understand data science. Not as a tourist, but with genuine depth.
Most data science resources either assume you have a statistics background and skip the engineering, or assume you have no technical background at all and waste your time explaining what a function is. This book does neither. It starts from what you already know and builds the statistical and analytical thinking you need to work confidently with data.
I wrote this book while making the same transition. The moments where I was confused, where something finally clicked, and where textbooks assumed knowledge I didn’t have — those are all captured in Author’s Note callouts throughout the chapters. The goal is to give you the guide I wish I’d had: rigorous enough to be trustworthy, practical enough to be useful, and honest about where things are genuinely hard.
What this book covers
The book is organised in five parts, each building on the last:
Part 1: Foundations rewires your thinking from deterministic to probabilistic. You’ll learn distributions, descriptive statistics, and probability — the grammar of uncertainty — with every concept grounded in something you already know from engineering.
Part 2: Inference gives you the tools to draw conclusions from data: hypothesis testing, confidence intervals, A/B testing, and Bayesian inference. This is where you stop guessing and start quantifying.
Part 3: Modelling covers the algorithms — linear and logistic regression, regularisation, tree-based models, dimensionality reduction, clustering, and time series. Each one is taught as a tool with trade-offs, not magic.
Part 4: Engineering for Data Science is where your existing skills become a superpower. Reproducibility, data pipelines, MLOps, scaling, and testing — applied to data science workflows.
Part 5: Applied Data Science puts everything together on real industry problems: churn prediction, recommendation systems, demand forecasting, fraud detection, NLP, and computer vision.
Four appendices provide supporting material: a mathematical foundations refresher, a concept bridge mapping SE terms to DS terms, a reading list, and answers to all chapter exercises.
This is not a deep learning book — there are no neural network architectures or GPU training pipelines. It is not a “learn Python” book — Python fluency is assumed. And it is not a proof-based statistics textbook — the treatment is computational and applied, aimed at building working understanding rather than mathematical formalism.
How this book works
Throughout the book, you’ll find two types of callout that bridge the gap between engineering and statistics:
These connect a data science concept to something you already understand from software engineering. For example: a data-generating process is like an API you can call but whose source code you can’t read — you observe the responses and reverse-engineer the logic. Where the analogy breaks down, we say so.
These capture honest reflections on the learning journey — moments of confusion, what finally made something click, and the gap between what textbooks assume and what engineers actually need. They are not performative humility; they are the notes I wish someone had left for me.
Every concept follows the same rhythm: intuition first, then the mathematics, then executable code that makes it concrete. Formulae are always paired with a plain-English interpretation or a Python equivalent (often both). Mathematical notation is introduced gradually — if you haven’t seen Greek letters since university, the mathematical foundations appendix has a refresher.
You’ll notice that every chart in this book is accompanied by the code that produced it. This is deliberate. I made the decision to retain the code as part of the explanation. Data science is a practice, not just a theory, and the code shows you how that practice works: which library calls produce which results, what parameters matter, and how to go from a statistical concept to a working implementation. When you encounter a similar problem in your own work, the code is there as a reference you can adapt.
What you’ll need
Python proficiency is assumed — we won’t explain list comprehensions or how pip install works. The code uses the standard data science stack: numpy, pandas, scipy, scikit-learn, matplotlib, and seaborn.
Basic maths at the level of a first-year undergraduate course — enough to recognise a derivative, a summation, or a matrix multiplication, even if the details are rusty. The mathematical foundations appendix covers everything you’ll need.
No prior statistics or probability training. That is what this book provides, from first principles.
Running the code
All code examples are executable Python, designed to run in a Jupyter environment. You can install the dependencies with:
pip install -r requirements.txtThe complete source for this book, including all executable notebooks, is available at the GitHub repository. Found an error? Have a better analogy? Issues and pull requests are welcome.
Getting started
The book is designed to be read in order — each chapter builds on the last. If you want a preview of how familiar territory fits in, glance at the chapter titles in Part 4; then start at Part 1 for the statistical foundations that make everything else click.