Appendix C — Recommended reading and resources
If a chapter left you wanting more mathematical rigour, more practical tooling, or a different perspective, the resources below are where to look. They are organised by purpose rather than by chapter, and each entry includes enough context for you to decide whether it is worth your time.
C.1 Statistical foundations
The books in this section add mathematical depth to the modelling concepts introduced in Parts 1–3, with increasing levels of rigour.
An Introduction to Statistical Learning with Applications in Python — Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor (Springer, 2023). The single best next step after this book. It covers the same modelling territory (regression, classification, tree methods, regularisation, clustering, dimensionality reduction) with full mathematical derivations alongside practical application. What it adds beyond this book is the why: rigorous justifications for the algorithms you have already learned to use, plus worked lab exercises using scikit-learn for reinforcement. It is freely available online from the authors.
The Elements of Statistical Learning — Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer, 2nd ed., 2009). The graduate-level treatment of the same material covered by ISLP. ESL predates ISLP and was the original text; ISLP was written as a more accessible introduction. ESL covers the same topics with full mathematical derivations, plus additional material on boosting, neural networks, graphical models, and high-dimensional inference. Reach for this when you want to understand why an algorithm works, not just how to use it. It too is freely available online from the authors.
All of Statistics: A Concise Course in Statistical Inference — Larry Wasserman (Springer, 2004). A compact, rigorous treatment of probability, statistical inference, and modelling. Assumes comfort with proof-based mathematics (measure-theoretic notation appears in places), so the step up from this book is steeper than ISLP. A good match for engineers with strong maths backgrounds who want the full formal machinery behind hypothesis testing, confidence intervals, and nonparametric methods.
C.2 Bayesian thinking
Statistical Rethinking: A Bayesian Course with Examples in R and Stan — Richard McElreath (CRC Press, 2nd ed., 2020). If the Bayesian inference chapter left you wanting a deeper treatment, this is where to go. McElreath builds Bayesian models from first principles, with an emphasis on causal reasoning and model checking that complements the frequentist methods used in most of this book. The code examples use R and Stan rather than Python, but the conceptual content is language-independent. The accompanying lecture videos are excellent.
C.3 The two cultures
“Statistical Modeling: The Two Cultures” — Leo Breiman (Statistical Science, 2001). A short, influential paper that articulates the tension between data modelling (assume a generative model, estimate its parameters) and algorithmic modelling (treat the mechanism as unknown, optimise predictive accuracy). This tension runs throughout Parts 3 and 5 of this book: linear regression versus random forests, interpretability versus performance, understanding versus prediction. Engineers often find their instincts validated by Breiman’s argument for algorithmic modelling, which makes the counterarguments in the data modelling tradition worth your particular attention. Worth reading once you have finished the modelling chapters.
C.4 Machine learning engineering
“Hidden Technical Debt in Machine Learning Systems” — D. Sculley et al. (NeurIPS, 2015). The paper behind Section 17.1’s observation that model code is typically less than 5% of a production ML system. It catalogues the engineering challenges (data dependencies, configuration debt, feedback loops, pipeline jungles) that dominate real-world ML projects. Essential reading for any engineer moving into ML operations.
Designing Machine Learning Systems — Chip Huyen (O’Reilly, 2022). Covers the full lifecycle of production ML from an engineering perspective: feature stores, training-serving skew, data management, monitoring, and deployment patterns. Where Sculley et al. diagnose the problems, Huyen prescribes the solutions. This is the most directly relevant book for engineers who have finished Part 4 of this book and want to go deeper on ML infrastructure.
C.5 Tools and libraries
The book uses a small set of core libraries throughout. These are the official documentation sites and the most useful supplementary resources for each.
C.5.1 Core stack
scikit-learn — The workhorse for classical ML in Python. The user guide is exceptionally well-written: each algorithm page includes the mathematical formulation, practical tips, and references to the original papers. When in doubt about a method’s assumptions or hyperparameters, start here.
statsmodels — The library for statistical modelling, hypothesis testing, and time series analysis. Its documentation is more academic than scikit-learn’s, but the examples gallery provides worked demonstrations of most techniques covered in Parts 2 and 3.
SciPy (scipy.stats) — Probability distributions, statistical tests, and numerical methods. Used throughout this book for distribution objects, hypothesis tests, and confidence intervals. The statistical functions documentation is comprehensive.
pandas — The primary library for tabular data manipulation in Python, used in virtually every chapter of this book. The user guide covers indexing, grouping, reshaping, and merging in detail, and the cookbook provides idiomatic solutions to common data-wrangling tasks. Engineers coming from SQL will find the comparison with SQL page particularly useful.
NumPy — The foundation for numerical computing in Python. The user guide section on broadcasting explains the mechanics behind the vectorised operations used throughout this book, and the internals documentation covers memory layout and stride tricks for when performance matters.
Matplotlib — The plotting library used for all figures in this book. The tutorials page covers the object-oriented API (which this book uses exclusively) and the gallery is useful for finding the right chart type.
Seaborn — A statistical visualisation library built on Matplotlib, used in this book for distribution plots, pair plots, and heatmaps. The tutorial explains when to reach for Seaborn over raw Matplotlib.
C.5.2 Model interpretability
SHAP — A library for computing Shapley values to explain individual model predictions. SHAP values appear naturally alongside the partial dependence plots discussed in Section 21.1 and provide a principled way to answer “why did the model make this prediction?”, a question that arises in every applied chapter in Part 5. The documentation includes tutorials for tree-based models, linear models, and deep learning.
C.5.3 Data engineering
Polars — A DataFrame library with lazy evaluation and automatic query optimisation, introduced in Section 19.1. Polars is a strong alternative to pandas for datasets that strain memory or require complex multi-step transformations. The user guide is well-organised and includes migration guides from pandas.
DVC (Data Version Control) — Version control for data and ML pipelines, discussed in Section 16.1. Extends Git to track large files and define reproducible pipeline stages. The getting started guide covers the core workflow.
MLflow — Experiment tracking, model registry, and deployment tooling, discussed in Section 16.1 and Section 18.1. The quickstart covers logging experiments and registering models.
C.5.4 Pipeline orchestration
Three orchestration tools are mentioned in Section 17.1, each with a different design philosophy.
Apache Airflow is the established standard with the largest ecosystem, but it carries significant operational overhead: scheduler administration, database backends, and worker management are non-trivial. Adopt it when your organisation already runs it or when you need its extensive integration library; do not start a greenfield project here without a strong reason. Prefect takes a more Pythonic approach, using decorators and native Python control flow rather than DAG definitions, with lower ceremony for smaller teams. Dagster emphasises typed data contracts between pipeline stages, with built-in testing and observability that appeals to teams who value explicit data dependencies.
C.5.5 Data validation
pandera — Schema validation for pandas and Polars DataFrames, demonstrated in Section 20.1. Lets you define column-level type constraints, value ranges, and custom checks as declarative schemas. Lightweight and well-integrated with pytest.
Great Expectations — A more comprehensive data validation framework for production environments. Mentioned alongside pandera in Section 20.1. Includes data profiling, documentation generation, and integration with orchestration tools.
C.6 Forecasting
Prophet — A forecasting library developed at Meta, mentioned in Section 15.1 and Section 23.1. Its default uncertainty intervals are based on historical trend changes rather than a formal likelihood model, so they require careful validation — a significant caveat given this book’s emphasis on calibrated prediction intervals. That said, Prophet is hard to beat for fast baseline forecasts on business time series with multiple seasonalities, missing data, and holiday effects. The documentation includes quickstarts and performance benchmarks.
C.7 Topics beyond this book
Several chapters mention techniques that are beyond this book’s scope but are natural next steps for engineers who want to go further.
Deep learning and neural networks. The computer vision chapter notes that convolutional neural networks (CNNs) dominate modern image analysis, and the NLP chapter notes that transformer-based language models have largely superseded the bag-of-words and TF-IDF approaches taught in this book. For a practical introduction to deep learning in Python, Deep Learning with Python by François Chollet (Manning, 2nd ed., 2021) covers the fundamentals using Keras. For engineers who prefer to learn by building before formalising, fast.ai’s Practical Deep Learning for Coders takes the opposite approach: working code first, theory afterwards. For a more mathematical treatment, Dive into Deep Learning (d2l.ai) is a freely available interactive textbook.
Natural language processing. The NLP chapter covers classical NLP (tokenisation, TF-IDF, topic modelling) but stops short of word embeddings, attention mechanisms, and large language models. Speech and Language Processing by Dan Jurafsky and James Martin (Pearson, 3rd edition draft, freely available online) covers the full spectrum from classical to modern NLP.
Causal inference. Section B.8 identifies the places where engineering intuition reliably misfires in a data science context, including the distinction between correlation and causation in observational data. For engineers interested in going beyond A/B testing to causal reasoning from observational data, The Book of Why by Judea Pearl and Dana Mackenzie (Basic Books, 2018) is an accessible introduction to causal graphs and do-calculus — more manifesto than textbook, but effective at building intuition. For the actual methodology you would implement, Causal Inference: What If by Miguel Hernán and James Robins (Chapman & Hall/CRC, freely available online as a draft from the authors) covers the statistical foundations rigorously.
Conformal prediction. Mentioned briefly in Section 23.1 as a distribution-free method for constructing prediction intervals. This is an active area of research that provides finite-sample coverage guarantees without parametric assumptions, a natural extension of the prediction interval ideas in Section 15.1.
Approximate nearest neighbour search. Section 22.1 mentions that serving recommendations at scale requires approximate nearest neighbour (ANN) algorithms. Libraries like Faiss (Meta), ScaNN (Google), and Annoy (originally developed at Spotify) provide sublinear search over embedding spaces. These are infrastructure tools rather than statistical methods, but they are essential for deploying any embedding-based system in production.
C.8 How to choose what to read next
The right next step depends on where you want to go:
- Deeper statistical foundations → An Introduction to Statistical Learning (ISLP). This is the default recommendation for any reader.
- More mathematical rigour → All of Statistics (Wasserman) or Elements of Statistical Learning (ESL).
- Bayesian methods → Statistical Rethinking (McElreath).
- ML engineering and production systems → Designing Machine Learning Systems (Huyen), then the MLflow and DVC documentation.
- Deep learning → Deep Learning with Python (Chollet), fast.ai, or d2l.ai.
- Causal inference → The Book of Why (Pearl) for intuition, then Causal Inference: What If (Hernán & Robins) for methodology.
- Hands-on practice → Kaggle competitions and datasets provide structured problems with real data, community solutions, and immediate feedback, often a more effective learning path than another textbook.
The most dangerous trap in data science is not mathematical difficulty — it is the temptation to chase the flashiest tool before consolidating the fundamentals. Transformers before you understand logistic regression. Spark before you can write efficient pandas. Neural networks before you have tried a well-tuned gradient boosting model. One book worked through properly, with every exercise attempted and every code example run, teaches more than five books skimmed. Fluency in scikit-learn is worth more than familiarity with ten libraries. Breadth feels productive; depth actually is.