Appendix B — The SE to DS concept bridge

When you first encounter bootstrap resampling, it sounds foreign until you realise it is load testing. Residual analysis sounds statistical — until you recognise it as debugging. Alert fatigue is base rate neglect. Overfitting is premature optimisation. The concepts are the same; the vocabulary is different.

This appendix maps the concepts you know to the ones you are learning. It is organised by theme rather than by chapter, so you can look up an SE concept you understand and find the DS concept it connects to. Where the analogy breaks down, that is noted: a misleading bridge is worse than no bridge at all.

B.1 Thinking and reasoning

These are the fundamental shifts in perspective. They do not map to specific tools but to the way you frame problems.

SE concept	DS concept	How they connect	Ch.
Determinism (same input → same output)	Stochastic processes (same input → distribution of outputs)	In SE, variability is a bug. In DS, variability is the data. The shift is accepting that $\varepsilon$ (the residual error term) exists and is information, not error.	Section 1.1
Debugging (examine error logs for patterns)	Residual analysis (examine model errors for patterns)	Both ask the same question: do the errors look random, or is there structure I missed? Clustered errors mean something is wrong; scattered errors mean you are done.	Section 9.1
Reverse-engineering an API from its responses	Statistical modelling of a data-generating process	Both involve observing outputs and forming hypotheses about internal logic. Unlike an API, a data-generating process has no documentation, no versioning, and can shift without warning.	Section 1.1
Specification before implementation	Experiment design before data collection	Changing acceptance criteria after seeing results invalidates the process in both cases. An A/B test designed after peeking at data is like modifying assertions to match observed output.	Section 7.1
Iterative refinement (fix, deploy, observe)	Gradient descent (compute loss, update, repeat)	Both converge on a solution by taking small corrective steps. Most model-fitting algorithms work this way: define a loss, take the gradient, step downhill. OLS is an exception (it has a closed-form solution), but beyond linear regression, iterative optimisation is the norm.	Section 9.1

B.2 Types, constraints, and validation

SE concept	DS concept	How they connect	Ch.
Type systems (define domains and valid operations)	Probability distributions (constrain values and assign likelihoods)	A distribution is a type: it specifies what values are legal and how likely each one is. `rv_discrete` and `rv_continuous` are subclasses of `rv_generic`, polymorphism you already understand.	Section 2.1
Schema specification (`NOT NULL`, `CHECK` constraints)	Dtype selection and validation in DataFrames	Type constraints catch silent runtime errors before they propagate. Choosing `int8` vs `float64` in a DataFrame is the same instinct as choosing `VARCHAR(255)` vs `TEXT`.	Section 19.1
Input validation at system boundaries	Distributional assumptions about data	Choosing the wrong distribution for your data is like accepting malformed input. If your alert threshold assumes Normal latencies but your data is log-normal, you get false alarms or missed incidents.	Section 2.1

B.3 Testing and evaluation

SE concept	DS concept	How they connect	Ch.
Unit tests (assume correct, run, check for contradiction)	Hypothesis testing (assume $H_0$, the null hypothesis; observe data, reject if evidence is strong)	Both follow the same logic: start with a null assumption, look for evidence against it. The difference is that data assertions account for sampling variability, so you need a significance level, not just `assertEqual`.	Section 5.1
Dev/test/prod environments	Train/validation/test splits	Never evaluate a model on the data used to build it, for the same reason you don’t test code against the examples you wrote it for. The test set is production; the training set is development.	Section 9.1
Multiple test runs (re-running until green)	Peeking at p-values (check repeatedly, stop at significance)	A flaky test that “passes” because you ran it enough times is exploiting multiple attempts, the same mechanism that makes peeking at p-values invalid. Every re-run is another opportunity to get a misleadingly green result by chance, just as each daily check of a p-value is another opportunity for a spurious significant result.	Section 7.1
Specification testing (assert exact output)	Property-based testing (assert behavioural invariants)	ML models rarely produce exact outputs. Testing shifts from “output equals X” to “output has property Y”: predictions are monotonic, probabilities sum to one, performance exceeds a baseline.	Section 20.1
Precision and recall in alerting systems	Precision and recall in classification	The trade-off is structurally similar: a sensitive detector catches more true positives but generates more false alarms. Tuning the threshold controls the balance. In hypothesis testing, Type I error (false alarm) and Type II error (missed detection) are related but not identical; precision in classification also depends on the operating threshold and the base rate of positives.	Section 5.1, Section 10.1
Load testing (hit system with traffic, measure outcomes)	Bootstrap resampling (resample data, observe distribution of estimates)	Both simulate a process you cannot solve analytically. The key difference: load testing generates new traffic against a real system, while bootstrapping resamples from data you already have, since there is no real system to test against, so the data acts as a proxy for the population. The bootstrap makes no parametric distributional assumptions (no assumption of Normality, for example) but does rely on the sample being representative.	Section 6.1

B.4 Architecture and design patterns

SE concept	DS concept	How they connect	Ch.
Separation of concerns	Time series decomposition (trend + seasonality + residual)	Isolate independent components for independent analysis. Each component can be understood, tested, and modified without affecting the others.	Section 15.1
Extract Interface refactoring	PCA (compose correlated features into principal components)	Both reduce a complex surface to its essential dimensions. The goal differs (Extract Interface is about decoupling; PCA is about compression with minimal information loss), but both ask: what is the minimal representation that preserves the essential structure?	Section 13.1
Hand-coded routing logic (if-else chains)	Decision trees (learned splitting rules)	Trees discover the conditions from data instead of you writing them by hand. The structure is identical (nested conditionals), but the rules come from optimisation, not domain knowledge.	Section 12.1
Pattern matching with routers/dispatchers	Classification models (logistic regression, random forests)	Hand-coded rules are replaced by patterns learned from data. The model generalises where the rules would need case-by-case maintenance.	Section 10.1
Anomaly grouping in distributed tracing	Clustering (infer group membership from metrics)	Both assign entities to groups based on observed behaviour rather than explicit labels. You have thousands of trace spans or data points with no explicit categories, and clustering discovers the groupings that the data itself suggests.	Section 14.1
Resource throttling (CPU/memory limits per service)	Regularisation (penalty on model complexity)	Both constrain capacity to prevent runaway behaviour. The regularisation parameter $\lambda$ controls how much the model can use each feature, just as resource limits control how much each service can consume.	Section 11.1
Integer thresholds (quorum, retry limits)	DBSCAN (Density-Based Spatial Clustering of Applications with Noise) `min_samples` parameter	Both are integer thresholds that control sensitivity, but the purpose differs. A quorum prevents inconsistency by requiring agreement; `min_samples` controls density sensitivity by setting the minimum number of nearby points required to form a cluster core. The surface parallel is real but the mechanisms are distinct, so don’t over-read this one.	Section 14.1

B.5 Operations and monitoring

SE concept	DS concept	How they connect	Ch.
SLOs and error budgets (p50, p95, p99 latencies)	Probability distributions and tail probabilities	You already consume distributional summaries: percentiles, tail probabilities, threshold-based SLOs. Data science asks you to build the distribution yourself, from raw data, for a system with no prior SLO.	Section 1.1
Alert thresholds based on percentiles	Distributional assumptions for monitoring	Setting a threshold at p99 latency only works if you have the right distribution. Wrong assumption → wrong threshold → false alarms or missed incidents.	Section 2.1
SLO compliance tracking (claimed vs observed)	Prediction interval coverage (nominal vs actual)	Both compare a claimed probability to what actually happens. If your 95% prediction interval only covers 88% of observations, the model is miscalibrated (its intervals are too narrow), like an SLO that claims 99.9% but delivers 99.5%. Miscalibration can go in either direction: intervals that are too wide waste capacity; intervals that are too narrow miss incidents.	Section 15.1
Alert fatigue and false positive rates	Base rate neglect in Bayes’ theorem	A detector with 95% sensitivity and 5% false positive rate still generates mostly false alarms when the base rate is sufficiently low; with a 2% incident rate, for example, fewer than one in three alerts signals a real incident. This is why rare-event monitoring drowns in noise, and why Bayes’ theorem matters.	Section 4.1
Data drift detection (KS tests on distributions)	Distribution monitoring	Monitor the distribution of incoming data, not individual observations. A shift in the input distribution signals that the model’s assumptions may no longer hold, the same instinct as monitoring for deployment regressions.	Section 18.1
Monitoring aggregation windows	t-SNE (t-distributed Stochastic Neighbour Embedding) perplexity parameter	Both involve a scale parameter that trades off detail against structure. However, the mechanisms differ substantially: aggregation windows control temporal resolution in a straightforward, interpretable way, while t-SNE perplexity controls the effective neighbourhood size in a nonlinear projection whose output is not globally interpretable. Cluster positions and sizes are artefacts of the algorithm, not quantitative signals. This analogy has limits.	Section 13.1
Metric autocorrelation (smoothing windows)	Autocorrelation in time series	Monitoring smoothing assumes that recent observations are correlated with the present. Autocorrelation formalises this: how strongly does today’s value predict tomorrow’s?	Section 15.1
Sensitivity analysis in capacity planning	Partial dependence plots	Both vary one input while holding everything else fixed to understand its effect. A partial dependence plot is a sensitivity analysis for a model.	Section 21.1
SLO accuracy (claimed vs measured percentiles)	Model calibration (predicted probabilities vs observed frequencies)	If your latency SLO claims the 99th percentile is 200ms but you measure 280ms, the claim is miscalibrated. A well-calibrated model is one whose 80% confidence intervals actually contain the true value 80% of the time. Engineers who build dashboards that report metrics will immediately understand why “the number should match reality.”	Section 10.1

B.6 Infrastructure and deployment

SE concept	DS concept	How they connect	Ch.
Containers and reproducible builds	Environment pinning for reproducible analysis	Both solve the same problem: “it works on my machine.” Lock files, pinned dependencies, and containerisation prevent silent behavioural changes. The difference is that DS reproducibility also depends on random seeds and data snapshots.	Section 16.1
Bitwise-reproducible builds	Statistical reproducibility (results within tolerance)	Software has layers of reproducibility: same binary, same behaviour, same test results. DS adds another: same conclusions, even if the 14th decimal place differs. `np.allclose` replaces `assertEqual`.	Section 16.1
Fat JAR (bundle code + dependencies)	Serialised pipeline (bundle code + learned state)	Both package everything needed to run in production. But a model’s learned state (weights, splits, encoders) is more fragile than code, since it depends on the data it was trained on and can silently degrade.	Section 18.1
Asynchronous vs synchronous processing	Batch vs real-time prediction	The same cost/latency/fault-tolerance trade-off applies. Batch prediction is cheaper and more fault-tolerant; real-time prediction has lower latency but higher operational complexity.	Section 18.1
Dev/prod parity (12-factor app)	Train/serve skew prevention (when the feature computation differs between training and serving)	Identical code paths for feature computation in training and serving prevent the same class of bugs as dev/prod parity. A feature that behaves differently at training time and serving time is a silent production failure.	Section 17.1
Package registry (npm, PyPI)	Feature store (single source of truth for features)	Both provide canonical definitions that prevent scattered reimplementation. A feature store ensures every model computes “days since last purchase” the same way, just as a package registry ensures everyone uses the same library version.	Section 17.1
SQL query optimisation	Lazy evaluation in Polars/Spark	Both let a query planner see the full computation before executing, enabling optimisations (predicate pushdown, projection pruning) that eager evaluation cannot.	Section 19.1
ETL pipelines (extract, transform, load)	Feature engineering (transform raw data into model inputs)	Both are data transformation pipelines. Feature engineering adds the constraint that no information from the future can leak into historical records (data leakage) and that the same transformation must run identically at training time and serving time.	Section 17.1
Git and package versioning	Model registry (MLflow, model cards)	Engineers understand versioning instinctively: track what changed, enable rollback, prevent stale artefacts from reaching production. A model registry serves the same purpose for learned artefacts, with the additional complexity that each model version also encodes the data it was trained on, a dimension that code versions do not have.	Section 18.1

B.7 Trade-offs and decision-making

SE concept	DS concept	How they connect	Ch.
Premature optimisation	Overfitting	Both result from investing too much complexity in solving the current problem at the expense of future performance. A model that perfectly fits the training data, like code optimised for today’s workload, breaks when conditions change.	Section 1.1
Error budgets (tolerable failure rate)	Significance level $\alpha$ (the tolerable false positive rate)	Both set a threshold for acceptable mistakes. An SLO that permits 0.1% errors is making the same kind of decision as a hypothesis test that permits a 5% false positive rate. The key difference: $\alpha$ is a pre-commitment: you set it before seeing the data and cannot adjust it afterwards, whereas error budgets are consumed over time and trigger policy changes.	Section 5.1
Capacity planning (quadruple servers → 2× throughput)	Sample size (quadruple data → halve standard error)	Statistical precision scales with $\sqrt{n}$: four times the data halves the standard error. The engineering analogy is illustrative but imperfect; server scaling is workload-dependent and can be near-linear for embarrassingly parallel tasks, while the $\sqrt{n}$ relationship is universal. Both exhibit diminishing returns, but the rate differs.	Section 4.1
Dependency pruning (remove unused packages)	Feature selection (remove uninformative features)	Both reduce complexity without losing capability. Lasso regularisation sets coefficients to exactly zero, the equivalent of removing an unused import.	Section 11.1
SLO definition with edge cases	Target variable definition (what counts as “churned”?)	Both require explicit handling of ambiguity before building anything. If you can’t define the target precisely, the model, like the system, will optimise for the wrong thing.	Section 21.1

B.8 Evaluation metrics: a quick index

Several evaluation metrics recur throughout the book. Each is introduced where the model that produces it lives, but they often need to be looked up later — once you’ve reached the applied chapters, you may want a quick reminder of what AUC means without flicking back to logistic regression. This index summarises the seven metrics that appear most often: what each measures, when to use it, where it traps the unwary, and where in the book it first appears.

Metric	What it measures	When to use it	Common trap	First introduced
$R^2$	Proportion of variance in the response explained by the model.	Regression with a continuous outcome where you want a unitless goodness-of-fit summary.	Never decreases when you add predictors, even useless ones. Comparisons across datasets with different variance are unsafe.	Section 9.3
Adjusted $R^2$	$R^2$ penalised by the number of predictors $p$.	Multiple regression model selection where you want to detect whether an added predictor improves fit beyond chance.	Still depends on the same variance baseline as $R^2$, so cross-dataset comparisons remain unreliable.	Section 9.5
Pseudo-$R^2$ (McFadden)	Likelihood-based analogue of $R^2$ for models without a residual sum of squares.	Logistic regression and other generalised linear models.	Not on the same scale as $R^2$; values between 0.2 and 0.4 already indicate good fit. Don’t compare to OLS $R^2$.	Section 10.4
F1 score	Harmonic mean of precision and recall.	Classification with imbalanced classes where a single number summarising both error types is required.	The default 0.5 threshold rarely maximises F1; always report the threshold. A single F1 number hides which side of the trade-off is failing.	Section 10.7
AUC (ROC)	Probability that the model ranks a random positive above a random negative, averaged across thresholds.	Threshold-free comparison of classifiers’ discriminative ability.	Optimistic under severe class imbalance: a high AUC can coexist with poor production precision. Says nothing about the operating point’s cost.	Section 10.9
Silhouette score	Average cohesion-versus-separation of clustered points relative to the next-best cluster.	Choosing $K$ or comparing clustering algorithms when no ground-truth labels exist.	Biased towards convex, equal-sized clusters; can score density-based clusterings unfairly low even when they capture genuine structure.	Section 14.3
Adjusted Rand index (ARI)	Agreement between two cluster assignments, corrected for chance.	Comparing a clustering against ground-truth labels, or two clusterings against each other.	Requires labels (or a reference partition). Sensitive to differing partition sizes; treat low values cautiously across very different $K$.	Section 14.8

Each chapter that introduces one of these metrics back-links here, so you can sweep across the seven at once. If you’re choosing between metrics for a project — particularly under class imbalance or with uneven cluster sizes — read the trap column twice: most production failures with evaluation metrics come from there, not from the formula.

Author’s Note

The most disorienting thing about data science is discovering that the notation covers ideas you already know. Debugging is residual analysis. Load testing is bootstrapping. Alert fatigue is base rate neglect. The concepts are the same; the vocabulary is different. The real barrier to entry is the translation layer, not intelligence or mathematical ability.

B.9 Where the bridge breaks down

Not every SE concept has a DS counterpart, and forcing analogies where they don’t fit causes more confusion than it prevents. These are the places where engineering intuition reliably fires in the wrong direction.

Residual error is not a bug. In deterministic systems, any deviation from expected output means something is wrong. In statistical models, residual error ($\varepsilon$) is expected and irreducible. A model that perfectly predicts every training observation is almost certainly overfitting; it has memorised noise rather than learned signal. The instinct to eliminate all error must be actively suppressed.

More data has diminishing returns. Server scaling is workload-dependent and can be near-linear for embarrassingly parallel tasks. Statistical precision, by contrast, always scales with $\sqrt{n}$: four times the data halves the uncertainty. After a certain point, collecting more data barely moves the needle. The correct response is not always “get more data” but often “get better data” or “use a better model.”

Deterministic reproducibility is insufficient. Two runs of the same code on the same data can produce different model weights (due to random initialisation, stochastic optimisation, or floating-point ordering). Statistical reproducibility means the conclusions are stable, not that every bit is identical. This is why DS tests use np.allclose with tolerances rather than exact equality.

There is no specification to test against. In SE, you write tests against a known specification. In DS, the “specification” is the unknown data-generating process. You can measure how well your model fits observed data, but you cannot verify it against ground truth, because ground truth is what the model is trying to approximate. This fundamental asymmetry means model evaluation is always indirect.

Correlation is not causation, and no amount of engineering will fix that. In systems engineering, if A consistently precedes B, you can trace the call chain and confirm causality. In observational data, correlation can arise from confounders, reverse causation, or coincidence. You need experimental design (A/B testing) to establish causation, not just observational analysis. This is the single most dangerous place for engineering intuition to mislead.

Optimisation landscapes are not convex. In software, if you fix all the bugs, you have a correct programme. In DS, loss function landscapes can have local minima, saddle points, and flat regions where gradient descent stalls. “Fixed all the errors” does not mean you have found the best model. You may be at a local optimum. Engineering intuition says “keep optimising until done”; DS requires early stopping, learning rate schedules, and accepting good-enough.

The cost of being wrong is asymmetric and context-dependent. In SE, a wrong output is usually wrong in a uniform sense: a 500 error is a 500 error. In DS, a false negative and a false positive can have wildly different costs: a missed fraud case versus a blocked legitimate transaction, a missed cancer diagnosis versus an unnecessary biopsy. The precision/recall trade-off appears in the tables above, but the deeper point is that the cost matrix is problem-specific and must be defined before building anything. Engineering intuition treats all errors as equally bad; data science requires you to decide which errors matter more.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "The SE to DS concept bridge" --- When you first encounter bootstrap resampling, it sounds foreign until you realise it is load testing. Residual analysis sounds statistical — until you recognise it as debugging. Alert fatigue is base rate neglect. Overfitting is premature optimisation. The concepts are the same; the vocabulary is different. This appendix maps the concepts you know to the ones you are learning. It is organised by theme rather than by chapter, so you can look up an SE concept you understand and find the DS concept it connects to. Where the analogy breaks down, that is noted: a misleading bridge is worse than no bridge at all. ## Thinking and reasoning {#sec-bridge-thinking} These are the fundamental shifts in perspective. They do not map to specific tools but to the way you frame problems. | SE concept | DS concept | How they connect | Ch. | | :--- | :--- | :--- | :---: | | Determinism (same input → same output) | Stochastic processes (same input → distribution of outputs) | In SE, variability is a bug. In DS, variability is the data. The shift is accepting that $\varepsilon$ (the residual error term) exists and is information, not error. | @sec-determinism | | Debugging (examine error logs for patterns) | Residual analysis (examine model errors for patterns) | Both ask the same question: do the errors look random, or is there structure I missed? Clustered errors mean something is wrong; scattered errors mean you are done. | @sec-from-description-to-prediction | | Reverse-engineering an API from its responses | Statistical modelling of a data-generating process | Both involve observing outputs and forming hypotheses about internal logic. Unlike an API, a data-generating process has no documentation, no versioning, and can shift without warning. | @sec-determinism | | Specification before implementation | Experiment design before data collection | Changing acceptance criteria after seeing results invalidates the process in both cases. An A/B test designed after peeking at data is like modifying assertions to match observed output. | @sec-ab-testing | | Iterative refinement (fix, deploy, observe) | Gradient descent (compute loss, update, repeat) | Both converge on a solution by taking small corrective steps. Most model-fitting algorithms work this way: define a loss, take the gradient, step downhill. OLS is an exception (it has a closed-form solution), but beyond linear regression, iterative optimisation is the norm. | @sec-from-description-to-prediction | ## Types, constraints, and validation {#sec-bridge-types} | SE concept | DS concept | How they connect | Ch. | | :--- | :--- | :--- | :---: | | Type systems (define domains and valid operations) | Probability distributions (constrain values and assign likelihoods) | A distribution is a type: it specifies what values are legal and how likely each one is. `rv_discrete` and `rv_continuous` are subclasses of `rv_generic`, polymorphism you already understand. | @sec-distributions | | Schema specification (`NOT NULL`, `CHECK` constraints) | Dtype selection and validation in DataFrames | Type constraints catch silent runtime errors before they propagate. Choosing `int8` vs `float64` in a DataFrame is the same instinct as choosing `VARCHAR(255)` vs `TEXT`. | @sec-scale-intro | | Input validation at system boundaries | Distributional assumptions about data | Choosing the wrong distribution for your data is like accepting malformed input. If your alert threshold assumes Normal latencies but your data is log-normal, you get false alarms or missed incidents. | @sec-distributions | ## Testing and evaluation {#sec-bridge-testing} | SE concept | DS concept | How they connect | Ch. | | :--- | :--- | :--- | :---: | | Unit tests (assume correct, run, check for contradiction) | Hypothesis testing (assume $H_0$, the null hypothesis; observe data, reject if evidence is strong) | Both follow the same logic: start with a null assumption, look for evidence against it. The difference is that data assertions account for sampling variability, so you need a significance level, not just `assertEqual`. | @sec-hypothesis-testing | | Dev/test/prod environments | Train/validation/test splits | Never evaluate a model on the data used to build it, for the same reason you don't test code against the examples you wrote it for. The test set is production; the training set is development. | @sec-from-description-to-prediction | | Multiple test runs (re-running until green) | Peeking at p-values (check repeatedly, stop at significance) | A flaky test that "passes" because you ran it enough times is exploiting multiple attempts, the same mechanism that makes peeking at p-values invalid. Every re-run is another opportunity to get a misleadingly green result by chance, just as each daily check of a p-value is another opportunity for a spurious significant result. | @sec-ab-testing | | Specification testing (assert exact output) | Property-based testing (assert behavioural invariants) | ML models rarely produce exact outputs. Testing shifts from "output equals X" to "output has property Y": predictions are monotonic, probabilities sum to one, performance exceeds a baseline. | @sec-testing-intro | | Precision and recall in alerting systems | Precision and recall in classification | The trade-off is structurally similar: a sensitive detector catches more true positives but generates more false alarms. Tuning the threshold controls the balance. In hypothesis testing, Type I error (false alarm) and Type II error (missed detection) are related but not identical; precision in classification also depends on the operating threshold and the base rate of positives. | @sec-hypothesis-testing, @sec-from-numbers-to-decisions | | Load testing (hit system with traffic, measure outcomes) | Bootstrap resampling (resample data, observe distribution of estimates) | Both simulate a process you cannot solve analytically. The key difference: load testing generates new traffic against a real system, while bootstrapping resamples from data you already have, since there is no real system to test against, so the data acts as a proxy for the population. The bootstrap makes no parametric distributional assumptions (no assumption of Normality, for example) but does rely on the sample being representative. | @sec-confidence-intervals | ## Architecture and design patterns {#sec-bridge-architecture} | SE concept | DS concept | How they connect | Ch. | | :--- | :--- | :--- | :---: | | Separation of concerns | Time series decomposition (trend + seasonality + residual) | Isolate independent components for independent analysis. Each component can be understood, tested, and modified without affecting the others. | @sec-order-matters | | Extract Interface refactoring | PCA (compose correlated features into principal components) | Both reduce a complex surface to its essential dimensions. The goal differs (Extract Interface is about decoupling; PCA is about compression with minimal information loss), but both ask: what is the minimal representation that preserves the essential structure? | @sec-too-many-features | | Hand-coded routing logic (if-else chains) | Decision trees (learned splitting rules) | Trees discover the conditions from data instead of you writing them by hand. The structure is identical (nested conditionals), but the rules come from optimisation, not domain knowledge. | @sec-when-lines-fail | | Pattern matching with routers/dispatchers | Classification models (logistic regression, random forests) | Hand-coded rules are replaced by patterns learned from data. The model generalises where the rules would need case-by-case maintenance. | @sec-from-numbers-to-decisions | | Anomaly grouping in distributed tracing | Clustering (infer group membership from metrics) | Both assign entities to groups based on observed behaviour rather than explicit labels. You have thousands of trace spans or data points with no explicit categories, and clustering discovers the groupings that the data itself suggests. | @sec-clustering-intro | | Resource throttling (CPU/memory limits per service) | Regularisation (penalty on model complexity) | Both constrain capacity to prevent runaway behaviour. The regularisation parameter $\lambda$ controls how much the model can use each feature, just as resource limits control how much each service can consume. | @sec-overfitting | | Integer thresholds (quorum, retry limits) | DBSCAN (Density-Based Spatial Clustering of Applications with Noise) `min_samples` parameter | Both are integer thresholds that control sensitivity, but the purpose differs. A quorum prevents inconsistency by requiring agreement; `min_samples` controls density sensitivity by setting the minimum number of nearby points required to form a cluster core. The surface parallel is real but the mechanisms are distinct, so don't over-read this one. | @sec-clustering-intro | ## Operations and monitoring {#sec-bridge-ops} | SE concept | DS concept | How they connect | Ch. | | :--- | :--- | :--- | :---: | | SLOs and error budgets (p50, p95, p99 latencies) | Probability distributions and tail probabilities | You already consume distributional summaries: percentiles, tail probabilities, threshold-based SLOs. Data science asks you to build the distribution yourself, from raw data, for a system with no prior SLO. | @sec-determinism | | Alert thresholds based on percentiles | Distributional assumptions for monitoring | Setting a threshold at p99 latency only works if you have the right distribution. Wrong assumption → wrong threshold → false alarms or missed incidents. | @sec-distributions | | SLO compliance tracking (claimed vs observed) | Prediction interval coverage (nominal vs actual) | Both compare a claimed probability to what actually happens. If your 95% prediction interval only covers 88% of observations, the model is miscalibrated (its intervals are too narrow), like an SLO that claims 99.9% but delivers 99.5%. Miscalibration can go in either direction: intervals that are too wide waste capacity; intervals that are too narrow miss incidents. | @sec-order-matters | | Alert fatigue and false positive rates | Base rate neglect in Bayes' theorem | A detector with 95% sensitivity and 5% false positive rate still generates mostly false alarms when the base rate is sufficiently low; with a 2% incident rate, for example, fewer than one in three alerts signals a real incident. This is why rare-event monitoring drowns in noise, and why Bayes' theorem matters. | @sec-probability | | Data drift detection (KS tests on distributions) | Distribution monitoring | Monitor the distribution of incoming data, not individual observations. A shift in the input distribution signals that the model's assumptions may no longer hold, the same instinct as monitoring for deployment regressions. | @sec-mlops-intro | | Monitoring aggregation windows | t-SNE (t-distributed Stochastic Neighbour Embedding) perplexity parameter | Both involve a scale parameter that trades off detail against structure. However, the mechanisms differ substantially: aggregation windows control temporal resolution in a straightforward, interpretable way, while t-SNE perplexity controls the effective neighbourhood size in a nonlinear projection whose output is not globally interpretable. Cluster positions and sizes are artefacts of the algorithm, not quantitative signals. This analogy has limits. | @sec-too-many-features | | Metric autocorrelation (smoothing windows) | Autocorrelation in time series | Monitoring smoothing assumes that recent observations are correlated with the present. Autocorrelation formalises this: how strongly does today's value predict tomorrow's? | @sec-order-matters | | Sensitivity analysis in capacity planning | Partial dependence plots | Both vary one input while holding everything else fixed to understand its effect. A partial dependence plot is a sensitivity analysis for a model. | @sec-churn-intro | | SLO accuracy (claimed vs measured percentiles) | Model calibration (predicted probabilities vs observed frequencies) | If your latency SLO claims the 99th percentile is 200ms but you measure 280ms, the claim is miscalibrated. A well-calibrated model is one whose 80% confidence intervals actually contain the true value 80% of the time. Engineers who build dashboards that report metrics will immediately understand why "the number should match reality." | @sec-from-numbers-to-decisions | ## Infrastructure and deployment {#sec-bridge-infra} | SE concept | DS concept | How they connect | Ch. | | :--- | :--- | :--- | :---: | | Containers and reproducible builds | Environment pinning for reproducible analysis | Both solve the same problem: "it works on my machine." Lock files, pinned dependencies, and containerisation prevent silent behavioural changes. The difference is that DS reproducibility also depends on random seeds and data snapshots. | @sec-repro-intro | | Bitwise-reproducible builds | Statistical reproducibility (results within tolerance) | Software has layers of reproducibility: same binary, same behaviour, same test results. DS adds another: same *conclusions*, even if the 14th decimal place differs. `np.allclose` replaces `assertEqual`. | @sec-repro-intro | | Fat JAR (bundle code + dependencies) | Serialised pipeline (bundle code + learned state) | Both package everything needed to run in production. But a model's learned state (weights, splits, encoders) is more fragile than code, since it depends on the data it was trained on and can silently degrade. | @sec-mlops-intro | | Asynchronous vs synchronous processing | Batch vs real-time prediction | The same cost/latency/fault-tolerance trade-off applies. Batch prediction is cheaper and more fault-tolerant; real-time prediction has lower latency but higher operational complexity. | @sec-mlops-intro | | Dev/prod parity (12-factor app) | Train/serve skew prevention (when the feature computation differs between training and serving) | Identical code paths for feature computation in training and serving prevent the same class of bugs as dev/prod parity. A feature that behaves differently at training time and serving time is a silent production failure. | @sec-pipelines-intro | | Package registry (npm, PyPI) | Feature store (single source of truth for features) | Both provide canonical definitions that prevent scattered reimplementation. A feature store ensures every model computes "days since last purchase" the same way, just as a package registry ensures everyone uses the same library version. | @sec-pipelines-intro | | SQL query optimisation | Lazy evaluation in Polars/Spark | Both let a query planner see the full computation before executing, enabling optimisations (predicate pushdown, projection pruning) that eager evaluation cannot. | @sec-scale-intro | | ETL pipelines (extract, transform, load) | Feature engineering (transform raw data into model inputs) | Both are data transformation pipelines. Feature engineering adds the constraint that no information from the future can leak into historical records (data leakage) and that the same transformation must run identically at training time and serving time. | @sec-pipelines-intro | | Git and package versioning | Model registry (MLflow, model cards) | Engineers understand versioning instinctively: track what changed, enable rollback, prevent stale artefacts from reaching production. A model registry serves the same purpose for learned artefacts, with the additional complexity that each model version also encodes the data it was trained on, a dimension that code versions do not have. | @sec-mlops-intro | ## Trade-offs and decision-making {#sec-bridge-tradeoffs} | SE concept | DS concept | How they connect | Ch. | | :--- | :--- | :--- | :---: | | Premature optimisation | Overfitting | Both result from investing too much complexity in solving the current problem at the expense of future performance. A model that perfectly fits the training data, like code optimised for today's workload, breaks when conditions change. | @sec-determinism | | Error budgets (tolerable failure rate) | Significance level $\alpha$ (the tolerable false positive rate) | Both set a threshold for acceptable mistakes. An SLO that permits 0.1% errors is making the same kind of decision as a hypothesis test that permits a 5% false positive rate. The key difference: $\alpha$ is a *pre-commitment*: you set it before seeing the data and cannot adjust it afterwards, whereas error budgets are consumed over time and trigger policy changes. | @sec-hypothesis-testing | | Capacity planning (quadruple servers → 2× throughput) | Sample size (quadruple data → halve standard error) | Statistical precision scales with $\sqrt{n}$: four times the data halves the standard error. The engineering analogy is illustrative but imperfect; server scaling is workload-dependent and can be near-linear for embarrassingly parallel tasks, while the $\sqrt{n}$ relationship is universal. Both exhibit diminishing returns, but the rate differs. | @sec-probability | | Dependency pruning (remove unused packages) | Feature selection (remove uninformative features) | Both reduce complexity without losing capability. Lasso regularisation sets coefficients to exactly zero, the equivalent of removing an unused import. | @sec-overfitting | | SLO definition with edge cases | Target variable definition (what counts as "churned"?) | Both require explicit handling of ambiguity before building anything. If you can't define the target precisely, the model, like the system, will optimise for the wrong thing. | @sec-churn-intro | ## Evaluation metrics: a quick index {#sec-metrics-index} Several evaluation metrics recur throughout the book. Each is introduced where the model that produces it lives, but they often need to be looked up later — once you've reached the applied chapters, you may want a quick reminder of what AUC means without flicking back to logistic regression. This index summarises the seven metrics that appear most often: what each measures, when to use it, where it traps the unwary, and where in the book it first appears. | Metric | What it measures | When to use it | Common trap | First introduced | | :--- | :--- | :--- | :--- | :---: | | $R^2$ | Proportion of variance in the response explained by the model. | Regression with a continuous outcome where you want a unitless goodness-of-fit summary. | Never decreases when you add predictors, even useless ones. Comparisons across datasets with different variance are unsafe. | @sec-statsmodels-ols | | Adjusted $R^2$ | $R^2$ penalised by the number of predictors $p$. | Multiple regression model selection where you want to detect whether an added predictor improves fit beyond chance. | Still depends on the same variance baseline as $R^2$, so cross-dataset comparisons remain unreliable. | @sec-multiple-regression | | Pseudo-$R^2$ (McFadden) | Likelihood-based analogue of $R^2$ for models without a residual sum of squares. | Logistic regression and other generalised linear models. | Not on the same scale as $R^2$; values between 0.2 and 0.4 already indicate good fit. Don't compare to OLS $R^2$. | @sec-deployment-failure-model | | F1 score | Harmonic mean of precision and recall. | Classification with imbalanced classes where a single number summarising both error types is required. | The default 0.5 threshold rarely maximises F1; always report the threshold. A single F1 number hides *which* side of the trade-off is failing. | @sec-classification-metrics | | AUC (ROC) | Probability that the model ranks a random positive above a random negative, averaged across thresholds. | Threshold-free comparison of classifiers' discriminative ability. | Optimistic under severe class imbalance: a high AUC can coexist with poor production precision. Says nothing about the operating point's cost. | @sec-roc-auc | | Silhouette score | Average cohesion-versus-separation of clustered points relative to the next-best cluster. | Choosing $K$ or comparing clustering algorithms when no ground-truth labels exist. | Biased towards convex, equal-sized clusters; can score density-based clusterings unfairly low even when they capture genuine structure. | @sec-choosing-k | | Adjusted Rand index (ARI) | Agreement between two cluster assignments, corrected for chance. | Comparing a clustering against ground-truth labels, or two clusterings against each other. | Requires labels (or a reference partition). Sensitive to differing partition sizes; treat low values cautiously across very different $K$. | @sec-cluster-evaluation | Each chapter that introduces one of these metrics back-links here, so you can sweep across the seven at once. If you're choosing between metrics for a project — particularly under class imbalance or with uneven cluster sizes — read the trap column twice: most production failures with evaluation metrics come from there, not from the formula. ::: {.callout-tip} ## Author's Note The most disorienting thing about data science is discovering that the notation covers ideas you already know. Debugging is residual analysis. Load testing is bootstrapping. Alert fatigue is base rate neglect. The concepts are the same; the vocabulary is different. The real barrier to entry is the translation layer, not intelligence or mathematical ability. ::: ## Where the bridge breaks down {#sec-bridge-limits} Not every SE concept has a DS counterpart, and forcing analogies where they don't fit causes more confusion than it prevents. These are the places where engineering intuition reliably fires in the wrong direction. **Residual error is not a bug.** In deterministic systems, any deviation from expected output means something is wrong. In statistical models, residual error ($\varepsilon$) is expected and irreducible. A model that perfectly predicts every training observation is almost certainly overfitting; it has memorised noise rather than learned signal. The instinct to eliminate all error must be actively suppressed. **More data has diminishing returns.** Server scaling is workload-dependent and can be near-linear for embarrassingly parallel tasks. Statistical precision, by contrast, always scales with $\sqrt{n}$: four times the data halves the uncertainty. After a certain point, collecting more data barely moves the needle. The correct response is not always "get more data" but often "get better data" or "use a better model." **Deterministic reproducibility is insufficient.** Two runs of the same code on the same data can produce different model weights (due to random initialisation, stochastic optimisation, or floating-point ordering). Statistical reproducibility means the *conclusions* are stable, not that every bit is identical. This is why DS tests use `np.allclose` with tolerances rather than exact equality. **There is no specification to test against.** In SE, you write tests against a known specification. In DS, the "specification" is the unknown data-generating process. You can measure how well your model fits observed data, but you cannot verify it against ground truth, because ground truth is what the model is trying to approximate. This fundamental asymmetry means model evaluation is always indirect. **Correlation is not causation, and no amount of engineering will fix that.** In systems engineering, if A consistently precedes B, you can trace the call chain and confirm causality. In observational data, correlation can arise from confounders, reverse causation, or coincidence. You need experimental design (A/B testing) to establish causation, not just observational analysis. This is the single most dangerous place for engineering intuition to mislead. **Optimisation landscapes are not convex.** In software, if you fix all the bugs, you have a correct programme. In DS, loss function landscapes can have local minima, saddle points, and flat regions where gradient descent stalls. "Fixed all the errors" does not mean you have found the best model. You may be at a local optimum. Engineering intuition says "keep optimising until done"; DS requires early stopping, learning rate schedules, and accepting good-enough. **The cost of being wrong is asymmetric and context-dependent.** In SE, a wrong output is usually wrong in a uniform sense: a 500 error is a 500 error. In DS, a false negative and a false positive can have wildly different costs: a missed fraud case versus a blocked legitimate transaction, a missed cancer diagnosis versus an unnecessary biopsy. The precision/recall trade-off appears in the tables above, but the deeper point is that the *cost matrix* is problem-specific and must be defined before building anything. Engineering intuition treats all errors as equally bad; data science requires you to decide which errors matter more.