22 Building a recommendation system

22.1 The problem every platform has

You’ve built search. Users type a query, your system returns results ranked by relevance, and the interaction ends. But what happens when the user opens your app and doesn’t type anything? The blank landing page is the hardest screen in product engineering — you need to show something useful before the user tells you what they want. Every time Netflix surfaces a row of films, Spotify generates a playlist, or an e-commerce site shows “customers also bought,” a recommendation system is making predictions about human preferences from incomplete data.

This is a different kind of prediction from what we’ve seen so far. Churn prediction (Section 21.1) asked a binary question — will this customer leave? Regression (Section 9.1) predicted a continuous value. Recommendation systems predict a matrix of preferences — every user’s opinion of every item — from a dataset where the vast majority of entries are missing. A platform with 100,000 users and 10,000 items has a billion possible user-item pairs, but a typical user might interact with a few hundred items. That’s over 99.999% missing data.

The engineering challenge is familiar: you’re building a system that must serve personalised results at low latency for millions of users. The data science challenge is less familiar: you’re estimating preferences from sparse, noisy, implicit signals — and the thing you’re predicting (human taste) is subjective, context-dependent, and constantly shifting.

22.2 The interaction matrix

The foundation of most recommendation systems is the interaction matrix (also called the user-item matrix or ratings matrix). Each row is a user, each column is an item, and each entry records an interaction — a rating, a purchase, a click, or simply whether the user engaged with the item at all.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

n_users = 500
n_items = 200
n_interactions = 8000  # ~8% density — sparse but realistic

# Simulate user and item latent factors (hidden preferences and attributes)
n_factors = 5
user_factors = rng.normal(0, 1, (n_users, n_factors))
item_factors = rng.normal(0, 1, (n_items, n_factors))

# True preferences are dot products of latent factors plus noise
true_preferences = user_factors @ item_factors.T

# Sample which user-item pairs are observed (not random — biased towards
# items the user would prefer, mimicking real browsing behaviour)
observed_pairs = set()
while len(observed_pairs) < n_interactions:
    u = rng.integers(0, n_users)
    # Higher-preference items are more likely to be observed
    probs = np.exp(true_preferences[u]) / np.exp(true_preferences[u]).sum()
    i = rng.choice(n_items, p=probs)
    observed_pairs.add((u, i))

# Generate ratings (1-5) from true preferences + noise
rows = []
for u, i in observed_pairs:
    rating = np.clip(np.round(true_preferences[u, i] + rng.normal(0, 0.5) + 3), 1, 5)
    rows.append({'user_id': u, 'item_id': i, 'rating': int(rating)})

ratings_df = pd.DataFrame(rows)
print(f"Users: {n_users:,}")
print(f"Items: {n_items:,}")
print(f"Interactions: {len(ratings_df):,}")
print(f"Matrix density: {len(ratings_df) / (n_users * n_items):.1%}")
print(f"Rating distribution:")
print(ratings_df['rating'].value_counts().sort_index().to_string())

Users: 500
Items: 200
Interactions: 8,000
Matrix density: 8.0%
Rating distribution:
rating
1      50
2     288
3     722
4    1261
5    5679

The 8% density we’ve simulated is generous — real-world systems often have densities below 1%. This sparsity is the central technical challenge. Most machine learning algorithms assume you have complete feature vectors for each observation. Here, the “feature vector” is the user’s preference for every item, and almost all of it is missing.

Engineering Bridge

The interaction matrix is the adjacency matrix of a bipartite graph — users and items are two distinct node types, and edges only run between them (users rate items, never other users). The matrix is rectangular, not square, which distinguishes it from the symmetric adjacency matrices in general graph problems. But the storage patterns are identical: you wouldn’t materialise a billion-cell dense matrix when only a fraction of edges exist. Recommendation systems work with sparse representations — coordinate lists, compressed sparse row (CSR) formats, or simply DataFrames of (user, item, value) triples. If you’ve worked with sparse data structures in graph processing or network analysis, the indexing and traversal patterns will be familiar.

22.3 Collaborative filtering: users who liked this also liked that

Collaborative filtering is the idea that users who agreed in the past will agree in the future. If you and I both rated the same ten films highly, and you loved a film I haven’t seen, there’s a reasonable chance I’d like it too. The algorithm doesn’t need to know anything about what the films are — it works entirely from the pattern of interactions.

There are two flavours. User-based collaborative filtering finds users similar to you and recommends items they liked. Item-based collaborative filtering finds items similar to ones you’ve liked and recommends those. The mechanics are nearly identical; they differ only in which axis of the interaction matrix you compute similarity over.

Let’s implement item-based collaborative filtering. For each pair of items, we compute the cosine similarity between their columns in the interaction matrix — treating each item as a vector of user ratings. Items that receive similar ratings from the same users will have high similarity.

from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

# Build sparse interaction matrix
interaction_matrix = csr_matrix(
    (ratings_df['rating'].values,
     (ratings_df['user_id'].values, ratings_df['item_id'].values)),
    shape=(n_users, n_items)
)

# Compute item-item similarity (cosine similarity between item columns)
item_similarity = cosine_similarity(interaction_matrix.T)
np.fill_diagonal(item_similarity, 0)  # don't recommend an item to itself

def recommend_items_cf(user_id, interaction_matrix, item_similarity, n=5):
    """Recommend items using item-based collaborative filtering."""
    user_ratings = interaction_matrix[user_id].toarray().flatten()
    rated_mask = user_ratings > 0

    # Predicted score: similarity-weighted average of the user's known ratings
    scores = item_similarity @ user_ratings
    # Normalise by the sum of absolute similarities to rated items
    sim_sums = np.abs(item_similarity[:, rated_mask]).sum(axis=1)
    sim_sums[sim_sums == 0] = 1  # avoid division by zero
    scores = scores / sim_sums

    # Exclude already-rated items
    scores[rated_mask] = -np.inf
    top_items = np.argsort(scores)[::-1][:n]
    return top_items, scores[top_items]

# Example: recommendations for user 0
rec_items, rec_scores = recommend_items_cf(0, interaction_matrix, item_similarity)
user_history = ratings_df[ratings_df['user_id'] == 0].sort_values('rating', ascending=False)

print(f"User 0 has rated {len(user_history)} items")
print(f"Top-rated items: {user_history.head(3)[['item_id', 'rating']].to_string(index=False)}")
print(f"\nRecommended items: {rec_items}")
print(f"Predicted scores:  {np.round(rec_scores, 2)}")

User 0 has rated 16 items
Top-rated items:  item_id  rating
      21       5
       3       5
       5       5

Recommended items: [152 173 183 141  56]
Predicted scores:  [5. 5. 5. 5. 5.]

The approach is intuitive and requires no training — you compute similarities once and generate recommendations on the fly. But it has significant limitations. Similarity computation is $O(n^2)$ in the number of items (or users), which doesn’t scale well. More fundamentally, it can only recommend items that share rating patterns with items you’ve already rated — it struggles with the long tail of items that have few ratings.

22.4 Matrix factorisation: compressing preferences

Matrix factorisation takes a different approach. Instead of computing pairwise similarities, it decomposes the interaction matrix into two smaller matrices — one representing users and one representing items — in a shared low-dimensional latent space. If PCA (Section 13.3) finds the directions of maximum variance in a feature matrix, matrix factorisation finds the latent dimensions that best explain the observed ratings.

The idea is that a user’s rating of an item can be approximated as the dot product of a user vector and an item vector:

\[ \hat{r}_{ui} = \mathbf{p}_u \cdot \mathbf{q}_i = \sum_{k=1}^{K} p_{uk} \, q_{ik} \]

where $\mathbf{p}_u$ is a $K$-dimensional vector representing user $u$’s preferences, $\mathbf{q}_i$ is a $K$-dimensional vector representing item $i$’s attributes, and $K$ is the number of latent factors (typically 10–100). Each factor might correspond to something interpretable — one dimension might capture “action versus drama,” another “mainstream versus indie” — but the algorithm discovers these dimensions automatically from the data. This is the same principle behind the latent components in PCA and the latent topics in text analysis: compress high-dimensional data by finding its underlying structure.

We find $\mathbf{P}$ and $\mathbf{Q}$ by minimising the reconstruction error on observed ratings, with regularisation to prevent overfitting:

\[ \min_{\mathbf{P}, \mathbf{Q}} \sum_{(u, i) \in \text{observed}} \left( r_{ui} - \mathbf{p}_u \cdot \mathbf{q}_i \right)^2 + \lambda \sum_u \|\mathbf{p}_u\|^2 + \lambda \sum_i \|\mathbf{q}_i\|^2 \]

In words: minimise the squared error on observed ratings, plus an L2 penalty on every user and item vector. The regularisation term $\lambda$ plays exactly the same role as in ridge regression (Section 11.4) — it penalises large parameter values to keep the model from fitting noise in the sparse observed data. A common extension adds explicit bias terms — a global mean $\mu$, a per-user bias $b_u$ (some users rate everything higher), and a per-item bias $b_i$ (some items are systematically better-received) — so the prediction becomes $\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u \cdot \mathbf{q}_i$. Our simplified formulation omits these for clarity, but in production they often capture a large fraction of the variance.

from sklearn.decomposition import TruncatedSVD

# For demonstration, we'll use truncated SVD on the interaction matrix.
# SVD treats missing entries as zeros — implicitly telling the model that
# every unrated item has a rating of 0, far below the 1-5 observed scale.
# This depresses latent item factors and biases reconstruction. In
# production, you'd use alternating least squares (ALS) or stochastic
# gradient descent (SGD), which optimise only over observed entries and
# avoid this zero-imputation problem. Mean-centering each user's ratings
# before fitting can partially mitigate the bias. SVD is illustrative
# and available in scikit-learn, so we use it here.
n_components = 20

svd = TruncatedSVD(n_components=n_components, random_state=42)

# fit_transform returns U @ Sigma — singular values are already absorbed
user_latent = svd.fit_transform(interaction_matrix)   # (n_users, n_components)
# components_ is V^T, so .T gives V
item_latent = svd.components_.T                       # (n_items, n_components)

print(f"User latent matrix: {user_latent.shape}")
print(f"Item latent matrix: {item_latent.shape}")
print(f"Explained variance ratio: {svd.explained_variance_ratio_.sum():.1%}")

# Reconstruct ratings for user 0: (U @ Sigma) @ V^T
predicted_ratings = user_latent[0] @ item_latent.T
actual_ratings = interaction_matrix[0].toarray().flatten()
rated_items = np.where(actual_ratings > 0)[0]

comparison = pd.DataFrame({
    'item_id': rated_items,
    'actual': actual_ratings[rated_items],
    'predicted': np.round(predicted_ratings[rated_items], 1)
}).head(10)
print(f"\nUser 0 — actual vs predicted ratings:")
print(comparison.to_string(index=False))

User latent matrix: (500, 20)
Item latent matrix: (200, 20)
Explained variance ratio: 43.9%

User 0 — actual vs predicted ratings:
 item_id  actual  predicted
       3       5        2.8
       5       5        0.7
      21       5        2.0
      27       5        0.5
      30       5        1.3
      33       4        0.3
      37       5        4.5
      40       5        1.2
      52       5        1.1
      78       3        0.2

import matplotlib.pyplot as plt

# Compute average rating per item, excluding unrated entries (zeros)
dense = interaction_matrix.toarray()
masked = np.ma.masked_equal(dense, 0)
avg_ratings = masked.mean(axis=0).data

# Use the first two latent dimensions for visualisation
fig, ax = plt.subplots(figsize=(10, 5))
sc = ax.scatter(
    item_latent[:, 0], item_latent[:, 1],
    c=avg_ratings,
    cmap='viridis', alpha=0.8, s=40, edgecolors='none'
)
cbar = plt.colorbar(sc, ax=ax)
cbar.set_label('Average rating')
ax.set_xlabel('Latent dimension 1')
ax.set_ylabel('Latent dimension 2')
ax.set_title('Items in latent space: similar items cluster together')
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.show()

Scatter plot showing 200 items projected onto two latent dimensions from the matrix factorisation. Points are coloured on a sequential scale from dark (low average rating) to bright (high average rating). Items with similar rating patterns cluster together in the latent space, demonstrating that the factorisation has learned meaningful preference structure. — Figure 22.1: Items projected into a two-dimensional latent space from the matrix factorisation. Clusters of nearby items share similar rating patterns — users who like one item in a cluster tend to like others nearby.

The latent space gives us something collaborative filtering alone doesn’t: a compact representation of why users and items are similar. Two items close in latent space share underlying attributes that drive similar ratings, even if few users have rated both. This is how matrix factorisation addresses sparsity — it generalises across the gaps by learning structure rather than relying on direct co-occurrence.

Author’s Note

Matrix factorisation clicked for me when I stopped thinking of it as a prediction algorithm and started thinking of it as compression. In our simulation, the interaction matrix has 100,000 cells (500 users × 200 items) but only 8,000 are filled in. Yet 20 latent dimensions can reconstruct those 8,000 entries reasonably well — which means the underlying structure of human preferences is far lower-dimensional than the raw matrix suggests. It’s the same insight behind dimensionality reduction in Section 13.3, but applied to a matrix that’s mostly empty. The compression is the model: by forcing the data through a bottleneck, you learn what matters and discard what’s noise.

22.5 Content-based filtering: using item features

Collaborative filtering works entirely from interaction patterns — it doesn’t need to know what items are. Content-based filtering takes the opposite approach: it recommends items similar to what the user has liked, based on item attributes.

If a user rated several Python programming books highly, a content-based system would recommend other Python books. It examines item features (genre, author, price range, description keywords) and builds a profile of each user’s preferences over those features.

# Simulate item features: genre, price range, and quality indicators
n_genres = 8
item_genre_probs = rng.dirichlet(np.ones(n_genres) * 0.5, n_items)
item_price = rng.uniform(5, 100, n_items)
item_popularity = rng.exponential(1, n_items)

# Feature matrix: genre distribution + normalised price + popularity
item_content = np.column_stack([
    item_genre_probs,
    (item_price - item_price.mean()) / item_price.std(),
    (item_popularity - item_popularity.mean()) / item_popularity.std(),
])

def recommend_content_based(user_id, interaction_matrix, item_content, n=5):
    """Build a user profile from rated items and find similar unrated items."""
    user_ratings = interaction_matrix[user_id].toarray().flatten()
    rated_mask = user_ratings > 0

    if rated_mask.sum() == 0:
        return np.array([]), np.array([])

    # User profile: weighted average of item features, weighted by rating
    weights = user_ratings[rated_mask]
    user_profile = (item_content[rated_mask].T @ weights) / weights.sum()

    # Score unrated items by cosine similarity to user profile
    scores = cosine_similarity(user_profile.reshape(1, -1), item_content).flatten()
    scores[rated_mask] = -np.inf

    top_items = np.argsort(scores)[::-1][:n]
    return top_items, scores[top_items]

rec_items_cb, rec_scores_cb = recommend_content_based(
    0, interaction_matrix, item_content
)
print(f"Content-based recommendations for user 0: {rec_items_cb}")
print(f"Similarity scores: {np.round(rec_scores_cb, 3)}")

Content-based recommendations for user 0: [ 89 199   1 165 187]
Similarity scores: [0.898 0.895 0.872 0.865 0.86 ]

The scores reflect how closely each recommended item’s features match the user’s profile. Notice that content-based scores are bounded between -1 and 1 (cosine similarity), unlike the unbounded predicted ratings from collaborative filtering — the two approaches operate on different scales, which is why the hybrid system below normalises both before combining them.

Content-based filtering has a distinct advantage: it can recommend items with no interaction history at all, as long as you have their features. A new book added to the catalogue can be recommended immediately based on its genre and description — no ratings required. We’ll see in Section 22.7 why this matters so much.

But content-based filtering has its own limitation: it can only recommend items similar to what the user has already seen. It doesn’t discover surprising connections — the fact that fans of obscure jazz also tend to enjoy nature documentaries, for instance. Collaborative filtering captures these cross-category patterns because it works from behaviour, not metadata.

22.6 Hybrid systems: combining signals

Before building a hybrid, it helps to see where each approach excels and where it falls short:

	Collaborative filtering	Content-based filtering
Data needed	User-item interactions	Item features (metadata)
New items	Cannot recommend (no interactions)	Can recommend immediately
New users	Cannot personalise (no history)	Cannot personalise (no history)
Cross-category discovery	Strong — finds surprising patterns	Weak — only recommends similar items
Scalability	$O(n^2)$ similarity; $O(K)$ with MF	$O(\text{items} \times \text{features})$

In practice, production recommendation systems combine multiple approaches. A hybrid system uses collaborative filtering as its primary signal, content-based features to handle cold-start items, and popularity as a fallback for completely new users.

def recommend_hybrid(user_id, interaction_matrix, item_similarity,
                     item_content, alpha=0.6, n=5):
    """
    Combine collaborative and content-based scores.

    alpha controls the weight: 1.0 = pure collaborative, 0.0 = pure content.
    """
    user_ratings = interaction_matrix[user_id].toarray().flatten()
    rated_mask = user_ratings > 0

    # Collaborative filtering scores
    cf_scores = item_similarity @ user_ratings
    sim_sums = np.abs(item_similarity[:, rated_mask]).sum(axis=1)
    sim_sums[sim_sums == 0] = 1
    cf_scores = cf_scores / sim_sums

    # Content-based scores
    if rated_mask.sum() > 0:
        weights = user_ratings[rated_mask]
        user_profile = (item_content[rated_mask].T @ weights) / weights.sum()
        cb_scores = cosine_similarity(
            user_profile.reshape(1, -1), item_content
        ).flatten()
    else:
        cb_scores = np.zeros(n_items)

    # Normalise both to [0, 1] for fair combination
    def min_max_scale(x):
        finite = x[np.isfinite(x)]
        if finite.max() == finite.min():
            return np.zeros_like(x)
        return (x - finite.min()) / (finite.max() - finite.min())

    cf_norm = min_max_scale(cf_scores)
    cb_norm = min_max_scale(cb_scores)

    combined = alpha * cf_norm + (1 - alpha) * cb_norm
    combined[rated_mask] = -np.inf

    top_items = np.argsort(combined)[::-1][:n]
    return top_items, combined[top_items]

rec_hybrid, scores_hybrid = recommend_hybrid(
    0, interaction_matrix, item_similarity, item_content
)
print(f"Hybrid recommendations for user 0: {rec_hybrid}")
print(f"Combined scores: {np.round(scores_hybrid, 3)}")

Hybrid recommendations for user 0: [ 67 141   1  56 113]
Combined scores: [0.986 0.979 0.97  0.965 0.96 ]

The mixing weight $\alpha$ is a hyperparameter you’d tune empirically — and it might vary per user. A user with hundreds of ratings provides strong collaborative signal; a new user with three ratings benefits more from content-based features.

Engineering Bridge

A hybrid recommendation system follows the fan-out and merge pattern common in microservice architectures. Each recommendation strategy is an independent scorer — the request fans out to collaborative, content-based, and popularity services in parallel, then a blending layer merges the results, applies business rules (exclude out-of-stock items, enforce diversity), and returns the final ranked list. This separation of concerns — scoring, blending, filtering — maps cleanly to how engineers design composable systems. The key engineering decisions are familiar: what happens when the collaborative scorer is slow or unavailable? You apply the circuit breaker pattern — fall back to the content-based or popularity scores, serve a degraded but still useful response, and alert on the failure. Caching intermediate results and A/B testing different blending strategies are the same problems you’d solve in any aggregation layer.

22.7 The cold-start problem

The cold-start problem has two faces. New users have no interaction history, so collaborative filtering has nothing to work with. New items have no ratings, so they can’t appear in collaborative recommendations. Both are common in production — any growing platform adds users and items continuously.

Strategies for cold-starting users include asking for explicit preferences during onboarding (the “rate five items to get started” pattern), using demographic or contextual information (location, device, referral source), and falling back to popularity-based recommendations that gradually transition to personalised ones as interactions accumulate.

For new items, content-based features bridge the gap. An item’s metadata — genre, price, description, creator — provides enough signal to place it in the latent space alongside similar items that do have ratings. This is sometimes called side information injection, and it’s one of the strongest arguments for hybrid approaches.

# For a brand-new user with no interaction history, collaborative filtering
# has nothing to work with. The fallback: recommend the most popular and
# highest-rated items across all users.

item_stats = ratings_df.groupby('item_id').agg(
    n_ratings=('rating', 'count'),
    avg_rating=('rating', 'mean')
).reset_index()

# Weighted score: balance popularity and quality
item_stats['pop_score'] = (
    0.5 * (item_stats['n_ratings'] / item_stats['n_ratings'].max()) +
    0.5 * (item_stats['avg_rating'] / 5)
)

top_popular = item_stats.nlargest(5, 'pop_score')
print("Cold-start fallback — most popular items:")
print(top_popular[['item_id', 'n_ratings', 'avg_rating', 'pop_score']]
      .round(2).to_string(index=False))

Cold-start fallback — most popular items:
 item_id  n_ratings  avg_rating  pop_score
      87        111        4.79       0.98
      90        106        4.89       0.97
      72        100        4.92       0.94
     160         94        4.90       0.91
     193         92        4.86       0.90

22.8 Evaluation: measuring what matters

Evaluating a recommendation system is harder than evaluating a classifier or regressor. The standard machine learning metrics don’t capture what makes recommendations useful. A system that recommends the five most popular items to everyone might have decent accuracy but provides no personalisation.

For rating prediction, we can measure reconstruction error using root mean squared error (RMSE) — how closely the model’s predicted ratings match held-out ratings. But RMSE optimises for accuracy on items users happened to rate, not for whether unrated items would have been worth recommending. Most real systems care about ranking: does the system surface items the user would engage with, and in a useful order?

from sklearn.model_selection import train_test_split

# Hold out 20% of interactions for testing. This is a random row-level split,
# so some users appear in both train and test. In production, you'd use a
# per-user temporal split (hold out each user's most recent interactions) to
# avoid information leakage.
train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)

# Rebuild interaction matrix from training data only
train_matrix = csr_matrix(
    (train_df['rating'].values,
     (train_df['user_id'].values, train_df['item_id'].values)),
    shape=(n_users, n_items)
)

# Recompute item similarity on training data
train_item_sim = cosine_similarity(train_matrix.T)
np.fill_diagonal(train_item_sim, 0)

# Evaluate: precision@k and recall@k
def evaluate_precision_recall_at_k(test_df, train_matrix, item_sim, k=5,
                                    threshold=4):
    """
    For each user, recommend k items and check how many are in the test set
    with rating >= threshold.
    """
    precisions, recalls = [], []

    for user_id in test_df['user_id'].unique():
        # Relevant items: test items the user rated highly
        user_test = test_df[
            (test_df['user_id'] == user_id) & (test_df['rating'] >= threshold)
        ]
        if len(user_test) == 0:
            continue

        relevant = set(user_test['item_id'].values)

        # Generate recommendations from training data
        rec_items, _ = recommend_items_cf(user_id, train_matrix, item_sim, n=k)
        recommended = set(rec_items)

        hits = len(relevant & recommended)
        precisions.append(hits / k)
        recalls.append(hits / len(relevant) if len(relevant) > 0 else 0)

    return np.mean(precisions), np.mean(recalls)

for k in [5, 10, 20]:
    prec, rec = evaluate_precision_recall_at_k(
        test_df, train_matrix, train_item_sim, k=k
    )
    print(f"k={k:2d}  precision@k = {prec:.3f}  recall@k = {rec:.3f}")

k= 5  precision@k = 0.029  recall@k = 0.046
k=10  precision@k = 0.031  recall@k = 0.100
k=20  precision@k = 0.031  recall@k = 0.209

Precision@k asks: of the $k$ items recommended, how many were relevant (the fraction of true positives among recommendations)? Recall@k asks: of all relevant items, how many appeared in the top $k$? These are the same precision and recall concepts from Section 10.8, adapted for ranked lists rather than binary classification. A third common metric, NDCG (normalised discounted cumulative gain), also accounts for the position of relevant items — a relevant item at position 1 matters more than one at position 10.

ks = list(range(1, 31))
precs, recs = [], []
for k in ks:
    p, r = evaluate_precision_recall_at_k(
        test_df, train_matrix, train_item_sim, k=k
    )
    precs.append(p)
    recs.append(r)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ks, precs, color='steelblue', linewidth=2, linestyle='-',
        marker='o', markevery=5, label='Precision@k')
ax.plot(ks, recs, color='coral', linewidth=2, linestyle='--',
        marker='s', markevery=5, label='Recall@k')

# Mark the crossover point
cross_idx = np.argmin(np.abs(np.array(precs) - np.array(recs)))
ax.axvline(ks[cross_idx], color='#aaaaaa', linestyle=':', linewidth=1.2, alpha=0.7)
ax.annotate(f'Crossover ≈ k={ks[cross_idx]}',
            xy=(ks[cross_idx], precs[cross_idx]),
            xytext=(15, 10), textcoords='offset points',
            fontsize=10, color='#555555')

ax.set_xlabel('k (number of recommendations)')
ax.set_ylabel('Proportion')
ax.set_ylim(0, 1)
ax.set_title('Precision-recall trade-off in recommendations')
ax.legend(loc='center right')
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.show()

Line chart with two curves showing how precision and recall change as the number of recommendations k increases from 1 to 30. The precision@k curve (solid line with circle markers) falls steadily as more items are recommended. The recall@k curve (dashed line with square markers) rises from near zero. The two curves cross at approximately k equals 12, marked by a vertical dashed line, illustrating the precision-recall trade-off in ranked recommendation lists. — Figure 22.2: Precision@k and Recall@k as a function of k. As k increases, recall improves (more relevant items are surfaced) but precision declines (more irrelevant items dilute the list). The dashed vertical line marks the crossover point.

But even these metrics miss something important. A recommendation system that only surfaces safe, popular choices might score well on precision but feel generic and unhelpful. Good recommendations balance relevance (items the user would rate highly), novelty (items the user hasn’t already discovered), and diversity (variety across the recommended list). These qualities are harder to measure offline and often require online A/B testing (Section 7.1) — showing real users different recommendation strategies and measuring engagement.

Offline evaluation has a deeper limitation: the test set only contains items users chose to interact with. Users who were already guided by a recommendation system had their choices shaped by it — a form of exposure bias (technically, the missing data is not missing at random). High precision on held-out ratings tells you the model is good at predicting preferences for items users were already inclined to discover; it tells you nothing about whether the system would surface items the user didn’t know they wanted. This is why online experimentation remains essential.

There is also a systemic concern worth flagging: recommendation systems create feedback loops. Popular items get recommended more often, which generates more interactions, which makes them appear even more popular. Over time, this concentrates attention on a shrinking set of items while the long tail becomes invisible. Strategies like reserving exploration slots for new or underrepresented items, applying inverse propensity weighting, and enforcing diversity constraints can counteract this — but you need to measure the effect explicitly (catalogue coverage, Gini coefficient of item exposure) rather than assuming the feedback loop is benign.

22.9 Implicit feedback: when you don’t have ratings

We’ve been working with explicit ratings — users telling us how much they liked something on a scale of 1 to 5. Most real-world systems don’t have this luxury. Instead, they observe implicit feedback: clicks, purchases, time spent, pages scrolled, items added to a basket and later abandoned. These signals are abundant but ambiguous. A user clicking on an item might mean genuine interest, an accidental tap, or morbid curiosity. Not clicking doesn’t mean dislike — the user may simply not have seen it.

Explicit feedback has both positive and negative signals — ratings of 1 and 5 are both informative. Implicit feedback is one-class: you observe what users did but not what they deliberately chose to avoid. This changes the modelling approach. Instead of predicting a rating, you predict a preference probability and weight it by a confidence derived from the interaction strength [1].

# Simulate implicit feedback: view counts instead of ratings
implicit_df = ratings_df.copy()
# Convert ratings to "interaction strength" (e.g., number of times viewed)
implicit_df['views'] = (implicit_df['rating'] * rng.poisson(3, len(implicit_df))
                        + rng.poisson(1, len(implicit_df)))
implicit_df = implicit_df.drop(columns=['rating'])

# In implicit feedback, confidence increases with interaction strength.
# The standard formulation uses c_ui = 1 + alpha_c * views, where alpha_c
# is a confidence scaling parameter (distinct from the blending weight in
# the hybrid recommender above).
alpha_conf = 40
implicit_df['confidence'] = 1 + alpha_conf * implicit_df['views']
implicit_df['preference'] = 1  # all observed interactions are positive

print("Implicit feedback sample:")
print(implicit_df.head(10).to_string(index=False))
print(f"\nConfidence range: [{implicit_df['confidence'].min()}, "
      f"{implicit_df['confidence'].max()}]")

Implicit feedback sample:
 user_id  item_id  views  confidence  preference
      13       88     16         641           1
     118      104     15         601           1
     178      155     15         601           1
     412       93      5         201           1
     150       93     26        1041           1
     351       83     10         401           1
      53      160     12         481           1
     391      155      7         281           1
      69      104     18         721           1
     229       93      8         321           1

Confidence range: [1, 2321]

The confidence-weighted approach (with $\alpha_c$ as a hyperparameter you’d tune empirically — values in the literature range from 1 to 40 depending on the domain) treats every observed interaction as a positive signal with varying strength, and every unobserved interaction as a weak negative. A user who watched a film five times gives a stronger positive signal than one who watched thirty seconds and moved on — but both are positive, and neither is as informative as a user explicitly rating something one star.

Author’s Note

The shift from explicit to implicit feedback changed my understanding of what recommendation systems actually do. With ratings, you’re solving a regression problem — predict the number. With implicit data, you’re solving something closer to an information retrieval problem — rank the items the user would engage with above the ones they wouldn’t. This distinction matters because the loss function, the evaluation metrics, and the baseline expectations all change. Accuracy on a 1-5 scale is meaningless when your data is binary clicks. It’s a reminder that the framing of the problem shapes everything downstream — a lesson that applies well beyond recommendation systems.

22.10 Scaling recommendations

A recommendation system that takes ten seconds to generate results is useless in production, regardless of its accuracy. The scaling challenges are twofold: offline computation (training models and precomputing scores) and online serving (generating personalised recommendations at request time with low latency).

For collaborative filtering, the item-item similarity matrix is $O(n^2)$ in items — manageable for thousands of items, prohibitive for millions. Matrix factorisation scales better: once you’ve learned the latent factors, generating a single user’s recommendations is a dot product over $K$ dimensions, which is fast.

The standard production pattern is to precompute recommendations in a batch job (nightly or hourly), store the top-$k$ items per user in a fast key-value store, and serve them directly on request. For users who need real-time adaptation (they just bought something), a lightweight re-ranking layer adjusts the precomputed list based on the latest interaction — removing items similar to what was just purchased and boosting items in related categories.

Note that precomputation and model retraining operate on different schedules. You might recompute the top-$k$ lists hourly (cheap — just matrix multiplication and sorting), but retrain the latent factor model weekly or monthly (expensive — full optimisation pass over all observed interactions). Serving stale lists from a fresh model is different from serving fresh lists from a stale model; monitoring both cadences matters.

import time

# Demonstrate precomputation: generate top-10 recommendations for all users.
# This materialises the full score matrix — feasible at our demo scale (500×200)
# but not at production scale with millions of users and items. At scale, you'd
# use approximate nearest neighbour (ANN) libraries (e.g. Faiss, Annoy, ScaNN)
# to find the top-k items in latent space without computing all pairwise scores.
start = time.perf_counter()

all_scores = user_latent @ item_latent.T  # (n_users, n_items)

# Mask already-rated items
dense_interactions = interaction_matrix.toarray()
all_scores[dense_interactions > 0] = -np.inf

# Top-k per user
k = 10
top_k_items = np.argsort(all_scores, axis=1)[:, -k:][:, ::-1]

elapsed = time.perf_counter() - start
print(f"Precomputed top-{k} for {n_users} users in {elapsed*1000:.1f}ms")
print(f"Storage: {top_k_items.nbytes / 1024:.1f} KB for the lookup table")
print(f"\nUser 0 precomputed recommendations: {top_k_items[0]}")

Precomputed top-10 for 500 users in 3.7ms
Storage: 39.1 KB for the lookup table

User 0 precomputed recommendations: [ 87 194 154  99  45 145 153 195  36 167]

The precomputation is fast for our 500-user demonstration dataset — but the full score matrix grows as $O(\text{users} \times \text{items})$, so it doesn’t scale to millions. The engineering patterns from Section 19.1 — chunked processing, vectorisation, and choosing the right data structure — apply directly to the batch scoring job, while ANN libraries handle the nearest-neighbour search in latent space.

At the largest scale, neural approaches — two-tower models and transformer-based sequence recommenders — have largely supplanted matrix factorisation, at the cost of significantly more training infrastructure. The principles are the same (learn latent representations, score via dot products, serve via ANN), but the representation learning is deeper. Those architectures are beyond our scope here; the matrix factorisation framework gives you the conceptual foundation they build on.

Engineering Bridge

The precompute-and-serve pattern maps directly to the cache-aside architecture common in web services. You compute the expensive operation asynchronously (model training and score generation), store the results in a fast-access layer (Redis, Memcached, or a purpose-built feature store), and serve requests from the cache. Cache invalidation — the “hardest problem in computer science” — has a direct analogue: when should you recompute recommendations? After every interaction (expensive but fresh), on a schedule (cheap but stale), or on a hybrid trigger (recompute when the user’s interaction count crosses a threshold)? The trade-offs are identical to those in any caching system. One additional wrinkle: a user exercising their right to erasure under GDPR doesn’t just remove a row from the interaction table — their preferences are encoded in the model’s latent factors. Full compliance may require retraining the model after deletion, not just clearing the cache.

22.11 Summary

Recommendation is a missing-data problem. The interaction matrix is overwhelmingly sparse, and the core challenge is predicting the missing entries — what would each user think of items they haven’t seen?
Collaborative filtering works from behaviour, not features. It exploits the observation that users who agreed in the past tend to agree in the future. Item-based and user-based variants differ in which axis of the matrix they compute similarity over.
Matrix factorisation compresses the interaction matrix into latent factors. By forcing ratings through a low-dimensional bottleneck, it learns the underlying structure of preferences and generalises across sparse observations — the same principle as PCA applied to a matrix with missing values.
Content-based filtering uses item features to bridge the cold-start gap. New items without ratings can still be recommended based on their attributes. Hybrid systems combine collaborative and content-based signals for robustness.
Evaluation requires ranking metrics, not just prediction error. Precision@k, Recall@k, and NDCG measure whether the system surfaces relevant items in the top positions. Online A/B testing remains essential because offline metrics don’t capture novelty, diversity, or user satisfaction.

22.12 Exercises

Implement user-based collaborative filtering by computing cosine similarity between user rows (instead of item columns) of the interaction matrix. Compare its recommendations for user 0 with the item-based approach from Section 22.3. Under what conditions would you expect user-based to outperform item-based, and vice versa?
The hybrid recommender in Section 22.6 uses a fixed blending weight $\alpha = 0.6$. Split the training data further into train and validation subsets (80/20), then write a function that tunes $\alpha$ by evaluating precision@10 on the validation set for values of $\alpha$ from 0 to 1 in steps of 0.1. Plot precision@10 as a function of $\alpha$. Is the optimal blend closer to pure collaborative or pure content-based?
Evaluating diversity. Write a function that computes the intra-list diversity of a recommendation list — the average pairwise distance (1 minus cosine similarity) between the recommended items’ content features. Compute this metric for the collaborative, content-based, and hybrid recommenders. Which produces the most diverse recommendations? Why?
Conceptual: A product manager proposes using purchase history as the sole input for recommendations on an e-commerce site. What signals would this miss compared to using browsing behaviour (page views, search queries, time on page)? What are the risks of including browsing behaviour — when might a signal be misleading?
Conceptual: Recommendation systems create feedback loops — popular items get recommended more often, which generates more interactions, which makes them appear even more popular. How does this popularity bias affect the system over time? What mechanisms could you introduce to ensure that new or niche items still get discovered? How would you measure whether your intervention is working?

--- title: "Building a recommendation system" --- ## The problem every platform has {#sec-recsys-intro} You've built search. Users type a query, your system returns results ranked by relevance, and the interaction ends. But what happens when the user opens your app and doesn't type anything? The blank landing page is the hardest screen in product engineering — you need to show something useful before the user tells you what they want. Every time Netflix surfaces a row of films, Spotify generates a playlist, or an e-commerce site shows "customers also bought," a recommendation system is making predictions about human preferences from incomplete data. This is a different kind of prediction from what we've seen so far. Churn prediction (@sec-churn-intro) asked a binary question — will this customer leave? Regression (@sec-from-description-to-prediction) predicted a continuous value. Recommendation systems predict a *matrix* of preferences — every user's opinion of every item — from a dataset where the vast majority of entries are missing. A platform with 100,000 users and 10,000 items has a billion possible user-item pairs, but a typical user might interact with a few hundred items. That's over 99.999% missing data. The engineering challenge is familiar: you're building a system that must serve personalised results at low latency for millions of users. The data science challenge is less familiar: you're estimating preferences from sparse, noisy, implicit signals — and the thing you're predicting (human taste) is subjective, context-dependent, and constantly shifting. ## The interaction matrix {#sec-interaction-matrix} The foundation of most recommendation systems is the **interaction matrix** (also called the user-item matrix or ratings matrix). Each row is a user, each column is an item, and each entry records an interaction — a rating, a purchase, a click, or simply whether the user engaged with the item at all. ```{python} #| label: interaction-matrix-setup #| echo: true import numpy as np import pandas as pd rng = np.random.default_rng(42) n_users = 500 n_items = 200 n_interactions = 8000 # ~8% density — sparse but realistic # Simulate user and item latent factors (hidden preferences and attributes) n_factors = 5 user_factors = rng.normal(0, 1, (n_users, n_factors)) item_factors = rng.normal(0, 1, (n_items, n_factors)) # True preferences are dot products of latent factors plus noise true_preferences = user_factors @ item_factors.T # Sample which user-item pairs are observed (not random — biased towards # items the user would prefer, mimicking real browsing behaviour) observed_pairs = set() while len(observed_pairs) < n_interactions: u = rng.integers(0, n_users) # Higher-preference items are more likely to be observed probs = np.exp(true_preferences[u]) / np.exp(true_preferences[u]).sum() i = rng.choice(n_items, p=probs) observed_pairs.add((u, i)) # Generate ratings (1-5) from true preferences + noise rows = [] for u, i in observed_pairs: rating = np.clip(np.round(true_preferences[u, i] + rng.normal(0, 0.5) + 3), 1, 5) rows.append({'user_id': u, 'item_id': i, 'rating': int(rating)}) ratings_df = pd.DataFrame(rows) print(f"Users: {n_users:,}") print(f"Items: {n_items:,}") print(f"Interactions: {len(ratings_df):,}") print(f"Matrix density: {len(ratings_df) / (n_users * n_items):.1%}") print(f"Rating distribution:") print(ratings_df['rating'].value_counts().sort_index().to_string()) ``` The 8% density we've simulated is generous — real-world systems often have densities below 1%. This sparsity is the central technical challenge. Most machine learning algorithms assume you have complete feature vectors for each observation. Here, the "feature vector" is the user's preference for every item, and almost all of it is missing. ::: {.callout-note} ## Engineering Bridge The interaction matrix is the adjacency matrix of a **bipartite graph** — users and items are two distinct node types, and edges only run between them (users rate items, never other users). The matrix is rectangular, not square, which distinguishes it from the symmetric adjacency matrices in general graph problems. But the storage patterns are identical: you wouldn't materialise a billion-cell dense matrix when only a fraction of edges exist. Recommendation systems work with sparse representations — coordinate lists, compressed sparse row (CSR) formats, or simply DataFrames of (user, item, value) triples. If you've worked with sparse data structures in graph processing or network analysis, the indexing and traversal patterns will be familiar. ::: ## Collaborative filtering: users who liked this also liked that {#sec-collaborative-filtering} **Collaborative filtering** is the idea that users who agreed in the past will agree in the future. If you and I both rated the same ten films highly, and you loved a film I haven't seen, there's a reasonable chance I'd like it too. The algorithm doesn't need to know *anything* about what the films are — it works entirely from the pattern of interactions. There are two flavours. **User-based** collaborative filtering finds users similar to you and recommends items they liked. **Item-based** collaborative filtering finds items similar to ones you've liked and recommends those. The mechanics are nearly identical; they differ only in which axis of the interaction matrix you compute similarity over. Let's implement item-based collaborative filtering. For each pair of items, we compute the **cosine similarity** between their columns in the interaction matrix — treating each item as a vector of user ratings. Items that receive similar ratings from the same users will have high similarity. ```{python} #| label: item-based-cf #| echo: true from scipy.sparse import csr_matrix from sklearn.metrics.pairwise import cosine_similarity # Build sparse interaction matrix interaction_matrix = csr_matrix( (ratings_df['rating'].values, (ratings_df['user_id'].values, ratings_df['item_id'].values)), shape=(n_users, n_items) ) # Compute item-item similarity (cosine similarity between item columns) item_similarity = cosine_similarity(interaction_matrix.T) np.fill_diagonal(item_similarity, 0) # don't recommend an item to itself def recommend_items_cf(user_id, interaction_matrix, item_similarity, n=5): """Recommend items using item-based collaborative filtering.""" user_ratings = interaction_matrix[user_id].toarray().flatten() rated_mask = user_ratings > 0 # Predicted score: similarity-weighted average of the user's known ratings scores = item_similarity @ user_ratings # Normalise by the sum of absolute similarities to rated items sim_sums = np.abs(item_similarity[:, rated_mask]).sum(axis=1) sim_sums[sim_sums == 0] = 1 # avoid division by zero scores = scores / sim_sums # Exclude already-rated items scores[rated_mask] = -np.inf top_items = np.argsort(scores)[::-1][:n] return top_items, scores[top_items] # Example: recommendations for user 0 rec_items, rec_scores = recommend_items_cf(0, interaction_matrix, item_similarity) user_history = ratings_df[ratings_df['user_id'] == 0].sort_values('rating', ascending=False) print(f"User 0 has rated {len(user_history)} items") print(f"Top-rated items: {user_history.head(3)[['item_id', 'rating']].to_string(index=False)}") print(f"\nRecommended items: {rec_items}") print(f"Predicted scores: {np.round(rec_scores, 2)}") ``` The approach is intuitive and requires no training — you compute similarities once and generate recommendations on the fly. But it has significant limitations. Similarity computation is $O(n^2)$ in the number of items (or users), which doesn't scale well. More fundamentally, it can only recommend items that share rating patterns with items you've already rated — it struggles with the **long tail** of items that have few ratings. ## Matrix factorisation: compressing preferences {#sec-matrix-factorisation} Matrix factorisation takes a different approach. Instead of computing pairwise similarities, it decomposes the interaction matrix into two smaller matrices — one representing users and one representing items — in a shared low-dimensional latent space. If PCA (@sec-pca) finds the directions of maximum variance in a feature matrix, matrix factorisation finds the latent dimensions that best explain the observed ratings. The idea is that a user's rating of an item can be approximated as the dot product of a user vector and an item vector: $$ \hat{r}_{ui} = \mathbf{p}_u \cdot \mathbf{q}_i = \sum_{k=1}^{K} p_{uk} \, q_{ik} $$ where $\mathbf{p}_u$ is a $K$-dimensional vector representing user $u$'s preferences, $\mathbf{q}_i$ is a $K$-dimensional vector representing item $i$'s attributes, and $K$ is the number of latent factors (typically 10–100). Each factor might correspond to something interpretable — one dimension might capture "action versus drama," another "mainstream versus indie" — but the algorithm discovers these dimensions automatically from the data. This is the same principle behind the latent components in PCA and the latent topics in text analysis: compress high-dimensional data by finding its underlying structure. We find $\mathbf{P}$ and $\mathbf{Q}$ by minimising the reconstruction error on observed ratings, with regularisation to prevent overfitting: $$ \min_{\mathbf{P}, \mathbf{Q}} \sum_{(u, i) \in \text{observed}} \left( r_{ui} - \mathbf{p}_u \cdot \mathbf{q}_i \right)^2 + \lambda \sum_u \|\mathbf{p}_u\|^2 + \lambda \sum_i \|\mathbf{q}_i\|^2 $$ In words: minimise the squared error on observed ratings, plus an L2 penalty on every user and item vector. The regularisation term $\lambda$ plays exactly the same role as in ridge regression (@sec-ridge) — it penalises large parameter values to keep the model from fitting noise in the sparse observed data. A common extension adds explicit bias terms — a global mean $\mu$, a per-user bias $b_u$ (some users rate everything higher), and a per-item bias $b_i$ (some items are systematically better-received) — so the prediction becomes $\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u \cdot \mathbf{q}_i$. Our simplified formulation omits these for clarity, but in production they often capture a large fraction of the variance. ```{python} #| label: matrix-factorisation #| echo: true from sklearn.decomposition import TruncatedSVD # For demonstration, we'll use truncated SVD on the interaction matrix. # SVD treats missing entries as zeros — implicitly telling the model that # every unrated item has a rating of 0, far below the 1-5 observed scale. # This depresses latent item factors and biases reconstruction. In # production, you'd use alternating least squares (ALS) or stochastic # gradient descent (SGD), which optimise only over observed entries and # avoid this zero-imputation problem. Mean-centering each user's ratings # before fitting can partially mitigate the bias. SVD is illustrative # and available in scikit-learn, so we use it here. n_components = 20 svd = TruncatedSVD(n_components=n_components, random_state=42) # fit_transform returns U @ Sigma — singular values are already absorbed user_latent = svd.fit_transform(interaction_matrix) # (n_users, n_components) # components_ is V^T, so .T gives V item_latent = svd.components_.T # (n_items, n_components) print(f"User latent matrix: {user_latent.shape}") print(f"Item latent matrix: {item_latent.shape}") print(f"Explained variance ratio: {svd.explained_variance_ratio_.sum():.1%}") # Reconstruct ratings for user 0: (U @ Sigma) @ V^T predicted_ratings = user_latent[0] @ item_latent.T actual_ratings = interaction_matrix[0].toarray().flatten() rated_items = np.where(actual_ratings > 0)[0] comparison = pd.DataFrame({ 'item_id': rated_items, 'actual': actual_ratings[rated_items], 'predicted': np.round(predicted_ratings[rated_items], 1) }).head(10) print(f"\nUser 0 — actual vs predicted ratings:") print(comparison.to_string(index=False)) ``` ```{python} #| label: fig-latent-space #| echo: true #| fig-cap: "Items projected into a two-dimensional latent space from the matrix factorisation. Clusters of nearby items share similar rating patterns — users who like one item in a cluster tend to like others nearby." #| fig-alt: "Scatter plot showing 200 items projected onto two latent dimensions from the matrix factorisation. Points are coloured on a sequential scale from dark (low average rating) to bright (high average rating). Items with similar rating patterns cluster together in the latent space, demonstrating that the factorisation has learned meaningful preference structure." import matplotlib.pyplot as plt # Compute average rating per item, excluding unrated entries (zeros) dense = interaction_matrix.toarray() masked = np.ma.masked_equal(dense, 0) avg_ratings = masked.mean(axis=0).data # Use the first two latent dimensions for visualisation fig, ax = plt.subplots(figsize=(10, 5)) sc = ax.scatter( item_latent[:, 0], item_latent[:, 1], c=avg_ratings, cmap='viridis', alpha=0.8, s=40, edgecolors='none' ) cbar = plt.colorbar(sc, ax=ax) cbar.set_label('Average rating') ax.set_xlabel('Latent dimension 1') ax.set_ylabel('Latent dimension 2') ax.set_title('Items in latent space: similar items cluster together') ax.spines[['top', 'right']].set_visible(False) plt.tight_layout() plt.show() ``` The latent space gives us something collaborative filtering alone doesn't: a compact representation of *why* users and items are similar. Two items close in latent space share underlying attributes that drive similar ratings, even if few users have rated both. This is how matrix factorisation addresses sparsity — it generalises across the gaps by learning structure rather than relying on direct co-occurrence. ::: {.callout-tip} ## Author's Note Matrix factorisation clicked for me when I stopped thinking of it as a prediction algorithm and started thinking of it as compression. In our simulation, the interaction matrix has 100,000 cells (500 users × 200 items) but only 8,000 are filled in. Yet 20 latent dimensions can reconstruct those 8,000 entries reasonably well — which means the underlying structure of human preferences is far lower-dimensional than the raw matrix suggests. It's the same insight behind dimensionality reduction in @sec-pca, but applied to a matrix that's mostly empty. The compression *is* the model: by forcing the data through a bottleneck, you learn what matters and discard what's noise. ::: ## Content-based filtering: using item features {#sec-content-based} Collaborative filtering works entirely from interaction patterns — it doesn't need to know what items *are*. **Content-based filtering** takes the opposite approach: it recommends items similar to what the user has liked, based on item attributes. If a user rated several Python programming books highly, a content-based system would recommend other Python books. It examines item features (genre, author, price range, description keywords) and builds a profile of each user's preferences over those features. ```{python} #| label: content-based #| echo: true # Simulate item features: genre, price range, and quality indicators n_genres = 8 item_genre_probs = rng.dirichlet(np.ones(n_genres) * 0.5, n_items) item_price = rng.uniform(5, 100, n_items) item_popularity = rng.exponential(1, n_items) # Feature matrix: genre distribution + normalised price + popularity item_content = np.column_stack([ item_genre_probs, (item_price - item_price.mean()) / item_price.std(), (item_popularity - item_popularity.mean()) / item_popularity.std(), ]) def recommend_content_based(user_id, interaction_matrix, item_content, n=5): """Build a user profile from rated items and find similar unrated items.""" user_ratings = interaction_matrix[user_id].toarray().flatten() rated_mask = user_ratings > 0 if rated_mask.sum() == 0: return np.array([]), np.array([]) # User profile: weighted average of item features, weighted by rating weights = user_ratings[rated_mask] user_profile = (item_content[rated_mask].T @ weights) / weights.sum() # Score unrated items by cosine similarity to user profile scores = cosine_similarity(user_profile.reshape(1, -1), item_content).flatten() scores[rated_mask] = -np.inf top_items = np.argsort(scores)[::-1][:n] return top_items, scores[top_items] rec_items_cb, rec_scores_cb = recommend_content_based( 0, interaction_matrix, item_content ) print(f"Content-based recommendations for user 0: {rec_items_cb}") print(f"Similarity scores: {np.round(rec_scores_cb, 3)}") ``` The scores reflect how closely each recommended item's features match the user's profile. Notice that content-based scores are bounded between -1 and 1 (cosine similarity), unlike the unbounded predicted ratings from collaborative filtering — the two approaches operate on different scales, which is why the hybrid system below normalises both before combining them. Content-based filtering has a distinct advantage: it can recommend items with no interaction history at all, as long as you have their features. A new book added to the catalogue can be recommended immediately based on its genre and description — no ratings required. We'll see in @sec-cold-start why this matters so much. But content-based filtering has its own limitation: it can only recommend items similar to what the user has already seen. It doesn't discover surprising connections — the fact that fans of obscure jazz also tend to enjoy nature documentaries, for instance. Collaborative filtering captures these cross-category patterns because it works from behaviour, not metadata. ## Hybrid systems: combining signals {#sec-hybrid} Before building a hybrid, it helps to see where each approach excels and where it falls short: | | Collaborative filtering | Content-based filtering | | --- | --- | --- | | **Data needed** | User-item interactions | Item features (metadata) | | **New items** | Cannot recommend (no interactions) | Can recommend immediately | | **New users** | Cannot personalise (no history) | Cannot personalise (no history) | | **Cross-category discovery** | Strong — finds surprising patterns | Weak — only recommends similar items | | **Scalability** | $O(n^2)$ similarity; $O(K)$ with MF | $O(\text{items} \times \text{features})$ | In practice, production recommendation systems combine multiple approaches. A **hybrid system** uses collaborative filtering as its primary signal, content-based features to handle cold-start items, and popularity as a fallback for completely new users. ```{python} #| label: hybrid-recommender #| echo: true def recommend_hybrid(user_id, interaction_matrix, item_similarity, item_content, alpha=0.6, n=5): """ Combine collaborative and content-based scores. alpha controls the weight: 1.0 = pure collaborative, 0.0 = pure content. """ user_ratings = interaction_matrix[user_id].toarray().flatten() rated_mask = user_ratings > 0 # Collaborative filtering scores cf_scores = item_similarity @ user_ratings sim_sums = np.abs(item_similarity[:, rated_mask]).sum(axis=1) sim_sums[sim_sums == 0] = 1 cf_scores = cf_scores / sim_sums # Content-based scores if rated_mask.sum() > 0: weights = user_ratings[rated_mask] user_profile = (item_content[rated_mask].T @ weights) / weights.sum() cb_scores = cosine_similarity( user_profile.reshape(1, -1), item_content ).flatten() else: cb_scores = np.zeros(n_items) # Normalise both to [0, 1] for fair combination def min_max_scale(x): finite = x[np.isfinite(x)] if finite.max() == finite.min(): return np.zeros_like(x) return (x - finite.min()) / (finite.max() - finite.min()) cf_norm = min_max_scale(cf_scores) cb_norm = min_max_scale(cb_scores) combined = alpha * cf_norm + (1 - alpha) * cb_norm combined[rated_mask] = -np.inf top_items = np.argsort(combined)[::-1][:n] return top_items, combined[top_items] rec_hybrid, scores_hybrid = recommend_hybrid( 0, interaction_matrix, item_similarity, item_content ) print(f"Hybrid recommendations for user 0: {rec_hybrid}") print(f"Combined scores: {np.round(scores_hybrid, 3)}") ``` The mixing weight $\alpha$ is a hyperparameter you'd tune empirically — and it might vary per user. A user with hundreds of ratings provides strong collaborative signal; a new user with three ratings benefits more from content-based features. ::: {.callout-note} ## Engineering Bridge A hybrid recommendation system follows the **fan-out and merge** pattern common in microservice architectures. Each recommendation strategy is an independent scorer — the request fans out to collaborative, content-based, and popularity services in parallel, then a blending layer merges the results, applies business rules (exclude out-of-stock items, enforce diversity), and returns the final ranked list. This separation of concerns — scoring, blending, filtering — maps cleanly to how engineers design composable systems. The key engineering decisions are familiar: what happens when the collaborative scorer is slow or unavailable? You apply the **circuit breaker** pattern — fall back to the content-based or popularity scores, serve a degraded but still useful response, and alert on the failure. Caching intermediate results and A/B testing different blending strategies are the same problems you'd solve in any aggregation layer. ::: ## The cold-start problem {#sec-cold-start} The cold-start problem has two faces. **New users** have no interaction history, so collaborative filtering has nothing to work with. **New items** have no ratings, so they can't appear in collaborative recommendations. Both are common in production — any growing platform adds users and items continuously. Strategies for cold-starting users include asking for explicit preferences during onboarding (the "rate five items to get started" pattern), using demographic or contextual information (location, device, referral source), and falling back to popularity-based recommendations that gradually transition to personalised ones as interactions accumulate. For new items, content-based features bridge the gap. An item's metadata — genre, price, description, creator — provides enough signal to place it in the latent space alongside similar items that do have ratings. This is sometimes called **side information** injection, and it's one of the strongest arguments for hybrid approaches. ```{python} #| label: cold-start-demo #| echo: true # For a brand-new user with no interaction history, collaborative filtering # has nothing to work with. The fallback: recommend the most popular and # highest-rated items across all users. item_stats = ratings_df.groupby('item_id').agg( n_ratings=('rating', 'count'), avg_rating=('rating', 'mean') ).reset_index() # Weighted score: balance popularity and quality item_stats['pop_score'] = ( 0.5 * (item_stats['n_ratings'] / item_stats['n_ratings'].max()) + 0.5 * (item_stats['avg_rating'] / 5) ) top_popular = item_stats.nlargest(5, 'pop_score') print("Cold-start fallback — most popular items:") print(top_popular[['item_id', 'n_ratings', 'avg_rating', 'pop_score']] .round(2).to_string(index=False)) ``` ## Evaluation: measuring what matters {#sec-recsys-evaluation} Evaluating a recommendation system is harder than evaluating a classifier or regressor. The standard machine learning metrics don't capture what makes recommendations *useful*. A system that recommends the five most popular items to everyone might have decent accuracy but provides no personalisation. For rating prediction, we can measure reconstruction error using root mean squared error (RMSE) — how closely the model's predicted ratings match held-out ratings. But RMSE optimises for accuracy on items users happened to rate, not for whether unrated items would have been worth recommending. Most real systems care about **ranking**: does the system surface items the user would engage with, and in a useful order? ```{python} #| label: evaluation-metrics #| echo: true from sklearn.model_selection import train_test_split # Hold out 20% of interactions for testing. This is a random row-level split, # so some users appear in both train and test. In production, you'd use a # per-user temporal split (hold out each user's most recent interactions) to # avoid information leakage. train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42) # Rebuild interaction matrix from training data only train_matrix = csr_matrix( (train_df['rating'].values, (train_df['user_id'].values, train_df['item_id'].values)), shape=(n_users, n_items) ) # Recompute item similarity on training data train_item_sim = cosine_similarity(train_matrix.T) np.fill_diagonal(train_item_sim, 0) # Evaluate: precision@k and recall@k def evaluate_precision_recall_at_k(test_df, train_matrix, item_sim, k=5, threshold=4): """ For each user, recommend k items and check how many are in the test set with rating >= threshold. """ precisions, recalls = [], [] for user_id in test_df['user_id'].unique(): # Relevant items: test items the user rated highly user_test = test_df[ (test_df['user_id'] == user_id) & (test_df['rating'] >= threshold) ] if len(user_test) == 0: continue relevant = set(user_test['item_id'].values) # Generate recommendations from training data rec_items, _ = recommend_items_cf(user_id, train_matrix, item_sim, n=k) recommended = set(rec_items) hits = len(relevant & recommended) precisions.append(hits / k) recalls.append(hits / len(relevant) if len(relevant) > 0 else 0) return np.mean(precisions), np.mean(recalls) for k in [5, 10, 20]: prec, rec = evaluate_precision_recall_at_k( test_df, train_matrix, train_item_sim, k=k ) print(f"k={k:2d} precision@k = {prec:.3f} recall@k = {rec:.3f}") ``` **Precision@k** asks: of the $k$ items recommended, how many were relevant (the fraction of true positives among recommendations)? **Recall@k** asks: of all relevant items, how many appeared in the top $k$? These are the same precision and recall concepts from @sec-precision-recall-tradeoff, adapted for ranked lists rather than binary classification. A third common metric, **NDCG** (normalised discounted cumulative gain), also accounts for the *position* of relevant items — a relevant item at position 1 matters more than one at position 10. ```{python} #| label: fig-k-tradeoff #| echo: true #| fig-cap: "Precision@k and Recall@k as a function of k. As k increases, recall improves (more relevant items are surfaced) but precision declines (more irrelevant items dilute the list). The dashed vertical line marks the crossover point." #| fig-alt: "Line chart with two curves showing how precision and recall change as the number of recommendations k increases from 1 to 30. The precision@k curve (solid line with circle markers) falls steadily as more items are recommended. The recall@k curve (dashed line with square markers) rises from near zero. The two curves cross at approximately k equals 12, marked by a vertical dashed line, illustrating the precision-recall trade-off in ranked recommendation lists." ks = list(range(1, 31)) precs, recs = [], [] for k in ks: p, r = evaluate_precision_recall_at_k( test_df, train_matrix, train_item_sim, k=k ) precs.append(p) recs.append(r) fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(ks, precs, color='steelblue', linewidth=2, linestyle='-', marker='o', markevery=5, label='Precision@k') ax.plot(ks, recs, color='coral', linewidth=2, linestyle='--', marker='s', markevery=5, label='Recall@k') # Mark the crossover point cross_idx = np.argmin(np.abs(np.array(precs) - np.array(recs))) ax.axvline(ks[cross_idx], color='#aaaaaa', linestyle=':', linewidth=1.2, alpha=0.7) ax.annotate(f'Crossover ≈ k={ks[cross_idx]}', xy=(ks[cross_idx], precs[cross_idx]), xytext=(15, 10), textcoords='offset points', fontsize=10, color='#555555') ax.set_xlabel('k (number of recommendations)') ax.set_ylabel('Proportion') ax.set_ylim(0, 1) ax.set_title('Precision-recall trade-off in recommendations') ax.legend(loc='center right') ax.spines[['top', 'right']].set_visible(False) plt.tight_layout() plt.show() ``` But even these metrics miss something important. A recommendation system that only surfaces safe, popular choices might score well on precision but feel generic and unhelpful. Good recommendations balance **relevance** (items the user would rate highly), **novelty** (items the user hasn't already discovered), and **diversity** (variety across the recommended list). These qualities are harder to measure offline and often require online A/B testing (@sec-ab-testing) — showing real users different recommendation strategies and measuring engagement. Offline evaluation has a deeper limitation: the test set only contains items users chose to interact with. Users who were already guided by a recommendation system had their choices shaped by it — a form of **exposure bias** (technically, the missing data is not missing at random). High precision on held-out ratings tells you the model is good at predicting preferences for items users were already inclined to discover; it tells you nothing about whether the system would surface items the user didn't know they wanted. This is why online experimentation remains essential. There is also a systemic concern worth flagging: recommendation systems create **feedback loops**. Popular items get recommended more often, which generates more interactions, which makes them appear even more popular. Over time, this concentrates attention on a shrinking set of items while the long tail becomes invisible. Strategies like reserving exploration slots for new or underrepresented items, applying inverse propensity weighting, and enforcing diversity constraints can counteract this — but you need to measure the effect explicitly (catalogue coverage, Gini coefficient of item exposure) rather than assuming the feedback loop is benign. ## Implicit feedback: when you don't have ratings {#sec-implicit-feedback} We've been working with explicit ratings — users telling us how much they liked something on a scale of 1 to 5. Most real-world systems don't have this luxury. Instead, they observe **implicit feedback**: clicks, purchases, time spent, pages scrolled, items added to a basket and later abandoned. These signals are abundant but ambiguous. A user clicking on an item might mean genuine interest, an accidental tap, or morbid curiosity. Not clicking doesn't mean dislike — the user may simply not have seen it. Explicit feedback has both positive and negative signals — ratings of 1 and 5 are both informative. Implicit feedback is **one-class**: you observe what users did but not what they deliberately chose to avoid. This changes the modelling approach. Instead of predicting a rating, you predict a **preference probability** and weight it by a **confidence** derived from the interaction strength [@hu2008]. ```{python} #| label: implicit-feedback #| echo: true # Simulate implicit feedback: view counts instead of ratings implicit_df = ratings_df.copy() # Convert ratings to "interaction strength" (e.g., number of times viewed) implicit_df['views'] = (implicit_df['rating'] * rng.poisson(3, len(implicit_df)) + rng.poisson(1, len(implicit_df))) implicit_df = implicit_df.drop(columns=['rating']) # In implicit feedback, confidence increases with interaction strength. # The standard formulation uses c_ui = 1 + alpha_c * views, where alpha_c # is a confidence scaling parameter (distinct from the blending weight in # the hybrid recommender above). alpha_conf = 40 implicit_df['confidence'] = 1 + alpha_conf * implicit_df['views'] implicit_df['preference'] = 1 # all observed interactions are positive print("Implicit feedback sample:") print(implicit_df.head(10).to_string(index=False)) print(f"\nConfidence range: [{implicit_df['confidence'].min()}, " f"{implicit_df['confidence'].max()}]") ``` The confidence-weighted approach (with $\alpha_c$ as a hyperparameter you'd tune empirically — values in the literature range from 1 to 40 depending on the domain) treats every observed interaction as a positive signal with varying strength, and every *unobserved* interaction as a weak negative. A user who watched a film five times gives a stronger positive signal than one who watched thirty seconds and moved on — but both are positive, and neither is as informative as a user explicitly rating something one star. ::: {.callout-tip} ## Author's Note The shift from explicit to implicit feedback changed my understanding of what recommendation systems actually do. With ratings, you're solving a regression problem — predict the number. With implicit data, you're solving something closer to an information retrieval problem — rank the items the user would engage with above the ones they wouldn't. This distinction matters because the loss function, the evaluation metrics, and the baseline expectations all change. Accuracy on a 1-5 scale is meaningless when your data is binary clicks. It's a reminder that the framing of the problem shapes everything downstream — a lesson that applies well beyond recommendation systems. ::: ## Scaling recommendations {#sec-recsys-scaling} A recommendation system that takes ten seconds to generate results is useless in production, regardless of its accuracy. The scaling challenges are twofold: **offline computation** (training models and precomputing scores) and **online serving** (generating personalised recommendations at request time with low latency). For collaborative filtering, the item-item similarity matrix is $O(n^2)$ in items — manageable for thousands of items, prohibitive for millions. Matrix factorisation scales better: once you've learned the latent factors, generating a single user's recommendations is a dot product over $K$ dimensions, which is fast. The standard production pattern is to precompute recommendations in a batch job (nightly or hourly), store the top-$k$ items per user in a fast key-value store, and serve them directly on request. For users who need real-time adaptation (they just bought something), a lightweight re-ranking layer adjusts the precomputed list based on the latest interaction — removing items similar to what was just purchased and boosting items in related categories. Note that precomputation and model retraining operate on different schedules. You might recompute the top-$k$ lists hourly (cheap — just matrix multiplication and sorting), but retrain the latent factor model weekly or monthly (expensive — full optimisation pass over all observed interactions). Serving stale lists from a fresh model is different from serving fresh lists from a stale model; monitoring both cadences matters. ```{python} #| label: scaling-precompute #| echo: true import time # Demonstrate precomputation: generate top-10 recommendations for all users. # This materialises the full score matrix — feasible at our demo scale (500×200) # but not at production scale with millions of users and items. At scale, you'd # use approximate nearest neighbour (ANN) libraries (e.g. Faiss, Annoy, ScaNN) # to find the top-k items in latent space without computing all pairwise scores. start = time.perf_counter() all_scores = user_latent @ item_latent.T # (n_users, n_items) # Mask already-rated items dense_interactions = interaction_matrix.toarray() all_scores[dense_interactions > 0] = -np.inf # Top-k per user k = 10 top_k_items = np.argsort(all_scores, axis=1)[:, -k:][:, ::-1] elapsed = time.perf_counter() - start print(f"Precomputed top-{k} for {n_users} users in {elapsed*1000:.1f}ms") print(f"Storage: {top_k_items.nbytes / 1024:.1f} KB for the lookup table") print(f"\nUser 0 precomputed recommendations: {top_k_items[0]}") ``` The precomputation is fast for our 500-user demonstration dataset — but the full score matrix grows as $O(\text{users} \times \text{items})$, so it doesn't scale to millions. The engineering patterns from @sec-scale-intro — chunked processing, vectorisation, and choosing the right data structure — apply directly to the batch scoring job, while ANN libraries handle the nearest-neighbour search in latent space. At the largest scale, neural approaches — two-tower models and transformer-based sequence recommenders — have largely supplanted matrix factorisation, at the cost of significantly more training infrastructure. The principles are the same (learn latent representations, score via dot products, serve via ANN), but the representation learning is deeper. Those architectures are beyond our scope here; the matrix factorisation framework gives you the conceptual foundation they build on. ::: {.callout-note} ## Engineering Bridge The precompute-and-serve pattern maps directly to the cache-aside architecture common in web services. You compute the expensive operation asynchronously (model training and score generation), store the results in a fast-access layer (Redis, Memcached, or a purpose-built feature store), and serve requests from the cache. Cache invalidation — the "hardest problem in computer science" — has a direct analogue: when should you recompute recommendations? After every interaction (expensive but fresh), on a schedule (cheap but stale), or on a hybrid trigger (recompute when the user's interaction count crosses a threshold)? The trade-offs are identical to those in any caching system. One additional wrinkle: a user exercising their right to erasure under GDPR doesn't just remove a row from the interaction table — their preferences are encoded in the model's latent factors. Full compliance may require retraining the model after deletion, not just clearing the cache. ::: ## Summary {#sec-recsys-summary} 1. **Recommendation is a missing-data problem.** The interaction matrix is overwhelmingly sparse, and the core challenge is predicting the missing entries — what would each user think of items they haven't seen? 2. **Collaborative filtering works from behaviour, not features.** It exploits the observation that users who agreed in the past tend to agree in the future. Item-based and user-based variants differ in which axis of the matrix they compute similarity over. 3. **Matrix factorisation compresses the interaction matrix into latent factors.** By forcing ratings through a low-dimensional bottleneck, it learns the underlying structure of preferences and generalises across sparse observations — the same principle as PCA applied to a matrix with missing values. 4. **Content-based filtering uses item features to bridge the cold-start gap.** New items without ratings can still be recommended based on their attributes. Hybrid systems combine collaborative and content-based signals for robustness. 5. **Evaluation requires ranking metrics, not just prediction error.** Precision@k, Recall@k, and NDCG measure whether the system surfaces relevant items in the top positions. Online A/B testing remains essential because offline metrics don't capture novelty, diversity, or user satisfaction. ## Exercises {#sec-recsys-exercises} 1. Implement **user-based** collaborative filtering by computing cosine similarity between user rows (instead of item columns) of the interaction matrix. Compare its recommendations for user 0 with the item-based approach from @sec-collaborative-filtering. Under what conditions would you expect user-based to outperform item-based, and vice versa? 2. The hybrid recommender in @sec-hybrid uses a fixed blending weight $\alpha = 0.6$. Split the training data further into train and validation subsets (80/20), then write a function that tunes $\alpha$ by evaluating precision@10 on the validation set for values of $\alpha$ from 0 to 1 in steps of 0.1. Plot precision@10 as a function of $\alpha$. Is the optimal blend closer to pure collaborative or pure content-based? 3. **Evaluating diversity.** Write a function that computes the **intra-list diversity** of a recommendation list — the average pairwise distance (1 minus cosine similarity) between the recommended items' content features. Compute this metric for the collaborative, content-based, and hybrid recommenders. Which produces the most diverse recommendations? Why? 4. **Conceptual:** A product manager proposes using purchase history as the sole input for recommendations on an e-commerce site. What signals would this miss compared to using browsing behaviour (page views, search queries, time on page)? What are the risks of including browsing behaviour — when might a signal be misleading? 5. **Conceptual:** Recommendation systems create feedback loops — popular items get recommended more often, which generates more interactions, which makes them appear even more popular. How does this **popularity bias** affect the system over time? What mechanisms could you introduce to ensure that new or niche items still get discovered? How would you measure whether your intervention is working?