Federated Learning: Privacy-Preserving Training

Train models without sharing data

W23D4 - Federated Averaging

The Big Idea

Centralized Training

Collect all data on one server, train normally

model.fit(all_data) # Privacy risk

Federated Training

Data stays on clients, share only model weights

server.aggregate(client_weights) # Data never leaves

Same model quality (or close to it), fundamentally different privacy properties.

1 Why Federated Learning?

Three forces pushing ML toward federated approaches:

Privacy Regulations

GDPR, CCPA, HIPAA restrict data collection and centralization. Federated learning keeps data where it originated.

Data Sovereignty

Organizations can't or won't share raw data. Hospitals, banks, and competitors can still collaborate on models.

Communication Cost

Edge devices generate massive data. Sending model updates is cheaper than sending raw data.

User Trust

Users increasingly demand control over their data. "Your data never leaves your device" is a powerful promise.

2 The FedAvg Algorithm

Federated Averaging (McMahan et al., 2017) is the foundational federated learning algorithm.

Central Server

Holds global model weights. Never sees raw data.

Client 1

Action fans

Client 2

Comedy fans

Client 3

Drama fans

Client 4

Mixed taste

Client 5

Sci-fi fans

Client 6

Horror fans

Client 7

Romance fans

Client 8

Doc fans

One Round of FedAvg:

Broadcast

Server sends current global weights w_t to K randomly selected clients

Train

Each selected client trains locally for E epochs on their private data, producing updated weights w_k

Upload

Each client sends back their updated weights and sample count n_k (NOT their data)

Aggregate

Server computes weighted average of all updates to form the new global model

Federated Averaging — Weighted Aggregation

w_t+1 = Σ (n_k / n_total) × w_k

3 Non-IID Challenges

In the real world, each client's data is different. This is the non-IID problem and it makes federated learning harder.

IID (Ideal)

Every client has a representative sample of all data. Easy to converge.

Client 1: [Action, Comedy, Drama, ...]
Client 2: [Action, Comedy, Drama, ...]
# Same distribution everywhere

Non-IID (Reality)

Each client has biased data reflecting their preferences. Harder to converge.

Client 1: [Action, Action, Action, ...]
Client 2: [Comedy, Comedy, Comedy, ...]
# Very different distributions!

In our simulation: We partition MovieLens by user_id % NUM_CLIENTS. Since users have different tastes, each client naturally gets non-IID data — no synthetic manipulation needed.

4 Tonight's Lab

$ python w23d4_starter.py

The starter auto-creates a virtual environment, downloads MovieLens 100K, trains a centralized baseline, then runs FedAvg across 8 simulated clients.

Your Mission:

Run Centralized Baseline: Train on all data combined — this is your upper bound

Implement FedAvg: Complete the broadcast, local training, and aggregation code

Compare: How close does federated get to centralized? What's the gap?

Experiment: Try more rounds, more local epochs, or add noise for differential privacy

5 Resources

FedAvg Explainer

Animated visualization of FedAvg rounds with weighted averaging demo

Watch FedAvg →

Non-IID Explainer

Compare IID vs non-IID distributions and see convergence impact

Explore Non-IID →

Federated Monitoring

How to monitor models when you can't see the training data

Learn Constraints →

Session Agenda

6:30 - 6:50

Why Federated Learning? Privacy + Regulation

6:50 - 7:20

FedAvg Algorithm Deep Dive

7:20 - 7:50

Non-IID Challenges + Convergence

7:50 - 8:30

Build Block: Implement FedAvg

8:30 - 9:00

Differential Privacy Extension

9:00 - 9:30

Centralized vs Federated Discussion