Federated Learning: Privacy-Preserving Training

Train models without sharing data

W23D4 - Federated Averaging

The Big Idea

Centralized Training

Collect all data on one server, train normally

model.fit(all_data) # Privacy risk

Federated Training

Data stays on clients, share only model weights

server.aggregate(client_weights) # Data never leaves

Same model quality (or close to it), fundamentally different privacy properties.

1 Why Federated Learning?

Three forces pushing ML toward federated approaches:

Privacy Regulations

GDPR, CCPA, HIPAA restrict data collection and centralization. Federated learning keeps data where it originated.

Data Sovereignty

Organizations can't or won't share raw data. Hospitals, banks, and competitors can still collaborate on models.

Communication Cost

Edge devices generate massive data. Sending model updates is cheaper than sending raw data.

User Trust

Users increasingly demand control over their data. "Your data never leaves your device" is a powerful promise.

2 The FedAvg Algorithm

Federated Averaging (McMahan et al., 2017) is the foundational federated learning algorithm.

Central Server

Holds global model weights. Never sees raw data.

C1
Client 1
Action fans
C2
Client 2
Comedy fans
C3
Client 3
Drama fans
C4
Client 4
Mixed taste
C5
Client 5
Sci-fi fans
C6
Client 6
Horror fans
C7
Client 7
Romance fans
C8
Client 8
Doc fans

One Round of FedAvg:

Broadcast
Server sends current global weights wt to K randomly selected clients
Train
Each selected client trains locally for E epochs on their private data, producing updated weights wk
Upload
Each client sends back their updated weights and sample count nk (NOT their data)
Aggregate
Server computes weighted average of all updates to form the new global model
Federated Averaging — Weighted Aggregation
wt+1 = Σ (nk / ntotal) × wk

3 Non-IID Challenges

In the real world, each client's data is different. This is the non-IID problem and it makes federated learning harder.

IID (Ideal)

Every client has a representative sample of all data. Easy to converge.

Client 1: [Action, Comedy, Drama, ...] Client 2: [Action, Comedy, Drama, ...] # Same distribution everywhere

Non-IID (Reality)

Each client has biased data reflecting their preferences. Harder to converge.

Client 1: [Action, Action, Action, ...] Client 2: [Comedy, Comedy, Comedy, ...] # Very different distributions!

In our simulation: We partition MovieLens by user_id % NUM_CLIENTS. Since users have different tastes, each client naturally gets non-IID data — no synthetic manipulation needed.

4 Tonight's Lab

$ python w23d4_starter.py

The starter auto-creates a virtual environment, downloads MovieLens 100K, trains a centralized baseline, then runs FedAvg across 8 simulated clients.

Your Mission:

1
Run Centralized Baseline: Train on all data combined — this is your upper bound
2
Implement FedAvg: Complete the broadcast, local training, and aggregation code
3
Compare: How close does federated get to centralized? What's the gap?
4
Experiment: Try more rounds, more local epochs, or add noise for differential privacy

5 Resources

FedAvg Explainer

Animated visualization of FedAvg rounds with weighted averaging demo

Watch FedAvg →

Non-IID Explainer

Compare IID vs non-IID distributions and see convergence impact

Explore Non-IID →

Federated Monitoring

How to monitor models when you can't see the training data

Learn Constraints →

Session Agenda

6:30 - 6:50
Why Federated Learning? Privacy + Regulation
6:50 - 7:20
FedAvg Algorithm Deep Dive
7:20 - 7:50
Non-IID Challenges + Convergence
7:50 - 8:30
Build Block: Implement FedAvg
8:30 - 9:00
Differential Privacy Extension
9:00 - 9:30
Centralized vs Federated Discussion