Train models without sharing data
W23D4 - Federated AveragingCollect all data on one server, train normally
model.fit(all_data) # Privacy risk
Data stays on clients, share only model weights
server.aggregate(client_weights) # Data never leaves
Same model quality (or close to it), fundamentally different privacy properties.
Three forces pushing ML toward federated approaches:
GDPR, CCPA, HIPAA restrict data collection and centralization. Federated learning keeps data where it originated.
Organizations can't or won't share raw data. Hospitals, banks, and competitors can still collaborate on models.
Edge devices generate massive data. Sending model updates is cheaper than sending raw data.
Users increasingly demand control over their data. "Your data never leaves your device" is a powerful promise.
Federated Averaging (McMahan et al., 2017) is the foundational federated learning algorithm.
Holds global model weights. Never sees raw data.
In the real world, each client's data is different. This is the non-IID problem and it makes federated learning harder.
Every client has a representative sample of all data. Easy to converge.
Client 1: [Action, Comedy, Drama, ...]
Client 2: [Action, Comedy, Drama, ...]
# Same distribution everywhere
Each client has biased data reflecting their preferences. Harder to converge.
Client 1: [Action, Action, Action, ...]
Client 2: [Comedy, Comedy, Comedy, ...]
# Very different distributions!
In our simulation: We partition MovieLens by user_id % NUM_CLIENTS.
Since users have different tastes, each client naturally gets non-IID data — no synthetic manipulation needed.
The starter auto-creates a virtual environment, downloads MovieLens 100K, trains a centralized baseline, then runs FedAvg across 8 simulated clients.
Animated visualization of FedAvg rounds with weighted averaging demo
Watch FedAvg →