Understanding how data heterogeneity across federated clients affects model training, convergence, and accuracy in federated learning systems.
Toggle between IID (Independent and Identically Distributed) and Non-IID modes to see how data is partitioned across 6 federated clients. Each client represents a user with their own local dataset of movie ratings.
Compare how IID and Non-IID data distributions affect the global model's convergence during federated training. Non-IID data leads to client drift, causing oscillations and slower improvement in global model performance.
Understanding the theoretical foundations and practical implications of data distribution in federated learning.
Independent and Identically Distributed means each client's data is drawn from the same underlying distribution, and samples are independent of each other. In federated learning, IID data means every client has a representative subset of the full dataset -- all genres, demographics, and patterns appear proportionally on each device.
In practice, user data is almost never IID. People have unique preferences, behaviors, and contexts. A horror fan's watch history looks nothing like a rom-com enthusiast's. Geographic, demographic, and temporal factors all create systematic differences between clients, making non-IID the default condition for federated systems.
When a client trains on skewed local data, its model parameters "drift" away from the optimal global model. After several local SGD steps, the client's model becomes specialized for its own distribution. Aggregating divergent models produces a global model that may perform poorly for all clients, not just some.
The most common form of non-IID in recommendation systems is label distribution skew, where different clients have different proportions of each class/genre. Other forms include feature skew (same label, different features), quantity skew (varying dataset sizes), and concept drift (distributions changing over time).
mu.
The MovieLens dataset is a natural example of non-IID federated data. When we treat each user as a federated client, the inherent non-IID nature becomes clear: