This article gives a quick insight into what is federated machine learning and why it’s popular today.
A cloud refers to a server, public or private, accessible over a network, which is usually the internet. It generally has high processing power and storage, and is suitable for big computations. A cloud can be used to train AI models, but only when the data is available to it. On the other hand, since the cloud is generally a remote system, capturing data directly on it becomes difficult and not feasible. Capturing the data on local devices and transmitting it to the cloud does not always give real-time results. This is where the concept of federated learning comes in.
Federated learning promotes machine learning while the data is on the device. It handles a flexible architecture, which enables a secure process for sensitive data collection and model training. The world currently takes data privacy as an important responsibility. To introduce automation into fields like healthcare, biometrics, etc, real-time sensitive data is the core requirement. The important question therefore is: How do we train a model without collecting sensitive data from the users, storing it, and using it for training? To define optimal training for a machine learning model, relevance of data is an important aspect. Data is best when it comes directly from a source. But the permission for data collection becomes an issue.
With federated machine learning, model training can be centralised on a decentralised data feed. The model is trained on the source device, and the device configurations are measured to see if it is able to train the model or not. Data sources are selected based on how optimally they can provide the data. Once the model trains on the device, it sends the training results (not the data) to the server. In a similar fashion, training results from edge devices are sent to the server. Each device has a training threshold to avoid overlearning or accessibility of unique data results, also referred to as ‘differential privacy’. Simply put, the model ‘trains enough to be unknown’. Model memorisation will not reflect to a particular user or device in general when we get differential privacy into the picture.
The model files are removed from the device once the complete model training is performed, to ensure no violation of privacy takes place. Server-device communication is pipelined with secure aggregation, which lets the server combine the results that are encrypted and only decrypt the aggregates.
A secure aggregation protocol masks the training results and scrambles them in such a way that all the results add up to zero. Once the training is successful and the results are sent to the server, the testing is carried out on other devices which were selected as data sources but not used. To explain it in lay terms, the “device acts as a data sample, some devices are used for training and some for testing.”