Recently, we’ve seen the power of large-scale machine learning projects. Projects like GPT-2 and BigGAN involve model and dataset sizes that would be unfathomable only a few years ago, and they they are increasingly dominating the state of the art. Underlying the push for larger models and larger, more complex, and more diverse datasets is the common belief that all we need to do is scale up. But a couple different (orthogonal) approaches seem to be emerging to achieve this scale:

1. Better hardware. GPUs weren’t designed for deep learning, so why stop here? FPGAs, TPUs, neuromorphic computing all fall under this category.
2. More efficient algorithms and architectures. The earliest deep learning architectures, as it turns out, used many more parameters and computation than was necessary. EfficientNet comes to mind.
3. Distributed systems. This blog post will focus exclusively on this approach to ML scale.

## Why distributed ML?

OpenAI has observed that since 2012 AI model size is growing faster than Moore’s law, with the number of flops used for training doubling every 3.5 months. While they acknowledge that hardware has improved, they attribute the majority of this increase in size to distributed training.

But although we haven’t hit a wall in hardware yet, the same fundamental, physical limitations will present themselves, and so I would argue that multi-machine training will become increasingly important as we try to scale up models.

Sidenote: It’s important to make the distinction between distributed training and distributed inference. The case for distributed inference is somewhat obvious: if your ML service gets a lot of requests, you want to handle that load, so you distribute your model onto many machines. But scaling training is what leads to improved model accuracy, and it is a more challenging problem.

## One way to distribute training

Let’s assume we have a loss function of the form (empirical risk minimization):

$$L(\theta) = \frac{1}{N} \sum_{i=1}^N f(x_i; \theta)$$

Now notice that we can partition the dataset into $$M$$ parts and average over those parts $$L(\theta) = \frac{1}{M} \sum_{i\in [M]} \frac{1}{|B_i|} \sum_{i \in B_i} f(x_i; \theta)$$ This shows that we can split up our dataset across many machines and compute the loss of each partition of the dataset independently. If we want to compute the loss, we just average the per-machine losses!

Because of linearity, this also holds for the gradients:

$$\nabla L(\theta) = \frac{1}{M} \sum_{i\in [M]} \frac{1}{|B_i|} \sum_{i \in B_i} \nabla f(x_i; \theta)$$ This shows how we can use data parallelism to train a model with large batch sizes over multiple machines. On each iteration, each machine samples a random subset of their partition of the data, computes a gradient estimate $$g_{B_i}(\theta)$$ using backpropagation on that subset, and sends its gradient to a centralized parameter server that averages the received gradients and updates the parameters. The parameter server then sends its parameters back to the workers so that they can update their local copy of the model.

This gives an idea of how we can train models with massive batch sizes (since the effective batch size is the sum of the batch sizes for each machine). Larger batch sizes reduce gradient norms, which in in turn improve the convergence rate of SGD. (Large batch sizes have been used to train models on ImageNet in less than one hour.)

## Distributed ML without trust

Bitcoin/cryptocurrency mining consumes a substantial percentage of the world’s electricity. It has also increased the price of GPUs at a time when they are increasingly valuable for deep learning.

The question arises: can we harness the funamentally distributed and decentralized nature of computation in cryptocurrencies to train ML models? Bitcoin has demonstrated that people and organizations are willing to assemble a large amount of hardware and use electricity to make a small amount of money in return. Maybe we could do the same thing with machine learning?

Here’s a toy model for this type of approach:

1. The requester submits its “problem” to a decentralized network of nodes. This “problem” consists of a link to a dataset that can be downloaded in partitions.
2. The nodes in this network compute gradients on their respective subsets of the data, and send this back to the requester.
3. The requester can either trust these gradients, or compute the change in the loss itself. (in this setting, doing the latter is infeasible, so this might require new improvements in cheap gradient verification)
4. The request pays the nodes a small amount for their gradients. This gives the nodes an incentive to continue computing gradients for the requester.

Here are some problems with this approach

1. Network overhead Network latency poses a problem for this approach. If a single SGD update requires us to receive the new gradients, update our local copy of the parameters, and broadcast our copy of the parameters out to the workers, we may have to wait a significant amount of time until we can compute the next update.
2. Ensuring consistency. Like all distributed systems, we need to ensure that all participants have the same copy of the parameters. I.e., how do we know that we are getting fresh gradients from a node? Ensuring consistency in an adversarial environment is more challenging.
3. Trust. How do we know that the gradients that we receive from each of the workers is correct, i.e., that it is an unbiased estimator of the true gradient for that partition? How do the workers know that they will receive payment for computing the gradients?

The final of these is a difficult challenge, but there has been some promising work in the area. In particular, verifiable computation is the notion that we can provide a proof (certificate) with the output of our computation and use the proof to check that the output is correct without doing the computation ourselves. Verifiable computation has been extensively explored in cryptocurrencies, but there has already been some preliminary work to verify the output of a neural network. If we can verify the output of the model, there is promise that we can use similar techniques to verify

Another way to improve trust is to use a voting-style algorithm for gradients. In this situation, small subsets of workers compute gradients on the same examples and are synchronized. The requester only “accepts” the gradient returned by the majority. If any one of them defects from the majority, the requester does not pay them. This is vulnerable if a majority of workers “collude” against the requester, but it is possibly cheaper and simpler than verifiable computation.

The second trust issue can be mitigated by having a tight feedback loop between the requester and the workers; i.e., they receive small micropayments for computing individual gradients rather than one bulk payment at the end. This allows the workers to stop computing gradients if they don’t get payment (a kind of tit-for-tat relationship).

## Federated learning

Sometimes, the data that we train on is more valuable than the compute, and the data itself is already distributed across the network. In this setting, we might want a network of nodes (e.g., sensors, users, cameras, medical devices) to provide gradients for our model but not give away their data. This is the federated learning setting.

The idea is similar: ask each worker to train a local model of their data, or even compute gradient estimates on their local copy of the data and send it to a central server. With some tricks (e.g., adding Gaussian noise to the gradients), the hope is that the requester can’t reconstruct the data from these messages. Because we can’t ask the workers to give up their data, we have to trust the workers that their gradients are correct. But this is another interesting setting where we can harness a large number of workers to train a model that would otherwise be infeasible.