Adventures in Data Land
Parallel Stochastic Gradient Descent

Here’s the problem: you’ve optimized your stochastic gradient descent library but the code is still not fast enough. When streaming data off a disk/network you cannot exceed 100 MB/s, so processing a 1TB of data will take about 3-5 hours on a single machine. And classical stochastic gradient descent is sequential even in the context of multicore machines. Before looking at ways to speed up the algorithm let’s look at the basic SGD template:

Here f(x) is the loss function and Ω[x] is the regularizer. This procedure is entirely sequential, so how to accelerate it?

Multicore

One option is to execute the above loop in each core independently while working with a common shared weight vector x. This means that if we update x in a round-robin fashion we will have a delay of k-1 when using k cores. The delay is due to the time it takes between seeing an instance of f and when we can actually apply the update to x. The beauty of this approach is that one can prove convergence (thanks to Marty Zinkevich and John Langford) without losing anything in the rates relative to the single core setting. The intuition behind it is that as we converge optimization becomes more and more an averaging procedure and in this case a small delay does not hurt at all.

Multiple Machines

The Achilles heel of the above algorithm is that it requires extremely low latency between the individual processing nodes. While this is OK on a computer or on a GPU, it will fail miserably on a network of computers since we may have many miliseconds latency between the nodes. This suggests an even simpler algorithm for further parallelization:

  1. Overpartition the data into k blocks for k clusters (i.e. for each cluster simply randomly draw a fraction c = O(m/k) of the entire dataset). 
  2. Now perform stochastic gradient descent on each machine separately with constant learning rate.
  3. Average the solutions between different machines. 

Surprisingly enough this method can be shown to converge and give optimal speedup. A paper describing the proof for this simple algorithm is currently under review (thanks to Marty Zinkevich, Markus Weimer and Lihong Li). 

Lazy updates for generic regularization in SGD

Yesterday I wrote about how to do fast stochastic gradient descent updates for quadratic regularization. However, there are lots more regularizers which one would want to use for sparse updates of high-dimensional parameter vectors. In general the problem looks like this:

Whenever the gradients in stochastic gradient descent are sparse and whenever x is very high dimensional it would be very wasteful to use the standard SGD updates. You could use very clever data structures for storing vectors and updating. For instance John Duchi and Yoram Singer wrote a beautiful paper to address this. Or you could do something really simple … 

Novi Quadrianto and I ran into this problem when trying to compute an optimal tiering of webpages where we had a few billion pages that needed allocating over a bunch of storage systems. Each update in this case only dealt with a dozen or so pages, hence our gradients were very very sparse. So we used this trick:

  1. Instead of storing x, also store a time-stamp when a particular coordinate was addressed last. 
  2. In between updates caused by f, all updates for a coordinate look as follows:
  3. This is a difference equation. We can approximate this by a differential equation and accumulate over the coordinate changes while no updates from f but rather just the regularizer occur. So we get the following updates
  4. Now all you need to do is solve the differential equation once and you’re done. If you don’t want to keep the partial sums over η around (this can be costly if the updates are infrequent) you could use the approximation
  5. This saves a few GB whenever the updates only occur every billion steps …

This algorithm would obviously work for l1 programming. But it’ll work just as well for any other regularizer you could think of. 

Fast quadratic regularization for online learning

After a few discussions within Yahoo I’ve decided to post a bunch of tricks here. A lot of these are well known. Some others might be new. They’re small hacks, too small to put into a paper on their own, unless we allow 1 page papers. But they can make a big difference to make your code go fast or do useful things. So here it goes:

Assume you want to minimize the objective

In this case you could perform updates of the form

The problem here is that if the gradients are sparse, say 1000 features for a document while x may have millions, maybe even billions of nonzero coefficients, this update is extremely slow: for each meaningful coefficient update we also update thousands, maybe millions of coefficients that just shrink summarily. So what to do: keep a scaling coefficient, say c, separately. This yields:

Now we have simple update for the scaling coefficient and a sparse update for the parameter vector. As far as I recall this trick can be found e.g. in Leon Bottou’s stochastic gradient solver. We also used this in the implementation of the Kivinen, Williamson & Smola paper from 1998 on stochastic gradient descent for kernels.