Adventures in Data Land
Lazy updates for generic regularization in SGD

Yesterday I wrote about how to do fast stochastic gradient descent updates for quadratic regularization. However, there are lots more regularizers which one would want to use for sparse updates of high-dimensional parameter vectors. In general the problem looks like this:

Whenever the gradients in stochastic gradient descent are sparse and whenever x is very high dimensional it would be very wasteful to use the standard SGD updates. You could use very clever data structures for storing vectors and updating. For instance John Duchi and Yoram Singer wrote a beautiful paper to address this. Or you could do something really simple … 

Novi Quadrianto and I ran into this problem when trying to compute an optimal tiering of webpages where we had a few billion pages that needed allocating over a bunch of storage systems. Each update in this case only dealt with a dozen or so pages, hence our gradients were very very sparse. So we used this trick:

  1. Instead of storing x, also store a time-stamp when a particular coordinate was addressed last. 
  2. In between updates caused by f, all updates for a coordinate look as follows:
  3. This is a difference equation. We can approximate this by a differential equation and accumulate over the coordinate changes while no updates from f but rather just the regularizer occur. So we get the following updates
  4. Now all you need to do is solve the differential equation once and you’re done. If you don’t want to keep the partial sums over η around (this can be costly if the updates are infrequent) you could use the approximation
  5. This saves a few GB whenever the updates only occur every billion steps …

This algorithm would obviously work for l1 programming. But it’ll work just as well for any other regularizer you could think of.