Log-probabilities, semirings and floating point numbers
Here’s a trick/bug that is a) really well known in the research community, b) lots of beginners get it wrong nonetheless, c) simple unit tests may not detect it and d) it may be a really fatal bug in your code. You can even find it in large scale machine learning toolboxes.
Suppose you want to do mixture clustering and you happily compute p(x|y) and p(y), say by a mixture of Gaussians. Quite often you’ll need access to p(y|x) which can be computed via
Let’s assume that your code runs well in 2D but now you try it in 100 dimensions and it fails with a division by zero error. What’s going on? After all, your code is algebraically correct. Most likely you ignored the fact that floating point numbers have only a fixed precision and by exponentiating the argument of the Gaussian (recall, we have a normalization that is exponential in the number of dimensions) you encountered numerical underflow. What happened is that you were trying to store the relevant information in the exponent rather than the mantissa of a floating point number. On single precision you have 8 bit for the exponent and 23 for the mantissa and you just ran out of storage.
Here’s how you can pull things back into the mantissa - simply store log probabilities and operate with them. Hence instead of + and x you should use ‘log +’ and +.
In this case you will not get numerical underflow when adding probabilities since the argument in the log is going to be O(1) (we have at least one term which is exp(0) and the remaining terms are smaller than 1), or at least not due to running out of precision in the exponent.
In more general terms, what happened is that we replaced the operations ‘+’ and ‘x’ by two operations ‘log +’ and ‘+’ which act just in the same way as addition and multiplication. Aji and McEliece formalized this in their paper on the Generalized Distributive Law. Systems that satisfy these operations are called commutative semirings. Some examples are:
- Set union (+) and intersection (x)
- The tropical semiring which uses min (+) and + (x)
- Boolean AND (x) and OR (+)
- Plain old calculus with addition (+) and multiplication (x)
- The log + semiring with log + (+) and + (x)
Replacing these symbols in well known algorithms such as dynamic programming gives the forward-backward algorithm, shortest path calculations and others, but this is a story for another day.
Parallel Stochastic Gradient Descent
Here’s the problem: you’ve optimized your stochastic gradient descent library but the code is still not fast enough. When streaming data off a disk/network you cannot exceed 100 MB/s, so processing a 1TB of data will take about 3-5 hours on a single machine. And classical stochastic gradient descent is sequential even in the context of multicore machines. Before looking at ways to speed up the algorithm let’s look at the basic SGD template:
Here f(x) is the loss function and Ω[x] is the regularizer. This procedure is entirely sequential, so how to accelerate it?
One option is to execute the above loop in each core independently while working with a common shared weight vector x. This means that if we update x in a round-robin fashion we will have a delay of k-1 when using k cores. The delay is due to the time it takes between seeing an instance of f and when we can actually apply the update to x. The beauty of this approach is that one can prove convergence (thanks to Marty Zinkevich and John Langford) without losing anything in the rates relative to the single core setting. The intuition behind it is that as we converge optimization becomes more and more an averaging procedure and in this case a small delay does not hurt at all.
The Achilles heel of the above algorithm is that it requires extremely low latency between the individual processing nodes. While this is OK on a computer or on a GPU, it will fail miserably on a network of computers since we may have many miliseconds latency between the nodes. This suggests an even simpler algorithm for further parallelization:
- Overpartition the data into k blocks for k clusters (i.e. for each cluster simply randomly draw a fraction c = O(m/k) of the entire dataset).
- Now perform stochastic gradient descent on each machine separately with constant learning rate.
- Average the solutions between different machines.
Surprisingly enough this method can be shown to converge and give optimal speedup. A paper describing the proof for this simple algorithm is currently under review (thanks to Marty Zinkevich, Markus Weimer and Lihong Li).
Hashing for linear functions
This is the first of a few posts on hashing. It’s an incredibly powerful technique when working with discrete objects and sequences. And it’s also idiot-proof simple. I decided to post it after listening to a talk this morning where the speaker proposed something really complicated for what should be very simple.
A hash function is (in its ideal CS version) a mapping h from a domain X into a set [0 … N-1] of integers such that h(x) is essentially random. A fast implementation is e.g. Murmur hash. Note that as long as the hash is good it does not matter which one you use (I have received the latter question at least twice from referees in the past. It’s about as meaningful as asking which random number generator was used for sampling).
When we want to compute inner products of the form
we might run into a big problem - whenever the feature map is too high-dimensional we will not have enough memory to store w. A second problem is that we might have to store a dictionary for computing the features of x, e.g. when we have a bag of words representation. It is quite likely that the dictionary will take up much more memory than the actual parameter vector w. On top of that, we need to write code to generate the dictionary. Instead of that, let’s be lazy and do the following:
Here h is a hash function with range small enough to fit w into memory and σ is a binary (+1, -1) hash function. This looks pretty simple. It’s essentially the sketch that Charikar, Chen and Farrach-Colton suggested in 2003 for data streams. But it’s also very powerful. In particular one can show the following:
- The inner products are unbiased relative to the original linear function.
- The variance decreases with 1/N.
- No dictionary is needed. The hash function acts as our lookup device for the index. This is a massive space saver and it also is typically much faster.
- We can use this even when we don’t know the full dictionary at the time we start running an online solver. E.g. whenever we get new users we don’t need to allocate more memory for them.
- This can easily be extended to approximate matches. Say I have the string ‘hash’. Then I can simply insert the strings ‘*ash’, ‘h*sh’, ‘ha*h’, ‘has*’ to represent approximate matches.
- We can also extend it to substrings - simply add them to the hash table. This yields e.g. ‘sub’, ‘ubs’, ‘bst’, ‘str’, ‘tri’, ‘rin’, ‘ing’, ‘ngs’ when we are interested in trigrams. In general for all n grams the cost is linear in n and the length of the document.
- Adding some weighting to the features is trivial. Just do it. Go wild. Add meaningful substrings. Personalize for users. Add tasks. Use for social networks. Use collaborative filtering (more on this later).
To get started have a look at John Langford’s Vowpal Wabbit or James Petterson’s Stream. The theory can be found in this AISTATS and this ICML paper (note: the proof of the tail bound in the ICML paper is broken).
P.S.: This is a repost. Tumblr seems to have screwed up management of some of the post IDs.
In praise of the Second Binomial Formula
Here’s a simple trick you can use to compute pairs of distances: use the second binomial formula and a linear algebra library. These problems occur in RBF kernel computations (this is how it’s implemented e.g. in the Elefant library) and in k-nearest neighbor and EM clustering algorithms.
The problem: given sets X and Z compute the matrix of pairwise squared distances.
Simple solution: use a double for loop, take differences, square and be happy. This is slow for a number of reasons - it isn’t optimized for the architecture, it requires you to take differences, and it also means that you aren’t using memory blocking for it. Plus it’d be quite some work to write a multithreaded code for it.
Fast solution: use the second binomial formula to decompose the distance into norms of vectors in X and Z and an inner product between them.
To compute the latter you can simply use BLAS, or more specifically SGEMM or DGEMM for single or double precision respectively. A quad-core Xeon peaks out at 100GFlops for DGEMM, so computing all pairwise distances for |X| = |Z| = 100k and 1000 dimensions should take in the order of 100-200 seconds (I’m throwing in a 2x penalty for the overhead of computing distances). On a GPU this would be even faster. If you compute it one row at a time you should see about a 10x penalty in speed since DGEMV is much slower than DGEMM.
For sparse matrices you could use an inverted index (Jon Baxter suggested this), or you could be lazy and simply let a sparse matrix linear algebra library such as OSKI or Intel MKL do the work for you.
Lazy updates for generic regularization in SGD
Yesterday I wrote about how to do fast stochastic gradient descent updates for quadratic regularization. However, there are lots more regularizers which one would want to use for sparse updates of high-dimensional parameter vectors. In general the problem looks like this:
Whenever the gradients in stochastic gradient descent are sparse and whenever x is very high dimensional it would be very wasteful to use the standard SGD updates. You could use very clever data structures for storing vectors and updating. For instance John Duchi and Yoram Singer wrote a beautiful paper to address this. Or you could do something really simple …
Novi Quadrianto and I ran into this problem when trying to compute an optimal tiering of webpages where we had a few billion pages that needed allocating over a bunch of storage systems. Each update in this case only dealt with a dozen or so pages, hence our gradients were very very sparse. So we used this trick:
- Instead of storing x, also store a time-stamp when a particular coordinate was addressed last.
- In between updates caused by f, all updates for a coordinate look as follows:
- This is a difference equation. We can approximate this by a differential equation and accumulate over the coordinate changes while no updates from f but rather just the regularizer occur. So we get the following updates
- Now all you need to do is solve the differential equation once and you’re done. If you don’t want to keep the partial sums over η around (this can be costly if the updates are infrequent) you could use the approximation
- This saves a few GB whenever the updates only occur every billion steps …
This algorithm would obviously work for l1 programming. But it’ll work just as well for any other regularizer you could think of.
Easy kernel width choice
This is an idea that was originally put forward by Bernhard Schölkopf in his thesis: Assume you have an RBF (radial basis function) kernel and you want to know how to scale it. Recall that such a kernel is given by
We want that the argument of this function is O(1) for typical pairs of instances x and x’. Bernhard proposed to look at the dimensionality of x and rescale accordingly. This is a great heuristic. But it ignores correlation between the coordinates. A much simpler trick is to pick, say 1000 pairs (x,x’) at random from your dataset, compute the distance of all such pairs and take the median, the 0.1 and the 0.9 quantile. Now pick λ to be the inverse any of these three numbers. With a little bit of crossvalidation you will figure out which one of the three is best. In most cases you won’t need to search any further.
Fast quadratic regularization for online learning
After a few discussions within Yahoo I’ve decided to post a bunch of tricks here. A lot of these are well known. Some others might be new. They’re small hacks, too small to put into a paper on their own, unless we allow 1 page papers. But they can make a big difference to make your code go fast or do useful things. So here it goes:
Assume you want to minimize the objective
In this case you could perform updates of the form
The problem here is that if the gradients are sparse, say 1000 features for a document while x may have millions, maybe even billions of nonzero coefficients, this update is extremely slow: for each meaningful coefficient update we also update thousands, maybe millions of coefficients that just shrink summarily. So what to do: keep a scaling coefficient, say c, separately. This yields:
Now we have simple update for the scaling coefficient and a sparse update for the parameter vector. As far as I recall this trick can be found e.g. in Leon Bottou’s stochastic gradient solver. We also used this in the implementation of the Kivinen, Williamson & Smola paper from 1998 on stochastic gradient descent for kernels.