This is a follow-up on the hashing for linear functions post. It’s based on the HashCoFi paper that Markus Weimer, Alexandros Karatzoglou and I wrote for AISTATS’10. It deals with the issue of running out of memory when you want to use collaborative filtering for very large problems. Here’s the setting:

Assume you want to do Netflix-style collaborative filtering, i.e. you want to estimate entries in a ratings matrix of (user, movie) pairs. A rather effective approach is to use matrix factorization, that is, to approximate \(M = U^\top V\) where M is the ratings matrix, U is the (tall and skinny) matrix of features for each user, stacked up, and V is the counterpart for movies. This works well for the Netflix prize since the number of users and movies is comparatively small.

In reality we might have, say 100 million users for which we might want to recommend products. One option is to distribute all these users over several servers (similar to what a distributed hash table mapping does, e.g. for libmemcached). Alternatively, if we want to keep it all on one server, we’re facing the problem of having to store \(10^8 \cdot 100 \cdot 4 = 4 \cdot 10^10\) bytes, i.e. 40 GB if we assume to allocate 400 Bytes per user (that’s a rather small footprint). That is 100 dimensions per user. Usually this is too big for all but the biggest servers. Even worse, suppose that we have user-churn. That is, new users might be arriving while old users disappear (obviously we don’t know whether they’ll ever come back again so we don’t really want to de-allocate the memory devoted to them). Obviously we cannot add more RAM. One possible solution is to store the data on disk and request it whenever a user arrives. This will cost us 5-10ms latency. An SSD will improve this but it still limits throughput. Moreover, it’ll require cache management algorithms to interact with the collaborative filtering code.

Here’s a simple alternative: apply the hashing trick that we used for vectors to matrices. Recall that in the exact case we compute matrix entries via

$$M[i,j] = \sum_{k=1}^{K} U[i,k] V[j,k]$$

Now denote by \(h_u\) and \(h_v\) hash functions mapping pairs of integers to a given hash range \([1 \ldots N]\). Moreover, let \(\sigma_u\) and \(\sigma_v\) be corresponding Rademacher hash functions which return a binary hash in \(\{\pm 1\}\). Now replace the above sum via

$$M[i,j] = \sum_{k=1}^{K} u[h_u(i,k)] \sigma_u(i,k) v[h_v(j,k)] \sigma_v(j,k)$$

What happened is that now all access into U is replaced by access into a vector u of length N (and the same holds true for V). Why does this work: firstly, we can prove that if we construct u and v from U and V via

$$u[k] = \sum_{h_u(i,j) = k} \sigma_u(i,j) U[i,j] \text{ and } v[k] = \sum_{h_v(i,j) = k} \sigma_v(i,j) V[i,j]$$

then the approximate version of \(M[i,j]\) converges to the correct \(M[i,j]\) with variance \(O(1/N)\) and moreover that the estimate is unbiased. Getting the exact expressions is a bit tedious and they’re described in the paper. In practice, things are even better than this rate: since we never use U and V but always u and v we simply optimize with respect to the compressed representation.

One of the advantages of the compressed representation is that we never really need to have any knowledge of all the rows of U. In particular, rather than mapping user IDs to rows in U we simply use the user ID as the hash key. If a new user appears, memory is effectively allocated to the new user by means of the hash function. If a user disappears, his parameters will simply get overwritten if we perform stochastic gradient descent with respect to the u and v vectors. The same obviously holds for movies or any other entity one would like to recommend.

Bottom line - we now can have fast (in memory) access to user parameters regardless of the number of users. The downside is that the latency is still quite high: remember that the hash function requires us to access \(u[h_u(i,k)]\) for many different values of k. This means that each access in k is a cache miss, i.e. it’ll cost us 100-200ns RAM latency rather than the 10-20ns we’d pay for burst reads. How to break this latency barrier is the topic of one of the next posts.