May 2012
1 post
In defense of keeping data private
This is going to be contentious. And it somewhat goes against a lot of things that researchers hold holy. And it goes against my plan of keeping philosophy out of this blog. But it must be said since remaining silent has the potential of damaging science with proposals that sound good and are bad. The proposal is that certain conferences make it mandatory to publish datasets that were used for...
May 10th
December 2011
3 posts
Machine Learning Summer School Purdue Videos →
The MLSS 2011 videos from Purdue are now available on YouTube. Enjoy!
Dec 17th
Random numbers in constant storage
Many algorithms require random number generators to work. For instance, locality sensitive hashing requires one to compute the random projection matrix P in order to compute the hashes z = P x. Likewise, fast eigenvalue solvers in large matrices often rely on a random matrix, e.g. the paper by Halko, Martinsson and Tropp, SIAM Review 2011, which assumes that at some point we multiply a matrix M by...
Dec 17th
11 notes
Slides for the NIPS 2011 tutorial
The slides for the 2011 NIPS tutorial on Graphical Models for the Internet are online. Lots of stuff on parallelization, applications to user modeling, content recommendation, and content analysis here.  Livestream (16:00-18:00 European Standard Time) Part 1 [keynote] [pdf], Part 2 [powerpoint] [pdf]
Dec 12th
September 2011
1 post
The Neal Kernel and Random Kitchen Sinks
So you read a book on Reproducing Kernel Hilbert Spaces and you’d like to try out this kernel thing. But you’ve got a lot of data and most algorithms will give you an expansion that requires a number of kernel functions linear in the amount of data. Not good if you’ve got millions to billions of instances. You could try out low rank expansions such as the Nystrom method of...
Sep 24th
1 note
August 2011
1 post
Big Learning: Algorithms, Systems, and Tools for...
We’re organizing a workshop at NIPS 2011. Submission are solicited for a two day workshop December 16-17 in Sierra Nevada, Spain.  This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The Big Learning setting has attracted intense interest with active research spanning diverse fields...
Aug 31st
20 notes
June 2011
3 posts
Introduction to Graphical Models
Here’s a link to slides [Keynote, PDF] for a basic course on Graphical Models for the Internet that I’m giving at MLSS 2011 in Purdue that Vishy Vishwanathan is organizing. The selection is quite biased, limited, and subjective, but it’s meant to complement the other classes at the summer school. The slides are likely to grow, so in case of doubt, check for updates. Comments are...
Jun 17th
Distributed synchronization with the distributed...
Here’s a simple synchronization paradigm between many computers that scales with the number of machines involved and which essentially keeps cost at \(O(1)\) per machine. For lack of a better name I’m going to call it the distributed star since this is what the communication looks like. It’s quite similar to how memcached stores its (key,value) pairs.  Assume you have n...
Jun 9th
Speeding up Latent Dirichlet Allocation
The code to our LDA implementation on Hadoop is released on Github under the Mozilla Public License. It’s seriously fast and scales very well to 1000 machines or more (don’t worry, it runs on a single machine, too). We believe that at present this is the fastest implementation you can find, in particular if you want to have a) 1000s of topics, b) a large dictionary, c) a large number...
Jun 9th
48 notes
March 2011
3 posts
Bloom Filters
Bloom filters are one of the really ingenious and simple building blocks for randomized data structures. A great summary is the paper by Broder and Mitzenmacher. In this post I will briefly review its key ideas since it forms the basis of the Count-Min sketch of Cormode and Muthukrishnan, it will also be necessary for an accelerated version of the graph kernel of Shervashidze and Borgwardt, and...
Mar 30th
26 notes
Real simple covariate shift correction
Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people; you train your algorithm; it works fine giving you high accuracy and you conclude that you’re ready for a successful career in medical diagnostics. Not so fast … Many things could go wrong. In particular, the distributions that you work with for training and those in the wild might...
Mar 26th
31 notes
Graphical Models for the Internet
Here are a few tutorial slides I prepared with Amr Ahmed for WWW 2011 in Hyderabad next week. They describe in fairly basic (and in the end rather advanced) terms how one might use graphical models for the amounts of data available on the internet. Comments and feedback are much appreciated.  PDF Keynote
Mar 25th
February 2011
2 posts
Memory Latency, Hashing, Optimal Golomb Rulers and...
In many problems involving hashing we want to look up a range of elements from a vector, e.g. of the form \(v[h(i,j)]\) for arbitrary \(i\) and for a range of \(j \in \{1, \ldots, n\}\) where \(h(i,j)\) is a hash function. This happens e.g. for multiclass classification, collaborative filtering, and multitask learning.  While this works just fine in terms of estimation performance, traversing all...
Feb 12th
Collaborative Filtering considered harmful
Much excellent work has been published on collaborative filtering, in particular in terms of recovering missing entries in a matrix. The Netflix contest has contributed a significant amount to the progress in the field.  Alas, reality is not quite as simple as that. Very rarely will we ever be able to query a user about arbitrary movies, books, or other objects. Instead, user ratings are...
Feb 12th
1 note
September 2010
4 posts
Why
Some readers might wonder why I’m writing this blog. Here’s an (incomplete) list: It’s fun. There are lots of fantastic blogs discussing the philosophy and big questions of machine learning (e.g. John Langford’s hunch.net) but I couldn’t find many covering simple tricks of the trade. Scientific papers sometimes obscure simple ideas. In the most extreme case, a...
Sep 16th
26 notes
Hashing for Collaborative Filtering
This is a follow-up on the hashing for linear functions post. It’s based on the HashCoFi paper that Markus Weimer, Alexandros Karatzoglou and I wrote for AISTATS’10. It deals with the issue of running out of memory when you want to use collaborative filtering for very large problems. Here’s the setting: Assume you want to do Netflix-style collaborative filtering, i.e. you want...
Sep 16th
18 notes
Priority Sampling
Tamas Sarlos pointed out a much smarter strategy on how to obtain a sparse representation of a (possibly dense) vector: Priority Sampling by Duffield, Lund and Thorup (Journal of the ACM 2006).  The idea is quite ingenious and (surprisingly so) essentially optimal, as Mario Szegedy showed. Here’s the algorithm: For each \(x_i\) compute a priority \(p_i = \frac{x_i}{a_i}\) where \(a_i \sim...
Sep 7th
Random elements from a stream
This is a classic trick when dealing with data streams. It shows how to draw a random element from a sequence of instances without knowing beforehand how long the sequence is and which symbols occur.  Let us first assume that we knew the identities of all symbols. In this case finding a random symbol would be easy. All we require is that for each symbol s we draw a random variable \(\xi_s \sim...
Sep 6th
24 notes
August 2010
8 posts
Sparsifying a vector/matrix
Sometimes we want to compress vectors to reduce memory footprint or to minimize computational cost. This occurs, e.g. when we want to replace a probability vector by a sample from it. Without loss of generality, assume that our vector v has only nonnegative entries and that its entries sum up to 1, i.e. that it is a probability vector. If not, we simply apply our sampling scheme to the following: ...
Aug 27th
3 tags
Log-probabilities, semirings and floating point...
Here’s a trick/bug that is a) really well known in the research community, b) lots of beginners get it wrong nonetheless, c) simple unit tests may not detect it and d) it may be a really fatal bug in your code. You can even find it in large scale machine learning toolboxes. Suppose you want to do mixture clustering and you happily compute p(x|y) and p(y), say by a mixture of Gaussians....
Aug 21st
19 notes
5 tags
Parallel Stochastic Gradient Descent
Here’s the problem: you’ve optimized your stochastic gradient descent library but the code is still not fast enough. When streaming data off a disk/network you cannot exceed 100 MB/s, so processing a 1TB of data will take about 3-5 hours on a single machine. And classical stochastic gradient descent is sequential even in the context of multicore machines. Before looking at ways to...
Aug 19th
29 notes
Hashing for linear functions
This is the first of a few posts on hashing. It’s an incredibly powerful technique when working with discrete objects and sequences. And it’s also idiot-proof simple. I decided to post it after listening to a talk this morning where the speaker proposed something really complicated for what should be very simple. A hash function is (in its ideal CS version) a mapping h from a domain X into a set...
Aug 18th
3 tags
In praise of the Second Binomial Formula
Here’s a simple trick you can use to compute pairs of distances: use the second binomial formula and a linear algebra library. These problems occur in RBF kernel computations (this is how it’s implemented e.g. in the Elefant library) and in k-nearest neighbor and EM clustering algorithms.  The problem: given sets X and Z compute the matrix of pairwise squared distances. Simple...
Aug 18th
1 note
3 tags
Lazy updates for generic regularization in SGD
Yesterday I wrote about how to do fast stochastic gradient descent updates for quadratic regularization. However, there are lots more regularizers which one would want to use for sparse updates of high-dimensional parameter vectors. In general the problem looks like this: Whenever the gradients in stochastic gradient descent are sparse and whenever x is very high dimensional it would be very...
Aug 13th
8 notes
3 tags
Easy kernel width choice
This is an idea that was originally put forward by Bernhard Schölkopf in his thesis: Assume you have an RBF (radial basis function) kernel and you want to know how to scale it. Recall that such a kernel is given by We want that the argument of this function is O(1) for typical pairs of instances x and x’. Bernhard proposed to look at the dimensionality of x and rescale accordingly. This...
Aug 12th
3 tags
Fast quadratic regularization for online learning
After a few discussions within Yahoo I’ve decided to post a bunch of tricks here. A lot of these are well known. Some others might be new. They’re small hacks, too small to put into a paper on their own, unless we allow 1 page papers. But they can make a big difference to make your code go fast or do useful things. So here it goes: Assume you want to minimize the objective In this...
Aug 12th