One of the key features of a parameter server is that it, well, serves parameters. In particular, it serves more parameters than a single machine can typically hold and provides more bandwidth than what a single machine offers.

A sensible strategy to increase both aspects is to arrange data in the form of a bipartite graph with clients on one side and the server machines on the other. This way bandwidth and storage increase linearly with the number of machines involved. This is well understood. For instance, distributed (key,value) stores such as memcached or Basho Riak use it. It dates back to the ideas put forward e.g. in the STOC 1997 paper by David Karger et al. on Consistent Hashing and Random Trees.

A key problem is that we can obviously not store a mapping table from the keys to the machines. This would require a database that is of the same size as the set of keys and that would need to be maintained and updated on each client. One way around this is to use the argmin hash mapping. That is, given a machine pool \(M\), we assign a given (key,value) pair to the machine that has the smallest hash, i.e.

$$m(k, M) = \mathrm{argmin}_{m \in M} h(m,k)$$

The advantage of this scheme is that it allows for really good load balancing and repair. First off, the load is almost uniformly distributed, short of a small number of heavy hitters. Secondly, if a machine is removed or added to the machine pool, rebalancing affects all other machines uniformly. To see this, notice that the choice of machine with the smallest and second-smallest hash value is uniform.

Unfortunately, this is a stupid way of distributing (key,value) pairs for machine learning. And this is what we did in our 2010 VLDB and 2012 WSDM papers. To our excuse, we didn’t know any better. And others copied that approach … after all, how you can you improve on such nice rebalancing aspects.

This begs the question why it is a bad idea. It all comes down to the issue of synchronization. Basically, whenever a client attempts to synchronize its keys, it needs to traverse the list of the keys it owns and communicate with the appropriate servers. In the above scheme, it means that we need to communicate to a new random server for each key. This is amazingly costly. Probably the best comparison would be a P2P network where each byte is owned by a different machine. Downloads would take forever.

We ‘fixed’ this problem by cleverly reordering the access and then performing a few other steps of randomization. There’s even a nice load balancing lemma in the 2012 WSDM paper. However, a much better solution is to prevent the problem from happening and to borrow from key distribution algorithms such as Chord. In it, servers are inserted into a ring via a hash function. So are keys. This means that each server now owns a **contiguous segment of keys**. As a result, we can easily determine which keys go to which server, simply by knowing where in the ring the server sits.

In the picture above, keys are represented by little red stars. They are randomly assigned using a hash function via \(h(k)\) to the segments ‘owned’ by servers \(s\) that are inserted in the same way, i.e. via \(h(s)\). In the picture above, each server ‘owns’ the segment to its left. Also have a look at the Amazon Dynamo paper for a related description.

Obviously, such a load-balancing isn’t quite as ideal as the argmin hash. For instance, if a machine fails, the next machine inherits the entire segment. However, by inserting each server \(\log n\) times we can ensure that a good load balance is achieved and also that when machines are removed, there are several other machines that pick up the work. Moreover, it is now also very easy to replicate things (more on this later). If you’re curious on how to do this, have a look at Amar Phanishayee's excellent thesis. In a nutshell, the machines to the left hold the replicas. More details in the next post.

We’ve been busy building the next generation of a Parameter Server and it’s finally ready. It’s quite different from our previous designs, the main improvements being fault tolerance and self repair, a much improved network protocol, flexible consistency models, and a much more general interface.

In the next few posts I’ll explain the engineering decisions that went into this system which is capable of solving problems as diverse as very high throughput sketching, topic models, and optimization. And yes, it will be open source so you can build your own algorithms on top of it.

Zico Kolter and I proudly announce the 2014 Machine Learning Summer School in Pittsburgh. It will be held at Carnegie Mellon University in July 7-18, 2014. Our focus is on scalable data analysis and its applications, largely in the internet domain. So, if this is you PhD topic or if you’re planning on a startup in this area, come along.

Registration is open now. We will have scholarships for talented students to waive housing and attendance fees. Please submit your CVs on the site. See you in Pittsburgh.

Disks are slow and RAM is fast. Everyone knows that. But many optimization algorithms don’t take advantage of this. More to the point, disks currently stream at about 100-200 MB/s, solid state drives stream at over 500 MB/s with 1000x lower latency than disks, and main memory reigns supreme at about 10-100 GB/s bandwidth (depending on how many memory banks you have). This means that it is 100 times more expensive to retrieve instances from disk rather than recycling them once they’re already in memory. CPU caches are faster yet with 100-1000 GB/s of bandwidth. Everyone knows this. If not, read Jeff Dean’s slides. Page 13 is pure gold.

Ok, so what does this mean for machine learning? If you can keep things in memory, you can do things way faster. This is the main idea behind Spark. It’s a wonderful alternative to Hadoop. In other words, if your data fits into memory, you’re safe and you can process data way faster. A lot of datasets that are considered big in *academia* fit this bill. But what about *real* big data? Essentially you have two options - have the systems designer do the hard work or change your algorithm. This post is about the latter. And yes, there’s a good case to be made about who should do the work: the machine learners or the folks designing the computational infrastructure (I think it’s both).

So here’s the problem: Many online algorithms load data from disk, stream it through memory as efficiently as possible and *discard* it after seeing it once, only to pick it up later for another pass through the data. That is, these algorithms are **disk bound** rather than CPU bound. Several solvers try to address this by making the disk representation more efficient, e.g. Liblinear or VowpalWabbit, both of which user their own internal representation for efficiency. While this still makes for quite efficient code that can process up to 3TB of data per hour in any given pass, main memory is still much faster. This has led to the misconception that many machine learning algorithms are disk bound. But, they aren’t …

What if we could re-use data that’s in memory? For instance, use a ringbuffer where the disk writes into it (much more slowly) and the CPU reads from it (100 times more rapidly). The problem is what to do with an observation that we’ve already processed. A naive strategy would be to pretend that it is a new instance, i.e. we could simply update on it more than once. But this is very messy since we need to keep track of how many times we’ve seen the instance before, and it creates nonstationarity in the training set.

A much cleaner strategy is to switch to dual variables, similar to the updates in the Dualon of Shalev-Shwartz and Singer. This is what Shin Matsushima did in our dual cached loops paper. Have a look at StreamSVM here. Essentially, it keeps data in memory in a ringbuffer and updates the dual variables. This way, we’re guaranteed to make progress at each step, even if we’re revisiting the same observation more than once. To see what happens have a look at the graph below:

It’s just as fast as LibLinear provided that it’s all in memory. Algorithmically, what happens in the SVM case is that one updates the Lagrange multipliers \(\alpha_i\), while simultaneously keeping an estimate of the parameter vector \(w\) available.

That said, this strategy is more general: reuse data several times for optimization while it is in memory. If possible, perform successive updates by changing variables of an optimization that is well-defined regardless of the order in which (and how frequently) data is seen.

Imagine you have two graphs \(G\) and \(G’\) and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

FOR ALL vertices \(v \in G \cup G’\) DO

- Check that \(v \in G\) and that \(v \in G’\)
- Check that the neighbors of v are the same in \(G\) and \(G’\)

This algorithm can be carried out in linear time in the size of the graph, alas many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Lehman algorithm is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas - this would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:

FOR ALL vertices \(v \in G\) DO

- Compute a hash of \((a_v, a_{v_1}, \ldots, a_{v_n})\) where \(a_{v_i}\) are the attributes of the neighbors of vertex \(v\).
- Use the hash as vertex attribute for v in the next iteration.

The algorithm terminates when this iteration has converged in terms of unique assignments of hashes to vertices.

Note that it is *not* guaranteed to work for all graphs. In particular, it fails for graphs with a high degree of symmetry, e.g. chains, complete graphs, tori and stars. However, whenever it converges to a unique vertex attribute assignment it provides a certificate for graph isomorphism. Moreover, the sets of vertex attributes can be used to show that two graphs are not isomorphic (it suffices to verify that the sets differ at any stage).

Shervashidze et al. 2012 use this idea to define a similarity measure between graphs. Basically the idea is that graphs are most similar if many of their vertex identifiers match since this implies that the associated subgraphs match. Formally they compute a kernel using

$$k(G,G’) = \sum_{i=1}^d \lambda_d \sum_{v \in V} \sum_{v’ \in V’} \delta(a(v,i), a(v’,i))$$

Here \(a(v,i)\) denote the vertex attribute of \(v\) after WL iteration \(i\). Morevoer, \(\lambda_i\) are nonnegative coefficients that weigh how much the similarity at level \(i\) matters. Rather than a brute-force computation of the above test for equality we can sort vertex attribute sets. Note that vertices that have different attributes at any given iteration will never have the same attribute thereafter. This means that we can compare the two sets at all depths at at most \(O(d \cdot (|V| + |V’|))\) cost.

A similar trick is possible if we want to regress between vertices on the same graph since we can use the set of attributes that a vertex obtains during the iterations as features. Finally, we can make our life even easier if we don’t compute kernels at all and use a linear classifier on the vertex attributes directly.

This is going to be contentious. And it somewhat goes against a lot of things that researchers hold holy. And it goes against my plan of keeping philosophy out of this blog. But it must be said since remaining silent has the potential of damaging science with proposals that sound good and are bad.

The proposal is that certain conferences make it mandatory to publish datasets that were used for the experiments. This is a very bad idea and two things are getting confused here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here’s an example where it’s more obvious:

CERN is a monster machine. There’s only one of its kind in the world. There are limited resources and it’s impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn’t really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don’t.

Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the ‘right’ hospitals gives you an unfair advantage over researchers who didn’t work building this relationship.

Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available ‘in the raw’ for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.

There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible.

As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.

In summary, while desirable, I strongly disagree with a mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. And for publicly funded research this may well be the right thing to do. But to mandate it for industry would essentially do two things - it will make industrial research even more secretive than it already is (and that’s a terrible thing). And secondly, it will make academic research less relevant for real problems (I’ve seen my fair share and am guilty of my fair share of such papers).

Machine Learning Summer School Purdue Videos -

The MLSS 2011 videos from Purdue are now available on YouTube. Enjoy!

Many algorithms require random number generators to work. For instance, locality sensitive hashing requires one to compute the random projection matrix P in order to compute the hashes z = P x. Likewise, fast eigenvalue solvers in large matrices often rely on a random matrix, e.g. the paper by Halko, Martinsson and Tropp, SIAM Review 2011, which assumes that at some point we multiply a matrix M by a matrix P with Gaussian random entries.

The problem with these methods is that if we want to perform this projection operation in many places, we need to distribute the matrix P to several machines. This is undesirable since a) it introduces another stage of synchronization between machines and b) it requires space to store the matrix P in the first place. The latter is often bad since memory access can be much slower than computation, depending on how the memory is being accessed. The prime example here is multiplication with a sparse matrix which would require random memory access.

Instead, we simply recompute the entries by hashing. To motivate things consider the case where the entries of P are all drawn from the uniform distribution U[0,1]. For a hash function h with range [0 .. N] simply set \(U_{ij} = h(i,j)/N\). Since hash functions map (i,j) pairs to uniformly distributed uncorrelated numbers in the range [0 .. N] this essentially amounts to uniformly distributed random numbers that can be recomputed on the fly.

A slightly more involved example is how to draw Gaussian random variables. We may e.g. resort to the Box-Müller transform which shows how to convert two uniformly distributed random numbers into two Gaussians. While being quite wasteful (we use two random numbers rather than one), we simply use two uniform hashes and then compute

$$P_{ij} = \left({-2 \log h(i,j,1)/N}\right)^{\frac{1}{2}} \cos (2 \pi h(i,j,2)/N)$$

Since this is known to generate Gaussian random variables from uniform random variables this will give us Gaussian distributed hashes. Similar tricks work for other random variables. It means that things like Random Kitchen Sinks, Locality Sensitive Hashing, and related projection methods never really need to store the ‘random’ projection coefficients whenever memory is at a premium or whenever it would be too costly to synchronize the random numbers.

The slides for the 2011 NIPS tutorial on Graphical Models for the Internet are online. Lots of stuff on parallelization, applications to user modeling, content recommendation, and content analysis here.

Livestream (16:00-18:00 European Standard Time)

Part 1 [keynote] [pdf], Part 2 [powerpoint] [pdf]

So you read a book on Reproducing Kernel Hilbert Spaces and you’d like to try out this kernel thing. But you’ve got a lot of data and most algorithms will give you an expansion that requires a number of kernel functions linear in the amount of data. Not good if you’ve got millions to billions of instances.

You could try out low rank expansions such as the Nystrom method of Seeger and Williams, 2000, the randomized Sparse Greedy Matrix Approximation of Smola and Schölkopf, 2000 (the Nyström method is a special case where we only randomize by a single term), or the very efficient positive diagonal pivoting trick of Scheinberg and Fine, 2001. Alas, all those methods suffer from a serious problem: at training you need to multiply by the inverse of the reduced covariance matrix, which is \(O(d^2)\) cost for a d dimensional expansion. An example of an online algorithm that suffers from the same problem is this (NIPS award winning) paper of Csato and Opper, 2002. Assuming that we’d like to have d grow with the sample size this is not a very useful strategy. Instead, we want to find a method which has \(O(d)\) cost for d attributes yet shares good regularization properties that can be properly analyzed.

Enter Radford Neal’s seminal paper from 1994 on Gaussian Processes (a famous NIPS reject). In it he shows that a Neural Network with an infinite number of nodes and a Gaussian Prior over coefficients converges to a GP. More specifically, we get the kernel

$$k(x,x’) = E_{c}[\phi_c(x) \phi_c(x’)]$$

Here \(\phi_c(x)\) is a function parametrized by c, e.g. the location of a basis function, the degree of a polynomial, or the direction of a Fourier basis function. There is also a discussion regarding RKHS in a paper by Smola, Schölkof and Müller, 1998 that discusses this phenomenon in regularization networks. These ideas were promptly forgotten by its authors. One exception is the empirical kernel map where one uses a generic design matrix that is generated through the observations directly.

It was not until the paper by Rahimi and Recht, 2008 on random kitchen sinks that this idea regained popularity. In a nutshell the algorithm works as follows: Draw d values \(c_i\) from the distribution over c. Use the corresponding basis functions in a linear model with quadratic penalty on the expansion coefficients. This method works whenever the basis functions are well bounded. For instance, for the Fourier basis the functions are bounded by 1. The proof of convergence of the explicit function expansion to the kernel is then a simple consequence of Chernoff bounds.

In the random kitchen sinks paper Rahimi and Recht discuss RBF kernels and binary indicator functions. However, this works more generally for any set of well behaved set of basis functions used in generating a random design matrix. A few examples:

- Fourier basis with Gaussian parameters. Take functions of the form \(e^{i w^\top x}\) where the coefficients \(w\) are drawn from a Gaussian. This is the random kitchen sinks paper. Obviously you can use hash functions rather than an actual random number generator. This ensures that you don’t need to store all parameters \(w\).
- Pick random separating hyperplanes. This will effectively give you functions of bounded variation.
- Use the empirical kernel map, i.e. we use some function \(k(x,x’)\) for which we employ for \(x’\) a random subset of the data we wish to train on.