<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>

var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));

var pageTracker = _gat._getTracker("UA-4764946-1");
pageTracker._initData();
pageTracker._trackPageview();
Alex Smola</description><title>Adventures in Data Land</title><generator>Tumblr (3.0; @smolix)</generator><link>http://blog.smola.org/</link><item><title>In defense of keeping data private</title><description>&lt;p&gt;&lt;p class="commenter"&gt;&lt;span class="comment-body" data-li-comment-text="About public access of large scale benchmark data

There are two issues at stake here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here's an example where it's more obvious:

CERN is a monster machine. There's only one of its kind in the world. There are limited resources and it's impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn't really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don't.

Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the 'right' hospitals gives you an unfair advantage over researchers who didn't work building this relationship.

Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available 'in the raw' for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.

There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible. 
As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.

In summary, while desirable, I strongly disagree with the mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. But to mandate it would essentially do two things - it will make industrial research even more secretive than it already is (and that's a terrible thing). And secondly, it will make academic research less relevant for real problems (I've seen my fair share and am guilty of my fair share of such papers)."&gt;This is going to be contentious. And it somewhat goes against a lot of things that researchers hold holy. And it goes against my plan of keeping philosophy out of this blog. But it must be said since remaining silent has the potential of damaging science with proposals that sound good and are bad.&lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span class="comment-body" data-li-comment-text="About public access of large scale benchmark data

There are two issues at stake here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here's an example where it's more obvious:

CERN is a monster machine. There's only one of its kind in the world. There are limited resources and it's impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn't really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don't.

Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the 'right' hospitals gives you an unfair advantage over researchers who didn't work building this relationship.

Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available 'in the raw' for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.

There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible. 
As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.

In summary, while desirable, I strongly disagree with the mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. But to mandate it would essentially do two things - it will make industrial research even more secretive than it already is (and that's a terrible thing). And secondly, it will make academic research less relevant for real problems (I've seen my fair share and am guilty of my fair share of such papers)."&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;The proposal is that certain conferences make it mandatory to publish datasets that were used for the experiments. This is a very bad idea and two things are getting confused here: &lt;span&gt;scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here&amp;#8217;s an example where it&amp;#8217;s more obvious: &lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;CERN is a monster machine. There&amp;#8217;s only one of its kind in the world. There are limited resources and it&amp;#8217;s impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn&amp;#8217;t really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don&amp;#8217;t. &lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the &amp;#8216;right&amp;#8217; hospitals gives you an unfair advantage over researchers who didn&amp;#8217;t work building this relationship. &lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available &amp;#8216;in the raw&amp;#8217; for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter. &lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span&gt;There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible. &lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span class="comment-body" data-li-comment-text="About public access of large scale benchmark data

There are two issues at stake here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here's an example where it's more obvious:

CERN is a monster machine. There's only one of its kind in the world. There are limited resources and it's impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn't really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don't.

Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the 'right' hospitals gives you an unfair advantage over researchers who didn't work building this relationship.

Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available 'in the raw' for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.

There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible. 
As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.

In summary, while desirable, I strongly disagree with the mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. But to mandate it would essentially do two things - it will make industrial research even more secretive than it already is (and that's a terrible thing). And secondly, it will make academic research less relevant for real problems (I've seen my fair share and am guilty of my fair share of such papers)."&gt;As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage. &lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span class="comment-body" data-li-comment-text="About public access of large scale benchmark data

There are two issues at stake here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here's an example where it's more obvious:

CERN is a monster machine. There's only one of its kind in the world. There are limited resources and it's impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn't really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don't.

Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the 'right' hospitals gives you an unfair advantage over researchers who didn't work building this relationship.

Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available 'in the raw' for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.

There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible. 
As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.

In summary, while desirable, I strongly disagree with the mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. But to mandate it would essentially do two things - it will make industrial research even more secretive than it already is (and that's a terrible thing). And secondly, it will make academic research less relevant for real problems (I've seen my fair share and am guilty of my fair share of such papers)."&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="commenter"&gt;&lt;span class="comment-body" data-li-comment-text="About public access of large scale benchmark data

There are two issues at stake here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here's an example where it's more obvious:

CERN is a monster machine. There's only one of its kind in the world. There are limited resources and it's impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn't really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don't.

Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the 'right' hospitals gives you an unfair advantage over researchers who didn't work building this relationship.

Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available 'in the raw' for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter.

There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible. 
As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage.

In summary, while desirable, I strongly disagree with the mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. But to mandate it would essentially do two things - it will make industrial research even more secretive than it already is (and that's a terrible thing). And secondly, it will make academic research less relevant for real problems (I've seen my fair share and am guilty of my fair share of such papers)."&gt;In summary, while desirable, I strongly disagree with a mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. And for publicly funded research this may well be the right thing to do. But to mandate it for industry would essentially do two things - it will make industrial research even more secretive than it already is (and that&amp;#8217;s a terrible thing). And secondly, it will make academic research less relevant for real problems (I&amp;#8217;ve seen my fair share and am guilty of my fair share of such papers).&lt;/span&gt;&lt;/p&gt;&lt;/p&gt;</description><link>http://blog.smola.org/post/22786487711</link><guid>http://blog.smola.org/post/22786487711</guid><pubDate>Thu, 10 May 2012 10:39:33 -0700</pubDate></item><item><title>Machine Learning Summer School Purdue Videos</title><description>&lt;a href="http://www.youtube.com/playlist?list=PL2A65507F7D725EFB"&gt;Machine Learning Summer School Purdue Videos&lt;/a&gt;: &lt;p&gt;The MLSS 2011 videos from Purdue are now available on YouTube. Enjoy!&lt;/p&gt;</description><link>http://blog.smola.org/post/14345888700</link><guid>http://blog.smola.org/post/14345888700</guid><pubDate>Fri, 16 Dec 2011 23:14:58 -0800</pubDate></item><item><title>Random numbers in constant storage</title><description>&lt;p&gt;Many algorithms require random number generators to work. For instance, locality sensitive hashing requires one to compute the random projection matrix P in order to compute the hashes z = P x. Likewise, fast eigenvalue solvers in large matrices often rely on a random matrix, e.g. the paper by &lt;a href="http://amath.colorado.edu/faculty/martinss/Pubs/2010_HMT_random_review.pdf" title="SIAM Review"&gt;Halko, Martinsson and Tropp, SIAM Review 2011&lt;/a&gt;, which assumes that at some point we multiply a matrix M by a matrix P with Gaussian random entries. &lt;/p&gt;
&lt;p&gt;The problem with these methods is that if we want to perform this projection operation in many places, we need to distribute the matrix P to several machines. This is undesirable since a) it introduces another stage of synchronization between machines and b) it requires space to store the matrix P in the first place. The latter is often bad since memory access can be much slower than computation, depending on how the memory is being accessed. The prime example here is multiplication with a sparse matrix which would require random memory access. &lt;/p&gt;
&lt;p&gt;Instead, we simply recompute the entries by hashing. To motivate things consider the case where the entries of P are all drawn from the uniform distribution U[0,1]. For a hash function h with range [0 .. N] simply set \(U_{ij} = h(i,j)/N\). Since hash functions map (i,j) pairs to uniformly distributed uncorrelated numbers in the range [0 .. N] this essentially amounts to uniformly distributed random numbers that can be recomputed on the fly. &lt;/p&gt;
&lt;p&gt;A slightly more involved example is how to draw Gaussian random variables. We may e.g. resort to the &lt;a href="http://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform" title="Box Muller transform"&gt;Box-Müller transform&lt;/a&gt; which shows how to convert two uniformly distributed random numbers into two Gaussians. While being quite wasteful (we use two random numbers rather than one), we simply use two uniform hashes and then compute &lt;/p&gt;
&lt;p&gt;$$P_{ij} = \left({-2 \log h(i,j,1)/N}\right)^{\frac{1}{2}} \cos (2 \pi h(i,j,2)/N)$$&lt;/p&gt;
&lt;p&gt;Since this is known to generate Gaussian random variables from uniform random variables this will give us Gaussian distributed hashes. Similar tricks work for other random variables. It means that things like Random Kitchen Sinks, Locality Sensitive Hashing, and related projection methods never really need to store the &amp;#8216;random&amp;#8217; projection coefficients whenever memory is at a premium or whenever it would be too costly to synchronize the random numbers.&lt;/p&gt;</description><link>http://blog.smola.org/post/14345795830</link><guid>http://blog.smola.org/post/14345795830</guid><pubDate>Fri, 16 Dec 2011 23:11:10 -0800</pubDate></item><item><title>Slides for the NIPS 2011 tutorial</title><description>&lt;p&gt;The slides for the 2011 NIPS tutorial on Graphical Models for the Internet are online. Lots of stuff on parallelization, applications to user modeling, content recommendation, and content analysis here. &lt;/p&gt;
&lt;p&gt;&lt;a href="http://cevug.ugr.es/tv2/" title="Livestream"&gt;Livestream&lt;/a&gt; (16:00-18:00 European Standard Time)&lt;/p&gt;
&lt;p&gt;Part 1 [&lt;a href="http://alex.smola.org/talks/nips2011/part1.key" title="Part 1"&gt;keynote&lt;/a&gt;] [&lt;a href="http://alex.smola.org/talks/nips2011/part1.pdf" title="Part 1"&gt;pdf&lt;/a&gt;], Part 2 [&lt;a href="http://alex.smola.org/talks/nips2011/part2.pptx" title="Part 2"&gt;powerpoint&lt;/a&gt;] [&lt;a href="http://alex.smola.org/talks/nips2011/part2.pdf" title="Part 2"&gt;pdf&lt;/a&gt;]&lt;/p&gt;</description><link>http://blog.smola.org/post/14117021513</link><guid>http://blog.smola.org/post/14117021513</guid><pubDate>Mon, 12 Dec 2011 06:27:56 -0800</pubDate></item><item><title>The Neal Kernel and Random Kitchen Sinks</title><description>&lt;p&gt;So you read a &lt;a title="Learning with Kernels" href="http://www.amazon.com/Learning-Kernels-Regularization-Optimization-Computation/dp/0262194759"&gt;book&lt;/a&gt; on &lt;a title="Grace Wahba" href="http://www.ec-securehost.com/SIAM/CB59.html"&gt;Reproducing Kernel Hilbert Spaces&lt;/a&gt; and you&amp;#8217;d like to try out this kernel thing. But you&amp;#8217;ve got a lot of data and most algorithms will give you an expansion that requires a number of kernel functions linear in the amount of data. Not good if you&amp;#8217;ve got millions to billions of instances.&lt;/p&gt;
&lt;p&gt;You could try out low rank expansions such as the Nystrom method of &lt;a title="Nystrom" href="http://lapmal.epfl.ch/papers/nystroem.pdf"&gt;Seeger and Williams&lt;/a&gt;, 2000, the randomized Sparse Greedy Matrix Approximation of &lt;a title="SGMA" href="http://arnetminer.org/dev.do?m=downloadpdf&amp;amp;url=http://arnetminer.org/pdf/PDFFiles2/--d---d-1258203727680/Sparse%20Greedy%20Matrix%20Approximation%20for%20Machine%20Learning1258205169211.pdf"&gt;Smola and Schölkopf&lt;/a&gt;, 2000 (the Nyström method is a special case where we only randomize by a single term), or the very efficient positive diagonal pivoting trick of &lt;a title="Pivoting" href="http://www.ai.mit.edu/projects/jmlr/papers/volume2/fine01a/fine01a.pdf"&gt;Scheinberg and Fine&lt;/a&gt;, 2001. Alas, all those methods suffer from a serious problem: at training you need to multiply by the inverse of the reduced covariance matrix, which is \(O(d^2)\) cost for a d dimensional expansion. An example of an online algorithm that suffers from the same problem is this (NIPS award winning) paper of &lt;a title="Csato Opper" href="http://www.ki.tu-berlin.de/fileadmin/fg135/Publikationen/Opper/papers02/CsOp02.pdf"&gt;Csato and Opper&lt;/a&gt;, 2002. Assuming that we&amp;#8217;d like to have d grow with the sample size this is not a very useful strategy. Instead, we want to find a method which has \(O(d)\) cost for d attributes yet shares good regularization properties that can be properly analyzed.&lt;/p&gt;
&lt;p&gt;Enter Radford Neal&amp;#8217;s seminal paper from 1994 on &lt;a title="GP" href="http://www.cs.toronto.edu/~radford/ftp/pin.pdf"&gt;Gaussian Processes&lt;/a&gt; (a famous NIPS reject). In it he shows that a Neural Network with an infinite number of nodes and a Gaussian Prior over coefficients converges to a GP. More specifically, we get the kernel&lt;/p&gt;
&lt;p&gt;$$k(x,x&amp;#8217;) = E_{c}[\phi_c(x) \phi_c(x&amp;#8217;)]$$&lt;/p&gt;
&lt;p&gt;Here \(\phi_c(x)\) is a function parametrized by c, e.g. the location of a basis function, the degree of a polynomial, or the direction of a Fourier basis function. There is also a discussion regarding RKHS in a paper by &lt;a title="Regularization" href="http://alex.smola.org/papers/1998/SmoSchMul98.pdf"&gt;Smola, Schölkof and Müller&lt;/a&gt;, 1998 that discusses this phenomenon in regularization networks. These ideas were promptly forgotten by its authors. One exception is the &lt;a title="ekm" href="http://noble.gs.washington.edu/papers/schoelkopf_kernel.html"&gt;empirical kernel map&lt;/a&gt; where one uses a &lt;a title="svm linear" href="ftp://ftp.cs.wisc.edu/math-prog/talks/afosr.ps"&gt;generic design matrix&lt;/a&gt; that is generated through the observations directly. &lt;/p&gt;
&lt;p&gt;It was not until the paper by &lt;a title="rks" href="http://books.nips.cc/papers/files/nips21/NIPS2008_0885.pdf"&gt;Rahimi and Recht&lt;/a&gt;, 2008 on random kitchen sinks that this idea regained popularity. In a nutshell the algorithm works as follows:  Draw d values \(c_i\) from the distribution over c. Use the corresponding basis functions in a linear model with quadratic penalty on the expansion coefficients.   This method works whenever the basis functions are well bounded. For instance, for the Fourier basis the functions are bounded by 1. The proof of convergence of the explicit function expansion to the kernel is then a simple consequence of Chernoff bounds.&lt;/p&gt;
&lt;p&gt;In the random kitchen sinks paper Rahimi and Recht discuss RBF kernels and binary indicator functions. However, this works more generally for any set of well behaved set of basis functions used in generating a random design matrix. A few examples:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Fourier basis with Gaussian parameters. Take functions of the form \(e^{i w^\top x}\) where the coefficients \(w\) are drawn from a Gaussian. This is the random kitchen sinks paper. Obviously you can use hash functions rather than an actual random number generator. This ensures that you don&amp;#8217;t need to store all parameters \(w\).&lt;/li&gt;
&lt;li&gt;Pick random separating hyperplanes. This will effectively give you functions of bounded variation.&lt;/li&gt;
&lt;li&gt;Use the empirical kernel map, i.e. we use some function \(k(x,x&amp;#8217;)\) for which we employ for \(x&amp;#8217;\) a random subset of the data we wish to train on.&lt;/li&gt;
&lt;/ul&gt;</description><link>http://blog.smola.org/post/10572672684</link><guid>http://blog.smola.org/post/10572672684</guid><pubDate>Fri, 23 Sep 2011 16:01:51 -0700</pubDate></item><item><title>Big Learning: Algorithms, Systems, and Tools for Learning at Scale</title><description>&lt;p&gt;&lt;p class="p1"&gt;We&amp;#8217;re organizing a &lt;a title="Big Learning" href="http://www.biglearn.org"&gt;workshop at NIPS 2011&lt;/a&gt;. Submission are solicited for a two day workshop December 16-17 in Sierra Nevada, Spain. &lt;/p&gt;
&lt;p class="p3"&gt;This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The Big Learning setting has attracted intense interest with active research spanning diverse fields including machine learning, databases, parallel and distributed systems, parallel architectures, and programming languages and abstractions. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):&lt;/p&gt;
&lt;p class="p3"&gt;&lt;strong&gt;Hardware Accelerated Learning&lt;/strong&gt;: Practicality and performance of specialized high-performance hardware (e.g. GPUs, FPGAs, ASIC) for machine learning applications.&lt;/p&gt;
&lt;p class="p3"&gt;&lt;strong&gt;Applications of Big Learning&lt;/strong&gt;: Practical application case studies; insights on end-users, typical data workflow patterns, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced); challenges of real-world system building.&lt;/p&gt;
&lt;p class="p4"&gt;&lt;strong&gt;Tools, Software, &amp;amp; Systems&lt;/strong&gt;: Languages and libraries for large-scale parallel or distributed learning. Preference will be given to approaches and systems that leverage cloud computing (e.g. Hadoop, DryadLINQ, EC2, Azure), scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware (e.g. GPU, Multicore, FPGA, ASIC).&lt;/p&gt;
&lt;p class="p4"&gt;&lt;strong&gt;Models &amp;amp; Algorithms&lt;/strong&gt;: Applicability of different learning techniques in different situations (e.g., simple statistics vs. large structured models); parallel acceleration of computationally intensive learning and inference; evaluation methodology; trade-offs between performance and engineering complexity; principled methods for dealing with large number of features; &lt;/p&gt;
&lt;p class="p4"&gt;Submissions should be written as extended abstracts, no longer than 4 pages (excluding references) in the &lt;a title="LaTeX style" href="http://nips.cc/PaperInformation/StyleFiles"&gt;NIPS latex style&lt;/a&gt;. Relevant work previously presented in non-machine-learning conferences is strongly encouraged. Exciting work that was recently presented is allowed, provided that the extended abstract mentions this explicitly.  &lt;/p&gt;
&lt;p class="p4"&gt;Submission Deadline: September 30th, 2011.&lt;/p&gt;
&lt;p class="p4"&gt;Please refer to the &lt;a title="Big Learning submission" href="http://biglearn.org/index.php/Authorinfo"&gt;website for detailed submission instructions&lt;/a&gt;.&lt;/p&gt;&lt;/p&gt;</description><link>http://blog.smola.org/post/9604982818</link><guid>http://blog.smola.org/post/9604982818</guid><pubDate>Tue, 30 Aug 2011 16:36:51 -0700</pubDate></item><item><title>Introduction to Graphical Models</title><description>&lt;p&gt;Here&amp;#8217;s a link to slides [&lt;a title="MLSS Purdue" href="http://alex.smola.org/talks/purdue.key"&gt;Keynote&lt;/a&gt;, &lt;a title="MLSS Purdue" href="http://alex.smola.org/talks/purdue.pdf"&gt;PDF&lt;/a&gt;] for a basic course on Graphical Models for the Internet that I&amp;#8217;m giving at &lt;a title="MLSS 2011" href="http://learning.stat.purdue.edu/mlss/mlss/start"&gt;MLSS 2011&lt;/a&gt; in Purdue that Vishy Vishwanathan is organizing. The selection is quite biased, limited, and subjective, but it&amp;#8217;s meant to complement the other classes at the summer school.&lt;/p&gt;
&lt;p&gt;The slides are likely to grow, so in case of doubt, check for updates. Comments are most welcome. And yes, it&amp;#8217;s a horribly incomplete overview, due to space and time constraints.&lt;/p&gt;</description><link>http://blog.smola.org/post/6631465935</link><guid>http://blog.smola.org/post/6631465935</guid><pubDate>Fri, 17 Jun 2011 13:40:48 -0700</pubDate></item><item><title>Distributed synchronization with the distributed star</title><description>&lt;p&gt;Here&amp;#8217;s a simple synchronization paradigm between many computers that scales with the number of machines involved and which essentially keeps cost at \(O(1)\) per machine. For lack of a better name I&amp;#8217;m going to call it the distributed star since this is what the communication looks like. It&amp;#8217;s quite similar to how memcached stores its (key,value) pairs. &lt;/p&gt;
&lt;p&gt;Assume you have n computers, each of which have a copy of a large parameter vector w (typically several GB) and we would like to keep these copies approximately synchronized.&lt;/p&gt;
&lt;p&gt;A simple version would be to pause the computers occasionally, have them send their copies to a central node, and then return with a consensus value. Unfortunately this takes \(O(|w| \log n)\) time if we aggregate things on a tree (we can reduce it by streaming data through but this makes the code a lot more tricky). Furthermore we need to stop processing while we do so. The latter may not even be possible and any local computation is likely to benefit from having most up-to-date parameters. &lt;/p&gt;
&lt;p&gt;Instead, we use the following: assume that we can break up the parameter vector into smaller (key, value) pairs that need synchronizing. We now have each computer send its local changes for each key to a central server, update the parameters there, and later receive information about global changes. So far this algorithm looks stupid - after all, when using n machines it would require \(O(|w| n)\) time to process since the central server is the bottleneck. This is where the distributed star comes in. Instead of keeping all data on a single server, we use the well known distributed hashing trick and send it to a machine n from a pool P of servers:&lt;/p&gt;
&lt;p&gt;$$n(\mathrm{key}, P) = \mathop{\mathrm{argmin}}_{n \in P} ~ h(\mathrm{key}, n)$$&lt;/p&gt;
&lt;p&gt;Here h is the hash function. Such a system spreads communication evenly and it leads to an \(O(|w| n/|P|)\) load per machine. In particular, if we make each of the computers involved in the local computation also members of the pool, i.e. if we have \(n = |P|\) we get an \(O(|w|)\) cost for keeping terms synchronized regardless of the number of machines involved. &lt;/p&gt;
&lt;p&gt;Obvious approximations: we assume that all machines are on the same switch. Moreover we assume that the times to open a TCP/IP connection are negligible (we keep them open after the first message) relative to the work to transmit the data. &lt;/p&gt;
&lt;p&gt;The reason I&amp;#8217;m calling this a distributed star is that for each key we have a star communication topology, it&amp;#8217;s just that we use a different star for each key. If anyone in systems knows what this thing is really called, I&amp;#8217;d greatly appreciate feedback. Memcached uses the same setup, alas it doesn&amp;#8217;t have versioned writes and callbacks, so we had to build our own system using &lt;a title="ICE" href="http://www.zeroc.com"&gt;ICE&lt;/a&gt;.&lt;/p&gt;</description><link>http://blog.smola.org/post/6361194871</link><guid>http://blog.smola.org/post/6361194871</guid><pubDate>Thu, 09 Jun 2011 13:01:00 -0700</pubDate></item><item><title>Speeding up Latent Dirichlet Allocation</title><description>&lt;p&gt;The code to our LDA implementation on Hadoop is released on &lt;a title="Yahoo LDA" href="https://github.com/shravanmn/Yahoo_LDA"&gt;Github&lt;/a&gt; under the Mozilla Public License. It&amp;#8217;s seriously fast and scales very well to 1000 machines or more (don&amp;#8217;t worry, it runs on a single machine, too). We believe that at present this is the fastest implementation you can find, in particular if you want to have a) 1000s of topics, b) a large dictionary, c) a large number of documents, and d) Gibbs sampling. It handles quite comfortably a billion documents. Shravan Narayanamurthy deserves all the credit for the code. The paper describing an earlier version of the system appeared in &lt;a title="VLDB paper" href="http://www.vldb.org/pvldb/vldb2010/pvldb_vol3/R63.pdf"&gt;VLDB 2010&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;Some background: Latent Dirichlet Allocation by Blei, Jordan and Ng &lt;a title="JMLR paper" href="http://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf"&gt;(JMLR 2003)&lt;/a&gt; is a great tool for aggregating terms beyond what simple clustering can do. While the original paper showed exciting results it wasn&amp;#8217;t terribly scalable. A significant improvement was the collapsed sampler of Griffiths and Steyvers &lt;a title="Collapsed Sampler" href="http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf"&gt;(PNAS 2004)&lt;/a&gt;. The key idea was that in an exponential families model with conjugate prior you can integrate out the natural parameter, thus providing a sampler that mixed much more rapidly. It uses the following update equation to sample the topic for a word.&lt;/p&gt;
&lt;p&gt;$$p(t|d,w) \propto \frac{n^*(t,d) + \alpha_t}{n^*(d) + \sum_{t&amp;#8217;} \alpha_{t&amp;#8217;}} \frac{n^*(t,w) + \beta_w}{n^*(t) + \sum_{w&amp;#8217;} \beta_{w&amp;#8217;}}$$&lt;/p&gt;
&lt;p&gt;Here t denotes the topic, d the document, w the word, and \(n(t,d), n(d), n(t,w), n(t)\) denote the number of words which satisfy a particular (topic, document), (document), (topic, word), (topic) combination. The starred quantities such as $n^*(t,d)$ simply mean that we use the counts where the current word for which we need to resample the topic is omitted. &lt;/p&gt;
&lt;p&gt;Unfortunately the above formula is quite slow when it comes to drawing from a large number of topics. Worst of all, it is nonzero throughout. A rather ingenious trick was proposed by Yao, Mimno, and McCallum &lt;a title="fast sampler" href="http://www.cs.umass.edu/~mimno/papers/fast-topic-model.pdf"&gt;(KDD 2009)&lt;/a&gt;. It uses the fact that the relevant terms in the sum are sparse and only the \(\alpha\) and \(\beta\) dependent terms are dense (and obviously the number of words per document doesn&amp;#8217;t change, hence we can drop that, too). This yields&lt;/p&gt;
&lt;p&gt;$$p(t|d,w) \propto \frac{\alpha_t \beta_w}{n^*(t) + \sum_{w&amp;#8217;} \beta_{w&amp;#8217;}} + n^*(t,d) \frac{n^*(t,w) + \beta_w}{n^*(t) + \sum_{w&amp;#8217;} \beta_{w&amp;#8217;}} + \frac{n^*(t,d) n^*(t,w)}{n^*(t) + \sum_{w&amp;#8217;} \beta_{w&amp;#8217;}}$$&lt;/p&gt;
&lt;p&gt;Out of these three terms, only the first one is dense, all others are sparse. Hence, if we knew the sum over \(t\) for all three summands we could design a sampler which first samples which of the blocks is relevant and then which topic within each of these blocks. This is efficient since the first term doesn&amp;#8217;t actually depend on \(n(t,w)\) or \(n(t,d)\) but rather only on \(n(t)\) which can be updated efficiently after each new topic assignment. In other words, we are able to update dense term in O(1) operations after each sampling step and the remaining terms are all sparse. This trick gives a 10-50 times speedup in the sampler over a dense representation.&lt;/p&gt;
&lt;p&gt;To combine several machines we have two alternatives: one is to perform one sampling pass over the data and then reconcile the samplers. This was proposed by Newman, Asuncion, Smyth, and Welling &lt;a title="asuncion jmlr paper" href="http://www.ics.uci.edu/~asuncion/pubs/JMLR_09.pdf"&gt;(JMLR 2009)&lt;/a&gt;. While the approach proved to be feasible, it has a number of disadvantages. It only exercises the network while the CPU sits idle and vice versa. Secondly, a deferred update makes for slower mixing. Instead, one can simply have each sampler communicate with a distributed central storage continuously. In a nutshell, each node sends the differential to the global statekeeper and receives from it the latest global value. The key point is that this occurs asynchronously and moreover that we are able to decompose the state over several machines such that the available bandwidth grows with the number of machines involved. More on such distributed schemes in a later post.&lt;/p&gt;</description><link>http://blog.smola.org/post/6359713161</link><guid>http://blog.smola.org/post/6359713161</guid><pubDate>Thu, 09 Jun 2011 12:02:00 -0700</pubDate></item><item><title>Bloom Filters</title><description>&lt;p&gt;Bloom filters are one of the really ingenious and simple building blocks for randomized data structures. A great summary is the paper by &lt;a title="Bloom Filter Review" href="http://projecteuclid.org/DPubS?service=UI&amp;amp;version=1.0&amp;amp;verb=Display&amp;amp;handle=euclid.im/1109191032"&gt;Broder and Mitzenmacher&lt;/a&gt;. In this post I will briefly review its key ideas since it forms the basis of the &lt;a title="Countmin" href="https://sites.google.com/site/countminsketch/"&gt;Count-Min sketch&lt;/a&gt; of Cormode and Muthukrishnan, it will also be necessary for an accelerated version of the graph kernel of &lt;a title="Nino's NIPS 2009 talk" href="http://videolectures.net/nips09_shervashidze_fsk/"&gt;Shervashidze and Borgwardt&lt;/a&gt;, and finally, a similar structure will be needed to compute data streams over time for a real-time sketching service.&lt;/p&gt;
&lt;p&gt;At its heart a bloom filter uses a bit vector of length N and a set of k hash functions mapping arbitrary keys x into their hash values \(h_i(x) \in [1 .. N]\) where \(i \in \{1 .. k\}\) denotes the hash function. The Bloom filter allows us to perform approximate set membership tests where we have no false negatives but we may have a small number of false positives. &lt;/p&gt;
&lt;p&gt;Initialize(b): Set all \(b[i] = 0\)&lt;/p&gt;
&lt;p&gt;Insert(b,x): For all \(i \in \{1 .. k\}\) set \(b[h_i(x)] = 1\)&lt;/p&gt;
&lt;p&gt;Query(b, x): Return true if \(b[h_i(x)] = 1\) for all \(i \in \{1 .. k\}\), false otherwise&lt;/p&gt;
&lt;p&gt;Furthermore, unions and intersections between sets are easily achieved by performing bit-wise OR and AND operations on the bloom hashes of the corresponding sets respectively.&lt;/p&gt;
&lt;p&gt;It is clear that if we inserted x into the Bloom filter the query will return true, since all relevant bits in b are 1. To analyze the probability of a false positive take the probability of a bit being 1. After inserting m items using k hash functions on a range of N we have&lt;/p&gt;
&lt;p&gt;$$\Pr(b[i] = \mathrm{TRUE}) = 1 - (1 - \frac{1}{N})^{k m} \approx 1 - e^{-\frac{km}{N}}$$&lt;/p&gt;
&lt;p&gt;For a false positive to occur we need to have all k bits associated with the hash functions to be 1. Ignoring the fact that the hash functions might collide the probability of false positives is given by&lt;/p&gt;
&lt;p&gt;$$p \approx (1 - e^{-\frac{km}{N}})^k$$&lt;/p&gt;
&lt;p&gt;Taking derivatives with respect to \(\frac{km}{N}\) shows that the minimum is obtained for \(\log 2\), that is \(k = \frac{N}{m} \log 2\).&lt;/p&gt;
&lt;p&gt;One of the really nice properties of the Bloom filter is that all memory is used to store the information about the set rather than an index structure storing the keys of the items. The downside is that it is impossible to read out b without knowing the queries. Also note that it is impossible to remove items from the Bloom filter once they&amp;#8217;ve been inserted. After all, we do not know whether some of the bits might have collided with another key, hence setting the corresponding bits to 0 would cause false negatives. &lt;/p&gt;</description><link>http://blog.smola.org/post/4206530042</link><guid>http://blog.smola.org/post/4206530042</guid><pubDate>Wed, 30 Mar 2011 03:47:00 -0700</pubDate></item><item><title>Real simple covariate shift correction</title><description>&lt;p&gt;Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people; you train your algorithm; it works fine giving you high accuracy and you conclude that you&amp;#8217;re ready for a successful career in medical diagnostics.&lt;/p&gt;
&lt;p&gt;Not so fast &amp;#8230;&lt;/p&gt;
&lt;p&gt;Many things could go wrong. In particular, the distributions that you work with for training and those in the wild might differ considerably. This happened to an unfortunate startup I had the opportunity to consult for many years ago. They were developing a blood test for a disease that affects mainly older men and they&amp;#8217;d managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though, to obtain blood samples from healthy men (mainly for ethical reasons). To compensate for that, they asked a large number of students on campus to donate blood and they performed their test. Then they asked me whether I could help them build a classifier to detect the disease. I told them that it would be very easy to distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme case of covariate shift that couldn&amp;#8217;t be corrected by conventional means. In other words, training and test data were so different that nothing useful could be done and they had wasted significant amounts of money. &lt;/p&gt;
&lt;p&gt;In general the situation is not quite so dire. Assume that we want to estimate some dependency \(p(y|x)\) for which we have labeled data \((x_i, y_i)\). Alas, the observations \(x_i\) are drawn from some distribution \(q(x)\) rather than the &amp;#8216;proper&amp;#8217; distribution \(p(x)\). If we adopt a risk minimization approach, that is, if we want to solve&lt;/p&gt;
&lt;p&gt;$$\mathrm{minimize}_{f} \frac{1}{m} \sum_{i=1}^m l(x_i, y_i, f(x_i)) + \frac{\lambda}{2} \|f\|^2$$&lt;/p&gt;
&lt;p&gt;we will need to re-weight each instance by the ratio of probabilities that it would have been drawn from the correct distribution, that is, we need to reweight things by \(\frac{p(x_i)}{q(x_i)}\). This is the ratio of how frequently the instances would have occurred in the correct set vs. how frequently it occurred with the sampling distribution \(q\). It is sometimes also referred to as the Radon-Nikodym derivative. Such a method is called importance sampling and the following derivation shows why it is valid:&lt;/p&gt;
&lt;p&gt;$$\int f(x) dp(x) = \int f(x) \frac{dp(x)}{dq(x)} dq(x)$$&lt;/p&gt;
&lt;p&gt;Alas, we do not know \(\frac{dp(x)}{dq(x)}\), so before we can do anything useful we need to estimate the ratio. Many methods are available, e.g. some rather fancy operator theoretic ones which try to recalibrate the expectation operator directly using a minimum-norm or a maximum entropy principle. However, there exists a much more pedestrian, yet quite effective approach that will give almost as good results: logistic regression. &lt;/p&gt;
&lt;p&gt;After all, we know how to estimate probability ratios. This is achieved by learning a classifier to distinguish between data drawn from \(p\) and data drawn from \(q\). If it is impossible to distinguish between the two distributions then it means that the associated instances are equaly likely to come from either oneof the two distributions. On the other hand, any instances that can be well discriminated should be significantly over/underweighted accordingly. For simplicity&amp;#8217;s sake assume that we have an equal number of instances from both distributions, denoted by \(x_i \sim p(x)\) and \(x_i&amp;#8217; \sim q(x)\) respectively. Now denote by \(z_i\) labels which are 1 for data drawn from \(p\) and -1 for data drawn from \(q\). Then the probability in a mixed dataset is given by&lt;/p&gt;
&lt;p&gt;$$p(z=1|x) = \frac{p(x)}{p(x) + q(x)}$$&lt;/p&gt;
&lt;p&gt;Hence, if we use a logistic regression approach which yields \(p(z=1|x) = \frac{1}{1 + e^{-f(x)}}\), it follows (after some simple algebra) that &lt;/p&gt;
&lt;p&gt;$$\frac{p(z=1|x)}{p(z=-1|x)} = e^{f(x)}.$$&lt;/p&gt;
&lt;p&gt;Now we only need to solve the logistic regression problem&lt;/p&gt;
&lt;p&gt;$$\mathrm{minimize}_f \frac{1}{2m} \sum_{(x,z)} \log [1 + \exp(-z f(x))] + \frac{\lambda}{2} \|f\|^2$$&lt;/p&gt;
&lt;p&gt;to obtain \(f\). Subsequently we can use \(e^{f(x_i)}\) as covariate shift correction weights in training our actual classifier. The good news is that we can use an off-the-shelf tool such as logistic regression to deal with a decidedly nonstandard estimation problem. &lt;/p&gt;</description><link>http://blog.smola.org/post/4110255196</link><guid>http://blog.smola.org/post/4110255196</guid><pubDate>Sat, 26 Mar 2011 09:44:00 -0700</pubDate></item><item><title>Graphical Models for the Internet</title><description>&lt;p&gt;Here are a few tutorial slides I prepared with &lt;a title="Amr Ahmed" href="http://www.cs.cmu.edu/~amahmed/"&gt;Amr Ahmed&lt;/a&gt; for &lt;a title="WWW 2011" href="http://www.www2011india.com/"&gt;WWW 2011&lt;/a&gt; in Hyderabad next week. They describe in fairly basic (and in the end rather advanced) terms how one might use graphical models for the amounts of data available on the internet. Comments and feedback are much appreciated. &lt;/p&gt;
&lt;p&gt;&lt;a title="WWW 2011 tutorial" href="http://alex.smola.org/drafts/www11-1.pdf"&gt;PDF&lt;/a&gt; &lt;a title="WWW 2011 tutorial slides" href="http://alex.smola.org/drafts/www11-1.key"&gt;Keynote&lt;/a&gt;&lt;/p&gt;</description><link>http://blog.smola.org/post/4075687192</link><guid>http://blog.smola.org/post/4075687192</guid><pubDate>Thu, 24 Mar 2011 19:02:57 -0700</pubDate></item><item><title>Memory Latency, Hashing, Optimal Golomb Rulers and Feistel Networks</title><description>&lt;p&gt;In many problems involving hashing we want to look up a range of elements from a vector, e.g. of the form \(v[h(i,j)]\) for arbitrary \(i\) and for a range of \(j \in \{1, \ldots, n\}\) where \(h(i,j)\) is a hash function. This happens e.g. for multiclass classification, collaborative filtering, and multitask learning. &lt;/p&gt;
&lt;p&gt;While this works just fine in terms of estimation performance, traversing all values of j leads to an algorithm which is horrible in terms of memory access patterns. Modern RAM chips are much faster (over 10x) when it comes to reading values in sequence than when carrying out random reads. Furthermore, random access destroys the benefit of a cache. This leads to algorithms which are efficient in terms of their memory footprint, however, they can be relatively slow. One way to address this is to bound the range of \(h(i,j)\) for different values of j. Here are some ways of how we could do this:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Decompose \(h(i,j) = h(i) + j\). This is computationally very cheap, it has good sequential access properties but it leads to horrible collisions should there ever be two \(i\) and \(i&amp;#8217;\) for which \(|h(i) - h(i&amp;#8217;)| \leq n\). &lt;/li&gt;
&lt;li&gt;Decompose \(h(i,j) = h(i) + h&amp;#8217;(j)\) where \(h&amp;#8217;(j)\) has a small range of values. &lt;br/&gt;This is a really bad idea since now we have a nontrivial probability of collision as soon as the range of \(h&amp;#8217;(j)\) is less than \(n^2\) due to the birthday paradox. Moreover, for adjacent values \(h(i)\) and \(h(i&amp;#8217;)\) we will get many collisions.&lt;/li&gt;
&lt;li&gt;Decompose \(h(i,j) = h(i) + g(j)\) where \(g(j)\) is an &lt;a title="Optimal Golomb Ruler" href="http://en.wikipedia.org/wiki/Golomb_ruler"&gt;Optimal Golomb Ruler&lt;/a&gt;.&lt;br/&gt;The latter is an increasing sequence of integers for which any pairwise distance occurs exactly once. In other words, the condition \(h(a) - h(b) = h(c) - h(d)\) implies that \(a = c\) and \(b = d\). &lt;a title="John Langford" href="http://hunch.net/~jl"&gt;John Langford&lt;/a&gt; proposed this to address the problem. In fact, it solves our problem since there are a) no collisions for a fixed \(i\) and b) for neighboring values \(h(i)\) and \(h(i&amp;#8217;)\) we will get at most one collision (due to the Golomb ruler property). Alas, this only works up to \(n=26\) since finding an Optimal Golomb Ruler is hard (it is currently unknown whether it is actually NP hard).&lt;/li&gt;
&lt;li&gt;An alternative that works for larger n and that is sufficiently simple to compute is to use cryptography. After all, all we want is that the hash function \(h&amp;#8217;(j)\) has small range and that it doesn&amp;#8217;t have any self collisions or any systematic collisions. We can achieve this by encrypting j using the key i to generate an encrypted message of N possible values. In other words we use&lt;br/&gt;$$h(i,j) = h(i) + \mathrm{crypt}(j|i,N)$$&lt;br/&gt;Since it is an encryption of j, the mapping is invertible and we won&amp;#8217;t have collisions for a given value of j. Furthermore, for different i the encodings will be uncorrelated (after all, i is the key). Finally, we can control the range \(N&amp;gt;n\) simply by choosing the encryption algorithm. In this case the random memory access is of bounded range, hence the CPU cache will not suffer from too many misses.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;A particularly nice algorithm is the &lt;a title="Feistel cipher" href="http://en.wikipedia.org/wiki/Feistel_cipher"&gt;Feistel cipher&lt;/a&gt;. It works as follows: define the iterative map&lt;/p&gt;
&lt;p&gt;$$f(x,y) = (y, x \mathrm{XOR} h(y))$$&lt;/p&gt;
&lt;p&gt;Here \(h\) is a hash function. After 4 iterations \((x,y) \to f(x,y)\) we obtain an encryption of \((x,y)\). Now use \(x=i\) and \(y = j\) to obtain the desired result. Basically we are trading off memory latency with computation (which is local).&lt;/p&gt;</description><link>http://blog.smola.org/post/3243371889</link><guid>http://blog.smola.org/post/3243371889</guid><pubDate>Fri, 11 Feb 2011 17:56:00 -0800</pubDate></item><item><title>Collaborative Filtering considered harmful</title><description>&lt;p&gt;Much excellent work has been published on collaborative filtering, in particular in terms of recovering missing entries in a matrix. The Netflix contest has contributed a significant amount to the progress in the field. &lt;/p&gt;
&lt;p&gt;Alas, reality is not quite as simple as that. Very rarely will we ever be able to query a user about arbitrary movies, books, or other objects. Instead, user ratings are typically expressed as &lt;em&gt;preferences&lt;/em&gt; rather than absolute statements: a preference for &lt;em&gt;Die Hard&lt;/em&gt; given a generic set of movies only tells us that the user appreciates action movies; however, a preference for &lt;em&gt;Die Hard&lt;/em&gt; over &lt;em&gt;Terminator&lt;/em&gt; or &lt;em&gt;Rocky&lt;/em&gt; suggests that the user might favor Bruce Willis over other action heroes. In other words, the context of user choice is vital when estimating user preferences. &lt;/p&gt;
&lt;p&gt;Hence if we attempt to estimate scores \(s_{ui}\) of user \(u\) regarding item \(i\) it is important to use the context within which the ratings have been obtained. For instance, if we are given a session of items \((i_1, \ldots, i_n)\) out of which item \(i^*\) was selected we might want to consider a logistic model of the form:&lt;/p&gt;
&lt;p&gt;$$-\log p(i^*|i_1, \ldots, i_n) = \log \left[\sum_{i=1}^n e^{s_{ui}} \right] - s_{ui^*}$$&lt;/p&gt;
&lt;p&gt;The option of no action is easy to add, simply by adding the null score \(s_{u0}\) which captures the event of no action by a user.&lt;br/&gt;&lt;a title="Shuang Hong" href="http://www.cc.gatech.edu/~syang46/"&gt;Shuang Hong&lt;/a&gt; tried out this idea to get a significant performance improvement on a number of collaborative filtering datasets. Bottom line - make sure that the problem you&amp;#8217;re solving is actually the one that a) generated the data and b) that will help you in practice. That is, in many cases &lt;em&gt;matrix completion is not the problem &lt;/em&gt;you want to solve, even though it might win you benchmarks.&lt;/p&gt;</description><link>http://blog.smola.org/post/3241732437</link><guid>http://blog.smola.org/post/3241732437</guid><pubDate>Fri, 11 Feb 2011 16:18:00 -0800</pubDate></item><item><title>Why</title><description>&lt;p&gt;Some readers might wonder why I&amp;#8217;m writing this blog. Here&amp;#8217;s an (incomplete) list:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;It&amp;#8217;s fun.&lt;/li&gt;
&lt;li&gt;There are lots of fantastic blogs discussing the philosophy and big questions of machine learning (e.g. John Langford&amp;#8217;s &lt;a title="Hunch" href="http://hunch.net"&gt;hunch.net&lt;/a&gt;) but I couldn&amp;#8217;t find many covering simple tricks of the trade.&lt;/li&gt;
&lt;li&gt;Scientific papers sometimes obscure simple ideas. In the most extreme case, a paper will get rejected if the idea is presented in too simple terms (it happened to me more than once and the paper was praised once the simple parts had been obfuscated). Also, they need to come with ample evidence for why an idea works, strong theoretical guarantees and lots of experiments. This is all needed as a safeguard and it&amp;#8217;s really really important. But it often hides the basic idea.&lt;/li&gt;
&lt;li&gt;Some ideas are really cute and useful but not big enough to write a paper about. It&amp;#8217;s pointless to write 10 pages if the idea can be fully covered in 1 page. We&amp;#8217;d need a journal of 1 page ideas to deal with this.&lt;/li&gt;
&lt;li&gt;Many practitioners are scared to pick up a paper with many equations but they might be willing to spend 10 minutes reading a blog post.&lt;/li&gt;
&lt;/ul&gt;</description><link>http://blog.smola.org/post/1130285201</link><guid>http://blog.smola.org/post/1130285201</guid><pubDate>Wed, 15 Sep 2010 21:17:43 -0700</pubDate></item><item><title>Hashing for Collaborative Filtering</title><description>&lt;p&gt;This is a follow-up on the hashing for linear functions post. It&amp;#8217;s based on the &lt;a title="HashCoFi" href="http://jmlr.csail.mit.edu/proceedings/papers/v9/karatzoglou10a/karatzoglou10a.pdf"&gt;HashCoFi&lt;/a&gt; paper that &lt;a title="Markus Weimer" href="http://www.weimo.de"&gt;Markus Weimer&lt;/a&gt;, &lt;a title="Alexandros" href="http://www.ci.tuwien.ac.at/~alexis/"&gt;Alexandros Karatzoglou&lt;/a&gt; and I wrote for AISTATS&amp;#8217;10. It deals with the issue of running out of memory when you want to use collaborative filtering for very large problems. Here&amp;#8217;s the setting:&lt;/p&gt;
&lt;p&gt;Assume you want to do Netflix-style collaborative filtering, i.e. you want to estimate entries in a ratings matrix of (user, movie) pairs. A rather effective approach is to use matrix factorization, that is, to approximate \(M = U^\top V\) where M is the ratings matrix, U is the (tall and skinny) matrix of features for each user, stacked up, and V is the counterpart for movies. This works well for the Netflix prize since the number of users and movies is comparatively small.&lt;/p&gt;
&lt;p&gt;In reality we might have, say 100 million users for which we might want to recommend products. One option is to distribute all these users over several servers (similar to what a distributed hash table mapping does, e.g. for libmemcached). Alternatively, if we want to keep it all on one server, we&amp;#8217;re facing the problem of having to store \(10^8 \cdot 100 \cdot 4 = 4 \cdot 10^10\) bytes, i.e. 40&amp;#160;GB if we assume to allocate 400 Bytes per user (that&amp;#8217;s a rather small footprint). That is 100 dimensions per user. Usually this is too big for all but the biggest servers. Even worse, suppose that we have user-churn. That is, new users might be arriving while old users disappear (obviously we don&amp;#8217;t know whether they&amp;#8217;ll ever come back again so we don&amp;#8217;t really want to de-allocate the memory devoted to them). Obviously we cannot add more RAM. One possible solution is to store the data on disk and request it whenever a user arrives. This will cost us 5-10ms latency. An SSD will improve this but it still limits throughput. Moreover, it&amp;#8217;ll require cache management algorithms to interact with the collaborative filtering code. &lt;/p&gt;
&lt;p&gt;Here&amp;#8217;s a simple alternative: apply the hashing trick that we used for vectors to matrices. Recall that in the exact case we compute matrix entries via&lt;/p&gt;
&lt;p&gt;$$M[i,j] = \sum_{k=1}^{K} U[i,k] V[j,k]$$&lt;/p&gt;
&lt;p&gt;Now denote by \(h_u\) and \(h_v\) hash functions mapping pairs of integers to a given hash range \([1 \ldots N]\). Moreover, let \(\sigma_u\) and \(\sigma_v\) be corresponding Rademacher hash functions which return a binary hash in \(\{\pm 1\}\). Now replace the above sum via&lt;/p&gt;
&lt;p&gt;$$M[i,j] = \sum_{k=1}^{K} u[h_u(i,k)] \sigma_u(i,k) v[h_v(j,k)] \sigma_v(j,k)$$&lt;/p&gt;
&lt;p&gt;What happened is that now all access into U is replaced by access into a vector u of length N (and the same holds true for V). Why does this work: firstly, we can prove that if we construct u and v from U and V via&lt;/p&gt;
&lt;p&gt;$$u[k] = \sum_{h_u(i,j) = k} \sigma_u(i,j) U[i,j] \text{ and } v[k] = \sum_{h_v(i,j) = k} \sigma_v(i,j) V[i,j]$$&lt;/p&gt;
&lt;p&gt;then the approximate version of \(M[i,j]\) converges to the correct \(M[i,j]\) with variance \(O(1/N)\) and moreover that the estimate is unbiased. Getting the exact expressions is a bit tedious and they&amp;#8217;re described in the paper. In practice, things are even better than this rate: since we never use U and V but always u and v we simply optimize with respect to the compressed representation. &lt;/p&gt;
&lt;p&gt;One of the advantages of the compressed representation is that we never really need to have any knowledge of all the rows of U. In particular, rather than mapping user IDs to rows in U we simply use the user ID as the hash key. If a new user appears, memory is effectively allocated to the new user by means of the hash function. If a user disappears, his parameters will simply get overwritten if we perform stochastic gradient descent with respect to the u and v vectors. The same obviously holds for movies or any other entity one would like to recommend. &lt;/p&gt;
&lt;p&gt;Bottom line - we now can have fast (in memory) access to user parameters regardless of the number of users. The downside is that the latency is still quite high: remember that the hash function requires us to access \(u[h_u(i,k)]\) for many different values of k. This means that each access in k is a cache miss, i.e. it&amp;#8217;ll cost us 100-200ns RAM latency rather than the 10-20ns we&amp;#8217;d pay for burst reads. How to break this latency barrier is the topic of one of the next posts.&lt;/p&gt;</description><link>http://blog.smola.org/post/1130198570</link><guid>http://blog.smola.org/post/1130198570</guid><pubDate>Wed, 15 Sep 2010 20:59:17 -0700</pubDate></item><item><title>Priority Sampling</title><description>&lt;p&gt;&lt;a title="Tamas Sarlos" href="http://research.yahoo.com/Tamas_Sarlos"&gt;Tamas Sarlos&lt;/a&gt; pointed out a much smarter strategy on how to obtain a sparse representation of a (possibly dense) vector: &lt;a title="Priority Sampling" href="http://doi.acm.org/10.1145/1314690.1314696"&gt;Priority Sampling&lt;/a&gt; by Duffield, Lund and Thorup (Journal of the ACM 2006).  The idea is quite ingenious and (surprisingly so) essentially optimal, as &lt;a title="Mario Szegedy" href="http://www.cs.rutgers.edu/~szegedy/"&gt;Mario Szegedy&lt;/a&gt; showed. Here&amp;#8217;s the algorithm:&lt;/p&gt;
&lt;p&gt;For each \(x_i\) compute a priority \(p_i = \frac{x_i}{a_i}\) where \(a_i \sim U(0, 1]\) is drawn from a uniform distribution. Denote by \(\tau\) the k+1 largest such priority. Then pick all k indices i which satisfy \(p_i &amp;gt; \tau\) and assign them the value \(s_i = \mathrm{max}(x_i, \tau)\). All other coordinates are set to \(s_i = 0\).&lt;/p&gt;
&lt;p&gt;This provides an estimator with the following properties:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;The variance is no larger than that of the best k+1 sparse estimator.&lt;/li&gt;
&lt;li&gt;The entries \(s_i\) satisfy \(\mathbf{E}[s_i] = x_i\)&lt;/li&gt;
&lt;li&gt;The covariance vanishes, i.e. \(\mathbf{E}[s_i s_j] = x_i x_j\) &lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Note that we assumed that all \(x_i \geq 0\). Otherwise simply apply the same algorithm to \(|x_i|\) and then return signed versions of the estimate.&lt;/p&gt;
&lt;ol&gt;&lt;/ol&gt;</description><link>http://blog.smola.org/post/1078486350</link><guid>http://blog.smola.org/post/1078486350</guid><pubDate>Mon, 06 Sep 2010 18:19:00 -0700</pubDate></item><item><title>Random elements from a stream</title><description>&lt;p&gt;This is a classic trick when dealing with data streams. It shows how to draw a random element from a sequence of instances without knowing beforehand how long the sequence is and which symbols occur. &lt;/p&gt;
&lt;p&gt;Let us first assume that we knew the identities of all symbols. In this case finding a random symbol would be easy. All we require is that for each symbol s we draw a random variable \(\xi_s \sim U[0,1]\) from some distribution and subsequently we choose the symbol $$s^* = \mathrm{argmin}_s \xi_s.$$ Since each s has equal probability of being associated with the smallest value \(\xi_s\) it follows that the draw is uniformly random.&lt;/p&gt;
&lt;p&gt;Now assume that instead of requesting a random variable \(\xi_s\) we simply compute the hash of s via \(h(s)\) and we set $$s^* = \mathrm{argmin}_s h(s).$$ For a draw from the space of hash functions this again is uniform. The advantage is that we essentially determined all the random bits when selecting \(h\) rather than at the time when we want to compute its value \(h(s)\). The second advantage is that we can now simply keep track of what is the currently smallest value of \(h(s)\) and update as we go along. We have the following algorithm:&lt;/p&gt;
&lt;pre&gt;INIT
   hstar = MAXINT 
   n = 0 
   sstar = NONE
FOR ALL incoming s DO
   IF h(s) = hstar:
      n = n + 1
   ELSE IF h(s) &amp;lt; hstar:
      n = 1
      hstar = h(s)
      sstar = s
RETURN (sstar, n)
&lt;/pre&gt;
&lt;p&gt;This algorithm will provide item counts for a random element of the sequence. If you want more than one sample, simply keep a list of the symbols with the k smallest hash values and their associated counts. Such algorithms can be used to compute the variance or other moments of a sequence. &lt;/p&gt;</description><link>http://blog.smola.org/post/1077104724</link><guid>http://blog.smola.org/post/1077104724</guid><pubDate>Mon, 06 Sep 2010 13:12:48 -0700</pubDate></item><item><title>Sparsifying a vector/matrix</title><description>&lt;p&gt;Sometimes we want to compress vectors to reduce memory footprint or to minimize computational cost. This occurs, e.g. when we want to replace a probability vector by a sample from it. Without loss of generality, assume that our vector v has only nonnegative entries and that its entries sum up to 1, i.e. that it is a probability vector. If not, we simply apply our sampling scheme to the following:&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_l7rv1gTazQ1qasu5b.png"/&gt;&lt;/p&gt;
&lt;p&gt;There are a few strategies one could use to obtain a sparse vector s:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sampling uniformly at random&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For an n-dimensional vector simply use the sparsification rule: pick a random number j in [1..n]. Set the j-th coordinate in the sparse vector to &lt;img src="http://media.tumblr.com/tumblr_l7rv8mF9CY1qasu5b.png"/&gt; and all other terms to 0. For k draws from this distribution, simply average the draws. This scheme is unbiased but it has very high variance since there&amp;#8217;s a very high chance we&amp;#8217;ll miss out on the relevant components of v. This is particularly bad if v only has a small number of nonzero terms. DO NOT USE.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sampling according to v at random&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A much better method is to draw based on the size of the entries in v. That is, treat v as a probability distribution and draw from it. If we draw coordinate j in [1..n] then set the j-th coordinate of the sparse vector to &lt;img src="http://media.tumblr.com/tumblr_l7rvlz5Nou1qasu5b.png"/&gt; and all other terms to 0. As before, for k draws from this distribution, simply average. The advantage of this method is that now the entries of s are within the range [0, 1] and that moreover we hone in on the nonzero terms with high weight. But we can do better &amp;#8230;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sampling according to v with replacement&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The key weakness in the above methods can be seen in the following example: for a vector of the form v = (0.99, 0.005, 0.005, 0, 0, &amp;#8230;., 0) we would need to perform many samples with replacement from v until we even draw a single instance from the second or third coordinate. This is very time consuming. However, once we&amp;#8217;ve drawn coordinate 1 we can inspect the corresponding value in v at no extra cost and there&amp;#8217;s no need to redraw it. &lt;/p&gt;
&lt;p&gt;Enter sampling with replacement: draw from v, remove the coordinate, renormalize the remainder to 1 and repeat. This gives us a sample drawn without replacement (see the code below which builds a heap and then adds/removes things from it while keeping stuff normalized). However, we need to weigh things differently based on how they&amp;#8217;re drawn. Here&amp;#8217;s what you do when drawing k terms without replacement:&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_l7sa64qu2Z1qasu5b.png"/&gt;&lt;/p&gt;
&lt;p&gt;Here we initialize γ=1 and j is the index of the item drawn at the i-th step. Below is some (I hope relatively bug-free) sample code which implements such a sampler. As far as I recall, I found partial fragments of the heap class on &lt;a title="Stackoverflow" href="http://www.stackoverflow.com"&gt;stackoverflow&lt;/a&gt; but can&amp;#8217;t quite recall where I found it and who to attribute this to. This could be made more efficient in C++ but I think it conveys the general idea.&lt;/p&gt;
&lt;pre&gt;class arrayHeap:
    # List of terms on array
    def __init__(self, probabilities):
        self.m = len(probabilities)            # sample size
        self.b = int(math.ceil(math.log(self.m,2))) # bits
        self.limit = 1 &amp;lt;&amp;lt; self.b
        self.heap = numpy.zeros(self.limit &amp;lt;&amp;lt; 1)    # allocate twice the size

        # allocate the leaves
        self.heap[self.limit:(self.limit + self.m)] = probabilities
        # iterate up the tree (this is O(m))
        for i in range(self.b-1,-1,-1):
            for j in range((1 &amp;lt;&amp;lt; i), (1 &amp;lt;&amp;lt; (i + 1))):
                self.heap[j] = self.heap[j &amp;lt;&amp;lt; 1] + self.heap[(j &amp;lt;&amp;lt; 1) + 1]
            
    # remove term from index (this costs O(log m) steps)
    def delete(self, index):
        i = index + self.limit
        w = self.heap[i]
        for j in range(self.b, -1, -1):
            self.heap[i] -= w
            i = i &amp;gt;&amp;gt; 1

    # add value w to index (this costs O(log m) steps)
    def add(self, index, w):
        i = index + self.limit
        for j in range(self.b, -1, -1):
            self.heap[i] += w
            i = i &amp;gt;&amp;gt; 1

    # sample from arrayHeap
    def sample(self):
        xi = self.heap[1] * numpy.random.rand()
        i = 1
        while (i &amp;lt; self.limit):
            i &amp;lt;&amp;lt;= 1
            if (xi &amp;gt;= self.heap[i]):
                xi -= self.heap[i]
                i += 1
        return (i - self.limit)

#Input: normalized probabilities, sample size
def sampleWithoutReplacement(p, n):
    heap = arrayHeap(p)
    samples = numpy.zeros(n, dtype=int) # result vector
    for j in range(n):
        samples[j] = heap.sample()
        heap.delete(samples[j])
    return samples

#Input: dense matrix p, resampling rate n
def sparsifyWithoutReplacement(p, n=1):
    (nr, nc) = p.shape
    res = scipy.sparse.lil_matrix(p.shape) # allocate sparse container
    for i in range(nr):
        tmp = sampleWithoutReplacement(p[i,:], n)
        weight = 1.0            # we start with full weight first
        for j in range(n):
            res[i,tmp[j]] = weight + (n-j - 1.0) * p[i,tmp[j]]
            weight -= p[i,tmp[j]]
    res *= (1/float(n))
    return res
&lt;/pre&gt;</description><link>http://blog.smola.org/post/1016514759</link><guid>http://blog.smola.org/post/1016514759</guid><pubDate>Thu, 26 Aug 2010 16:00:00 -0700</pubDate></item><item><title>Log-probabilities, semirings and floating point numbers</title><description>&lt;p&gt;Here&amp;#8217;s a trick/bug that is a) really well known in the research community, b) lots of beginners get it wrong nonetheless, c) simple unit tests may not detect it and d) it may be a really fatal bug in your code. You can even find it in large scale machine learning toolboxes.&lt;/p&gt;
&lt;p&gt;Suppose you want to do mixture clustering and you happily compute p(x|y) and p(y), say by a mixture of Gaussians. Quite often you&amp;#8217;ll need access to p(y|x) which can be computed via&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_l7ruc7zSUB1qasu5b.png"/&gt;&lt;/p&gt;
&lt;p&gt;Let&amp;#8217;s assume that your code runs well in 2D but now you try it in 100 dimensions and it fails with a division by zero error. What&amp;#8217;s going on? After all, your code is algebraically correct. Most likely you ignored the fact that floating point numbers have only a fixed precision and by exponentiating the argument of the Gaussian (recall, we have a normalization that is exponential in the number of dimensions) you encountered numerical underflow. What happened is that you were trying to store the relevant information in the exponent rather than the mantissa of a floating point number. On &lt;a title="Floating point" href="http://en.wikipedia.org/wiki/Floating_point"&gt;single precision&lt;/a&gt; you have 8 bit for the exponent and 23 for the mantissa and you just ran out of storage. &lt;/p&gt;
&lt;p&gt;Here&amp;#8217;s how you can pull things back into the mantissa - simply store log probabilities and operate with them. Hence instead of + and x you should use &amp;#8216;log +&amp;#8217; and +.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_l7d4j8Cnc31qasu5b.png"/&gt;&lt;/p&gt;
&lt;p&gt;In this case you will not get numerical underflow when adding probabilities since the argument in the log is going to be O(1) (we have at least one term which is exp(0) and the remaining terms are smaller than 1), or at least not due to running out of precision in the exponent.&lt;/p&gt;
&lt;p&gt;In more general terms, what happened is that we replaced the operations &amp;#8216;+&amp;#8217; and &amp;#8216;x&amp;#8217; by two operations &amp;#8216;log +&amp;#8217; and &amp;#8216;+&amp;#8217; which act just in the same way as addition and multiplication. Aji and McEliece formalized this in their paper on the &lt;a title="GDL" href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=825794"&gt;Generalized Distributive Law&lt;/a&gt;. Systems that satisfy these operations are called commutative semirings. Some examples are:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Set union (+) and intersection (x)&lt;/li&gt;
&lt;li&gt;The tropical semiring which uses min (+) and + (x)&lt;/li&gt;
&lt;li&gt;Boolean AND (x) and OR (+)&lt;/li&gt;
&lt;li&gt;Plain old calculus with addition (+) and multiplication (x)&lt;/li&gt;
&lt;li&gt;The log + semiring with log + (+) and + (x)&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Replacing these symbols in well known algorithms such as dynamic programming gives the forward-backward algorithm, shortest path calculations and others, but this is a story for another day.&lt;/p&gt;</description><link>http://blog.smola.org/post/987977550</link><guid>http://blog.smola.org/post/987977550</guid><pubDate>Sat, 21 Aug 2010 09:02:00 -0700</pubDate><category>trick</category><category>probability</category><category>floating point number</category></item></channel></rss>

