Adventures in Data Land
Random numbers in constant storage

Many algorithms require random number generators to work. For instance, locality sensitive hashing requires one to compute the random projection matrix P in order to compute the hashes z = P x. Likewise, fast eigenvalue solvers in large matrices often rely on a random matrix, e.g. the paper by Halko, Martinsson and Tropp, SIAM Review 2011, which assumes that at some point we multiply a matrix M by a matrix P with Gaussian random entries. 

The problem with these methods is that if we want to perform this projection operation in many places, we need to distribute the matrix P to several machines. This is undesirable since a) it introduces another stage of synchronization between machines and b) it requires space to store the matrix P in the first place. The latter is often bad since memory access can be much slower than computation, depending on how the memory is being accessed. The prime example here is multiplication with a sparse matrix which would require random memory access. 

Instead, we simply recompute the entries by hashing. To motivate things consider the case where the entries of P are all drawn from the uniform distribution U[0,1]. For a hash function h with range [0 .. N] simply set \(U_{ij} = h(i,j)/N\). Since hash functions map (i,j) pairs to uniformly distributed uncorrelated numbers in the range [0 .. N] this essentially amounts to uniformly distributed random numbers that can be recomputed on the fly. 

A slightly more involved example is how to draw Gaussian random variables. We may e.g. resort to the Box-Müller transform which shows how to convert two uniformly distributed random numbers into two Gaussians. While being quite wasteful (we use two random numbers rather than one), we simply use two uniform hashes and then compute 

$$P_{ij} = \left({-2 \log h(i,j,1)/N}\right)^{\frac{1}{2}} \cos (2 \pi h(i,j,2)/N)$$

Since this is known to generate Gaussian random variables from uniform random variables this will give us Gaussian distributed hashes. Similar tricks work for other random variables. It means that things like Random Kitchen Sinks, Locality Sensitive Hashing, and related projection methods never really need to store the ‘random’ projection coefficients whenever memory is at a premium or whenever it would be too costly to synchronize the random numbers.

  1. canada-news reblogged this from smolix
  2. ravelite reblogged this from smolix
  3. johndoeche reblogged this from smolix
  4. smolix posted this
Blog comments powered by Disqus