Adventures in Data Land
In defense of keeping data private

This is going to be contentious. And it somewhat goes against a lot of things that researchers hold holy. And it goes against my plan of keeping philosophy out of this blog. But it must be said since remaining silent has the potential of damaging science with proposals that sound good and are bad.


The proposal is that certain conferences make it mandatory to publish datasets that were used for the experiments. This is a very bad idea and two things are getting confused here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access. To make these things a bit more clear, here’s an example where it’s more obvious: 


CERN is a monster machine. There’s only one of its kind in the world. There are limited resources and it’s impossible for any arbitrary researcher to reproduce their experiments, simply because of the average physicist being short of the tens of billions of Dollars that it took to build it. Access to the accelerator is also limited. It requires qualification and resource planning. So, even if we think this is open, it isn’t really as open as it looks. And yes, working at CERN gives you an unfair advantage over all the researchers who don’t. 


Likewise take medical research. Patient records are covered by HIPAA privacy constraints and there is absolutely no way for such records to be publicly released. The participants sign an entire chain of documents that tie them to not releasing such data publicly. In other words, common access is impossible. Reproducibility would require that someone, who wants to test a contentious result, needs to sign corresponding privacy documents before accessing the data. And yes, working with the ‘right’ hospitals gives you an unfair advantage over researchers who didn’t work building this relationship. 


Lastly, user data on the internet. Users have every right for their comments, content, images, mails, etc. to be treated with the utmost respect and to be published only when it is in their interest and with their permission to do so. I believe that there is a material difference between data being made available for analytics purposes in a personalization system and data being made available ‘in the raw’ for any researcher to play with. The latter allows for individuals to inspect particular records and learn that Alice mailed Bob a love letter. Something that would make Charlie very upset if he found out. Hence common access is a non-starter. 


There are very clear financial penalties for releasing private data - users would leave the service. Moreover, it would give a competitor an advantage over the releasing party. Since the data is largely collected by private parties at their expense it is not possible. 

As for reproducibility - this is an issue. But provided that in case of a contentious result it is possible for a trusted researcher to check them, possibly after signing an NDA, this can be addressed. And yes, working for one of these companies gives you an unfair advantage. 


In summary, while desirable, I strongly disagree with a mandatory publications policy. Yes, every effort should be made personally by researchers to see whether some data is releasable. And for publicly funded research this may well be the right thing to do. But to mandate it for industry would essentially do two things - it will make industrial research even more secretive than it already is (and that’s a terrible thing). And secondly, it will make academic research less relevant for real problems (I’ve seen my fair share and am guilty of my fair share of such papers).

Blog comments powered by Disqus