Tuesday, 1 July 2008

Science is dead


Every so often, some ill-informed pundit gets a little to caught up in his or her sketchy understanding of a media misrepresentation and announces that the scientific method is now obsolete. Given the frequency with which this happens, I might be justified in wondering why I bother to get up and go to work in the morning. The latest evangelist is Chris Anderson, who is Editor in Chief of Wired: someone we might expect to know what he is talking about.

He doesn't though: he claims that the existence of large amounts of data and good clustering algorithms will do away with the need to form and test hypotheses. In his enthusiasm, he comes surprisingly close to describing a computer that will give us the answer to Life, The Universe and Everything. The idea seems to be based on a soundbite from Google's research director, Peter Norvig, who altered George Box' famous quote to:

All models are wrong and, increasingly, you can succeed without them.

It is true enough that all models are wrong or at least limited. They are necessarily simplifications based on assumptions. That is their power as well as their limitation. It is also true that Google doesn't particularly care about the content of a site when it is ranking it: the decision is based on link statistics rather than semantics. Anderson dramatically overstates this point and gets utterly carried away in the 'no models' rhetoric. While it is true that clustering data can highlight previously unexpected correlations (meaning that the algorithm is not looking for any particular correlation), it is absurd to say that models are not required. For one thing, the algorithms that perform the clustering are themselves models. Metrics are used to determine whether the results are good enough and the algorithms are frequently tweaked to improve their performance. How could this be done without a model that determines when some set of performance data is better than another? It's just that the algorithms are not models of the correlation between content and search relevance. Sometimes you can get good results without understanding something in detail, but you still have to know how to interpret the results, which requires a model.

Anderson goes way, way further than this. He says that science can work like this, entirely eliminating the need for the scientific method, which he claims is now obsolete. This suggestion is fascinatingly idiotic. Anderson feels that if you get a whole bunch of data and cluster it, you will learn new things about the world without first having to decide what you are looking for. There is some dubious truth in this, of course. A few years ago, I was involved with a project that used clustering techniques to determine targets for drug discovery. We weren't looking for a drug for any particular ailment: we were looking at proteins and dragging up correlations that looked promising. In other words, it told us where to start looking, which could reduce the time to market for new drugs. This is potentially a valuable tool, but it is far from making the scientific method obsolete, as Anderson would have realised if he had stopped to think about what he was saying for even an instant. Clustering algorithms like this are just another form of observation about the world, no different in principle to looking through a microscope. Therefore, they fit perfectly neatly into the scientific method along with every other type of observation. When we see a potential target for a drug, we hypothesise that it is in fact a good candidate and we test that hypotheses in an entirely conventional scientific way. Nothing new is happening here, other than the use of a slightly different kind of tool to look at the world and collect observations. And once again, while we didn't use a model relating proteins directly to their appropriateness as a drug target, we did need to know quite a lot about the properties of those proteins and how they behave. This is a model. We also had to know how to write an algorithm that generates appropriate results (a model) and a way to determine whether the results are actually any good (a model). More importantly, the information generated was just the input to a further stage of enquiry conducted according to sound scientific principles. I for one, for example, would be rather reluctant to trust a drug that had been developed without an understanding of how it worked. If Anderson had thought his position through for a second, he would have come to the same conclusion.

Anderson gives some examples which, if anything, actually undermine his position. The first is quantum physics. He (rightly) points out that our current model is broken and that we've so far been unable to come up with a better one. Anderson says that we should let the data speak for itself, but seems to suffer a bit of an imagination failure at this point because he doesn't explain what this might be expected to achieve. Meanwhile, the 'broken' model of quantum physics has quietly and without fuss been used to create the very computers Anderson feels should be making that model obsolete. More fundamentally, the only way to find out exactly why and in what way the model is wrong is to use the testable predictions that the theory makes. How will we know that clustering results are correct? How can we apply them? We need theory (models). His other example is even worse:
The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.
How did he know that he had discovered new species?

Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor.
And how can he know this? Because we have models of evolution and genetics that make predictions in a way that statistical correlation simply cannot.

Methods of this kind are already revolutionising the way science is carried out in practice. They are tools that can provide scientists with enormous amounts of data they previously had no access to. They can help to find areas worthy of further study. The same is true of the telescope, the microscope and every other piece of scientific equipment. I doubt anyone claimed that the invention of the electron microscope would make the scientific method obsolete because now we can just look at stuff, rather than having to determine its properties second-hand. The suggestion that clustering techniques and the availability of lots of data will do so is equally blithering.

Next week, there will be another self-important buffoon claiming that science is obsolete for some completely different and equally misguided reason and Anderson's claims will be forgotten. I understand that Anderson has just got caught up in his own enthusiasm and just hasn't thought things through, but this kind of ill-informed nonsense is harmful to the already badly skewed public perception of science. For a senior journalist in a leading science and technology publication to do such a thing is breathtakingly incompetent at best.

No comments: