The most expensive data set I have modeled – dont blink (over $500M)

The cost of data generated in science is probably not something many people ponder. When I look back over some of the datasets I have looked at over the years its interesting to speculate on the costs of the underlying data. Working for pharmaceutical companies, data from specialist tests for toxicities (e.g. hERG) probably ran from the tens to hundreds of thousands of dollars for 10s to 100s of compounds of interest (back when the assays were low throughput and contracted out). Even high throughput screens performed inside companies probably had similar costs.  But the prize for the most expensive data set I have worked on surely goes to the recent analysis of over 300 NIH funded chemical probes. 

By conservative estimates this project and therefore the data derived from it likely cost well in excess of $500M (as quoted in 2010) . If there are any bean counters out there with an updated value please let me know. To date the countless grants have funded hundreds of screens. The result a little over 300 probes so far. So each probe compound is worth well over $1M ! OK I am NOT exaggerating, this is by FAR the most expensive dataset I have had the opportunity to model. Lets put this in some perspective, it cost over $3B to fund the human genome project. When I think of potential for impact on healthcare etc, somehow this dataset does not register even close to the human genome. Thats not to discount it, but what will it lead to? Probably you could say the same for the human genome but many would argue its propelled many insights, projects and products in science (almost like the billions sunk in the space race in the 60’s).

So what does this all mean? Well the analysis we recently published suggested that over 20% of the probes were undesirable based on the experience (40 yrs) of a medicinal chemist. That suggests 20% of the > $500M may have been a complete wast of time. Thats over $100M down the drain (conservatively). This of course is just small change for an agency that has an annual budget of $30.15 billion . $100M could make a significant impact on rare disease research, it could be used productively for Ebola research as well as many other diseases. But you may argue this project is past tense. The various research groups took their hefty chunk of overhead, and there were costs that were sunk in equipment and staffing etc. But the data lives on and resides in PubChem.. half a billion dollars of data just sat there.

We used all the data to try to model the medicinal chemists decision using machine learning methods. This suggests that perhaps we can use such models to prioritize compounds before we invest more time and resources in them. Literally the $500M machine learning model exists ! This might never have happened but for a discussion with Christopher Lipinski and Nadia Litterman back in April to see if a model could be created to emulate his decision making in some way. So while I am amazed at the costs to generate this data, without this we would not have been able to do the analysis we did. Is this the end of it? Probably not. This is not really BIG DATA but it took BIG MONEY.

If anyone has examples of more expensive datasets of molecules and screening data please point them my way.






Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>