Nothing in life is perfect but some things should be. For example, displaying in a free database the correct structure of a drug taken by millions of people everyday. You would think this was something taken for granted. Well as pointed out by Antony Williams most databases have errors.
So when a database proclaims to be “a comprehensive resource of clinically approved drugs” one takes a serious look especially as it comes from the NIH one expects it to be perfect.
Yesterday I downloaded the database and read the paper, I took just the HTS amenable compounds (>7000) and sent them along to Antony to look at. Immediately he found errors and just reported on his blog the scale of the problem.
So what does one do? Its pretty clear from the paper they ripped compounds from databases like PubChem. As all chemistry databases have proliferated in size these errors have accumulated. Unfortunately, few researchers perform molecule curation and this can have a dramatic effect on any computational models developed.
In a recent manuscript Antony and I have submitted we provide many examples of ‘molecular pollution’ across frequently used databases and raise awareness of the urgent need for government funding of data curation. Clearly something has to be done. What we should not do is add this dataset to other chemistry databases as seems to happen all too frequently.
While our government rightly criticizes companies for polluting our environment, we in turn should take a close look at the agencies that are polluting the web with poorly curated databases of chemical structures. It is 2011, we have the technologies and computing power so we should be able to get this right. Perhaps our only hope is that vigilant individuals like Antony can collaborate with others and find a way to clean up the chemistry mess on the web and create a GOLD STANDARD for molecule databases.