Nothing in life is perfect but some things should be. For example, displaying in a free database the correct structure of a drug taken by millions of people everyday. You would think this was something taken for granted. Well as pointed out by Antony Williams most databases have errors.
So when a database proclaims to be “a comprehensive resource of clinically approved drugs” one takes a serious look especially as it comes from the NIH one expects it to be perfect.
Yesterday I downloaded the database and read the paper, I took just the HTS amenable compounds (>7000) and sent them along to Antony to look at. Immediately he found errors and just reported on his blog the scale of the problem.
So what does one do? Its pretty clear from the paper they ripped compounds from databases like PubChem. As all chemistry databases have proliferated in size these errors have accumulated. Unfortunately, few researchers perform molecule curation and this can have a dramatic effect on any computational models developed.
In a recent manuscript Antony and I have submitted we provide many examples of ‘molecular pollution’ across frequently used databases and raise awareness of the urgent need for government funding of data curation. Clearly something has to be done. What we should not do is add this dataset to other chemistry databases as seems to happen all too frequently.
While our government rightly criticizes companies for polluting our environment, we in turn should take a close look at the agencies that are polluting the web with poorly curated databases of chemical structures. It is 2011, we have the technologies and computing power so we should be able to get this right. Perhaps our only hope is that vigilant individuals like Antony can collaborate with others and find a way to clean up the chemistry mess on the web and create a GOLD STANDARD for molecule databases.
No comment yet
15 pings
FocusOnBiggerPicture says:
April 29, 2011 at 3:34 pm (UTC -5)
Perhaps you should help and contribute rather than be stuck on a loop of criticsm. Anthony Williams has an incentive in being critical. And chemspider has been full of errors from what I have seen historically. I would suggest both of you use your time and efforts in contributing to the community rather than obsessing over the occasional problems.
ChemConnector says:
April 29, 2011 at 11:11 pm (UTC -5)
Dear Focus from Washington…thanks for the feedback. Yes, ChemSpider has errors for sure. It’s 26 million compounds and we have inherited lots of errors from various places but we improve the quality of the data set daily. We also use our curation work to improve Wikipedia (http://www.chemconnector.com/2008/01/09/dedicating-christmas-time-to-the-cause-of-curating-wikipedia/), I’ve personally handed over many curated datasets to scientists to help them with their work and these have ended up in ChemSpider, and we are presently building an interface to share our curation efforts from ChemSPider with other databases to reduce rework.
If all I was was critical I’d willingly accept the criticism. However, I curate data almost every evening for one project or another, and outside of work hours. I am presently doing that for the NCGC dataset and the results will be handed back.
Sean has worked with me a number of times, as reported in our many publications, to generate high quality data for analysis and this is then made available in databases such as CDD and as supplementary data for others to use.
I wish the issues I was highlighting were occasional problems. They are not. The majority of the data are inherited with their errors. This CAN be cleaned up. Rather than anonymously criticize our efforts I welcome you lift the veil and to contribute. I am more than willing to engage in a discussion to embrace a collaborative effort to improve data quality. It’s what I, and Sean, stand for.
sean says:
April 29, 2011 at 4:36 pm (UTC -5)
please read http://tinyurl.com/67x5knd.
Markus Sitzmann says:
April 29, 2011 at 5:05 pm (UTC -5)
Well, particularly ChemSpider belongs to the group of “polluters” in PubChem. Count the number of Aspirin, Benzene or Ethanol structures submitted by ChemSpider to PubChem (only linking to a “deprecated” ChemSpider record). Or make an advanced search for ChemSpider records containing also Argon, here is an example:
http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=20187034&loc=ec_rcs
There are many other examples.
Antony Williams says:
May 2, 2011 at 9:41 am (UTC -5)
As commented here: http://www.chemconnector.com/2011/04/29/markush-misrepresentations-in-chemspider/
Yup…many of those marked as deprecated on our site, and therefore necessary to deprecate from PubChem, are from a historic definition of patent data. These will all get removed when we deprecate and when we redeposit will be gone from the set. NONE of the Argon related compounds that I deprecated tonight came from that dataset. If you look at my post on Mercury Argon for example that originated with PubChem as did some of the others. But now they are gone….from our site at least.
Watch for the news shortly about the work we are doing to share deprecation information out with appropriate feeds. You should be able to use these directly when we expose!
FocusOnBiggerPicture says:
May 1, 2011 at 1:23 pm (UTC -5)
Sean & Tony: I was inappropriate in my comments. The two of do contribute quite a bit to the community, and I stand corrected.
Antony Williams says:
May 2, 2011 at 9:42 am (UTC -5)
Focus…thanks for the acknowledgement that we do contribute to the community. We have been doing our best to do so for a long time now so I acknowledge the retraction of your comments. It is appreciated.
Markush Misrepresentations in ChemSpider – ChemConnector Blog says:
April 29, 2011 at 9:36 pm (UTC -5)
[…] Markus also commented on Sean Ekin’s blog here: […]
A new scientific tool called collaboration » Collaborative Chemistry says:
May 2, 2011 at 2:58 pm (UTC -5)
[…] less money but in a more collaborative manner using computers (and a few human scientists)…such as database curation. Share […]
Recognition of Molecule “Hygiene” » Collaborative Chemistry says:
May 5, 2011 at 1:45 pm (UTC -5)
[…] blog on issues with the NCGC Pharmaceutical collection and my follow-up suggestion of the need to improve the situation. I looked at the website today and so far there has been no recognition of some of the structural […]
NCGC adds disclaimer to NPC browser » Collaborative Chemistry says:
May 23, 2011 at 8:30 am (UTC -5)
[…] progress of sorts on the compound quality issue in the NPC browser. Weeks after bringing this to their attention there is now a disclaimer (see snapshot below). Not sure when this was added but spotted it today […]
Persona(s) non grata: does database development need personas? » Collaborative Chemistry says:
June 1, 2011 at 3:28 pm (UTC -5)
[…] really work in our domain? Would they have helped ensure that the quality of the molecules in the NCGC database? I think the answer is No (reviewers would have helped here, BTW did anyone reading this blog act […]
The sound of silence – is the chemistry database discussion being avoided? » Collaborative Chemistry says:
June 10, 2011 at 3:56 pm (UTC -5)
[…] had no idea that picking up comments from one blog and suggesting what we need to do in the chemistry database world …little if any response from the actual parties responsible or the journal involved. I think we […]
Collaborations, pufferfish, sea squirt, and database quality » Collaborative Chemistry says:
January 10, 2012 at 10:50 am (UTC -5)
[…] or small molecules for pharmacophore modeling. This blog has extensively covered the idea of the need for a gold standard molecule database. Antony Williams and I were asked to put an online editorial together for Drug Discovery Today. […]
Approved Drugs – a new mobile app for iOS » Collaborative Chemistry says:
June 15, 2012 at 3:41 pm (UTC -5)
[…] this be another starting point for a gold standard database of drugs? Will be interesting to see how this app develops in […]