Collaboration is about contribution

I have been in the blog world a very short time and already seem to have attracted some interesting comments. Let me be clear, I have been collaborating and contributing to science for over 15 years with groups all over the world, so to be told I should be “contributing to the community  rather than obsessing over the occasional problems” by an anonymous person is quite a surprise!

My mission is to is to highlight how collaboration is important and necessary. Highlighting Antony Williams’ blog on the issues with NCGC pharmaceutical collection is in line with this. Antony has certainly been a major force for cleaning up databases and other sources of molecules out there including wikipedia and he has done this in a collaborative fashion. We all benefit from his and others efforts. I can bet  he will spend his time doing this with this contentious dataset and what thanks will he get? Another ill informed commentor on my or his blog who does not do their research will say that we are are being critical. An enlightened reader will say “heck before I spend hours using this database let me check the structure integrity” first or better still they will find an incorrect structure, correct it and tell NCGC. Clearly NCGC goofed big time based on the comments from Trung who was an author on their paper. The journal they published in should have checked the structures too. We as a community can do our bit. we have to be vigilant

It is clear that :

1. As chemistry databases have proliferated in size errors have accumulated.

2. While some are checking for errors and correcting as they are suggested by users, this is the exception rather than the norm.

3. There needs to be a good-faith effort for checking structures carefully before making chemistry databases public.

4. The government should be funding quality database creation, not database proliferation.

For this ‘molecular pollution’ to end  we need a rethink about reliance on the NIH as a source of defacto gold standard databases for structures when it clearly is not. Even “skunk works” type efforts need to be considered and all held to the same standard. We have the technological prowess and computing power to tackle complex challenges, cleaning up the web of garbage structures should be a priority and this is certainly an opportunity to collaborate.

No comment yet

7 pings

  1. Markus Sitzmann says:

    Sean, if you obey point 3 strictly, there will be no public databases anymore – not today, and not in future. While I agree that we have to take care about error proliferation in databases, it is down right impossible to curate all structures out there. Just check out Antony’s blogs (the ChemSpider as well as the ChemConnector blog) where he asked for a community effort to confirm a certain structure … and well, quite often it ends up in a discussions about the right stereoisomere, tautomer …. etc – with no real down-nailed structure.

    1. sean says:

      So we just continue to add more databases and exacerbate the problem? Not a smart solution. We need to do better.

      1. Markus Sitzmann says:

        Tell me a realistic strategy (including financing) and I will help you. Since I work in this field since years, I know that ChemSpider and Antony do a great job in promoting the fact that there is a real problem (which undoubtedly is there, no question). Regarding the solution I must say I would be surprised if we curate more than a very small percentage of the chemical structure data treasure out there within a period of the next five years – even if ChemSpider would be able to ten-fold (or more) the number of voluntary(?) curators. However, in my opinion the extend of the problem requires full-time jobs – hence, you have to finance them. By doing this you have to be careful not to collide with CAS or mutate into a CAS-like project yourself (i.e. taking fees for your curated data and be cautious that your curated data remains “your” data).

        Another important aspect is that you have to make users of a database sensitive to the fact, that they always have to evaluated any information obtain by a database search to the level of detail that is required. That is the same kind of responsibility each chemist has in the lab – if you do a 20 step synthesis, you better check the quality of the reactants for step 21 before you put everything into the same flask :-).

  2. Rajarshi says:

    I think Markus has a point – would we end up with a completely correct db in a finite amount of time?

    I also see your point that proliferation of DB’s is problematic. But do you have a solution to that? I can imaging that there are going to be instances in the future where people extract / publish datasets, taken from various sources. Are you suggesting that all such activities stop and refer only to the One Correct Database?

    Come to think of it, would ChemSpider release all its curated structures? I assume that would have solved this whole problem, right?

    I fully agree that an NIH database is not ncessarily a gold standard. Are there any such db’s out there? CAS maybe? More likely, some form of propagation of corrections will be a long term solution (may a chemical equivalent of DAS).

    You also note ” We have the technological prowess and computing power to tackle complex challenges, cleaning up the web of garbage structures should be a priority and this is certainly an opportunity to collaborate.”

    How does computing power and technology help here? From your description, it looks to be a purely manual, human driven process.

    1. Antony Williams, ChemConnector says:

      Rajarshi…relative to your question “would we end up with a completely correct db in a finite amount of time” I think the question is how big is the database and do we have enough definitions with the data to build it. I will be discussing some of the challenges that the people assembling the data for the NPC Browser have. They have taken on a worthy challenge trying to build it but I think (don’t know) that there are either some processes that need optimizing or some definitions need improvement. if you look at my recent posts (http://www.chemconnector.com/2011/04/28/reviewing-data-quality-in-the-ncgc-pharmaceutical-collection-browser/ and http://www.chemconnector.com/2011/05/02/what-is-a-drug-data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-2/) I believe that it IS possible to define what the chemicals are that should be represented and make sure that the data are consistent. But it requires definition. In reality this is a very small database for the NPC Screening set. The next challenge is “completeness”…I am starting to sound a little like Bill Clinton when he asked that infamous question “what is *****”. If complete means 100% every piece of data is mapped and defined perfectly it won’t happen. But I judge that for this dataset we could get to 90% pretty quickly. I am 50 records into reviewing the data at present and have found that >20% of the structures have incorrect stereochemistry relative to the definition of the compounds elsewhere.

      We are presently working on the Open PHACTS project http://www.openphacts.org/ and this will have a big impact in this area in the coming months I believe.

      I think I can comment on Seans’s views re. compute power and technology. Most of the easy errors I am catching in the NPC Browser are due to very simple searches looking for multiple undefined stereocenters, charge imbalances etc but I do agree that in general there is an additional curation step by me as that is my preferred work process. But I think we can filter out a lot of the chaff quite automatically in the large databases and are presently active in doing that on the ChemSPider platform using our “Background Processing Framework”. More info will follow. Cheers

  3. Markus Sitzmann says:

    Well, and another point is, if you want to initiate a collaborative effort for the curation of chemical structures/data, you need to organize and release your “uncurated” data first in some way – that is simply a requirement for getting started.

  4. sean says:

    All good points. I freely admit I do not have all the answers and why should I alone this is something that affects consumers and developers of these databases alike. Clearly there are tools to check structure integrity and these could be used to find errors across the various public databases on the web. Once the errors are found manual curation kicks in. This is what Tony and a few others have been doing for years. But it needs to happen on a bigger scale hence the need for more computing power and perhaps new structure checking algorithms versus some standard database.

    99.9% of the people using the NCGC database will accept the structures at face value. There is no disclaimer saying a percentage of structures may be incorrect. Nowhere does it say that the user of the database will need to verify structure integrity themselves. Repurposing is pretty big right now and any efforts to pull such molecules together is critical. I sense that
    what was a marathon effort may have fell short in the last mile and it will rely on the efforts of the few interested in maintaing higher standards to
    correct this. I had high hopes.

    And as for CAS I do not think a set of a few thousand drugs is going to impact their efforts this is not the same league. But if public tools from NIH cannot even get 2000 let alone 7000 structures correct (of what should be well documented structures ) than Francis Collins should have just outsourced this to a company that would be held to some standard.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>