Find out more about Sanfilippo Syndrome and Charcot-Marie-Tooth

A couple of diseases I work on are described in two of recent newsletters.

First up is Sanfilippo Syndrome, which is described in the Lysosomal Disease Network’s newsletter Indications. This describes Jonah’s Just Begun and Phoenix Nest and was written By Jill Wood. I did a bit of proof reading of this article so I hope its OK – I take responsibility for typos.

Second is the latest CMT Update which has an article in it on the release of a new mouse model for CMT2.

Happy reading and a big well done to all involved in these rare diseases, and they show what can be done by some very talented parents / patients!


Turning points and games with a purpose

Today I was talking to one of my old mentors from my postdoc days (in big pharma), for the first time in a few years. I realized that working with him and exposure to computational chemistry software  was a real turning point in my career in 1996. Several more turning points later …and I am using computers everyday in my research & writing and that got me to where I am as described in my last post. As these things do, that discussion lead to me thinking about what would be the next turning point for me and perhaps science?

Last night I was catching up on some long overdue reading. Most people have  heard of FoldIt, the protein folding/ protein design game from the University of Washington. But what about other games with a purpose like Open-Phylo and Dizeez, I bet few have heard of these computer games. Has anyone reading this played them? What did you get out of them? How long did you spend on them and did they maintain your interest? As someone that has not played any of these games nor for that matter Angry Birds, I would welcome any insights before I try to find the time to give them a go. I am intrigued how others can spend time on computers and increasingly mobile devices to play games, while for a couple of decades I have used computers and software as tools for work and research and little else. What if I could use games to get my work done!

So could my next turning point be using software games as tools for drug discovery? How do we bring the software we use out of the expert domain and put it into the hands of the crowd for public good? I do not have any answers but I thought I would start a little list of such games with a purpose.

Here are the first 3 for the biological sciences:

FoldIt (see this, this, this this)




A triple life in science

Ever since embarking on my pretty unusual career path in leaving big pharma in 2001, I have been faced with several forks in the road, hard decisions on which way to go. To join a software company or not? work with university spin out? – some were good decisions others less so. Since 2008 I have generally opted for the path of least resistance and just added these opportunities to the growing list and ran with it. So when people ask me who I work for or what my day job is I have to take a few minutes to explain that I work for a lot of different companies or organizations and wear a few different hats. I am pretty sure when I go visit a company for the first time and explain this to them it can be hard to take in if they have perhaps only ever worked for 1 or 2 companies. I can summarize my work life as a triple life. I have been giving this some thought recently as I was invited to talk to undergraduate students at the upcoming ACS meeting in Boston in August. Doubtless I will prepare some slides at some point.

Currently the largest slice of my time goes to CDD where I work on neglected disease grants as PI (NIH, MM4TB) and write papers on the projects funded. After this I spend a good percentage of my time working on rare diseases, for Phoenix Nest, Inc. working on our enzyme replacement for Sanfilippo syndrome type D STTR with LABioMed and also acting as CSO at the Hereditary Neuropathy Foundation, working on Charcot-Marie-Tooth research. I also volunteer my assistance to Hannah’s Hope For GAN. In addition in my start up Collaborations Pharmaceuticals, Inc. I work with collaborators to perform preliminary experiments to get data we can use in future grants and patents. A common theme here is writing STTR or other grants as well as papers to raise awareness of the research undertaken. The final component is Collaborations In Chemistry, through which I do additional consulting for academia,  biotech and consumer product companies on ADME/Tox, neglected diseases and pretty much anything that comes along. This is also an outlet for any other interesting computational collaborative project that comes along, such as last years foray into Ebola Virus research and tweeting at conferences.

This diversity of projects is welcome because it makes it more interesting as I  like to continually try something new in science, although it makes it hard sometimes to cull projects. I am fortunate to be able to collaborate with so many terrific groups of people who can tolerate this alternative career path. Yes it is challenging to keep it all straight and manage time, but I would say working for oneself is something others should try if they are in the situation to do it. I am perhaps at another of those crossroads at which point I should probably hire an assistant to help, so if you know anyone that would be interested please let me know. Who needs a double, give me a triple life in Science!


Open Source Bayesian Models (X2)

For the last 5-6 years I have been kind of obsessed (in a good way) with how perhaps we could try to get computational machine learning models for drug discovery to a point where they could be shared. The reasoning behind this being that we publish papers, but the models described in them never really get used by anyone else. Its been a bit of a journey that  as of yesterday resulted in Alex Clark and I having 2 papers accepted at JCIM here and here. I thought I would provide a bit more detail of why I think this is important.

It all started back in November 2009 when I had a meeting with Chris Waller, Eric Gifford, Rishi Gupta (all Pfizer employees), Barry Bunin and Moses Hohman (CDD) at Pfizer. The hope was to try to get access to data from big pharma as models in CDD Public or  CDD Vault. What actually came out was something different but still useful. The light bulb went on at the table, why not compare commercial descriptors and algorithms with the open source descriptors and algorithms for different ADME datasets. A year later this work came out as a paper in Drug Metabolism and Disposition. Of course this also makes you think how the reliance on expense tools may be lessened.

Following this we put a SBIR together that helped to fund the development of the FCFP6 and ECFP6 descriptors (by Alex Clark) that are now on Github. These descriptors allowed Alex to build Bayesian models in TB Mobile 2.0 for target prediction. The most recent work published in JCIM builds on this to describe “the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps.”

There is still a lot of work to be done to get the CDD Models to where I want it to be, and validate models, but I hope by making the software and models accessible we have helped others to run with it too. The second part is independent of the CDD efforts and was to show what could  be achieved with these open source technologies.. “we performed a large scale validation study in order to ensure that the technique generalizes to a broad variety of drug discovery datasets. To achieve this we have used the ChEMBL (version 20) database and split it into more than 2000 separate datasets, each of which consists of compounds and measurements with the same target and activity measurement.”

We then made these models accessible on a website which can be used by anyone and uploaded into the mobile apps Alex developed.

We are immensely grateful to the 3 reviewers and editor (Alex Tropsha) of these manuscripts because they had double the workload. As I have done in the past I include the reviewer comments and our rebuttals to illustrate where the reviews made us modify the original submissions. Both papers were made open access - we have not had the proofs yet at the time of writing so there may be some typos needing correction.

It has been hugely rewarding working with Alex on this project and the immediate benefits I see from the 2000 ChEMBL models are that anyone could take these and use them to do drug discovery / virtual screening on so many different targets. Its pretty over whelming to imagine having so many models, and while its not “Big data” for some, for us as modelers this is about as big as it gets. The community does need to realize it can get even bigger as this represents just a fraction of the ChEMBL dataasets which are a moving target.


cover art idea


paper 1


Manuscript ID: ci-2015-00143z
Title: “Open Source Bayesian Models: I. Application to ADME/Tox and Drug Discovery Datasets”
Author(s): Clark, Alex; Dole, Krishna; Coulon-Spektor, Anna; McNutt, Andrew; Grass, George; Freundlich, Joel; Reynolds, Robert; Ekins, Sean

Reviewer: 1

Well written article about a nice, free and open piece of work about a Bayesian model-building software module used to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity and other physicochemical properties. The thorough description including code examples makes the method easily accessible for readers. Releasing the software as part of a widely cited open source tool kit make it easy to access and test. I hope the authors pay the open access fees for this article.

Response: Thank you. We plan on making both parts open access if accepted.

Reviewer: 2

The authors’ two-part publication on the development and application of their open-source tools for building Bayesian models is well-written and addresses an important need in the field: Easy development of predictive models in the CADD field with free and public tools, and easy sharing of such models within the research community. I therefore recommend publication after minor modifications.

Response: Thank you for your comments.

I have some reservation about the large number of citations of previous CDD work in either manuscript, which smacks a bit of company advertisement. However, these cited works seem relevant for the topic presented here, so I’ll give the authors the benefit of the doubt.

Response: We agree the selected citations are relevant to the manuscript. There are a handful that we would class as CDD papers e.g. describing TB Mobile and CDD Models. The majority of the references by Ekins et al. relate to work done outside of CDD that is relevant including both academic and industrial collaborations using machine learning.

While Bayesian classifiers are certainly useful (and have been widely applied), there are other modern machine-learning techniques such as kNN, random forests, and all the way up to the hot topic of Deep Learning, especially if one desires quantitative vs. just classification predictions. I am sure the reader would be interested in hearing the authors’ view on, if not possible plans for, implementation of such models in an open-source approach as described here.

Response: We agree there are many approaches, as we mention briefly, however if we were to go into detail our manuscript would be a review. We have now added the note “A more exhaustive review of the different machine learning approaches is outside the scope of this work.” We have chosen to focus exclusively on the Bayesian approach for the reasons provided, and have submitted these manuscripts because we have explicit new contributions to describe. We have previously compared Bayesian and other approaches for classification with different datasets and seen little difference between algorithms based on ROC assessments. While these other machine learning methods are of interest to anyone in the field, we respectfully decline to comment on them further, as we do not have a significant amount to add to the subject at this time.

As far as I can tell, the authors mention applicability domain (AD) only en passant in ms. I and not at all in ms. II. One cannot do (and publish) modern (Q)SAR without AD analysis. In ms. I, what is the “applicability number” mentioned on p.20? What are the “further measures” (p.30) of AD they plan to implement? In ms. II, the analysis of “balanced” vs. “diabolical” partitioning is cute and instructive (though neither really novel nor unexpected in its outcome) but most importantly, lacks AD analysis: One would assume that most of the predictions in the “diabolical” cases were out of AD. The authors need to do and present AD data.

Response: “Applicability Domain” usually refers to QSAR with continuous descriptors, not to Bayesian methods with binary fingerprints. Our goal in the manuscript is to enable extra-pharma drug discovery projects to exploit in silico machine learning methods that have until now been confined in practice to pharma and to a few academic groups. To do this we use previously published datasets (described and validated by ourselves and others elsewhere) to show the open algorithm / descriptors produce similar results for the ROC values. Our goal was not to compare applicability for the models. We have updated the description of the CDD Models implementation to clarify our simplistic approaches for domain transferability measures applied here “After the model has been created, each molecule in the user’s selected ‘project’ receives a relative score, applicability number (fraction of structural features shared with the training set), and maximum similarity number (maximum Tanimoto/Jaccard similarity to any of the “good” molecules).”
In both papers, the authors talk about combining of assay result sets for the same target. In this context, they then do what most authors do to “ensure logical consistency” (ms.1, p.19, .l.39) by removing duplicates via averaging or exclusion of the compounds if the measurements are incompatible (ms.II, p.23-24). I have my issues with this default approach: What if these cases of incompatible results are exactly a warning sign that the entire two assays are mutually incompatible? Please report the extreme cases, i.e. the target:assay instance that had the highest percentage of incompatible results, both in terms of the fraction of all compounds, and the fraction of the overlap subset (compounds with multiple measurements reported). The point here is that if a significant number of compounds in the overlap set have divergent results, then maybe the combined collection should not be used for this mix-and-match approach altogether; and having only one measurement (with obviously no possibility for incompatible results) is actually not good but bad. This issue is obviously much more severe for quantitative models. But I am convinced that even classifiers are negatively affected by this. See for example the papers by Kalliokoski and Kramer et al. in the 2012-2013 time frame, analyzing these issues for ChEMBL data sets.

Response: We are entirely in agreement with the concerns expressed. We admit to being a little brief in describing how we reject incompatible results, though our description in paper 2 captures the essence of how we went about data preparation (e.g. the examples we give as “<3 and >4, or <6 and =7″ for two incompatible groups). In the greater scheme of things, we are working toward a data collation system that is a little smarter, and can use provenance information to make more informed decisions about how to deal with clashes (e.g. one source more likely to be incorrect, or a “voting” winner takes all in the case of more than 2 options). For the moment, however, we have simply assumed that everything in ChEMBL is equally valid, and used a very simple conflict resolution system, and described it in minimal detail. We assert that this is reasonable for this project, since it defers to the ChEMBL curators, who have a rigorous process in place. It is important to point out, however, that the extraction process that we used to obtain model sources from ChEMBL has been carried out for the purpose of creating a large number of test cases containing highly realistic data, with the objectives being to (1) demonstrate that a significant amount of data is readily available, and (2) to build and validate additional algorithms for working with this abundance of models, in a way that is scalable in terms of human time. The ability to obtain thousands of models from public sources is quite novel in cheminformatics, and has only become viable in recent years due to improvements in the quality of public data, and open source algorithms. For purposes of using this data for a major prospective drug discovery campaign, we would recommend more attention to detail, which we are currently pursuing.

Paper 1 (ci-2015-00143z) Open Source Bayesian Models: I. Application to ADME/Tox and Drug Discovery Datasets:

p.4, l.27: “[…] have essentially put the experimental assays out of business.” – Do the authors have a reference for this or is this just hearsay or private discussions?

Response: We have had numerous discussions with ex-employees (whom we cannot cite) at big pharma and the wealth of papers from Pfizer over the last 5-10 years (which we cited in the sentence) clearly show the strength of models developed.

p.5, l.20: “The current development of technologies for open models and descriptors
build on established methodologies.” – Is “build” a verb or a noun here? If the former, it should be “builds” since, to be grammatically correct, it has to refer to “development.”

Response: We have used ‘builds’.

p.5, l.46: An additional freely available web tool for the prediction of toxicities, physicochemical properties, and biological activities that the authors could cite is the Chemical Activity Predictor at http://cactus.nci.nih.gov/chemical/apps/cap.

Response: thank you for bringing this to our attention. We have added “In addition, there are web tools for the prediction of bioactivities and physicochemical properties like the Chemistry Activity Predictor (GUSAR) {Zakharov, 2014 #7222}.”

p.31-32, sections Author Contributions, Conflicts of Interest, and Acknowledgments: Punctuation and name abbreviation issues (SE vs. S.E. etc.).

Response: We are grateful to the reviewer for taking the time to identify these errors, and have fixed each of them.

Reviewer: 3

The authors describe an implementation of Naïve Bayes within CDK and E/FCFP* descriptors. They show some examples with development and sharing the models using their development. The authors indicate that their development enhances CDK tools by allowing users to easily develop, publish and apply and share models. This is an interesting extension of CDK, which, however, on my opinion require a more focused article. Indeed, in this study the authors try to combine software development and benchmarking studies. With respect to the first study I suggest the authors to write it as an Application Note (see guidelines on the journal web page) while the second part of the study should be done as a proper benchmarking study (see also below) to prove that NB has a significant value to the readers of the journal.

Response: We thank the reviewer for their comments. We believe our work is worthy of a manuscript rather than an application note as it describes software development and application in paper 1. Paper 2 uses the software developed in paper 1 for a novel application, namely the challenge of building 1000’s of models from a very big dataset as well as automatically assigning classes from continuous datasets. Neither of the other two reviewers suggested publishing paper 1 as an application note.

In many places, the authors term Bayesian models instead of Naïve Bayesian (NB) model. NB is crude approximation of Bayesian modeling (e.g., there is an assumption that all descriptors are independent). Some short introduction to the theory of NM approach and its comparison with full Bayesian models, which provide optimal separation of classes, should be made. Several objective benchmarking studies, see e.g. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf, have indicated that NB has did not have a good reputation in comparison to different modern methods. Moreover, since 2006 many new algorithms have appeared. Therefore, the application of this method in computer science literature is rather limited. To this extent the conclusion of the authors that NB performs similarly to other used approaches are unexpected. I believe that it is a result of a specific selection of the studies used in this comparison.

Response: We believe that we have defined this terminology well enough to be able to use the term “Bayesian” as shorthand notation. The method we describe is actually the Laplacian-corrected naive Bayesian, which is unwieldy. We introduce the difference in some detail, and why we have followed previous cheminformaticians in favouring this variant: it is highly amenable to the use of thousands of structure-derived fingerprints, but it has significant drawbacks, one of them being that the result is not a probability, which is different to versions such as the standard naive Bayesian approach. We have devoted a significant amount of discussion to this, and do not believe that any more is required. Our previous papers cited in this manuscript describe numerous examples of comparing Bayesian versus SVM versus Trees, in all cases we have seen little difference using the exact same molecular descriptors when comparing the ROC for test sets (leave out groups or external).

We included 3 references to describe Laplacian-corrected naive Bayesian: Rogers, D.; Brown, Klon, A. E.; Lowrie, J. F.; Diller, D. J., Improved naive Bayesian modeling of numerical data for absorption, distribution, metabolism and excretion (ADME) property prediction. Journal of chemical information and modeling 2006, 46, 1945-56.
R. D.; Hahn, M., Using extended-connectivity fingerprints with Laplacian-modified Bayesian analysis in high-throughput screening follow-up. Journal of biomolecular screening 2005, 10, 682-6.
Chen, B.; Sheridan, R. P.; Hornak, V.; Voigt, J. H., Comparison of random forest and Pipeline Pilot Naive Bayes in prospective QSAR predictions. Journal of chemical information and modeling 2012, 52, 792-803.
Indeed, while the authors provide comparison of some models to previous results, they do it almost exclusively using models from their own publications. Moreover, some of these publications were review articles (e.g., ref 108), which thus may have a limited value in terms of achieved accuracies.

Response: Our previous papers cited were not all reviews (like ref 108), we use ref 108 to simplify referencing the earlier papers described. In this manuscript we describe numerous examples of comparing Bayesian versus SVM versus Trees, in all cases we have seen little difference using the exact same molecular descriptors. Our aim by comparison is to show that the ROC values (n fold validation) in the current study are similar to those used previously in our earlier studies.
This is on my opinion is not sufficient. For several datasets, e.g. AMES mutagenicity, there are multiple benchmarking studies. A proper comparison of the performance of the proposed methodology to these results (using similar test protocols as provided within these studies) is required to prove the claim of the authors that models developed with NB are of similar quality as compared to other methods.

Response: We respectfully disagree that our validation is insufficient. In a previous publication we described the implementation of ECFP6/FCFP6 fingerprints for use in the CDK toolkit, which we performed with the intention of matching the efficacy of the original implementation that was designed by SciTegic and published. While details were withheld by SciTegic, we have previously established that our implementation has equivalent performance. Since we are using the exact same algorithm for deriving Bayesian models, it is hardly a stretch to expect that the Bayesian models we created would also perform similarly well, and we have presented a number of examples to indicate this is the case. We have used the same method in many other cases which are not (yet) published and found this to be the case also. Readers are also able to confirm this for themselves using the open source implementation. In short, we believe our claims are therefore valid, and comes with plenty of supporting evidence. The goal of paper 1 was to show that our method with an open algorithm and descriptors could reproduce similar statistics for datasets which we have used previously. We believe it is acceptable to use our past datasets for this, many are published by others and prior papers provide this information. Our goal was not to focus on benchmarking of any one dataset– see comments above.

If this is not the case, I do not really see the advantages of development and sharing the NB models. The scientists will be doing this using the best available approaches notwithstanding whether they use public or commercial software. Indeed, the economic gain by applying most predictive algorithms can provide much better cost savings compared to the use of the models with lower prediction ability. This economic gain can be much higher than the software costs. Moreover, the problem with model sharing in many cases is not limited by the availability of open software or open descriptors. It is more related to problems with IP issues and data security.

Response: We have previously shown there is no significant difference between costly commercial software and using open source descriptors and algorithms with very large datasets at Pfizer and extensive external testing (Gupta et al., 2010 paper). We find the reviewer’s comments to be quite perplexing. If the reviewer means to say that there is zero value in creating an open source freely sharable implementation that can be easily used by any scientist on virtually any platform, when there are currently only expensive & proprietary products that are unavailable to all but a few… then it is hard to know where to start with rebutting this. Needless to say that we know from our own experience that this is simply not the case. Bayesian methods based on circular fingerprints are extremely useful (as we and many others have attested in the literature), and putting them in the hands of everyone who could possibly benefit from them has value that is self-evident, to say the least. We are also interested in the possibility of making other methods available in the same way. Making it easy to share the resulting models is major theme of our work of late, and we are pursuing this goal as far as our resources allow. We have discussed some of the caveats of potentially revealing information about the molecules used to create the models, in order to provide the users with the ability to make an informed decision about IP protection. The benefits of sharing are numerous – part of the challenge is because scientists create models and publish them in a way that does not make them accessible to others – we want to try to change that, and time will tell how much impact our efforts will have. Our motivation is not economic gain but scientific gain. Models can be shared securely in CDD or they can be made completely or partially open. Each use depends on the needs of the user for the particular project.

Last, but not lest, all data used in this study should be supplied together with the article (as zipped files with chemical structures, names of molecules and activity data; the original data sources can be also included.). The indicated links do not provide an access to all datasets (i.e., registration is required for some sets). This will be required to allow the readers to re-use them in other studies.

Response: As we have described in the data and materials availability: Data and materials availability: ‘All computational models are available from the authors upon request. All molecules for malaria, tuberculosis and cholera datasets from Table 1 are available in CDD Public (https://app.collaborativedrug.com/register) and the models from Table 2 are available from (http://molsync.com/bayesian1).’
The CDD public data from Table 1 is readily accessible after registering. Most of the datasets in Table 1 have already been published and made available by others (see citations). We include just one proprietary dataset (Caco-2). If scientists need access to the other datasets they can request them from us. We are not aware there is a requirement of the journal to make all data open. Clearly drug companies that publish data in JCIM do not do this frequently.

Since the content from paper 2 represents a rather large fraction of ChEMBL, it would be antisocial to include it as a either a single file on the ACS server, or as thousands of smaller files. For this reason we prefer to host it ourselves (and make it accessible to the community) in a way that is more convenient to the reader.


Please make sure your COI statement appears in the manuscript:
“S.E. is a consultant for Collaborative Drug Discovery Inc. A.M.C. is the founder of Molecular Materials Informatics, Inc.”

Response – yes we included this.


Paper 2
Manuscript ID: ci-2015-00144w
Title: “Open Source Bayesian Models: II. Mining a “Big Dataset” to Create and Validate Models with ChEMBL”
Author(s): Clark, Alex; Ekins, Sean

Reviewer: 1

This paper is a companion to the software paper submitted in parallel. It describes the use the ChEMBL to test their two-state Bayesian classification described in the parallel paper.
The reasoning behind the study as well as the methodology is properly described and accessible, as is the extraction of the underlying data sets.
All models produced in this study are openly available, as is the software which has been integrated into the open source chemistry development kit (CDK).

Response: Thank you.

Reviewer: 2

Paper 2 (ci-2015-00144w) Open Source Bayesian Models: II. Mining a “Big Dataset” to
Create and Validate Models with ChEMBL:

p.20, l.38: “independent not order dependent” – strangely phrased.

Response: Corrected.

p.23, l.23: “The first limit clause restricts to any of the assay identifiers for the block, which varies from one to thousands.” – Unclear phrasing and/or mangled grammar: Restricts what? And what varies from one to thousands?

Response: Corrected.

p.26, l.1ff: What software and method was used for this analysis and the plots?

Response: Analysis has been done by software described in these two manuscripts. Plots were created using original software (which we do not describe, since it is not novel and was created only to support the manuscript).

p.33, l.15: “described by Keiser et al., 80.” – Screwed-up punctuation. Or a sentence part missing?

Response: we have changed this as follows. ‘Similarity ensemble analysis (SEA) was described by Keiser et al., 80 which used 246 targets and 65,241 molecules and the Tanimoto similarity was compared for each pair of molecules. This approach was used to identify new targets for several known drugs that were not expected.”

p.33, l.34: “These had correct […]” – better: “These models…”

Response: Corrected.

p.33, l.37: “build models for adverse drug reactions these in turn” – comma (or semicolon or even full stop) missing before “these”.

Response: Corrected.

p.33, l.46: “Natives Bayesian” – I am pretty sure the authors meant “Naïve Bayesian.”

Response: Corrected.

p.33, l.48: “It was however shown that combining HTS a fingerprints […]” – Mangled sentence.

Response: Corrected.

p.34, l.25: “over 1800 molecules tested against over 800 molecules” – this makes no sense.

Response: Corrected.. Should be 800 end points/ assays.

p.36, l.8: “In this case, secure collaborative software would be used to transfer and run the model.” – Too much advertisement for CDD.
Response: We are stating a fact that if IP was to be maintained it would have to happen in a secure environment. We do not mention CDD explicitly.

p.38, sections Author Contributions, Conflicts of Interest, and Acknowledgments: Punctuation and name abbreviation issues (SE vs. S.E. etc.).

Response: Corrected.
Reviewer: 3

The authors have extracted and analyzed datasets extracted from ChemBL database using naïve bayes classifier. They tried to develop a threshold schema to separate quantitative data on classes of active and inactive compounds and made developed models and associated data available for download by the external users.

The mapping of naïve bayes scores to probability estimation is well known in the computer science literature, which has been addressed more than a decade ago, see e.g. http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf. I do not see a reason to develop “yet another” algorithm without providing a correct benchmarking and comparison of it with the previous studies.

Response: The reviewer has not taken into account the fact that we are describing the Laplacian-corrected variant of the naive Bayesian method. The references given refer to the conventional form, which is more popular outside of cheminformatics (which usually does not have to deal with thousands of fingerprints), and as such are describing the process normalizing values that are already formally probabilities in the 0..1 range. The method that we have adopted generates values with arbitrary scale and range, and so this limits the extent to which they can be interpreted. The raw values are suitable for ordering, but little else. We are not aware of previously disclosed methods for transforming continuous values into a “probability-like” range, and we deem this to be of some value to cheminformatics. We have also described these issues in considerable detail in the text, and do not believe that any further discussion is necessary.

The authors should not substitute term “Bayesian models” with “Naive Bayes models”. NB is based on very strong assumptions about the statistical properties of descriptors and does not provide optimal models as full Bayesian modeling.

Response: As previously noted, we have used the term “Bayesian” as shorthand for “Laplacian-corrected naive Bayesian”, after having introduced the term. In the interests of literary quality, we have kept to this convention.

The article does not have a result section. It starts with description of data preparation, which belongs to the Data section. Actually, there is no need to specify sql queries used to extract data. Such technical information can be better published as supplementary materials or just skipped.

Response: We respectfully disagree. We have formatted the manuscript in a way that we believe serves the casual reader as well as anyone studying it in detail, and have provided all of the content categories that are expected of a research paper. While migrating the SQL queries to supplementary information would not be a dealbreaker for the overall value of the manuscript, we believe that it is useful for readers to communicate what work is required in order to transmute the data source into something that is immediately useful. Some readers may be under the impression that it is much easier or much harder than it really is, and anyone who is familiar with data processing methods would find it valuable. For this reason, we have retained this section of the manuscript as is.

The IC50 values used by the authors were collected from different articles, which were based on different experimental conditions. The authors should provide some arguments and discussions how the use of different experimental conditions affects the results and why such different data can be merged together.

In some cases the ChEMBL data are also from a single lab so it depends on the dataset. We agree one would expect some interlab variability when the data comes from more than one laboratory. In this work, which is focused on method development, we have “passed the buck” to the ChEMBL team. We have explained in detail (hence the SQL queries) how we have chosen to assimilate values with the same target/assay types. To the extent that they are incompatible, this is decision that was made by the curators of ChEMBL. To scientists using this data for prospective studies, it is up to them to decide whether it was reasonable for us to assume that the ChEMBL curation is good enough. We do not argue that this core assumption is appropriate for all drug discovery scenarios, but we do demonstrate our claim that by doing so, it is possible to produce a large number of models with an entirely automated method, and that this is of interest to the greater community. Whether the generally-high ROC values are indicative of high compatibility of data from different labs is not something we claim to have proven. However, the fact that there is sufficient high quality open data data – and now methods – for creating well-performing models for many hundreds of biorelevant targets with adequate model sizes is in itself very interesting, and in our opinion, well worth sharing with the community.

I do not understand the arguments about the need to develop sophisticated algorithms to select a threshold for classification models. The regression models are much better suitable to work with quantitative models. If only a classification is required, the selection of a threshold depends on the intended use of the results (e.g., models developed for screening of new compounds with 10mM and 1µM should be based on the appropriate thresholds for the activity data). Because of these two arguments, I could not really follow the logic and need to design some new criteria for separation of active and inactive compounds. Again, this part belongs to the methodological part of the article.

Response: We may have erred on the side of assuming that this concept is familiar to all cheminformaticians involved in drug discovery, though we believe it is reasonable given the readership of the journal. In order to build a model based on 2-state classifications, it is necessary to have data that is classified as one of two states. Since bioassay data is typically given as continuous values, often in concentration units, the easiest way to do this is to define a threshold. The choice of threshold varies considerably depending on the circumstances, e.g. for some targets, only strong binders are interesting, while for other cases, the available data may not include many/any strong binders, and so a lower threshold is appropriate. The best choice is not necessary obvious. When a handful of models is being considered, contextual scientific knowledge is usually available, with manual trial-and-error as a fallback, but for thousands of models, this represents a major scaling issue. It is our belief that most of the readers for whom this article ought to appeal will be familiar with this concept, and that the explanation we have provided in the manuscript is sufficient.

Thus, actually I did not find what are the results of this study and what is their value? Who will be the potential users of the developed models and how can be these results used? Unfortunately, the article does not have a clear answer to this question.

Response: We have spent some time describing the possibilities that arise from having thousands of models for bio-relevant targets based on high quality open data. We believe that this should be largely self-evident to anyone who is working in the drug discovery industry: having a model for almost every drug target conveniently on hand, and freely available, is transformative and quite different from the status quo. As we have described in the discussion, while others have built Bayesian models for multiple targets, none has considered the scale of what we have demonstrated with the ChEMBL data – namely over 2000 classification models. While we make no claims to the effect that these models are completely ready to be used directly for prospective drug discovery campaigns, it is a major step in the direction of creating large collections of models, and should be of very broad interest and applicable to other data collections. From experience and collaborations, we have already identified academic and commercial organizations that would benefit from the models, and fully intended to follow up any interesting results with disclosure in the literature.


My experiences reviewing at PLOS

While I am a supporter and proponent of open accessing publishing, I have in the past been fairly critical of some of these journals, in particular PLOS. Even though I still publish in their journals I have held off from PLOS ONE in favor of F1000Research who I feel have a better publishing – reviewing model which I suits me just fine. Today something new happened as far as I can tell though. They (PLOS ONE) sent me and probably thousands of others, thank you’s for reviewing! I am flattered..but hold on..

Sean Ekins
PLOS ONE Reviewer (2014)

May 2015

Dear Sean,

On behalf of PLOS and the PLOS ONE editorial team, I would like to thank you for participating in the peer review process this past year at PLOS ONE. We very much appreciate your valuable input in 2014. We know there are many claims on your time and expertise but with your help, we have continued to publish an influential, lively and highly accessed Open Access journal. Simply put, we could not do it without you and the thousands of other volunteers for PLOS ONE and the other PLOS journals who graciously contributed time reviewing manuscripts.

A public “Thank You” to our 2014 reviewers – including you – was published in February 2015.

(2015) PLOS ONE 2014 Reviewer Thank You. PLoS ONE 10(2): e0121093. doi:10.1371/journal.pone.0121093

Your name is listed in the Supporting Information file associated with the article. I hope that you will be able to use this letter, along with the article citation, to claim the credit and recognition you deserve within your institution for supporting PLOS ONE and Open Access publishing.

If you would ever like to provide feedback on our processes, we would very much welcome that. Please send your feedback to us at plosone@plos.org.

With Gratitude,

Damian Pattinson
Editorial Director

P.S. If you’d like to receive news and information from PLOS, opt-in here.

I appreciate the little email but at the same time it made me wonder if all of this is starting to get a little out of hand. Maybe its just me. I find that perhaps I am getting a bit more cynical but do we really need to get credit for reviewing papers? Is it not our role as scientists to review papers? Honestly as an editorial board member at other journals I get to see plenty of papers that get routed around to reviewers and while I do not review as many papers as I used to I feel I am doing my bit for scholarly publishing. I also know there are also some mechanisms for paying reviewers, and while I do not want to stifle these alternative business models, they are not yet the norm.  So quite honestly I do not feel the need to claim credit for reviewing someones paper. Yes, I like the idea of listing openly the names of reviewers of each paper (in the interests of transparency) and PLOS ONE are not there yet. I also like the idea of sharing reviewers reviews as I have been trying to do here and PLOS ONE do not do that either. But I for one will not be adding another line to my CV that states I reviewed for PLOS ONE. What next a citation for reading a paper in PLOS ONE? Where do we draw the line? My request to them is do the things that help with transparency and building confidence in reviewing rather than what appears trivial to me. I would prefer a discount on publishing with them if I review papers rather than some citation. That would be a very nice “Thank You”. I look forward to that kind of email in future, but it may be a long wait.


Contrasts in Pharmacology 2.0

Last week I was in Turin to give a talk at the Contrasts in Pharmacology 2.0 meeting, organized by the Fondazione Internazionale Menarini.

My talk title was provided to me “Bigger data to increase drug discovery”.

I am grateful to the organizers for inviting me and was honored to be there alongside speakers from the WHO, GSK, Roche, Brigham and Womens Hospital, University of York etc… It was a great opportunity to learn about other areas in pharmacology and meet some new researchers. The talks are all available and were recorded for online viewing.


Papers and posters for ACS Boston 2015

It is that time of the year when we get those automated emails telling us our abstracts have been scheduled for the next ACS in August. In addition to these I will be participating in a careers panel for undergraduates which will be interesting – hopefully to inspire the next generation that there is lots to do in drug discovery, cheminformatics, toxicology and rare diseases. In addition I will have a poster in the small chemical business section, that should be a new venture as a budding entrepreneur. So plenty of work ahead in prepping for these. Finally there are posters collaborators are submitting which I hope to add in due course.

Here goes the list of papers to be presented at the 250th ACS National Meeting that will be held in Boston, Massachusetts, August 16-20, 2015.

PAPER ID: 2248982
PAPER TITLE: “Mining big datasets to create and validate machine learning models”

DIVISION: [CINF] Division of Chemical Information

PAPER ID: 2248973
PAPER TITLE: “Making it open: Putting cheminformatics to use against the Ebola virus”

DIVISION: [CINF] Division of Chemical Information

PAPER ID: 2246123
PAPER TITLE: “Applying cheminformatics and bioinformatics approaches to neglected tropical disease big data”

DIVISION: [CINF] Division of Chemical Information

PAPER ID: 2249028
PAPER TITLE: “Development and sharing of ADME/Tox and Drug Discovery machine learning models”

DIVISION: [COMP] Division of Computers in Chemistry

PAPER ID: 2248999
PAPER TITLE: “Mobile Apps for Transporter Drug-Drug Interaction Prediction – A Tool of the Future, now

DIVISION: [TOXI] Division of Chemical Toxicology

PAPER ID: 2248989
PAPER TITLE: “Starting small companies focused on rare diseases”

DIVISION: [SCHB] Division of Small Chemical Businesses


The Necessary – a provocative vision of the scientific future

I was asked by Barry Bunin a few months ago to provide a short presentation on a vision for the future of drug discovery which I finally gave this week. After thinking about this for a while and chatting with a few friends including Alex Clark, – I wondered what was necessary in the future and that is how I came up with the title. This is not the kind of talk I would normally give so I have to thank Barry for getting me out of my dry science comfort zone. It is still a little odd imagining giving such a talk . Thankfully it was a short presentation. Some of the slides were objectives for a perfect world and I would say I am not a true believer of all of what I have on the slides. Here is my brief run down of what I was trying to present:

slide 1. Title

slide 2. a definition

slide 3. I wanted to show how it was pretty hard to imagine what could be a scientific breakthrough in the future..what will it be in 2015. Also it was surprising that some discoveries (notably gene therapy) while advanced are still not happening widely.

slide 4. A utopian or dystopic future – you decide. Perhaps the one thing I would change is the life span, could it be longer! I have no solution as to how we would achieve any of these goals but could they be future science advances in years to come?

slide 5. A little explanation perhaps on how we will make some of these things happen – purely science fiction perhaps, and that is fine..but todays science fiction is tomorrows science.

Slide 6. The goal of this slide was to show some technologies that suggest that the near future / now has some pretty advanced technologies that were science fiction in the past.

Slide 7. The goal here was to show how parents are disrupting funding science and bringing gene therapies to clinical trials. One example has raised $6.5 million and in 6 years brought a potential treatment to clinical trials.

Slide 8. This is my slide of the near future – we are so close to many of these advances that involve computation and drug discovery, some of them are tantalizingly close – this slide also shows that I am still skeptical.\

Slide 9. The challenges of science today may be obsolete in the future – perhaps the most controversial slide – because what would we do as scientists if there was nothing more to learn and all knowledge was complete?

Slide 10. I wanted to end with the “now” and how we could kickstart some of the things of slide 4-5. This is our challenge, how to start on the future.



The next generation of Tuberculosis researchers

I like to help and mentor students in various ways when I can. Over the years this has lead to some fun projects and computational models. For a long time I have been an Adjunct Associate Professor at the University of Maryland, yet today I heard about a group of undergraduates there working on Tuberculosis. I have no connection to this group at all so I feel like there is no conflict of interest. They are a group of interdisciplinary honors undergraduates trying to raise $6000 to fund their research project. Their goal as they state “is to find a known drug that decreases virulence of Mycobacterium tuberculosis by looking for protein interactions in ESX-1 and ESX-5 secretion systems or the Nuo operon”.
They simply need the money to buy a library of FDA approved drugs to screen and they have used crowdfunding to try to get there.
My only feedback would be that this may be an OK place to start with a library of known drugs but there has been considerable screening of FDA drugs versus whole cell MTB etc..with limited traction. Perhaps they could also contact the NIH to get the NIH Clinical Set of over 700 drugs for free. Then they can save their money for follow up of compounds and in vitro ADME screens or in vivo testing. Also perhaps they could look at compounds that have already shown whole cell activity and work from them to deduce their mechanism – there are several thousand of these from NIH screens done at SRI. In addition there are the 177 actives from GSK.
Hopefully these young scientists represent the next generation of tuberculosis researchers. I applaud them for taking on such a project against a tough bug like Mycobacterium tuberculosis. I hope by raising awareness of their efforts we can help them reach their goal.


What warrants an erratum and why the old publishing model must change

Friday AM my day started with an email which I have marked up and added links to

Dear Dr. Ekins:

It has come to our attention that an error was identified in your recent Perspective entitled “The parallel worlds of public and commercial bioactive chemistry data” published in the March 12, 2015 issue of the Journal of Medicinal Chemistry (please see attached).  We would like to request that you submit an Additions and Corrections to the Journal (instructions attached).


An editor at J Med Chem


I just uploaded the letter from CAS to Figshare.

During Friday I issued the requested erratum and then retracted it latter in the day when I realized there was in fact no error. I emailed and left calls for the editor and admin so Monday should be fun..

What changed my mind was two Independent scientists and longtime SciFinder users / authors on the paper came to the same conclusion that in mid 2014 there was a problem with this patent in SciFinder (Chris Southan has now blogged more on it).

Its all a storm in a teacup as I thought we were pretty balanced in the article. Interestingly when the paper went through extensive review and major revisions no reviewer seemed to pick up on the same problem for CAS.

I think this highlights the difficulty with the old fashioned publishing model.

1. Authors submit paper to Journal on work they did months/ years ago

2. Months later they get reviews back

3. Weeks later they respond to reviews

4. Months later they get re-reviews back

5. Weeks later the re-respond to authors

6. Weeks later it is accepted

7. Days later it goes ASAP

8. weeks later proofs corrected and online

9. Months later paper published

10. By the time an article publishes it could reference databases and other sources long out of date and changed.

The publishing model ACS and other journals / societies use is way out of date and is not relevant anymore, why should it take > 6 months to go from submission to publication of a perspective? This is not even a research article where timeliness is even more critical.

And yet we still submit to ACS journals….Definitely as scientists we need more options.






Older posts «