Open Source Bayesian Models (X2)

For the last 5-6 years I have been kind of obsessed (in a good way) with how perhaps we could try to get computational machine learning models for drug discovery to a point where they could be shared. The reasoning behind this being that we publish papers, but the models described in them never really get used by anyone else. Its been a bit of a journey that  as of yesterday resulted in Alex Clark and I having 2 papers accepted at JCIM here and here. I thought I would provide a bit more detail of why I think this is important.

It all started back in November 2009 when I had a meeting with Chris Waller, Eric Gifford, Rishi Gupta (all Pfizer employees), Barry Bunin and Moses Hohman (CDD) at Pfizer. The hope was to try to get access to data from big pharma as models in CDD Public or  CDD Vault. What actually came out was something different but still useful. The light bulb went on at the table, why not compare commercial descriptors and algorithms with the open source descriptors and algorithms for different ADME datasets. A year later this work came out as a paper in Drug Metabolism and Disposition. Of course this also makes you think how the reliance on expense tools may be lessened.

Following this we put a SBIR together that helped to fund the development of the FCFP6 and ECFP6 descriptors (by Alex Clark) that are now on Github. These descriptors allowed Alex to build Bayesian models in TB Mobile 2.0 for target prediction. The most recent work published in JCIM builds on this to describe “the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps.”

There is still a lot of work to be done to get the CDD Models to where I want it to be, and validate models, but I hope by making the software and models accessible we have helped others to run with it too. The second part is independent of the CDD efforts and was to show what could  be achieved with these open source technologies.. “we performed a large scale validation study in order to ensure that the technique generalizes to a broad variety of drug discovery datasets. To achieve this we have used the ChEMBL (version 20) database and split it into more than 2000 separate datasets, each of which consists of compounds and measurements with the same target and activity measurement.”

We then made these models accessible on a website which can be used by anyone and uploaded into the mobile apps Alex developed.

We are immensely grateful to the 3 reviewers and editor (Alex Tropsha) of these manuscripts because they had double the workload. As I have done in the past I include the reviewer comments and our rebuttals to illustrate where the reviews made us modify the original submissions. Both papers were made open access – we have not had the proofs yet at the time of writing so there may be some typos needing correction.

It has been hugely rewarding working with Alex on this project and the immediate benefits I see from the 2000 ChEMBL models are that anyone could take these and use them to do drug discovery / virtual screening on so many different targets. Its pretty over whelming to imagine having so many models, and while its not “Big data” for some, for us as modelers this is about as big as it gets. The community does need to realize it can get even bigger as this represents just a fraction of the ChEMBL dataasets which are a moving target.


cover art idea


paper 1


Manuscript ID: ci-2015-00143z
Title: “Open Source Bayesian Models: I. Application to ADME/Tox and Drug Discovery Datasets”
Author(s): Clark, Alex; Dole, Krishna; Coulon-Spektor, Anna; McNutt, Andrew; Grass, George; Freundlich, Joel; Reynolds, Robert; Ekins, Sean

Reviewer: 1

Well written article about a nice, free and open piece of work about a Bayesian model-building software module used to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity and other physicochemical properties. The thorough description including code examples makes the method easily accessible for readers. Releasing the software as part of a widely cited open source tool kit make it easy to access and test. I hope the authors pay the open access fees for this article.

Response: Thank you. We plan on making both parts open access if accepted.

Reviewer: 2

The authors’ two-part publication on the development and application of their open-source tools for building Bayesian models is well-written and addresses an important need in the field: Easy development of predictive models in the CADD field with free and public tools, and easy sharing of such models within the research community. I therefore recommend publication after minor modifications.

Response: Thank you for your comments.

I have some reservation about the large number of citations of previous CDD work in either manuscript, which smacks a bit of company advertisement. However, these cited works seem relevant for the topic presented here, so I’ll give the authors the benefit of the doubt.

Response: We agree the selected citations are relevant to the manuscript. There are a handful that we would class as CDD papers e.g. describing TB Mobile and CDD Models. The majority of the references by Ekins et al. relate to work done outside of CDD that is relevant including both academic and industrial collaborations using machine learning.

While Bayesian classifiers are certainly useful (and have been widely applied), there are other modern machine-learning techniques such as kNN, random forests, and all the way up to the hot topic of Deep Learning, especially if one desires quantitative vs. just classification predictions. I am sure the reader would be interested in hearing the authors’ view on, if not possible plans for, implementation of such models in an open-source approach as described here.

Response: We agree there are many approaches, as we mention briefly, however if we were to go into detail our manuscript would be a review. We have now added the note “A more exhaustive review of the different machine learning approaches is outside the scope of this work.” We have chosen to focus exclusively on the Bayesian approach for the reasons provided, and have submitted these manuscripts because we have explicit new contributions to describe. We have previously compared Bayesian and other approaches for classification with different datasets and seen little difference between algorithms based on ROC assessments. While these other machine learning methods are of interest to anyone in the field, we respectfully decline to comment on them further, as we do not have a significant amount to add to the subject at this time.

As far as I can tell, the authors mention applicability domain (AD) only en passant in ms. I and not at all in ms. II. One cannot do (and publish) modern (Q)SAR without AD analysis. In ms. I, what is the “applicability number” mentioned on p.20? What are the “further measures” (p.30) of AD they plan to implement? In ms. II, the analysis of “balanced” vs. “diabolical” partitioning is cute and instructive (though neither really novel nor unexpected in its outcome) but most importantly, lacks AD analysis: One would assume that most of the predictions in the “diabolical” cases were out of AD. The authors need to do and present AD data.

Response: “Applicability Domain” usually refers to QSAR with continuous descriptors, not to Bayesian methods with binary fingerprints. Our goal in the manuscript is to enable extra-pharma drug discovery projects to exploit in silico machine learning methods that have until now been confined in practice to pharma and to a few academic groups. To do this we use previously published datasets (described and validated by ourselves and others elsewhere) to show the open algorithm / descriptors produce similar results for the ROC values. Our goal was not to compare applicability for the models. We have updated the description of the CDD Models implementation to clarify our simplistic approaches for domain transferability measures applied here “After the model has been created, each molecule in the user’s selected ‘project’ receives a relative score, applicability number (fraction of structural features shared with the training set), and maximum similarity number (maximum Tanimoto/Jaccard similarity to any of the “good” molecules).”
In both papers, the authors talk about combining of assay result sets for the same target. In this context, they then do what most authors do to “ensure logical consistency” (ms.1, p.19, .l.39) by removing duplicates via averaging or exclusion of the compounds if the measurements are incompatible (ms.II, p.23-24). I have my issues with this default approach: What if these cases of incompatible results are exactly a warning sign that the entire two assays are mutually incompatible? Please report the extreme cases, i.e. the target:assay instance that had the highest percentage of incompatible results, both in terms of the fraction of all compounds, and the fraction of the overlap subset (compounds with multiple measurements reported). The point here is that if a significant number of compounds in the overlap set have divergent results, then maybe the combined collection should not be used for this mix-and-match approach altogether; and having only one measurement (with obviously no possibility for incompatible results) is actually not good but bad. This issue is obviously much more severe for quantitative models. But I am convinced that even classifiers are negatively affected by this. See for example the papers by Kalliokoski and Kramer et al. in the 2012-2013 time frame, analyzing these issues for ChEMBL data sets.

Response: We are entirely in agreement with the concerns expressed. We admit to being a little brief in describing how we reject incompatible results, though our description in paper 2 captures the essence of how we went about data preparation (e.g. the examples we give as “<3 and >4, or <6 and =7” for two incompatible groups). In the greater scheme of things, we are working toward a data collation system that is a little smarter, and can use provenance information to make more informed decisions about how to deal with clashes (e.g. one source more likely to be incorrect, or a “voting” winner takes all in the case of more than 2 options). For the moment, however, we have simply assumed that everything in ChEMBL is equally valid, and used a very simple conflict resolution system, and described it in minimal detail. We assert that this is reasonable for this project, since it defers to the ChEMBL curators, who have a rigorous process in place. It is important to point out, however, that the extraction process that we used to obtain model sources from ChEMBL has been carried out for the purpose of creating a large number of test cases containing highly realistic data, with the objectives being to (1) demonstrate that a significant amount of data is readily available, and (2) to build and validate additional algorithms for working with this abundance of models, in a way that is scalable in terms of human time. The ability to obtain thousands of models from public sources is quite novel in cheminformatics, and has only become viable in recent years due to improvements in the quality of public data, and open source algorithms. For purposes of using this data for a major prospective drug discovery campaign, we would recommend more attention to detail, which we are currently pursuing.

Paper 1 (ci-2015-00143z) Open Source Bayesian Models: I. Application to ADME/Tox and Drug Discovery Datasets:

p.4, l.27: “[…] have essentially put the experimental assays out of business.” – Do the authors have a reference for this or is this just hearsay or private discussions?

Response: We have had numerous discussions with ex-employees (whom we cannot cite) at big pharma and the wealth of papers from Pfizer over the last 5-10 years (which we cited in the sentence) clearly show the strength of models developed.

p.5, l.20: “The current development of technologies for open models and descriptors
build on established methodologies.” – Is “build” a verb or a noun here? If the former, it should be “builds” since, to be grammatically correct, it has to refer to “development.”

Response: We have used ‘builds’.

p.5, l.46: An additional freely available web tool for the prediction of toxicities, physicochemical properties, and biological activities that the authors could cite is the Chemical Activity Predictor at http://cactus.nci.nih.gov/chemical/apps/cap.

Response: thank you for bringing this to our attention. We have added “In addition, there are web tools for the prediction of bioactivities and physicochemical properties like the Chemistry Activity Predictor (GUSAR) {Zakharov, 2014 #7222}.”

p.31-32, sections Author Contributions, Conflicts of Interest, and Acknowledgments: Punctuation and name abbreviation issues (SE vs. S.E. etc.).

Response: We are grateful to the reviewer for taking the time to identify these errors, and have fixed each of them.

Reviewer: 3

The authors describe an implementation of Naïve Bayes within CDK and E/FCFP* descriptors. They show some examples with development and sharing the models using their development. The authors indicate that their development enhances CDK tools by allowing users to easily develop, publish and apply and share models. This is an interesting extension of CDK, which, however, on my opinion require a more focused article. Indeed, in this study the authors try to combine software development and benchmarking studies. With respect to the first study I suggest the authors to write it as an Application Note (see guidelines on the journal web page) while the second part of the study should be done as a proper benchmarking study (see also below) to prove that NB has a significant value to the readers of the journal.

Response: We thank the reviewer for their comments. We believe our work is worthy of a manuscript rather than an application note as it describes software development and application in paper 1. Paper 2 uses the software developed in paper 1 for a novel application, namely the challenge of building 1000’s of models from a very big dataset as well as automatically assigning classes from continuous datasets. Neither of the other two reviewers suggested publishing paper 1 as an application note.

In many places, the authors term Bayesian models instead of Naïve Bayesian (NB) model. NB is crude approximation of Bayesian modeling (e.g., there is an assumption that all descriptors are independent). Some short introduction to the theory of NM approach and its comparison with full Bayesian models, which provide optimal separation of classes, should be made. Several objective benchmarking studies, see e.g. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf, have indicated that NB has did not have a good reputation in comparison to different modern methods. Moreover, since 2006 many new algorithms have appeared. Therefore, the application of this method in computer science literature is rather limited. To this extent the conclusion of the authors that NB performs similarly to other used approaches are unexpected. I believe that it is a result of a specific selection of the studies used in this comparison.

Response: We believe that we have defined this terminology well enough to be able to use the term “Bayesian” as shorthand notation. The method we describe is actually the Laplacian-corrected naive Bayesian, which is unwieldy. We introduce the difference in some detail, and why we have followed previous cheminformaticians in favouring this variant: it is highly amenable to the use of thousands of structure-derived fingerprints, but it has significant drawbacks, one of them being that the result is not a probability, which is different to versions such as the standard naive Bayesian approach. We have devoted a significant amount of discussion to this, and do not believe that any more is required. Our previous papers cited in this manuscript describe numerous examples of comparing Bayesian versus SVM versus Trees, in all cases we have seen little difference using the exact same molecular descriptors when comparing the ROC for test sets (leave out groups or external).

We included 3 references to describe Laplacian-corrected naive Bayesian: Rogers, D.; Brown, Klon, A. E.; Lowrie, J. F.; Diller, D. J., Improved naive Bayesian modeling of numerical data for absorption, distribution, metabolism and excretion (ADME) property prediction. Journal of chemical information and modeling 2006, 46, 1945-56.
R. D.; Hahn, M., Using extended-connectivity fingerprints with Laplacian-modified Bayesian analysis in high-throughput screening follow-up. Journal of biomolecular screening 2005, 10, 682-6.
Chen, B.; Sheridan, R. P.; Hornak, V.; Voigt, J. H., Comparison of random forest and Pipeline Pilot Naive Bayes in prospective QSAR predictions. Journal of chemical information and modeling 2012, 52, 792-803.
Indeed, while the authors provide comparison of some models to previous results, they do it almost exclusively using models from their own publications. Moreover, some of these publications were review articles (e.g., ref 108), which thus may have a limited value in terms of achieved accuracies.

Response: Our previous papers cited were not all reviews (like ref 108), we use ref 108 to simplify referencing the earlier papers described. In this manuscript we describe numerous examples of comparing Bayesian versus SVM versus Trees, in all cases we have seen little difference using the exact same molecular descriptors. Our aim by comparison is to show that the ROC values (n fold validation) in the current study are similar to those used previously in our earlier studies.
This is on my opinion is not sufficient. For several datasets, e.g. AMES mutagenicity, there are multiple benchmarking studies. A proper comparison of the performance of the proposed methodology to these results (using similar test protocols as provided within these studies) is required to prove the claim of the authors that models developed with NB are of similar quality as compared to other methods.

Response: We respectfully disagree that our validation is insufficient. In a previous publication we described the implementation of ECFP6/FCFP6 fingerprints for use in the CDK toolkit, which we performed with the intention of matching the efficacy of the original implementation that was designed by SciTegic and published. While details were withheld by SciTegic, we have previously established that our implementation has equivalent performance. Since we are using the exact same algorithm for deriving Bayesian models, it is hardly a stretch to expect that the Bayesian models we created would also perform similarly well, and we have presented a number of examples to indicate this is the case. We have used the same method in many other cases which are not (yet) published and found this to be the case also. Readers are also able to confirm this for themselves using the open source implementation. In short, we believe our claims are therefore valid, and comes with plenty of supporting evidence. The goal of paper 1 was to show that our method with an open algorithm and descriptors could reproduce similar statistics for datasets which we have used previously. We believe it is acceptable to use our past datasets for this, many are published by others and prior papers provide this information. Our goal was not to focus on benchmarking of any one dataset– see comments above.

If this is not the case, I do not really see the advantages of development and sharing the NB models. The scientists will be doing this using the best available approaches notwithstanding whether they use public or commercial software. Indeed, the economic gain by applying most predictive algorithms can provide much better cost savings compared to the use of the models with lower prediction ability. This economic gain can be much higher than the software costs. Moreover, the problem with model sharing in many cases is not limited by the availability of open software or open descriptors. It is more related to problems with IP issues and data security.

Response: We have previously shown there is no significant difference between costly commercial software and using open source descriptors and algorithms with very large datasets at Pfizer and extensive external testing (Gupta et al., 2010 paper). We find the reviewer’s comments to be quite perplexing. If the reviewer means to say that there is zero value in creating an open source freely sharable implementation that can be easily used by any scientist on virtually any platform, when there are currently only expensive & proprietary products that are unavailable to all but a few… then it is hard to know where to start with rebutting this. Needless to say that we know from our own experience that this is simply not the case. Bayesian methods based on circular fingerprints are extremely useful (as we and many others have attested in the literature), and putting them in the hands of everyone who could possibly benefit from them has value that is self-evident, to say the least. We are also interested in the possibility of making other methods available in the same way. Making it easy to share the resulting models is major theme of our work of late, and we are pursuing this goal as far as our resources allow. We have discussed some of the caveats of potentially revealing information about the molecules used to create the models, in order to provide the users with the ability to make an informed decision about IP protection. The benefits of sharing are numerous – part of the challenge is because scientists create models and publish them in a way that does not make them accessible to others – we want to try to change that, and time will tell how much impact our efforts will have. Our motivation is not economic gain but scientific gain. Models can be shared securely in CDD or they can be made completely or partially open. Each use depends on the needs of the user for the particular project.

Last, but not lest, all data used in this study should be supplied together with the article (as zipped files with chemical structures, names of molecules and activity data; the original data sources can be also included.). The indicated links do not provide an access to all datasets (i.e., registration is required for some sets). This will be required to allow the readers to re-use them in other studies.

Response: As we have described in the data and materials availability: Data and materials availability: ‘All computational models are available from the authors upon request. All molecules for malaria, tuberculosis and cholera datasets from Table 1 are available in CDD Public (https://app.collaborativedrug.com/register) and the models from Table 2 are available from (http://molsync.com/bayesian1).’
The CDD public data from Table 1 is readily accessible after registering. Most of the datasets in Table 1 have already been published and made available by others (see citations). We include just one proprietary dataset (Caco-2). If scientists need access to the other datasets they can request them from us. We are not aware there is a requirement of the journal to make all data open. Clearly drug companies that publish data in JCIM do not do this frequently.

Since the content from paper 2 represents a rather large fraction of ChEMBL, it would be antisocial to include it as a either a single file on the ACS server, or as thousands of smaller files. For this reason we prefer to host it ourselves (and make it accessible to the community) in a way that is more convenient to the reader.


Please make sure your COI statement appears in the manuscript:
“S.E. is a consultant for Collaborative Drug Discovery Inc. A.M.C. is the founder of Molecular Materials Informatics, Inc.”

Response – yes we included this.


Paper 2
Manuscript ID: ci-2015-00144w
Title: “Open Source Bayesian Models: II. Mining a “Big Dataset” to Create and Validate Models with ChEMBL”
Author(s): Clark, Alex; Ekins, Sean

Reviewer: 1

This paper is a companion to the software paper submitted in parallel. It describes the use the ChEMBL to test their two-state Bayesian classification described in the parallel paper.
The reasoning behind the study as well as the methodology is properly described and accessible, as is the extraction of the underlying data sets.
All models produced in this study are openly available, as is the software which has been integrated into the open source chemistry development kit (CDK).

Response: Thank you.

Reviewer: 2

Paper 2 (ci-2015-00144w) Open Source Bayesian Models: II. Mining a “Big Dataset” to
Create and Validate Models with ChEMBL:

p.20, l.38: “independent not order dependent” – strangely phrased.

Response: Corrected.

p.23, l.23: “The first limit clause restricts to any of the assay identifiers for the block, which varies from one to thousands.” – Unclear phrasing and/or mangled grammar: Restricts what? And what varies from one to thousands?

Response: Corrected.

p.26, l.1ff: What software and method was used for this analysis and the plots?

Response: Analysis has been done by software described in these two manuscripts. Plots were created using original software (which we do not describe, since it is not novel and was created only to support the manuscript).

p.33, l.15: “described by Keiser et al., 80.” – Screwed-up punctuation. Or a sentence part missing?

Response: we have changed this as follows. ‘Similarity ensemble analysis (SEA) was described by Keiser et al., 80 which used 246 targets and 65,241 molecules and the Tanimoto similarity was compared for each pair of molecules. This approach was used to identify new targets for several known drugs that were not expected.”

p.33, l.34: “These had correct […]” – better: “These models…”

Response: Corrected.

p.33, l.37: “build models for adverse drug reactions these in turn” – comma (or semicolon or even full stop) missing before “these”.

Response: Corrected.

p.33, l.46: “Natives Bayesian” – I am pretty sure the authors meant “Naïve Bayesian.”

Response: Corrected.

p.33, l.48: “It was however shown that combining HTS a fingerprints […]” – Mangled sentence.

Response: Corrected.

p.34, l.25: “over 1800 molecules tested against over 800 molecules” – this makes no sense.

Response: Corrected.. Should be 800 end points/ assays.

p.36, l.8: “In this case, secure collaborative software would be used to transfer and run the model.” – Too much advertisement for CDD.
Response: We are stating a fact that if IP was to be maintained it would have to happen in a secure environment. We do not mention CDD explicitly.

p.38, sections Author Contributions, Conflicts of Interest, and Acknowledgments: Punctuation and name abbreviation issues (SE vs. S.E. etc.).

Response: Corrected.
Reviewer: 3

The authors have extracted and analyzed datasets extracted from ChemBL database using naïve bayes classifier. They tried to develop a threshold schema to separate quantitative data on classes of active and inactive compounds and made developed models and associated data available for download by the external users.

The mapping of naïve bayes scores to probability estimation is well known in the computer science literature, which has been addressed more than a decade ago, see e.g. http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf. I do not see a reason to develop “yet another” algorithm without providing a correct benchmarking and comparison of it with the previous studies.

Response: The reviewer has not taken into account the fact that we are describing the Laplacian-corrected variant of the naive Bayesian method. The references given refer to the conventional form, which is more popular outside of cheminformatics (which usually does not have to deal with thousands of fingerprints), and as such are describing the process normalizing values that are already formally probabilities in the 0..1 range. The method that we have adopted generates values with arbitrary scale and range, and so this limits the extent to which they can be interpreted. The raw values are suitable for ordering, but little else. We are not aware of previously disclosed methods for transforming continuous values into a “probability-like” range, and we deem this to be of some value to cheminformatics. We have also described these issues in considerable detail in the text, and do not believe that any further discussion is necessary.

The authors should not substitute term “Bayesian models” with “Naive Bayes models”. NB is based on very strong assumptions about the statistical properties of descriptors and does not provide optimal models as full Bayesian modeling.

Response: As previously noted, we have used the term “Bayesian” as shorthand for “Laplacian-corrected naive Bayesian”, after having introduced the term. In the interests of literary quality, we have kept to this convention.

The article does not have a result section. It starts with description of data preparation, which belongs to the Data section. Actually, there is no need to specify sql queries used to extract data. Such technical information can be better published as supplementary materials or just skipped.

Response: We respectfully disagree. We have formatted the manuscript in a way that we believe serves the casual reader as well as anyone studying it in detail, and have provided all of the content categories that are expected of a research paper. While migrating the SQL queries to supplementary information would not be a dealbreaker for the overall value of the manuscript, we believe that it is useful for readers to communicate what work is required in order to transmute the data source into something that is immediately useful. Some readers may be under the impression that it is much easier or much harder than it really is, and anyone who is familiar with data processing methods would find it valuable. For this reason, we have retained this section of the manuscript as is.

The IC50 values used by the authors were collected from different articles, which were based on different experimental conditions. The authors should provide some arguments and discussions how the use of different experimental conditions affects the results and why such different data can be merged together.

In some cases the ChEMBL data are also from a single lab so it depends on the dataset. We agree one would expect some interlab variability when the data comes from more than one laboratory. In this work, which is focused on method development, we have “passed the buck” to the ChEMBL team. We have explained in detail (hence the SQL queries) how we have chosen to assimilate values with the same target/assay types. To the extent that they are incompatible, this is decision that was made by the curators of ChEMBL. To scientists using this data for prospective studies, it is up to them to decide whether it was reasonable for us to assume that the ChEMBL curation is good enough. We do not argue that this core assumption is appropriate for all drug discovery scenarios, but we do demonstrate our claim that by doing so, it is possible to produce a large number of models with an entirely automated method, and that this is of interest to the greater community. Whether the generally-high ROC values are indicative of high compatibility of data from different labs is not something we claim to have proven. However, the fact that there is sufficient high quality open data data – and now methods – for creating well-performing models for many hundreds of biorelevant targets with adequate model sizes is in itself very interesting, and in our opinion, well worth sharing with the community.

I do not understand the arguments about the need to develop sophisticated algorithms to select a threshold for classification models. The regression models are much better suitable to work with quantitative models. If only a classification is required, the selection of a threshold depends on the intended use of the results (e.g., models developed for screening of new compounds with 10mM and 1µM should be based on the appropriate thresholds for the activity data). Because of these two arguments, I could not really follow the logic and need to design some new criteria for separation of active and inactive compounds. Again, this part belongs to the methodological part of the article.

Response: We may have erred on the side of assuming that this concept is familiar to all cheminformaticians involved in drug discovery, though we believe it is reasonable given the readership of the journal. In order to build a model based on 2-state classifications, it is necessary to have data that is classified as one of two states. Since bioassay data is typically given as continuous values, often in concentration units, the easiest way to do this is to define a threshold. The choice of threshold varies considerably depending on the circumstances, e.g. for some targets, only strong binders are interesting, while for other cases, the available data may not include many/any strong binders, and so a lower threshold is appropriate. The best choice is not necessary obvious. When a handful of models is being considered, contextual scientific knowledge is usually available, with manual trial-and-error as a fallback, but for thousands of models, this represents a major scaling issue. It is our belief that most of the readers for whom this article ought to appeal will be familiar with this concept, and that the explanation we have provided in the manuscript is sufficient.

Thus, actually I did not find what are the results of this study and what is their value? Who will be the potential users of the developed models and how can be these results used? Unfortunately, the article does not have a clear answer to this question.

Response: We have spent some time describing the possibilities that arise from having thousands of models for bio-relevant targets based on high quality open data. We believe that this should be largely self-evident to anyone who is working in the drug discovery industry: having a model for almost every drug target conveniently on hand, and freely available, is transformative and quite different from the status quo. As we have described in the discussion, while others have built Bayesian models for multiple targets, none has considered the scale of what we have demonstrated with the ChEMBL data – namely over 2000 classification models. While we make no claims to the effect that these models are completely ready to be used directly for prospective drug discovery campaigns, it is a major step in the direction of creating large collections of models, and should be of very broad interest and applicable to other data collections. From experience and collaborations, we have already identified academic and commercial organizations that would benefit from the models, and fully intended to follow up any interesting results with disclosure in the literature.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>