Titulo Estágio
Empowering Machine Learning Potential Models with Explainable Artificial Intelligence.
Áreas de especialidade
Sistemas Inteligentes
Local do Estágio
Centre for Informatics and Systems of the University of Coimbra (CISUC), at the Department of Informatics Engineering of the University of Coimbra, and at the Institute for Chemical and Bioengineering, at the Department of Chemistry and Applied Bioscience
Enquadramento
For centuries, chemistry has enabled the increased wealth and quality of life by e.g. introducing new materials and developing new medicines. With the growing concerns about the environment and pollution's impact, chemists have shifted their focus to sustainability. In 1998, Anastas and Warner [1] introduced 12 principles for green chemistry. One of these principles emphasizes the importance of developing more efficient syntheses, a critical step towards sustainable chemistry.
In order to achieve a sustainable future, artificial intelligence (AI) has emerged as an important instrument. AI has huge potential to accelerate scientific discovery by allowing us to discover insight from historical data and to guide the experiments that we do in the lab. Within the broader field of AI, Machine Learning (ML) focuses on creating models to learn from existing patterns within data. ML techniques have been shown to be useful in different areas of science and engineering. In chemistry, ML is being used to help in the process of developing new chemical products such as catalysts or drugs, predicting properties of molecules, such as their energy, among many other applications [2][3]. Depending on the model’s purpose, there are many nuances that need to be carefully addressed. For instance, how to represent an input, e.g. a molecule, in a form that a computer can learn from. Moreover, ML models need large amounts of data to achieve good inference accuracy, and these models are very dependent on the dataset they were trained on. Yet, in what concerns the application of AI to chemistry, there is not much data available and/or the data available is difficult to access. Therefore, there is a need for models capable of training with less information, but also being able to be extrapolated to domains different from the domain they have already seen [4]. Furthermore, there have been several discussions regarding the information available on how models function. There is an increasing need to ensure transparency in the models' underlying logics, besides their efficient performance.
One application of ML in chemistry that has been much studied recently is in the development of empirical models to assess the potential-energy surface (PES). With accurate ML models for the PES, molecular properties such as catalyst efficiency or drug activity can be simulated in the computer with unprecedented speed. In 2007, Behler and Parrinello [5] introduced an approach for building PESs with ab initio accuracy using Neural Network Potentials (NNP). In more recent work, Smith et al. [6] proposed a new architecture (Accurate Neural network engIne for molecular energies or ANI) which can be transferable between different chemical environments. Their models have shown extraordinary performance in predicting the energy of organic molecules.
Since their introduction, the ANI architecture has been used and improved by many other authors. Zhang et al. [7] have developed an ANI derivative (ANI-nr) for modeling chemical reactions. Although the ANI method has shown good performance on molecules similar to those it was trained on, it has struggled to extrapolate to new molecules or bonding patterns without extensive retraining. While in the best case, the method would learn the deeper principles of chemistry, the risk is that it is resorting to simple pattern matching [8]. We therefore need to investigate these ML potentials to understand how and what they learn, so that we can improve them. One way to do this is by applying techniques from the new field of AI, called Explainable AI (XAI) [9]. XAI has presented itself as the new paradigm of AI since, in addition to models with great precision and extrapolation capacity, people want to know why the models arrived at the results they did. In this regard, Miller [10] distinguishes between interpretability, justification, and explanation. Interpretability has to do with the capacity of an observer to understand the cause of the decision; justification has to do with why that decision is good; and explanation involves presenting information to the human that contains the context and cause of a given result. Creating models that are explainable is important to build trust and to increase acceptance among the people who will use them. It can also be used to understand what the models are doing in order to improve them. Available techniques include SHAP and LIME [11], [12]. SHAP, for instance, shows which were the features of the input that most contributed to the output of the model. In chemistry a useful XAI approach is to use heatmaps, which can visually explain the contribution of each atom or each group depending how it is constructed. Recently, Rasmussen et al.[13] used techniques like this one to compare the results obtained with Random Forest models and the Crippen logP values, considered as ground truth.
Objetivo
The neural network potentials discussed above have demonstrated high accuracy when compared to ab initio methods. Yet, these models can be seen as a black box since humans cannot understand the logic behind the predictions of the algorithm.
The aim of this master science thesis is to explore the explainability of the state-of-the-art ML models used for chemistry. More specifically, the goal is the use of the framework introduced by Rasmussen to make ML potentials, such as the ANI models, more understandable for human interpretation. Therefore, we can better understand the logic behind the models and propose improvements. In this case, the models are used for predicting molecular enthalpy, for which we can generate an interpretable ground-truth dataset. Based on this research aim, the following research objectives can be identified:
1. Prepare the ANI models for interpolation and extrapolation. This objective involves setting up the ANI models using two different approaches: one for interpolation using the QM9 dataset, and another for extrapolation using the ZINC dataset. The goal is to train the models and assess their ability to learn underlying chemistry versus relying on pattern matching.
2. Generate the Benson increments for the GDB-11 dataset. This objective involves utilizing the dataset constructed by C. McGill and W. Green to calculate the Benson increments as ground truth for molecular enthalpy. By comparing the explanations provided by the ANI models with the ground truth from the Benson increments, the extent to which the models capture the underlying chemistry can be evaluated.
3. Explore and compare various explainable artificial intelligence (XAI) techniques. This objective involves investigating different XAI techniques, such as heat maps, LIME (Local Interpretable Model-agnostic Explanations), and SHAP (Shapley Additive Explanations), to gain insights into the interpretability of the ANI models. The goal is to determine the most suitable technique and potentially develop improved techniques that align with the ANI models' characteristics.
4. Evaluate the interpretability of the ANI models using quantitative and qualitative measures. This objective involves assessing the effectiveness of the implemented XAI techniques in providing interpretable explanations. It may involve quantitative metrics, such as interpretability scores, and qualitative evaluation through expert judgment and domain-specific knowledge to verify the alignment of the explanations with known chemistry principles.
By addressing these research objectives, the study aims to contribute to the understanding and improvement of the interpretability of ANI models in chemistry by applying XAI techniques and assessing their effectiveness in providing clear and trustworthy explanations for users.
Plano de Trabalhos - Semestre 1
1- State of the art [Sept – Oct]
2- Problem statement, research aims, objectives, and questions [Nov]
3- Design and first implementation of the AI system [Nov – Jan]
4- Thesis proposal writing [Dec – Jan]
Plano de Trabalhos - Semestre 2
5- Improvement of the AI system [Feb – Apr]
6- Experimental Tests [Apr – May]
7- Paper writing [May – Jun]
8- Thesis writing [Feb – Jul]
Condições
The work should take place both at the Centre for Informatics and Systems of the University of Coimbra (CISUC), at the Department of Informatics Engineering of the University of Coimbra, and at the Institute for Chemical and Bioengineering, at the Department of Chemistry and Applied Biosciences of the Eidgenössische Technische Hochschule Zürich (ETH Zurich).
Observações
Advisors:
Luís Macedo, Department of Informatics Engineering, Faculty of Sciences and Technology, University of Coimbra
Kjell Jorner, Institute for Chemical and Bioengineering at the Department of Chemistry and Applied Biosciences, Eidgenössische Technische Hochschule Zürich (ETH).
References:
[1] P. T. Anastas and J. C. Warner, Green Chemistry: Theory and Practice. Oxford University Press, 2000. [Online]. Available: https://books.google.pt/books?id=_iMORRU42isC
[2] J. Seumer, J. K. S. Hansen, M. Brøndsted Nielsen, and J. H. Jensen, “Computational Evolution Of New Catalysts For The Morita–Baylis–Hillman Reaction,” Angewandte Chemie International Edition, Feb. 2023, doi: 10.1002/anie.202218565.
[3] A. Zhavoronkov et al., “Deep learning enables rapid identification of potent DDR1 kinase inhibitors,” Nat Biotechnol, vol. 37, no. 9, pp. 1038–1040, Sep. 2019, doi: 10.1038/s41587-019-0224-x.
[4] K. Jorner, “Putting Chemical Knowledge to Work in Machine Learning for Reactivity,” Chimia (Aarau), vol. 77, no. 1–2, pp. 22–30, 2023, doi: 10.2533/chimia.2023.22.
[5] J. Behler and M. Parrinello, “Generalized neural-network representation of high-dimensional potential-energy surfaces,” Phys Rev Lett, vol. 98, no. 14, Apr. 2007, doi: 10.1103/PhysRevLett.98.146401.
[6] J. S. Smith, O. Isayev, and A. E. Roitberg, “ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost,” Chem Sci, vol. 8, no. 4, pp. 3192–3203, 2017, doi: 10.1039/C6SC05720A.
[7] S. Zhang et al., “Exploring the frontiers of chemistry with a general reactive machine learning potential.”
[8] J. A. Kammeraad, J. Goetz, E. A. Walker, A. Tewari, and P. M. Zimmerman, “What Does the Machine Learn? Knowledge Representations of Chemical Reactivity,” J Chem Inf Model, vol. 60, no. 3, pp. 1290–1301, Mar. 2020, doi: 10.1021/acs.jcim.9b00721.
[9] C. Molnar, Interpretable Machine Learning, 2nd ed. 2022. [Online]. Available: https://christophm.github.io/interpretable-ml-book
[10] T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artificial Intelligence, vol. 267. Elsevier B.V., pp. 1–38, Feb. 01, 2019. doi: 10.1016/j.artint.2018.07.007.
[11] S. M. Lundberg, P. G. Allen, and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions.” [Online]. Available: https://github.com/slundberg/shap
[12] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ Explaining the Predictions of Any Classifier.” [Online]. Available: https://github.
[13] M. H. Rasmussen, D. S. Christensen, and J. H. Jensen, “Do machines dream of atoms? Crippen’s logP as a quantitative molecular benchmark for explainable AI heatmaps,” 2022.
[14] E. Heid, C. J. Mcgill, F. H. Vermeire, and W. H. Green, “Characterizing Uncertainty in Machine Learning for Chemistry.”
Orientador
Luís Macedo
macedo@dei.uc.pt 📩