SMILES Toxicity Prediction

2040 Final Project
9 min readApr 20, 2021

Part Three
By Cameron Webster, Jillian Green, Chris Rohlicek, and Akshay Shah

We are graduate students in the Data Science Master’s program at Brown University. As part of our Deep Learning and Special Topics in Data Science course (Data 2040), we worked on processing molecular compounds using sequence models.

Introduction

Some of the greatest technological strides that have been made recently in machine learning research have been in the areas of natural language processing. These advances have come in the form of technologies like transformers, attention mechanisms, and context-based encoding schemes to learn high-level representations for the abstract concepts conveyed in language. However, while language gets most of the attention when discussing all of these exciting developments, chemistry has also been an incredibly active area of research for all of these new methods. The connection between language and chemistry may not be clear at first, but the door to a rich field of research is opened when we consider that a molecule is just like a sentence written in the language of atoms and bonds.

In this project we explored a few methods of processing molecular compounds and predicting their qualities. The compounds we used are encoded in the SMILES format and the quality we’re predicting is the presence of the SR-P53 protein which is commonly used as a proxy for the general toxicity of a compound.

Figure 1: Visual representations of SMILES. Top left: CC(C)=CCO, top right: CNC(C)©Cc1ccccc1.CNC(C)©Cc1ccccc1.O=S(=O)(O)O, bottom left: CCCCCCCCN1CCCC1=O, bottom right: CCC(C)©OC(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc1ccccc1)C(N)=O.

The literature on this type of problem is incredibly varied, but some of the most common approaches center around the use of RNNs and LSTMs in the standard ways in which they’re used in NLP tasks, and CNNs applied to 2D representations of the molecules. In the more recent literature there are new applications of transformers being applied to SMILES sequences in the form of modified BERT models that are trained on massive chemical datasets.

Our goal in this project is to investigate these methods and experiment with a model built to combine the different strengths of the CNN-, LSTM-, and BERT-based approaches with the goal of predicting molecule toxicity.

Project Recap

In our first blog we focused on Exploratory Data Analysis (EDA) and our baseline model. Most notably during EDA, we discovered that carbon occurs most commonly in the dataset, and the distribution of sequence length is centered around 20 tokens. We performed a SimpleRNN baseline model on SMILES strings, achieving a validation score of 0.5 as a random classifier, using AUROC as our metric.

In our second blog we implemented a transfer learning model (InceptionResNetV2) from the image processing perspective. Our base model was instantiated with retrained ImageNet weights, with a set of 2D image representations of our molecules as our input. On top of our base model we added a GlobalAveragePooling2D layer and a Dense layer. After training over 3 epochs, our model achieved an AUROC of 0.783.

In this blog post we focus on improving each model and experimenting with a technique called ensembling to combine our two models for further improvement.

Improvements & Attempted Ideas

Our RNN baseline consisted of one Keras SimpleRNN layer with 100 hidden units fed by a one-hot-encoded representation of the inputs passed through a Keras embedding layer. One of the most important improvements made to this architecture was the change from an embedded one-hot representation of the features to a richer tokenization of the input sequences to the component of the model that takes SMILES sequences.

Two major disadvantages of one-hot-encoded representations of sequences are the sparsity of the input tensors and the lack of relationship detection between SMILES tokens. To give our model a more detailed representation of these relationships, we explored the use of pre-training Bi-direction Encoder Representations from Transformers (BERT) on our input. As the name implies, BERT is an unsupervised model that leverages several layers of Transformer blocks to predict masked tokens (Masked Language Modeling) and next sequence tokens (Causal Language Modeling). Not only is BERT superior to recurrent models due to parallelizability, but it is also able to capture relationships across the entire sequence due to mechanisms in the transformer that potentially attend to all tokens in a given sequence. Below is an example of the attention mechanism of a transformer block.

Figure 2: Example of the Attention Mechanism of a Transformer Block [1].

Initially, we proposed pre-training our own BERT model from scratch using a separate unlabeled set of SMILES sequences. However, due to the computational demands of training such a model on a massive dataset that the model would require, we opted to leverage a pre-trained BERT model specific to SMILES sequences found on HuggingFace from Seyone Chithrananda’s profile known as ChemBERTa, which adapts the improved RoBERTa architecture.

As opposed to natural language, SMILES contains structural properties that map sequences to 2D images. For this reason, we chose to switch from simple one-directional RNN layers to multiple bi-directional LSTM layers. Additionally, we included a 1D convolutional layer after the LSTM layers but before the final softmax-activated output in order to capture more of the local structure of the sequence.

Image Model

Shown below is our InceptionResNetV2 model from the second blog post, which we also used in our final experiments.

Figure 3: Image Model Architecture.

Due of the good results we got with the architecture above, we didn’t make any structural changes. However, we did make changes to our data to resolve a few issues involving duplicated samples. We trained this model for 30 epochs and achieved a peak validation AUROC of 0.787, and a test AUROC of 0.847.

Figure 4: Image Model Train History.

Sequence Model

Two major improvements we made to our baseline SimpleRNN were the implementation of our ChemBERTa tokenizer for data pre-processing and the use of bi-directional LSTM layers.

Figure 5: Sequence Model Architecture.

After experimenting with various number of layers, different initializations, learning rates, and optimizers, we settled on the architecture above using a Glorot uniform initialization, an initial learning rate of 0.01, and the Adam optimizer. We trained this model for 30 epochs and achieved a peak validation AUROC of 0.691.

Figure 6: Sequence Model Train History.

Composite Model

The final step in our experimentation was to create a composite model combining the sequence and image feature embeddings by concatenating them and feeding them through two dense layers. This model is binary classifier of SR-P53 and takes in two separate inputs, the tokenized SMILES sequences and the 2D image representations of the SMILES compounds. In order to combine the two models, we used a global average pooling layer to achieve matching shapes and stripped off the original models’ final dense layers. In addition to concatenating the models there are two more dense layers to reduce the final output to shape (None, 2) for binary classification. Layer ‘dense_1’ has ReLU activation function and l2 regularization. We included l2 to help normalize the model ensembling process by making sure the weights of one model do not overshadow the other.

Figure 7: Composite Model Architecture.
Figure 8: Composite Model Diagram.

Our model parameters include the Adam optimizer with learning rate 0.005, binary cross-entropy loss, and an AUROC evaluation metric. We use a much smaller initial learning rate in the training of this model because the training procedure is essentially a fine-tuning process on the two constituent models. The use of AUROC is common in these compound quality prediction problems because the data is almost always very imbalanced so the AUROC gives us the best understanding of how well our model is classifying the data. We included a learning rate scheduler ‘ReduceLROnPlateau’ monitoring validation AUROC. Early stopping was also added with a patience of 4 while monitoring validation AUROC. The model was fit with 10 epochs and a batch size of 32. The composite model achieved a peak AUROC of 0.942, and a test score of 0.897. These values support our hypothesis that combining models create a better classifier, but these numbers are substantially greater than the state of the art, so it is very likely they are the result of a data issue at some point in the pipeline.

Figure 9: Composite Model Training History.

Interpretations of Final Model

At first glance, we were very excited to see a 0.942 AUROC! But after inspecting various parts of our pipeline to check the validity of this result, we pinpointed multiple irregularities in the tail-end of our experiments. After spending many hours investigating these issues, we are still not confident enough to report this as our main model (see next section for more on this!). Therefore, we will interpret our InceptionResNetV2 as our final model.

Shown below is the ROC curve for our image model and from the comparison with the diagonal baseline curve, we see that our classifier has learned to discern the difference between the two classes.

Figure 10: Receiver Operating Characteristic Curve (ROC) for CNN Model.

… and we created confusion matrix and see that our model is not simply classifying by predicting the majority class, which as we saw in our RNN baseline can be tempting behavior for a classifier on an imbalanced dataset.

Figure 11: Confusion Matrix for CNN model.

If We Had More Time

We’re satisfied with how our project went as an exploration of some of the state of the art methods in this type of problem and as a proof of concept for an interesting combination of a few of them, but if we had more time we would have loved to dive deeper into getting the most out of these models. Because of time limitations we had to cut short a lot of our specific model optimization so the experiments we conducted could very well have reached higher marks if we were able to spend more time with the individual architectures and hyperparameters. Beyond this, we would also have experimented with different ways of combining the two models — we chose our approach as a way of extending the typical ensembling strategies people use to allow for a little more model collaboration, but there are many more ways of combining these representations that could have lent fascinating insight into the problem.

As far as our attempted composite model goes, we would have used more time to get to the bottom of a potential data issue that impacted our results. After many hours of inspecting our data pipeline, we felt like we were very close to discovering the cause of the problem, but unfortunately our deadline caught up to us.

Takeaways & What We Learned

During this project we were introduced to hands-on application of machine learning in an exciting area of active research. During this project we were challenged to research current evaluation methods and apply them in our own unique way. While we came away with more questions than we started with, this project was an enlightening exercise in the creative development of novel solutions to real world problems.

This project exemplified the importance of preprocessing and how different preprocessing methods can impact a model’s results. The most notable conceptual insight we gained from this project was that image representations of a molecule encode the relevant feature information in a way that is more easily accessible than in the sequence format. However because the two formats are so different, we think there is certainly some way of combining the two molecule representations to develop a richer feature space.

For more information, check out our code in our GitHub repository, our final paper, or a recording of our project.

Thanks for reading!

--

--

No responses yet