SMILES Toxicity Prediction
Part Two
By Cameron Webster, Jillian Green, Chris Rohlicek, and Akshay Shah
We are graduate students in the Data Science Master’s program at Brown University. As part of our Deep Learning and Special Topics in Data Science course (Data 2040), we worked on processing molecular compounds using sequence models.
A Brief Recap
In Blog 1 we discussed the dataset, EDA, and our baseline model. Our goal of this project is to come up with innovative ways of encoding SMILES molecular compound representations as sequences for predicting toxicity. The data consists of string representations of molecules written in a format used to encode the different atom, bonds, and various components of a given compound. Previously, our SimpleRNN baseline model achieved a validation score of 0.5 as a random classifier, using AUROC as our metric.
Introduction
During this phase of the project we implement a transfer learning model that tackles our problem from the image processing perspective. InceptionNet utilizes different sized filters, so we chose to use InceptionResNetV2 as our jumping off point.
Why Implement a CNN?
At the core of our project, we’re hoping to find effective ways to create embeddings of our SMILES input sequences that can be used by our different neural networks. Last week we implemented an RNN baseline that feeds the SMILES strings in as sequences and treats them as a sequence of tokens from a hardcoded vocabulary that has knowledge of the SMILES system of encoding the different parts of a given molecule. While that approach has a lot of subtleties that we’ll return to in the final stage of our project, this week we took a detour to implement a different type of model that might lend some embedding insight, a convolutional neural network. Because CNNs are generally used for image processing, we processed our data for this new approach by creating a matching image dataset of 2D representations of our molecules to go along with our dataset of SMILES sequences. The objective of the model in this task will be to optimally classify the molecules as having or not having the SR-p53 protein that is our marker of toxicity, but the learning process now will be extracting features from the molecules’ 2D structure, rather than their sequential structure. The paper we used as inspiration can be found here: chemception.
Preprocessing Steps
We generated images of molecules using the SMILES strings from our training and testing sets.
First, the SMILES strings from the train and test datasets were converted into 2D colored images of the SMILES compounds using RDKit’s draw function. The images were subsequently saved as png files and zipped. A mapping DataFrame was created to link SMILES strings and their target class to the associated png file name. See figure 3 for a sample images in our training set.
We used ImageDataGenerator from Keras to split out the validation set from the larger training set with a 0.2 validation split. No augmentation was performed on the image data since it is not necessary to rotate, zoom, flip, etc. the SMILES images because the plotting function necessarily creates them in the same format each time. The image size is (300, 300) and we chose a batch size of 64. The shape of the input tensor of the training dataset is (6904, 300, 300, 3), validation is (1725, 300, 300, 3), and test is (268, 300, 300, 3). In total our training set has 8,629 images and our testing set has 268 images.
Model
What is Transfer Learning?
Transfer learning refers to the practice of building neural networks that build off of pre-trained models. Our model uses transfer learning from InceptionResNetV2 with ImageNet weights as a base and a dense network as the head. The head model includes a GlobalAveragePooling2D layer and a Dense layer with softmax activation.
Model Head
The section of layers that we add to our transfer learning model are below.
Since our network is performing a binary classification, we use categorical cross-entropy as the loss function. We chose to use the Adam optimizer with the default learning rate of 0.01. As discussed in the previous blog post, since non-toxicity is more prevalent than toxicity in our dataset, we chose AUROC as our metric to get an informative judge of classification quality. We also included accuracy as a metric in our model so we can continue to compare that value with the underlying proportion of the two classes. Finally, the model is trained over 3 epochs and achieved an AUROC of 0.783.
As you can see, our model began to overfit very quickly. Moving forward, we will change our precise training strategy to counteract this.
Conclusion & Next Steps
Moving into the final stage of our project we’re going to be diving deeper into the different approaches we’ve considered so far, and see where they can lead us. These different approaches generally fall into the categories of sequence learning methods (RNN, LSTM, etc.), image processing methods (CNN), and there is literature on the use of transformers which incorporates elements of both of these model paradigms. On these three fronts we plan to work on more sophisticated sequence models, experiment with ways of combining the sequence and image approaches through a composite model using our CNN and recurrent network, and finally we will experiment with image-based transformer approaches to combine our preliminary success with the 2D SMILES representations with an attention mechanism.
All code can be found in this GitHub repository.
Blog Post 3: here!
References
[1] https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00423-w
[2] https://github.com/bigchem/transformer-cnn
[3] https://arxiv.org/pdf/2002.08264.pdf
[4]Mahdianpari, Masoud, et al. “Very Deep Convolutional Neural Networks for Complex Land Cover Mapping Using Multispectral Remote Sensing Imagery.” Remote Sensing, vol. 10, no. 7, 2018, p. 1119., doi:10.3390/rs10071119.