We are graduate students in the Data Science Master’s program at Brown University. As part of our Deep Learning and Special Topics in Data Science course (Data 2040), we worked on processing molecular compounds using sequence models.
Chemistry is a field that has recently experienced a great surge of research attention from the machine learning community. This crossover between disciplines has been spurred by the fact that many of the modern text-processing techniques that have been developed in machine learning are very useful in extracting information from chemical compounds. While this can feel like a pretty big conceptual leap to go from text to chemistry, we just have to ask: what’s the difference to a machine learning model between a sequence of letters and a sequence of atoms?
With the barrier between language and chemistry broken down, we enter into a rich field of research investigating different methods of processing and extracting useful properties from a chemical compound expressed as a sequence of characters. One of the standard ways of writing compounds in this way is using the SMILES (Simplified Molecular-Input Line-Entry System) format. With SMILES we use different characters to symbolize different kinds of atoms, bonds, rings, etc. to get expressions see Figure 1.
The problem that we’re tackling in this project is how to predict molecular toxicity from a molecule’s SMILES expression. The data we’re starting with is the Tox21 data set from the National Institutes of Health [1,10] and we’re using presence of the SR-p53 protein as a proxy for toxicity (more on this later in the EDA section).
In terms of approaches for this type of molecular analysis, the literature is broad and expanding quickly but focuses on the standard set of sequence methods that have led to many of the advances in NLP in recent years. These include models like RNNs, GRUs, and LSTMs, as well as more powerful models like transformers and variational auto-encoders to learn more sophisticated sequence embeddings through the use of attention.
Intro To Tokenization
When working with SMILES data, the first preprocessing step that we need to do to transform our data into a sequence format that can be used by our model is tokenization. Tokenization is a standard preprocessing step in any sequence-learning task; in the case of text processing tokenization may looking like breaking up a sentence into a sequence of letters, and likewise tokenization in our problem looks like breaking up SMILES compounds into elements of our molecular alphabet.
Exploratory Data Analysis (EDA)
Before creating our model, we analyze and investigate the dataset and summarize our findings. This is an important step in data science models, because it allows us to better understand the data, the outliers, and any errors. We begin with high level summaries and then dive deeper into more specific analyses.
The dataset used for a preliminary baseline model comes from the Tox21 Data Challenge 2014. This competition, whose purpose was to crowdsource data analysis on how certain chemicals do or do not interfere with certain biological pathways. The datasets for this competition contained tables of SMILE data-label pairs where the labels for each dataset represented binary outcomes indicating whether or not a chemical interferes with a particular essential molecule’s function. Of the several datasets available, we chose the one with labels indicating interference in the function of p53, a tumor-suppressing protein that shows increased bodily prevalence in the face of abnormal cells propagating from replication. Known as the “guardian of the genome”, p53 plays a vital role in binding to DNA to prevent genetic mutations . Below is a graphic demonstrating the diversity of use-cases for p53 and its derivatives.
Here are a few visualizations from our exploratory data analysis of this dataset.
Figure 3 shows the top 30 molecules in terms of occurrence at least once in a given molecule. In descending order, the 8 most frequently occurring SMILES tokens are carbon, the opening and closing statements of a new branch, the double bond, oxygen, a formal charge of plus or minus 1, nitrogen, and the formal charge of plus or minus 2. This makes intuitive sense: the atoms listed are three of the most commonly occurring elements in organic compounds and the other tokens represent fundamental concepts in chemistry.
Figure 4 shows the top 30 molecules in terms of total occurrence in the dataset, note the logarithmically scaled vertical axis. We see that carbon’s dominance in this figure is more pronounced than the previous figure. This likely has to do with the presence of complex carbon structures, like rings, where multiple carbon atoms appear in succession. Additionally, the double bond occurs more frequently overall than the branching shapes even though it occurs less frequently in an at-least-once context. This likely has to do with the fact that while many molecules contain at least one branch, many molecules will contain multiple double bonds if they do contain any given the charges of atoms within the molecules.
Figure 5 shows the distribution of sequence lengths in terms of number of SMILES tokens. While we can see the distribution of lengths is centered around 20 tokens, the distribution is nonetheless heavily skewed in the positive direction. Given the highly variable character of sequence lengths and the presence of outliers in this respect, we’ll have to add an additional preprocessing step that transforms the inputs into sequences of constant length. We’ll discuss this step in more detail when we cover the baseline model.
Additionally, we created a few visualizations of the two-dimensional structures of molecules based on their SMILES sequences to get a sense of how the mappings look:
Preprocessing & Baseline Model
We begin by preprocessing our sequence data.
Our first preprocessing step was dealing with unseen tokens in the test set, that are not present in the training set. With the help of Vinoj John Hosan, we create a new preprocessing tool called LabelEncoderExt, a build on LabelEncoder that is able to handle new classes (unseen tokens). Overall, LabelEncoderExt works by replacing an unseen token with <Unknown> by being passed through fit and transform functions. More specifically, the fit function takes in a list of data, fits the encoder using all the unique values, and adds “Unknown” to the label encoder list. The transform function takes a data list, transforms it to an id list by assigning all new values the class “Unknown”. We use LabelEncoderExt to encode the tokenized training and test data. To do this we traverse through the tokenized training samples and store a NumPy array of integer-encoded samples .
Next we preprocessed our encoded training and test data through padding. While most sequences are of a length less than 60, the longest sequence in our training data is 240 characters. We pad the encoded training and test data to be length 240 (we add 0s to each sequence until it is the desired length). Our padded and encoded data becomes X_train, our padded and encoded test data becomes X_test, and our y_train and y_test are the target columns in the original dataset (0 for non-toxic, 1 for toxic).
Before implementing our baseline model, we initial Weights and Biases (wandb), which allows us to display training progress and diagnose issues earlier on. We create a sequential model with a Masking, Embedding, SimpleRNN and one Dense layer.
The masking layer “masks” the value of 0, telling the RNN to ignore the padding characters.
The embedding layer is the first hidden layer in a network, where input_dim is the length of our vocabulary (length of train_alphabet + 1 to account for the unknown token), the output_dim is the vector space size in which words will be embedded (length of train_alphabet + 1), and the input_length is the length of input sequences (240) .
The SimpleRNN layer is a fully connected RNN where the output is fed back into the input, we chose a dimensionality of 100 as the output space.
We end with a dense layer that has an output size of 1, and uses sigmoid as the activation.
To compile our model, we use binary_crossentropy as the loss, SGD as the optimizer, and both accuracy and AUC as the metrics. After training our model on 10 epochs with a batch size of 32, we get a val_accuracy of 0.8955 and a val_auc of 0.500.
Toxicity is much less common than non-toxicity samples in our data (at a ratio of 9:1). Thus, it is not surprising that the data is very skewed in favor of non-toxicity. To combat for this and get an informative judge of classification quality, we use AUROC (resilient to skewed data). Using AUROC, we can interpret our validation score of 0.5 as a random classifier.
Conclusion & Next Steps
Looking forward, the most immediate next steps are to create a stronger model to improve on our baseline. The main scoring metric we are focusing on is the AUROC, as is the convention when dealing with two-class classification and unbalanced data. In terms of architectural next steps, we will explore different methods of creating embeddings of our SMILES sequences possibly using techniques like byte-pair encoding or other methods of alphabet extension, as well as LSTMs and transformers to learn more sophisticated embeddings based on context and attention.
The current baseline model is extremely simplistic including only an embedding layer (for the purposes of data reshaping), simple RNN layer, and dense layer. Our next steps will explore the use of LSTMs and other carefully selected model parameters.
Outside of simple RNNs, we are planning to explore CNNs, transformers, and attention-based language models. Attention based language models like BERT and GPT-3 have recently emerged as interesting modes of classification or seq2seq prediction for SMILES data. The progress of deep learning algorithms has yielded great progress in the field of molecular property classification. To that end, the goal of our project is to come up with innovative ways of encoding SMILES as sequences for predicting toxicity.