1. ABOUT THE DATASET
------------

Title:	Dataset supporting the article 'Transformer-decoder GPT models for generating virtual screening libraries of HMGCR inhibitors: effects of temperature, prompt-length and transfer-learning strategies.'

Creators: Mauricio Cafiero (ORCID: 0000-0002-4895-1783)

Organisation: University of Reading

Rights-holders: University of Reading

Publication Year: 2024

Description: Raw data for virtual screeing libraries generated by a generative, pre-trained transformer-decoder model. Models were pre-trained on a general drug database from ZINC15, and fine-tuned on 
inhibitors of HMGCR from ChEMBL. Libraries used different transfer-learning strategies, different prompt-lengths and different temperatures. The resultant libraries were screened against a deep neural 
network trained on experimental HMGCR IC50 values to predict IC50 values, docking scores from Autodock Vina, quantitative estimate of drug-likeness, Tanimoto similarity to known statin drugs, and other 
properties. This dataset contains tables of properties as well as csv files with the generated libraries, a TKinter-based GUI to interacting with the library, and docking poses for selected molecules.

Cite as: Cafiero, Mauricio (2024): Dataset supporting the article 'Transformer-decoder GPT models for generating virtual screening libraries of HMGCR inhibitors: effects of temperature, prompt-length and 
transfer-learning strategies.' University of Reading. Dataset. https://doi.org/10.17864/1947.001340

Related publication: M. Cafiero, 'Transformer-decoder GPT models for generating virtual screening libraries of HMGCR inhibitors: effects of temperature, prompt-length and transfer-learning strategies.' Submitted, 
ACS Journal of Chemical Information and Modeling.

Contact: m.cafiero@reading.ac.uk


2. TERMS OF USE
------------

Copyright 2024 University of Reading. This dataset is licensed under a Creative Commons Attribution 4.0 International Licence: 
https://creativecommons.org/licenses/by/4.0/.


3. PROJECT AND FUNDING INFORMATION
------------

N.A.


4. CONTENTS
------------
File listing

Statin_GPT_Data_Cafiero_2024.xlsx

This file contains raw data for virtual screeing libraries generated by a generative, pre-trained transformer-decoder model. Models were pre-trained on a general drug database from ZINC15, and fine-tuned on
inhibitors of HMGCR from ChEMBL. Libraries used different transfer-learning strategies, different prompt-lengths and different temperatures. The resultant libraries were screened against a deep neural 
network trained on experimental HMGCR IC50 values to predict IC50 values, docking scores from Autodock Vina, quantitative estimate of drug-likeness, Tanimoto similarity to known statin drugs, and other 
properties. 


Tab				Contents

Training Stats			Training Statistics for the GPT and DNN models used in this work.

Transfer-Frozen			Properties for virtual  screening libraries using 1000 prompts, a prompt-length of 12 tokens, and  T = 0.0 and 0.5. 

Transfer-All			Properties for virtual  screening libraries using 1000 prompts, a prompt-length of 12 tokens, and  T = 0.0 and 0.5. 

6Seed				Properties for virtual  screening libraries using 5000 prompts, a prompt-length of 6 tokens, and  T = 0.0 and 0.5. 

9Seed				Properties for virtual  screening libraries using 5000 prompts, a prompt-length of 9 tokens, and  T = 0.0 and 0.5. 

12Seed				Properties for virtual  screening libraries using 5000 prompts, a prompt-length of 12 tokens, and  T = 0.0 and 0.5.

k-means				Results of k-means clustering analysis for the 2183 unique micromolar molecules generated by all 30 5K models. 

Overlap				Overlap tables between each of the 5K libraries at T=0.0 and 0.5.


Acronyms	Variables

Xfer		transfer
IC50		Inhibitory concentration, 50% activity
MW		molecular weight
QED		Quantitative estimate of Drug-likeness
alog P 		Calculated log of the partition coefficient between water and octanol
HBA		hydrogen bond acceptor
HBD		hydrogen bond donor



Statin_GPT_Libraries_Cafiero_2024.zip

This file contains csv files with the generated libraries, a TKinter-based GUI to interacting with the library, and docking poses for selected molecules.

Files/Folders			Contents

Statin_model_GUI_properties	a Python/TKinter GUI for querying the generated libraries

Statin_model_GUI_duplicates	a Python/TKinter GUI for finding duplicate molecules between the various generated libraries

Poses				a folder of sdf files containing poses for the molecules from each k-means cluster with the lowest IC50 value, as well as posees for 6 known statins

xfer_Learning_files		a folder conatining a subfolfer for each library generated. The subfolders contain csv files with SMILES strings, IC50 values, QED values, etc. A separate csv
				file contains similarity information for the libraries. A png file contains a heatmap showing Pearson correlations between properties for each library.

images_from_gui			an empty folder to hold images generated by the user using the GUI. 


Acronyms	Variables

IC50		Inhibitory concentration, 50% activity
QED		Quantitative estimate of Drug-likeness
GUI		Graphical user interface
SMILES		a text-based molecular representation
csv		comma separated files.


5. METHODS
-----------

The methods below are adapted from the manuscript named above which has been submitted to review to the journal named above. All referenced figures and equations can be found in the manuscript.


	A dataset of all compiled HMGCR inhibitors was downloaded from BindingDB.org. Data was filtered to remove duplicates, outliers and null values and to include only those entries with a specific IC50 
value, i.e., values with > or < were excluded. This resulted in a dataset of 905 inhibitor SMILES strings and corresponding IC50 values in units of  nanomolar (nM). This dataset will be referred to 
as the BDB905 dataset in the rest of this work. The molecule SMILES strings were featurized with Mordred descriptors as implemented in Deepchem.  Morgan Fingerprints and RDKit descriptors were also 
tested but not used in production.  The Mordred descriptors were reduced from 1613 features to 75 features using principle component analysis as implemented in Scikit Learn. IC50 values (in nM) ranged 
from 0.16 nM to over 109 nM, and so they were transformed with the natural log function for the fitting process (called ln-IC50 here), leading to a much smaller range (-2 to about 13) which could be 
fit more easily. 

	The  BDB905 dataset was fit using a modified form of the DenseNet architecture wherein the network is made of blocks (referred to here as SkipDense blocks) each containing n dense layers and 
an optional skip connection from the block input to the block output, essentially allowing the block input to skip that block entirely while a second copy of the input progresses through the block as 
normal (Figure 1). DenseNet has been shown to have excellent performance for non-linear fitting.17 The models used here had a normalization layer (for RDKit descriptors only; for Mordred descriptors, 
normalization was performed prior to PCA reduction), one, two or three SkipDense blocks, and a dense output layer. The models were trained with the Adam optimizer, and used a learning rate of 0.002, 
an l2 regularization constant of 0.01, 400 nodes per layer, and 4 dense layers per block. LeakyReLU activation was used on all dense layers except the output which used linear activation. 

	The BDB905 dataset was split into training and validation sets (90/10 split) and models were trained for 150 epochs. The mean absolute error was used as the loss function for optimization, and 
training and validation scores were calculated for each model trained. Eight models were trained and evaluated: 1 SkipDense block with a skip, 1 SkipDense block with no skip, 2 SkipDense blocks with 
skips in both, 2 SkipDense blocks with no skips, 2 SkipDense blocks with no skip in the first and a skip in the second, 2 SkipDense blocks with a skip in the first and no skip in the second, 3 SkipDense 
blocks with no skip in the first and skips in the second and third, and 3 SkipDense blocks with no skip in the first and third blocks and a skip in the second block. Since the models had similar losses 
and scores, a set of four statin molecules was used as a “fine-tuning” of sorts to select the final model for production. These four statins, Cerivastatin, Simvastatin, Atorvastatin and Rosuvastatin 
have known IC50’s of 3.54, 2.74, 1.16 and 0.16 nM respectively.  The trained models were used to predict the IC50 scores for these four molecules, and their accuracy and relative ordering were used 
to select the final model.

	A dataset of 40,000 molecular SMILES strings was downloaded from the In vitro dataset at Zinc15, which consists of substances that are reported or inferred active at 10 micromolar or less in 
binding assays. This dataset will be referred to as the ZN1540K dataset in the rest of this work. SMILES strings were tokenized using the SmilesTokenizer from DeepChem and the vocabulary file provided 
at their GitHub page. Tokenized SMILES strings were then padded with padding tokens ([PAD]) to the length of the longest SMILES string in the dataset. Inputs for each tokenized SMILES string were 
created as strings with a length of one less than the longest SMILES string in the dataset, missing the final token. Ground truth for each SMILES string was the same as the input string but shifted by 
one, i.e., missing the first token and including the last token. 

	The ZN1540K dataset was then used to pre-train four attention-based decoder models, which we will refer to as Generative Pre-trained (GPT) models. Each GPT consisted of an input layer, one to 
four Transformer Blocks, and a dense output layer (Figure 2). The general structure of this model was adapted from a text-based decoder.  The transformer block used 256 nodes in the dense layers, ReLU 
activation in the first dense layer, and a dropout rate of 10%. Attention layers in the transformer blocks all used 4 attention heads and a key dimension of 256. The embedding layer had a dimension of 
256, and the output dense layer had a dimension of 85 [the size of the vocabulary for the combined ZN1540K and ChEMBL1081 datasets] and used SoftMax activation to generate token probabilities. The 
training of the GPT models used the Nadam optimizer and sparse categorical cross-entropy for the loss function. 
	
	A dataset of all HMGCR inhibitors was downloaded from ChEMBL. This  dataset is distinct from the BDB905 dataset used above, despite having some overlap. This separate dataset was used to 
provide more data diversity to the process. As this dataset was used to train the GPT models on SMILES strings, all entries could be used, regardless of whether they had a valid IC50 value and 
accounted for 1081 unique SMILES strings. We will refer to this dataset as the ChEMBL1081 dataset for the rest of this work. 

	The ChEMBL1081 dataset was used to generate four models. It was specified that four GPT models were trained with one, two, three and four transformer blocks. Additional transformer blocks 
were added to each of the initial GPT models so that they all had a total of four blocks. The added blocks were then trained on the ChEMBL1081 dataset, while the weights for the pre-trained blocks 
were frozen. For example: the GPT with 2 transformer blocks had the weights for those two blocks frozen. Two more blocks were added, and the new four-block model was trained with only the weights 
for the two new blocks and the output dense layer trainable. All layer details were the same as section 2.2 and this training also used the Nadam optimizer and sparse categorical cross-entropy for 
the loss function. Pre-trained blocks were frozen to allow only the newly added blocks to learn from the ChEMBL1081 dataset.

	The first of the resulting four models was referred to as NoX (no blocks transferred, all four blocks pretrained, none trained on the ChEMBL1081 dataset). This model serves as a control, as 
it was trained on 40,000 drug-like molecules, but had no specific HMGCR training. The next three models were referred to as the 1X, 2X and 3X models, having one, two or three pretrained blocks 
transferred, and three, two and one blocks fine-tuned on the ChEMBL1081 dataset. Finally, a fifth model with four transformer blocks was trained only on the ChEMBL1081 dataset, with no transferred 
blocks.   This was referred to as the SO models (statin-only). All five models were trained for an additional 50 epochs with all weights on all blocks trainable, allowing cooperativity between the 
blocks, which had before been divided into “drug-like” blocks and “statin-only" blocks. These fully optimized models were referred to as the NoXALL, 1XALL, 2XALL and 3XALL models. 

	A subset of the ChEMBL1081 dataset was created containing only those molecules with known IC50 values. This yielded a set of 232 molecules, and was used as a control and for validation of 
screening tests. This dataset will be referred to as the ChEMBL232 dataset in the rest of this work.

	All models were used to generate molecule libraries for virtual screening. In the proof-of-concept models, molecules were generated by feeding each model a “seed” or prompt of 12 input tokens 
and asking it to predict the most likely next token in the sequence (called temperature = 0 molecule generation). This was done eighty times per seed, so that the resulting molecules consisted of 92 
tokens each. Each model was fed 1000 12-token prompts, hoping to generate 1000 molecules in each library. The seeds were generated by taking a random sample of 1000 (or, later, 5000) molecules from 
the combined ZN1540K and ChEMBL1081 datasets, tokenizing them, and choosing the first 12 tokens in each. Once generated, each token sequence was transformed back to a SMILES string for further analysis. 
 
	The library generation process was repeated with a higher “temperature” of 0.5 using a multinomial-like sampling strategy. Higher temperature, in this case, means that each token selected to 
add to the seeds is not necessarily the most probable next token, but possibly a token of lesser likelihood. When the model is queried for the next most likely token, it does not just give the single 
token, but provides the probability for every token in the vocabulary (85 total tokens for this work). In temperature  = 0 molecule generation, the token with the highest probability is chosen, but if 
temperature is not zero, some other token is chosen. Higher temperature generation was performed by taking the probability distribution that the next token, t, given by the model will be the i-th 
vocabulary word, f_i (t), and transforming it according to a Boltzmann-like distribution, and then using this PD(t) as the probability function for the Numpy random.choice tool. The token selected by 
this process was then added to the seed, for a total of 92 tokens. The libraries generated with the five models from section 2.3 are referred to as 1K12S libraries,  where 12 is the number of seed 
tokens, or prompt length. Thus, for the five fully trained models, the resulting libraries are: NoXA1K12S (No Xfer-learning, All-layers trained, 1K prompts, 12 Seeds tokens), 3XA1K12S (3 Xfer-learning 
layers, All-layers trained, 1K prompts, 12 Seeds tokens), etc. 

	After the proof-of-concept 1000 prompt library was tested, three more libraries were created for each of the five models at two temperatures (0.0 and 0.5), leading to thirty total libraries. 
These libraries each had 5000 prompts each, and had prompt lengths of six, nine, or twelve tokens. These libraries are referred to as 5KnS libraries, where n is the prompt length. Thus, for the five 
fully trained models in section 2.3, the resulting libraries are: NoXA5K12S (No Xfer-learning, All-layers trained, 5K prompts, 12 Seeds tokens), 3XA5K12S (3 Xfer-learning layers, All-layers trained, 
5K prompts, 12 Seeds tokens), etc. 

	Several strategies were used to screen the generated libraries. First the  DNN from section 2.1 was used to predict an IC50 value for each molecule. The predicted IC50 values were used to 
separate the libraries into two sets: all molecules were included in a subset referred to here as “refined,” and if the predicted IC50 value for a molecule was less than 1000 nM, that molecule was 
also added to a “docking” subset. All molecules in the docking subset were then docked in the HMGCR binding site using the Dockstring23 package for python. This package accepts a SMILES string as 
input and then prepares the molecule by protonating it at a pH of 7.4 using Open Babel, generating a conformation using ETKG from RDKit, optimizing the structure with MMFF94, and computing charges 
for all atoms using Open Babel, all while maintaining any stereochemistry in the original SMILES string. The prepared molecule is then docked into the protein binding site using AutoDock Vina with 
default values of exhaustiveness, binding modes, and energy range.  The prepared HMGCR binding site from the DUD-E database was used for docking.  Poses were visualized with Pymol.

	RDKit15 was used to calculate various ADME properties including molecular weight (MW), calculated log P (aLogP), hydrogen bond acceptors and donors (HBA, HBD), number of rotatable bonds, 
number of aromatic rings, polar surface area, and number of alerts for undesirable moieties. These properties were also used to calculate the Quantitative Estimate of Druglikeness (QED),27 which uses 
a fit of ADME properties to predict how drug-like a molecule will be. RDKit was also used to search for several substructures from known statin drugs: the atorvastatin pharmacophore 
(3,5-dihydroxypentanoic acid, which binds to ASP 671, LYS 672, and LYS 673 in HMGCR), the HMG coenzyme-A pharmacophore, a fluorophenyl ring and a methane sulphonamide group (both found in type-2 
statins) and a butyryl group and decalin ring (both found in type-1 statins). Absolute numbers and percentages of these substructure in the libraries are reported. 

	Morgan fingerprints28 (radius of 2, so roughly equivalent to extended connectivity fingerprints of diameter 4) were used to find Tanimoto similarity for several sets of molecules. First, 
the average similarities of all of the molecules in each library were calculated by averaging the pairwise similarity between all unique sets of molecules, so, for a library of n molecules, there 
were n(n-1)/2 unique similarity values. This was used as a measure of the amount of variation in each library. The average similarities were also calculated between all molecules in each library 
and a set of six statin molecules: atorvastatin, rosuvastatin, fluvastatin, simvastatin, lovastatin and pravastatin. The first three in this set are well known type-2 statins, and the last three 
are well-known type-1 statins. This type of similarity to know actives is often used as a screening criterion. 

	In order to use similarity as a screening criterion, a benchmark must be established with the fingerprint and similarity method being used. In order to do this, the BDB905 dataset, which 
contained 905 experimental IC50 values, was examined. The molecules in the dataset were sorted by IC50 values from lowest to highest, and the average similarity for each set of 5 consecutive molecules 
was calculated, i.e., the average similarity was found for molecules 1-5, 6-10, 11-15, etc, with the rationale that if the molecules have similar activity, then their similarity may correlate with 
that. Figure 3 shows the distribution of similarities for this dataset. The range that occurred most was ~0.25, meaning that more sets of 5 molecules with similar IC50 values has similarities of ~0.25 
than any other value. It is worth noting that the higher values most often correspond to lower IC50 groupings. For example, the highest similarity, 0.72, corresponded to the 11 to 15 grouping, which 
had an average IC50 value of 0.72 nM, and the second highest similarity, 0.57, corresponded to a grouping with an average IC50 of 1.22 nM.  The lowest similarity, on the other hand, 0.12, corresponded 
to two grouping with average IC50 values of 7180 and 749800 nM. Thus, in this work, 0.25 is used as the cutoff value for similarity: if the similarity is 0.25 and above, there is a chance of similar 
activity. The final set of unique, sub-micromolar molecules was analysed for ease of synthesis using the synthetic accessibility score, which breaks molecules down by fragments and uses fragment information 
from the ChEMBL database to estimate ease of synthesis. This method has been found to agree with expert analysis with an R2 of 0.89.