1. ABOUT THE DATASET
------------

Title:	Dataset supporting the article 'Variable-temperature token sampling in decoder-GPT molecule-generation can produce more robust and potent virtual screening libraries.'

Creators: Mauricio Cafiero (ORCID: 0000-0002-4895-1783)

Organisation: University of Reading

Rights-holders: University of Reading

Publication Year: 2025

Description: Raw data for virtual screeing libraries generated by a generative, pre-trained transformer-decoder model using variable temperature decoding. In this scheme, various temperature ramps are used during the generation process, such that each token could have a different generation temperature. The model used for this is described in our previous work:
DOI: 10.1021/acs.jcim.4c01309.

Cite as: Cafiero, Mauricio (2025): Dataset supporting the article 'Variable-temperature token sampling in decoder-GPT molecule-generation can produce more robust and potent virtual screening libraries.'
University of Reading. Dataset. https://doi.org/10.17864/1947.001408

Related publication: M. Cafiero, 'Variable-temperature token sampling in decoder-GPT molecule-generation can produce more robust and potent virtual screening libraries.' Submitted, 
RSC, Physical Chemistry Chemical Physics.

Contact: m.cafiero@reading.ac.uk



2. TERMS OF USE
------------

Copyright 2025 University of Reading. This dataset is licensed under a Creative Commons Attribution 4.0 International Licence: 
https://creativecommons.org/licenses/by/4.0/.


3. PROJECT AND FUNDING INFORMATION
------------

N.A.

4. CONTENTS
------------
File listing

VarTemp_Data_Cafiero_2025.xlsx

This file contains raw data for virtual screeing libraries generated by a generative, pre-trained transformer-decoder model using variable temperature decoding. In this scheme, various temperature ramps are used during the generation process, such that each token could have a different generation temperature. The model used for this is described in our previous work:
DOI: 10.1021/acs.jcim.4c01309.


Tab				Contents

Raw Data			Total valid, useable, and sub-micromolar numbers of molecules generated by the GPT with 4 prompt lengths ( 1 token, 3 tokens, 6 tokens, and 23 token 				scaffolds) for four sampling temperatures.

Overlap				Numbers of generated molecules that overlap between all libraries using various decoding schemes.

Correlations			Pearson correlations between IC50, docking score, SAS, and alogP for each library used in this work.

Group Membership		Total counts [top] and fraction of molecules [bottom] from each library in each k-means group (Groups 0-9).


Acronyms	Variables

IC50	Inhibitory concentration, 50% activity (nM)
MW	molecular weight (g/mol)
QED	Quantitative estimate of Drug-likeness
alog P 	Calculated log of the partition coefficient between water and octanol
HBA	hydrogen bond acceptor
HBD	hydrogen bond donor
Rot	rotatable bonds
arom	number of aromatic rungs
Ator	atorvastatin
Sim	Simvastatin
Rosu	Rosuvastatin
Lova	Lovastatin
Fluva	Fluvaststatin
Prava	Pravastatin
SAS	synthetic accessibility score
score	docking score (kcal/mol)

VarTemp_Data_Cafiero_2025.zip

This file contains csv files with the generated libraries and a TKinter-based GUI to interacting with the library.

Files/Folders			Contents

Notebooks			a Python/TKinter GUI for querying the generated libraries

anneal				CSV files for all generated libraries with SMILES and predicted IC50 values.

anneal/Docking_props		CSV files for all generated libraries with SMILES, predicted IC50 values, docking scores and ADME properties.

anneal/K-means			CSV files with information about K-means group membership.


Acronyms	Variables

IC50		Inhibitory concentration, 50% activity
GUI		Graphical user interface
SMILES		a text-based molecular representation
csv		comma separated files.


5. METHODS
-----------

The methods below are adapted from the manuscript named above which has been submitted to review to the journal named above. All referenced figures and equations can be found in the manuscript.


	The statin molecule GPT and statin IC50 scoring dense neural network (DNN) from the previous work were used for all molecule generation and scoring in this work. Specifically, the GPT model with 2 transformer blocks trained on the Zn15 dataset of 40k in-vitro bio-active molecules and two transformer blocks trained on 1081 HMGCR inhibitor molecules from the ChEMBL database was used (referred to in that work as the 2XA model). The DNN was trained on 905 HMGCR inhibitors from the binding database. The previous work showed that shorter prompt lengths produced virtual screening libraries with lower IC50 values and other desirable properties1. Prompts for the GPT are tokens corresponding to elements of molecular SMILES strings, so a prompt length of 10 would correspond to the first 10 characters of a SMILES string. The 2XA GPT had a vocabulary size of 85 tokens, which is also used here. In this work, the shortest prompt length from the previous work, six tokens, is replicated, in order to establish continuity with that work. In the previous work, the six tokens were taken as the first six tokens from a set of 5,000 molecular SMILES strings chosen randomly from the 41,000 molecule training set. The same set of 5000 SMILES strings is used here. In addition, three other prompt lengths were also tested here: three tokens and one token, taken from the same set of 5,000 SMILES strings as the previous work, and a set of  22-token scaffolds, repeated to make 5000 total prompts. These scaffolds are shown in Figure 1, and correspond to the pharmacophores for Atorvastatin, Rosuvastatin, HMG-Coenzyme A, and Simvastatin. In the case of all but the Simvastatin pharmacophore, both the protonated and deprotonated forms were used, for a total of seven scaffolds. These seven scaffolds were then repeated to create a list of 5000 prompts. One of the scaffolds corresponded to 22 tokens, while the others corresponded to 17 tokens, so the shorter scaffolds were padded with five extra “start” tokens to achieve a uniform length of 22 for all of the scaffolds.  The 2XA GPT model was used to generate up to 5,000 molecules for each of the four prompt lengths, with various temperature-based schemes discussed below.
 
Figure 1. Scaffolds used as prompts in the generation process: a. the pharmacophore for Atorvastatin, b. the pharmacophore for Rosuvastatin, c. the pharmacophore for HMG-Coenzyme A, and d. the pharmacophore for Simvastatin.

	To generate a molecule from a prompt, at each step k of the generation process, the existing set of tokens (or just the prompt if it is the first step) is passed through the model and the probability that each of the possible tokens (i) out of the 85 tokens in the vocabulary being the next token [Pk(i)] is calculated. With T=0.0, or greedy decoding, the token with the highest probability is chosen each time. In temperature-based sampling, at each step, the probabilities are scaled according to the temperature:

				  PS_k (i)=(P_k (i)^(1/T))/(∑_j P_k (j)^(1/T) )				 	             EQ. 1

where PS are the scaled probabilities. This scaling serves to even out the probabilities, so that at higher temperatures, the difference between the highest and lowest probabilities decreases. These probabilities are then used to randomly choose the next token, with higher probability tokens more likely to be chosen. At higher temperatures, though, even the less likely tokens are somewhat likely to be chosen. This in turn results in less probable, more varied molecules being generated, and a wider chemical space being sampled. 

	In the variable-temperature token generation used in this work, the token selection process switches between greedy decoding and temperature-based sampling while the  molecule is being generated, and the temperature increases or decreases during the process as well. In this work, three increasing temperature schemes were tested. First, a slowly increasing exponential:
									T=T_o χe^(χ-1)				             EQ. 2

Where To is the initial temperature (in this case, 0.0), and  is the ratio of the current step, k, to the maximum number of steps, kmax. Next, a more rapidly increasing exponential was tested:
							T=T_o [1-e^(-χ)+χe^(-χ)]  		                             EQ. 3
Finally, an increasing sigmoid was tested, activating at 50% of kmax:
		      						T=  T_o/(1+e^(-(k-0.5 k_max ) ) )                            EQ. 4
Three analogous decreasing temperature schemes were also tested: a slowly decreasing function:
							T=T_o (1-χ) e^(-χ),                                                  EQ. 5
a more rapidly decreasing function: 
							T=T_o (1-χ) e^χ,                                                     EQ. 6
and a decreasing sigmoid, activating at 50% of kmax:
							T=  T_o/(1+e^((k-0.5 k_max ) ) ).                                    EQ. 7
Figure 1 shows each of these temperature ramps beginning or ending at T=0.5, and a maximum number of generation steps of 90. 
         
Figure 1. Variable temperature sampling schemes with an increasing ramp beginning at zero and ending at 0.5 (EQ. 2-4, left) and a decreasing ramp beginning at 0.5 and ending at zero (EQ. 5-7, right). The x-axis is step/token number (k) and the y-axis is temperature.  

	The final temperature ramp used in this work, based on the results obtained with EQs 2-7, was an increasing sigmoid, activated at 10% of the total number of generation steps. This ramp is shown in Figure 2. In all temperature ramps used in this work T=0.0, or greedy decoding, was used for any temperature less than 0.015, in order to improve numerical stability in the generation process.  
 
Figure 2. Sigmoidal variable temperature sampling scheme with an increasing ramp beginning at zero and ending at 0.5.  Sigmoid centered at 10% of the maximum token length. The x-axis is step/token number (k) and the y-axis is temperature.  

	In this work, libraries of up to 5,000 molecule were generated for each of the four prompt lengths with four set temperatures: 0.0, 0.5, 1.0, and 2.0. Libraries were also then generated using EQs 2-7 to vary the temperature during the generation process,  all beginning or ending at T=0.5. This temperature was chosen as it was found to produce robust, potent libraries in the previous work1. Finally, the increasing sigmoid of EQ. 4, activated at 10% of the number of generation steps, was used, ending at four temperatures: 0.5, 1.0, 1.5 and 2.0. Overall this produced fourteen temperature variations for each of the four prompt lengths, or fifty-six total libraries being generated. 
 
	
	All of the molecules in these fifty six libraries were characterized in several ways, including predicted IC50 values, docking scores, synthetic accessibility scores, ADME properties, presence of various chemical moieties, and various molecular similarities. The DNN from the previous work1 was used to calculate an IC50 value for each molecule. Each molecule was then docked in the HMGCR binding site (structure from the curated DUD-E Database) using AutoDock Vina20 via DockString. The DockString package prepares the molecule by protonating it at a pH of 7.4 with Open Babel, generating a conformation using ETKG from RDKit, optimizing the structure with MMFF94 (also from RDKit), and computing charges for the atoms using Open Babel, maintaining the stereochemistry in the original SMILES string. The Synthetic Accessibility Score (SAS) for each molecule was computed using the SAS tool in RDKit. The Quantitative Estimate of Druglikeness (QED) was calculated for each molecule, along with the pharmacokinetic properties that make up the QED including alogP, molecular weight, number of hydrogen bond donors, number of hydrogen bond acceptors, number of aromatic rings and number of rotatable bonds. Average values for these properties for each library may be found in the supporting data. 

The libraries were analysed for the  presence of several moieties typical of type I and II statins: the HMG coenzyme-A pharmacophore, a fluorophenyl ring and a methane sulfonamide group (both found in type II statins), and a butyryl group and decalin ring (both found in type I statins). The counts for these moieties in each library are presented in the supporting data. 

Tanimoto similarities between every pair of molecules in each library and between each molecule in each library and a set of known statin molecules were calculated by using Morgan fingerprints of radius 2, which is roughly equivalent to extended connectivity fingerprints of diameter 4. The percentages of each library that showed a greater than 0.25 similarity to Atorvastatin and Simvastatin (a representative type I and type II statin, respectively) are shown in Table 1 (%A and %S). Percent similarities to other statins are shown in the supporting data, and largely follow the patterns for the representative type I and II molecules. Also in the supporting data are the percentages of pairs in each library that have a similarity of more than 0.25. This characteristic can serve as a measure of the diversity of each library, as a higher percentage of similarity means that the library covers a smaller chemical space.  

Finally, Pearson correlations between several of the properties presented here were calculated, including correlation between IC50 and docking score, IC50 and SAS, IC50 and alogP and docking score and SAS. The correlation between IC50 and docking score is important as the IC50 for a molecule is based on a DNN, which in turn is trained on features derived from the SMILES strings for each molecule. Thus, other than a few rudimentary properties such as number of rotatable bonds and polar surface area, there is no 3D structural information about the molecule in the IC50 calculation. Likewise, there is no information about the physical, 3D fit of the molecule for the binding site in the IC50 calculation. The docking score, however, is based wholly on the three-dimensional structure of the molecule and its complementarity with the binding site. The greater the agreement between the IC50 value and the docking score, the more trustworthy each becomes. As a guideline, Pearson coefficients between0 to ±0.3 can be considered weaker correlations, values from ±0.3 to ±0.5 can be considered medium−strength correlations, while values from ±0.5 to ±1 can be considered strong correlations. The IC50/score correlation is provided in here, and the other correlations are available in the supporting data, either in a table or in heatmap images.