How to cite this Dataset
Cafiero, Mauricio (2024): Dataset supporting the article 'Transformer-decoder GPT models for generating virtual screening libraries of HMGCR inhibitors: effects of temperature, prompt-length and transfer-learning strategies'. University of Reading. Dataset. https://doi.org/10.17864/1947.001340
Description
Raw data for virtual screeing libraries generated by a generative, pre-trained transformer-decoder model. Models were pre-trained on a general drug database from ZINC15, and fine-tuned on inhibitors of HMGCR from ChEMBL. Libraries used different transfer-learning strategies, different prompt-lengths and different temperatures. The resultant libraries were screened against a deep neural network trained on experimental HMGCR IC50 values to predict IC50 values, docking scores from Autodock Vina, quantitative estimate of drug-likeness, Tanimoto similarity to known statin drugs, and other properties. This dataset contains tables of properties as well as CSV files with the generated libraries, a TKinter-based GUI to interacting with the library, and docking poses for selected molecules.
| Resource Type: | Dataset |
|---|---|
| Creators: | Cafiero, Mauricio |
| Rights-holders: | University of Reading |
| Data Publisher: | University of Reading |
| Publication Year: | 2024 |
| Data last accessed: | 20 November 2025 |
| DOI: | https://doi.org/10.17864/1947.001340 |
| Metadata Record URL: | https://researchdata.reading.ac.uk/id/eprint/1340 |
| Organisational units: | Life Sciences > School of Chemistry, Food and Pharmacy > Department of Chemistry |
| Participating Organisations: | University of Reading |
| Keywords: | GPT, machine learning, drug design |
| Rights: | |
| Data Availability: | OPEN |

