AI and Synthetic Pharmacokinetics
Researchers use AI to produce synthetic pharmacokinetic data.
Rob Coker | | 3 min read | Technology

In a new development for AI in drug discovery, researchers at the University of Waterloo, Canada, have introduced a novel generative model capable of producing high-quality synthetic pharmacokinetic (PK) data for drug compounds. Dubbed "Imagand", the SMILES-to-Pharmacokinetics (S2PK) diffusion model could reduce the time and cost associated with drug development.
The study addresses the persistent problem of data sparsity in pharmacokinetics – the study of how a drug is absorbed, distributed, metabolized, and excreted by the body. Because PK data are often collected independently and in isolated datasets, researchers studying drug combinations or conducting high-throughput screening frequently face significant gaps in available information. Imagand seeks to fill these gaps with synthetic data that mimics real-world distributions.
Diffusion models, originally developed for generating high-quality images and video, have recently found new applications in molecular science. These models work by systematically corrupting input data and then learning to reconstruct it from noise. The Imagand model builds on this framework by training a transformer-based diffusion system to predict PK properties from molecular SMILES strings – simplified notations of chemical structures – using pre-trained chemical language models.
Imagand can generate 12 different PK target properties, including absorption rates, volume of distribution, half-life, clearance rates, and toxicity indicators, for over 30,000 drug-like molecules. The model leverages a novel noise strategy called Discrete Local Gaussian Noise (DLGN), which breaks down complex data distributions into more manageable segments, improving training stability and output realism.
Realistically synthetic data
A key challenge in any synthetic data generation task is ensuring that the output accurately reflects the complexity of the real world. To that end, the authors evaluated Imagand’s outputs using both statistical and task-based benchmarks. Metrics such as Hellinger distance (for distribution similarity), Pearson and Spearman correlation coefficients (for bivariate relationships), and regression task performance were used to compare synthetic versus real data.
Across most datasets, Imagand’s synthetic PK values were not only statistically similar to their real-world counterparts, but also enhanced performance in downstream machine learning tasks. In several experiments, regression models trained on synthetic data outperformed those trained on real data when evaluated against held-out test sets.
For example, in predicting aqueous solubility (AqSolDB), synthetic data models achieved a Pearson correlation of 0.731, outperforming real-data models with 0.756, and nearly matching real-real benchmarks. Similarly, for plasma protein binding and lipophilicity, synthetic data yielded competitive metrics in terms of both mean squared error and correlation coefficients. These findings were validated over 30 trials per dataset to ensure statistical robustness – not to mention scalability.
By bridging the data gap in PK studies, Imagand could unlock new possibilities for computational drug design, repurposing, and toxicology. By democratizing access to Imagand code and PK data, particularly for researchers or startups without the resources to conduct costly in vivo or in vitro experiments, the research team suggests that future iterations of will explore the generation of categorical PK properties (e.g. whether a compound is hepatotoxic or not), expansion to larger and more diverse datasets, and integration with other modalities such as transcriptomic or proteomic data.
Following a Bachelor’s degree in English Literature and a Master’s in Creative Writing, I entered the world of publishing as a proofreader, working my way up to editor. The career so far has taken me to some amazing places, and I’m excited to see where I can go with Texere and The Medicine Maker.