From Beginner to Expert: AI in Drug Discovery – A Definitive Guide from Lab to Market (Part 2: “Data and Algorithms Behind AI-Driven R&D”) explains which data types AI relies on, what sources those data come from, and how representative model families are used in practice.

TOC

1. Data and Models: The Core of AI in Drug Discovery

In Part 1, we mapped where AI can plug into the drug discovery and development value chain and what it can – and cannot – do. In Part 2, we zoom in on the foundations: data and models.

In practical terms, most AI projects in drug discovery come down to two questions:

  • From which data, and with which labels or objectives, are we learning?
  • Which model family is best suited to the structure and scale of those data?

Once these are clear, researchers and non-technical stakeholders can discuss AI initiatives using a shared vocabulary and can more realistically assess whether a given AI strategy is well-founded.

2. Key Data Types Used in Drug Discovery

Let us first review the main data types used in AI-driven drug discovery and where they typically appear along the value chain.

2-1. Chemical Structure Data (Small Molecules / Fragments)

For small-molecule projects, chemical structure and activity data form the core.

  • SMILES, InChI, and graph representations (atoms and bonds)
  • Physicochemical descriptors (logP, pKa, molecular weight, polar surface area, etc.)
  • Activity values against targets (IC50, Kd, Emax, and so on)

AI models use these to predict potency, selectivity, and ADMET properties, and to perform de novo molecular design. Graph neural networks (GNNs) have become popular because they can operate directly on molecular graphs rather than relying solely on engineered descriptors.

2-2. Bioassay and Screening Data

High-throughput screening (HTS), phenotypic screening, and related assays provide another essential data source.

  • Binary labels (active/inactive) or multi-class outcomes
  • Curve-fit parameters derived from concentration–response data
  • Multi-assay profiles for each compound or biological entity

AI can re-score hits, predict activity in related assays, and prioritize follow-up experiments. However, assay noise, protocol differences, and batch effects make robust preprocessing and normalization critical.

2-3. Omics Data (Genomics, Transcriptomics, Proteomics, etc.)

Omics datasets are central to target discovery and patient stratification.

  • Genomic alterations (SNVs, CNVs, structural variants, etc.)
  • Gene expression profiles (RNA-seq, arrays)
  • Protein expression and post-translational modifications

AI models use these high-dimensional data for:

  • Disease subtype clustering
  • Prognostic and treatment-response prediction
  • Candidate target and biomarker discovery

Going forward, multi-omics integration will be increasingly important, as single-omics signals alone often fail to capture the full biological context.

2-4. Structural Biology and 3D Data (Proteins and Complexes)

Experimental and predicted 3D structures – from crystallography, NMR, cryo-EM, and structure prediction methods – provide:

  • Binding pocket shapes and physicochemical properties
  • Interaction patterns with ligands or binding partners
  • Hints about allosteric sites and conformational dynamics

AI can use these structures to predict binding affinity, infer function from sequence, and support design for antibodies, peptides, molecular glues, PROTACs, and other modalities.

2-5. Imaging Data (Cellular, Tissue, Pathology, In Vivo)

Imaging is another domain where AI excels.

  • Cellular imaging and high-content screening
  • Tissue and pathology slides
  • In vivo imaging (MRI, PET, CT, optical imaging, etc.)

From these, AI can extract morphological features, subcellular localization, and tumor microenvironment characteristics, supporting phenotypic screening and biomarker discovery using computer vision (CNNs, Vision Transformers, and related architectures).

2-6. Clinical and Real-World Data (EHR, Registries, Claims)

Clinical trial data and real-world evidence underpin development strategy and lifecycle management.

  • Electronic health records (diagnoses, prescriptions, labs, vitals, free-text notes)
  • Registries and cohort datasets
  • Claims and dispensing data

With these, AI can:

  • Improve patient recruitment by identifying eligible patients
  • Analyze real-world effectiveness and safety
  • Characterize patterns of use, including off-label prescribing

Privacy, bias, and missing data are major concerns that must be addressed from the outset.

2-7. Text, Literature, Patents, and Internal Reports

With the rise of large language models, text data have become a strategic asset.

  • Scientific articles, reviews, and conference abstracts
  • Patent documents
  • Internal reports, study summaries, and meeting minutes

AI can mine these texts for:

  • Competitive and technology landscape insights
  • Early detection of safety signals and mechanisms
  • Hypothesis generation (targets, combinations, biomarkers)

However, generated text is not equivalent to scientific truth, so rigorous fact-checking and citation management remain essential.

3. Main Model Families and Their Roles in R&D

Next, we summarize four broad categories of models frequently used in AI-driven drug discovery.

3-1. Classical Machine Learning (Random Forests, SVM, k-NN, etc.)

Traditional QSAR and many ADMET prediction tasks still rely heavily on classical ML.

  • Inputs: molecular descriptors, physicochemical features, and simple engineered features
  • Models: random forests, gradient boosting, support vector machines, and others
  • Strengths: works with relatively small datasets, reasonably interpretable, easy to deploy
  • Weaknesses: less suited to very high-dimensional or multi-modal data

In niche toxicity endpoints or proprietary assays with limited data, classical ML can still be the most practical choice.

3-2. Deep Learning (CNNs, RNNs, Transformers, GNNs)

Deep learning underpins many modern AI successes across data types.

  • CNNs for images and grid-like data
  • RNNs / LSTMs / Transformers for sequences, time series, and text
  • GNNs for molecular graphs and interaction networks

GNNs are particularly attractive because they can learn directly from graph structures, capturing local and global patterns in molecules or biological networks. The trade-offs are higher data and compute requirements and more complex model design.

3-3. Generative Models (VAE, GAN, Diffusion Models)

Generative models fuel the popular image of “AI designing molecules.”

  • VAEs learn continuous latent spaces from which new molecules can be sampled
  • GANs pit a generator against a discriminator to produce high-quality synthetic samples
  • Diffusion models iteratively denoise from random noise to create structures

Beyond generating “many molecules,” conditional generation – e.g., under target or property constraints – makes these models more useful for practical design tasks.

3-4. Multi-Modal and Foundation Models

A recent trend is multi-modal and foundation models that jointly learn from multiple data types.

  • Models that connect molecules, protein sequences, structures, and assay data
  • Surrogate models linking omics with clinical outcomes
  • Systems that bridge text (papers, patents) with structural and assay data

These models aim to learn broadly useful representations that can be fine-tuned for specific tasks with relatively little labeled data. However, their performance and risk profile are tightly coupled to the quality and bias of their pre-training data.

4. Where the Data Come From in Practice

In practice, AI in drug discovery typically combines several data sources:

  • Public databases
    Compound–activity datasets, protein structure repositories, omics resources, and open cohorts.
    Pros: accessible, support reproducible science.
    Cons: noisy, heterogeneous, and often inconsistent in metadata and assay conditions.
  • Internal experimental data
    Company-owned screening, toxicology, preclinical, clinical, and manufacturing data.
    Pros: highly relevant to the company’s portfolio and strategy.
    Cons: fragmented across systems, varying formats, missing metadata.
  • Collaborative and consortium data
    Shared datasets from academia, other companies, and public–private partnerships.
    Pros: access to scale and diversity otherwise unattainable.
    Cons: complex contracts, IP questions, privacy, and standardization issues.
  • Commercial datasets
    Curated real-world data, literature-derived datasets, and specialized annotations.
    Pros: cleaned and enriched for analysis.
    Cons: cost and licensing constraints on internal sharing and reuse.

A common pattern is to pre-train on public data and fine-tune on internal data. This can work well, but only if distribution shifts between these data sources are carefully managed.

5. Pitfalls: Data Quality, Bias, and Label Design

In many projects, data quality, bias, and label design matter more than the choice of model architecture.

5-1. Label Design: What Counts as “Ground Truth”?

A frequent failure mode is poorly defined labels that only partially reflect reality.

  • Using IC50 from a single assay as a label while ignoring cell type and conditions
  • Defining clinical labels by survival alone without accounting for follow-up time or concomitant medications
  • Collapsing nuanced safety assessments into a binary “toxic/not toxic” label that hides subjective judgment and reporting bias

Before building models, teams should explicitly discuss how well their labels represent the real-world outcomes they care about and which aspects have been deliberately simplified or discarded.

5-2. Data Leakage

Data leakage can silently inflate performance metrics.

  • The same compound or patient appears in both training and test sets
  • Future information (e.g., post-treatment data) leaks into input features
  • Variables that are unavailable in practice (e.g., batch IDs or manual QC flags) are inadvertently used as features

This leads to models that look excellent on internal benchmarks but fail in real use. Robust evaluation design with domain and ML experts is essential.

5-3. Bias and Generalizability

Clinical and real-world datasets often contain structural biases: uneven representation of populations, institutions, and treatment patterns. Models can inadvertently learn and amplify these biases, resulting in:

  • Performance drops in under-represented groups or regions
  • Over-reliance on idiosyncratic patterns from specific sites

Data collection and model evaluation should therefore be designed with generalizability in mind, explicitly checking for subgroup performance and robustness.

6. From Data to Deployed Model: A Typical Workflow

Finally, a simplified end-to-end workflow for AI projects in drug discovery looks like this:

  • (1) Define the question: what do we want to predict or generate (e.g., activity, toxicity, biomarkers, response)?
  • (2) Inventory the data: identify available sources and gaps.
  • (3) Clean and engineer features: handle missing values, normalize, correct batch effects, and construct features.
  • (4) Select and train models: choose among classical ML, deep learning, and generative approaches.
  • (5) Design evaluation and validation: adopt appropriate splits and metrics, including external or prospective validation when possible.
  • (6) Close the loop with experiments or clinical data: use new results to refine and update models.

From Part 3 onward, we will instantiate this workflow for specific modalities (small molecules, antibodies, cell and gene therapies, nucleic acid drugs, and others) and discuss where AI tends to generate tangible value most efficiently.

My Thoughts and Future Outlook

Once you try to operationalize AI in drug discovery, you quickly realize that the central challenge is not “which algorithm is best,” but rather how we design data and labels. In this sense, AI in drug discovery is evolving into a competition over data and question design rather than a pure race for model sophistication. Two teams using the same public data can reach very different levels of impact depending on how they define labels, construct cohorts, and set up evaluation.

At the same time, there is a familiar gap between “we seem to have plenty of data” and “we actually have very little data we can use for modeling.” Bridging this gap requires early collaboration between data scientists and domain experts across R&D, translational medicine, safety, and manufacturing. Questions such as “Is this signal noise or biology?”, “Is this endpoint clinically meaningful?”, and “Can we realistically collect this data at scale?” have to be answered jointly. In the next parts of this series, we will dive deeper into concrete, modality-specific examples and gradually build a more practical blueprint for designing data and models that truly matter.

This article has been edited by the Morningglorysciences team.

Related Articles

Comment Guideline

💬 Before leaving a comment, please review our [Comment Guidelines].

Let's share this post !

Author of this article

After completing graduate school, I studied at a Top tier research hospital in the U.S., where I was involved in the creation of treatments and therapeutics in earnest. I have worked for several major pharmaceutical companies, focusing on research, business, venture creation, and investment in the U.S. During this time, I also serve as a faculty member of graduate program at the university.

Comments

To comment

CAPTCHA


TOC