Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

Fabio De Sousa Ribeiro1,2,* · Emma A.M. Stanley1,* · Charles Jones1 · Tian Xia1 · Dominic C. Marshall1,3 · Laurent Renard Triché4 · Christopher V. Cosgriff6,7 · Panagiotis Dimitrakopoulos2,5 · Sotirios A. Tsaftaris2,5 · Ben Glocker1,2 1Imperial College London · 2Causality in Healthcare AI (CHAI) Hub · 3Cleveland Clinic London · 4Department of Perioperative Medicine, CHU Clermont-Ferrand · 5University of Edinburgh · 6Department of Medicine, Massachussetts General Hospital · 7Broad Institute of MIT and Harvard · *Joint first authors

TL;DR

  • First billion-parameter scale generative foundation model for chest radiography trained from scratch.
  • Built on CXR7-1M, the largest open source CXR dataset with over 1M radiographs and clinical expert-guided metadata.
  • State-of-the-art synthesis fidelity, indistinguishable from real radiographs to clinical experts.
Main diagram

Frontier generative foundation model for chest radiography. a) The proposed CXR7-1M dataset, harmonised from seven existing datasets and augmented with additional radiologist-guided metadata. b) Radiographic rectified flow transformer (RadiT), and VAE trained with a domain-specific Rad-DINO perceptual loss (Rad-VAE). c) Synthetic 512x512 resolution chest radiographs generated using our RadiT XL (1.3B) model.

Abstract

We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state-of-the-art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.

Real vs AI Radiographs: Can You Tell?

Test yourself on the same real-vs-synthetic challenge used with clinical experts.

Choose the synthetic radiograph generated by RadiT XL. Good luck!

1/0

Key Contributions

Hover over cards for details

The CXR7-1M Dataset

We collate the CXR7-1M dataset, the largest-scale open-source chest X-ray dataset to date, comprising over 1.2M radiographs, harmonised from multiple existing datasets and paired with radiologist-guided metadata systematically extracted through expert consultation.

CXR7-1M dataset composition, showing the seven different original dataset sources and all the available patient metadata variables, which were harmonised through iterative consultation with clinical experts.

Frontier Generative Foundation Model for Chest Radiography

We build a series of scaled rectified flow transformers for chest X-ray generation, up to 1.3B parameters. Our largest model, RadiT XL, was trained for 1.6T tokens attains four-fold FDD and ten-fold KDD improvements over prior state-of-the-art on the CheXGenBench benchmark.

We optimise our VAE variants for radiographic fidelity by either training them from scratch or LoRA fine-tuning from a FLUX.2 VAE base,
using a domain-specific Rad-DINO perceptual loss.
a) Latent-space rectified flow models operate on Rad-VAE latent tokens and use patient metadata conditioning to generate controllable chest radiographs. b) Pixel-space rectified flow models operate directly on image patch tokens with the same metadata conditioning interface, avoiding an explicit VAE bottleneck.
Comparative evaluation of CXR generative fidelity. All metrics were computed using Rad-DINO features. Benchmark results are from CheXGenBench
(Dutt et al., 2025). We also report results on two internal test splits from CXR7-1M, a MIMIC-CXR 5K split and a separate 50K split.
Superscript (pix) denotes our flow model variants trained in pixel-space at 512×512 resolution.

Clinical Expert-Guided Causal Model

We introduce an expert-designed, clinically plausible causal graph for chest X-ray and instantiate it as the largest continuous flow-based causal model to date, spanning 19 demographic and radiological variables, and unlocking scalable exact abduction for discrete factors.

Proposed clinical expert-informed causal graph of demographic factors and radiologic findings, developed through iterative discussions with three clinical experts. A continuous flow-based structural causal model was then built such that CXR synthesis can reflect known clinical dependencies between variables.

Controllable Image Editing

We evaluate our models on controllable image generation and editing, showing that RadiT XL achieves high-fidelity control over multiple demographic, acquisition-view, and clinical attributes.

Clinically Indistinguishable Synthetic Radiographs

We conduct a blind real-vs-synthetic discrimination experiment and show that RadiT XL produces chest radiographs that expert clinicians find indistinguishable from real ones.

Clinical experts' performance on the real-vs-synthetic task across 2 presentations.
Near-chance accuracy and low intra- and inter-rater Cohen's kappa indicate high synthetic image realism.
Synthetic radiograph 1
Synthetic radiograph 2
Synthetic chest radiographs generated by RadiT XL that all three clinical experts thought were real.