FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching

ICLR 2026
Joongwon Lee1,*, Seonghwan Kim1,2,*, Seokhyun Moon1,*, Hyunwoo Kim3,†, Woo Youn Kim1,4,5,†
1KAIST, Department of Chemistry 2KAIST, InnoCORE AI-CRED Institute 3Dongguk University, College of Pharmacy 4KAIST, Department of Data Science 5HITS
*Equal Contribution  Corresponding Authors
Overview of the FragFM framework
Overview. (a) A coarse-to-fine autoencoder maps an atom-level graph G to a fragment-level graph 𝒒 and a latent z; discrete + continuous flows generate (𝒒, z) while fΞΈ picks fragments from a stochastic bag ℬ. (b) The decoder reconstructs atom-level edges via Blossom matching.

Abstract

We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle a large fragment space, our framework enables more efficient, scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate the ability of modern molecular graph generative models to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a comparative study of FragFM against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.

Contributions

Molecular graph generative models

Atom-based

OH
  • Chemically implausible
  • Poor scaling
  • Low controllability

Fragment-based

  • Limited fragment library
  • Poor scaling
  • Hard to recover whole graph

FragFM

ours

The first fragment-level graph flow matching that explores meaningful chemical space at scale.

What FragFM solves
Challenge
Large fragment vocabulary
FragFM
Stochastic fragment bag
Challenge
Generalizing to new fragments
FragFM
GNN fragment embedding
Challenge
One-to-many atom mapping
FragFM
Coarse-to-fine autoencoder
What FragFM shows
Higher performance
Scalability
Better conditioning

Method

FragFM rests on two ideas. (i) A coarse-to-fine autoencoder compresses an atom-level graph G into a much smaller fragment-level graph 𝒒 plus a single continuous latent z that carries the atom-level connectivity. (ii) On that fragment-level graph, discrete flow matching (DFM) generates fragment types; to make the huge fragment vocabulary tractable, training and sampling are restricted to a stochastic fragment bag 𝓑 βŠ‚ 𝓕, optimized with an Info-NCE objective.

Coarse-to-Fine Autoencoder

The encoder first applies a deterministic fragmentation rule (BRICS) to split G into fragments, then a graph neural network Ο†enc encodes both into a latent z that remembers which atoms were bonded across fragment cuts. The decoder Ο†dec scores candidate inter-fragment atom pairs and feeds those scores into the Blossom matching algorithm to recover exact atom-level edges E.

Coarse-to-fine autoencoder overview
Coarse-to-fine autoencoder. Atom graph G is rule-compressed to fragment graph 𝒒; a GNN encoder emits latent z; decoder + Blossom matching reconstruct atom-level edges.
Reconstruction accuracy β€” test set (%)
DatasetBondGraph
MOSES99.9999.93
GuacaMol99.9899.42
ZINC250k99.6498.71
NPGen99.7197.43
$$ \textbf{Encoder:}\;\; \mathbf{G}\xrightarrow{\text{Rule}}\mathcal{G},\quad (\mathbf{G},\mathcal{G})\xrightarrow{\phi_{\text{enc}}} z $$ $$ \textbf{Decoder:}\;\; (\mathcal{G},z)\xrightarrow{\phi_{\text{dec}}}\text{score}\xrightarrow{\text{Blossom}}\mathbf{E} $$
reconstruction the fragment-level graph + latent z is a near-lossless compression, achieving >99% bond-level accuracy across all four benchmarks (see table).

Fragment-Level Flow Matching with Stochastic Bag

On the coarse graph X = (𝒒, z), FragFM runs discrete flow matching (DFM) over fragment types and continuous flow matching over the latent. To avoid a softmax over the full fragment vocabulary 𝓕 (which can exceed 10⁡ fragments), each step samples a bag 𝓑 βŠ‚ 𝓕 of size N and restricts the denoiser's softmax to 𝓑. A neural density-ratio estimator fΞΈ(Xt, x) is trained with an Info-NCE loss against one positive + Nβˆ’1 negative fragments:

$$ \mathcal{L}(\theta) = -\,\mathbb{E}_{\mathcal{B}}\!\left[\,\log \frac{f_{\theta}(X_t,\,x^{+})} {\displaystyle\sum_{y\in\mathcal{B}} f_{\theta}(X_t,\,y)} \right] $$
Info-NCE pushes the positive fragment x⁺ above Nβˆ’1 sampled negatives inside 𝓑, so fΞΈ approximates the density ratio p1|t(x|Xt) / p1(x).
scaling training and sampling cost is O(N) per step, independent of |𝓕| β€” decoupling model cost from vocabulary size.

At inference, the trained fΞΈ defines an in-bag posterior over 𝓑, which the sampler integrates into a one-step Euler transition kernel:

$$ p^{\theta}_{1|t,\mathcal{B}}(x_1 \mid X_t, \mathcal{B}) = \frac{\mathbf{1}_{\mathcal{B}}(x_1)\, f_{\theta}(X_t, x_1)} {\sum_{y\in\mathcal{B}} f_{\theta}(X_t, y)} $$ $$ p^{\theta}_{t+\Delta t|t}(x_{t+\Delta t}\mid X_t, \mathcal{B}) = \mathbb{E}_{x_1\sim p^{\theta}_{1|t,\mathcal{B}}} \!\left[\, p_{t+\Delta t|t}(x_{t+\Delta t}\mid X_t, x_1)\,\right] $$
posterior softmax over the sampled bag 𝓑 β€” never over the full vocabulary 𝓕.
kernel a standard DFM Euler step; as Nβ†’|𝓕| it converges to the exact transition kernel, so small Ξ”t + moderate N suffices in practice.
Parameterization of fragment embedder and prediction head f_theta
Parameterization of fΞΈ. (left) Each fragment in 𝓑 is embedded by a GNN fragment embedder; coarse-graph nodes are mapped to fixed-size vectors; fΞΈ(Xt, x) is their inner product. (right) The coarse-graph embedder first vectorizes each node, then processes the whole graph with a graph transformer. Shared fragment embeddings let FragFM generalize to fragments unseen at training.

Benchmarks

NPGen β€” A Natural Product Benchmark

658,566 natural products from COCONUT, average 35.0 heavy atoms (vs. 21.7 MOSES, 27.9 GuacaMol). Evaluation adds NP-likeness and NP-Classifier divergences that reflect biological functionality rather than saturated distributional overlap.

UMAP of MOSES, GuacaMol, and NPGen
Distinct chemical space. UMAP of 5k samples per dataset.
Representative NPGen molecules
Representative NPGen molecules with NP-Classifier pathway/superclass/class annotations.

Quantitative Results

FragFM is a graph generative model, evaluated against atom- and fragment-level graph baselines on four standard molecular-generation benchmarks β€” matching or setting SOTA across all of them. autoregressive / sequence and one-shot diffusion / flow baselines are shown for context. Bold = best; underline = second-best.

MOSES β€” 25k generated molecules
ModelValid ↑Unique ↑Novel ↑Filters ↑FCD ↓SNN ↑
Training set100.0100.0–100.00.480.59
GraphINVENT96.499.8–95.01.220.54
JT-VAE100.0100.099.997.81.000.53
DiGress85.7100.095.097.11.190.52
DisCo88.3100.097.795.61.440.50
Cometh90.5100.096.497.21.440.51
Cometh-PC90.599.992.699.11.270.54
DeFoG92.899.992.198.91.950.55
FragFM (ours)99.8100.087.199.10.580.56
GuacaMol β€” 10k generated molecules
ModelVal. ↑V.U. ↑V.U.N. ↑KL Div. ↑FCD ↑
Training set100.0100.0–99.992.8
MCTS100.0100.095.482.21.5
DiGress85.285.285.192.968.0
DisCo86.686.686.592.659.7
Cometh94.494.493.594.167.4
Cometh-PC98.998.997.696.772.7
DeFoG99.099.097.997.773.8
FragFM (ours)99.799.395.097.485.8
ZINC250k β€” 25k generated molecules
ModelValid ↑NSPDK ↓FCD ↓
Training set–0.00010.062
GraphAF67.920.043216.128
GraphDF89.720.173733.899
MolHF94.750.070922.230
GDSS97.120.019214.032
GSDM92.570.016812.435
GruM98.320.00232.235
SwinGNN86.160.00474.398
DiGress94.980.00213.482
GGFlow99.630.00101.455
FragFM (ours)99.810.00020.630
NPGen β€” 30k generated natural-product-like molecules
Model Val. ↑ Unique ↑ Novel ↑ NP Score
KL Div. ↓
NP Class KL Div. ↓ FCD ↓
PathwaySuperclassClass
Training set100.0100.0–0.00060.00020.00280.00940.01
GraphAF79.163.695.60.85460.97133.39076.690525.11
JT-VAE100.097.299.50.54370.10551.28952.56454.07
HierVAE100.081.597.70.30210.42300.57711.40738.95
DiGress85.499.799.90.19570.02290.33701.03092.05
DeFoG85.998.499.20.15500.12520.41341.35974.46
FragFM (ours)98.099.095.40.03740.01960.14820.35701.34

NPGen Samples Across Models

Click a model to browse its NPGen samples β€” across small (≀30 heavy atoms) and large (31–60) molecules.

Up to 30 heavy atoms
FragFM NPGen samples (small)
31–60 heavy atoms
FragFM NPGen samples (large)
Up to 30 heavy atoms
DiGress NPGen samples (small)
31–60 heavy atoms
DiGress NPGen samples (large)
Up to 30 heavy atoms
GraphAF NPGen samples (small)
31–60 heavy atoms
GraphAF NPGen samples (large)
Up to 30 heavy atoms
HierVAE NPGen samples (small)
31–60 heavy atoms
HierVAE NPGen samples (large)
Up to 30 heavy atoms
JT-VAE NPGen samples (small)
31–60 heavy atoms
JT-VAE NPGen samples (large)

Generated Sample Gallery

Random valid molecules generated by FragFM on the standard drug-like benchmarks MOSES and GuacaMol.

Conditional Generation

Property-Controlled Generation

FragFM exposes two guidance knobs: standard classifier guidance Ξ»X that pulls the flow trajectory toward a target property c, and fragment-bag reweighting λ𝓑 that biases the fragment bag 𝓑 itself β€” up-weighting fragments whose predicted property matches the target.

Unconditional vs conditional fragment bag sampling
Fragment-bag guidance (λ𝓑). At inference, the fragment bag 𝓑 is reweighted by a fragment-level property predictor: fragments helpful for target c are amplified, others suppressed β€” yielding the conditional bag 𝓑c.

Both knobs drop out of a single Bayes decomposition of the property-conditioned in-bag transition kernel:

$$ p(X_{t+\Delta t}\mid X_t,\mathcal{B}^{c},c) \;=\; \underbrace{\;p(X_{t+\Delta t}\mid X_t,\mathcal{B}^{c})\;}_{\color{#4f46e5}{\mathcal{B}^{c}\text{ steered by }\lambda_{\mathcal{B}}}} \;\cdot\; \underbrace{\;\frac{p(c\mid X_{t+\Delta t},X_t,\mathcal{B}^{c})}{p(c\mid X_t,\mathcal{B}^{c})}\;}_{\color{#4f46e5}{\text{steered by }\lambda_X}} $$
λ𝓑 biases the bag 𝓑c toward fragments whose predicted property matches c.
Ξ»X is the classifier-guidance strength on the in-bag transition kernel.

On the hard JAK2 docking-score task (top 0.08% of ZINC250k), bag guidance alone reaches the target while DiGress collapses in validity.

QED MAE-FCD
QED. MAE–FCD across target values.
JAK2 docking KDE
JAK2 KDE. Only FragFM reaches βˆ’11.0 kcal/mol.
Bag guidance effect
Bag guidance. λ𝓑 shifts MAE–FCD curves; red = bag-only (Ξ»X=0).

Synthesizability via Fragment-Bag Swap

AIZynthFinder retrosynthesis on 25k MOSES samples: FragFM reaches 77% solved, close to the MOSES test set (80%). Swapping the fragment bag to fragments drawn from solved (synthesizable) molecules β€” no retraining β€” pushes it to 85%, exceeding the dataset itself.

Swapping the fragment bag improves molecule synthesizability without retraining N N O GENERAL FRAGMENT BAG FragFM AIZYNTH SOLVED 77% keep only fragments from AIZynth-solved molecules SUBSET βŠ‚ GENERAL BAG FragFM Β· same ΞΈ AIZYNTH SOLVED 85% ↑
Each glyph is a chemical fragment; FragFM assembles fragments into molecules. Restricting the bag to a subset β€” fragments drawn only from AIZynth-solved molecules (faded glyphs are filtered out) β€” lifts the synthesizable rate from 77% β†’ 85% with no retraining.
AIZynthFinder Solved ↑ 1 step 2 steps 3 steps 4+ steps
MOSES (test set) 80.1 53.9 16.0 5.17 5.0
FragFM 77.0 46.3 18.7 6.2 5.8
FragFM Β· solved bag 85.0 52.3 20.7 6.6 5.4
Fraction (%) of 25k MOSES generations solved by AIZynthFinder, split by route length.
AIZynthFinder synthesis-step distribution across models
Baseline comparison. FragFM has the lowest unsolved rate among learned generators.

Scalability

Denoising at the fragment level β€” not over individual atoms β€” keeps validity above 95% and FCD below 1.0 even under aggressive step reduction, with ~5Γ— faster wall-clock time than DiGress on NPGen.

Sampling step ablation
Step ablation. Atom-based baselines collapse as steps shrink; FragFM holds up.
Wall-clock sampling time
Wall-clock time. MOSES and NPGen; (Β·) = sampling steps.

Additional Results

Ablations over the fragmentation rule, sampling temperature, fragment-bag size, and how well FragFM recovers rare fragments.

(a) Robustness to fragmentation rule β€” BRICS vs RECAP vs rBRICS, no retraining
Swapping BRICS for RECAP or rBRICS β€” keeping all hyperparameters fixed β€” barely moves the quality metrics. The framework is not tied to a specific fragmentation scheme.
Rule MOSES ZINC250k
Valid ↑Unique ↑Novel ↑Filters ↑FCD ↓SNN ↑ Valid ↑NSPDK ↓FCD ↓
Training set100.0100.0–100.00.480.59100.00.00010.062
BRICS (default)99.8100.087.199.10.580.5699.810.00020.630
RECAP99.899.983.699.30.560.5799.660.00030.580
rBRICS99.8100.088.598.70.580.5699.790.00030.563
(b) Temperature scaling on MOSES β€” quality–novelty trade-off via 𝒯pred and 𝒯bag
Two knobs shift the quality–novelty frontier: 𝒯pred on the denoiser's softmax and 𝒯bag on fragment-bag sampling. Raising 𝒯bag pushes Novel and Scaf up at a small FCD cost.
𝒯pred𝒯bag Valid ↑Unique ↑Novel ↑Filters ↑FCD ↓SNN ↑Scaf ↑
1.01.099.8100.087.199.10.580.5610.9
1.01.599.2100.094.298.30.900.5213.5
1.51.099.7100.088.698.80.880.5411.0
1.51.599.2100.094.598.30.910.5113.1
(c) Effect of fragment-bag size on MOSES β€” inference-time and training-time bag size ablations
Performance is largely saturated above the default (N = 384, dashed line). Inference-time is more sensitive than training-time β€” a moderate bag suffices for training, while inference quality continues to improve with larger Ninference.
Effect of inference-time fragment-bag size
(a) Inference-time bag size (with Ntrain = 384). Validity / Filters / FCD vs Ninference on log-scale.
Effect of training-time fragment-bag size
(b) Training-time bag size (with Ninference = 384). The black dashed line marks the default FragFM configuration.
(d) Long-tail fragment recovery β€” 20 rarest fragments per dataset, training vs FragFM ratios
For each dataset, the 20 rarest fragments (k = 1 is the rarest) are plotted. FragFM-generated ratios closely track the training-set ratios β€” even for fragments that appear only a handful of times in training.
Long-tail fragment recovery on MOSES
(a) MOSES
Long-tail fragment recovery on GuacaMol
(b) GuacaMol
Long-tail fragment recovery on NPGen
(c) NPGen