Enlarged

GRADE

Grounded Reasoning Assessment for Discipline-informed Editing

The First Benchmark for Evaluating Discipline-Informed Knowledge and Reasoning in Image Editing

Multimodal 520 Samples 10 Domains
I

Abstract

GRADE is the first benchmark for evaluating discipline-informed knowledge and reasoning in image editing. It comprises 520 carefully curated samples across 10 academic domains โ€” from natural science to social science โ€” and provides a multi-dimensional automated evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability.

II

Overview

GRADE Distribution

Distribution of 520 samples across 10 academic disciplines with fine-grained sub-categories.

III

Evaluation Pipeline

Evaluation Pipeline

Discipline Reasoning

Evaluates whether edited results correctly reflect underlying discipline knowledge through a structured QA-guided VQA protocol. Weighted binary scoring points, score range [0, 1].

Visual Consistency

Assesses whether edits blend consistently with expected visual structure. Three task-related types: Localized, Style, and Independence. Score {0, 1, 2}.

Logical Readability

Evaluates whether the edited image presents discipline content in a clear, logically coherent, and interpretable form. Score {0, 1, 2}.

Accuracy (Strict)

A sample is considered passed only when all reasoning questions are answered correctly, and both Visual Consistency and Logical Readability receive full marks.

Relax Score

A weighted combination of three dimensions: Reasoning contributes 60%, Consistency contributes 30%, and Readability contributes 10%, all normalized to a 0โ€“100 scale.

IV

Leaderboard

Model Reasoning Consistency Readability Accuracy
Proprietary Models
1Nano Banana Pro
77.589.595.846.2
2Nano Banana 2
72.686.495.939.6
3Seedream 5.0
64.187.590.624.7
4GPT-Image-1.5
54.582.390.716.0
5FLUX.2 Max
47.867.268.611.9
6Nano Banana
42.275.182.09.0
7Seedream 4.5
41.355.682.16.9
8GPT-Image-1.0
44.065.282.36.0
9FLUX.2 Pro
38.955.570.34.4
10Seedream 4.0
32.453.277.03.1
Open-Source Models
11Qwen-Edit-2511
18.645.252.12.7
12Step-1x (think+reflect)
19.257.266.92.3
13Step-1x (think)
17.656.368.21.4
14DreamOmni
17.483.289.11.0
15Step-1x
17.352.863.71.0
16Bagel
15.258.669.80.6
17Bagel (think)
15.654.867.80.2
18ICEdit
9.833.256.50.2
19FLUX.2 dev
11.317.649.60.2
20OmniGen
9.733.651.60.0
V

Benchmark Explorer

Select a discipline and sub-category to view sample editing tasks with grading rubrics. Click on images to enlarge.

VI

Citation

@article{grade2026, title={GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing}, author={Mingxin Liu and Ziqian Fan and Zhaokai Wang and Leyao Gu and Zirun Zhu and Yiguo He and Yuchen Yang and Changyao Tian and Xiangyu Zhao and Ning Liao and Shaofeng Zhang and Qibing Ren and Zhihang Zhong and Xuanhe Zhou and Junchi Yan and Xue Yang}, year={2026} }