AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

1Beihang University; 2Peking University; 3Tsinghua University;
4Huazhong University of Science and Technology; 5Zhongguancun Laboratory;
Equal contribution   ✉ Corresponding author  

highlight

Our paper has been accepted by AAAI-2025!

highlight

Highlights

  • High-Quality Dataset Construction: We released a 45.8K dataset of AI-generated prompts and SDXL images with QA pairs, enhancing T2I research while reducing reliance on manual annotations.

  • AI-driven Alignment Method: AGFSync leverages multiple metrics and DPO fine-tuning to improve text fidelity and aesthetics without manual intervention.

  • Exceptional Performance and High Efficiency: AGFSync outperforms existing alignment methods in text alignment and image quality, offering high efficiency and transformative potential at low cost.

highlight

Summary Video

Motivation

Motivation Image 1
Motivation Image 2
Motivation Image 3
Motivation Image 4

1. Text-to-Image (T2I) models face challenges: While T2I models can generate images from text, achieving consistent alignment between text and image content and ensuring logical coherence remain significant hurdles.

2. Multimodal models offer new possibilities: Large language models excel in natural language tasks and can process image data, leveraging their cross-modal understanding to replace human evaluators in assessing image quality.

3. Manual annotation is unsustainable and costly: With T2I models scaling from 860M to 2.6B parameters (SD v1.5 to SDXL), the data required for training is growing rapidly. Relying on costly manual annotation is impractical, making AI-driven annotation essential.

Examples of Our Open-Sourced Dataset

These examples demonstrate that each entry in our constructed preference dataset consists of a caption and its corresponding good image and bad image. This structure enables precise evaluation of text-image alignment.


Dataset Example

Abstract

Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL-base, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques. Our code and dataset are publicly available.

Method Overview

Overview of AGFSync. This framework consists of three main steps: (1) Preference Candidate Set Generation, (2) Preference Pair Construction, and (3) DPO Alignment. AGFSync effectively learns from AI-generated feedback data using DPO, eliminating the need for human annotation, model architecture modifications, or reinforcement learning.

BibTeX

@article{an2024agfsync,
    title={AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation},
    author={An, Jingkun and Zhu, Yinghao and Li, Zongjian and Feng, Haoran and Chen, Bohua and Shi, Yemin and Pan, Chengwei},
    journal={arXiv preprint arXiv:2403.13352},
    year={2024}
}