Researchers from Sun Yat-sen University and Lenovo Research have proposed AutoStudio, a training-free multi-agent framework for multi-round interactive image generation. AutoStudio is capable of generating diverse images while maintaining subject consistency across multiple rounds of interaction with users.
How AutoStudio Works
AutoStudio employs three LLM-based agents to interpret human intentions and generate appropriate layout guidance for the Stable Diffusion (SD) model. Furthermore, it introduces a novel P-UNet architecture and a theme initialization generation method to enhance the SD model with theme-aware features, ultimately helping to generate high-quality images with multi-theme consistency. Extensive experiments validate AutoStudio’s superior performance on various tasks, opening up new possibilities for advanced and user-friendly text-to-image applications.
Related Links
- Project website: https://howe183.github.io/AutoStudio.io/
- Paper: https://arxiv.org/pdf/2406.01388
- Code: https://github.com/donahowe/AutoStudio
Paper Reading: AutoStudio – Making Consistent Themes in Multi-Round Interactive Image Generation
Abstract
As state-of-the-art text-to-image (T2I) generation models have become adept at generating excellent single images, a more challenging task, multi-round interactive image generation, has begun to attract attention in the related research community. This task requires the model to interact with users over multiple rounds to generate a coherent sequence of images. However, due to users potentially switching themes frequently, current efforts struggle to maintain theme consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio.
AutoStudio uses three LLM-based agents to handle interactions and an SD-based agent to generate high-quality images. Specifically, AutoStudio includes:
- A Theme Manager for interpreting interactive dialogue and managing context for each theme
- A Layout Generator for generating fine-grained bounding boxes to control theme positions
- A Supervisor for providing layout refinement suggestions
- A Drawer for completing image generation
Additionally, we introduce Parallel-UNet to replace the original UNet in the Drawer, which adopts two parallel cross-attention modules to leverage theme-aware features. We also introduce a theme initialization generation method to better preserve minor themes. Our AutoStudio can generate a series of multi-theme images in an interactive and consistent manner. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains good multi-theme consistency over multiple rounds, and it also improves upon the current state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average Character-to-Character Similarity.
Method
AutoStudio leverages four agents and a theme database to accomplish multi-round multi-theme interactive image generation:
- Theme Manager interprets user dialogue
- Layout Generator provides layouts
- Supervisor provides layout refinement suggestions
- Drawer generates images based on the refined layout and theme database
Results Showcase
Continuous Dialogue
Multi-Round Interactive Image Generation
Multi-Functionality Binding
Conclusion
This paper introduces AutoStudio, a novel training-free multi-agent framework that successfully tackles the multi-round interactive image generation problem. AutoStudio employs three LLM-based agents to interpret human intentions and generate appropriate layout guidance for the SD model. Moreover, it introduces a novel P-UNet architecture and a theme initialization generation method to enhance the SD model with theme-aware features, ultimately helping to generate high-quality images with multi-theme consistency. Extensive experiments validate AutoStudio’s superior performance on various tasks, opening up new possibilities for advanced and user-friendly text-to-image applications.