AutoStudio: AI Creates Comics with Consistent Characters & Plots

Researchers from Sun Yat-sen University and Lenovo Research have proposed AutoStudio, a training-free multi-agent framework for multi-round interactive image generation. AutoStudio is capable of generating diverse images while maintaining subject consistency across multiple rounds of interaction with users.

How AutoStudio Works

AutoStudio employs three LLM-based agents to interpret human intentions and generate appropriate layout guidance for the Stable Diffusion (SD) model. Furthermore, it introduces a novel P-UNet architecture and a theme initialization generation method to enhance the SD model with theme-aware features, ultimately helping to generate high-quality images with multi-theme consistency. Extensive experiments validate AutoStudio’s superior performance on various tasks, opening up new possibilities for advanced and user-friendly text-to-image applications.

Related Links

  • Project website: https://howe183.github.io/AutoStudio.io/
  • Paper: https://arxiv.org/pdf/2406.01388
  • Code: https://github.com/donahowe/AutoStudio

Paper Reading: AutoStudio – Making Consistent Themes in Multi-Round Interactive Image Generation

Abstract

As state-of-the-art text-to-image (T2I) generation models have become adept at generating excellent single images, a more challenging task, multi-round interactive image generation, has begun to attract attention in the related research community. This task requires the model to interact with users over multiple rounds to generate a coherent sequence of images. However, due to users potentially switching themes frequently, current efforts struggle to maintain theme consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio.

AutoStudio uses three LLM-based agents to handle interactions and an SD-based agent to generate high-quality images. Specifically, AutoStudio includes:

  1. A Theme Manager for interpreting interactive dialogue and managing context for each theme
  2. A Layout Generator for generating fine-grained bounding boxes to control theme positions
  3. A Supervisor for providing layout refinement suggestions
  4. A Drawer for completing image generation

Additionally, we introduce Parallel-UNet to replace the original UNet in the Drawer, which adopts two parallel cross-attention modules to leverage theme-aware features. We also introduce a theme initialization generation method to better preserve minor themes. Our AutoStudio can generate a series of multi-theme images in an interactive and consistent manner. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains good multi-theme consistency over multiple rounds, and it also improves upon the current state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average Character-to-Character Similarity.

Method

AutoStudio leverages four agents and a theme database to accomplish multi-round multi-theme interactive image generation:

  1. Theme Manager interprets user dialogue
  2. Layout Generator provides layouts
  3. Supervisor provides layout refinement suggestions
  4. Drawer generates images based on the refined layout and theme database

Results Showcase

Continuous Dialogue

Multi-Round Interactive Image Generation

Multi-Functionality Binding

Conclusion

This paper introduces AutoStudio, a novel training-free multi-agent framework that successfully tackles the multi-round interactive image generation problem. AutoStudio employs three LLM-based agents to interpret human intentions and generate appropriate layout guidance for the SD model. Moreover, it introduces a novel P-UNet architecture and a theme initialization generation method to enhance the SD model with theme-aware features, ultimately helping to generate high-quality images with multi-theme consistency. Extensive experiments validate AutoStudio’s superior performance on various tasks, opening up new possibilities for advanced and user-friendly text-to-image applications.

Categories: GitHub
X