GEM: Generating Engaging Multimodal Content

Chongyang Gao; Yiren Jian; Natalia Denisenko; Soroush Vosoughi; V. S. Subrahmanian

doi:10.24963/ijcai.2024/847

GEM: Generating Engaging Multimodal Content

Chongyang Gao, Yiren Jian, Natalia Denisenko, Soroush Vosoughi, V. S. Subrahmanian

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

AI, Arts & Creativity. Pages 7654-7662. https://doi.org/10.24963/ijcai.2024/847

PDF BibTeX

Generating engaging multimodal content is a key objective in numerous applications, such as the creation of online advertisements that captivate user attention through a synergy of images and text. In this paper, we introduce GEM, a novel framework engineered for the generation of engaging multimodal image-text posts. The GEM framework operates in two primary phases. Initially, GEM integrates a pre-trained engagement discriminator with a technique for deriving an effective continuous prompt tailored for the stable diffusion model. Subsequently, GEM unveils an iterative algorithm dedicated to producing coherent and compelling image-sentence pairs centered around a specified topic of interest. Through a combination of experimental analysis and human evaluations, we establish that the image-sentence pairs generated by GEM not only surpass several established baselines in terms of engagement but also in achieving superior alignment.

Keywords:

Application domains: Images, movies and visual arts

Application domains: Text, literature and creative language

Methods and resources: Machine learning, deep learning, neural models, reinforcement learning

Theory and philosophy of arts and creativity in AI systems: Autonomous creative or artistic AI