Eliminating the Cross-Domain Misalignment in Text-guided Image Inpainting

Eliminating the Cross-Domain Misalignment in Text-guided Image Inpainting

Muqi Huang, Chaoyue Wang, Yong Luo, Lefei Zhang

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 875-883. https://doi.org/10.24963/ijcai.2024/97

Text-guided image inpainting has rapidly garnered prominence as a task in user-directed image synthesis, aiming to complete the occluded image regions following the textual prompt provided. However, current methods usually grapple with issues arising from the disparity between low-level pixel data and high-level semantic descriptions, which results in inpainted sections not harmonizing with the original image (either structurally or texturally). In this study, we introduce a Structure-Aware Inpainting Learning scheme and an Asymmetric Cross Domain Attention to address these cross-domain misalignment challenges. The proposed structure-aware learning scheme employs features of an intermediate modality as structure guidance to bridge the gap between text information and low-level pixels. Meanwhile, asymmetric cross-domain attention enhances the texture consistency between inpainted and unmasked regions. Our experiments show exceptional performance on leading datasets such as MS-COCO and Open Images, surpassing state-of-the-art text-guided image inpainting methods. Code is released at: https://github.com/MucciH/ECDM-inpainting
Keywords:
Computer Vision: CV: Image and video synthesis and generation 
Computer Vision: CV: Vision, language and reasoning