Evaluating and Complementing Vision-to-Language Technology for People who are Blind with Conversational Crowdsourcing

Evaluating and Complementing Vision-to-Language Technology for People who are Blind with Conversational Crowdsourcing

Elliot Salisbury, Ece Kamar, Meredith Ringel Morris

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Best Sister Conferences. Pages 5349-5353. https://doi.org/10.24963/ijcai.2018/751

We study how real-time crowdsourcing can be used both for evaluating the value provided by existing automated approaches and for enabling workflows that provide scalable and useful alt text to blind users. We show that the shortcomings of existing AI image captioning systems frequently hinder a user's understanding of an image they cannot see to a degree that even clarifying conversations with sighted assistants cannot correct. Based on analysis of clarifying conversations collected from our studies, we design experiences that can effectively assist users in a scalable way without the need for real-time interaction. Our results provide lessons and guidelines that the designers of future AI captioning systems can use to improve labeling of social media imagery for blind users.
Keywords:
Natural Language Processing: Resources and Evaluation
Humans and AI: Human-AI Collaboration
Humans and AI: Human Computation and Crowdsourcing
Humans and AI: Ethical Issues in AI