In VTQA challenge, the model is expected to answer the question according to the given image-text pair. To answer VTQA questions, the proposed model needs to: (1) learn to identifying entities in image and text referred to the question, (2) align multimedia representations of the same entity, and (3) conduct multi-steps reasoning between text and image and output open-ended answer. The VTQA dataset consists of 10124 image-text pairs and 23,781 questions. The images are real images from MSCOCO dataset, containing a variety of entities. The annotators are required to first annotate relevant text according to the image, and then ask questions based on the image-text pair, and finally answer the question open-ended.
Information diversity, multimedia multi-step reasoning and open-ended answer make our task more challenging than the existing dataset. The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.
As illustrated in the Figure 1, given an image-text pair and a question, a system is required to answer the question by natural language. Importantly, the system needs to: (1) analyze the question and find out the key entities, (2) align the key entities between image and text, and (3) generate the answer according to the question and aligned entities. For example, in Figure 1, the key entity of Q1 is “Elena”. According to the text “gold hair”, we can determine that the second person from the right in the image is “Elena”. Finally, we further answer “suit” based on the image information. As for Q2, which is a more complex question, the previous steps need to be repeated several times to answer it.