Abstract: Audio-visual question answering (AVQA) task, which aims to answer questions derived from the original videos, has attracted extensive attention in the fields of multimedia, computer vision, ...