Abstract: Most existing multimodal named entity recognition (MNER) methods cannot align image and text well, and fail to effectively fuse image-text semantic information, leading to suboptimal MNER ...