An explainable visual-navigation setting where an agent recounts indoor scenes and links generated language to visual explanation maps.
Sep 5, 2023