Papers
Sign in to view your remaining parses.
Tag Filter
Video Question Answering
Self-Chained Image-Language Model for Video Localization and Question Answering
Published:5/12/2023
Self-Recurrent Video Localization and Question AnsweringBLIP-2 Based Vision-Language ModelVideo Question AnsweringTemporal Keyframe LocalizationUnlabeled Video Localization Optimization
The SeViLA framework introduces a solution for video question answering, addressing issues from uniform frame sampling. Utilizing the BLIP2 model, it efficiently combines temporal keyframe localization and QA, significantly improving performance while reducing the need for expen
01
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Published:12/19/2024
Multimodal Large Language ModelVisual-Spatial Intelligence BenchmarkSpatial ReasoningVideo Question AnsweringCognitive Map Generation
This work introduces VSIBench to evaluate multimodal large language models' spatial reasoning from videos, revealing emerging spatial awareness and local world models, with cognitive map generation enhancing spatial distance understanding beyond standard linguistic reasoning tec
09