Single-Stage Visual Query Localization
in Egocentric Videos

(NeurIPS 2023)

Hanwen Jiang1    Santhosh Ramakrishnan1    Kristen Grauman1,2   

1UT Austin    2FAIR, Meta   

Paper | Code

Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline results in slow inference speeds. We propose VQLoC, a novel single-stage VQL framework that is end-to-end trainable. Our key idea is to first build a holistic understanding of the query-video relationship and then perform spatio-temporal localization in a single shot manner. Specifically, we establish the query-video relationship by jointly considering query-to-frame correspondences between the query and each video frame and frame-to-frame correspondences between nearby video frames. Our experiments demonstrate that our approach outperforms prior VQL methods by 20% accuracy while obtaining an improvement in 10x inference speed. VQLoC is also the top entry on the Ego4D VQ2D challenge leaderboard.

Problem Definition

The goal is to search and localize open-set visual queries in long-form videos, jointly predicting the appearance time window and the corresponding spatial bounding boxes.


Challenges

The challenges of the task arise from: i) the "needle-in-the-haystack" problem; ii) the diversity of open-set queries, and iii) no "exact" matching in the target videos.



Method

Our model VQLoC estabishing a holistic undertsanding of query-video relationship, and predicts the results based on the understanding in a single shot. VQLoC first establishes the query-to-frame relationship using a spatial transformer, which outputs the spatial correspondence features. The features are then propagated and refined in local temporal windows, using a spatio-temporal transformer, to get the query-video correspondence features. Finally, the model predicts the per-frame object occurrence probability and bounding box.



Results

Our model VQLoC achieve 20% performance gain compared with prior works and improves the inference speed 10x. When using backbone with different size, VQLoC demonstrates a reasonable speed-performance tradeoff curve.



Visualization

We show identified query object response track with bounding boxes.


Video


Citation


@article{jiang2023vqloc,
   title={Single-Stage Visual Query Localization in Egocentric Videos},
   author={Jiang, Hanwen and Ramakrishnan, Santhosh and Grauman, Kristen},
   journal={ArXiv},
   year={2023},
   volume={2306.09324}
}