Video Test-Time Adaptation for Action Recognition

1Institute of Computer Graphics and Vision, Graz University of Technology, Austria 2Christian Doppler Laboratory for Semantic 3D Computer Vision 3Christian Doppler Laboratory for Embedded Machine Learning 4Goethe University Frankfurt, Germany 5MIT-IBM Watson AI Lab, USA

Abstract

Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated.

We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample.

Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts.


Pipeline


The online adaptation is applied on videos that are received sequentially and here we show the adaptation process of iteration i. We first compute the online estimates of the test statistics by 1) sampling two temporally augmented views from the test video, and computing the statistics on multi-layer features maps across the two views, 2) then performing exponential moving averages of statistics among iterations. Afterwards, we perform feature distribution alignment by minimizing the discrepancy between the pre-computed training statistics and the online estimates of test statistics. Furthermore, we enforce prediction consistency over temporally augmented views for performance boost.



Examples of Corrupted Videos


BibTeX

@inproceedings{lin2023video,
  title={Video Test-Time Adaptation for Action Recognition},
  author={Lin, Wei and Mirza, Muhammad Jehanzeb and Kozinski, Mateusz and Possegger, Horst and Kuehne, Hilde and Bischof, Horst},
  booktitle={CVPR},
  year={2023}
}
}