Interpreting Video Features

This website contains animated figures both from the main article and the supplementary material of our paper:

Interpreting Video Features: a Comparison of 3D Convolutional Networks and Convolutional LSTM Networks

Joonatan Mänttäri*, Sofia Broomé*, John Folkesson, Hedvig Kjellström. *Joint first authorship

ACCV 2020, 15th Asian Conference on Computer Vision, to appear.

@InProceedings{ManttariBroome_2020_Interpreting_Video_Features, title={Interpreting Video Features: a Comparison of 3D Convolutional Networks and Convolutional LSTM Networks}, author={Joonatan M\"antt\"ari* and Sofia Broom\'e* and John Folkesson and Hedvig Kjellstr\"om}, year={2020}, booktitle={Computer Vision - ACCV 2020, 15th Asian Conference on Computer Vision, to appear. (*Joint first authors)}, month = {December} }

The hyperparameters used in the work can be found at the bottom of the page. If you find the article useful for your research, please cite it.

Abstract: A number of techniques for interpretability have been presented for deep learning in computer vision, typically with the goal of understanding what the networks have actually learned underneath a given classification decision. However, interpretability for deep video architectures is still in its infancy and we do not yet have a clear concept of how to decode spatiotemporal features. In this paper, we present a study comparing how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames. This is the first comparison of two video models that both convolve to learn spatial features but that have principally different methods of modeling time. Additionally, we extend the concept of meaningful perturbation introduced by Fong et al. to the temporal dimension to search for the most meaningful part of a sequence for a classification decision.

Results samples from main article

Class	Scores I3D	Scores CLSTM
Moving something and something away from each other	OS: 0.994 FS: 0.083 RS: 0.856	0.312 0.186 0.125
Moving something and something closer to each other Predicted: 38: Moving something and something so they collide with each other 135: Something falling like a rock	OS: 0.547 FS: 0.028 RS: 0.053 P: 38	0.257 0.079 0.122 P: 135
Moving something and something so they pass each other	OS: 0.999 FS: 0.002 RS: 0.414	0.788 0.392 0.537
Moving something up	OS: 0.804 FS: 0.016 RS: 0.667	0.546 0.121 0.764
Moving something up Predicted: 146: Taking one of many similar things on the table 100: Pushing something so that it slightly moves	OS: 0.685 FS: 0.003 RS: 0.048 CS: 0.001 P: 146	0.221 0.182 0.350 CS: 0.005 P: 100
Pretending to take something from somewhere Predicted: 27: Lifting something up completely without letting it drop down	OS: 0.284 FS: 0.003 RS: 0.006	0.600 0.167 0.088 CS: 0.004 P: 27
Turning the camera downwards while filming something	OS: 1.000 FS: 0.001 RS: 0.011	0.158 0.063 0.093
Turning the camera upwards while filming something	OS: 0.990 FS: 0.001 RS: 0.000	0.806 0.177 0.181

Further Examples of Spatio-temporal Features:

Below, we present results for 22 additional randomly selected sequences (two from each class) from the Something-something dataset. As mentioned in the main article, we selected eleven classes where the two models had comparable performance, both poor and strong. The four classes not appearing above are (I3D F1-score/C-LSTM F1 score): moving something and something so they collide with each other (0.16/0.03), burying something in something (0.1/0.06), turning the camera left while filming something (0.94/0.79) and turning the camera right while filming something (0.91/0.8).

Class	Mask Losses I3D	I3D	CLSTM	Mask Losses CLSTM
Turning the camera downwards while filming something	OS: 1.000 FS: 0.015 RS: 0.003			0.373 0.547 0.224

Turning the camera downwards while filming something

OS: 0.999
FS: 0.000
RS: 0.000

0.921
0.238
0.460

Turning the camera left while filming something

OS: 0.997
FS: 0.012
RS: 0.014

0.988
0.183
0.105

Turning the camera left while filming something

OS: 0.999
FS: 0.001
RS: 0.451

OS: 0.985
FS: 0.229
RS: 0.094

Turning the camera right while filming something

Predicted:
157: Tilting something with something on it until it falls off

OS: .940
FS: 0.106
RS: 0.017

OS: 0.192
FS: 0.261
RS: 0.140
CS: 0.103
P: 157

Turning the camera right while filming something

OS: .947
FS: 0.005
RS: 0.188

OS: 0.708
FS: 0.093
RS: 0.119

Turning the camera upwards while filming something

OS: 0.999
FS: 0.001
RS: 0.002

OS: 0.687
FS: 0.205
RS: 0.149

Turning the camera upwards while filming something

OS: 0.997
FS: 0.002
RS: 0.064

OS: 0.689
FS: 0.108
RS: 0.129

Moving something and something away from each other

Predicted:
121: Removing something, revealing something behind

OS: 0.917
FS: 0.058
RS: 0.071

OS: 0.297
FS: 0.155
RS: 0.294
CS: 0.085
P:121

Moving something and something away from each other Predicted: 130: Showing that something is inside something	OS: 0.991 FS: 0.022 RS: 0.956	OS: 0.259 FS: 0.081 RS: 0.146 CS: 0.008 P: 130
Moving something and something closer to each other Predicted: 173: Wiping something off of something 100: Pushing something so that it slightly moves	OS: 0.273 FS: 0.004 RS: 0.245 CS: 0.001 P:173	OS: 0.1230 FS: 0.375 RS: 0.200 CS: 0.012 P: 100
Moving something and something closer to each other	OS: 0.932 FS: 0.002 RS: 0.007	OS: 0.453 FS: 0.063 RS: 0.198
Moving something and something so they collide with each other	OS: 0.686 FS: 0.003 RS: 0.000	OS: 0.620 FS: 0.145 RS: 0.129
Moving something and something so they collide with each other Predicted: 37: Moving something and something closer to each other	OS: 0.810 FS: 0.055 RS: 0.419	OS: 0.333 FS: 0.119 RS: 0.276 CS: 0.030 P:37
Moving something and something so they so they pass each other	OS: 0.997 FS: 0.007 RS: 0.974	OS: 0.737 FS: 0.490 RS: 0.140
Moving something and something so they so they pass each other Predicted: 37: Moving something and something closer to each other	OS: 0.694 FS: 0.010 RS: 0.003 CS: 0.273 P: 37	OS: 0.813 FS: 0.227 RS: 0.830 CS: 0.142 P: 37
Burying something in something Predicted: 145: Taking one of many similar things on the table 157: Tilting something with something on it until it falls off	OS: 0.619 FS: 0.010 RS: 0.216 CS: 0.020 P: 145	OS: 0.013 FS: 0.079 RS: 0.262 CS: 0.001 P: 157
Burying something in something Predicted: 106: Putting something into something 5: Closing something	OS: 0.177 FS: 0.007 RS: 0.130 CS: 0.027 P: 106	OS: 0.112 FS: 0.147 0.327 CS: 0.002 P: 5
Moving something up Predicted: 27: Lifting something up completely without letting it drop down 100: Pushing something so that it slightly moves	OS: 0.848 FS: 0.065 RS: 0.380 CS: 0.003 P: 27	OS: 0.229 FS: 0.102 RS: 0.269 CS: 0.003 P: 100
Moving something up Predicted: 100: Pushing something so that it slightly moves	OS: 0.755 FS: 0.012 RS: 0.032	OS: 0.230 FS: 0.146 RS: 0.200 CS: 0.003 P: 100
Pretending to take something from somewhere Predicted: 160: Touching (without moving) part of something	OS: 0.810 FS: 0.019 RS: 0.682 CS: 0.000 P: 160	OS: 0.179 FS: 0.073 RS: 0.162 CS: 0.004 P: 160
Pretending to take something from somewhere Predicted: 145: Stuffing something into something 160: Touching (without moving) part of something	OS: 0.325 FS: 0.012 RS: 0.126 CS: 0.047 P: 145	OS: 0.418 FS: 0.062 RS: 0.266 CS: 0.011 P: 160