Skip to the content.

This website contains animated figures both from the main article and the supplementary material of our paper:

Interpreting Video Features: a Comparison of 3D Convolutional Networks and Convolutional LSTM Networks

Joonatan Mänttäri*, Sofia Broomé*, John Folkesson, Hedvig Kjellström. *Joint first authorship

ACCV 2020, 15th Asian Conference on Computer Vision, to appear.

Paper and code.

@InProceedings{ManttariBroome_2020_Interpreting_Video_Features, title={Interpreting Video Features: a Comparison of 3D Convolutional Networks and Convolutional LSTM Networks}, author={Joonatan M\"antt\"ari* and Sofia Broom\'e* and John Folkesson and Hedvig Kjellstr\"om}, year={2020}, booktitle={Computer Vision - ACCV 2020, 15th Asian Conference on Computer Vision, to appear. (*Joint first authors)}, month = {December} }

The hyperparameters used in the work can be found at the bottom of the page. If you find the article useful for your research, please cite it.

Abstract: A number of techniques for interpretability have been presented for deep learning in computer vision, typically with the goal of understanding what the networks have actually learned underneath a given classification decision. However, interpretability for deep video architectures is still in its infancy and we do not yet have a clear concept of how to decode spatiotemporal features. In this paper, we present a study comparing how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames. This is the first comparison of two video models that both convolve to learn spatial features but that have principally different methods of modeling time. Additionally, we extend the concept of meaningful perturbation introduced by Fong et al. to the temporal dimension to search for the most meaningful part of a sequence for a classification decision.

Results samples from main article

Class Scores I3D I3D CLSTM Scores CLSTM
Moving something and something away from each other OS: 0.994
FS: 0.083
RS: 0.856
img img 0.312
0.186
0.125
Moving something and something closer to each other

Predicted:
38: Moving something and something so they collide with each other
135: Something falling like a rock
OS: 0.547
FS: 0.028
RS: 0.053
P: 38
img img 0.257
0.079
0.122
P: 135
Moving something and something so they pass each other OS: 0.999
FS: 0.002
RS: 0.414
img img 0.788
0.392
0.537
Moving something up OS: 0.804
FS: 0.016
RS: 0.667
img img 0.546
0.121
0.764
Moving something up

Predicted:
146: Taking one of many similar things on the table
100: Pushing something so that it slightly moves
OS: 0.685
FS: 0.003
RS: 0.048
CS: 0.001
P: 146
img img 0.221
0.182
0.350
CS: 0.005
P: 100
Pretending to take something from somewhere

Predicted:
27: Lifting something up completely without letting it drop down
OS: 0.284
FS: 0.003
RS: 0.006
img img 0.600
0.167
0.088
CS: 0.004
P: 27
Turning the camera downwards while filming something OS: 1.000
FS: 0.001
RS: 0.011
img img 0.158
0.063
0.093
Turning the camera upwards while filming something OS: 0.990
FS: 0.001
RS: 0.000
img img 0.806
0.177
0.181

Further Examples of Spatio-temporal Features:

Below, we present results for 22 additional randomly selected sequences (two from each class) from the Something-something dataset. As mentioned in the main article, we selected eleven classes where the two models had comparable performance, both poor and strong. The four classes not appearing above are (I3D F1-score/C-LSTM F1 score): moving something and something so they collide with each other (0.16/0.03), burying something in something (0.1/0.06), turning the camera left while filming something (0.94/0.79) and turning the camera right while filming something (0.91/0.8).

Class Mask Losses I3D I3D CLSTM Mask Losses CLSTM
Turning the camera downwards while filming something OS: 1.000
FS: 0.015
RS: 0.003
img img 0.373
0.547
0.224
Turning the camera downwards while filming something OS: 0.999
FS: 0.000
RS: 0.000
img img 0.921
0.238
0.460
Turning the camera left while filming something OS: 0.997
FS: 0.012
RS: 0.014
img img 0.988
0.183
0.105
Turning the camera left while filming something OS: 0.999
FS: 0.001
RS: 0.451
img img OS: 0.985
FS: 0.229
RS: 0.094
Turning the camera right while filming something

Predicted:
157: Tilting something with something on it until it falls off
OS: .940
FS: 0.106
RS: 0.017
img img OS: 0.192
FS: 0.261
RS: 0.140
CS: 0.103
P: 157
Turning the camera right while filming something OS: .947
FS: 0.005
RS: 0.188
img img OS: 0.708
FS: 0.093
RS: 0.119
Turning the camera upwards while filming something OS: 0.999
FS: 0.001
RS: 0.002
img img OS: 0.687
FS: 0.205
RS: 0.149
Turning the camera upwards while filming something OS: 0.997
FS: 0.002
RS: 0.064
img img OS: 0.689
FS: 0.108
RS: 0.129
Moving something and something away from each other

Predicted:
121: Removing something, revealing something behind
OS: 0.917
FS: 0.058
RS: 0.071
img img OS: 0.297
FS: 0.155
RS: 0.294
CS: 0.085
P:121
Moving something and something away from each other

Predicted:
130: Showing that something is inside something
OS: 0.991
FS: 0.022
RS: 0.956
img img OS: 0.259
FS: 0.081
RS: 0.146
CS: 0.008
P: 130
Moving something and something closer to each other

Predicted:
173: Wiping something off of something
100: Pushing something so that it slightly moves
OS: 0.273
FS: 0.004
RS: 0.245
CS: 0.001
P:173
img img OS: 0.1230
FS: 0.375
RS: 0.200
CS: 0.012
P: 100
Moving something and something closer to each other OS: 0.932
FS: 0.002
RS: 0.007
img img OS: 0.453
FS: 0.063
RS: 0.198
Moving something and something so they collide with each other OS: 0.686
FS: 0.003
RS: 0.000
img img OS: 0.620
FS: 0.145
RS: 0.129
Moving something and something so they collide with each other

Predicted:
37: Moving something and something closer to each other
OS: 0.810
FS: 0.055
RS: 0.419
img img OS: 0.333
FS: 0.119
RS: 0.276
CS: 0.030
P:37
Moving something and something so they so they pass each other OS: 0.997
FS: 0.007
RS: 0.974
img img OS: 0.737
FS: 0.490
RS: 0.140
Moving something and something so they so they pass each other

Predicted:
37: Moving something and something closer to each other
OS: 0.694
FS: 0.010
RS: 0.003
CS: 0.273
P: 37
img img OS: 0.813
FS: 0.227
RS: 0.830
CS: 0.142
P: 37
Burying something in something

Predicted:
145: Taking one of many similar things on the table
157: Tilting something with something on it until it falls off
OS: 0.619
FS: 0.010
RS: 0.216
CS: 0.020
P: 145
img img OS: 0.013
FS: 0.079
RS: 0.262
CS: 0.001
P: 157
Burying something in something

Predicted:
106: Putting something into something
5: Closing something
OS: 0.177
FS: 0.007
RS: 0.130
CS: 0.027
P: 106
img img OS: 0.112
FS: 0.147
0.327
CS: 0.002
P: 5
Moving something up

Predicted:
27: Lifting something up completely without letting it drop down
100: Pushing something so that it slightly moves
OS: 0.848
FS: 0.065
RS: 0.380
CS: 0.003
P: 27
img img OS: 0.229
FS: 0.102
RS: 0.269
CS: 0.003
P: 100
Moving something up

Predicted:
100: Pushing something so that it slightly moves
OS: 0.755
FS: 0.012
RS: 0.032
img img OS: 0.230
FS: 0.146
RS: 0.200
CS: 0.003
P: 100
Pretending to take something from somewhere

Predicted:
160: Touching (without moving) part of something
OS: 0.810
FS: 0.019
RS: 0.682
CS: 0.000
P: 160
img img OS: 0.179
FS: 0.073
RS: 0.162
CS: 0.004
P: 160
Pretending to take something from somewhere

Predicted:
145: Stuffing something into something
160: Touching (without moving) part of something
OS: 0.325
FS: 0.012
RS: 0.126
CS: 0.047
P: 145
img img OS: 0.418
FS: 0.062
RS: 0.266
CS: 0.011
P: 160

Hyperparameters for model training

Model (dataset) Dropout Rate Weight Decay Optimizer Epochs Momentum
I3D (smth-smth) 0.5 0 ADAM 13 -
I3D (KTH) 0.7 5E-5 ADAM 30 -
C-LSTM (smth-smth) 0 0 SGD 105 0.2
C-LSTM (KTH) 0.5 1E-4 SGD 21 0.2

Hyperparameters for temporal mask inference

Dataset Lambda1 Lambda2 Beta Optmizer Iterations Learning Rate
Smth-Smth 0.01 0.02 3 ADAM 300 0.001
KTH 0.02 0.04 3 ADAM 300 0.001