Internship: End-to-end architectures for large-scale video recognition


The internship will take place at NAVER LABS Europe (NLE) and will be jointly supervised by Philippe Weinzaepfel, research scientist at NLE, and Cordelia Schmid, Inria research director and head of the Thoth team at Inria Grenoble. Grenoble lies in the French Alpes and offers ideal conditions for skiing, hiking, climbing, etc.


State-of-the-art CNN architectures for video recognition are based on a two-stream architecture [1]: one stream operates on RGB and another one on optical flow. For instance, in action classification, the state-of-the-art I3D approach [2] uses a two-stream architecture, where each stream relies on 3D convolutions. Recently, it has also been proposed to add more streams, for instance based on body part segmentation [3]. In action detection, i.e., the task of spatio-temporal recognition of the actions a video, the best approaches are all based on a two-stream architecture [4,5]. Even for lower-level tasks, such as video segmentation, two-stream architecture obtains state-of-the-art performance [6,7].

Figure: Example of two-stream architectures (credit [1]).

However, these architectures have several limitations. First, the two streams are most of the time trained independently and combined at test time using late fusion. Second, the optical flow is extracted using a hand-crafted approach and artificially transformed into visual images. Third, extracting the optical flow and estimating the results of the two streams results in a computational time that does not allowed real-time applications or large-scale processing.


The goal of this internship/PhD is to improve all these aspects. A first idea consists in making both streams operating on the same input, by adapting the flow streams to the RGB stream. This will allow to save computational time, to better study how the streams are interacting and to train them end-to-end. A second idea could be to estimate the optical flow using a neural network and to train the whole architecture end-to-end.

Skills and profile

We are looking for strongly motivated master student with an interest in computer vision and deep learning. This project requires strong background in applied mathematics and excellent programming skills. A successful project can lead to a PhD supervised jointly between NLE and the Thoth team at Inria Grenoble.


Please send a CV, letter of motivation, the name of two referees and transcripts of grades by e-mail to and .

This position is now closed.


[1] Simonyan, K., & Zisserman, A. Two-stream convolutional networks for action recognition in videos. NIPS 2014.
[2] Carreira, J., & Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CVPR 2017.
[3] Zolfaghari, M., Oliveira, G. L., Sedaghat, N., & Brox, T. Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection. ICCV 2017.
[4] Peng, X., & Schmid, C. (2016, October). Multi-region two-stream R-CNN for action detection. ECCV 2016.
[5] Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. Action Tubelet Detector for Spatio-Temporal Action Localization. ICCV 2017.
[6] Tokmakov, P., Alahari, K., & Schmid, C.. Learning motion patterns in videos. CVPR 2017.
[7] Jain, S. D., Xiong, B., & Grauman, K.. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. CVPR 2017.