New THOTH website available here

Internship: Human 3D Shape Estimation from a Single Image


The internship will be in the Thoth and Morpheo teams at Inria Grenoble and will be co-supervised by Gregory Rogez, Jean-Sebastien Franco and Cordelia Schmid.


While the recent success of deep learning methods is undeniable, as applied to a wide array of classical computer vision problems ranging from image feature extraction to semantic segmentation, its applicability to higher level 3D problems is still an open challenge. A growing trend in the community is to examine applicability of such inference tools to estimate 3D shape and pose characteristics [1] from images and videos. This is relevant to a broad number of applications, from self-driving cars to augmented reality.

In the research community, recent works have shown the success of deep network architectures for the problem of retrieving 3D features such as kinematic joints [2,3] or surface characterizations [4], with extremely encouraging results. Such successes, sometimes achieved with simple, standard network architectures such as AlexNet [5] or VGG [6], naturally raise the question of applicability of these methodologies for the more challenging problem of end-to-end full 3D shape retrieval. Is it possible to design an architecture that produces full 3D shapes corresponding to humans observed in an input image? This is the problem we propose to tackle during this internship.

Figure: Example of an architecture for human 3D shape estimation (credit [1]).


Naturally, however simple its formulation, this objective raises several key challenges. First, there is an unsolved representational issue. While the comfort zone of CNNs is in dealing with regular 2D input and output grids, in this case study, the gap must be bridged somewhere in the envisioned architectures between the still 2D nature of inputs, and a 3D shape parameterization yet to be defined. Second, the dimensionality of the problem is considerably higher than what existing 3D networks have been shown to handle, because the parameterization sought is no longer restricted to a subset of the variability, e.g. kinematic pose of humans, but to an intrinsically finer description, which should also accurately account for shape surface details. Third, the training sets for this problem are yet to be designed and produced. The large data variability of 3D problems has motivated some initial efforts to produce fully synthetic training sets [7], where such variability can be scripted. Yet recent successful methods underscore the necessity for as realistic as possible training data, for both the general applicability of the estimation, and to keep the underlying network architecture simple, as devoid as possible of any domain adaptations.


This MSc thesis is grounded in a solid environment to tackle such challenges, bringing together the expertise of internationally recognized researchers in the visual recognition field on one hand (Thoth), and 3D tracking and capture on the other hand (Morpheo). The candidate will benefit from state of the art 3D capture equipment through the Equipex Kinovis platform, which we will leverage to produce highly detailed 3D capture data for training sets in a highly controlled environment, which can in turn help the methods produce accurate results in general acquisition situations. We expect this MSc to yield some advances toward the larger goal of shape estimation from images, ultimately leading to subsequent PhD work with broad impact on 3D vision, with publication at top level conferences in computer vision, computer graphics, and 3D capture communities.


During the MSc, the student will familiarize with deep learning CNN techniques, and relevant elements of 3D vision, in order to provide a model and application for an initial subset of human capture situations. The master candidate will first review existing efforts on the subject, including the bibliography suggested below and previous MSc efforts within the teams. He/she is then expected to contribute a substantial advance toward the goal of 3D human shape estimation from images, leveraging an existing 3D human pose estimation framework developed in the team [2,3].

The master student will perform the following tasks:
-Study the relevant bibliography, identify relevant existing datasets
-Discuss and propose a solution relevant to this problem with advisors
-Exhibit a preliminary implementation of the proposed solution
-Validate this solution on a reasonable size available or acquired dataset. The student will have access to the Kinovis platform to perform her/his own 3D acquisitions.
-Write a thesis with details of the proposed method, bibliography and experiments.

Skills and profile

We are looking for a creative and highly motivated master student (preferably in Computer Science or Applied Mathematics) with an interest in computer vision and deep learning. Fluent English or French spoken is mandatory. This project requires strong mathematics knowledge in linear algebra, geometry, and statistics, and excellent programming skills, Python, C++ and/or Matlab. Prior courses or knowledge in the areas of computer vision, computational geometry, mesh processing, computer graphics, signal processing, machine learning is a plus. A successful project can lead to a PhD supervised jointly between the Thoth and Morpheo teams at Inria Grenoble.


Please send a CV, letter of motivation, the name of two referees and transcripts of grades by e-mail to, and .


[1] F. Bogo, A. Kanazawa, C. Lassner, P. V. Gehler, J. Romero, M. J. Black: Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. ECCV 2016.
[2] G. Rogez, P. Weinzaepfel, C. Schmid: LCR-Net: Localization - Classification- Regression for Human Pose. CVPR 2017.
[3] G. Rogez, C. Schmid: MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild. NIPS 2016.
[4] L. Wei, Q. Huang, D. Ceylan, E. Vouga, H. Li: Dense Human Body Correspondences Using Convolutional Networks. CVPR 2016.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS 2012.
[6] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556, 2014.
[7] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, C. Schmid: Learning from Synthetic Humans. CVPR 2017.