Markerless 3D Human Motion Inference Framework from 2D Videos using Deep Learning

Project Description

Objectives

Research Question:
Is it possible to accurately predict the human body’s 3D joint positions from 2D video images without any markers? If this is feasible, we would be able to use this system in a wide variety of areas including forensic and clinical gait analyses and human action classification.

Goals:
The proposed work is to (1) build a high-resolution human motion dataset in 2D and 3D, (2) develop research platform to make a Deep Learning-based inference engine for the human body joint 3D positions from 2D video images without using markers, (3) complete a pilot study to investigate the feasibility of the prediction models of the inference engine, and (4) seek external funding opportunities.  

Background

The aim of this project is to develop a markerless system to analyze the human body’s 3D joints movement. The accurate 3D key joints from a human body can be acquired by using a sophisticated system in which markers are attached to the human body and a 3D position tracking system to trace the movements of the markers. In practice, however, this kind of system cannot be used to acquire data in forensics and clinical areas such as identifying a person, classifying an action, and determining abnormality based on the body movement. People do not walk around with a special suit having many markers inside an area where motion capture cameras are installed. 

Many challenges exist in using 2D video streams to extract these important features in 3D because of occlusions of body parts in 2D video images taken from one angle. Before the dawn of the Deep Learning era, this approach was rare and not accurate enough to identify joints in 3D [1], [2]. Thanks to many breakthroughs in the artificial neural network such as Deep Convolutional Neural Network [3], Deep Residual Learning [4], and Generative Adversarial Networks [5], there have been new and more accurate approaches were proposed in 2D pose estimation [6], [7] and 3D shape estimation from 2D [8]–[20].

Approach

We propose to take the following approach to address the aforementioned issues. The overall system can be understood as a 3D joint estimation from 2D joints without using markers. See Fig. 1 for more details.

Fig. 1. System overview. The proposed system has two sub-sections: 3D pose estimator from 2D video input (top) and gait feature extractor (bottom). (a) 2D input video, (b) 2D pose estimator, (c) extracted 2D joint points, (d) skeletal data of human pose, (e) skeletal model converter, (f) 2D joint points for 3D estimation, (g) 3D pose estimator, (h) 3D joint points, (i) feature extractor, (j) feature data, and (k) applications. 

In order to make an accurate 3D pose estimator, it is essential to have large and high-quality training data. The earlier approaches aforementioned were not designed to take accurate 3D joint positions that can be used in forensic and/or clinical analyses. They aimed mostly to classify actions in which accurate 3D joint points were not necessary. The accuracies reported from [13], [16] with other state-of-the-art methods are around 20 mm in the average error and 80 in MPJPE (mean per joint position error). Also, according to [21], there is no strong evidence that state-of-the-art methods can be utilized in forensic gait analysis. The variability in measuring joint angles must be less than ±5 degrees. Our goal for the accuracy level is to meet this requirement so that the estimated 3D joint positions can be used to analyze, for example, gait features such as the initial contact hip extension, initial contact left knee flexion, terminal stance right leg inclination, and etc. See Fig. 2.

Fig. 2. Example of angular features of human body movements.

Thus, our proposal includes a data acquisition system that will get exact 3D body joints as ground truth. The acquired data will be used to train a Deep Neural Network to estimate 3D human body joint positions accurately. Therefore, there will be three modules in the proposed system.

2D Joint Estimation: We will use a method from [6] to have human 2D pose estimation by which we will get a 2D joint positions. We will retain the original network with our new data from the data acquisition system to have higher accuracy. Fig. 3 shows the overall pipeline of the 2D joint estimator that we are planning to use.

Fig. 3. Overall pipeline. (a) The method takes the entire image as input and takes two branches: ((b) and (c)). (d) Performs a set of bipartite matchings to associate body parts candidates. (e) Assemble them into full-body poses. [6]

3D Joint Estimation: To estimate 3D joint positions, we will use a method from [22] in which it takes 2D skeletal data as input and conducts 3D pose estimation. This method is simple but efficient compared to other methods [8]–[20], [22]. The original system has a 2D pose estimation. We, however, decided to replace it with [6] after comparing their performances. Fig. 4 illustrates the 3D joint estimator that we are planning to use.

Fig. 4. Overall pipeline. A unit building block is repeated twice and the repeated block will be concatenated twice again in the 3D pose estimator. The output is an array of 3D joint positions. [22].

Expected Outcomes

The project outcome will be as follows: (1) Human motion datasets in 2D joints, 3D joints, and depth information. (2) A 2D human pose estimator using a new 2D dataset. (3) A 3D human pose estimator using a new 3D joint dataset.

Significance of the Activity

The proposed work is not only new additions to benchmark datasets that can be used to assess human motion analyses but also methods that can be useful to extract 3D human motion features from widely available 2D videos. This will benefit the University in teaching and research on robotics, computer vision, and security.

Bibliography/References

[1] E. Adeli-Mosabbeb, M. Fathy, and F. Zargari, “Model-based human gait tracking, 3D reconstruction and recognition in uncalibrated monocular video,” Imaging Sci. J., vol. 60, no. 1, pp. 9–28, Feb. 2012.

[2] and, “Tracking hybrid 2D-3D human models from multiple views,” in Proceedings IEEE International Workshop on Modelling People. MPeople’99, 1999, pp. 11–18.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” ArXiv151203385 Cs, Dec. 2015.

[5] I. J. Goodfellow et al., “Generative Adversarial Networks,” ArXiv14062661 Cs Stat, Jun. 2014.

[6] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1302–1310.

[7] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy, “PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model,” ArXiv180308225 Cs, Mar. 2018.

[8] D. Tome, C. Russell, and L. Agapito, “Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image,” 2017 IEEE Conf. Comput. Vis. Pattern Recognit. CVPR, pp. 5689–5698, Jul. 2017.

[9] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image,” ArXiv160708128 Cs, Jul. 2016.

[10] D. Mehta et al., “Single-Shot Multi-Person 3D Body Pose Estimation From Monocular RGB Input,” ArXiv171203453 Cs, Dec. 2017.

[11] G. Rogez, P. Weinzaepfel, and C. Schmid, “LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2019.

[12] C.-H. Chen and D. Ramanan, “3D Human Pose Estimation = 2D Pose Estimation + Matching,” ArXiv161206524 Cs, Dec. 2016.

[13] F. Moreno-Noguer, “3D Human Pose Estimation from a Single Image via Distance Matrix Regression,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1561–1570.

[14] U. Iqbal, A. Doering, H. Yasin, B. Krüger, A. Weber, and J. Gall, “A dual-source approach for 3D human pose estimation from single images,” Comput. Vis. Image Underst., vol. 172, pp. 37–49, Jul. 2018.

[15] S. Li and A. B. Chan, “3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network,” in Computer Vision — ACCV 2014, vol. 9004, D. Cremers, I. Reid, H. Saito, and M.-H. Yang, Eds. Cham: Springer International Publishing, 2015, pp. 332–347.

[16] Q. Wan, W. Zhang, and X. Xue, “DeepSkeleton: Skeleton Map for 3D Human Pose Regression,” ArXiv171110796 Cs, Nov. 2017.

[17] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, “Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach,” ArXiv170402447 Cs, Apr. 2017.

[18] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis, “Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4966–4975.

[19] X. Zhou, S. Leonardos, Xiaoyan Hu, and K. Daniilidis, “3D shape estimation from 2D landmarks: A convex relaxation approach,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 4447–4455.

[20] Y. Kudo, K. Ogaki, Y. Matsui, and Y. Odagiri, “Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations,” ArXiv180308244 Cs, Mar. 2018.

[21] Royal Society (London), Forensic gait analysis: a primer for courts. London: The Royal Society, 2017.

[22] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A Simple Yet Effective Baseline for 3d Human Pose Estimation,” in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017, pp. 2659–2668.

Comments are closed.