Project Description
Objectives
Research Question:
Is it possible to accurately predict the human body’s 3D joint positions from 2D video images without any markers? If this is feasible, we would be able to use this system in a wide variety of areas including forensic and clinical gait analyses and human action classification.
Goals:
The proposed work is to (1) build a high-resolution human motion dataset in 2D and 3D, (2) develop research platform to make a Deep Learning-based inference engine for the human body joint 3D positions from 2D video images without using markers, (3) complete a pilot study to investigate the feasibility of the prediction models of the inference engine, and (4) seek external funding opportunities.
Background
The aim of this project is to develop a markerless system to analyze the human body’s 3D joints movement. The accurate 3D key joints from a human body can be acquired by using a sophisticated system in which markers are attached to the human body and a 3D position tracking system to trace the movements of the markers. In practice, however, this kind of system cannot be used to acquire data in forensics and clinical areas such as identifying a person, classifying an action, and determining abnormality based on the body movement. People do not walk around with a special suit having many markers inside an area where motion capture cameras are installed.
Many challenges exist in using 2D video streams to extract these important features in 3D because of occlusions of body parts in 2D video images taken from one angle. Before the dawn of the Deep Learning era, this approach was rare and not accurate enough to identify joints in 3D [1], [2]. Thanks to many breakthroughs in the artificial neural network such as Deep Convolutional Neural Network [3], Deep Residual Learning [4], and Generative Adversarial Networks [5], there have been new and more accurate approaches were proposed in 2D pose estimation [6], [7] and 3D shape estimation from 2D [8]–[20].
Approach
We propose to take the following approach to address the aforementioned issues. The overall system can be understood as a 3D joint estimation from 2D joints without using markers. See Fig. 1 for more details.
Fig. 1. System overview. The proposed system has two sub-sections: 3D pose estimator from 2D video input (top) and gait feature extractor (bottom). (a) 2D input video, (b) 2D pose estimator, (c) extracted 2D joint points, (d) skeletal data of human pose, (e) skeletal model converter, (f) 2D joint points for 3D estimation, (g) 3D pose estimator, (h) 3D joint points, (i) feature extractor, (j) feature data, and (k) applications.
In order to make an accurate 3D pose estimator, it is essential to have large and high-quality training data. The earlier approaches aforementioned were not designed to take accurate 3D joint positions that can be used in forensic and/or clinical analyses. They aimed mostly to classify actions in which accurate 3D joint points were not necessary. The accuracies reported from [13], [16] with other state-of-the-art methods are around 20 mm in the average error and 80 in MPJPE (mean per joint position error). Also, according to [21], there is no strong evidence that state-of-the-art methods can be utilized in forensic gait analysis. The variability in measuring joint angles must be less than ±5 degrees. Our goal for the accuracy level is to meet this requirement so that the estimated 3D joint positions can be used to analyze, for example, gait features such as the initial contact hip extension, initial contact left knee flexion, terminal stance right leg inclination, and etc. See Fig. 2.
Fig. 2. Example of angular features of human body movements.
Thus, our proposal includes a data acquisition system that will get exact 3D body joints as ground truth. The acquired data will be used to train a Deep Neural Network to estimate 3D human body joint positions accurately. Therefore, there will be three modules in the proposed system.
2D Joint Estimation: We will use a method from [6] to have human 2D pose estimation by which we will get a 2D joint positions. We will retain the original network with our new data from the data acquisition system to have higher accuracy. Fig. 3 shows the overall pipeline of the 2D joint estimator that we are planning to use.
Fig. 3. Overall pipeline. (a) The method takes the entire image as input and takes two branches: ((b) and (c)). (d) Performs a set of bipartite matchings to associate body parts candidates. (e) Assemble them into full-body poses. [6]
3D Joint Estimation: To estimate 3D joint positions, we will use a method from [22] in which it takes 2D skeletal data as input and conducts 3D pose estimation. This method is simple but efficient compared to other methods [8]–[20], [22]. The original system has a 2D pose estimation. We, however, decided to replace it with [6] after comparing their performances. Fig. 4 illustrates the 3D joint estimator that we are planning to use.
Fig. 4. Overall pipeline. A unit building block is repeated twice and the repeated block will be concatenated twice again in the 3D pose estimator. The output is an array of 3D joint positions. [22].
Expected Outcomes
The project outcome will be as follows: (1) Human motion datasets in 2D joints, 3D joints, and depth information. (2) A 2D human pose estimator using a new 2D dataset. (3) A 3D human pose estimator using a new 3D joint dataset.
Significance of the Activity
The proposed work is not only new additions to benchmark datasets that can be used to assess human motion analyses but also methods that can be useful to extract 3D human motion features from widely available 2D videos. This will benefit the University in teaching and research on robotics, computer vision, and security.
Bibliography/References
[5] I. J. Goodfellow et al., “Generative Adversarial Networks,” ArXiv14062661 Cs Stat, Jun. 2014.