Pytorch-TensorRT-Detection

Pose Estimation with PoseNet

Pose estimation consists of locating various body parts (aka keypoints) that form a skeletal topology (aka links). Pose estimation has a variety of applications including gestures, AR/VR, HMI (human/machine interface), and posture/gait correction. Pre-trained models are provided for human body and hand pose estimation that are capable of detecting multiple people per frame.

The poseNet object accepts an image as input, and outputs a list of object poses. Each object pose contains a list of detected keypoints, along with their locations and links between keypoints. You can query these to find particular features. poseNet can be used from Python and C++.

As examples of using the poseNet class, we provide sample programs for C++ and Python:

These samples are able to detect the poses of multiple humans in images, videos, and camera feeds. For more info about the various types of input/output streams supported, see the Camera Streaming and Multimedia page.

Pose Estimation on Images

First, let’s try running the posenet sample on some examples images. In addition to the input/output paths, there are some additional command-line options that are optional:

If you’re using the Docker container, it’s recommended to save the output images to the images/test mounted directory. These images will then be easily viewable from your host device under jetson-inference/data/images/test (for more info, see Mounted Data Volumes).

Here are some examples of human pose estimation using the default Pose-ResNet18-Body model:

# C++
$ ./posenet "images/humans_*.jpg" images/test/pose_humans_%i.jpg

# Python
$ ./posenet.py "images/humans_*.jpg" images/test/pose_humans_%i.jpg

note: the first time you run each model, TensorRT will take a few minutes to optimize the network.
          this optimized network file is then cached to disk, so future runs using the model will load faster.

There are also test images of people under "images/peds_*.jpg" that you can try as well.

Pose Estimation from Video

To run pose estimation on a live camera stream or video, pass in a device or file path from the Camera Streaming and Multimedia page.

# C++
$ ./posenet /dev/video0     # csi://0 if using MIPI CSI camera

# Python
$ ./posenet.py /dev/video0  # csi://0 if using MIPI CSI camera

<img src=https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/posenet-video-body.jpg width=”750”>

# C++
$ ./posenet --network=resnet18-hand /dev/video0

# Python
$ ./posenet.py --network=resnet18-hand /dev/video0

<img src=https://github.com/dusty-nv/jetson-inference/raw/dev/docs/images/posenet-video-hands.jpg width=”750”>

Pre-trained Pose Estimation Models

Below are the pre-trained pose estimation networks available for download, and the associated --network argument to posenet used for loading the pre-trained models:

Model CLI argument NetworkType enum Keypoints
Pose-ResNet18-Body resnet18-body RESNET18_BODY 18
Pose-ResNet18-Hand resnet18-hand RESNET18_HAND 21
Pose-DenseNet121-Body densenet121-body DENSENET121_BODY 18

note: to download additional networks, run the Model Downloader tool
             $ cd jetson-inference/tools
             $ ./download-models.sh

You can specify which model to load by setting the --network flag on the command line to one of the corresponding CLI arguments from the table above. By default, Pose-ResNet18-Body is used if the optional --network flag isn’t specified.

Working with Object Poses

If you want to access the pose keypoint locations, the poseNet.Process() function returns a list of poseNet.ObjectPose structures. Each object pose represents one object (i.e. one person) and contains a list of detected keypoints and links - see the Python and C++ docs for more info.

Below is Python pseudocode for finding the 2D direction (in image space) that a person is pointing, by forming a vector between the left_shoulder and left_wrist keypoints:

poses = net.Process(img)

for pose in poses:
    # find the keypoint index from the list of detected keypoints
    # you can find these keypoint names in the model's JSON file, 
    # or with net.GetKeypointName() / net.GetNumKeypoints()
    left_wrist_idx = pose.FindKeypoint('left_wrist')
    left_shoulder_idx = pose.FindKeypoint('left_shoulder')

    # if the keypoint index is < 0, it means it wasn't found in the image
    if left_wrist_idx < 0 or left_shoulder_idx < 0:
        continue
	
    left_wrist = pose.Keypoints[left_wrist_idx]
    left_shoulder = pose.Keypoints[left_shoulder_idx]

    point_x = left_shoulder.x - left_wrist.x
    point_y = left_shoulder.y - left_wrist.y

    print(f"person {pose.ID} is pointing towards ({point_x}, {point_y})")

This was a simple example, but you can make it more useful with further manipulation of the vectors and by looking up more keypoints. There are also more advanced techniques that use machine learning on the pose results for gesture classification, like in the trt_hand_pose project.

##

Next | Monocular Depth Estimation
Back | Running the Live Camera Segmentation Demo</p>