

Primary use case intended for this model is to recognize the action from the sequence of RGB frames and optical flow gray images. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model needs to be used with NVIDIA Hardware and Software. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications. The data is the inference only performance.


The Jetson devices run at Max-N configuration for maximum system performance. The inference performance runs with trtexec on Jetson Nano, Xavier NX, AGX Xavier and NVIDIA T4 GPU. And the final label of the video is determined by the average score of those 10 segments. We uniformly divide the video clip into 10 parts, choose center of each segments as start point and then pick 32 consecutive frames from those start points to form the inference segments. The conv evaluation inference is performed on 10 segments out of a video clip. For example, if the model requires 32 frames as input and a video clip has 128 frames, then we will choose the frames from index 48 to index 79 to do the inference. The center evaluation inference is performed on the middle part of frames in the video clip. The key performance indicator is the accuracy of action recognition. The videos are also diversed by visible body parts/camera motion/camera viewpoint/number of people involved in the action/video quality. The evaluation dataset are obtained by randomly collecting 10% video per class out of HMDB5. Each of classes directory will contain multiple video clips folders which contain the corresponding RGB frames (rgb), optical flow x-axis grayscale images (u), and optical flow y-axis grayscale images (v). The dataset should be divided into different directory by classes. TAO toolkit support training ActionRecognitionNet with RGB input or optical flow input. The data format must be in the following format. Video size: most of videos are in 320x240 Number of people involved in the action: single, two, three
#Actio network full#
Visible body parts: upper body, full body, lower bodyĬamera view point: front, back, left, right The training videos are varied in visible body parts, camera motion, camera viewpoint, number of people involved in the action and video quality. We pick videos of walk, ride_bike, run, fall_floor and push out of HMDB51 to form HMDB5.

The models are trained on a subset of HMDB51. The training algorithm optimizes the network to minimize the cross entropy loss for classification. They will take a sequence of RGB frames or optical flow gray images as input and predict the action label of those frames. Model Architectureīoth 2D and 3D models are with ResNet-style backbone. Both models are trained on a subset of HMDB51. And there are also three 3D models with the same input type as the 2D models. Six pretrained ActionRecognitionNet models are delivered - Three 2D models which are trained with RGB, optical flow generated on A100 with NVOF SDK and optical flow generated on Jetson Xavier with VPI respectively. The model described in this card is action recognition network, which aims to recognize what people do in videos. ActionRecognitionNet Model Card Model Overview
