Human Activity Analysis


Video cameras have become increasingly prevalent, with higher resolution and frame-rates. Humans are often the focus of these videos, making human motion analysis an important field. This thesis explores the level of detail necessary to distinguish human activities for tasks of regression, like body tracking, and activity classification.

We first consider activities that can be distinguished by their appearance during a single moment in time. Specifically, we use a database-retrieval approach to both approximate the full 3D pose of the hand from a single frame and to classify into its configuration. To index the database we present a novel silhouette signature and signature distance to capture differences in both the extension and abduction of fingers.

Next, we consider more complex activities, like typing, that are characterized by a motion texture, or statistical regularities in space and time. A single frame is inadequate to distinguish such activities, and it may be difficult to track the detailed sequence of body and object elements because of occlusions or temporal aliasing. Further, such activities are not characterized by a detailed sequence of 3D poses, but rather by the motion texture they produce in space and time. We propose a new motion texture activity representation for computer vision tasks that require such spatial-temporal reasoning. Autocorrelation is used to capture temporal aspects of an activity signal that may be unbounded in time, and we show how it can be efficiently computed using an exponential moving average formulation. An optional space-time aggregation handles a potentially variable number of motion signals. This motion texture representation transforms any input signal of an activity into a fixed-size representation, even when the activity itself has varying extents in space and time. As a result of this conversion, any off-the-shelf classifier can be applied to detect the activity.

For evaluation, we show how our representation can be used as a motion texture ``layer'' within a convolutional neural network. We first study typing detection, and use our method with trajectories from corner points as input. The resulting motion texture descriptor captures hand-object motion patterns that we use within a privacy-filter pipeline to obscure potentially sensitive content, like passcodes. We also study the more abstract challenge of identity recognition by gait and demonstrate significant improvements over the state of the art using silhouette sequences as input to our autocorrelation network. Further, we show that adding a shallow network before the autocorrelation computation and training the network end-to-end learns a more robust activity feature.





Carley, Cassandra Mariette (2018). Human Activity Analysis. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.