Video Motion Analysis with Limited Labeled Training Data
The introduction of Convolutional Neural Networks (CNNs) has significantly advanced the performance of many computer vision systems on benchmark data sets, including those for motion analysis. However, impressive performance often relies heavily on the labeled training samples, which may not always be available. In this dissertation, we discuss the issue of limited labeled data from two perspectives. First, a model tends to overfit and not generalize well to unseen data with a small data set for training. We build a connection between a popular CNN architecture and its improved generalization ability. Then, we examine two motion analysis tasks where labeled data is insufficient. We propose an unsupervised refinement method for pixel-level tracking, i.e., motion field estimation, and a weakly supervised method for an object-level tracking task.
More specifically, an existing popular CNN architecture named Residual Neural Networks (ResNets) is able to improve the generalization ability of the CNN models, and it has become the de facto backbone network of choice for deep learning models. The main idea of ResNets is simple: a skip connection is added over some convolutional layers. However, the understanding of the reasons for its superiority is limited. In an attempt to analyze why ResNets generalize better with the additional identity connections, we map a residual network to a plain network for a simplified (but still deep and non-linear) version and find that the resulting large weights from the identity connections improve the stability to the noise added to the network, which in turn improves the model’s generalization ability.
The lack of direct supervision signals makes it hard to estimate 2D image motion accurately, and even more so near motion boundaries (MBs), i.e., the curves of discontinuity of the motion field. First, motion is discontinuous across MBs, while typical estimators assume smoothness. Second, the features extracted to find point correspondences between frames are contaminated by multiple motions. The direct supervision signal provided by ground-truth labels may mitigate this problem, but the dense annotation of the motion for each pixel is very expensive, if not impossible. We show that accurate prediction of MBs from imperfect motion estimates helps improve motion estimates near MBs. Our method first detects MBs from the input motion estimated by an existing unsupervised estimator. Then, it improves the motion predictions near these boundaries by replacing them with the motion a bit farther away from the MB.
Next, we propose a weakly supervised method for Multi-Object Tracking (MOT). A popular tracking framework named tracking by detection first detects objects of interest in each frame, extracts an appearance embedding for each detection, and then associates these detections across frames based on the embeddings. Recently, a joint model that does both object detection and embedding in one forward pass has been proposed, and has been shown by-and-large to improve inference over conducting the two operations sequentially. The existing joint models require fully annotated tracking data for training, which includes both the ground-truth bounding boxes for detection and identity labels for these bounding boxes across frames. When this data is unavailable, we propose a weakly supervised method that augments ground-truth bounding boxes with appearance embeddings computed by an off-the-shelf Re-Identification (Re-ID) model. The Re-ID model is trained on independent Re-ID data that only contains identity labels, and the augmented appearance embeddings serve as pseudo ground truth embeddings.
Weakly Supervised Learning
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations
Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info