Video Motion Analysis with Limited Labeled Training Data

Yu, Shuzhi

Video Motion Analysis with Limited Labeled Training Data

View / Download9.73 MB

Date

2023

Authors

Yu, Shuzhi

Advisors

Tomasi, Carlo

Repository Usage Stats

29
views

81
downloads

Abstract

The introduction of Convolutional Neural Networks (CNNs) has significantly advanced the performance of many computer vision systems on benchmark data sets, including those for motion analysis. However, impressive performance often relies heavily on the labeled training samples, which may not always be available. In this dissertation, we discuss the issue of limited labeled data from two perspectives. First, a model tends to overfit and not generalize well to unseen data with a small data set for training. We build a connection between a popular CNN architecture and its improved generalization ability. Then, we examine two motion analysis tasks where labeled data is insufficient. We propose an unsupervised refinement method for pixel-level tracking, i.e., motion field estimation, and a weakly supervised method for an object-level tracking task.

More specifically, an existing popular CNN architecture named Residual Neural Networks (ResNets) is able to improve the generalization ability of the CNN models, and it has become the de facto backbone network of choice for deep learning models. The main idea of ResNets is simple: a skip connection is added over some convolutional layers. However, the understanding of the reasons for its superiority is limited. In an attempt to analyze why ResNets generalize better with the additional identity connections, we map a residual network to a plain network for a simplified (but still deep and non-linear) version and find that the resulting large weights from the identity connections improve the stability to the noise added to the network, which in turn improves the model’s generalization ability.

The lack of direct supervision signals makes it hard to estimate 2D image motion accurately, and even more so near motion boundaries (MBs), i.e., the curves of discontinuity of the motion field. First, motion is discontinuous across MBs, while typical estimators assume smoothness. Second, the features extracted to find point correspondences between frames are contaminated by multiple motions. The direct supervision signal provided by ground-truth labels may mitigate this problem, but the dense annotation of the motion for each pixel is very expensive, if not impossible. We show that accurate prediction of MBs from imperfect motion estimates helps improve motion estimates near MBs. Our method first detects MBs from the input motion estimated by an existing unsupervised estimator. Then, it improves the motion predictions near these boundaries by replacing them with the motion a bit farther away from the MB.

Next, we propose a weakly supervised method for Multi-Object Tracking (MOT). A popular tracking framework named tracking by detection first detects objects of interest in each frame, extracts an appearance embedding for each detection, and then associates these detections across frames based on the embeddings. Recently, a joint model that does both object detection and embedding in one forward pass has been proposed, and has been shown by-and-large to improve inference over conducting the two operations sequentially. The existing joint models require fully annotated tracking data for training, which includes both the ground-truth bounding boxes for detection and identity labels for these bounding boxes across frames. When this data is unavailable, we propose a weakly supervised method that augments ground-truth bounding boxes with appearance embeddings computed by an off-the-shelf Re-Identification (Re-ID) model. The Re-ID model is trained on independent Re-ID data that only contains identity labels, and the augmented appearance embeddings serve as pseudo ground truth embeddings.

Type

Dissertation

Department

Computer Science

Subjects

Computer science, Artificial intelligence, Generalization Ability, Motion Analysis, Motion Boundary, Multi-Object Tracking, Optical Flow, Weakly Supervised Learning

Permalink

https://hdl.handle.net/10161/27587

Citation

Yu, Shuzhi (2023). Video Motion Analysis with Limited Labeled Training Data. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/27587.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

Video Motion Analysis with Limited Labeled Training Data

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections