Browsing by Author "Tomasi, Carlo"
Results Per Page
Sort Options
Item Open Access 3D Object Representations for Robot Perception(2019) Burchfiel, Benjamin Clark MalloyReasoning about 3D objects is one of the most critical perception problems robots face; outside of navigation, most interactions between a robot and its environment are object-centric. Object-centric robot perception has long relied on maintaining an explicit database of 3D object models with the assumption that encountered objects will be exact copies of entries in the database; however, as robots move into unstructured environments such as human homes, the variation of encountered objects increases and maintaining an explicit object database becomes infeasible. This thesis introduces a general-purpose 3D object representation that allows the joint estimation of a previously unencountered object's class, pose, and 3D shape---crucial foundational tasks for general robot perception.
We present the first method capable of performing all three of these tasks simultaneously, Bayesian Eigenobjects (BEOs), and show that it outperforms competing approaches which estimate only object shape and class given a known object pose. BEOs use an approximate Bayesian version of Principal Component Analysis to learn an explicit low-dimensional subspace containing the 3D shapes of objects of interest, which allows for efficient shape inference at high object resolutions. We then extend BEOs to produce Hybrid Bayesian Eigenobjects (HBEOs), a fusion of linear subspace methods with modern convolutional network approaches, enabling realtime inference from a single depth image. Because HBEOs use a Convolutional Network to project partially observed objects onto the learned subspace, they allow the object to be larger and more expressive without impacting the inductive power of the model. Experimentally, we show that HBEOs offer significantly improved performance on all tasks compared to their BEO predecessors. Finally, we leverage the explicit 3D shape estimate produced by BEOs to further extend the state-of-the-art in category level pose estimation by fusing probabilistic pose predictions with a silhouette-based reconstruction prior. We also illustrate the advantages of combining both probabilistic pose estimation and shape verification, via an ablation study, and show that both portions of the system contribute to its performance. Taken together, these methods comprise a significant step towards creating a general-purpose 3D perceptual foundation for robotics systems, upon which problem-specific systems may be built.
Item Open Access Applying machine learning to investigate long-term insect-plant interactions preserved on digitized herbarium specimens.(Applications in plant sciences, 2020-06) Meineke, Emily K; Tomasi, Carlo; Yuan, Song; Pryer, Kathleen MPremise:Despite the economic significance of insect damage to plants (i.e., herbivory), long-term data documenting changes in herbivory are limited. Millions of pressed plant specimens are now available online and can be used to collect big data on plant-insect interactions during the Anthropocene. Methods:We initiated development of machine learning methods to automate extraction of herbivory data from herbarium specimens by training an insect damage detector and a damage type classifier on two distantly related plant species (Quercus bicolor and Onoclea sensibilis). We experimented with (1) classifying six types of herbivory and two control categories of undamaged leaf, and (2) detecting two of the damage categories for which several hundred annotations were available. Results:Damage detection results were mixed, with a mean average precision of 45% in the simultaneous detection and classification of two types of damage. However, damage classification on hand-drawn boxes identified the correct type of herbivory 81.5% of the time in eight categories. The damage classifier was accurate for categories with 100 or more test samples. Discussion:These tools are a promising first step for the automation of herbivory data collection. We describe ongoing efforts to increase the accuracy of these models, allowing researchers to extract similar data and apply them to biological hypotheses.Item Open Access Assisting Unsupervised Optical Flow Estimation with External Information(2023) Yuan, ShuaiOptical flow estimation is a long-standing problem in computer vision with broad applications in autonomous driving, robotics, etc.. Due to the scarcity of ground-truth labels, the unsupervised estimation of optical flow is especially important. However, it is a poorly constrained problem and presents challenges in the presence of occlusions, motion boundaries, non-Lambertian surfaces, lack of texture, and illumination changes. Therefore, we explore using external information, namely partial labels, semantics, and stereo views, to assist unsupervised optical flow estimation.Supervised training of optical flow predictors generally yields better accuracy than unsupervised training. However, the improved performance comes at an often high annotation cost. Semi-supervised training trades off accuracy against annotation cost. We use a simple yet effective semi-supervised training method to show that even a small fraction of labels can improve flow accuracy by a significant margin over unsupervised training. In addition, we propose active learning methods based on simple heuristics to further reduce the number of labels required to achieve the same target accuracy. Our experiments on both synthetic and real optical flow datasets show that our semi-supervised networks generally need around 50% of the labels to achieve close to full-label accuracy, and only around 20% with active learning on Sintel. We also analyze and show insights on the factors that may influence active learning performance. Code is available at https://github.com/duke-vision/optical-flow-active-learning-release. Unsupervised optical flow estimation is especially hard near occlusions and motion boundaries and in low-texture regions. We show that additional information such as semantics and domain knowledge can help better constrain this problem. We introduce SemARFlow, an unsupervised optical flow network designed for autonomous driving data that takes estimated semantic segmentation masks as additional inputs. This additional information is injected into the encoder and into a learned upsampler that refines the flow output. In addition, a simple yet effective semantic augmentation module provides self-supervision when learning flow and its boundaries for vehicles, poles, and sky. Together, these injections of semantic information improve the KITTI-2015 optical flow test error rate from 11.80% to 8.38%. We also show visible improvements around object boundaries as well as a greater ability to generalize across datasets. Code is available at https://github.com/duke-vision/semantic-unsup-flow-release. Both optical flow and stereo disparities are image matches and can therefore benefit from joint training. Depth and 3D motion provide geometric rather than photometric information and can further improve optical flow. Accordingly, we design a first network that estimates flow and disparity jointly and is trained without supervision. A second network, trained with optical flow from the first as pseudo-labels, takes disparities from the first network, estimates 3D rigid motion at every pixel, and reconstructs optical flow again. A final stage fuses the outputs from the two networks. In contrast with previous methods that only consider camera motion, our method also estimates the rigid motions of dynamic objects, which are of key interest in applications. This leads to better optical flow with visibly more detailed occlusions and object boundaries as a result. Our unsupervised pipeline achieves 7.36% optical flow error on the KITTI-2015 benchmark and outperforms the previous state-of-the-art 9.38% by a wide margin. It also achieves slightly better or comparable stereo depth results. Code will be made available.
Item Open Access Efficient selection of disambiguating actions for stereo vision(2010) Schaeffer, MonikaIn many domains that involve the use of sensors, such as robotics or sensor networks, there are opportunities to use some form of active sensing to disambiguate data from noisy or unreliable sensors. These disambiguating actions typically take time and expend energy. One way to choose the next disambiguating action is to select the action with the greatest expected entropy reduction, or information gain. In this work, we consider active sensing in aid of stereo vision for robotics. Stereo vision is a powerful sensing technique for mobile robots, but it can fail in scenes that lack strong texture. In such cases, a structured light source, such as vertical laser line, can be used for disambiguation. By treating the stereo matching problem as a specially structured HMM-like graphical model, we demonstrate that for a scan line with n columns and maximum stereo disparity d, the entropy minimizing aim point for the laser can be selected in O(nd) time - cost no greater than the stereo algorithm itself. A typical HMM formulation would suggest at least O(nd2) time for the entropy calculation alone.Item Open Access Extended Subwindow Search and Pictorial Structures(2012) Gu, ZhiqiangIn computer vision, the pictorial structure model represents an object in an image by parts that are arranged in a deformable configuration. Each part describes an object's local photometric appearance, and the configuration encodes the global geometric layout. This model has been very successful in recent object recognition systems.
We extend the pictorial structure model in three aspects. First, when the model contains only a single part, we develop new methods ranging from regularized subwindow search, nested window search, to twisted window search, for handling richer priors and more flexible shapes. Second, we develop the notion of a weak pictorial structure, as opposed to the strong one, for the characterization of a loose geometric layout in a rotationally invariant way. Third, we develop nested models to encode topological inclusion relations between parts to represent richer patterns.
We show that all the extended models can be efficiently matched to images by using dynamic programming and variants of the generalized distance transform, which computes the lower envelope of transformed cones on a dense image grid. This transform turns out to be important for a wide variety of computer vision tasks and often accelerates the computation at hand by an order of magnitude. We demonstrate improved results in either quality or speed, and sometimes both, in object matching, saliency measure, online and offline tracking, object localization and recognition.
Item Open Access Human Activity Analysis(2018) Carley, Cassandra MarietteVideo cameras have become increasingly prevalent, with higher resolution and frame-rates. Humans are often the focus of these videos, making human motion analysis an important field. This thesis explores the level of detail necessary to distinguish human activities for tasks of regression, like body tracking, and activity classification.
We first consider activities that can be distinguished by their appearance during a single moment in time. Specifically, we use a database-retrieval approach to both approximate the full 3D pose of the hand from a single frame and to classify into its configuration. To index the database we present a novel silhouette signature and signature distance to capture differences in both the extension and abduction of fingers.
Next, we consider more complex activities, like typing, that are characterized by a motion texture, or statistical regularities in space and time. A single frame is inadequate to distinguish such activities, and it may be difficult to track the detailed sequence of body and object elements because of occlusions or temporal aliasing. Further, such activities are not characterized by a detailed sequence of 3D poses, but rather by the motion texture they produce in space and time. We propose a new motion texture activity representation for computer vision tasks that require such spatial-temporal reasoning. Autocorrelation is used to capture temporal aspects of an activity signal that may be unbounded in time, and we show how it can be efficiently computed using an exponential moving average formulation. An optional space-time aggregation handles a potentially variable number of motion signals. This motion texture representation transforms any input signal of an activity into a fixed-size representation, even when the activity itself has varying extents in space and time. As a result of this conversion, any off-the-shelf classifier can be applied to detect the activity.
For evaluation, we show how our representation can be used as a motion texture ``layer'' within a convolutional neural network. We first study typing detection, and use our method with trajectories from corner points as input. The resulting motion texture descriptor captures hand-object motion patterns that we use within a privacy-filter pipeline to obscure potentially sensitive content, like passcodes. We also study the more abstract challenge of identity recognition by gait and demonstrate significant improvements over the state of the art using silhouette sequences as input to our autocorrelation network. Further, we show that adding a shallow network before the autocorrelation computation and training the network end-to-end learns a more robust activity feature.
Item Open Access Leveraging Data Augmentation in Limited-Label Scenarios for Improved Generalization(2024) Ravindran, Swarna KamlamThe resurgence of Convolutional Neural Networks (CNNs) from the early foundational work is largely attributed to the advent of extensive manually labeled datasets, which has made it possible to train high-capacity models with strong generalization capabilities. However, the annotation cost for these datasets is often prohibitive, and so training CNNs on limited data in a fully-supervised setting remains a crucial problem. Data augmentation is a promising direction for improving generalization in scarce data settings.
We study foundational augmentation techniques, including Mixed Sample Data Augmentations (MSDAs) and a no-parameter variant of RandAugment termed Preset-RandAugment, in the fully supervised scenario. We observe that Preset-RandAugment excels in limited-data contexts while MSDAs are moderately effective. In order to explain this behaviour, we refine ideas about diversity and realism from prior work and propose new ways to measure them. We postulate an additional property when data is limited: augmentations should encourage faster convergence by helping the model learn stable and invariant low-level features, focusing on less class-specific patterns. We explain the effectiveness of Preset-RandAugment in terms of these properties and identify low-level feature transforms as a key contributor to performance.
Building on these insights, we introduce a novel augmentation technique called RandMSAugment that integrates complementary strengths of existing methods. It combines low-level feature transforms from Preset-RandAugment with interpolation and cut-and-paste from MSDA. We improve image diversity through added stochasticity in the mixing process. RandMSAugment significantly outperforms the competition on CIFAR-100, STL-10, and Tiny-Imagenet. With very small training sets (4, 25, 100 samples/class), RandMSAugment achieves compelling performance gains between 4.1\% and 6.75\%. Even with more training data (500 samples/class) we improve performance by 1.03\% to 2.47\%. We also incorporate RandMSAugment augmentations into a semi-supervised learning (SSL) framework and show promising improvements over the state-of-the-art SSL method, FlexMatch. The improvements are more significant when the number of labeled samples is smaller. RandMSAugment does not require hyperparameter tuning, extra validation data, or cumbersome optimizations.
Finally, we combine RandMSAugment with another powerful generalization tool, ensembling, for fully-supervised training with limited samples. We show additonal improvements on the 3 classification benchmarks, which range between 2\% and 5\%. We empirically demonstrate that the gains due to ensembling are larger when the individual networks have moderate accuracies \ie outside of the low and high extremes.Furthermore, we introduce a simulation tool capable of providing insights about the maximum accuracy achievable through ensembling, under various conditions.
Item Open Access Motion Boundary and Occlusion Reasoning for Video Analysis(2022) Kim, HannahWith the increasing prevalence of video cameras, video motion analysis has become an important research area in vision. Motion in video is often represented in the form of dense Optical Flow fields, which specify the motion of each pixel from one frame to the next. While existing flow predictors achieve almost sub-pixel performance in existing benchmarks, they still suffer in three particular areas. The first area is near motion boundaries, or the curves across which the optical flow field is discontinuous. The second is in occlusion regions, sets of pixels in one frame without a corresponding pixel in the other. The optical flow is not defined for these occlusion pixels. The third is in regions with large motion as they require high computational and memory costs. This dissertation examines these three challenges for motion boundary detection, occlusion detection, video interpolation, and occlusion-based adversarial attack detection for optical flow.
First, we propose a convolutional neural network named MONet to jointly detect motion boundaries and occlusion regions in video both forward and backward in time. Since both motion boundaries and occlusion regions disrupt correspondences across frames, we first use a cost map of the Euclidean distances between each feature in one frame to its closest feature in the next. To reason in two time directions simultaneously, we direct warp the estimated occlusion region and motion boundary maps between two frames, preserving features in occlusion regions. As motion boundaries align with occlusion region boundaries, we utilize an attention mechanism and a gradient module to enforce the network to focus on the useful 2D spatial regions predicted by the other task. MONet achieves state-of-the-art results for both tasks on various benchmarks.
Next, we consider the video interpolation task, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel visual transformer module, named Cross Similarity (CS), to globally aggregate input image features with similar appearances as those of the interpolated frame. These aggregated features are then used to refine the interpolated prediction. To account for occlusions in the aggregated CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. Additionally, we augment our training dataset with an occluder patch that moves across frames to improve the network's robustness to occlusions and large motion. We supervise our IA module so that the network is encouraged to down-weight the features that are occluded by these patches. Because existing methods yield smooth predictions especially near motion boundaries, we use an additional training loss based on image gradient to yield sharper predictions.
We finally observe the effect of patch-based adversarial attacks on flow networks that cause occlusions and motion boundaries in the inputs, and present the first method to detect and localize these attacks without any fine-tuning or prior knowledge about the attacks. In particular, we detect the occlusion patch attacks via iterative optimization on the activations from the inner layers of any pre-trained optical flow networks to detect subset of anomalous activations.
Item Open Access People Tracking and Re-Identification from Multiple Cameras(2018) Ristani, ErgysIn many surveillance or monitoring applications, one or more cameras view several people that move in an environment. Multi-person tracking amounts to using the videos from these cameras to determine who is where at all times. The problem is very challenging both computationally and conceptually. On one hand the amount of video to process is enormous while near real-time performance is desired. On the other hand people's varying appearance due to lighting, occlusions, viewpoint changes, and unpredictable motion in blind spots make person re-identification challenging.
This dissertation makes several contributions to person re-identification and multi-person tracking from multiple cameras. We present a weighted triplet loss for learning appearance descriptors which addresses both problems uniformly, doesn't suffer from the imbalance between positive and negative examples, and remains robust to outliers. We introduce the largest tracking benchmark to date, DukeMTMC, and adequate performance measures that emphasize correct person identification. A correlation clustering formulation for associating person observations is then introduced which maximizes agreements on the evidence graph. We assemble a tracker called DeepCC that combines an existing person detector, hierarchical and online reasoning, our appearance features and correlation clustering association. DeepCC achieves increased performance on two challenging sequences from the DukeMTMC benchmark, and ablation experiments demonstrate the merits of individual components.
Item Open Access Tree Topology Estimation(2013) Estrada, Rolando JoseTree-like structures are fundamental in nature. A wide variety of two-dimensional imaging techniques allow us to image trees. However, an image of a tree typically includes spurious branch crossings and the original relationships of ancestry among edges may be lost. We present a methodology for estimating the most likely topology of a rooted, directed, three-dimensional tree given a single two-dimensional image of it. We regularize this inverse problem via a prior parametric tree-growth model that realistically captures the morphology of a wide variety of trees. We show that the problem of estimating the optimal tree has linear complexity if ancestry is known, but is NP-hard if it is lost. For the latter case, we present both a greedy approximation algorithm and a heuristic search algorithm that effectively explore the space of possible trees. Experimental results on retinal vessel, plant root, and synthetic tree datasets show that our methodology is both accurate and efficient.
Item Open Access Video Motion Analysis with Limited Labeled Training Data(2023) Yu, ShuzhiThe introduction of Convolutional Neural Networks (CNNs) has significantly advanced the performance of many computer vision systems on benchmark data sets, including those for motion analysis. However, impressive performance often relies heavily on the labeled training samples, which may not always be available. In this dissertation, we discuss the issue of limited labeled data from two perspectives. First, a model tends to overfit and not generalize well to unseen data with a small data set for training. We build a connection between a popular CNN architecture and its improved generalization ability. Then, we examine two motion analysis tasks where labeled data is insufficient. We propose an unsupervised refinement method for pixel-level tracking, i.e., motion field estimation, and a weakly supervised method for an object-level tracking task.
More specifically, an existing popular CNN architecture named Residual Neural Networks (ResNets) is able to improve the generalization ability of the CNN models, and it has become the de facto backbone network of choice for deep learning models. The main idea of ResNets is simple: a skip connection is added over some convolutional layers. However, the understanding of the reasons for its superiority is limited. In an attempt to analyze why ResNets generalize better with the additional identity connections, we map a residual network to a plain network for a simplified (but still deep and non-linear) version and find that the resulting large weights from the identity connections improve the stability to the noise added to the network, which in turn improves the model’s generalization ability.
The lack of direct supervision signals makes it hard to estimate 2D image motion accurately, and even more so near motion boundaries (MBs), i.e., the curves of discontinuity of the motion field. First, motion is discontinuous across MBs, while typical estimators assume smoothness. Second, the features extracted to find point correspondences between frames are contaminated by multiple motions. The direct supervision signal provided by ground-truth labels may mitigate this problem, but the dense annotation of the motion for each pixel is very expensive, if not impossible. We show that accurate prediction of MBs from imperfect motion estimates helps improve motion estimates near MBs. Our method first detects MBs from the input motion estimated by an existing unsupervised estimator. Then, it improves the motion predictions near these boundaries by replacing them with the motion a bit farther away from the MB.
Next, we propose a weakly supervised method for Multi-Object Tracking (MOT). A popular tracking framework named tracking by detection first detects objects of interest in each frame, extracts an appearance embedding for each detection, and then associates these detections across frames based on the embeddings. Recently, a joint model that does both object detection and embedding in one forward pass has been proposed, and has been shown by-and-large to improve inference over conducting the two operations sequentially. The existing joint models require fully annotated tracking data for training, which includes both the ground-truth bounding boxes for detection and identity labels for these bounding boxes across frames. When this data is unavailable, we propose a weakly supervised method that augments ground-truth bounding boxes with appearance embeddings computed by an off-the-shelf Re-Identification (Re-ID) model. The Re-ID model is trained on independent Re-ID data that only contains identity labels, and the augmented appearance embeddings serve as pseudo ground truth embeddings.
Item Open Access Video Motion: Finding Complete Motion Paths for Every Visible Point(2013) Ricco, Susanna MariaThe problem of understanding motion in video has been an area of intense research in computer vision for decades. The traditional approach is to represent motion using optical flow fields, which describe the two-dimensional instantaneous velocity at every pixel in every frame. We present a new approach to describing motion in video in which each visible world point is associated with a sequence-length video motion path. A video motion path lists the location where a world point would appear if it were visible in every frame of the sequence. Each motion path is coupled with a vector of binary visibility flags for the associated point that identify the frames in which the tracked point is unoccluded.
We represent paths for all visible points in a particular sequence using a single linear subspace. The key insight we exploit is that, for many sequences, this subspace is low-dimensional, scaling with the complexity of the deformations and the number of independent objects in the scene, rather than the number of frames in the sequence. Restricting all paths to lie within a single motion subspace provides strong regularization that allows us to extend paths through brief occlusions, relying on evidence from the visible frames to hallucinate the unseen locations.
This thesis presents our mathematical model of video motion. We define a path objective function that optimizes a set of paths given estimates of visible intervals, under the assumption that motion is generally spatially smooth and that the appearance of a tracked point remains constant over time. We estimate visibility based on global properties of all paths, enforcing the physical requirement that at least one tracked point must be visible at every pixel in the video. The model assumes the existence of an appropriate path motion basis; we find a sequence-specific basis through analysis of point tracks from a frame-to-frame tracker. Tracking failures caused by image noise, non-rigid deformations, or occlusions complicate the problem by introducing missing data. We update standard trackers to aggressively reinitialize points lost in earlier frames. Finally, we improve on standard Principal Component Analysis with missing data by introducing a novel compaction step that associates these relocalized points, reducing the amount of missing data that must be overcome. The full system achieves state-of-the-art results, recovering dense, accurate, long-range point correspondences in the face of significant occlusions.