Browsing by Author "Sun, Xiaobai"
Results Per Page
Sort Options
Item Open Access All-Near-Neighbor Search Among High-Dimensional Data via Hierarchical Sparse Graph Code Filtering(2020) Iliopoulos, Alexandros StavrosThis thesis addresses the problem of all-to-all near-neighbor (all-NN) search among a large dataset of discrete points in a high-dimensional feature space endowed with a distance metric. Near-neighbor search is fundamental to numerous computational tasks arising in modern data analysis, modeling, and knowledge discovery. The accuracy and efficiency of near-neighbor search depends on the underlying structure of the dataset. In emerging data-driven analysis applications, the dataset is not necessarily stationary, the structure is not necessarily built once for all queries, and search is not necessarily limited to a few query points. To facilitate accurate and efficient near-neighbor search in a stationary or dynamic dataset, we make a systematic investigation of all-NN search at multiple resolution levels, with attention to several standing or previously overlooked issues associated with high-dimensional feature spaces.
The key contributions are outlined as follows. (i) We proclaim the inherent sparse distribution of a finite discrete dataset in a high-dimensional space. We demonstrate that exploiting this sparsity is pivotal to dispelling the so-called curse of dimensionality in theoretical analysis and practical algorithm design. (ii) We introduce the adjacency graph hierarchy (AGH) as an analysis apparatus as well as base infrastructure for efficient all-NN search algorithm design. (iii) We present an efficient AGH construction algorithm by directly and deterministically pinpointing adjacent nodes at each resolution level via sparse adjacency code filtering, in lieu of exhaustive search within local node neighborhoods. We provide at the same time a statistical estimate of the significant gap in algorithm complexity between direct location of adjacent nodes and local search. (iv) We introduce an adjacency-sensitive, graph-based but decentralized approach for hierarchical partition of a discrete dataset, with potential advantage for accurate and efficient all-NN search in comparison to existing non-interacting partition structures.
Item Open Access An Exploratory Study of Cell Growth and Division Data(2016) Niu, TongUnderstanding the mechanism behind cell-size control of bacteria such as E. coli has been an active research topic for decades. Until about 2010, most studies were limited to measurements at the cell population level. Recently advanced technologies and techniques, including Mother Machine and modern microscopic imaging, as well as new biological insights and experiment techniques have enabled measurements of cell growth and division at individual cell level with high spatial and temporal resolution. This unprecedented data allowed scientists to take a closer look into the patterns of cell growth and division, to search for new answers, and to gain a better understanding of the natural law governing cell growth. This thesis study is based on six data sets provided by the You lab at Duke University and by the Jun lab at University of California, San Diego. We focus on exploratory and quantitative study of inter-generation relationships among certain cell growth parameters and variables, including the variation in each relationship.
We report and describe three particular findings. (1) The variation in cell size at birth between two successive generations, i.e., mother and daughter generations, is better captured in two separate zones in the mother-daughter plane of initial cell size. This leads to a two-phase linear model with much narrower noise spread. The model can be used to connect or compensate for certain well-known models. (2) The variation in doubling time between two successive generations is confined in a finite region with non-uniform density. Specifically, the confining region is apparently triangular in the successive-generation plot of doubling time. The highest variation of doubling time in daughter-generation corresponds to the median for the mother- generation. This variation structure of successive-generation is further used to predict the variations in succeeding generations via an iterative simulation method. (3) The relationship between cell size and doubling time is studied with Fourier analysis. We found that the two dominant frequencies of the cell-size time series are nearly constant for each lineage. Furthermore, the top dominant frequency location is invariant under random re-ordering of generations. This invariance suggests that the dominant frequency is an intra-generation property, i.e., an inherent property. These findings are descriptive rather than explanatory. Nonetheless, they provide additional angles to examine, re-examine and improve our understanding of cell growth control mechanism. The findings also show, perhaps in a subtle way, the effect of computational concepts and techniques, which we used to probe into data on the finding, artifacts and their separation.
Item Open Access Digital Stack Photography and Its Applications(2014) Hu, JunThis work centers on digital stack photography and its applications.
A stack of images refer, in a broader sense, to an ensemble of
associated images taken with variation in one or more than one various
values in one or more parameters in system configuration or setting.
An image stack captures and contains potentially more information than
any of the constituent images. Digital stack photography (DST)
techniques explore the rich information to render a synthesized image
that oversteps the limitation in a digital camera's capabilities.
This work considers in particular two basic DST problems, which had
been challenging, and their applications. One is high-dynamic-range
(HDR) imaging of non-stationary dynamic scenes, in which the stacked
images vary in exposure conditions. The other
is large scale panorama composition from multiple images. In this
case, the image components are related to each other by the spatial
relation among the subdomains of the same scene they covered and
captured jointly. We consider the non-conventional, practical and
challenge situations where the spatial overlap among the sub-images is
sparse (S), irregular in geometry and imprecise from the designed
geometry (I), and the captured data over the overlap zones are noisy
(N) or lack of features. We refer to these conditions simply as the
S.I.N. conditions.
There are common challenging issues with both problems. For example,
both faced the dominant problem with image alignment for
seamless and artifact-free image composition. Our solutions to the
common problems are manifested differently in each of the particular
problems, as a result of adaption to the specific properties in each
type of image ensembles. For the exposure stack, existing
alignment approaches struggled to overcome three main challenges:
inconsistency in brightness, large displacement in dynamic scene and
pixel saturation. We exploit solutions in the following three
aspects. In the first, we introduce a model that addresses and admits
changes in both geometric configurations and optical conditions, while
following the traditional optical flow description. Previous models
treated these two types of changes one or the other, namely, with
mutual exclusions. Next, we extend the pixel-based optical flow model
to a patch-based model. There are two-fold advantages. A patch has
texture and local content that individual pixels fail to present. It
also renders opportunities for faster processing, such as via
two-scale or multiple-scale processing. The extended model is then
solved efficiently with an EM-like algorithm, which is reliable in the
presence of large displacement. Thirdly, we present a generative
model for reducing or eliminating typical artifacts as a side effect
of an inadequate alignment for clipped pixels. A patch-based texture
synthesis is combined with the patch-based alignment to achieve an
artifact free result.
For large-scale panorama composition under the S.I.N. conditions, we
have developed an effective solution scheme that significantly reduces
both processing time and artifacts. Previously existing approaches can
be roughly categorized as either geometry-based composition or feature
based composition. In the former approach, one relies on precise
knowledge of the system geometry, by design and/or calibration. It
works well with a far-away scene, in which case there is only limited
variation in projective geometry among the sub-images. However, the
system geometry is not invariant to physical conditions such as
thermal variation, stress variation and etc.. The composition with
this approach is typically done in the spatial space. The other
approach is more robust to geometric and optical conditions. It works
surprisingly well with feature-rich and stationary scenes, not well
with the absence of recognizable features. The composition based on
feature matching is typically done in the spatial gradient domain. In
short, both approaches are challenged by the S.I.N. conditions. With
certain snapshot data sets obtained and contributed by Brady et al,
these methods either fail in composition or render images with
visually disturbing artifacts. To overcome the S.I.N. conditions, we
have reconciled these two approaches and made successful and
complementary use of both priori and approximate information about
geometric system configuration and the feature information from the
image data. We also designed and developed a software architecture
with careful extraction of primitive function modules that can be
efficiently implemented and executed in parallel. In addition to a
much faster processing speed, the resulting images are clear and
sharper at the overlapping zones, without typical ghosting artifacts.
Item Open Access Exploiting Common Structures Across Multiple Network Propagation Schemes(2018) Chen, XiPageRank is arguably the most popular graph-node ranking algorithm which has been used to analyze large networks.
This paper presents how to utilize common structures across multiple damping schemes in PageRank algorithm and its variants to improve computation efficiency. One contribution we have made is to make use of Krylov subspaces to fast approximate PageRank of the same graph, same personalized distribution vector across different damping factors. In the first run we compute an invariant subspace, which is the major time cost of our algorithm. The original network connection matrix is represented in the invariant subspace by a small matrix.
Afterward, for each different damping factor only the small matrix is involved, followed by a single matrix-vector multiplication. Another contribution is to re-index graph nodes, based on graph spectral method to improve data locality, and accelerate sparse matrix-vector multiplication. We applied the new method to large real-world network graphs and make comparisons between existing iterative methods and our method. Our algorithm has substantial lower cost in amortized time when multiple damping factors are used for network analysis.
Item Open Access Exploiting Semantic Word Relationships for Improved Unsupervised Academic Document Classification(2018-04-24) Yan, DavidIn our thesis work, we investigate on unsupervised theme-based categorization or classification of documents in a large corpus, where the documents are represented in a high-dimensional feature space. Unsupervised classification of text documents is in high demand with the data explosion of the modern age, yet it remains a challenging problem. In particular, we re-examine two key areas: the manner in which the documents are represented and the process by which the documents are clustered. Two innovative methods are presented in the joint research work with Rob Martorano. The first is on feature transformation, which is elaborated on in this thesis. We point out that existing approaches for document feature description serve well for author identification but pose limitations on theme-based classification. In our feature transformation, we discount esoteric use of words by authors and disclose and exploit semantic similarities and associations among different words used by different authors. We first locate semantically close words by utilizing word embedding techniques and products based on much larger word collections, external to the terms used in a particular document corpus. We then make numerical associations among term neighbors with similar semantic meanings; we denote these term neighborhoods as semantic elements. Using semantic elements, we use a self-tuning Gaussian blurring technique to increase association between documents that share similar context patterns. The second contribution is on cluster revision, which is briefly discussed in this thesis and elaborated more in Rob Martorano’s thesis. Clustering algorithms are typically used after feature dimension reduction. Some properties are preserved, and some are lost in the reduced dimension space. Some clusters are fragmented into smaller ones, and some are merged. We revise the clustering results by going back to the high dimensional space. We characterize the cluster features with what we refer to as stochastic barcodes. We developed a software architecture composed of the following major components. The first component uses semantic elements to form a refined document feature space using our novel feature transformation method. The second component performs a dimension reduction on the document feature space, then forms and refines the subsequent document clusters. We show, with experimental results on real-word document corpora, improvements made by our approach in comparison to existing and influential ones.Item Open Access Hyperspectral Image Classification with Nonlinear Methods(2016) Liu, TianchengThis thesis introduces two related lines of study on classification of hyperspectral images with nonlinear methods. First, it describes a quantitative and systematic evaluation, by the author, of each major component in a pipeline for classifying hyperspectral images (HSI) developed earlier in a joint collaboration [23]. The pipeline, with novel use of nonlinear classification methods, has reached beyond the state of the art in classification accuracy on commonly used benchmarking HSI data [6], [13]. More importantly, it provides a clutter map, with respect to a predetermined set of classes, toward the real application situations where the image pixels not necessarily fall into a predetermined set of classes to be identified, detected or classified with.
The particular components evaluated are a) band selection with band-wise entropy spread, b) feature transformation with spatial filters and spectral expansion with derivatives c) graph spectral transformation via locally linear embedding for dimension reduction, and d) statistical ensemble for clutter detection. The quantitative evaluation of the pipeline verifies that these components are indispensable to high-accuracy classification.
Secondly, the work extends the HSI classification pipeline with a single HSI data cube to multiple HSI data cubes. Each cube, with feature variation, is to be classified of multiple classes. The main challenge is deriving the cube-wise classification from pixel-wise classification. The thesis presents the initial attempt to circumvent it, and discuss the potential for further improvement.
Item Open Access On Graph Clustering or Community Detection: A Characteristic Analysis and Its Implications(2021) Liu, TianchengGraphs or networks represent relational data in an abstract and unifying form in terms of vertices and edges. Graph clustering in graph study language, or community detection in network science and engineering language, is fundamental to exploratory analysis of relational data, at different levels of depth and in broad applications. The main objectives of graph clustering are to capture, characterize and categorize heterogeneous clusters or diverse community patterns intrinsic to the data, and beyond that, to discover hidden structures or functional relationships.In the present work, we investigate clustering models and methodologies themselves. We make several original and impactful contributions in three key stages. (I) We introduce the dialectical component principle (DCP) for graph clustering. For the DCP realization, we formulate a general model representation in terms of two dialectical encoding functions (interdependent and opposing potentials in coupling and decoupling, or attraction and repulsion) and a trainable balance/bias ratio between the encoding components. The general representation encompasses notable existing models or clustering functions. We analyze the fundamental properties of DCP models acting on graph data. The analysis offers new insight into the known and previously unknown advantages or disadvantages of existing models. Furthermore, the study leads to our design of a novel, competitive and rich family of DCP models, named the BlueRed family. (II) We introduce an innovative characteristic analysis of DCP models on undirected, connected graphs. Conventional comparison of cluster configurations by a clustering function on a graph relies typically on a fixed or lucky value of the so-called resolution parameter, which is related to the balance/bias ratio of DCP components. In contrast, we investigate the relationship among cluster configurations across the entire, infinite range of the resolution parameter. With the characteristic analysis, we address and answer for the first time critical yet previously unasked questions: the existence of global minimum cluster configurations, their locations, their stability and sensitivity, and resolution quantization. (III) We present an algorithmic architecture based on the model family and characteristic analysis. The architecture assembles a set of effective graph operation constituents, deploys a set of economic search strategies for optimal solutions, and is equipped with a set of iteration criteria and post-evaluation measures. It accommodates existing clustering algorithms. We provide empirical evidence and comparison results with synthetic graphs and real-world networks.
Item Open Access Semantic Term “Blurring” and Stochastic “Barcoding” for Improved Unsupervised Text Classification(2018-04) Martorano, RobertThe abundance of text data being produced in the modern age makes it increasingly important to intuitively group, categorize, or classify text data by theme for efficient retrieval and search. Yet, the high dimensionality and imprecision of text data, or more generally language as a whole, prove to be challenging when attempting to perform unsupervised document clustering. In this thesis, we present two novel methods for improving unsupervised document clustering/classification by theme. The first is to improve document representations. We look to exploit “term neighborhoods” and “blur” semantic weight across neighboring terms. These neighborhoods are located in the semantic space afforded by “word embeddings.” The second method is for cluster revision, based on what we deem as “stochastic barcoding”, or “S- Barcode” patterns. Text data is inherently high dimensional, yet clustering typically takes place in a low dimensional representation space. Our method utilizes lower dimension clustering results as initial cluster configurations, and iteratively revises the configuration in the high dimensional space. We show with experimental results how both of the two methods improve the quality of document clustering. While this thesis elaborates on the two new conceptual contributions, a joint thesis by David Yan details the feature transformation and software architecture we developed for unsupervised document classification.Item Open Access Symmetric completion of Deformable Registration via Bi-residual Inversion(2018) Dubey, Abhishek KumarDeformable image registration is fundamental and critical to both diagnostics and therapeutics in precision and personalized medicine. Existing software packages for deformable registrations either provide only a forward deformation vector field (DVF), or render both for-
ward and backward DVFs with time-intensive computation. The latter has multiple advantages in medical image analysis and processing. However, its latency, which is substantially longer than clinical time windows, hinders the transition of its benefits to clinical applications. This study aims at facilitating the transition with algorithmic innovation. We introduce a novel registration approach, which substantiates a significant shift of the conventionally perceived tradeoff boundary between efficiency on the one side and functionality and accuracy on the other. In the new approach, we utilize efficient existing methods for forward DVF estimation. We complete symmetric registration with a backward DVF estimation, at high computation speed comparable to the forward DVF generation, and at high accuracy in inverse consistency as well as in registration. The forward DVF is possibly refined also in the symmetric augmentation or completion process. The efficacy of our approach is supported by theoretical analysis and empirical results. The key conceptual and algorithmic innovation is adaptive use of forward and backward inverse consistency (IC) residuals as feedbacks to refining DVF estimation.
The forward IC residual was used heuristically in earlier work. We give theoretical explana-
tion and conditions on when non-adaptive feedback succeed or fail. We provide furthermore
a framework of algorithm design for DVF inversion with a simple adaptive feedback control
mechanism. The use of backward IC residuals is original. The iteration with backward IC
residuals as updates may be seen as an implicit Newton iteration, by convergence rate analysis.
It has great advantages in simplicity, efficiency and robustness over the explicit Newton iter-
ation, for DVF inversion. The algorithm framework is completed with convergence analysis,
controllability condition, pre-evaluation of the initial forward DVF data, and post-evaluation
of DVF estimates. Experiment results with our approach on synthetic data and real thoracic
CT images show significant improvements in both registration and inverse-consistency errors.
They are also in a remarkable agreement with the analysis-based predictions.
Item Open Access Variable Damping Effect on Network Propagation(2018) Qian, YuchenIn modern network analysis, the PageRank algorithm has been used as an indispens-
able tool to determine the importance and relevance of the network nodes. Inspite
of extensive research conducted to accelerate the algorithm or its variants, there are
few studies about the effects of the damping factors on the ranking distribution.
To understand how the damping factor can affect the rank distribution in different
PageRank models, specifically, the directed surfer model by Brin and Page, and the
heat-kernel PageRank by Chung. We studied the ranking vector (steady state distri-
bution) under different damping factor values with each model. Enabled by efficient
batch calculation of the ranking vectors, we conducted systematic experiments to
measure the discrepancies of the distributions, explored and explained the capability
of adjusting the steady-state distribution via the change in damping factors. Experi-
mental results show that the steady-state distribution by Brin-Page model responses
non-linearly to the change in damping factor α, while by Chung’s heat-kernel model,
the damping factor β casts negligble effect on steady-state distribution. With this
phenomenon, Brin-Page model may be preferable over Chung’s model on utilizing
the non-linear relationship between the damping factor and steady-state distribution.
The relationship can be utilized also to find the propagation speed(damping factor)
from observations of two or more consecutive distributions.