Skip to main content
Duke University Libraries
DukeSpace Scholarship by Duke Authors
  • Login
  • Ask
  • Menu
  • Login
  • Ask a Librarian
  • Search & Find
  • Using the Library
  • Research Support
  • Course Support
  • Libraries
  • About
View Item 
  •   DukeSpace
  • Duke Scholarly Works
  • Scholarly Articles
  • View Item
  •   DukeSpace
  • Duke Scholarly Works
  • Scholarly Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Thumbnail
View / Download
841.6 Kb
Authors
Miller, Jeffrey
Betancourt, Brenda
Zaidi, Abbas
Wallach, Hanna
Steorts, Rebecca C
Repository Usage Stats
254
views
93
downloads
Abstract
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.
Type
Journal article
Subject
stat.ME
stat.ME
stat.AP
stat.CO
stat.ML
Permalink
https://hdl.handle.net/10161/11816
Collections
  • Scholarly Articles
More Info
Show full item record

Scholars@Duke

Steorts

Rebecca Carter Steorts

Associate Professor of Statistical Science
You can find more information about my research group and work at:https://resteorts.github.io/Recent papers of mine can be found at https://arxiv.org/search/?query=steorts&searchtype=all&source=header
Open Access

Articles written by Duke faculty are made available through the campus open access policy. For more information see: Duke Open Access Policy

Rights for Collection: Scholarly Articles


Works are deposited here by their authors, and represent their research and opinions, not that of Duke University. Some materials and descriptions may include offensive content. More info

Make Your Work Available Here

How to Deposit

Browse

All of DukeSpaceCommunities & CollectionsAuthorsTitlesTypesBy Issue DateDepartmentsAffiliations of Duke Author(s)SubjectsBy Submit DateThis CollectionAuthorsTitlesTypesBy Issue DateDepartmentsAffiliations of Duke Author(s)SubjectsBy Submit Date

My Account

LoginRegister

Statistics

View Usage Statistics
Duke University Libraries

Contact Us

411 Chapel Drive
Durham, NC 27708
(919) 660-5870
Perkins Library Service Desk

Digital Repositories at Duke

  • Report a problem with the repositories
  • About digital repositories at Duke
  • Accessibility Policy
  • Deaccession and DMCA Takedown Policy

TwitterFacebookYouTubeFlickrInstagramBlogs

Sign Up for Our Newsletter
  • Re-use & Attribution / Privacy
  • Harmful Language Statement
  • Support the Libraries
Duke University