Data Expeditions

Permanent URI for this collection

This repository includes datasets developed as part of the Data Expeditions program sponsored by the Information Initiative at Duke (iiD). The datasets were put together by teams of graduate students for Duke faculty to use in their courses. Each dataset comes with suggested questions amenable to both exploratory data analysis and advanced mathematical/statistical modeling. The datasets concern topics from multiple disciplines. The iiD, in collaboration with the Duke Social Science Research Institute and the Duke Library, makes Expeditions datasets available in this repository with the intention of allowing many Duke faculty and students to take advantage of these resources for learning quantitative methods


Recent Submissions

Now showing 1 - 12 of 12
  • ItemOpen Access
    ENV 350S / PUBPOL 280S Seminar in Marine Conservation Leadership
    (2016) Stefanski, Stephanie; Smith, Martin D
    Duke PhD student Stephanie Stefanski recently taught a class focused on the process of designing, implementing, and analyzing the results from an economic valuation survey. The class was given as a module to inform the broader class themes of policy design and cost-benefit analysis in fisheries and marine resource management. The data file contains 1,526 observations of U.S. households who responded to an online Qualtrics survey in May 2012 about their familiarity with and willingness to pay to protect marine biodiversity in the Gulf of Mexico by paying additional taxes to fund an expansion of a marine sanctuary in the northern Gulf. There are 92 variables, which include socio-demographic characteristics of respondents, their answers to willingness to pay questions, and their answers to debriefing questions. Stephanie gave a presentation describing the context and motivation of the study and the main questions used in the survey. She then demonstrated to students the different data analysis commands and coding in Stata to visualize the data through histograms and frequency charts. These data visualizations informed the different types of regression analyses Stephanie taught the class. Finally, the students separated into small groups to discuss one of four policy implication discussion questions. The purpose of the exercise is to help students think critically about survey design and implementation, and how the results of surveys can be used to inform a variety of policies and to better understanding why people support environmental policy. The module successfully engaged students in learning about a published study and the data collection and analysis process it entailed. The class discussion fostered critical thinking about how to connect this type of data analysis and survey design to their own research and to addressing environmental challenges and policies beyond the scope of the study.
  • ItemOpen Access
    STA 112, Data Science, Statcast
    (2016-12-12) Coleman, Jake; Rundel, Colin
    In this Data Exploration, students were introduced to baseball dataset Statcast, downloaded from, that included every pitch thrown in the first week of the 2016 season, with 21 characteristics. The students were tasked with using R packages dplyr and ggplot2 to answer data exploration and summarizion questions. The exercises challenged them to use information about the data as well as newly acquired computation skills. The Statcast data is owned by MLB Advanced Media, L.P. and was downloaded from a search performed on for all pitches from 4/1/16 to 4/7/16. Statcast is a relatively new dataset (introduced in 2015), including all pitch characteristics from its precurser PitchF/X (such as pitch movement, type, start and end velocity, etc.). Statcast alsoadded tracking of the ball during the entirety of the play, as well as tracking for all elders. Full Statcast data is not yet available to the public, but Baseball Savant allows the public to have access to Statcast-added batted ball variables such as launch angle and batted ball speed. Dplyr is an extremely powerful tool for exploring data, using simple structure to perform complex data management tasks. Students were introducted to dplyr in a previous lecture, and used the Statcast data to gain hands-on experience working with data. Their tasks ranged from simple summaries to sophisticated manipulation (as real data is rarely in perfect form for desired analysis). They also integrated the R package ggplot2 to visualize some of their findings and draw further conclusions.
  • ItemOpen Access
    Math 412 - Topology with Applications
    (2016-06-24) Ghadyali, Hamza; Bendich, Paul L
    Highlights of Data Expedition: • Students explored daily observations of local climate data spanning the past 35 years. • Topological Data Analysis, or TDA for short, provides cutting-edge tools for studying the geometry of data in arbitrarily high dimensions. • Using TDA tools, students discovered intrinsic dynamical features of the data and learned how to quantify periodic phenomenon in a time-series. • Since nature invariably produces noisy data which rarely has exact periodicity, students also considered the theoretical basis of almost-periodicity and even invented and tested new mathematical definitions of almost-periodic functions. Summary The dataset we used for this data expedition comes from the Global Historical Climatology Network. “GHCN (Global Historical Climatology Network)-Daily is an integrated database of daily climate summaries from land surface stations across the globe.” Source: We focused on the daily maximum and minimum temperatures from January 1, 1980 to April 1, 2015 collected from RDU International Airport. Through a guided series of exercises designed to be performed in Matlab, students explore these time-series, initially by direct visualization and basic statistical techniques. Then students are guided through a special sliding-window construction which transforms a time-series into a high-dimensional geometric curve. These high-dimensional curves can be visualized by projecting down to lower dimensions as in the figure below (Figure 1), however, our focus here was to use persistent homology to directly study the high-dimensional embedding. The shape of these curves has meaningful information but how one describes the “shape” of data depends on which scale the data is being considered. However, choosing the appropriate scale is rarely an obvious choice. Persistent homology overcomes this obstacle by allowing us to quantitatively study geometric features of the data across multiple-scales. Through this data expedition, students are introduced to numerically computing persistent homology using the rips collapse algorithm and interpreting the results. In the specific context of sliding-window constructions, 1-dimensional persistent homology can reveal the nature of periodic structure in the original data. I created a special technique to study how these high-dimensional sliding-window curves form loops in order to quantify the periodicity. Students are guided through this construction and learn how to visualize and interpret this information. Climate data is extremely complex (as anyone who has suffered from a bad weather prediction can attest) and numerous variables play a role in determining our daily weather and temperatures. This complexity coupled with imperfections of measuring devices results in very noisy data. This causes the annual seasonal periodicity to be far from exact. To this end, I have students explore existing theoretical notions of almost-periodicity and test it on the data. They find that some existing definitions are also inadequate in this context. Hence I challenged them to invent new mathematics by proposing and testing their own definition. These students rose to the challenge and suggested a number of creative definitions. While autocorrelation and spectral methods based on Fourier analysis are often used to explore periodicity, the construction here provides an alternative paradigm to quantify periodic structure in almost-periodic signals using tools from topological data analysis.
  • ItemOpen Access
    North Carolina Traffic Stops
    (2014) OwensOas, Derek
  • ItemOpen Access
    Major League Baseball and National Basketball Association regular season data by team
    (2014) Futoma, Joseph; McAlinn, Kenichiro
    With the rise of sports statistics, especially sabermetrics in baseball, statistics have proven crucial not only for managing teams and assessing player value, but also for forecasting team and individual performance. In this data expedition, we provided undergraduates with detailed information about each team from every NBA and MLB game during the 2010-2011 and 2013 seasons, respectively. For baseball, for each of the 2430 games we have 23 batting stats (e.g. hits, runs batted in, homeruns) and 23 pitching stats (e.g. strikeouts, runs allowed). For basketball, we have 20 stats (e.g. field goals, free throws, rebounds), for each of the 1230 games.
  • ItemOpen Access
    Exploring lemur olfactory communication
    (2015-11-30) Smyth, Kendra; Greene, Lydia
    In Fall 2015, we (Kendra Smyth & Lydia Greene) led a Data Expeditions (DE) workshop in Advanced Research in Evolutionary Anthropology, a senior-level class on the research process. The goal of the workshop was to get students familiar with the R language and introduce them to a range of statistical techniques that might be useful for analyzing their own senior thesis data. In the workshop, we used a lemur scent-marking dataset compiled by Greene during her undergraduate honors thesis at Duke. By using these data, we aimed to make statistics seem both accessible and relatable to these students. Although teaching students the specific commands in R is undeniably valuable, the true reward from the Data Expeditions came from seeing students understand key concepts in statistics and from giving them the tools to begin the process of analyzing their own data.
  • ItemOpen Access
    Math 412: Music + Topology
    (2014) Tralie, Christopher
    In this mini assignment you will explore an application of "sliding windows and persistence" on time series data (see Jose Perea's paper for more theory). Specifically, you will look at how to transform musical audio data into a high dimension point cloud/curve which can be probed with TDA methods. You will make use of a visualization program called LoopDitty to gain some intuition about what points in various persistence diagrams might mean. Please follow the directions below and submit an electronic writeup to with the answers to all of the questions and any observations you have. Enjoy!
  • ItemOpen Access
    2015 Call for Proposals
    (2015) Bendich, Paul L; Calderbank, Robert; Reiter, Jerome P
  • ItemOpen Access
    Signal, noise, and bias in yeast MNase-seq data
    (2014) MacAlpine, David Michael
    This is an optional challenge for students interested in applying what we have learned in class to a real computational genomics research problem; practicing the skills of using Python or R (or any other tool you wish) to visualize, analyze, model, and interpret real genomic data; and exploring the science linking chromatin structure and transcriptional regulation. Since this problem represents an open challenge for the genomics community, you are free to choose the approaches you use to analyze the data, as well as the questions you explore.
  • ItemOpen Access
    2014 Data Expeditions Call for Proposals
    (2015-11-30) Reiter, Jerome P; Calderbank, Robert