Browsing by Author "Rundel, Colin"
- Results Per Page
- Sort Options
Item Open Access An educator's perspective of the tidyverseÇetinkaya-Rundel, Mine; Hardin, Johanna; Baumer, Benjamin S; McNamara, Amelia; Horton, Nicholas J; Rundel, ColinComputing makes up a large and growing component of data science and statistics courses. Many of those courses, especially when taught by faculty who are statisticians by training, teach R as the programming language. A number of instructors have opted to build much of their teaching around the use of the tidyverse. The tidyverse, in the words of its developers, "is a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the next" (Wickham et al. 2019). The shared principles have led to the widespread adoption of the tidyverse ecosystem. No small part of this usage is because the tidyverse tools have been intentionally designed to ease the learning process and cognitive load for users as they engage with each new piece of the larger ecosystem. Moreover, the functionality offered by the packages within the tidyverse spans the entire data science cycle, which includes data import, visualisation, wrangling, modeling, and communication. We believe the tidyverse provides an effective and efficient pathway to data science mastery for students at a variety of different levels of experience. In this paper, we introduce the tidyverse from an educator's perspective, touching on the what (a brief introduction to the tidyverse), the why (pedagogical benefits, opportunities, and challenges), the how (scoping and implementation options), and the where (details on courses, curricula, and student populations).Item Open Access Measuring Baseball Defensive Value Using Statcast Data(2017) Jordan, DrewMultiple methods of measuring the defensive value of baseball players have been developed. These methods commonly rely on human batted ball charters, which inherently introduces the possibility of measurement error and lack of objectivity to these metrics. Using newly available Statcast data, we construct a new metric, SAFE 2.0, that utilizes Bayesian hierarchical logistic regression to calculate the probability that a given batted ball will be caught by a fielder. We use kernel density estimation to approximate the relative frequency of each batted ball in our data. We also incorporate the run consequence of each batted ball. Combining the catch probability, the relative frequency, and the run consequence of batted balls over a grid, we arrive at our new metric, SAFE 2.0. We apply our method to all batted balls hit to centerfield in the 2016 Major League Baseball season, and rank all centerfielders according to their relative performance for the 2016 season as measured by SAFE 2.0. We then compare these rankings to the rankings of the most commonly used measure of defensive value, Ultimate Zone Rating.
Item Open Access Spatial Assignments Using Intrinsic Markers to Infer Migratory Patterns(2017) Qian, LeiIn ecology, it is extremely useful to model migratory connections as it can be built
upon to produce further research into organisms and the environment. However, it
can be difficult due to the high cost of tracking animals using extrinsic factors such
as tagging or electronic chips and the possibility of influencing organisms behavior
afterwards. I use a Bayesian Gaussian Process method developed in (Rundel et al.
2013) to model migratory bird connectivity patterns using intrinsic markers, allele
counts, on a spatial scope to predict breeding grounds of a genetic sample.
Item Open Access STA 112, Data Science, Statcast(2016-12-12) Coleman, Jake; Rundel, ColinIn this Data Exploration, students were introduced to baseball dataset Statcast, downloaded from baseballsavant.mlb.com, that included every pitch thrown in the first week of the 2016 season, with 21 characteristics. The students were tasked with using R packages dplyr and ggplot2 to answer data exploration and summarizion questions. The exercises challenged them to use information about the data as well as newly acquired computation skills. The Statcast data is owned by MLB Advanced Media, L.P. and was downloaded from a search performed on baseballsavant.mlb.com for all pitches from 4/1/16 to 4/7/16. Statcast is a relatively new dataset (introduced in 2015), including all pitch characteristics from its precurser PitchF/X (such as pitch movement, type, start and end velocity, etc.). Statcast alsoadded tracking of the ball during the entirety of the play, as well as tracking for all elders. Full Statcast data is not yet available to the public, but Baseball Savant allows the public to have access to Statcast-added batted ball variables such as launch angle and batted ball speed. Dplyr is an extremely powerful tool for exploring data, using simple structure to perform complex data management tasks. Students were introducted to dplyr in a previous lecture, and used the Statcast data to gain hands-on experience working with data. Their tasks ranged from simple summaries to sophisticated manipulation (as real data is rarely in perfect form for desired analysis). They also integrated the R package ggplot2 to visualize some of their findings and draw further conclusions.