--- title: "ETC1010: Data Modelling and Computing" author: "Professor Di Cook, EBS, Monash U." output: learnr::tutorial: css: "css/logo.css" runtime: shiny_prerendered --- ```{r setup, include=FALSE} library(learnr) library(emo) knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, collapse = TRUE, fig.height = 4, fig.width = 8, fig.align = "center", cache = FALSE, comment = NA ) tutorial_html_dependency() ``` # Tidy data ## Course web site This is a link to the course web site, in case you need to go back and forth between tutorial and web materials: http://dmac.dicook.org ## Overview - Motivation - Terminology of data - Different examples of data - Steps in making data tidy - Lots of examples ![](images/1237b41b-7e70-4377-a230-90b0e3ff4001.jpg) *You’re ready to sit down with a newly-obtained dataset, excited about how it will open a world of insight and understanding, and then find you can’t use it. You’ll first have to spend a significant amount of time to restructure the data to even begin to produce a set of basic descriptive statistics or link it to other data you’ve been using.* *“Tidy data” is a term meant to provide a framework for producing data that conform to standards that make data easier to use. Tidy data may still require some cleaning for analysis, but the job will be much easier.* John Spencer Source: [Measure Evaluation](https://www.measureevaluation.org/resources/newsroom/blogs/tidy-data-and-how-to-get-it) ### *FOLLOW ALONG WITH THE NOTES* - Open your ETC1010 `Rproj` - Download the `Rmd` file from the class web site - Download these data files to your data directory - `graduate-programs.csv` - `genes.csv` - `ASN00086282.dly` - `TB_notifications_2018-03-18.csv` - `french_fries.rda` - `rates.csv` - `WebAirport_FY_1986-2017.xls` - Open the `lecture2a_tidydata.Rmd` file, that you have just downloaded in your RStudio window, and run the chunks of code as we talk about them. - Also have the [notes](https://ebsmonash.shinyapps.io/lecture2a_tidydata/) open in your web browser, so you can do the inline quizzes - for fun. ## Example: US graduate programs This is data from a study on US grad programs. It originally came in an excel file containing rankings of many different programs. This data set contains information on four programs astronomy, economics, entomology and psychology. ```{r readgrad} library(tidyverse) grad <- read_csv("data/graduate-programs.csv") grad %>% top_n(10) ``` What's good about the format? - Rows contain information about the institution - Columns contain types of information, like average number of publications, average number of citations, % completion, It makes it easy to make summaries: ```{r gradsummary} grad %>% count(subject) grad %>% filter(subject == "economics") %>% summarise(m=mean(NumStud), s=sd(NumStud)) grad %>% filter(subject == "economics") %>% ggplot(aes(x=NumStud, y=MedianTimetoDegree)) + geom_point() + theme(aspect.ratio=1) ``` What do we learn from these summaries? ```{r variable, echo=FALSE} quiz( question("The average number of graduate students per economics program is ", answer("about 61", correct = TRUE), answer("about 39")), question("What is the best description of the relationship between number of students and median time to degree?", answer("as the number of students increases the median time to degree increases, weakly", correct = TRUE), answer("as the number of students increases the variability in median time to degree decreases")) ) ``` ## Terminology of data - A __variable__ is a quantity, quality, or property that you can measure. For the grad programs, these would be all the column headers. - An __observation__ is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point. For the grad programs, this is institution, and program, uniquley define the observation. - A value is the state of a variable when you measure it. The value of a variable typically changes from observation to observation. ## Tidy tabular form __Tabular data__ is a set of values, each associated with a variable and an observation. Tabular data is __tidy__ if each value is placed in its own `cell`, each variable in its own column, and each observation in its own row. The grad programs data in tabular form. It also is in **wide** tidy form, because there are multiple columns containing different variables. It can also be useful - for the process of data analysis - to arrange data into **long** tidy form, where each value is uniquely identified. ```{r} grad %>% gather(variable, value, -subject, -Inst) ``` ## Different examples of data For each of these data examples, **let's try together to identify the variables and the observations** - some are HARD! ### Genes experiment `r ji("thinking")` ```{r} genes <- read_csv("data/genes.csv") genes ``` ### Melbourne weather What are the variables? Observations? `r ji("scared")` ```{r} melbtemp <- read_fwf("data/ASN00086282.dly", col_positions=fwf_widths(c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)))) melbtemp %>% select(X1, X2, X3, X4, X5, X9, X13, X17, X21, X25, X29, X33) ``` ### TB incidence This is current tuberculosis data taken from [WHO](http://www.who.int/tb/country/data/download/en/), the case notifications table. `r set.seed(2018); ji("sick")` ```{r} tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>% select(country, year, starts_with("new_sp_")) %>% filter(year>1996, year<2012) tb %>% top_n(20) ``` ### French fries 10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows: ```{r} load("data/french_fries.rda") french_fries %>% top_n(10) ``` ![](images/french_fries.png) What are the variables? Observations? ### Contingency table This data is collated from this story: [41% Of Fliers Think You’re Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/) ```{r} fly <- read_csv("data/fly_tbl.csv") fly ``` What are the variables? ## Messy vs tidy Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution. Tidy data can be though of as legos. Once you have this form, you can put it together in so many different ways, to make different analyses. ## Tidy verbs Source: A drawing made by Alison Horst [@allison_horst](https://twitter.com/allison_horst?lang=en) - `gather`: specify the **keys** (identifiers) and the **values** (measures) to make long form (used to be called melting) - `spread`: variables in columns (used to be called casting) - `separate`: split character columns ## Tidying genes data ```{r} genes genes_long <- genes %>% gather(variable, expr, -id) genes_long ``` ### Separate columns ```{r} genes_long %>% separate(variable, c("trt", "leftover"), "-") ``` ```{r} genes_long %>% separate(variable, c("trt", "leftover"), "-") %>% separate(leftover, c("time", "rep"), "\\.") ``` ### Now spread to examine different aspects #### Examine treatments against each other ```{r} genes_long <- genes_long %>% separate(variable, c("trt", "leftover"), "-") %>% separate(leftover, c("time", "rep"), "\\.") genes_long %>% spread(trt, expr) %>% ggplot(aes(x=WI, y=WM, colour=id)) + geom_point() ``` Generally, some negative association within each gene, WM is low if WI is high. #### Examine replicates against each other ```{r} genes_long %>% spread(rep, expr) %>% ggplot(aes(x=R1, y=R4, colour=id)) + geom_point() + coord_equal() ``` Very roughly, replicate 4 is like replicate 1, eg if one is low, the other is low. That's a good thing, that the replicates are fairly similar. ## Try this yourself Here is a little data set to practice gather, spread and separate on. ```{r} kb <- read_csv("data/koala_bilby.csv") kb ``` Try to do these by yourself, but the code is in the Rmd file if you need help. ### Exercise 1 Gather the data into long form, naming the two new variables, `label` and `count` ```{r echo=FALSE} kb_long <- kb %>% gather(label, count, -ID) kb_long ``` ### Exercise 2 Separate the labels into two new variables, `animal`, `state` ```{r echo=FALSE} kb_long <- kb_long %>% separate(label, c("animal", "state")) kb_long ``` ### Exercise 3 Spread the long form data into wide form, where the columns are the states. ```{r echo=FALSE} kb_long %>% spread(state, count) ``` ### Exercise 4 Spread the long form data into wide form, where the columns are the animals. ```{r echo=FALSE} kb_long %>% spread(animal, count) ``` ## Tidying Melbourne weather This is seriously messy! Read the data first, just the columns that we need. The data is in fixed column width format, which means a certain number of columns are designated for the particular records, eg columns 1-11 contains the station id, columns 12-15 contain the year, ... ```{r} melbtemp <- read_fwf("data/ASN00086282.dly", col_positions=fwf_widths(c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)), col_names = c("station", "year", "month", "variable", paste0("X", 5:128)))) melbtemp ``` X5, X9, ... contain the temperature and precipitation records. The remaining columns have data quality flags. We want to ignore the data quality flags for now, and select every fourth variable. ### Sometime easier to index select - `[...]` allows indexing of elements of a vector, or collection of numbers - `c(1:4, seq(5,128,4))` means collect items 1 through 4, and then every 4th item until the 128'th ```{r} melbtemp <- melbtemp[,c(1:4, seq(5, 128, 4))] melbtemp ``` ### Make long form ```{r} melbtemp_long <- melbtemp %>% gather(day, value, X5:X125) melbtemp_long ``` This is basically what we need. The day variable doesn't have the right values in it, but we can fix this later. ### Use spread to look at different things But there are a couple of things that I need to fix to make it possible to do these plots. We will discuss these operations next week. ```{r} melbtemp_long <- melbtemp_long %>% filter(variable %in% c("TMIN", "TMAX", "PRCP")) %>% mutate(value = ifelse(value == -9999, NA, value)) ``` ### What are the numbers? - What is the range of temperature and precipitation? - What units could this be in? - Why are some values -9999? Go to the data source: [https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt) ```{r units, echo=FALSE} quiz( question("Which is the best description of the temperature units? ", answer("degrees farehnheit F"), answer("degrees Kelvin K"), answer("degrees C x10", correct = TRUE)), question("What is the best description of the precipitation units", answer("mm x10", correct = TRUE), answer("inches")), question("What does -9999 mean?", answer("it was really cold"), answer("the keyboard got stuck"), answer("the value was missing", correct = TRUE)) ) ``` ### Now use spread #### Max vs Min ```{r} melbtemp_long %>% spread(variable, value) %>% ggplot(aes(x=TMIN, y=TMAX)) + geom_point() + coord_equal() + geom_abline(slope=1, intercept=0) ``` Oh, TMAX is always higher than TMIN. This is a good thing. #### Winter vs Summer ```{r} melbtemp_long %>% filter(variable == "TMAX") %>% spread(month, value) %>% ggplot(aes(x=`07`, y=`01`)) + geom_point() + xlab("July") + ylab("Jan") + coord_equal() + geom_abline(slope=1, intercept=0) ``` Same day of the month, same year, January has higher maximum temperatures than July. Except two days, the July max exceeded the January max. (This comparison is a bit whacky, though right?) ## Lab exercise 1 [41% Of Fliers Think You’re Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/). In the following table, V1 is a response to the question "Is it rude to recline your seat on a plane?", and V2 is the response to the question "Do you ever recline your seat when you fly?". The data is in the form of a contingency table. ```{r} fly_tbl <- read_csv("data/fly_tbl.csv") fly_tbl ``` a. What are the variables and observations in this data? b. Put the data in tidy long form (using the names `V2` as the key variable, and `count` as the value). c. Use the `rename` function to make the variable names a little shorter. You can get the data from [data/fly_tbl.csv](http://dmac.dicook.org/lectures/data/fly_tbl.csv) ```{r echo=FALSE, eval=FALSE} library(tidyverse) fly_tbl <- read_csv("data/fly_tbl.csv") fly_tbl %>% gather(V2, count, -V1) ``` ## Lab exercise 2 Your job is to tidy the TB incidence data. - Write down the steps that will be needed go from raw data to tidy - Write the code piece by piece You can get the data from [TB_notifications_2018-03-18.csv](http://dmac.dicook.org/lectures/data/TB_notifications_2018-03-18.csv) ```{r echo=FALSE, eval=FALSE} tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>% select(country, year, starts_with("new_sp_")) %>% filter(year>1996, year<2012) tb %>% top_n(20) tb_tidy <- tb %>% gather(stuff, count, starts_with("new_sp_")) %>% separate(stuff, c("stuff1", "stuff2", "genderage")) %>% separate(genderage, c("gender", "age"), sep=1) %>% select(-stuff1, -stuff2) tb_tidy ``` ## Lab exercise 3 For the data set, `rates.csv`, ```{r} rates <- read_csv("data/rates.csv") head(rates) ``` a. What are the variables and observations? b. Gather the five currencies, AUD, GBP, JPY, CNY, CAD, make it into tidy long form. c. Make line plots of the currencies, like the plot below, and describe the similarities and differences between the currencies. You can get the data from [rates.csv](http://dmac.netlify.com/lectures/data/rates.csv) ```{r echo=FALSE, results='hide', fig.show='hide'} library(tidyverse) rates <- read_csv("data/rates.csv") ggplot(rates, aes(x=date, y=AUD)) + geom_line() ``` ```{r echo=FALSE, results='hide', fig.width=6, fig.height=10} rates_long <- rates %>% select(date, AUD, GBP, JPY, CNY, CAD) %>% gather(currency, rate, -date) ggplot(rates_long, aes(x=date, y=rate)) + geom_line() + facet_wrap(~currency, ncol=1, scales="free_y") ``` ## Lab exercise 4 This is tough! From the web site [Department of Infrastructure, Regional Development and Cities](https://bitre.gov.au/publications/ongoing/airport_traffic_data.aspx) download the Airport Traffic Data 1985–86 to 2016–17. It is 2.7Mb. Extract the "Airport Passengers" table into R with this code ```{r} library(readxl) passengers <- read_xls("data/WebAirport_FY_1986-2017.xls", sheet=3, skip=6) passengers ``` Tidy the data, to produce a data set with these columns - airport: all of the airports. - year - type_of_flight: DOMESTIC, INTERNATIONAL - bound: IN or OUT ```{r eval=FALSE, echo=FALSE} passengers <- passengers %>% select(-X__1, -X__2, -Rank, -starts_with("TOTAL")) %>% rename(INBOUND_DOM=INBOUND, INBOUND_INTL=INBOUND__1, OUTBOUND_DOM=OUTBOUND, OUTBOUND_INTL=OUTBOUND__1, INBOUND_TOTL=INBOUND__2, OUTBOUND_TOTL=OUTBOUND__2) %>% gather(where, amount, starts_with("IN"), starts_with("OUT")) ``` ## Lab quiz Time to take the lab quiz.