Instructions

In this week’s lab, the main goal is to learn how to tidy a data set. On the due date, turn in your Rmd file and the html product.

Exercise 1

Open your project for this class. Make sure all your work is done relative to this project.

Open the lab2.Rmd file provided with the instructions. You can add your answers to questions from this document.

Exercise 2

These are warmups for the later questions. We will work through the examples in the lecture notes, and make sure that they all work for you, and that you know what each function does.

  1. Tidy the genes data.
    1. Which function makes the long tidy form?
    2. Which function separates the text strings into two variables?
    3. What two functions create new text values that have no leading “W”?
library(tidyverse)
genes <- read_csv("genes.csv")
gtidy <- genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") %>%
  mutate(trt = sub("W", "", trt)) %>%
  mutate(rep = sub("R", "", rep))
  1. Tidy the Melbourne weather station data.
    1. What does the function read.fwf do? What does the argument c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)) do in the function?
    2. What is the difference between [] and ()? What does [,c(1,2,3,4,seq(5,128,4))] do in the second line?
    3. Which function removes records other than the temperature and precipitation?
    4. What does the spread function do here?
    5. Why do the raw temperature and precipitation values get divided by 10?
melbtemp <- read.fwf("ASN00086282.dly", 
                     c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)), fill=T)
melbtemp <- melbtemp[,c(1,2,3,4,seq(5,128,4))]
colnames(melbtemp) <- c("id", "year", "month", "var", paste0("V",1:31))
melbtemp <- melbtemp %>% 
  gather(day, value, V1:V31) %>%
  mutate(day = sub("V", "", day)) %>%
  mutate(value=ifelse(value==-9999, NA, value)) %>%
  filter(var %in% c("PRCP", "TMAX", "TMIN")) %>%
  spread(var, value) %>%
  mutate(PRCP=PRCP/10, TMAX=TMAX/10, TMIN=TMIN/10)

Exercise 3

41% Of Fliers Think You’re Rude If You Recline Your Seat. In the following table, V1 is a response to the question “Is it rude to recline your seat on a plane?”, and V2 is the response to the question “Do you ever recline your seat when you fly?”.

fly_tbl <- read_csv("fly_tbl.csv")
library(knitr)
kable(fly_tbl)
V1 V2:Always V2:Usually V2:About half the time V2:Once in a while V2:Never
No, not rude at all 124 145 82 116 35
Yes, somewhat rude 9 27 35 129 81
Yes, very rude 3 3 NA 11 54
  1. What are the variables in this data?
  2. Tidy the data.

Exercise 4

For the data set, rates.csv,

rates <- read_csv("rates.csv")
head(rates)
# A tibble: 6 x 169
        date      AED     AFN      ALL      AMD      ANG      AOA      ARS
      <date>    <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 2017-06-20 3.673014 68.1380 119.4394 481.2024 1.783103 165.9165 16.14300
2 2017-06-21 3.673014 68.0805 118.8077 479.9032 1.776957 165.9165 16.21644
3 2017-06-22 3.673014 68.1400 118.6834 480.8324 1.782248 165.9165 16.14900
4 2017-06-23 3.673014 68.1047 118.0250 478.8100 1.775643 165.9165 16.17800
5 2017-06-24 3.673014 68.1047 118.0250 478.8100 1.775643 165.9165 16.17800
6 2017-06-25 3.673014 67.9365 118.0800 478.7900 1.775558 165.9165 16.10150
# ... with 161 more variables: AUD <dbl>, AWG <dbl>, AZN <dbl>, BAM <dbl>,
#   BBD <int>, BDT <dbl>, BGN <dbl>, BHD <dbl>, BIF <dbl>, BMD <int>,
#   BND <dbl>, BOB <dbl>, BRL <dbl>, BSD <int>, BTC <dbl>, BTN <dbl>,
#   BWP <dbl>, BYN <dbl>, BZD <dbl>, CAD <dbl>, CDF <dbl>, CHF <dbl>,
#   CLF <dbl>, CLP <dbl>, CNH <dbl>, CNY <dbl>, COP <dbl>, CRC <dbl>,
#   CUC <int>, CUP <dbl>, CVE <dbl>, CZK <dbl>, DJF <dbl>, DKK <dbl>,
#   DOP <dbl>, DZD <dbl>, EGP <dbl>, ERN <dbl>, ETB <dbl>, EUR <dbl>,
#   FJD <dbl>, FKP <dbl>, GBP <dbl>, GEL <dbl>, GGP <dbl>, GHS <dbl>,
#   GIP <dbl>, GMD <dbl>, GNF <dbl>, GTQ <dbl>, GYD <dbl>, HKD <dbl>,
#   HNL <dbl>, HRK <dbl>, HTG <dbl>, HUF <dbl>, IDR <dbl>, ILS <dbl>,
#   IMP <dbl>, INR <dbl>, IQD <dbl>, IRR <dbl>, ISK <dbl>, JEP <dbl>,
#   JMD <dbl>, JOD <dbl>, JPY <dbl>, KES <dbl>, KGS <dbl>, KHR <dbl>,
#   KMF <dbl>, KPW <dbl>, KRW <dbl>, KWD <dbl>, KYD <dbl>, KZT <dbl>,
#   LAK <dbl>, LBP <dbl>, LKR <dbl>, LRD <dbl>, LSL <dbl>, LYD <dbl>,
#   MAD <dbl>, MDL <dbl>, MGA <dbl>, MKD <dbl>, MMK <dbl>, MNT <dbl>,
#   MOP <dbl>, MRO <dbl>, MUR <dbl>, MVR <dbl>, MWK <dbl>, MXN <dbl>,
#   MYR <dbl>, MZN <dbl>, NAD <dbl>, NGN <dbl>, NIO <dbl>, NOK <dbl>, ...
  1. Write down what the variables are.
  2. Make a time series (line plot) of the Australian dollar cross rate with the USA. What day was the best day to exchange USD into AUD?
  3. Focusing on the five currencies, AUD, GBP, JPY, CNY, CAD, make it into tidy form, show the code.
  4. Make a facetted time series plot of the five currencies, where each currency is shown on its own scale. Is there a predominant pattern among these 5 currencies, of rate relative to the USD? Is there a currency trending differently?

Exercise 5

Read in the billboard top 100 music data, which contains N’Sync and Backstreet Boys songs that entered the billboard charts in the year 2000.

billboard <- read_csv("billboard.csv")
  1. What’s in this data? What’s 1-76? What are the variables?
  2. Convert this data into a long format appropriate for plotting a time series (date on the x axis, chart position on the y axis)
  3. Make a time series plot, where each song is its own line, and the line is coloured by the artist. It should look like this. Which song was the most successful? What song was the least successful? (Explain your reasoning, or metrics, for saying a song is more or less successful.)

Exercise 6

This data was pulled from https://www.whaleshark.org in 2013. It lists verified encounters with whale sharks across the globe.

whalesharks <- read_csv("whaleshark-encounters.csv")
  1. Is a whale shark, a whale?
  2. What are the observations is this data?
  3. What info potentially is replicated on multiple lines? That is, the records are an example of repeated measurements, sometimes called longitudinal or panel data.
  4. Compute the number of records for each whale shark, where their identity is known, Marked Individual. What individual has the most sightings? How many unmarked individuals are recorded in the database?
  5. Let’s make a map of the encounters. Code is below. What part of the code below draws the map? What part adds points for the encounters?
  6. Change the code colour males and females differently. Do males and females roam the same locations?
library(maps)
library(ggthemes)
world_map <- map_data("world")
ggplot(world_map) + 
  geom_polygon(aes(x=long, y=lat, group=group), 
               fill="grey90", colour="white") + 
  theme_map() +
  geom_point(data=whalesharks, aes(x=Longitude, y=Latitude),
             colour="salmon", alpha=0.5)

ggplot(world_map) + 
  geom_polygon(aes(x=long, y=lat, group=group), 
               fill="grey90", colour="white") + 
  theme_map() +
  geom_point(data=filter(whalesharks, !is.na(Sex)),
             aes(x=Longitude, y=Latitude,
                                   colour=Sex),
             alpha=0.5)

Exercise 7

The file budapest.csv has a subset of web click through data related to hotel searches for Budapest. Each line in this data corresponds to a summary of a person looking for a hotel on the Expedia web site. For these questions, the answer don’t require you to code, but to map out what operations you need to make on the data.

budapest <- read_csv("budapest.csv")
  1. Is the data in tidy form? What are the observations, and list a couple of the variables.
  2. If I want to answer this question “What proportion of people searching, actually booked a hotel room?” what would I need to do to the data? (The variable recording the searcher’s final decision is CLICK_THRU_TYP_ID , and the code indicating a booking is 3406).
  3. If I want to answer the question “What day of the week are most people seaching for hotels?” what would I need to do with the data? (There are two date variables in the data, one when they are searching, SRCH_DATETM, and the other what dates they want to hotel room, SRCH_BEGIN_USE_DATE, SRCH_END_USE_DATE.)
  4. If I want to answer the question “How far ahead of the check-in date do people typically search for a hotel room?” what needs to done with the data.
  5. If I want to answer the question “Does the existence of a promotion, IS_PROMO_FLAG, tend to result in a higher likelihood of a booking?” what operations do you need to do on the data.
  6. There are a lot of missing values in the data, number of NAs, particularly this is true for the booking variable. If an NA essentially means that the person searching quit the site without doing a booking, how would you recode the missing value?