Instructions

In this week’s lab, the main goal is to gain some experience in building models, to explore and explain data. We will start with the famous gapminder data, and use regression models to study temporal trends in life expectancy across the globe. The we will use decision trees to build a spam filter, using data collected by Dr Cook and her students a number of years ago.

Warmups

  • Watch the movie Hans Rosling’s TED talk
  • Email headers have a lot of information about where the email originated from, where the reply goes to, size, domains, … This is usually hidden by your mail handling software, unless you specifically request to view it. See if you can get your mail handler to show you the full header of an email. For example, here is a sample from an email to me from David Frazier:
Received: by 10.103.136.194 with HTTP; Sun, 17 Sep 2017 14:57:50 -0700 (PDT)
In-Reply-To: <CAFvWOFKt9C-WYAWi0-QfA_0x+ej=5DSLsPoPY4NVh29Y=sDf8w@mail.gmail.com>
References: <6A89C7A8-CA54-42BE-938F-CF41CCE2F362@monash.edu> <CAFvWOFKt9C-WYAWi0-QfA_0x+ej=5DSLsPoPY4NVh29Y=sDf8w@mail.gmail.com>
From: David Frazier <david.frazier@monash.edu>
Date: Mon, 18 Sep 2017 07:57:50 +1000
Message-ID: <CAFvWOF+i6U=tFsb2v+2yQ1L91zXXcusSLKBe=XJwHXdY-7JZJQ@mail.gmail.com>
Subject: Re: formula sheet
To: Dianne Cook <dicook@monash.edu>
Content-Type: multipart/mixed; boundary="001a114fcb3aabdccf055969b77e"

--001a114fcb3aabdccf055969b77e
Content-Type: multipart/alternative; boundary="001a114fcb3aabdccd055969b77c"

--001a114fcb3aabdccd055969b77c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi There,

...

Cheers,
David

Exercise 1

Open your project for this class. Make sure all your work is done relative to this project.

Open the lab10.Rmd file provided with the instructions. You can edit this file and add your answers to questions in this document.

Exercise 2

The data has demographics of life expectancy and GDP per capita for 142 countries reported every 5 years between 1952 and 2007.

Observations: 1,704
Variables: 6
$ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
$ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
  1. How would you describe the following plot?
ggplot(data=gapminder, aes(x=year, y=lifeExp, group=country)) +
  geom_line(alpha=0.5)

  1. 1950 is the first year, so for model fitting we are going to shift year to begin in 1950, makes interpretability easier.

  2. Then let’s fit a model for Australia

  3. Interpret the model. (This means explain how life expectancy changes over years, since 1950, using the parameter estimates of the model.)

  4. What was the average life expectancy in 1950?

  5. What was the average life expectancy in 2000?

  6. By how much did average life expectancy change over those 50 years?

  7. We can get various diagnostics out for the model with the broom package: the parameter estimates and their significance, the goodness of fit statistics, and model diagnostics.

  8. What column of the diagnostics contains the (a) fitted values, (b) residuals?

Exercise 3

Now we are going to fit a simple linear model separately to every country. And use the model fits to simplify the patterns across the globe, in order to be able to explain the changes in life expectancy.

This code will compute the models for you:

library("purrr")
by_country <- gapminder2 %>% 
  select(country, year1950, lifeExp, continent) %>%
  group_by(country, continent) %>% 
  nest()
by_country <- by_country %>% 
  mutate(
    model = purrr::map(data, ~ lm(lifeExp ~ year1950, 
                                  data = .))
  )
country_coefs <- by_country %>% 
  unnest(model %>% purrr::map(broom::tidy))
country_coefs <- country_coefs %>% 
  select(country, continent, term, estimate) %>% 
  spread(term, estimate) %>%
  rename(intercept = `(Intercept)`)
kable(head(country_coefs))
country_coefs %>%
  filter(country == "Australia")

It is also possible to use a for loop to compute the slope and intercept for each country.

  1. Pick your favorite country (other than Australia). Find the parameter estimates from the country_coefs data frame. Do a hand-sketch of the fitted model.

Exercise 4

  1. Make a scatterplot of the linear model estimates for each country, slope vs intercept. Colour the points by continent. Make the plot interactive using the plotly package, and find out which countries had a negative slope.

  2. Statistically summarise the relationship between intercept and slope, using words like no association, positive linear association, negative linear association, weak, moderate, strong, outliers, clusters.

  3. Do you see a difference between continents? If so, explain what you see.

  4. What does it mean for a country to have a high intercept, e.g. 70?

  5. What does it mean for a country to have a high slope, e.g. 0.7?

Exercise 5

Now we are going to examine the fit for each country. We might expect that a linear model is a better fit for some countries and not so good for other countries. Here is the code to extract the model diagnostics for each country’s model.

country_fit <- by_country %>% 
  unnest(model %>% 
           purrr::map(broom::glance))

Or you can use a for loop to compute this.

  1. Plot the \(R^2\) values as a histogram.

  2. Examine the countries with the worst fit, countries with \(R^2<0.45\), by making scatterplots of the data, with the linear model overlaid.

Each of these countries has a big dip in their life expectancy during the time of the study. Explain these using world history and current affairs information. (Feel free to google for news stories.)

Exercise 6

The file SPAM-503.csv contains summaries of a week’s worth of emails from 19 people. The email was manually labelled as spam or not. We decided to examine our emails because the university had recently changed its spam filters, and the emails from the university president were being sent to spam. Spam filters have improved dramatically in the last decade, but something happened to Monash mail this past week. Emails from Monash students and departmental colleagues have been discovered in the spam folder. Here is a description of the variables:

1. ISUid: ISU id 

2. id: e-mail id (some count from 1 to number of mails you got, so that
you can get back to the original message for the line of data -
to help with checking for strange results.)

3. Day of Week: Sun, Mon, Tue, Wed, Thu, Fri, Sat

4. Time of Day: 0-23 (only integer values)

5. Size [kb]: Size of e-mail in kilo byte

6. Box: Is sender in any of my Inboxes or Outbox (ie known to you) 
yes, no

7. Domain: Domain name of sender's e-mail address (only last segment):.edu,
.net, .com, .org, .gov, .mil, .de, .fr, .ru,

8. Local: Sender's e-mail is in local domain i.e. xx@yy.iastate.edu 
yes, no

9. Digits: Number of numbers (0-9) in the
senders name: e.g. lottery2003@yahoo.com will be 4.

10. name: Name field is a single word or empty: 
e.g. "Andreas Buja <andreas@research.att.com>" is name
 "bob <lottery2003@yahoo.com>" is single 
 "<lottery2003@yahoo.com>" is empty

11. %capital: % capital letters in subject line

12. NSpecial: umber of special characters (i.e. non a-z, A-Z or 0-9) in subject

Spam words in subject line:

13. credit: mortgage, sale, approve, credit -> yes/no
14. sucker: earn, free, save ->yes/no
15. porn: nude, sex, enlarge, improve -> yes/no
16. chain: pass, forward, help > yes/no
17. username: Is your username/name listed in subject line ->yes/no

18. Large text in e-mail 
yes, no (only yes, if html e-mail and size="+3" or size="5" or
higher. Visual inspection of e-mail will tell.)

19. Probability of being spam, according to ISU spam filter.  Look for
"Probability=x%" in the header of the email. And record the "x" or an
NA if the message doesn't have a probability. This variable will be
used to compare our classification results from our data. (Has a lot of missing values, because not everyone read email through the unversity mail system.)

20. Extended spam/mail category 
commercial->com,
lists->list, 
newsletter->news, 
ordinary->ord

21. Spam
yes, no
  1. Build a tree model to predict whether the email is spam or ham. Split your data into a 50% training and 50% test set. Build the tree on the training data, and predict the test data. (I have done some initial tuning to decide on the best inputs for minsplit=10 and cp=0.005. )

  2. Compute the proportion of false positives (ham being predicted to be spam), and false negatives (spam predicted to be ham), in your training and test data. Which is the worse error here?

  3. What combinations of variables tends to suggest the email is spam?