In this week’s lab, the main goal is to practice reading and handling different data formats. On the due date, turn in your Rmd file and the html product.
Open your project for this class. Make sure all your work is done relative to this project.
Open the lab5.Rmd
file provided with the instructions. You can edit this file and add your answers to questions in this document.
The financial world operates with many different currencies, and for countries to trade, for people to travel, these currencies are converted, bought and sold. The web site https://openexchangerates.org/ provides access to (daily) cross rates of currencies relative to the USD. The data is provided in JSON form. With a free account you can pull 30 days worth of data, every day. Carson Sievert’s web site explains how to do this, and provides R code. To follow the instructions and get your own data you need to sign up for an instant free account, and use the key provided to get the data.
lubridate
package data handling routines to make R recognise the date column as a date.This task is to take the example of the Australian electorate data from the lecture notes (Lecture 5 Reading different data formats), make sure that you can do the work, to make an electoral map of Australia. And then extend it with a little more info.
mapshaper
package to thin out the boundaries.In previous labs, we worked with the PISA 2015 data. In this question, you will extract this data for yourself.
Find the data on the OECD PISA web site, http://www.oecd.org/pisa/data/2015database/. Download the SPSS format “Student questionnaire data file (419MB)”. (The downloaded file name should be CY6_MS_CMB_STU_QQQ.sav
.) It is quite large, so if you have trouble your tutor has the file on a USB stick, that you can copy.
Read the data into R using this code:
How many students are in the data set?
If you continue to work with the data in R you will have some slow times. It is ok if you just want to focus on one country, but if we want to make calculations and models on all the data you computer will sit and spin a lot. The best approach is to create a small database, and use this to do calculations. This code creates the database, in your project folder:
Let’s test the speed
From the R object:
Using sqlite database:
(I get 27.09944 secs directly in R, and 1.61168 secs using the database.)
Using the code below, how many different plausible scores are generated for each student, in math, reading and science?
Compute the averages across the multiple math scores, and save in an R object. Make a dotplot against country, ordered from top score to lowest. What are the top three countries? What is Australia’s rank?
Database operations typically only operate on a column by column basis, so calculating statistics such as standard deviation can be a challenge. (Try it, and see what happens if you ask for the database to compute the standard deviation of the math scores instead of the mean, using the sd
function.) You can do this with a direct SQL QUERY (the ugly code is below). Do it! And then make a plot which shows the mean and a segment indicating one standard deviation below and above the mean, by country, sorted from highest to lowest average.