2 Section 1 Overview

In the Data Import section, you will learn how import data into R.

After completing this section, you will be able to:

  • Import data from spreadsheets.
  • Identify and set your working directory and specify the path to a file.
  • Use the readr and readxl packages to import spreadsheets.
  • Use R-base functions to import spreadsheets.
  • Download files from the internet using R.

The textbook for this section is available here.

2.1 Importing Spreadsheets

The textbook for this section is available here.

Key points

  • Many datasets are stored in spreadsheets. A spreadsheet is essentially a file version of a data frame with rows and columns.
  • Spreadsheets have rows separated by returns and columns separated by a delimiter. The most common delimiters are comma, semicolon, white space and tab.
  • Many spreadsheets are raw text files and can be read with any basic text editor. However, some formats are proprietary and cannot be read with a text editor, such as Microsoft Excel files (.xls).
  • Most import functions assume that the first row of a spreadsheet file is a header with column names. To know if the file has a header, it helps to look at the file with a text editor before trying to import it.

2.2 Paths and the Working Directory

The textbook for this section is available here.

Key points

  • The working directory is where R looks for files and saves files by default.
  • See your working directory with getwd(). Change your working directory with setwd().
  • We suggest you create a directory for each project and keep your raw data inside that directory.
  • Use the file.path() function to generate a full path from a relative path and a file name. Use file.path() instead of paste() because file.path() is aware of your operating system and will use the correct slashes to navigate your machine.
  • The file.copy() function copies a file to a new path.

Code

# see working directory
getwd()

# change your working directory
setwd()
# set path to the location for raw data files in the dslabs package and list files
path <- system.file("extdata", package="dslabs")
list.files(path)
## [1] "2010_bigfive_regents.xls"                                "carbon_emissions.csv"                                    "fertility-two-countries-example.csv"                    
## [4] "HRlist2.txt"                                             "life-expectancy-and-fertility-two-countries-example.csv" "murders.csv"                                            
## [7] "olive.csv"                                               "RD-Mortality-Report_2015-18-180531.pdf"                  "ssa-death-probability.csv"
# generate a full path to a file
filename <- "murders.csv"
fullpath <- file.path(path, filename)
fullpath
## [1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library/dslabs/extdata/murders.csv"
# copy file from dslabs package to your working directory
file.copy(fullpath, getwd())
## [1] FALSE
# check if the file exists
file.exists(filename)
## [1] TRUE

2.3 The readr and readxl Packages

The textbook for this section is available here.

Key points

  • readr is the tidyverse library that includes functions for reading data stored in text file spreadsheets into R. Functions in the package include read_csv(), read_tsv(), read_delim() and more. These differ by the delimiter they use to split columns.
  • The readxl package provides functions to read Microsoft Excel formatted files.
  • The excel_sheets() function gives the names of the sheets in the Excel file. These names are passed to the sheet argument for the readxl functions read_excel(), read_xls() and read_xlsx().
  • The read_lines() function shows the first few lines of a file in R.

Code

if(!require(dslabs)) install.packages("dslabs")
## Loading required package: dslabs
if(!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
if(!require(readxl)) install.packages("readxl")
## Loading required package: readxl
library(dslabs)
library(tidyverse)    # includes readr
library(readxl)

# inspect the first 3 lines
read_lines("murders.csv", n_max = 3)
## [1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"      "Alaska,AK,West,710231,19"
# read file in CSV format
dat <- read_csv(filename)
## Parsed with column specification:
## cols(
##   state = col_character(),
##   abb = col_character(),
##   region = col_character(),
##   population = col_double(),
##   total = col_double()
## )
#read using full path
dat <- read_csv(fullpath)
## Parsed with column specification:
## cols(
##   state = col_character(),
##   abb = col_character(),
##   region = col_character(),
##   population = col_double(),
##   total = col_double()
## )
head(dat)
## # A tibble: 6 x 5
##   state      abb   region population total
##   <chr>      <chr> <chr>       <dbl> <dbl>
## 1 Alabama    AL    South     4779736   135
## 2 Alaska     AK    West       710231    19
## 3 Arizona    AZ    West      6392017   232
## 4 Arkansas   AR    South     2915918    93
## 5 California CA    West     37253956  1257
## 6 Colorado   CO    West      5029196    65
#Ex:
path <- system.file("extdata", package = "dslabs")
files <- list.files(path)
files
## [1] "2010_bigfive_regents.xls"                                "carbon_emissions.csv"                                    "fertility-two-countries-example.csv"                    
## [4] "HRlist2.txt"                                             "life-expectancy-and-fertility-two-countries-example.csv" "murders.csv"                                            
## [7] "olive.csv"                                               "RD-Mortality-Report_2015-18-180531.pdf"                  "ssa-death-probability.csv"
filename <- "murders.csv"
filename1 <- "life-expectancy-and-fertility-two-countries-example.csv"
filename2 <- "fertility-two-countries-example.csv"
dat=read.csv(file.path(path, filename))
dat1=read.csv(file.path(path, filename1))
dat2=read.csv(file.path(path, filename2))

2.4 Importing Data Using R-base Functions

The textbook for this section is available here.

Key point

  • R-base import functions (read.csv(), read.table(), read.delim()) generate data frames rather than tibbles and character variables are converted to factors. This can be avoided by setting the argument stringsAsFactors=FALSE.

Code

# read.csv converts strings to factors
dat2 <- read.csv(filename)
class(dat2$abb)
## [1] "character"
class(dat2$region)
## [1] "character"

2.5 Downloading Files from the Internet

The textbook for this section is available here.

Key points

  • The read_csv() function and other import functions can read a URL directly.
  • If you want to have a local copy of the file, you can use download.file().
  • tempdir() creates a directory with a name that is very unlikely not to be unique.
  • tempfile() creates a character string that is likely to be a unique filename.

Code

url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"
dat <- read_csv(url)
## Parsed with column specification:
## cols(
##   state = col_character(),
##   abb = col_character(),
##   region = col_character(),
##   population = col_double(),
##   total = col_double()
## )
download.file(url, "murders.csv")
tempfile()
## [1] "/var/folders/6m/nz2p76pn679b692c99t644bm0000gn/T//RtmpnqMzj7/file3447474aaa04"
tmp_filename <- tempfile()
download.file(url, tmp_filename)
dat <- read_csv(tmp_filename)
## Parsed with column specification:
## cols(
##   state = col_character(),
##   abb = col_character(),
##   region = col_character(),
##   population = col_double(),
##   total = col_double()
## )
file.remove(tmp_filename)
## [1] TRUE

2.6 Assessment Part 1 - Data Import

  1. Which of the following is NOT part of the data wrangling process?
  • A. Importing data into R
  • B. Formatting dates/times
  • C. Checking correlations between your variables
  • D. Tidying data
  1. Which files could be opened in a basic text editor?

Select ALL that apply.

  • A. data.txt
  • B. data.csv
  • C. data.xlsx
  • D. data.tsv
  1. You want to analyze a file containing race finish times for a recent marathon. You open the file in a basic text editor and see lines that look like the following:
initials,state,age,time
vib,MA,61,6:01
adc,TX,45,5:45
kme,CT,50,4:19

What type of file is this?

  • A. A comma-delimited file without a header
  • B. A tab-delimited file with a header
  • C. A white space-delimited file without a header
  • D. A comma-delimited file with a header
  1. Assume the following is the full path to the directory that a student wants to use as their working directory in R: “/Users/student/Documents/projects/”

Which of the following lines of code CANNOT set the working directory to the desired “projects” directory?

  • A. setwd("~/Documents/projects/")
  • B. setwd("/Users/student/Documents/projects/")
  • C. setwd(/Users/student/Documents/projects/)
  • D. dir <- "/Users/student/Documents/projects" setwd(dir)
  1. We want to copy the “murders.csv” file from the dslabs package into an existing folder “data”, which is located in our HarvardX-Wrangling projects folder. We first enter the code below into our RStudio console.
> getwd()  
[1] "C:/Users/UNIVERSITY/Documents/Analyses/HarvardX-Wrangling"  
> filename <- "murders.csv"  
> path <- system.file("extdata", package = "dslabs")  

Which of the following commands would NOT successfully copy “murders.csv” into the folder “data”?

  • A.

file.copy(file.path(path, "murders.csv"), getwd())

  • B.
file.copy(file.path(path, filename), getwd())
  • C.

file.copy(file.path(path, "murders.csv"), file.path(getwd(), "data"))

  • D.
file.location <- file.path(system.file("extdata", package = "dslabs"), "murders.csv")
file.destination <- file.path(getwd(),"data") 
file.copy(file.location, file.destination)
  1. You are not sure whether the murders.csv file has a header row. How could you check this?

Select ALL that apply.

  • A. Open the file in a basic text editor.
  • B. In the RStudio “Files” pane, click on your file, then select “View File”.
  • C. Use the command read_lines (remembering to specify the number of rows with the n_max argument).
  1. What is one difference between read_excel and read_xlsx?
  • A. read_excel() also reads meta-data from the excel file, such as sheet names, while read_xlsx() only reads the first sheet in a file.
  • B. read_excel() reads both .xls and .xlsx files by detecting the file format from its extension, while read_xlsx() only reads .xlsx files.
  • C. read_excel() is part of the readr package, while read_xlsx() is part of the readxl package and has more options.
  • D. read_xlsx() has been replaced by read_excel() in a recent readxl package update.
  1. You have a file called “times.txt” that contains race finish times for a marathon. The first four lines of the file look like this:
initials,state,age,time
vib,MA,61,6:01
adc,TX,45,5:45
kme,CT,50,4:19

Which line of code will NOT produce a tibble with column names “initials”, “state”, “age”, and “time”?

  • A. race_times <- read_csv("times.txt")
  • B. race_times <- read.csv("times.txt")
  • C. race_times <- read_csv("times.txt", col_names = TRUE)
  • D. race_times <- read_delim("times.txt", delim = “,”)
  1. You also have access to marathon finish times in the form of an Excel document named “times.xlsx”. In the Excel document, different sheets contain race information for different years. The first sheet is named “2015”, the second is named “2016”, and the third is named “2017”.

Which line of code will NOT import the data contained in the “2016” tab of this Excel sheet?

  • A. times_2016 <- read_excel("times.xlsx", sheet = 2)
  • B. times_2016 <- read_xlsx("times.xlsx", sheet = “2”)
  • C. times_2016 <- read_excel("times.xlsx", sheet = "2016")
  • D. times_2016 <- read_xlsx("times.xlsx", sheet = 2)
  1. You have a comma-separated values file that contains the initials, home states, ages, and race finish times for marathon runners. The runners’ initials contain three characters for the runners’ first, middle, and last names (for example, “KME”).

You read in the file using the following code.

race_times <- read.csv(“times.csv”)

What is the data type of the initials in the object race_times?

  • A. integers
  • B. characters
  • C. factors
  • D. logical
  1. Which of the following is NOT a real difference between the readr import functions and the base R import functions?
  • A. The import functions in the readr package all start as read_, while the import functions for base R all start with read.
  • B. Base R import functions automatically convert character columns to factors.
  • C. The base R import functions can read .csv files, but cannot files with other delimiters, such as .tsv files, or fixed-width files.
  • D. Base R functions import data as a data frame, while readr functions import data as a tibble.
  1. You read in a file containing runner information and marathon finish times using the following code.
race_times <- read.csv(“times.csv”, stringsAsFactors = F)

What is the class of the object race_times?

  • A. data frame
  • B. tibble
  • C. matrix
  • D. vector
  1. Select the answer choice that summarizes all of the actions that the following lines of code can perform. Please note that the url below is an example and does not lead to data.
url <- "https://raw.githubusercontent.com/MyUserName/MyProject/master/MyData.csv "
dat <- read_csv(url)
download.file(url, "MyData.csv")
  • A. Create a tibble in R called dat that contains the information contained in the csv file stored on Github and save that tibble to the working directory.
  • B. Create a matrix in R called dat that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and name the downloaded file “MyData.csv”.
  • C. Create a tibble in R called dat that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and randomly assign it a temporary name that is very likely to be unique.
  • D. Create a tibble in R called dat that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and name the downloaded file “MyData.csv”.

2.7 Assessment Part 2 - Data Import

  1. Inspect the file at the following URL:

https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data

Which readr function should be used to import this file?

  • A. read_table()
  • B. read_csv()
  • C. read_csv2()
  • D. read_tsv()
  • E. None of the above
  1. Check the documentation for the readr function you chose in the previous question to learn about its arguments. Determine which arguments you need to the file from the previous question:
url <- "https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data"

Does this file have a header row? Does the readr function you chose need any additional arguments to import the data correctly?

  • A. Yes, there is a header. No arguments are needed.
  • B. Yes, there is a header. The header=TRUE argument is necessary.
  • C. Yes, there is a header. The col_names=TRUE argument is necessary.
  • D. No, there is no header. No arguments are needed.
  • E. No, there is no header. The header=FALSE argument is necessary.
  • F. No, there is no header. The col_names=FALSE argument is necessary.
  1. Inspect the imported data from the previous question.

How many rows are in the dataset?

url <- "https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data"
          df <- read_csv(url, col_names = FALSE)
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X2 = col_character()
## )
## See spec(...) for full column specifications.
          nrow(df)
## [1] 569

How many columns are in the dataset?

ncol(df)
## [1] 32