2 Section 1 Overview
In the Data Import section, you will learn how import data into R.
After completing this section, you will be able to:
- Import data from spreadsheets.
- Identify and set your working directory and specify the path to a file.
- Use the readr and readxl packages to import spreadsheets.
- Use R-base functions to import spreadsheets.
- Download files from the internet using R.
The textbook for this section is available here.
2.1 Importing Spreadsheets
The textbook for this section is available here.
Key points
- Many datasets are stored in spreadsheets. A spreadsheet is essentially a file version of a data frame with rows and columns.
- Spreadsheets have rows separated by returns and columns separated by a delimiter. The most common delimiters are comma, semicolon, white space and tab.
- Many spreadsheets are raw text files and can be read with any basic text editor. However, some formats are proprietary and cannot be read with a text editor, such as Microsoft Excel files (
.xls
). - Most import functions assume that the first row of a spreadsheet file is a header with column names. To know if the file has a header, it helps to look at the file with a text editor before trying to import it.
2.2 Paths and the Working Directory
The textbook for this section is available here.
Key points
- The working directory is where R looks for files and saves files by default.
- See your working directory with
getwd()
. Change your working directory withsetwd()
. - We suggest you create a directory for each project and keep your raw data inside that directory.
- Use the
file.path()
function to generate a full path from a relative path and a file name. Usefile.path()
instead ofpaste()
becausefile.path()
is aware of your operating system and will use the correct slashes to navigate your machine. - The
file.copy()
function copies a file to a new path.
Code
# see working directory
getwd()
# change your working directory
setwd()
# set path to the location for raw data files in the dslabs package and list files
system.file("extdata", package="dslabs")
path <-list.files(path)
## [1] "2010_bigfive_regents.xls" "carbon_emissions.csv" "fertility-two-countries-example.csv"
## [4] "HRlist2.txt" "life-expectancy-and-fertility-two-countries-example.csv" "murders.csv"
## [7] "olive.csv" "RD-Mortality-Report_2015-18-180531.pdf" "ssa-death-probability.csv"
# generate a full path to a file
"murders.csv"
filename <- file.path(path, filename)
fullpath <- fullpath
## [1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library/dslabs/extdata/murders.csv"
# copy file from dslabs package to your working directory
file.copy(fullpath, getwd())
## [1] FALSE
# check if the file exists
file.exists(filename)
## [1] TRUE
2.3 The readr and readxl Packages
The textbook for this section is available here.
Key points
- readr is the tidyverse library that includes functions for reading data stored in text file spreadsheets into R. Functions in the package include
read_csv()
,read_tsv()
,read_delim()
and more. These differ by the delimiter they use to split columns. - The readxl package provides functions to read Microsoft Excel formatted files.
- The
excel_sheets()
function gives the names of the sheets in the Excel file. These names are passed to the sheet argument for the readxl functionsread_excel()
,read_xls()
andread_xlsx()
. - The
read_lines()
function shows the first few lines of a file in R.
Code
if(!require(dslabs)) install.packages("dslabs")
## Loading required package: dslabs
if(!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!require(readxl)) install.packages("readxl")
## Loading required package: readxl
library(dslabs)
library(tidyverse) # includes readr
library(readxl)
# inspect the first 3 lines
read_lines("murders.csv", n_max = 3)
## [1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135" "Alaska,AK,West,710231,19"
# read file in CSV format
read_csv(filename) dat <-
## Parsed with column specification:
## cols(
## state = col_character(),
## abb = col_character(),
## region = col_character(),
## population = col_double(),
## total = col_double()
## )
#read using full path
read_csv(fullpath) dat <-
## Parsed with column specification:
## cols(
## state = col_character(),
## abb = col_character(),
## region = col_character(),
## population = col_double(),
## total = col_double()
## )
head(dat)
## # A tibble: 6 x 5
## state abb region population total
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
#Ex:
system.file("extdata", package = "dslabs")
path <- list.files(path)
files <- files
## [1] "2010_bigfive_regents.xls" "carbon_emissions.csv" "fertility-two-countries-example.csv"
## [4] "HRlist2.txt" "life-expectancy-and-fertility-two-countries-example.csv" "murders.csv"
## [7] "olive.csv" "RD-Mortality-Report_2015-18-180531.pdf" "ssa-death-probability.csv"
"murders.csv"
filename <- "life-expectancy-and-fertility-two-countries-example.csv"
filename1 <- "fertility-two-countries-example.csv"
filename2 <-read.csv(file.path(path, filename))
dat=read.csv(file.path(path, filename1))
dat1=read.csv(file.path(path, filename2)) dat2=
2.4 Importing Data Using R-base Functions
The textbook for this section is available here.
Key point
- R-base import functions (
read.csv()
,read.table()
,read.delim()
) generate data frames rather than tibbles and character variables are converted to factors. This can be avoided by setting the argumentstringsAsFactors=FALSE
.
Code
# read.csv converts strings to factors
read.csv(filename)
dat2 <-class(dat2$abb)
## [1] "character"
class(dat2$region)
## [1] "character"
2.5 Downloading Files from the Internet
The textbook for this section is available here.
Key points
- The
read_csv()
function and other import functions can read a URL directly. - If you want to have a local copy of the file, you can use
download.file()
. tempdir()
creates a directory with a name that is very unlikely not to be unique.tempfile()
creates a character string that is likely to be a unique filename.
Code
"https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"
url <- read_csv(url) dat <-
## Parsed with column specification:
## cols(
## state = col_character(),
## abb = col_character(),
## region = col_character(),
## population = col_double(),
## total = col_double()
## )
download.file(url, "murders.csv")
tempfile()
## [1] "/var/folders/6m/nz2p76pn679b692c99t644bm0000gn/T//RtmpnqMzj7/file3447474aaa04"
tempfile()
tmp_filename <-download.file(url, tmp_filename)
read_csv(tmp_filename) dat <-
## Parsed with column specification:
## cols(
## state = col_character(),
## abb = col_character(),
## region = col_character(),
## population = col_double(),
## total = col_double()
## )
file.remove(tmp_filename)
## [1] TRUE
2.6 Assessment Part 1 - Data Import
- Which of the following is NOT part of the data wrangling process?
- A. Importing data into R
- B. Formatting dates/times
- C. Checking correlations between your variables
- D. Tidying data
- Which files could be opened in a basic text editor?
Select ALL that apply.
- A. data.txt
- B. data.csv
- C. data.xlsx
- D. data.tsv
- You want to analyze a file containing race finish times for a recent marathon. You open the file in a basic text editor and see lines that look like the following:
initials,state,age,time61,6:01
vib,MA,45,5:45
adc,TX,50,4:19 kme,CT,
What type of file is this?
- A. A comma-delimited file without a header
- B. A tab-delimited file with a header
- C. A white space-delimited file without a header
- D. A comma-delimited file with a header
- Assume the following is the full path to the directory that a student wants to use as their working directory in R: “/Users/student/Documents/projects/”
Which of the following lines of code CANNOT set the working directory to the desired “projects” directory?
-
A.
setwd("~/Documents/projects/")
-
B.
setwd("/Users/student/Documents/projects/")
-
C.
setwd(/Users/student/Documents/projects/)
-
D.
dir <- "/Users/student/Documents/projects" setwd(dir)
- We want to copy the “murders.csv” file from the dslabs package into an existing folder “data”, which is located in our HarvardX-Wrangling projects folder. We first enter the code below into our RStudio console.
> getwd()
1] "C:/Users/UNIVERSITY/Documents/Analyses/HarvardX-Wrangling"
[> filename <- "murders.csv"
> path <- system.file("extdata", package = "dslabs")
Which of the following commands would NOT successfully copy “murders.csv” into the folder “data”?
- A.
file.copy(file.path(path, "murders.csv"), getwd())
- B.
file.copy(file.path(path, filename), getwd())
- C.
file.copy(file.path(path, "murders.csv"), file.path(getwd(), "data"))
- D.
file.location <- file.path(system.file("extdata", package = "dslabs"), "murders.csv")
file.destination <- file.path(getwd(),"data")
file.copy(file.location, file.destination)
- You are not sure whether the murders.csv file has a header row. How could you check this?
Select ALL that apply.
- A. Open the file in a basic text editor.
- B. In the RStudio “Files” pane, click on your file, then select “View File”.
-
C. Use the command
read_lines
(remembering to specify the number of rows with then_max
argument).
- What is one difference between
read_excel
andread_xlsx
?
-
A.
read_excel()
also reads meta-data from the excel file, such as sheet names, whileread_xlsx()
only reads the first sheet in a file. -
B.
read_excel()
reads both .xls and .xlsx files by detecting the file format from its extension, whileread_xlsx()
only reads .xlsx files. -
C.
read_excel()
is part of the readr package, whileread_xlsx()
is part of the readxl package and has more options. -
D.
read_xlsx()
has been replaced byread_excel()
in a recent readxl package update.
- You have a file called “times.txt” that contains race finish times for a marathon. The first four lines of the file look like this:
initials,state,age,time61,6:01
vib,MA,45,5:45
adc,TX,50,4:19 kme,CT,
Which line of code will NOT produce a tibble with column names “initials”, “state”, “age”, and “time”?
-
A.
race_times <- read_csv("times.txt")
-
B.
race_times <- read.csv("times.txt")
-
C.
race_times <- read_csv("times.txt", col_names = TRUE)
-
D.
race_times <- read_delim("times.txt", delim = “,”)
- You also have access to marathon finish times in the form of an Excel document named “times.xlsx”. In the Excel document, different sheets contain race information for different years. The first sheet is named “2015”, the second is named “2016”, and the third is named “2017”.
Which line of code will NOT import the data contained in the “2016” tab of this Excel sheet?
-
A.
times_2016 <- read_excel("times.xlsx", sheet = 2)
-
B.
times_2016 <- read_xlsx("times.xlsx", sheet = “2”)
-
C.
times_2016 <- read_excel("times.xlsx", sheet = "2016")
-
D.
times_2016 <- read_xlsx("times.xlsx", sheet = 2)
- You have a comma-separated values file that contains the initials, home states, ages, and race finish times for marathon runners. The runners’ initials contain three characters for the runners’ first, middle, and last names (for example, “KME”).
You read in the file using the following code.
read.csv(“times.csv”) race_times <-
What is the data type of the initials in the object race_times
?
- A. integers
- B. characters
- C. factors
- D. logical
- Which of the following is NOT a real difference between the readr import functions and the base R import functions?
-
A. The import functions in the readr package all start as
read_
, while the import functions for base R all start withread
. - B. Base R import functions automatically convert character columns to factors.
- C. The base R import functions can read .csv files, but cannot files with other delimiters, such as .tsv files, or fixed-width files.
- D. Base R functions import data as a data frame, while readr functions import data as a tibble.
- You read in a file containing runner information and marathon finish times using the following code.
read.csv(“times.csv”, stringsAsFactors = F) race_times <-
What is the class of the object race_times
?
- A. data frame
- B. tibble
- C. matrix
- D. vector
- Select the answer choice that summarizes all of the actions that the following lines of code can perform. Please note that the url below is an example and does not lead to data.
"https://raw.githubusercontent.com/MyUserName/MyProject/master/MyData.csv "
url <- read_csv(url)
dat <-download.file(url, "MyData.csv")
-
A. Create a tibble in R called
dat
that contains the information contained in the csv file stored on Github and save that tibble to the working directory. -
B. Create a matrix in R called
dat
that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and name the downloaded file “MyData.csv”. -
C. Create a tibble in R called
dat
that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and randomly assign it a temporary name that is very likely to be unique. -
D. Create a tibble in R called
dat
that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and name the downloaded file “MyData.csv”.
2.7 Assessment Part 2 - Data Import
- Inspect the file at the following URL:
Which readr function should be used to import this file?
-
A.
read_table()
-
B.
read_csv()
-
C.
read_csv2()
-
D.
read_tsv()
- E. None of the above
- Check the documentation for the readr function you chose in the previous question to learn about its arguments. Determine which arguments you need to the file from the previous question:
"https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data" url <-
Does this file have a header row? Does the readr function you chose need any additional arguments to import the data correctly?
- A. Yes, there is a header. No arguments are needed.
-
B. Yes, there is a header. The
header=TRUE
argument is necessary. -
C. Yes, there is a header. The
col_names=TRUE
argument is necessary. - D. No, there is no header. No arguments are needed.
-
E. No, there is no header. The
header=FALSE
argument is necessary. -
F. No, there is no header. The
col_names=FALSE
argument is necessary.
- Inspect the imported data from the previous question.
How many rows are in the dataset?
"https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data"
url <- read_csv(url, col_names = FALSE) df <-
## Parsed with column specification:
## cols(
## .default = col_double(),
## X2 = col_character()
## )
## See spec(...) for full column specifications.
nrow(df)
## [1] 569
How many columns are in the dataset?
ncol(df)
## [1] 32