2 Section 1 Overview
Section 1 introduces you to R Basics, Functions and Datatypes.
In Section 1, you will learn to:
- Appreciate the rationale for data analysis using R
- Define objects and perform basic arithmetic and logical operations
- Use pre-defined functions to perform operations on objects
- Distinguish between various data types
The textbook for this section is available here.
2.1 Motivation
Here is a link to the textbook section on the motivation for this course.
2.2 Getting started
Here is a link to the textbook section on Getting Started with R.
Key Points
- R was developed by statisticians and data analysts as an interactive environment for data analysis.
- Some of the advantages of R are that (1) it is free and open source, (2) it has the capability to save scripts, (3) there are numerous resources for learning, and (4) it is easy for developers to share software implementation.
- Expressions are evaluated in the R console when you type the expression into the console and hit Return.
- A great advantage of R over point and click analysis software is that you can save your work as scripts.
- “Base R” is what you get after you first install R. Additional components are available via packages.
# installing the dslabs package
if(!require(dslabs)) install.packages("dslabs")
## Loading required package: dslabs
# loading the dslabs package into the R session
library(dslabs)
2.3 Installing R and R Studio
2.3.1 Installing R
To install R to work on your own computer, you can download it freely from the Comprehensive R Archive Network (CRAN). Note that CRAN makes several versions of R available: versions for multiple operating systems and releases older than the current one. You want to read the CRAN instructions to assure you download the correct version. If you need further help, you read the walkthrough in this Chapter of the textbook.
2.3.2 Installing RStudio
RStudio is an integrated development environment (IDE). We highly recommend installing and using RStudio to edit and test your code. You can install RStudio through the RStudio website. Their cheatsheet is a great resource. You must install R before installing RStudio.
2.3.3 Textbook Link
Here is a link to the textbook section on Installing R and RStudio.
2.4 R Basics - Objects
Here is a link to the textbook section on objects in R.
Key Points
- To define a variable, we may use the assignment symbol “<-“.
- There are two ways to see the value stored in a variable: (1) type the variable into the console and hit Return, or (2) type print(“variable name”) and hit Return.
- Objects are stuff that is stored in R. They can be variables, functions, etc.
- The ls() function shows the names of the objects saved in your workspace.
Solving the equation x2+x−1=0
# assigning values to variables
1
a <- 1
b <- -1
c <-
# solving the quadratic equation
-b + sqrt(b^2 - 4*a*c) ) / ( 2*a ) (
## [1] 0.618034
-b - sqrt(b^2 - 4*a*c) ) / ( 2*a ) (
## [1] -1.618034
2.5 R Basics - Functions
Here is a link to the textbook section on functions.
Key points
- In general, to evaluate a function we need to use parentheses. If we type a function without parenthesis, R shows us the code for the function. Most functions also require an argument, that is, something to be written inside the parenthesis.
- To access help files, we may use the help function help(“function name”), or write the question mark followed by the function name.
- The help file shows you the arguments the function is expecting, some of which are required and some are optional. If an argument is optional, a default value is assigned with the equal sign. The args() function also shows the arguments a function needs.
- To specify arguments, we use the equals sign. If no argument name is used, R assumes you’re entering arguments in the order shown in the help file.
- Creating and saving a script makes code much easier to execute.
- To make your code more readable, use intuitive variable names and include comments (using the “#” symbol) to remind yourself why you wrote a particular line of code.
2.6 Assessment - R Basics
- What is the sum of the first n positive integers? We can use the formula \(n(n+1)/2\) to quickly compute this quantity.
# Here is how you compute the sum for the first 20 integers
20*(20+1)/2
## [1] 210
# However, we can define a variable to use the formula for other values of n
20
n <-*(n+1)/2 n
## [1] 210
25
n <-*(n+1)/2 n
## [1] 325
# Below, write code to calculate the sum of the first 100 integers
100
n<-*(n+1)/2 n
## [1] 5050
- What is the sum of the first 1000 positive integers? We can use the formula \(n(n+1)/2\) to quickly compute this quantity.
# Below, write code to calculate the sum of the first 1000 integers
1000
n<-*(n+1)/2 n
## [1] 500500
- Run the following code in the R console.
1000
n <- seq(1, n)
x <-sum(x)
## [1] 500500
Based on the result, what do you think the functions seq
and sum
do?
- A. sum creates a list of numbers and seq adds them up.
- B. seq creates a list of numbers and sum adds them up.
- C. seq computes the difference between two arguments and sum computes the sum of 1 through 1000.
- D. sum always returns the same number.
- In math and programming we say we evaluate a function when we replace arguments with specific values. So if we type
log2(16)
we evaluate thelog2
function to get the log base 2 of16
which is4
.
In R it is often useful to evaluate a function inside another function. For example, sqrt(log2(16))
will calculate the log to the base 2 of 16 and then compute the square root of that value. So the first evaluation gives a 4 and this gets evaluated by sqrt
to give the final answer of 2.
# log to the base 2
log2(16)
## [1] 4
# sqrt of the log to the base 2 of 16:
sqrt(log2(16))
## [1] 2
# Compute log to the base 10 (log10) of the sqrt of 100. Do not use variables.
log10(sqrt(100))
## [1] 1
- Which of the following will always return the numeric value stored in
x
? You can try out examples and use the help system in the R console.
- A. log(10^x)
- B. log10(x^10)
- C. log(exp(x))
- D. exp(log(x, base = 2))
2.7 Data Types
You can find the section of the textbook on data types here.
Key Points
- The function “class” helps us determine the type of an object.
- Data frames can be thought of as tables with rows representing observations and columns representing different variables.
- To access data from columns of a data frame, we use the dollar sign symbol, which is called the accessor.
- A vector is an object consisting of several entries and can be a numeric vector, a character vector, or a logical vector.
- We use quotes to distinguish between variable names and character strings.
- Factors are useful for storing categorical data, and are more memory efficient than storing characters.
Code
# loading the the murders dataset
data(murders)
# determining that the murders dataset is of the "data frame" class
class(murders)
## [1] "data.frame"
# finding out more about the structure of the object
str(murders)
## 'data.frame': 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
# showing the first 6 lines of the dataset
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
# using the accessor operator to obtain the population column
$population murders
## [1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355 2853118 4339367 4533372
## [20] 1328361 5773552 6547629 9883640 5303925 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179 19378102 9535483 672591 11536504 3751351 3831074
## [39] 12702379 1052567 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540 1852994 5686986 563626
# displaying the variable names in the murders dataset
names(murders)
## [1] "state" "abb" "region" "population" "total"
# determining how many entries are in a vector
murders$population
pop <-length(pop)
## [1] 51
# vectors can be of class numeric and character
class(pop)
## [1] "numeric"
class(murders$state)
## [1] "character"
# logical vectors are either TRUE or FALSE
3 == 2
z <- z
## [1] FALSE
class(z)
## [1] "logical"
# factors are another type of class
class(murders$region)
## [1] "factor"
# obtaining the levels of a factor
levels(murders$region)
## [1] "Northeast" "South" "North Central" "West"
2.8 Assessment - Data Types
- We’re going to be using the following dataset for this module. Run this code in the console.
library(dslabs)
data(murders)
Next, use the function str
to examine the structure of the murders
object. We can see that this object is a data frame with 51 rows and five columns.
str(murders)
## 'data.frame': 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
Which of the following best describes the variables represented in this data frame?
- A. The 51 states.
- B. The murder rates for all 50 states and DC.
- C. The state name, the abbreviation of the state name, the state’s region, and the state’s population and total number of murders for 2010.
- D. str shows no relevant information.
- In the previous question, we saw the different variables that are a part of this dataset from the output of the
str()
function. The functionnames()
is specifically designed to extract the column names from a data frame.
# Load package and data
library(dslabs)
data(murders)
# Use the function names to extract the variable names
names(murders)
## [1] "state" "abb" "region" "population" "total"
- In this module we have learned that every variable has a class. For example, the class can be a character, numeric or logical. The function
class()
can be used to determine the class of an object.
Here we are going to determine the class of one of the variables in the murders
data frame. To extract variables from a data frame we use $
, referred to as the accessor.
# To access the population variable from the murders dataset use this code:
murders$population
p <-
# To determine the class of object `p` we use this code:
class(p)
## [1] "numeric"
# Use the accessor to extract state abbreviations and assign it to a
murders$abb
a <-
# Determine the class of a
class(a)
## [1] "character"
- An important lesson you should learn early on is that there are multiple ways to do things in R. For example, to generate the first five integers we note that
1:5
andseq(1,5)
return the same result.
There are also multiple ways to access variables in a data frame. For example we can use the square brackets [[
instead of the accessor $
.
If you instead try to access a column with just one bracket,
"population"] murders[
R returns a subset of the original data frame containing just this column. This new object will be of class data.frame
rather than a vector. To access the column itself you need to use either the $
accessor or the double square brackets [[
.
Parentheses, in contrast, are mainly used alongside functions to indicate what argument the function should be doing something to. For example, when we did class(p)
in the last question, we wanted the function class
to do something related to the argument p
.
This is an example of how R can be a bit idiosyncratic sometimes. It is very common to find it confusing at first.
# We extract the population like this:
murders$population
p <-
# This is how we do the same with the square brackets:
murders[["population"]]
o <-
# We can confirm these two are the same
identical(o, p)
## [1] TRUE
# Use square brackets to extract `abb` from `murders` and assign it to b
murders[["abb"]]
b <-
# Check if `a` and `b` are identical
identical(a, b)
## [1] TRUE
- Using the
str()
command, we saw that the region column stores a factor. You can corroborate this by using theclass
command on the region column.
The function levels
shows us the categories for the factor.
# We can see the class of the region variable using class
class(murders$region)
## [1] "factor"
# Determine the number of regions included in this variable
length(levels(murders$region))
## [1] 4
- The function
table
takes a vector as input and returns the frequency of each unique element in the vector.
# Here is an example of what the table function does
c("a", "a", "b", "b", "b", "c")
x <-table(x)
## x
## a b c
## 2 3 1
# Write one line of code to show the number of states per region
table(murders$region)
##
## Northeast South North Central West
## 9 17 12 13
2.9 Section 1 Assessment
- To find the solutions to an equation of the format \(ax^2+bx+c\), use the quadratic equation: \(x=\frac{-b±\sqrt(b^2−4ac)}{2a}\).
What are the two solutions to \(2x^2-x-4=0\)? Use the quadratic equation. (Report the greater of the two solutions first, using 3 significant digits for both solutions)
options(digits = 3)
2
a <- -1
b <- -4
c <--b+sqrt(b^2-4*a*c))/(2*a) (
## [1] 1.69
-b-sqrt(b^2-4*a*c))/(2*a) (
## [1] -1.19
- Use R to compute log base 4 of 1024. You can use the
help
function to learn how to use arguments to change the base of thelog
function.
log(1024, base = 4)
## [1] 5
- Load the
movielens
dataset
data(movielens)
str(movielens)
## 'data.frame': 100004 obs. of 7 variables:
## $ movieId : int 31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
## $ title : chr "Dangerous Minds" "Dumbo" "Sleepers" "Escape from New York" ...
## $ year : int 1995 1941 1996 1981 1989 1978 1959 1982 1992 1991 ...
## $ genres : Factor w/ 901 levels "(no genres listed)",..: 762 510 899 120 762 836 81 762 844 899 ...
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ rating : num 2.5 3 3 2 4 2 2 2 3.5 2 ...
## $ timestamp: int 1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
How many rows are in the dataset? 100004
How many different variables are in the dataset? 7
What is the variable type of title
?
- A. It is a text (txt) variable
- B. It is a chronological (chr) variable
- C. It is a string (str) variable
- D. It is a numeric (num) variable
- E. It is an integer (int) variable
- F. It is a factor (Factor) variable
- G. It is a character (chr) variable
What is the variable type of genres
?
- A. It is a text (txt) variable
- B. It is a chronological (chr) variable
- C. It is a string (str) variable
- D. It is a numeric (num) variable
- E. It is an integer (int) variable
- F. It is a factor (Factor) variable
- G. It is a character (chr) variable
- We already know we can use the
levels()
function to determine the levels of a factor. A different function,nlevels()
, may be used to determine the number of levels of a factor.
Use this function to determine how many levels are in the factor genres
in the movielens
data frame.
nlevels(movielens$genres)
## [1] 901