2 Section 1 Overview

Section 1 introduces you to R Basics, Functions and Datatypes.

In Section 1, you will learn to:

Appreciate the rationale for data analysis using R
Define objects and perform basic arithmetic and logical operations
Use pre-defined functions to perform operations on objects
Distinguish between various data types

The textbook for this section is available here.

2.1 Motivation

Here is a link to the textbook section on the motivation for this course.

2.2 Getting started

Here is a link to the textbook section on Getting Started with R.

Key Points

R was developed by statisticians and data analysts as an interactive environment for data analysis.
Some of the advantages of R are that (1) it is free and open source, (2) it has the capability to save scripts, (3) there are numerous resources for learning, and (4) it is easy for developers to share software implementation.
Expressions are evaluated in the R console when you type the expression into the console and hit Return.
A great advantage of R over point and click analysis software is that you can save your work as scripts.
“Base R” is what you get after you first install R. Additional components are available via packages.

# installing the dslabs package
if(!require(dslabs)) install.packages("dslabs")

## Loading required package: dslabs

# loading the dslabs package into the R session
library(dslabs)

2.3 Installing R and R Studio

2.3.1 Installing R

To install R to work on your own computer, you can download it freely from the Comprehensive R Archive Network (CRAN). Note that CRAN makes several versions of R available: versions for multiple operating systems and releases older than the current one. You want to read the CRAN instructions to assure you download the correct version. If you need further help, you read the walkthrough in this Chapter of the textbook.

2.3.2 Installing RStudio

RStudio is an integrated development environment (IDE). We highly recommend installing and using RStudio to edit and test your code. You can install RStudio through the RStudio website. Their cheatsheet is a great resource. You must install R before installing RStudio.

2.3.3 Textbook Link

Here is a link to the textbook section on Installing R and RStudio.

2.4 R Basics - Objects

Here is a link to the textbook section on objects in R.

Key Points

To define a variable, we may use the assignment symbol “<-“.
There are two ways to see the value stored in a variable: (1) type the variable into the console and hit Return, or (2) type print(“variable name”) and hit Return.
Objects are stuff that is stored in R. They can be variables, functions, etc.
The ls() function shows the names of the objects saved in your workspace.

Solving the equation x2+x−1=0

# assigning values to variables
a <- 1
b <- 1
c <- -1

# solving the quadratic equation
(-b + sqrt(b^2 - 4*a*c) ) / ( 2*a )

## [1] 0.618034

(-b - sqrt(b^2 - 4*a*c) ) / ( 2*a )

## [1] -1.618034

2.5 R Basics - Functions

Here is a link to the textbook section on functions.

Key points

In general, to evaluate a function we need to use parentheses. If we type a function without parenthesis, R shows us the code for the function. Most functions also require an argument, that is, something to be written inside the parenthesis.
To access help files, we may use the help function help(“function name”), or write the question mark followed by the function name.
The help file shows you the arguments the function is expecting, some of which are required and some are optional. If an argument is optional, a default value is assigned with the equal sign. The args() function also shows the arguments a function needs.
To specify arguments, we use the equals sign. If no argument name is used, R assumes you’re entering arguments in the order shown in the help file.
Creating and saving a script makes code much easier to execute.
To make your code more readable, use intuitive variable names and include comments (using the “#” symbol) to remind yourself why you wrote a particular line of code.

2.6 Assessment - R Basics

What is the sum of the first n positive integers? We can use the formula $n(n+1)/2$ to quickly compute this quantity.

# Here is how you compute the sum for the first 20 integers
20*(20+1)/2

## [1] 210

# However, we can define a variable to use the formula for other values of n
n <- 20
n*(n+1)/2

## [1] 210

n <- 25
n*(n+1)/2

## [1] 325

# Below, write code to calculate the sum of the first 100 integers
n<-100
n*(n+1)/2

## [1] 5050

What is the sum of the first 1000 positive integers? We can use the formula $n(n+1)/2$ to quickly compute this quantity.

# Below, write code to calculate the sum of the first 1000 integers 
n<-1000
n*(n+1)/2

## [1] 500500

Run the following code in the R console.

n <- 1000
x <- seq(1, n)
sum(x)

## [1] 500500

Based on the result, what do you think the functions seq and sum do?

A. sum creates a list of numbers and seq adds them up.
B. seq creates a list of numbers and sum adds them up.
C. seq computes the difference between two arguments and sum computes the sum of 1 through 1000.
D. sum always returns the same number.

In math and programming we say we evaluate a function when we replace arguments with specific values. So if we type log2(16) we evaluate the log2 function to get the log base 2 of 16 which is 4.

In R it is often useful to evaluate a function inside another function. For example, sqrt(log2(16)) will calculate the log to the base 2 of 16 and then compute the square root of that value. So the first evaluation gives a 4 and this gets evaluated by sqrt to give the final answer of 2.

# log to the base 2 
log2(16)

## [1] 4

# sqrt of the log to the base 2 of 16:
sqrt(log2(16))

## [1] 2

# Compute log to the base 10 (log10) of the sqrt of 100. Do not use variables.
log10(sqrt(100))

## [1] 1

Which of the following will always return the numeric value stored in x? You can try out examples and use the help system in the R console.

A. log(10^x)
B. log10(x^10)
C. log(exp(x))
D. exp(log(x, base = 2))

2.7 Data Types

You can find the section of the textbook on data types here.

Key Points

The function “class” helps us determine the type of an object.
Data frames can be thought of as tables with rows representing observations and columns representing different variables.
To access data from columns of a data frame, we use the dollar sign symbol, which is called the accessor.
A vector is an object consisting of several entries and can be a numeric vector, a character vector, or a logical vector.
We use quotes to distinguish between variable names and character strings.
Factors are useful for storing categorical data, and are more memory efficient than storing characters.

Code

# loading the the murders dataset
data(murders)

# determining that the murders dataset is of the "data frame" class
class(murders)

## [1] "data.frame"

# finding out more about the structure of the object
str(murders)

## 'data.frame':    51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...

# showing the first 6 lines of the dataset
head(murders)

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

# using the accessor operator to obtain the population column
murders$population

##  [1]  4779736   710231  6392017  2915918 37253956  5029196  3574097   897934   601723 19687653  9920000  1360301  1567582 12830632  6483802  3046355  2853118  4339367  4533372
## [20]  1328361  5773552  6547629  9883640  5303925  2967297  5988927   989415  1826341  2700551  1316470  8791894  2059179 19378102  9535483   672591 11536504  3751351  3831074
## [39] 12702379  1052567  4625364   814180  6346105 25145561  2763885   625741  8001024  6724540  1852994  5686986   563626

# displaying the variable names in the murders dataset
names(murders)

## [1] "state"      "abb"        "region"     "population" "total"

# determining how many entries are in a vector
pop <- murders$population
length(pop)

## [1] 51

# vectors can be of class numeric and character
class(pop)

## [1] "numeric"

class(murders$state)

## [1] "character"

# logical vectors are either TRUE or FALSE
z <- 3 == 2
z

## [1] FALSE

class(z)

## [1] "logical"

# factors are another type of class
class(murders$region)

## [1] "factor"

# obtaining the levels of a factor
levels(murders$region)

## [1] "Northeast"     "South"         "North Central" "West"

2.8 Assessment - Data Types

We’re going to be using the following dataset for this module. Run this code in the console.

library(dslabs)  
data(murders)

Next, use the function str to examine the structure of the murders object. We can see that this object is a data frame with 51 rows and five columns.

str(murders)

## 'data.frame':    51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...

Which of the following best describes the variables represented in this data frame?

A. The 51 states.
B. The murder rates for all 50 states and DC.
C. The state name, the abbreviation of the state name, the state’s region, and the state’s population and total number of murders for 2010.
D. str shows no relevant information.

In the previous question, we saw the different variables that are a part of this dataset from the output of the str() function. The function names() is specifically designed to extract the column names from a data frame.

# Load package and data
library(dslabs)
data(murders)

# Use the function names to extract the variable names 
names(murders)

## [1] "state"      "abb"        "region"     "population" "total"

In this module we have learned that every variable has a class. For example, the class can be a character, numeric or logical. The function class() can be used to determine the class of an object.

Here we are going to determine the class of one of the variables in the murders data frame. To extract variables from a data frame we use $, referred to as the accessor.

# To access the population variable from the murders dataset use this code:
p <- murders$population 

# To determine the class of object `p` we use this code:
class(p)

## [1] "numeric"

# Use the accessor to extract state abbreviations and assign it to a
a <- murders$abb

# Determine the class of a
class(a)

## [1] "character"

An important lesson you should learn early on is that there are multiple ways to do things in R. For example, to generate the first five integers we note that 1:5 and seq(1,5) return the same result.

There are also multiple ways to access variables in a data frame. For example we can use the square brackets [[ instead of the accessor $.

If you instead try to access a column with just one bracket,

murders["population"]

R returns a subset of the original data frame containing just this column. This new object will be of class data.frame rather than a vector. To access the column itself you need to use either the $ accessor or the double square brackets [[.

Parentheses, in contrast, are mainly used alongside functions to indicate what argument the function should be doing something to. For example, when we did class(p) in the last question, we wanted the function class to do something related to the argument p.

This is an example of how R can be a bit idiosyncratic sometimes. It is very common to find it confusing at first.

# We extract the population like this:
p <- murders$population

# This is how we do the same with the square brackets:
o <- murders[["population"]]

# We can confirm these two are the same
identical(o, p)

## [1] TRUE

# Use square brackets to extract `abb` from `murders` and assign it to b
b <- murders[["abb"]]

# Check if `a` and `b` are identical 
identical(a, b)

## [1] TRUE

Using the str() command, we saw that the region column stores a factor. You can corroborate this by using the class command on the region column.

The function levels shows us the categories for the factor.

# We can see the class of the region variable using class
class(murders$region)

## [1] "factor"

# Determine the number of regions included in this variable 
length(levels(murders$region))

## [1] 4

The function table takes a vector as input and returns the frequency of each unique element in the vector.

# Here is an example of what the table function does
x <- c("a", "a", "b", "b", "b", "c")
table(x)

## x
## a b c 
## 2 3 1

# Write one line of code to show the number of states per region
table(murders$region)

## 
##     Northeast         South North Central          West 
##             9            17            12            13

2.9 Section 1 Assessment

To find the solutions to an equation of the format $ax^2+bx+c$, use the quadratic equation: $x=\frac{-b±\sqrt(b^2−4ac)}{2a}$.

What are the two solutions to $2x^2-x-4=0$? Use the quadratic equation. (Report the greater of the two solutions first, using 3 significant digits for both solutions)

options(digits = 3)
a <- 2
b <- -1
c <- -4
(-b+sqrt(b^2-4*a*c))/(2*a)

## [1] 1.69

(-b-sqrt(b^2-4*a*c))/(2*a)

## [1] -1.19

Use R to compute log base 4 of 1024. You can use the help function to learn how to use arguments to change the base of the log function.

log(1024, base = 4)

## [1] 5

Load the movielens dataset

data(movielens)
str(movielens)

## 'data.frame':    100004 obs. of  7 variables:
##  $ movieId  : int  31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
##  $ title    : chr  "Dangerous Minds" "Dumbo" "Sleepers" "Escape from New York" ...
##  $ year     : int  1995 1941 1996 1981 1989 1978 1959 1982 1992 1991 ...
##  $ genres   : Factor w/ 901 levels "(no genres listed)",..: 762 510 899 120 762 836 81 762 844 899 ...
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ rating   : num  2.5 3 3 2 4 2 2 2 3.5 2 ...
##  $ timestamp: int  1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...

How many rows are in the dataset? 100004

How many different variables are in the dataset? 7

What is the variable type of title?

A. It is a text (txt) variable
B. It is a chronological (chr) variable
C. It is a string (str) variable
D. It is a numeric (num) variable
E. It is an integer (int) variable
F. It is a factor (Factor) variable
G. It is a character (chr) variable

What is the variable type of genres?

A. It is a text (txt) variable
B. It is a chronological (chr) variable
C. It is a string (str) variable
D. It is a numeric (num) variable
E. It is an integer (int) variable
F. It is a factor (Factor) variable
G. It is a character (chr) variable

We already know we can use the levels() function to determine the levels of a factor. A different function, nlevels(), may be used to determine the number of levels of a factor.

Use this function to determine how many levels are in the factor genres in the movielens data frame.

nlevels(movielens$genres)

## [1] 901