3 Section 2 Overview

In Section 2.1, you will:

Create numeric and character vectors.
Name the columns of a vector.
Generate numeric sequences.
Access specific elements or parts of a vector.
Coerce data into different data types as needed.

In Section 2.2, you will:

Sort vectors in ascending and descending order.
Extract the indices of the sorted elements from the original vector.
Find the maximum and minimum elements, as well as their indices, in a vector.
Rank the elements of a vector in increasing order.

In Section 2.3, you will:

Perform arithmetic between a vector and a single number.
Perform arithmetic between two vectors of same length.

3.1 Vectors

The textbook for this section is available here.

Key Points

The function c(), which stands for concatenate, is useful for creating vectors.
Another useful function for creating vectors is the seq() function, which generates sequences.
Subsetting lets us access specific parts of a vector by using square brackets to access elements of a vector.

Code

# We may create vectors of class numeric or character with the concatenate function
codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt")

# We can also name the elements of a numeric vector
# Note that the two lines of code below have the same result
codes <- c(italy = 380, canada = 124, egypt = 818)
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)

# We can also name the elements of a numeric vector using the names() function
codes <- c(380, 124, 818)
country <- c("italy","canada","egypt")
names(codes) <- country

# Using square brackets is useful for subsetting to access specific elements of a vector
codes[2]

## canada 
##    124

codes[c(1,3)]

## italy egypt 
##   380   818

codes[1:2]

##  italy canada 
##    380    124

# If the entries of a vector are named, they may be accessed by referring to their name
codes["canada"]

## canada 
##    124

codes[c("egypt","italy")]

## egypt italy 
##   818   380

3.2 Vectors - Vector Coercion

The textbook for this section is available here.

Key Points

In general, coercion is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match the expected. For example, when defining x as

x <- c(1, “canada”, 3)

R coerced the data into characters. It guessed that because you put a character string in the vector, you meant the 1 and 3 to actually be character strings “1” and “3”.

The function as.character() turns numbers into characters.
The function as.numeric() turns characters into numbers.
In R, missing data is assigned the value NA.

3.3 Assessment - Vectors

A vector is a series of values, all of the same type. They are the most basic data type in R and can hold numeric data, character data, or logical data. In R, you can create a vector with the concatenate (or combine) function c()

You place the vector elements separated by a comma between the parentheses. For example a numeric vector would look something like this:

cost <- c(50, 75, 90, 100, 150)

# Here is an example creating a numeric vector named cost
cost <- c(50, 75, 90, 100, 150)

# Create a numeric vector to store the temperatures listed in the instructions into a vector named temp
# Make sure to follow the same order in the instructions
temp <- c("Beijing"=35, "Lagos"=88, "Paris"=42, "Rio de Janeiro"=84, "San Juan"=81, "Toronto"=30)
cost

## [1]  50  75  90 100 150

temp

##        Beijing          Lagos          Paris Rio de Janeiro       San Juan        Toronto 
##             35             88             42             84             81             30

class(temp)

## [1] "numeric"

As in the previous question, we are going to create a vector. Only this time, we learn to create character vectors. The main difference is that these have to be written as strings and so the names are enclosed within double quotes.

A character vector would look something like this:

food <- c("pizza", "burgers", "salads", "cheese", "pasta")

# here is an example of how to create a character vector
food <- c("pizza", "burgers", "salads", "cheese", "pasta")

# Create a character vector called city to store the city names
# Make sure to follow the same order as in the instructions
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")

We have successfully assigned the temperatures as numeric values to temp and the city names as character values to city. But can we associate the temperature to its related city? Yes! We can do so using a code we already know - names. We assign names to the numeric values.

It would look like this:

cost <- c(50, 75, 90, 100, 150)
food <- c("pizza", "burgers", "salads", "cheese", "pasta")
names(cost) <- food

# Associate the cost values with its corresponding food item
cost <- c(50, 75, 90, 100, 150)
food <- c("pizza", "burgers", "salads", "cheese", "pasta")
names(cost) <- food

# You already wrote this code
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")

# Associate the temperature values with its corresponding city
names(temp) <- city
temp

##        Beijing          Lagos          Paris Rio de Janeiro       San Juan        Toronto 
##             35             88             42             84             81             30

If we want to display only selected values from the object, R can help us do that easily.

For example, if we want to see the cost of the last 3 items in our food list, we would type:

cost[3:5]

Note here, that we could also type cost[c(3,4,5)] and get the same result. The : operator helps us condense the code and get consecutive values.

# cost of the last 3 items in our food list:
cost[3:5]

## salads cheese  pasta 
##     90    100    150

# temperatures of the first three cities in the list:
temp[1:3]

## Beijing   Lagos   Paris 
##      35      88      42

In the previous question, we accessed the temperature for consecutive cities (1st three). But what if we want to access the temperatures for any 2 specific cities?

An example: To access the cost of pizza (1st) and pasta (5th food item) in our list, the code would be:

cost[c(1,5)]

# Access the cost of pizza and pasta from our food list 
cost[c(1,5)]

## pizza pasta 
##    50   150

# Define temp
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
names(temp) <- city

# Access the temperatures of Paris and San Juan
temp[c(3,5)]

##    Paris San Juan 
##       42       81

The : operator helps us create sequences of numbers. For example, 32:99 would create a list of numbers from 32 to 99.

Then, if we want to know the length of this sequence, all we need to do is use the length command.

# Create a vector m of integers that starts at 32 and ends at 99.
m <- 32:99

# Determine the length of object m.
length(m)

## [1] 68

# Create a vector x of integers that starts at 12 and ends at 73.
x <- 12:73

# Determine the length of object x.
length(x)

## [1] 62

We can also create different types of sequences in R. For example, in seq(7, 49, 7), the first argument defines the start, and the second the end. The default is to go up in increments of 1, but a third argument lets us tell it by what interval.

# Create a vector with the multiples of 7, smaller than 50.
seq(7, 49, 7)

## [1]  7 14 21 28 35 42 49

# Create a vector containing all the positive odd numbers smaller than 100.
# The numbers should be in ascending order
seq(1,99,2)

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

The second argument of the function seq is actually a maximum, not necessarily the end.

So if we type

seq(7, 50, 7)

we actually get the same vector of integers as if we type

seq(7, 49, 7)

This can be useful because sometimes all we want are sequential numbers that are smaller than some value. Let’s look at an example.

# We can create a vector with the multiples of 7, smaller than 50 like this 
seq(7, 49, 7)

## [1]  7 14 21 28 35 42 49

# But note that the second argument does not need to be the last number
# It simply determines the maximum value permitted
# so the following line of code produces the same vector as seq(7, 49, 7)
seq(7, 50, 7)

## [1]  7 14 21 28 35 42 49

# Create a sequence of numbers from 6 to 55, with 4/7 increments and determine its length
length(seq(6,55,4/7))

## [1] 86

The seq() function has another useful argument. The argument length.out. This argument lets us generate sequences that are increasing by the same amount but are of the prespecified length.

For example, this line of code

x <- seq(0, 100, length.out = 5)
produces the numbers 0, 25, 50, 75, 100.

Let’s create a vector and see what is the class of the object produced.

# Store the sequence in the object a
a <- seq(1, 10, length.out = 100)

# Determine the class of a
class(a)

## [1] "numeric"

We have discussed the numeric class. We just saw that the seq function can generate objects of this class.

For another example, type

class(seq(1, 10, 0.5))

into the console and note that the class is numeric. R has another type of vector we have not described, the integer class. You can create an integer by adding the letter L after a whole number. If you type

class(3L)

in the console, you see this is an integer and not a numeric. For most practical purposes, integers and numerics are indistinguishable. For example 3, the integer, minus 3 the numeric is 0. To see this type this in the console

3L - 3

The main difference is that integers occupy less space in the computer memory, so for big computations using integers can have a substantial impact.

# Store the sequence in the object a
a <- seq(1,10)

# Determine the class of a
class(a)

## [1] "integer"

Let’s confirm that 1L is an integer not a numeric.

# Check the class of 1, assigned to the object a
class(1)

## [1] "numeric"

# Confirm the class of 1L is integer
class(1L)

## [1] "integer"

The concept of coercion is a very important one. Watching the video, we learned that when an entry does not match what an R function is expecting, R tries to guess what we meant before throwing an error. This might get confusing at times.

As we’ve discussed in earlier questions, there are numeric and character vectors. The character vectors are placed in quotes and the numerics are not.

We can avoid issues with coercion in R by changing characters to numerics and vice-versa. This is known as typecasting. The code, as.numeric(x) helps us convert character strings to numbers. There is an equivalent function that converts its argument to a string, as.character(x).

Let’s practice doing this!

# Define the vector x
x <- c(1, 3, 5,"a")

# Note that the x is character vector
x

## [1] "1" "3" "5" "a"

# Typecast the vector to get an integer vector
# You will get a warning but that is ok
x <- as.numeric(x)

## Warning: NAs introduced by coercion

## [1]  1  3  5 NA

3.4 Sorting

The textbook for this section is available here.

Key Points

The function sort() sorts a vector in increasing order.
The function order() produces the indices needed to obtain the sorted vector, e.g. a result of 2 3 1 5 4 means the sorted vector will be produced by listing the 2nd, 3rd, 1st, 5th, and then 4th item of the original vector.
The function rank() gives us the ranks of the items in the original vector.
The function max() returns the largest value while which.max() returns the index of the largest value. The functions min() and which.min() work similarly for minimum values.

3.5 Assessment - Sorting

When looking at a dataset, we may want to sort the data in an order that makes more sense for analysis. Let’s learn to do this using the murders dataset as an example

# Access the `state` variable and store it in an object 
states <- murders$state 

# Sort the object alphabetically and redefine the object 
states <- sort(states) 

# Report the first alphabetical value  
states[1]

## [1] "Alabama"

# Access population values from the dataset and store it in pop
pop <- murders$population

# Sort the object and save it in the same object
pop <- sort(pop)

# Report the smallest population size 
pop[1]

## [1] 563626

The function order() returns the index vector needed to sort the vector. This implies that sort(x) and x[order(x)] give the same result.

This can be useful for finding row numbers with certain properties such as “the row for the state with the smallest population”. Remember that when we extract a variable from a data frame the order of the resulting vector is the same as the order of the rows of the data frame. So for example, the entries of the vector murders$state are ordered in the same way as the states if you go down the rows of murders.

# Access population from the dataset and store it in pop
pop <- murders$population

# Use the command order to find the vector of indexes that order pop and store in object ord
ord <- order(pop)

# Find the index number of the entry with the smallest population size
ord[1]

## [1] 51

We can actually perform the same operation as in the previous exercise using the function which.min. It basically tells us which is the minimum value.

# Find the index of the smallest value for variable total 
which.min(murders$total)

## [1] 46

# Find the index of the smallest value for population
which.min(murders$population)

## [1] 51

Now we know how small the smallest state is and we know which row represents it. However, which state is it?

# Define the variable i to be the index of the smallest state
i <- which.min(murders$population)

# Define variable states to hold the states
states <- murders$state

# Use the index you just defined to find the state with the smallest population
states[i]

## [1] "Wyoming"

You can create a data frame using the data.frame function.

Here is a quick example:

temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)

# Store temperatures in an object 
temp <- c(35, 88, 42, 84, 81, 30)

# Store city names in an object 
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")

# Create data frame with city names and temperature 
city_temps <- data.frame(name = city, temperature = temp)

# Define a variable states to be the state names 
states <- murders$state

# Define a variable ranks to determine the population size ranks 
ranks <- rank(murders$population)

# Create a data frame my_df with the state name and its rank
my_df <- data.frame(name = states, rank = ranks)

This exercise is somewhat more challenging. We are going to repeat the previous exercise but this time order `my_df so that the states are ordered from least populous to most.

# Define a variable states to be the state names from the murders data frame
states <- murders$state

# Define a variable ranks to determine the population size ranks 
ranks <- rank(murders$population)

# Define a variable ind to store the indexes needed to order the population values
ind <- order(murders$population)

# Create a data frame my_df with the state name and its rank and ordered from least populous to most 
my_df <- data.frame(states = states[ind], ranks = ranks[ind])

The na_example dataset represents a series of counts. It is included in the dslabs package.

You can quickly examine the object using

library(dslabs)
data(na_example)
str(na_example)

However, when we compute the average we obtain an NA. You can see this by typing

mean(na_example)

# Using new dataset 
library(dslabs)
data(na_example)

# Checking the structure 
str(na_example)

##  int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...

# Find out the mean of the entire dataset 
mean(na_example)

## [1] NA

# Use is.na to create a logical index ind that tells which entries are NA
ind <- is.na(na_example)

# Determine how many NA ind has using the sum function
sum(ind)

## [1] 145

We previously computed the average of na_example using mean(na_example) and obtain NA. This is because the function mean returns NA if it encounters at least one NA. A common operation is therefore removing the entries that are NA and after that perform operations on the rest.

# Note what we can do with the ! operator
x <- c(1, 2, 3)
ind <- c(FALSE, TRUE, FALSE)
x[!ind]

## [1] 1 3

# Create the ind vector
library(dslabs)
data(na_example)
ind <- is.na(na_example)

# We saw that this gives an NA
mean(na_example)

## [1] NA

# Compute the average, for entries of na_example that are not NA
mean(na_example[!ind])

## [1] 2.3

3.6 Vector arithmetic

The textbook for this section is available here.

Key Points

In R, arithmetic operations on vectors occur element-wise.

Code

# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]

## [1] "California"

# how to obtain the murder rate
murder_rate <- murders$total / murders$population * 100000

# ordering the states by murder rate, in decreasing order
murders$state[order(murder_rate, decreasing=TRUE)]

##  [1] "District of Columbia" "Louisiana"            "Missouri"             "Maryland"             "South Carolina"       "Delaware"             "Michigan"            
##  [8] "Mississippi"          "Georgia"              "Arizona"              "Pennsylvania"         "Tennessee"            "Florida"              "California"          
## [15] "New Mexico"           "Texas"                "Arkansas"             "Virginia"             "Nevada"               "North Carolina"       "Oklahoma"            
## [22] "Illinois"             "Alabama"              "New Jersey"           "Connecticut"          "Ohio"                 "Alaska"               "Kentucky"            
## [29] "New York"             "Kansas"               "Indiana"              "Massachusetts"        "Nebraska"             "Wisconsin"            "Rhode Island"        
## [36] "West Virginia"        "Washington"           "Colorado"             "Montana"              "Minnesota"            "South Dakota"         "Oregon"              
## [43] "Wyoming"              "Maine"                "Utah"                 "Idaho"                "Iowa"                 "North Dakota"         "Hawaii"              
## [50] "New Hampshire"        "Vermont"

3.7 Assessment - Vector Arithmetic

Previously we created this data frame.

{r, eval=FALSE, echo=TRUE temp <- c(35, 88, 42, 84, 81, 30) city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto") city_temps <- data.frame(name = city, temperature = temp)

# Assign city names to `city` 
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")

# Store temperature values in `temp`
temp <- c(35, 88, 42, 84, 81, 30)

# Convert temperature into Celsius and overwrite the original values of 'temp' with these Celsius values
temp <- 5/9 * (temp -32)

# Create a data frame `city_temps` 
city_temps <- data.frame(name = city, temperature = temp)

We can use some of what we have learned to perform calculations that would otherwise be quite complicated. Let’s see an example.

# Define an object `x` with the numbers 1 through 100
x <- seq(1,100)

# Compute the sum 
sum((1/x)^2)

## [1] 1.63

Compute the per 100,000 murder rate for each state and store it in the object murder_rate. Then compute the average murder rate for the US using the function mean. What is the average?

# Store the per 100,000 murder rate for each state in murder_rate
murder_rate <- murders$total / murders$population * 100000

# Calculate the average murder rate in the US 
mean(murder_rate)

## [1] 2.78

3.8 Section 2 Assessment

Consider the vector x <- c(2, 43, 27, 96, 18).

Match the following outputs to the function which produces that output. Options include sort(x), order(x), rank(x) and none of these.

x <- c(2, 43, 27, 96, 18)
sort(x)

## [1]  2 18 27 43 96

order(x)

## [1] 1 5 3 2 4

rank(x)

## [1] 1 4 3 5 2

1, 2, 3, 4, 5 none of these

1, 5, 3, 2, 4 order(x)

1, 4, 3, 5, 2 rank(x)

2, 18, 27, 43, 96 sort(x)

Continue working with the vector x <- c(2, 43, 27, 96, 18).

x <- c(2, 43, 27, 96, 18)
min(x)

## [1] 2

which.min(x)

## [1] 1

max(x)

## [1] 96

which.max(x)

## [1] 4

min(x) 2

which.min(x) 1

max(x) none of these

which.max(x) 4

Mandi, Amy, Nicole, and Olivia all ran different distances in different time intervals. Their distances (in miles) and times (in minutes) are as follows:

name <- c("Mandi", "Amy", "Nicole", "Olivia")
distance <- c(0.8, 3.1, 2.8, 4.0)
time <- c(10, 30, 40, 50)

Write a line of code to convert time to hours. Remember there are 60 minutes in an hour. Then write a line of code to calculate the speed of each runner in miles per hour. Speed is distance divided by time.

How many hours did Olivia run?

hours <- time/60
hours[4]

## [1] 0.833

What was Mandi’s speed in miles per hour?

speed <- distance/hours
speed[1]

## [1] 4.8

Which runner had the fastest speed?

name[which.max(speed)]

## [1] "Amy"