3 Section 2 Overview
In Section 2.1, you will:
- Create numeric and character vectors.
- Name the columns of a vector.
- Generate numeric sequences.
- Access specific elements or parts of a vector.
- Coerce data into different data types as needed.
In Section 2.2, you will:
- Sort vectors in ascending and descending order.
- Extract the indices of the sorted elements from the original vector.
- Find the maximum and minimum elements, as well as their indices, in a vector.
- Rank the elements of a vector in increasing order.
In Section 2.3, you will:
- Perform arithmetic between a vector and a single number.
- Perform arithmetic between two vectors of same length.
3.1 Vectors
The textbook for this section is available here.
Key Points
- The function
c()
, which stands for concatenate, is useful for creating vectors. - Another useful function for creating vectors is the
seq()
function, which generates sequences. - Subsetting lets us access specific parts of a vector by using square brackets to access elements of a vector.
Code
# We may create vectors of class numeric or character with the concatenate function
c(380, 124, 818)
codes <- c("italy", "canada", "egypt")
country <-
# We can also name the elements of a numeric vector
# Note that the two lines of code below have the same result
c(italy = 380, canada = 124, egypt = 818)
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)
codes <-
# We can also name the elements of a numeric vector using the names() function
c(380, 124, 818)
codes <- c("italy","canada","egypt")
country <-names(codes) <- country
# Using square brackets is useful for subsetting to access specific elements of a vector
2] codes[
## canada
## 124
c(1,3)] codes[
## italy egypt
## 380 818
1:2] codes[
## italy canada
## 380 124
# If the entries of a vector are named, they may be accessed by referring to their name
"canada"] codes[
## canada
## 124
c("egypt","italy")] codes[
## egypt italy
## 818 380
3.2 Vectors - Vector Coercion
The textbook for this section is available here.
Key Points
- In general, coercion is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match the expected. For example, when defining x as
x <- c(1, “canada”, 3)
R coerced the data into characters. It guessed that because you put a character string in the vector, you meant the 1 and 3 to actually be character strings “1” and “3”.
- The function
as.character()
turns numbers into characters. - The function
as.numeric()
turns characters into numbers. - In R, missing data is assigned the value NA.
3.3 Assessment - Vectors
- A vector is a series of values, all of the same type. They are the most basic data type in R and can hold numeric data, character data, or logical data. In R, you can create a vector with the concatenate (or combine) function
c()
You place the vector elements separated by a comma between the parentheses. For example a numeric vector would look something like this:
c(50, 75, 90, 100, 150) cost <-
# Here is an example creating a numeric vector named cost
c(50, 75, 90, 100, 150)
cost <-
# Create a numeric vector to store the temperatures listed in the instructions into a vector named temp
# Make sure to follow the same order in the instructions
c("Beijing"=35, "Lagos"=88, "Paris"=42, "Rio de Janeiro"=84, "San Juan"=81, "Toronto"=30)
temp <- cost
## [1] 50 75 90 100 150
temp
## Beijing Lagos Paris Rio de Janeiro San Juan Toronto
## 35 88 42 84 81 30
class(temp)
## [1] "numeric"
- As in the previous question, we are going to create a vector. Only this time, we learn to create character vectors. The main difference is that these have to be written as strings and so the names are enclosed within double quotes.
A character vector would look something like this:
c("pizza", "burgers", "salads", "cheese", "pasta") food <-
# here is an example of how to create a character vector
c("pizza", "burgers", "salads", "cheese", "pasta")
food <-
# Create a character vector called city to store the city names
# Make sure to follow the same order as in the instructions
c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto") city <-
- We have successfully assigned the temperatures as numeric values to
temp
and thecity
names as character values to city. But can we associate the temperature to its related city? Yes! We can do so using a code we already know -names
. We assign names to the numeric values.
It would look like this:
c(50, 75, 90, 100, 150)
cost <- c("pizza", "burgers", "salads", "cheese", "pasta")
food <-names(cost) <- food
# Associate the cost values with its corresponding food item
c(50, 75, 90, 100, 150)
cost <- c("pizza", "burgers", "salads", "cheese", "pasta")
food <-names(cost) <- food
# You already wrote this code
c(35, 88, 42, 84, 81, 30)
temp <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city <-
# Associate the temperature values with its corresponding city
names(temp) <- city
temp
## Beijing Lagos Paris Rio de Janeiro San Juan Toronto
## 35 88 42 84 81 30
- If we want to display only selected values from the object, R can help us do that easily.
For example, if we want to see the cost of the last 3 items in our food list, we would type:
3:5] cost[
Note here, that we could also type cost[c(3,4,5)]
and get the same result. The :
operator helps us condense the code and get consecutive values.
# cost of the last 3 items in our food list:
3:5] cost[
## salads cheese pasta
## 90 100 150
# temperatures of the first three cities in the list:
1:3] temp[
## Beijing Lagos Paris
## 35 88 42
- In the previous question, we accessed the temperature for consecutive cities (1st three). But what if we want to access the temperatures for any 2 specific cities?
An example: To access the cost of pizza
(1st) and pasta
(5th food item) in our list, the code would be:
c(1,5)] cost[
# Access the cost of pizza and pasta from our food list
c(1,5)] cost[
## pizza pasta
## 50 150
# Define temp
c(35, 88, 42, 84, 81, 30)
temp <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city <-names(temp) <- city
# Access the temperatures of Paris and San Juan
c(3,5)] temp[
## Paris San Juan
## 42 81
- The
:
operator helps us create sequences of numbers. For example,32:99
would create a list of numbers from 32 to 99.
Then, if we want to know the length of this sequence, all we need to do is use the length
command.
# Create a vector m of integers that starts at 32 and ends at 99.
32:99
m <-
# Determine the length of object m.
length(m)
## [1] 68
# Create a vector x of integers that starts at 12 and ends at 73.
12:73
x <-
# Determine the length of object x.
length(x)
## [1] 62
- We can also create different types of sequences in R. For example, in
seq(7, 49, 7)
, the first argument defines the start, and the second the end. The default is to go up in increments of 1, but a third argument lets us tell it by what interval.
# Create a vector with the multiples of 7, smaller than 50.
seq(7, 49, 7)
## [1] 7 14 21 28 35 42 49
# Create a vector containing all the positive odd numbers smaller than 100.
# The numbers should be in ascending order
seq(1,99,2)
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
- The second argument of the function
seq
is actually a maximum, not necessarily the end.
So if we type
seq(7, 50, 7)
we actually get the same vector of integers as if we type
seq(7, 49, 7)
This can be useful because sometimes all we want are sequential numbers that are smaller than some value. Let’s look at an example.
# We can create a vector with the multiples of 7, smaller than 50 like this
seq(7, 49, 7)
## [1] 7 14 21 28 35 42 49
# But note that the second argument does not need to be the last number
# It simply determines the maximum value permitted
# so the following line of code produces the same vector as seq(7, 49, 7)
seq(7, 50, 7)
## [1] 7 14 21 28 35 42 49
# Create a sequence of numbers from 6 to 55, with 4/7 increments and determine its length
length(seq(6,55,4/7))
## [1] 86
- The
seq()
function has another useful argument. The argument length.out. This argument lets us generate sequences that are increasing by the same amount but are of the prespecified length.
For example, this line of code
seq(0, 100, length.out = 5)
x <-0, 25, 50, 75, 100. produces the numbers
Let’s create a vector and see what is the class of the object produced.
# Store the sequence in the object a
seq(1, 10, length.out = 100)
a <-
# Determine the class of a
class(a)
## [1] "numeric"
- We have discussed the numeric class. We just saw that the
seq
function can generate objects of this class.
For another example, type
class(seq(1, 10, 0.5))
into the console and note that the class
is numeric. R has another type of vector we have not described, the integer class. You can create an integer by adding the letter L
after a whole number. If you type
class(3L)
in the console, you see this is an integer and not a numeric. For most practical purposes, integers and numerics are indistinguishable. For example 3, the integer, minus 3 the numeric is 0. To see this type this in the console
- 3 3L
The main difference is that integers occupy less space in the computer memory, so for big computations using integers can have a substantial impact.
# Store the sequence in the object a
seq(1,10)
a <-
# Determine the class of a
class(a)
## [1] "integer"
- Let’s confirm that
1L
is an integer not a numeric.
# Check the class of 1, assigned to the object a
class(1)
## [1] "numeric"
# Confirm the class of 1L is integer
class(1L)
## [1] "integer"
- The concept of coercion is a very important one. Watching the video, we learned that when an entry does not match what an R function is expecting, R tries to guess what we meant before throwing an error. This might get confusing at times.
As we’ve discussed in earlier questions, there are numeric and character vectors. The character vectors are placed in quotes and the numerics are not.
We can avoid issues with coercion in R by changing characters to numerics and vice-versa. This is known as typecasting. The code, as.numeric(x)
helps us convert character strings to numbers. There is an equivalent function that converts its argument to a string, as.character(x)
.
Let’s practice doing this!
# Define the vector x
c(1, 3, 5,"a")
x <-
# Note that the x is character vector
x
## [1] "1" "3" "5" "a"
# Typecast the vector to get an integer vector
# You will get a warning but that is ok
as.numeric(x) x <-
## Warning: NAs introduced by coercion
x
## [1] 1 3 5 NA
3.4 Sorting
The textbook for this section is available here.
Key Points
- The function
sort()
sorts a vector in increasing order. - The function
order()
produces the indices needed to obtain the sorted vector, e.g. a result of 2 3 1 5 4 means the sorted vector will be produced by listing the 2nd, 3rd, 1st, 5th, and then 4th item of the original vector. - The function
rank()
gives us the ranks of the items in the original vector. - The function
max()
returns the largest value whilewhich.max()
returns the index of the largest value. The functionsmin()
andwhich.min()
work similarly for minimum values.
3.5 Assessment - Sorting
- When looking at a dataset, we may want to sort the data in an order that makes more sense for analysis. Let’s learn to do this using the
murders
dataset as an example
# Access the `state` variable and store it in an object
murders$state
states <-
# Sort the object alphabetically and redefine the object
sort(states)
states <-
# Report the first alphabetical value
1] states[
## [1] "Alabama"
# Access population values from the dataset and store it in pop
murders$population
pop <-
# Sort the object and save it in the same object
sort(pop)
pop <-
# Report the smallest population size
1] pop[
## [1] 563626
- The function
order()
returns the index vector needed to sort the vector. This implies thatsort(x)
andx[order(x)]
give the same result.
This can be useful for finding row numbers with certain properties such as “the row for the state with the smallest population”. Remember that when we extract a variable from a data frame the order of the resulting vector is the same as the order of the rows of the data frame. So for example, the entries of the vector murders$state
are ordered in the same way as the states if you go down the rows of murders
.
# Access population from the dataset and store it in pop
murders$population
pop <-
# Use the command order to find the vector of indexes that order pop and store in object ord
order(pop)
ord <-
# Find the index number of the entry with the smallest population size
1] ord[
## [1] 51
- We can actually perform the same operation as in the previous exercise using the function
which.min
. It basically tells us which is the minimum value.
# Find the index of the smallest value for variable total
which.min(murders$total)
## [1] 46
# Find the index of the smallest value for population
which.min(murders$population)
## [1] 51
- Now we know how small the smallest state is and we know which row represents it. However, which state is it?
# Define the variable i to be the index of the smallest state
which.min(murders$population)
i <-
# Define variable states to hold the states
murders$state
states <-
# Use the index you just defined to find the state with the smallest population
states[i]
## [1] "Wyoming"
- You can create a data frame using the data.frame function.
Here is a quick example:
c(35, 88, 42, 84, 81, 30)
temp <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city <- data.frame(name = city, temperature = temp) city_temps <-
# Store temperatures in an object
c(35, 88, 42, 84, 81, 30)
temp <-
# Store city names in an object
c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city <-
# Create data frame with city names and temperature
data.frame(name = city, temperature = temp)
city_temps <-
# Define a variable states to be the state names
murders$state
states <-
# Define a variable ranks to determine the population size ranks
rank(murders$population)
ranks <-
# Create a data frame my_df with the state name and its rank
data.frame(name = states, rank = ranks) my_df <-
- This exercise is somewhat more challenging. We are going to repeat the previous exercise but this time order `
my_df
so that the states are ordered from least populous to most.
# Define a variable states to be the state names from the murders data frame
murders$state
states <-
# Define a variable ranks to determine the population size ranks
rank(murders$population)
ranks <-
# Define a variable ind to store the indexes needed to order the population values
order(murders$population)
ind <-
# Create a data frame my_df with the state name and its rank and ordered from least populous to most
data.frame(states = states[ind], ranks = ranks[ind]) my_df <-
- The
na_example
dataset represents a series of counts. It is included in the dslabs package.
You can quickly examine the object using
library(dslabs)
data(na_example)
str(na_example)
However, when we compute the average we obtain an NA
. You can see this by typing
mean(na_example)
# Using new dataset
library(dslabs)
data(na_example)
# Checking the structure
str(na_example)
## int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...
# Find out the mean of the entire dataset
mean(na_example)
## [1] NA
# Use is.na to create a logical index ind that tells which entries are NA
is.na(na_example)
ind <-
# Determine how many NA ind has using the sum function
sum(ind)
## [1] 145
- We previously computed the average of
na_example
usingmean(na_example)
and obtainNA
. This is because the functionmean
returnsNA
if it encounters at least oneNA
. A common operation is therefore removing the entries that are NA and after that perform operations on the rest.
# Note what we can do with the ! operator
c(1, 2, 3)
x <- c(FALSE, TRUE, FALSE)
ind <-!ind] x[
## [1] 1 3
# Create the ind vector
library(dslabs)
data(na_example)
is.na(na_example)
ind <-
# We saw that this gives an NA
mean(na_example)
## [1] NA
# Compute the average, for entries of na_example that are not NA
mean(na_example[!ind])
## [1] 2.3
3.6 Vector arithmetic
The textbook for this section is available here.
Key Points
- In R, arithmetic operations on vectors occur element-wise.
Code
# The name of the state with the maximum population is found by doing the following
$state[which.max(murders$population)] murders
## [1] "California"
# how to obtain the murder rate
murders$total / murders$population * 100000
murder_rate <-
# ordering the states by murder rate, in decreasing order
$state[order(murder_rate, decreasing=TRUE)] murders
## [1] "District of Columbia" "Louisiana" "Missouri" "Maryland" "South Carolina" "Delaware" "Michigan"
## [8] "Mississippi" "Georgia" "Arizona" "Pennsylvania" "Tennessee" "Florida" "California"
## [15] "New Mexico" "Texas" "Arkansas" "Virginia" "Nevada" "North Carolina" "Oklahoma"
## [22] "Illinois" "Alabama" "New Jersey" "Connecticut" "Ohio" "Alaska" "Kentucky"
## [29] "New York" "Kansas" "Indiana" "Massachusetts" "Nebraska" "Wisconsin" "Rhode Island"
## [36] "West Virginia" "Washington" "Colorado" "Montana" "Minnesota" "South Dakota" "Oregon"
## [43] "Wyoming" "Maine" "Utah" "Idaho" "Iowa" "North Dakota" "Hawaii"
## [50] "New Hampshire" "Vermont"
3.7 Assessment - Vector Arithmetic
- Previously we created this data frame.
{r, eval=FALSE, echo=TRUE temp <- c(35, 88, 42, 84, 81, 30) city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto") city_temps <- data.frame(name = city, temperature = temp)
# Assign city names to `city`
c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city <-
# Store temperature values in `temp`
c(35, 88, 42, 84, 81, 30)
temp <-
# Convert temperature into Celsius and overwrite the original values of 'temp' with these Celsius values
5/9 * (temp -32)
temp <-
# Create a data frame `city_temps`
data.frame(name = city, temperature = temp) city_temps <-
- We can use some of what we have learned to perform calculations that would otherwise be quite complicated. Let’s see an example.
# Define an object `x` with the numbers 1 through 100
seq(1,100)
x <-
# Compute the sum
sum((1/x)^2)
## [1] 1.63
- Compute the per 100,000 murder rate for each state and store it in the object
murder_rate
. Then compute the average murder rate for the US using the functionmean
. What is the average?
# Store the per 100,000 murder rate for each state in murder_rate
murders$total / murders$population * 100000
murder_rate <-
# Calculate the average murder rate in the US
mean(murder_rate)
## [1] 2.78
3.8 Section 2 Assessment
- Consider the vector
x <- c(2, 43, 27, 96, 18)
.
Match the following outputs to the function which produces that output. Options include sort(x)
, order(x)
, rank(x)
and none of these.
c(2, 43, 27, 96, 18)
x <-sort(x)
## [1] 2 18 27 43 96
order(x)
## [1] 1 5 3 2 4
rank(x)
## [1] 1 4 3 5 2
1, 2, 3, 4, 5
none of these
1, 5, 3, 2, 4
order(x)
1, 4, 3, 5, 2
rank(x)
2, 18, 27, 43, 96
sort(x)
- Continue working with the vector
x <- c(2, 43, 27, 96, 18)
.
c(2, 43, 27, 96, 18)
x <-min(x)
## [1] 2
which.min(x)
## [1] 1
max(x)
## [1] 96
which.max(x)
## [1] 4
min(x)
2
which.min(x)
1
max(x)
none of these
which.max(x)
4
- Mandi, Amy, Nicole, and Olivia all ran different distances in different time intervals. Their distances (in miles) and times (in minutes) are as follows:
c("Mandi", "Amy", "Nicole", "Olivia")
name <- c(0.8, 3.1, 2.8, 4.0)
distance <- c(10, 30, 40, 50) time <-
Write a line of code to convert time to hours. Remember there are 60 minutes in an hour. Then write a line of code to calculate the speed of each runner in miles per hour. Speed is distance divided by time.
How many hours did Olivia run?
time/60
hours <-4] hours[
## [1] 0.833
What was Mandi’s speed in miles per hour?
distance/hours
speed <-1] speed[
## [1] 4.8
Which runner had the fastest speed?
which.max(speed)] name[
## [1] "Amy"