Week 1: Introduction to R
Jul 12, 2018 00:00 · 3095 words · 15 minute read
R and RStudio
R is an opern-source programming language for statistical analysis. Being open-source, unlike similar software such as Stata, SAS and SPSS, it is free to use. R was created in 1994 and it has become arguably the most popular language for statistical analysis.
RStudio is the most popular graphical user interface (GUI) for R which is developed by a company with the same name.
Read more about R and RStudio here: R, RStudio Install R and RStudio here: install R, install RStudio
RStudio interface
- R Script is where R code is typed and saved. We can execute codes directly from R script to Console using Cntrl + Enter (Windwos) or command + Enter (Mac). We can also add notes after typing “#”, which will not execute as the rest of the code.
- Console is where code is executed and result is printed. We can directly type code in console or run code from R script section.
- Environment is where all the objects (data) are saved.
- The last section include Directory which is a link to a folder of your interest in your computer where you store your files. Plots display any static graphs you create. Packages lists all the R packages install. Help is where you can find help on functions and packages. Viewer displays any interactive graphs or tables you create.
Basics
Arithmetic
Arthmetic operators
- Addition: +
- Subtraction: -
- Multiplication: *
- Division: /
- Exponentiation: ^
- Modulo (Remainder from division): %%
- Integer Division: %/%
Examples
2018 %% 10
## [1] 8
(2000 + 18) %% 5 # more complex
## [1] 3
2018 %/% 5
## [1] 403
Assignment
Assign 4 to an object x
x <- 4 # assign 4 to x
x # prints x
## [1] 4
Use object x in calculation
y <- 2
x*y
## [1] 8
Assign result of calculation to another object z
z <- x*y
z
## [1] 8
Data types in R
- numerics: Decimals values like 4.5
- integers: Natural numbers like 4. Integers are also numerics.
- logical: Boolean values TRUE and FALSE
- characters: Text or string values, e.g. “Kabul”
Create different objects for each data type
my_numeric <- 2018
my_integer <- 8L
my_character <- "Kabul"
my_logical <- TRUE
Check class of objects
class(my_numeric)
## [1] "numeric"
class(my_integer)
## [1] "integer"
class(my_character)
## [1] "character"
class(my_logical)
## [1] "logical"
Data objects in R
We have multiple types of objects in R:
- Scalars: Scalars are singular element objects, like one number or one character (we don’t talk about it in this course)
- Vectors: A sequence (collection) of data with one dimension, that is length
- Matrices: A collection of same data type arranged into rows and columns
- Arrays: Arrays are similar to matrices but can have more than two dimensions (we don’t talk about it in this course)
- Data Frames: A collection of data points arranged into rows and columns (like matrices) but each column can have different data type (unlike matrices)
- Lists: A collection/list of data objects, such as vectors, matrices, data frames, or even other lists
Vectors
Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a sequence (or set) of data points.
For example, a numeric vector hold numeric data and looks like this: 1, 2.1, -4, 6.8, 2, -1. A character vector holds character data looks like this: “Kabul”, “India”, “Boat”, “5”, “.”. A logical vector holds logical data and looks like this: FALSE, TRUE, TRUE, FALSE, TRUE.
There are special functions that create a vector. For example:
1:4
## [1] 1 2 3 4
rep("hello", 10) # repeat "hello", 10 times
## [1] "hello" "hello" "hello" "hello" "hello" "hello" "hello" "hello"
## [9] "hello" "hello"
In general, we create a vector with the combine function c()
numeric_vector <- c(1, 2.1, -4, 6.8, 2, -1)
character_vector <- c("Kabul", "India", "کشتی", "5", ".")
boolean_vector <- c(FALSE, TRUE, TRUE, FALSE, TRUE)
Vectors Arithmetic
Element by element operation
A <- c(1, 2, 3)
B <- c(10, 10, 10)
A + B
## [1] 11 12 13
A / B
## [1] 0.1 0.2 0.3
Vector index
A vector have many characteristics, such as class, length, and elements. Lenght of a vector is number of elements a vector has, irrespective of its class.
A <- c("Kabul", "Mazar", "Herat", "Jalalabad", "Kandahar")
length(A)
## [1] 5
Vector index: Retrieve parts of a vector by index inside “[ ]”. For example, element 1 of vector A is “Kabul”.
A[1]
## [1] "Kabul"
Let’s look at multiple indeces
A[3:5] # or A[c(3:5)] or A[c(3,4,5)]
## [1] "Herat" "Jalalabad" "Kandahar"
There is also negative index that exclude mentioned elements
A[-c(1:2)] # NOT A[-1:2], why???
## [1] "Herat" "Jalalabad" "Kandahar"
We know A has 5 elements. What happens if we ask for 6th element?
Matrices
A matrix, another object type in R, is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.
A <- matrix(1:9, nrow = 3)
A
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Or a character matrix
B <- matrix(c("Kabul", "Mazar", "Herat", "Kandahar"), nrow = 2)
B
## [,1] [,2]
## [1,] "Kabul" "Herat"
## [2,] "Mazar" "Kandahar"
Matrix is its own class. How do we check class of an object?
class(A)
class(B)
Matrix index
Vectors had one dimension. Matrices have two dimensions. Consistent with number of dimensions, matrix indeces also have two values.
Let’s look at matrix A:
A
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
How can we call value in second column and second row?
A[2, 2]
## [1] 5
How can we call all elements of second column?
A[, 2]
## [1] 4 5 6
How do we call all elements of first row?
A[1, ]
## [1] 1 4 7
Practice a little bit more with indexing matrices with yourself. It is both very useful and super cool.
Matrix Arithmetic
Matrix arithmetic resembles vector arithmetic. Let’s look at a few examples to understand.
A <- matrix(1:9, nrow = 3)
B <- matrix(11:19, nrow = 3)
A
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
B
## [,1] [,2] [,3]
## [1,] 11 14 17
## [2,] 12 15 18
## [3,] 13 16 19
- Matrix operation with a scalar
A * 2
## [,1] [,2] [,3]
## [1,] 2 8 14
## [2,] 4 10 16
## [3,] 6 12 18
- Matrix operation with another matrix
A * B
## [,1] [,2] [,3]
## [1,] 11 56 119
## [2,] 24 75 144
## [3,] 39 96 171
Joining matrices
There are two ways to join matrices: - rbind: add new rows - cbind: add new columns
In row bind, we add one matrix on top of another matrix
In column bind, we add one matrix on the right side of another matrix
Let’s join matrix A and B, once by row and once by column:
A <- matrix(1:9, nrow = 3)
B <- matrix(11:19, nrow = 3)
rbind(A, B)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [4,] 11 14 17
## [5,] 12 15 18
## [6,] 13 16 19
cbind(A, B)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 4 7 11 14 17
## [2,] 2 5 8 12 15 18
## [3,] 3 6 9 13 16 19
Switch A and B’s position in rbind() and cbind() and find out what happens.
rbind(B, A)
cbind(B, A)
Data frames
Data frames are similar to matrices in terms of having multiple rows and columns, but different as data frames does not have to be in the same data type. Data frames allow each column to have its own data type. For example, one column can be numeric, and another column can be character. Most data sets in the world are data frame.
R comes with pre-installed data frames. You can directly call a pre-install data by typing its name. Let’s look at one of the most used data frames mtcars.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
As you can see, typing a data frame name prints all of it. Let’s use head() and tail() functions that prints only top 6 rows and top 6 columns.
head(mtcars, 3) # only three rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
As data frames can contain multiple types of data, such as numeric, character, logical, or integer, we need a function that gives a summary of this information. That funciton is str() that shows you the structure of your data frame, or any other object (try it with a vector, matrix, list, etc.).
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Creating data frame
Data frames can be created in multiple ways. Let’s see a few cases.
- data frame from matrix
A <- matrix(1:9, nrow = 3)
A_df <- data.frame(A)
class(A_df)
## [1] "data.frame"
- Data frame from multiple vectors of the same length
numbers <- 1:5
cities <- c("Kabul", "Mazar", "Herat", "Jalalabad", "Kandahar")
population_in_million <- c(4.635, 0.693, 1.780, 0.356, 0.557)
Afg_major_cities <- data.frame(numbers, cities, population_in_million)
Afg_major_cities
## numbers cities population_in_million
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 4 4 Jalalabad 0.356
## 5 5 Kandahar 0.557
- Data frame from imported tabular data (e.g. Excel)
# We will study importing data in the next lecture
read.csv("~/Documents/PRSO/Programs/Data Analysis with R/Day 1/Afg_major_cities.csv", header = TRUE)
## X numbers cities population_in_million
## 1 1 1 Kabul 4.635
## 2 2 2 Mazar 0.693
## 3 3 3 Herat 1.780
## 4 4 4 Jalalabad 0.356
## 5 5 5 Kandahar 0.557
Data frame index
Data frame indeces are largely the same as matrices: we call row number and column number in that order inside “[ ]”.
Let’s see Kabul’s population which is the first row in the Afg_major_cities data frame.
Afg_major_cities[1, ]
## numbers cities population_in_million
## 1 1 Kabul 4.635
There are two ways for data frames to be indexed by column. For example, third column of data frame can be called [, 3] or [[3]]. The second method simply means third element which is used widely with list objects.
There is one useful indexing technique that is available for data frame is we can use rowname and columnname.
Let’s print population column (“population_in_million”).
Afg_major_cities[, "population_in_million"]
## [1] 4.635 0.693 1.780 0.356 0.557
Note, the result of either indeces is a vector, not a column or row of a data frame.
There are additional unique indexing techniques.
R uses $ operator to select a variable inside a data frame.
Afg_major_cities$population_in_million # is the same as Afg_major_cities[, 3]
## [1] 4.635 0.693 1.780 0.356 0.557
We can also select an element, for example 3rd element of 2nd column, using $ operator.
Afg_major_cities$cities[3] # is the same as Afg_major_cities[3, 2]
## [1] Herat
## Levels: Herat Jalalabad Kabul Kandahar Mazar
Column and Row Names
Most data are stored in data frame and it is important to assign appropriate names to columns (and sometiems rows) to help recognize them.
Let’s start by looking at the column names and row names of our data frame using colnames() and rownames() functions.
colnames(Afg_major_cities)
## [1] "numbers" "cities" "population_in_million"
rownames(Afg_major_cities)
## [1] "1" "2" "3" "4" "5"
To change column or row names, we can use the same functions. Here, we capitalize the column names which require changing the names altogether.
colnames(Afg_major_cities) <- c("Numbers", "Cities", "Population_in_million")
Afg_major_cities
## Numbers Cities Population_in_million
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 4 4 Jalalabad 0.356
## 5 5 Kandahar 0.557
To change one or a few column/row names, we use indexing techniques. Let’s change 2nd and 3rd column names.
colnames(Afg_major_cities)[c(2,3)] <- c("Major_cities", "Population")
Afg_major_cities
## Numbers Major_cities Population
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 4 4 Jalalabad 0.356
## 5 5 Kandahar 0.557
Lists
Lists is another type of object in R, and a unique one. It is unique as it includes other types of objects within itself. For example, a list can include multiple data frames, vectors, matrices, scalars, and even other lists as its elements.
To create a list, we use list() function.
my_list <- list(x = 4, matrix(1:4, nrow = 2), Afg_major_cities)
my_list
## $x
## [1] 4
##
## [[2]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[3]]
## Numbers Major_cities Population
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 4 4 Jalalabad 0.356
## 5 5 Kandahar 0.557
We can access each element using double brackets [[ ]]. For example, we access the 3rd element of my_list which is our data frame.
my_list[[3]]
## Numbers Major_cities Population
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 4 4 Jalalabad 0.356
## 5 5 Kandahar 0.557
We can go further and access a column of the data frame (3rd element of my_list) using indexing techniques in data frame.
my_list[[3]]$Population
## [1] 4.635 0.693 1.780 0.356 0.557
# or other techniques... (you get the idea)
# my_list[[3]][, 3]
# my_list[[3]][[3]]
Relational Operators
Relational operators are used to compare between values, and output boolean.
<
for less than>
for greater than<=
for less than or equal to>=
for greater than or equal to==
for equal to each other!=
not equal to each other
For example:
1 < 2
## [1] TRUE
2 == 1
## [1] FALSE
We can use relational operators with vectors too, which output boolean vectors.
For example:
A <- c(1, 4, 2)
B <- c(3, 3, 1)
A >= B
## [1] FALSE TRUE TRUE
NOTE: If one vector is shorter, the elements of shorter vector is recycled. For example, the shorter vector has three elements (3,2,1), which will be recycled to five elements (3,2,1,3,2).
For example:
A <- c(1, 4, 2, 5, 2, 2)
B <- c(3, 2, 1)
A >= B # Vector B becomes c(3, 2, 1, 3, 2, 1)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE