Week 1: Introduction to R

Jul 12, 2018 00:00 · 3095 words · 15 minute read r tips

R and RStudio

R is an opern-source programming language for statistical analysis. Being open-source, unlike similar software such as Stata, SAS and SPSS, it is free to use. R was created in 1994 and it has become arguably the most popular language for statistical analysis.

RStudio is the most popular graphical user interface (GUI) for R which is developed by a company with the same name.

Read more about R and RStudio here: R, RStudio Install R and RStudio here: install R, install RStudio

RStudio interface

R Script is where R code is typed and saved. We can execute codes directly from R script to Console using Cntrl + Enter (Windwos) or command + Enter (Mac). We can also add notes after typing “#”, which will not execute as the rest of the code.
Console is where code is executed and result is printed. We can directly type code in console or run code from R script section.
Environment is where all the objects (data) are saved.
The last section include Directory which is a link to a folder of your interest in your computer where you store your files. Plots display any static graphs you create. Packages lists all the R packages install. Help is where you can find help on functions and packages. Viewer displays any interactive graphs or tables you create.

Basics

Arithmetic

Arthmetic operators

Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo (Remainder from division): %%
Integer Division: %/%

Examples

2018 %% 10

## [1] 8

(2000 + 18) %% 5 # more complex

## [1] 3

2018 %/% 5

## [1] 403

Assignment

Assign 4 to an object x

x <- 4 # assign 4 to x
x # prints x

## [1] 4

Use object x in calculation

y <- 2
x*y

## [1] 8

Assign result of calculation to another object z

z <- x*y
z

## [1] 8

Data types in R

numerics: Decimals values like 4.5
integers: Natural numbers like 4. Integers are also numerics.
logical: Boolean values TRUE and FALSE
characters: Text or string values, e.g. “Kabul”

Create different objects for each data type

my_numeric   <- 2018
my_integer   <- 8L
my_character <- "Kabul"
my_logical   <- TRUE

Check class of objects

class(my_numeric)

## [1] "numeric"

class(my_integer)

## [1] "integer"

class(my_character)

## [1] "character"

class(my_logical)

## [1] "logical"

Data objects in R

We have multiple types of objects in R:

Scalars: Scalars are singular element objects, like one number or one character (we don’t talk about it in this course)
Vectors: A sequence (collection) of data with one dimension, that is length
Matrices: A collection of same data type arranged into rows and columns
Arrays: Arrays are similar to matrices but can have more than two dimensions (we don’t talk about it in this course)
Data Frames: A collection of data points arranged into rows and columns (like matrices) but each column can have different data type (unlike matrices)
Lists: A collection/list of data objects, such as vectors, matrices, data frames, or even other lists

Vectors

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a sequence (or set) of data points.

For example, a numeric vector hold numeric data and looks like this: 1, 2.1, -4, 6.8, 2, -1. A character vector holds character data looks like this: “Kabul”, “India”, “Boat”, “5”, “.”. A logical vector holds logical data and looks like this: FALSE, TRUE, TRUE, FALSE, TRUE.

There are special functions that create a vector. For example:

1:4

## [1] 1 2 3 4

rep("hello", 10) # repeat "hello", 10 times

##  [1] "hello" "hello" "hello" "hello" "hello" "hello" "hello" "hello"
##  [9] "hello" "hello"

In general, we create a vector with the combine function c()

numeric_vector <- c(1, 2.1, -4, 6.8, 2, -1)
character_vector <- c("Kabul", "India", "کشتی", "5", ".")
boolean_vector <- c(FALSE, TRUE, TRUE, FALSE, TRUE)

Vectors Arithmetic

Element by element operation

A <- c(1, 2, 3)
B <- c(10, 10, 10)
A + B

## [1] 11 12 13

A / B

## [1] 0.1 0.2 0.3

Vector index

A vector have many characteristics, such as class, length, and elements. Lenght of a vector is number of elements a vector has, irrespective of its class.

A <- c("Kabul", "Mazar", "Herat", "Jalalabad", "Kandahar")
length(A)

## [1] 5

Vector index: Retrieve parts of a vector by index inside “[ ]”. For example, element 1 of vector A is “Kabul”.

A[1]

## [1] "Kabul"

Let’s look at multiple indeces

A[3:5] # or A[c(3:5)] or A[c(3,4,5)]

## [1] "Herat"     "Jalalabad" "Kandahar"

There is also negative index that exclude mentioned elements

A[-c(1:2)] # NOT A[-1:2], why???

## [1] "Herat"     "Jalalabad" "Kandahar"

We know A has 5 elements. What happens if we ask for 6th element?

Matrices

A matrix, another object type in R, is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

A <- matrix(1:9, nrow = 3)
A

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Or a character matrix

B <- matrix(c("Kabul", "Mazar", "Herat", "Kandahar"), nrow = 2)
B

##      [,1]    [,2]      
## [1,] "Kabul" "Herat"   
## [2,] "Mazar" "Kandahar"

Matrix is its own class. How do we check class of an object?

class(A)
class(B)

Matrix index

Vectors had one dimension. Matrices have two dimensions. Consistent with number of dimensions, matrix indeces also have two values.

Let’s look at matrix A:

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

How can we call value in second column and second row?

A[2, 2]

## [1] 5

How can we call all elements of second column?

A[, 2]

## [1] 4 5 6

How do we call all elements of first row?

A[1, ]

## [1] 1 4 7

Practice a little bit more with indexing matrices with yourself. It is both very useful and super cool.

Matrix Arithmetic

Matrix arithmetic resembles vector arithmetic. Let’s look at a few examples to understand.

A <- matrix(1:9, nrow = 3)
B <- matrix(11:19, nrow = 3)
A

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

##      [,1] [,2] [,3]
## [1,]   11   14   17
## [2,]   12   15   18
## [3,]   13   16   19

Matrix operation with a scalar

A * 2

##      [,1] [,2] [,3]
## [1,]    2    8   14
## [2,]    4   10   16
## [3,]    6   12   18

Matrix operation with another matrix

A * B

##      [,1] [,2] [,3]
## [1,]   11   56  119
## [2,]   24   75  144
## [3,]   39   96  171

Joining matrices

There are two ways to join matrices: - rbind: add new rows - cbind: add new columns

In row bind, we add one matrix on top of another matrix

In column bind, we add one matrix on the right side of another matrix

Let’s join matrix A and B, once by row and once by column:

A <- matrix(1:9, nrow = 3)
B <- matrix(11:19, nrow = 3)
rbind(A, B)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## [4,]   11   14   17
## [5,]   12   15   18
## [6,]   13   16   19

cbind(A, B)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   11   14   17
## [2,]    2    5    8   12   15   18
## [3,]    3    6    9   13   16   19

Switch A and B’s position in rbind() and cbind() and find out what happens.

rbind(B, A)
cbind(B, A)

Data frames

Data frames are similar to matrices in terms of having multiple rows and columns, but different as data frames does not have to be in the same data type. Data frames allow each column to have its own data type. For example, one column can be numeric, and another column can be character. Most data sets in the world are data frame.

R comes with pre-installed data frames. You can directly call a pre-install data by typing its name. Let’s look at one of the most used data frames mtcars.

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

As you can see, typing a data frame name prints all of it. Let’s use head() and tail() functions that prints only top 6 rows and top 6 columns.

head(mtcars, 3) # only three rows

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

As data frames can contain multiple types of data, such as numeric, character, logical, or integer, we need a function that gives a summary of this information. That funciton is str() that shows you the structure of your data frame, or any other object (try it with a vector, matrix, list, etc.).

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Creating data frame

Data frames can be created in multiple ways. Let’s see a few cases.

data frame from matrix

A <- matrix(1:9, nrow = 3)
A_df <- data.frame(A)
class(A_df)

## [1] "data.frame"

Data frame from multiple vectors of the same length

numbers <- 1:5
cities <- c("Kabul", "Mazar", "Herat", "Jalalabad", "Kandahar")
population_in_million <- c(4.635, 0.693, 1.780, 0.356, 0.557)
Afg_major_cities <- data.frame(numbers, cities, population_in_million)
Afg_major_cities

##   numbers    cities population_in_million
## 1       1     Kabul                 4.635
## 2       2     Mazar                 0.693
## 3       3     Herat                 1.780
## 4       4 Jalalabad                 0.356
## 5       5  Kandahar                 0.557

Data frame from imported tabular data (e.g. Excel)

# We will study importing data in the next lecture
read.csv("~/Documents/PRSO/Programs/Data Analysis with R/Day 1/Afg_major_cities.csv", header = TRUE)

##   X numbers    cities population_in_million
## 1 1       1     Kabul                 4.635
## 2 2       2     Mazar                 0.693
## 3 3       3     Herat                 1.780
## 4 4       4 Jalalabad                 0.356
## 5 5       5  Kandahar                 0.557

Data frame index

Data frame indeces are largely the same as matrices: we call row number and column number in that order inside “[ ]”.

Let’s see Kabul’s population which is the first row in the Afg_major_cities data frame.

Afg_major_cities[1, ]

##   numbers cities population_in_million
## 1       1  Kabul                 4.635

There are two ways for data frames to be indexed by column. For example, third column of data frame can be called [, 3] or [[3]]. The second method simply means third element which is used widely with list objects.

There is one useful indexing technique that is available for data frame is we can use rowname and columnname.

Let’s print population column (“population_in_million”).

Afg_major_cities[, "population_in_million"]

## [1] 4.635 0.693 1.780 0.356 0.557

Note, the result of either indeces is a vector, not a column or row of a data frame.

There are additional unique indexing techniques.

R uses $ operator to select a variable inside a data frame.

Afg_major_cities$population_in_million # is the same as Afg_major_cities[, 3]

## [1] 4.635 0.693 1.780 0.356 0.557

We can also select an element, for example 3rd element of 2nd column, using $ operator.

Afg_major_cities$cities[3] # is the same as Afg_major_cities[3, 2]

## [1] Herat
## Levels: Herat Jalalabad Kabul Kandahar Mazar

Column and Row Names

Most data are stored in data frame and it is important to assign appropriate names to columns (and sometiems rows) to help recognize them.

Let’s start by looking at the column names and row names of our data frame using colnames() and rownames() functions.

colnames(Afg_major_cities)

## [1] "numbers"               "cities"                "population_in_million"

rownames(Afg_major_cities)

## [1] "1" "2" "3" "4" "5"

To change column or row names, we can use the same functions. Here, we capitalize the column names which require changing the names altogether.

colnames(Afg_major_cities) <- c("Numbers", "Cities", "Population_in_million")
Afg_major_cities

##   Numbers    Cities Population_in_million
## 1       1     Kabul                 4.635
## 2       2     Mazar                 0.693
## 3       3     Herat                 1.780
## 4       4 Jalalabad                 0.356
## 5       5  Kandahar                 0.557

To change one or a few column/row names, we use indexing techniques. Let’s change 2nd and 3rd column names.

colnames(Afg_major_cities)[c(2,3)] <- c("Major_cities", "Population")
Afg_major_cities

##   Numbers Major_cities Population
## 1       1        Kabul      4.635
## 2       2        Mazar      0.693
## 3       3        Herat      1.780
## 4       4    Jalalabad      0.356
## 5       5     Kandahar      0.557

Lists

Lists is another type of object in R, and a unique one. It is unique as it includes other types of objects within itself. For example, a list can include multiple data frames, vectors, matrices, scalars, and even other lists as its elements.

To create a list, we use list() function.

my_list <- list(x = 4, matrix(1:4, nrow = 2), Afg_major_cities)
my_list

## $x
## [1] 4
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[3]]
##   Numbers Major_cities Population
## 1       1        Kabul      4.635
## 2       2        Mazar      0.693
## 3       3        Herat      1.780
## 4       4    Jalalabad      0.356
## 5       5     Kandahar      0.557

We can access each element using double brackets [[ ]]. For example, we access the 3rd element of my_list which is our data frame.

my_list[[3]]

##   Numbers Major_cities Population
## 1       1        Kabul      4.635
## 2       2        Mazar      0.693
## 3       3        Herat      1.780
## 4       4    Jalalabad      0.356
## 5       5     Kandahar      0.557

We can go further and access a column of the data frame (3rd element of my_list) using indexing techniques in data frame.

my_list[[3]]$Population

## [1] 4.635 0.693 1.780 0.356 0.557

# or other techniques... (you get the idea)
# my_list[[3]][, 3]
# my_list[[3]][[3]]

Relational Operators

Relational operators are used to compare between values, and output boolean.

< for less than
> for greater than
<= for less than or equal to
>= for greater than or equal to
== for equal to each other
!= not equal to each other

For example:

1 < 2

## [1] TRUE

2 == 1

## [1] FALSE

We can use relational operators with vectors too, which output boolean vectors.

For example:

A <- c(1, 4, 2)
B <- c(3, 3, 1)
A >= B

## [1] FALSE  TRUE  TRUE

NOTE: If one vector is shorter, the elements of shorter vector is recycled. For example, the shorter vector has three elements (3,2,1), which will be recycled to five elements (3,2,1,3,2).

For example:

A <- c(1, 4, 2, 5, 2, 2)
B <- c(3, 2, 1)
A >= B # Vector B becomes c(3, 2, 1, 3, 2, 1)

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE