Week 2: On Data Frames and Intro to ggplot2 For Making Graphs
Jul 19, 2018 00:00 · 1607 words · 8 minute read
Data Frames
Data frames are the most widely used data object in R. A data frame is a collection of vectors of the same length, which is also called columns. Each column have their own data type (e.g. numeric, character, etc.). Data frames resemble spread sheets (e.g. MS Excel), and this is the reason for their widespread use.
Data Frame Index
Column(s) or row(s) of a data frame can be called using indeces, i.e. “[ ]”. For example, if we have a data frame, df:
df <- data.frame(numbers = 1:5,
cities = c("Kabul", "Mazar", "Herat", "Jalalabad", "Kandahar"),
population = c(4.635, 0.693, 1.780, 0.356, 0.557))
df
## numbers cities population
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 4 4 Jalalabad 0.356
## 5 5 Kandahar 0.557
then df[2, 2]
selects the intersection of first row and the first column of the dataset.
df[2, 2]
## [1] "Mazar"
df[2, ]
selects the entire second row (and returns a data frame!),
df[2, ]
## numbers cities population
## 2 2 Mazar 0.693
df[, 2]
selects the second column (and returns a vector),
df[, 2]
## [1] "Kabul" "Mazar" "Herat" "Jalalabad" "Kandahar"
df[-4, ]
selects all columns and rows in df data frame, except for the 4th row (and returns a data frame!)
df[-4, ]
## numbers cities population
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 5 5 Kandahar 0.557
and df[c(2,3,1), ]
selects first three rows but reorders them (returns a data frame with 3 rows and 5 columns).
df[c(2,3,1), ]
## numbers cities population
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 1 1 Kabul 4.635
There is also the “$” which is known as the component selector, that selects one column of a data frame. For example, df$cities
selects column cities (and returns a vector).
df$cities
## [1] "Kabul" "Mazar" "Herat" "Jalalabad" "Kandahar"
Create New Column/Variable
We definted data frames as a collection of same length vectors. It is possible to add a vector of same lenght as another column or variable to a data frame. For example, here we add a new column to data frame df which was created earlier. We can do so in multiple ways, which only two are explored here. One way is to use data.frame()
function.
df
## numbers cities population
## 1 1 Kabul 4.635
## 2 2 Mazar 0.693
## 3 3 Herat 1.780
## 4 4 Jalalabad 0.356
## 5 5 Kandahar 0.557
data.frame(df, new_column = c(T,T,F,F,T))
## numbers cities population new_column
## 1 1 Kabul 4.635 TRUE
## 2 2 Mazar 0.693 TRUE
## 3 3 Herat 1.780 FALSE
## 4 4 Jalalabad 0.356 FALSE
## 5 5 Kandahar 0.557 TRUE
The second way is to use component selector $.
df$new_column <- c(T,T,F,F,T)
df
## numbers cities population new_column
## 1 1 Kabul 4.635 TRUE
## 2 2 Mazar 0.693 TRUE
## 3 3 Herat 1.780 FALSE
## 4 4 Jalalabad 0.356 FALSE
## 5 5 Kandahar 0.557 TRUE
Inspect Data Frames
Some useful functions for working with data frames:
nrow()
: number of rowsncol()
: number of columnsdim()
: dimensionstr()
: returns the structure of the data framecolnames()
: returns the column names of the data framerownames()
: returns the row names (if any) of the data framesummary()
: returns summary statisticshead()
: returns the first 6 observations of the data frametail()
: returns the last 6 observations of the data frame
Let’s use the above functions on titanic data which is available in the R memory. First, I turn the titanic data into a data frame,
titanic <- as.data.frame(Titanic)
Now, we use the functions on titanic data frame and learn about this data set.
nrow(titanic)
## [1] 32
ncol(titanic)
## [1] 5
dim(titanic)
## [1] 32 5
str(titanic)
## 'data.frame': 32 obs. of 5 variables:
## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
## $ Age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
## $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq : num 0 0 35 0 0 0 17 0 118 154 ...
colnames(titanic)
## [1] "Class" "Sex" "Age" "Survived" "Freq"
rownames(titanic)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32"
summary(titanic)
## Class Sex Age Survived Freq
## 1st :8 Male :16 Child:16 No :16 Min. : 0.00
## 2nd :8 Female:16 Adult:16 Yes:16 1st Qu.: 0.75
## 3rd :8 Median : 13.50
## Crew:8 Mean : 68.78
## 3rd Qu.: 77.00
## Max. :670.00
head(titanic)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
tail(titanic)
## Class Sex Age Survived Freq
## 27 3rd Male Adult Yes 75
## 28 Crew Male Adult Yes 192
## 29 1st Female Adult Yes 140
## 30 2nd Female Adult Yes 80
## 31 3rd Female Adult Yes 76
## 32 Crew Female Adult Yes 20
To inspect elements of a column (or a vector), we have other functions. Note, we have to input a column (or a vector):
table()
returns a table of frequencyunique()
returns unique valuessummary()
returns summary statistics of a column
Let’s take these functions and use them on Class column in titatic data frame:
table(titanic$Class)
##
## 1st 2nd 3rd Crew
## 8 8 8 8
unique(titanic$Class) #Levels because column Class is factor type (a type of character)
## [1] 1st 2nd 3rd Crew
## Levels: 1st 2nd 3rd Crew
summary(titanic$Class)
## 1st 2nd 3rd Crew
## 8 8 8 8
Basic Graphs with ggplot2
ggplot2
is a data visualization package for R, created by Hadley Wickham in 2005. ggplot2
is based on the Grammar of Graphics, a general scheme for data visualization which breaks up graphs into components. Before going further, install and load the package:
install.packages("ggplot2") # or tidyverse package that include ggplot2
library(ggplot2) # loads package
Components of ggplot2 Graphs
There are 6 components to any graphic, based on grammar of graphics. In this week, we only touch on compmenents 1st, 2nd and 6th.
- data: What you want to visualize, including variables (columns) to be mapped to aesthetic attributes.
- geom: Geometric objects that are drawn to represent the data: bars, lines, points, etc.
- stats: Statistical transformations of the data, such as binning or averaging.
- scales: Map values in the data space to values in an aesthetic space (color, shape, size…)
- coord: Coordinate system; provides axes and gridlines to make it possible to read the graph.
- facets: Breaking up the data into subsets, to be displayed independently on a grid.
Types of Plots With ggplot2
- For one categorical variable, use barplot
ggplot(mtcars, aes(x = cyl)) + # data component
geom_bar() # geom component
- For one continuous variable, use histogram or density plot
ggplot(mtcars, aes(x = mpg)) + # data component
geom_histogram() # geom component
ggplot(mtcars, aes(x = mpg)) +
geom_density()
- For two categorical variables, use barplot for one variable and label another variable with colors.
ggplot(mtcars, aes(x = factor(gear), fill = factor(cyl))) + # data component
geom_bar() # geom component
There is a position argument inside geom_bar()
that allows to create fill and dodge barplots. See the follow examples
ggplot(mtcars, aes(x = factor(gear), fill = factor(cyl))) +
geom_bar(position = "fill")
ggplot(mtcars, aes(x = factor(gear), fill = factor(cyl))) +
geom_bar(position = "dodge")
- For one categorical and one continuous variables, use boxplot or density plot (There are also swarmplot, stripplot, violinplot)
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + # Note continous variable is on y axis
geom_boxplot()
ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + # Note continuous variable is on x axis
geom_density(alpha = .5) # using alpha we assign 50% color transparency
- For two continuous variables, use scatter plot
ggplot(mtcars, aes(x = wt, y = mpg)) + # data component
geom_point() # geom component
- For two categorical and one continuous variables, use boxplot and use color or facet to visualize another categorical variable. It is also possible to use density plot
geom_density()
with facets (see above for example of density plot).
ggplot(mtcars, aes(x = factor(cyl), y = mpg, col = factor(gear))) +
geom_boxplot() +
facet_wrap(~factor(gear)) # facet component
- For two Continuous and one categorical variables, use scatterplot and use color or facet
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_wrap(~factor(cyl)) # facet component
- For three continuous variables, use 3D scatterplot which is not available in ggplot2, but it is possible to use different color, shape, and size to visualize the third continuous variable.
ggplot(mtcars, aes(x = wt, y = mpg, size = disp)) +
geom_point(alpha = .5) # alpha controls transparency of points
Apply a Different Theme
There are a few prepared themes to change the look of your plots in ggplot2
ggplot(mtcars, aes(x = wt, y = mpg, color = disp)) +
geom_point() +
theme_bw() # applies black and white theme