Week 2: On Data Frames and Intro to ggplot2 For Making Graphs

Jul 19, 2018 00:00 · 1607 words · 8 minute read r tips

Data Frames

Data frames are the most widely used data object in R. A data frame is a collection of vectors of the same length, which is also called columns. Each column have their own data type (e.g. numeric, character, etc.). Data frames resemble spread sheets (e.g. MS Excel), and this is the reason for their widespread use.

Data Frame Index

Column(s) or row(s) of a data frame can be called using indeces, i.e. “[ ]”. For example, if we have a data frame, df:

df <- data.frame(numbers = 1:5, 
                 cities = c("Kabul", "Mazar", "Herat", "Jalalabad", "Kandahar"), 
                 population = c(4.635, 0.693, 1.780, 0.356, 0.557))
df
##   numbers    cities population
## 1       1     Kabul      4.635
## 2       2     Mazar      0.693
## 3       3     Herat      1.780
## 4       4 Jalalabad      0.356
## 5       5  Kandahar      0.557

then df[2, 2] selects the intersection of first row and the first column of the dataset.

df[2, 2]
## [1] "Mazar"

df[2, ] selects the entire second row (and returns a data frame!),

df[2, ]
##   numbers cities population
## 2       2  Mazar      0.693

df[, 2] selects the second column (and returns a vector),

df[, 2]
## [1] "Kabul"     "Mazar"     "Herat"     "Jalalabad" "Kandahar"

df[-4, ] selects all columns and rows in df data frame, except for the 4th row (and returns a data frame!)

df[-4, ]
##   numbers   cities population
## 1       1    Kabul      4.635
## 2       2    Mazar      0.693
## 3       3    Herat      1.780
## 5       5 Kandahar      0.557

and df[c(2,3,1), ] selects first three rows but reorders them (returns a data frame with 3 rows and 5 columns).

df[c(2,3,1), ]
##   numbers cities population
## 2       2  Mazar      0.693
## 3       3  Herat      1.780
## 1       1  Kabul      4.635

There is also the “$” which is known as the component selector, that selects one column of a data frame. For example, df$cities selects column cities (and returns a vector).

df$cities
## [1] "Kabul"     "Mazar"     "Herat"     "Jalalabad" "Kandahar"

Create New Column/Variable

We definted data frames as a collection of same length vectors. It is possible to add a vector of same lenght as another column or variable to a data frame. For example, here we add a new column to data frame df which was created earlier. We can do so in multiple ways, which only two are explored here. One way is to use data.frame() function.

df
##   numbers    cities population
## 1       1     Kabul      4.635
## 2       2     Mazar      0.693
## 3       3     Herat      1.780
## 4       4 Jalalabad      0.356
## 5       5  Kandahar      0.557
data.frame(df, new_column = c(T,T,F,F,T))
##   numbers    cities population new_column
## 1       1     Kabul      4.635       TRUE
## 2       2     Mazar      0.693       TRUE
## 3       3     Herat      1.780      FALSE
## 4       4 Jalalabad      0.356      FALSE
## 5       5  Kandahar      0.557       TRUE

The second way is to use component selector $.

df$new_column <- c(T,T,F,F,T)
df
##   numbers    cities population new_column
## 1       1     Kabul      4.635       TRUE
## 2       2     Mazar      0.693       TRUE
## 3       3     Herat      1.780      FALSE
## 4       4 Jalalabad      0.356      FALSE
## 5       5  Kandahar      0.557       TRUE

Inspect Data Frames

Some useful functions for working with data frames:

  • nrow(): number of rows
  • ncol(): number of columns
  • dim(): dimension
  • str(): returns the structure of the data frame
  • colnames(): returns the column names of the data frame
  • rownames(): returns the row names (if any) of the data frame
  • summary(): returns summary statistics
  • head(): returns the first 6 observations of the data frame
  • tail(): returns the last 6 observations of the data frame

Let’s use the above functions on titanic data which is available in the R memory. First, I turn the titanic data into a data frame,

titanic <- as.data.frame(Titanic)

Now, we use the functions on titanic data frame and learn about this data set.

nrow(titanic)
## [1] 32
ncol(titanic)
## [1] 5
dim(titanic)
## [1] 32  5
str(titanic)
## 'data.frame':    32 obs. of  5 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...
colnames(titanic)
## [1] "Class"    "Sex"      "Age"      "Survived" "Freq"
rownames(titanic)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32"
summary(titanic)
##   Class       Sex        Age     Survived      Freq       
##  1st :8   Male  :16   Child:16   No :16   Min.   :  0.00  
##  2nd :8   Female:16   Adult:16   Yes:16   1st Qu.:  0.75  
##  3rd :8                                   Median : 13.50  
##  Crew:8                                   Mean   : 68.78  
##                                           3rd Qu.: 77.00  
##                                           Max.   :670.00
head(titanic)
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0
tail(titanic)
##    Class    Sex   Age Survived Freq
## 27   3rd   Male Adult      Yes   75
## 28  Crew   Male Adult      Yes  192
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80
## 31   3rd Female Adult      Yes   76
## 32  Crew Female Adult      Yes   20

To inspect elements of a column (or a vector), we have other functions. Note, we have to input a column (or a vector):

  • table() returns a table of frequency
  • unique() returns unique values
  • summary() returns summary statistics of a column

Let’s take these functions and use them on Class column in titatic data frame:

table(titanic$Class)
## 
##  1st  2nd  3rd Crew 
##    8    8    8    8
unique(titanic$Class) #Levels because column Class is factor type (a type of character)
## [1] 1st  2nd  3rd  Crew
## Levels: 1st 2nd 3rd Crew
summary(titanic$Class)
##  1st  2nd  3rd Crew 
##    8    8    8    8

Basic Graphs with ggplot2

ggplot2 is a data visualization package for R, created by Hadley Wickham in 2005. ggplot2 is based on the Grammar of Graphics, a general scheme for data visualization which breaks up graphs into components. Before going further, install and load the package:

install.packages("ggplot2") # or tidyverse package that include ggplot2
library(ggplot2) # loads package

Components of ggplot2 Graphs

There are 6 components to any graphic, based on grammar of graphics. In this week, we only touch on compmenents 1st, 2nd and 6th.

  1. data: What you want to visualize, including variables (columns) to be mapped to aesthetic attributes.
  2. geom: Geometric objects that are drawn to represent the data: bars, lines, points, etc.
  3. stats: Statistical transformations of the data, such as binning or averaging.
  4. scales: Map values in the data space to values in an aesthetic space (color, shape, size…)
  5. coord: Coordinate system; provides axes and gridlines to make it possible to read the graph.
  6. facets: Breaking up the data into subsets, to be displayed independently on a grid.

Types of Plots With ggplot2

  1. For one categorical variable, use barplot
ggplot(mtcars, aes(x = cyl)) + # data component
  geom_bar() # geom component

  1. For one continuous variable, use histogram or density plot
ggplot(mtcars, aes(x = mpg)) +  # data component
  geom_histogram() # geom component

ggplot(mtcars, aes(x = mpg)) +
  geom_density()

  1. For two categorical variables, use barplot for one variable and label another variable with colors.
ggplot(mtcars, aes(x = factor(gear), fill = factor(cyl))) +  # data component
  geom_bar() # geom component

There is a position argument inside geom_bar() that allows to create fill and dodge barplots. See the follow examples

ggplot(mtcars, aes(x = factor(gear), fill = factor(cyl))) +
  geom_bar(position = "fill")

ggplot(mtcars, aes(x = factor(gear), fill = factor(cyl))) +
  geom_bar(position = "dodge")

  1. For one categorical and one continuous variables, use boxplot or density plot (There are also swarmplot, stripplot, violinplot)
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +  # Note continous variable is on y axis
  geom_boxplot()

ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + # Note continuous variable is on x axis
  geom_density(alpha = .5) # using alpha we assign 50% color transparency

  1. For two continuous variables, use scatter plot
ggplot(mtcars, aes(x = wt, y = mpg)) +  # data component
  geom_point() # geom component

  1. For two categorical and one continuous variables, use boxplot and use color or facet to visualize another categorical variable. It is also possible to use density plot geom_density() with facets (see above for example of density plot).
ggplot(mtcars, aes(x = factor(cyl), y = mpg, col = factor(gear))) +
  geom_boxplot() +
  facet_wrap(~factor(gear)) # facet component

  1. For two Continuous and one categorical variables, use scatterplot and use color or facet
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_wrap(~factor(cyl)) # facet component

  1. For three continuous variables, use 3D scatterplot which is not available in ggplot2, but it is possible to use different color, shape, and size to visualize the third continuous variable.
ggplot(mtcars, aes(x = wt, y = mpg, size = disp)) +
  geom_point(alpha = .5) # alpha controls transparency of points

Apply a Different Theme

There are a few prepared themes to change the look of your plots in ggplot2

ggplot(mtcars, aes(x = wt, y = mpg, color = disp)) +
  geom_point() + 
  theme_bw() # applies black and white theme

tweet Share