Week 3: Data Manipulation with dplyr
Jul 27, 2018 00:00 · 1506 words · 8 minute read
About tidyverse
There are thousands of packages developed by equally large number of developers that provide new features or improve R’s features. One of the most influential and useful set of packages is tidyverse which is a collection of other packages for data manipulation, exploration and visualization. All tidyverse packages share an underlying design philosophy, grammar, and data structures. The following are some of the most important tidyverse packages:
- dplyr for data manipulation and data summary
- ggplot2 for data visualization
- tidyr for data reshape
- readr for fast and friendly data import/export
- stringr for string manipulation
In this post we forcus on dplyr which is perhaps the most useful of all tidyverse packages.
Pipe Operator
But before we jump to dplyr, let’s learn about pipe operator %>%. Pipe operator is arguably one of the most useful utilities in R language that is made available while loading any of the tidyverse, dplyr, or magrittr packages. The aim of pipe operator is to make reading and understanding R codes easier by expressing a complex operation into a sequence simple functions and steps.
library(dplyr) #library(tidyverse)
The pipe operator %>% takes a value or the result of an expression, and insert it into the next function. For example, head of mtcars data can be written in the following manner using the pipe operator, instead of head(mtcars)
:
mtcars %>% head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
It is also possible to add multiple pipe operators, with each forwarding the result of previous expression into the next function. It is also common to break the line after each pipe operator to help with readability. Here, we turn Titanic data into a data frame and then print its head:
Titanic %>%
data.frame() %>%
head()
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
If we were to not use pipe operators for the above cammand, we woud have to type, which is more difficult to read and understand.
head(data.frame(Titanic))
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
About dplyr
dplyr package provides a set of functions that makes data manipulation extremely easy. There are many functions within dplyr package, but we will focus on a handful of them that are considered a MUST.
select()
chooses variables/columnsslice()
chooses observations/rowsfilter()
chooses observations/rows based on a logical conditionmutate()
adds new variables that are functions of existing onesarrange()
sorts the data frame based on a column’s valuesummarize()
reduces data frame to a single summary statistictally()
counts number of rowsgroup_by()
changes the ordering of the observations/rows
Select
This function subsets a data frame by its columns or variables. select()
takes name of data frame as its first argument, followed by columns we want to select or unselect.
Titanic %>%
data.frame() %>%
select(Sex, Survived, Freq) %>%
head()
## Sex Survived Freq
## 1 Male No 0
## 2 Male No 0
## 3 Male No 35
## 4 Male No 0
## 5 Female No 0
## 6 Female No 0
There are a few functions that allows selecting variables based on their name or location in a data frame. The following are a non-exhaustive list. See select_helpers in the help or type ?select_helpers for a complete list.
starts_with()
selects variables that their names start with a string specified inside this functionends_with()
selects variables that their names end with a string specified inside this functioncontains()
selects variables that their name contains a string specified inside this function
Titanic %>%
data.frame() %>%
select(starts_with("s")) %>%
head()
## Sex Survived
## 1 Male No
## 2 Male No
## 3 Male No
## 4 Male No
## 5 Female No
## 6 Female No
Slice
Titanic %>%
data.frame() %>%
slice(1:10)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
Filter
Titanic %>%
data.frame() %>%
filter(Age == "Child")
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Child Yes 5
## 10 2nd Male Child Yes 11
## 11 3rd Male Child Yes 13
## 12 Crew Male Child Yes 0
## 13 1st Female Child Yes 1
## 14 2nd Female Child Yes 13
## 15 3rd Female Child Yes 14
## 16 Crew Female Child Yes 0
Mutate
Creates a new variable and adds it to the data frame.
mtcars %>%
mutate(model = rownames(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 280C
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SE
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SL
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Merc 450SLC
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Cadillac Fleetwood
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Lincoln Continental
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Chrysler Imperial
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Fiat 128
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Honda Civic
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 Dodge Challenger
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 AMC Javelin
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Camaro Z28
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Fiat X1-9
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Porsche 914-2
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Lotus Europa
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ford Pantera L
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Ferrari Dino
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Maserati Bora
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Volvo 142E