Week 3: Data Manipulation with dplyr

Jul 27, 2018 00:00 · 1506 words · 8 minute read r tips

About tidyverse

There are thousands of packages developed by equally large number of developers that provide new features or improve R’s features. One of the most influential and useful set of packages is tidyverse which is a collection of other packages for data manipulation, exploration and visualization. All tidyverse packages share an underlying design philosophy, grammar, and data structures. The following are some of the most important tidyverse packages:

  1. dplyr for data manipulation and data summary
  2. ggplot2 for data visualization
  3. tidyr for data reshape
  4. readr for fast and friendly data import/export
  5. stringr for string manipulation

In this post we forcus on dplyr which is perhaps the most useful of all tidyverse packages.

Pipe Operator

But before we jump to dplyr, let’s learn about pipe operator %>%. Pipe operator is arguably one of the most useful utilities in R language that is made available while loading any of the tidyverse, dplyr, or magrittr packages. The aim of pipe operator is to make reading and understanding R codes easier by expressing a complex operation into a sequence simple functions and steps.

library(dplyr) #library(tidyverse)

The pipe operator %>% takes a value or the result of an expression, and insert it into the next function. For example, head of mtcars data can be written in the following manner using the pipe operator, instead of head(mtcars):

mtcars %>% head()
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

It is also possible to add multiple pipe operators, with each forwarding the result of previous expression into the next function. It is also common to break the line after each pipe operator to help with readability. Here, we turn Titanic data into a data frame and then print its head:

Titanic %>%
  data.frame() %>%
  head()
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

If we were to not use pipe operators for the above cammand, we woud have to type, which is more difficult to read and understand.

head(data.frame(Titanic))
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

About dplyr

dplyr package provides a set of functions that makes data manipulation extremely easy. There are many functions within dplyr package, but we will focus on a handful of them that are considered a MUST.

  • select() chooses variables/columns
  • slice() chooses observations/rows
  • filter() chooses observations/rows based on a logical condition
  • mutate() adds new variables that are functions of existing ones
  • arrange() sorts the data frame based on a column’s value
  • summarize() reduces data frame to a single summary statistic
  • tally() counts number of rows
  • group_by() changes the ordering of the observations/rows

Select

This function subsets a data frame by its columns or variables. select() takes name of data frame as its first argument, followed by columns we want to select or unselect.

Titanic %>% 
  data.frame() %>% 
  select(Sex, Survived, Freq) %>%
  head()
##      Sex Survived Freq
## 1   Male       No    0
## 2   Male       No    0
## 3   Male       No   35
## 4   Male       No    0
## 5 Female       No    0
## 6 Female       No    0

There are a few functions that allows selecting variables based on their name or location in a data frame. The following are a non-exhaustive list. See select_helpers in the help or type ?select_helpers for a complete list.

  • starts_with() selects variables that their names start with a string specified inside this function
  • ends_with() selects variables that their names end with a string specified inside this function
  • contains() selects variables that their name contains a string specified inside this function
Titanic %>%
  data.frame() %>%
  select(starts_with("s")) %>%
  head()
##      Sex Survived
## 1   Male       No
## 2   Male       No
## 3   Male       No
## 4   Male       No
## 5 Female       No
## 6 Female       No

Slice

Titanic %>% 
data.frame() %>% 
slice(1:10)
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 3    3rd   Male Child       No   35
## 4   Crew   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 7    3rd Female Child       No   17
## 8   Crew Female Child       No    0
## 9    1st   Male Adult       No  118
## 10   2nd   Male Adult       No  154

Filter

Titanic %>% 
data.frame() %>% 
filter(Age == "Child")
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 3    3rd   Male Child       No   35
## 4   Crew   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 7    3rd Female Child       No   17
## 8   Crew Female Child       No    0
## 9    1st   Male Child      Yes    5
## 10   2nd   Male Child      Yes   11
## 11   3rd   Male Child      Yes   13
## 12  Crew   Male Child      Yes    0
## 13   1st Female Child      Yes    1
## 14   2nd Female Child      Yes   13
## 15   3rd Female Child      Yes   14
## 16  Crew Female Child      Yes    0

Mutate

Creates a new variable and adds it to the data frame.

mtcars %>% 
mutate(model = rownames(mtcars))
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb               model
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4           Mazda RX4
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4       Mazda RX4 Wag
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1          Datsun 710
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1      Hornet 4 Drive
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Hornet Sportabout
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1             Valiant
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4          Duster 360
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2           Merc 240D
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2            Merc 230
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4            Merc 280
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4           Merc 280C
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3          Merc 450SE
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3          Merc 450SL
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3         Merc 450SLC
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4  Cadillac Fleetwood
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 Lincoln Continental
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4   Chrysler Imperial
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1            Fiat 128
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2         Honda Civic
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1      Toyota Corolla
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1       Toyota Corona
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2    Dodge Challenger
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2         AMC Javelin
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4          Camaro Z28
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2    Pontiac Firebird
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1           Fiat X1-9
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2       Porsche 914-2
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2        Lotus Europa
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4      Ford Pantera L
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6        Ferrari Dino
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8       Maserati Bora
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2          Volvo 142E
tweet Share