Linear regression with R
Jun 12, 2018 00:00 · 948 words · 5 minute read
Estimating linear regression
Before starting, I am going to load a number of utility libraries for this session.
library(tidyverse)
library(broom)
library(plotly)
A simple linear regression using mtcars dataset. I want to look at how weight of a car is associated with how many miles a car travels per gallon of gasoline–an indicator of fuel economy. I am going to look at their summary and scatter plot. For scatter plot, I use base-R’s plot()
function.
summary(mtcars[, c("wt", "mpg")])
## wt mpg
## Min. :1.513 Min. :10.40
## 1st Qu.:2.581 1st Qu.:15.43
## Median :3.325 Median :19.20
## Mean :3.217 Mean :20.09
## 3rd Qu.:3.610 3rd Qu.:22.80
## Max. :5.424 Max. :33.90
plot(mtcars$wt, mtcars$mpg)
It seems like heavier cars have lower mileage Next, I regress mpg on weight using linear regression.
lm(mpg ~ wt, data = mtcars)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
The output from lm()
function quantifies a negative relationship between weight and mileage. The output however does not provide other useful information such as confidence interval and r-squared, which can be accessed if we feed the lm
function into summary()
function.
summary(lm(mpg ~ wt, data = mtcars))
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Furthermore, augment()
from broom package allows you to access the fitted values and residuals of this regression model, in the form of a data.frame.
broom::augment(lm(mpg ~ wt, data = mtcars)) %>% head()
## .rownames mpg wt .fitted .se.fit .resid .hat
## 1 Mazda RX4 21.0 2.620 23.28261 0.6335798 -2.2826106 0.04326896
## 2 Mazda RX4 Wag 21.0 2.875 21.91977 0.5714319 -0.9197704 0.03519677
## 3 Datsun 710 22.8 2.320 24.88595 0.7359177 -2.0859521 0.05837573
## 4 Hornet 4 Drive 21.4 3.215 20.10265 0.5384424 1.2973499 0.03125017
## 5 Hornet Sportabout 18.7 3.440 18.90014 0.5526562 -0.2001440 0.03292182
## 6 Valiant 18.1 3.460 18.79325 0.5552829 -0.6932545 0.03323551
## .sigma .cooksd .std.resid
## 1 3.067494 1.327407e-02 -0.76616765
## 2 3.093068 1.723963e-03 -0.30743051
## 3 3.072127 1.543937e-02 -0.70575249
## 4 3.088268 3.020558e-03 0.43275114
## 5 3.097722 7.599578e-05 -0.06681879
## 6 3.095184 9.210650e-04 -0.23148309
There is also tidy()
function from broom package that makes a “tidy” data.frame of the regression estimates.
broom::tidy(lm(mpg ~ wt, data = mtcars)) %>% head()
## term estimate std.error statistic p.value
## 1 (Intercept) 37.285126 1.877627 19.857575 8.241799e-19
## 2 wt -5.344472 0.559101 -9.559044 1.293959e-10
Visualizing linear regression
The best way to visualize a simple linear regression is a scatter plot and a fitted regression line. For a simple linear regression, a 2-dimensional plot and a fitted line suffices. Here I use the ggplot2’s ggplot()
function which is made available through the tidyverse package.
ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm", se = F) + theme_bw()
Visualizing multiple regression becomes more challenging, as each variable in the model has to have a dimension. Therefore, as it is not possible to visualize more than 3 dimensions, a model with two right-hand-side variables can be visualized at most. Here, I am plotting weight, mpg, and horsepower using an plotly package that creates interactive plots.
plotly::plot_ly(data = mtcars, x = ~wt, y = ~mpg, z = ~hp)