Simple Linear Regression

Let’s explore simple linear regression with a classic example: predicting a car’s stopping distance (\(y\)) based on its speed (\(x\)) using R’s cars dataset. Imagine you’re driving—the faster you go, the longer it takes to stop. This relationship isn’t perfect (other factors like road conditions matter too), but we can model the average trend. Simple linear regression does this by fitting a straight line through the data points. The line’s equation looks like \[\text{distance} = \text{intercept} + \text{slope} \times \text{speed} + \text{random error.}\] Here, the intercept is where the line starts when speed is 0, and the slope tells us how much distance increases per mph. We estimate these to minimize prediction errors, called residuals.

library(ggpubr) 

#Scatterplot with regression line and correlation
ggplot(cars, aes(x = speed, y = dist)) +
  geom_point(color = "blue", size = 2) +  # Plot data points
  geom_smooth(method = "lm", se = TRUE, color = "red") +  # Add regression line with confidence interval
  stat_regline_equation(label.y = 120) +  # Display regression equation 
  stat_cor(method = "pearson", label.y = 110) +  # Display Pearson correlation coefficient
  labs(
    title = "Speed vs. Stopping Distance",
    x = "Speed (mph)",
    y = "Stopping Distance (ft)"
  ) +
  theme_minimal()

# Step 2: Fit the linear model
model <- lm(dist ~ speed, data = cars) 
summary(model)  # See coefficients 

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

For the cars data, the fitted line is \(\text{distance} = -17.58 + 3.93 \times \text{speed}\). This means, on average, stopping distance increases by ~3.93 feet per mph. But before trusting this, we check if the data “behaves” like a straight-line relationship: plot the points, see if they scatter evenly around the line, and ensure residuals aren’t patterned (like curves or funnels). If all looks good, we can estimate stopping distances for new speeds, like predicting ~61 feet at 20 mph. The key takeaway: regression helps us model trends while acknowledging real-world randomness!