Let’s explore simple linear regression with a classic example: predicting a car’s stopping distance (\(y\)) based on its speed (\(x\)) using R’s cars dataset. Imagine you’re driving—the faster you go, the longer it takes to stop. This relationship isn’t perfect (other factors like road conditions matter too), but we can model the average trend. Simple linear regression does this by fitting a straight line through the data points. The line’s equation looks like \[\text{distance} = \text{intercept} + \text{slope} \times \text{speed} + \text{random error.}\] Here, the intercept is where the line starts when speed is 0, and the slope tells us how much distance increases per mph. We estimate these to minimize prediction errors, called residuals.
library(ggpubr) #Scatterplot with regression line and correlationggplot(cars, aes(x = speed, y = dist)) +geom_point(color ="blue", size =2) +# Plot data pointsgeom_smooth(method ="lm", se =TRUE, color ="red") +# Add regression line with confidence intervalstat_regline_equation(label.y =120) +# Display regression equation stat_cor(method ="pearson", label.y =110) +# Display Pearson correlation coefficientlabs(title ="Speed vs. Stopping Distance",x ="Speed (mph)",y ="Stopping Distance (ft)" ) +theme_minimal()
# Step 2: Fit the linear modelmodel <-lm(dist ~ speed, data = cars) summary(model) # See coefficients
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
For the cars data, the fitted line is \(\text{distance} = -17.58 + 3.93 \times \text{speed}\). This means, on average, stopping distance increases by ~3.93 feet per mph. But before trusting this, we check if the data “behaves” like a straight-line relationship: plot the points, see if they scatter evenly around the line, and ensure residuals aren’t patterned (like curves or funnels). If all looks good, we can estimate stopping distances for new speeds, like predicting ~61 feet at 20 mph. The key takeaway: regression helps us model trends while acknowledging real-world randomness!