How to Calculate Regression Analysis
What Is Linear Regression?
Simple linear regression models the relationship between a single predictor variable x and a continuous response variable y using the equation ŷ = β₀ + β₁x, where β₀ is the y-intercept and β₁ is the slope. The goal is to find the line that best fits the data by minimizing the sum of squared residuals (the vertical distances between observed and predicted values). This method is called ordinary least squares (OLS).
Formulas for Slope and Intercept
The least-squares slope is: β₁ = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / Σ(xᵢ − x̄)², which is the covariance of x and y divided by the variance of x. The intercept is: β₀ = ȳ − β₁ · x̄. These two formulas guarantee that the regression line always passes through the point (x̄, ȳ), the centroid of the data.
Step-by-Step Calculation
Step 1: Compute x̄ and ȳ. Step 2: For each pair (xᵢ, yᵢ), compute (xᵢ − x̄), (yᵢ − ȳ), their product, and (xᵢ − x̄)². Step 3: Sum the products to get the numerator of β₁ and sum the squared x-deviations for the denominator. Step 4: Compute β₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)². Step 5: Compute β₀ = ȳ − β₁ · x̄. Step 6: Write the regression equation ŷ = β₀ + β₁x.
Worked Example
Using the data pairs (x, y): (1,3), (2,5), (3,4), (4,7), (5,8): x̄ = 3, ȳ = 5.4. The numerator Σ(xᵢ − x̄)(yᵢ − ȳ) = (−2)(−2.4) + (−1)(−0.4) + (0)(−1.4) + (1)(1.6) + (2)(2.6) = 4.8 + 0.4 + 0 + 1.6 + 5.2 = 12. Σ(xᵢ − x̄)² = 4+1+0+1+4 = 10. So β₁ = 12/10 = 1.2 and β₀ = 5.4 − 1.2(3) = 1.8. The regression equation is ŷ = 1.8 + 1.2x.
Measuring Model Fit with R²
The coefficient of determination R² measures how well the regression line explains the variability in y. It is computed as R² = 1 − (SS_res / SS_tot), where SS_res = Σ(yᵢ − ŷᵢ)² (sum of squared residuals) and SS_tot = Σ(yᵢ − ȳ)² (total sum of squares). R² ranges from 0 to 1; an R² of 0.85 means 85% of the variation in y is explained by the linear model. For simple linear regression, R² equals the square of the Pearson correlation r.
Checking Regression Assumptions
OLS regression requires linearity (the true relationship is linear), independence of observations, homoscedasticity (constant variance of residuals across all levels of x), and normally distributed residuals. Diagnostic plots — residuals vs. fitted values, a Q-Q plot of residuals, and a scale-location plot — help you assess these assumptions. Violations may require transformations (e.g., log of y) or alternative methods like weighted least squares.
Multiple Regression Overview
Multiple linear regression extends the model to include two or more predictors: ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ. The coefficients are estimated simultaneously using matrix algebra: β = (XᵀX)⁻¹Xᵀy. Adjusted R² is preferred over R² for multiple regression because it penalizes for adding predictors that do not improve fit. Multicollinearity — high correlation among predictors — can inflate standard errors and make coefficient estimates unstable.