Linear Regression

Linear Regression

Top Interview Questions

About Linear Regression

 

Linear Regression: An In-Depth Overview

Linear Regression is one of the most fundamental and widely used algorithms in statistics and machine learning. It is primarily employed for predictive modeling, helping to understand the relationship between a dependent variable (target) and one or more independent variables (predictors). Despite its simplicity, linear regression serves as the foundation for more complex techniques and is essential for data analysis across industries such as finance, healthcare, marketing, and engineering.


1. Concept of Linear Regression

At its core, linear regression assumes a linear relationship between the dependent variable ( Y ) and independent variable(s) ( X ). Mathematically, the simplest form is expressed as:

[
Y = \beta_0 + \beta_1 X + \epsilon
]

Where:

  • ( Y ) = Dependent variable (outcome we want to predict)

  • ( X ) = Independent variable (predictor or feature)

  • ( \beta_0 ) = Intercept (value of Y when X=0)

  • ( \beta_1 ) = Slope (amount by which Y changes for a unit change in X)

  • ( \epsilon ) = Error term (difference between actual and predicted values)

When there is more than one independent variable, it becomes Multiple Linear Regression:

[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon
]

This allows modeling of more complex real-world scenarios.


2. Assumptions of Linear Regression

Linear regression works well under certain assumptions. Understanding these assumptions is critical because violation can lead to misleading results:

  1. Linearity: The relationship between the independent and dependent variables should be linear.

  2. Independence: Observations must be independent of each other.

  3. Homoscedasticity: Constant variance of errors ((\epsilon)) across all levels of independent variables.

  4. Normality of Errors: The residuals (differences between actual and predicted values) should be normally distributed.

  5. No multicollinearity (for multiple regression): Independent variables should not be highly correlated with each other.

If these assumptions hold, linear regression produces unbiased, consistent, and efficient estimates of the coefficients.


3. Types of Linear Regression

Linear regression can be broadly categorized into:

  1. Simple Linear Regression:
    Involves one independent variable predicting a dependent variable. Example: Predicting a person’s weight based on their height.

  2. Multiple Linear Regression:
    Involves multiple independent variables. Example: Predicting house prices based on size, location, and number of bedrooms.

  3. Polynomial Regression (technically an extension):
    Models nonlinear relationships by introducing polynomial terms of predictors. Example: ( Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon )

  4. Ridge and Lasso Regression (Regularized Linear Regression):
    These methods add penalties to prevent overfitting in models with many variables. Ridge uses ( L2 ) penalty, Lasso uses ( L1 ) penalty.


4. Objective and Cost Function

The goal of linear regression is to fit the best line that minimizes the error between predicted and actual values. The most common approach is Ordinary Least Squares (OLS), which minimizes the sum of squared errors (SSE):

[
SSE = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
]

Where:

  • ( Y_i ) = Actual value

  • ( \hat{Y_i} ) = Predicted value

Minimizing SSE ensures that predictions are as close as possible to actual outcomes. Gradient descent can also be used to iteratively adjust coefficients in large datasets.


5. Model Evaluation Metrics

Evaluating a linear regression model requires assessing how well the model predicts the dependent variable. Common metrics include:

  1. R-squared (( R^2 )):
    Measures the proportion of variance in the dependent variable explained by independent variables. Ranges from 0 to 1.

[
R^2 = 1 - \frac{\sum (Y_i - \hat{Y_i})^2}{\sum (Y_i - \bar{Y})^2}
]

  1. Adjusted R-squared:
    Adjusts R-squared for the number of predictors in the model, preventing overestimation of model performance in multiple regression.

  2. Mean Absolute Error (MAE):
    Average of absolute differences between predicted and actual values.

[
MAE = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y_i}|
]

  1. Mean Squared Error (MSE) & Root Mean Squared Error (RMSE):
    Measures the average squared difference. RMSE is the square root of MSE and is in the same unit as the dependent variable.

[
MSE = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
]


6. Steps in Building a Linear Regression Model

  1. Data Collection: Gather relevant data for predictors and target variables.

  2. Data Preprocessing: Handle missing values, outliers, and categorical variables (encoding).

  3. Exploratory Data Analysis (EDA): Visualize relationships using scatter plots, correlation matrices, etc.

  4. Splitting Data: Divide dataset into training and testing sets to evaluate performance.

  5. Model Training: Fit a linear regression model using OLS or gradient descent.

  6. Model Evaluation: Assess performance using metrics like R-squared, MAE, RMSE.

  7. Model Interpretation: Examine coefficients to understand the impact of each variable.

  8. Prediction: Use the model to predict new observations.


7. Advantages of Linear Regression

  1. Simplicity and Interpretability: Easy to understand and explain.

  2. Computational Efficiency: Requires less computation than complex models.

  3. Basis for Other Models: Foundation for logistic regression, generalized linear models, and regularized regression.

  4. Predictive Power: Works well for linear relationships and moderate datasets.


8. Limitations of Linear Regression

  1. Assumes Linearity: Fails if relationships are nonlinear.

  2. Sensitive to Outliers: Extreme values can significantly impact the model.

  3. Multicollinearity: Highly correlated predictors distort coefficient estimates.

  4. Overfitting/Underfitting: May overfit if too many predictors or underfit if relationships are complex.

  5. Cannot Model Complex Patterns: Limited compared to tree-based or neural network models.


9. Applications of Linear Regression

  1. Finance: Predicting stock prices, loan defaults, or credit scores.

  2. Healthcare: Estimating disease progression, patient outcomes, or drug effectiveness.

  3. Marketing: Forecasting sales, customer behavior, or advertising effectiveness.

  4. Economics: Modeling economic indicators like GDP growth, inflation, or unemployment.

  5. Engineering: Predicting material strength, energy consumption, or equipment failure.

Linear regression’s versatility makes it applicable in almost any field where quantitative prediction is needed.

Fresher Interview Questions

 

1. What is Linear Regression?

Answer:
Linear Regression is a supervised machine learning algorithm used to predict a continuous numerical value based on one or more input variables.

It finds the best-fit straight line that represents the relationship between:

  • Independent variable(s) (X)

  • Dependent variable (Y)

Example:
Predicting salary based on years of experience.


2. What is the equation of Linear Regression?

Answer:
The equation of a straight line:

[
Y = mX + c
]

Where:

  • Y = dependent variable (output)

  • X = independent variable (input)

  • m = slope (how much Y changes when X changes)

  • c = intercept (value of Y when X = 0)


3. What is Simple Linear Regression?

Answer:
Simple Linear Regression involves:

  • One independent variable

  • One dependent variable

Example:
Salary = f(Experience)

[
Salary = m \times Experience + c
]


4. What is Multiple Linear Regression?

Answer:
Multiple Linear Regression uses more than one independent variable.

Equation:

[
Y = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n
]

Example:
House price based on:

  • Area

  • Location

  • Number of rooms


5. What are the assumptions of Linear Regression?

Answer:
Linear Regression works best when the following assumptions are met:

  1. Linearity – Relationship between X and Y is linear

  2. Independence – Observations are independent

  3. Homoscedasticity – Constant variance of errors

  4. Normality – Residuals are normally distributed

  5. No multicollinearity – Independent variables are not highly correlated


6. What is the cost function in Linear Regression?

Answer:
The cost function measures how well the model fits the data.

Most commonly used:

Mean Squared Error (MSE)

[
MSE = \frac{1}{n} \sum (Y_{actual} - Y_{predicted})^2
]

The goal is to minimize this error.


7. What is Gradient Descent?

Answer:
Gradient Descent is an optimization algorithm used to find the best values of m and c by minimizing the cost function.

Steps:

  1. Start with random values of m and c

  2. Calculate error

  3. Update m and c in the direction of minimum error

  4. Repeat until convergence


8. What is Learning Rate?

Answer:
Learning rate (α) controls how big the step size is during gradient descent.

  • Too large → overshoots minimum

  • Too small → slow convergence


9. What are residuals?

Answer:
Residual = Actual value − Predicted value

[
Residual = Y_{actual} - Y_{predicted}
]

Residuals help evaluate model accuracy.


10. What is R-squared (R²)?

Answer:
R² shows how much variance in the dependent variable is explained by the model.

  • R² = 1 → perfect fit

  • R² = 0 → no explanatory power

Example:
R² = 0.85 means 85% of the variation is explained.


11. What is Adjusted R-squared?

Answer:
Adjusted R² adjusts R² based on the number of independent variables.

It penalizes adding irrelevant features.


12. Difference between R² and Adjusted R²

Adjusted R²
Increases when variables are added Increases only if variables are useful
Can be misleading More reliable

13. What is Multicollinearity?

Answer:
Multicollinearity occurs when independent variables are highly correlated with each other.

Problems:

  • Unstable coefficients

  • Difficult interpretation

Solution:

  • Remove correlated variables

  • Use VIF (Variance Inflation Factor)


14. What is VIF?

Answer:
VIF measures how much multicollinearity exists.

  • VIF < 5 → acceptable

  • VIF > 10 → serious multicollinearity


15. What is Overfitting?

Answer:
Overfitting occurs when the model:

  • Fits training data very well

  • Performs poorly on unseen data

Causes:

  • Too many features

  • Small dataset


16. What is Underfitting?

Answer:
Underfitting happens when:

  • Model is too simple

  • Cannot capture patterns in data


17. How to handle overfitting in Linear Regression?

Answer:

  • Use regularization

  • Reduce features

  • Increase data

  • Cross-validation


18. What is Regularization?

Answer:
Regularization adds a penalty term to the cost function to reduce overfitting.


19. What is Ridge Regression?

Answer:
Ridge Regression adds L2 penalty.

[
Cost = MSE + \lambda \sum w^2
]

  • Reduces large coefficients

  • Does not make them zero


20. What is Lasso Regression?

Answer:
Lasso adds L1 penalty.

[
Cost = MSE + \lambda \sum |w|
]

  • Can shrink coefficients to zero

  • Performs feature selection


21. Difference between Ridge and Lasso

Ridge Lasso
L2 penalty L1 penalty
No feature elimination Feature elimination
Handles multicollinearity Sparse model

22. What evaluation metrics are used for Linear Regression?

Answer:

  • Mean Absolute Error (MAE)

  • Mean Squared Error (MSE)

  • Root Mean Squared Error (RMSE)

  • R² Score


23. What is RMSE?

Answer:
RMSE is the square root of MSE.

[
RMSE = \sqrt{MSE}
]

It is in the same unit as target variable.


24. Can Linear Regression handle non-linear data?

Answer:
No, but we can:

  • Use polynomial features

  • Apply transformations (log, square)


25. When should Linear Regression not be used?

Answer:

  • When relationship is highly non-linear

  • Presence of strong outliers

  • Categorical target variable


26. Real-time example of Linear Regression

Answer:

  • Salary prediction

  • House price prediction

  • Sales forecasting

  • Demand prediction


27. What libraries are used for Linear Regression in Python?

Answer:

  • scikit-learn

  • statsmodels

  • numpy

  • pandas


28. Explain Linear Regression in one line (Interview Tip)

Answer:
Linear Regression predicts continuous values by finding the best-fit straight line between input and output variables.


29. What are the advantages of Linear Regression?

Answer:

  • Simple and easy to implement

  • Interpretable results

  • Fast training

  • Works well with linearly related data


30. What are the disadvantages of Linear Regression?

Answer:

  • Sensitive to outliers

  • Assumes linearity

  • Poor performance with complex data


31. What are model coefficients in Linear Regression?

Answer:
Model coefficients represent the relationship strength between independent variables and the dependent variable.

  • Positive coefficient → Y increases as X increases

  • Negative coefficient → Y decreases as X increases

Example:
If coefficient = 5 → 1 unit increase in X increases Y by 5 units.


32. What is the intercept in Linear Regression?

Answer:
The intercept is the value of Y when all X values are zero.

It helps position the regression line correctly.


33. How do you detect outliers in Linear Regression?

Answer:

  • Box plots

  • Z-score

  • IQR (Interquartile Range)

  • Scatter plots

Outliers can skew the regression line.


34. How do outliers affect Linear Regression?

Answer:

  • Change slope significantly

  • Increase error values

  • Reduce model accuracy

Linear Regression is sensitive to outliers.


35. How to handle outliers?

Answer:

  • Remove extreme outliers

  • Transform data (log, square root)

  • Use robust regression techniques


36. What is homoscedasticity?

Answer:
Homoscedasticity means the variance of residuals is constant across all values of X.

This ensures reliable predictions.


37. What is heteroscedasticity?

Answer:
Heteroscedasticity occurs when residual variance changes with X.

Impact:

  • Unreliable coefficients

  • Invalid statistical tests


38. How do you detect heteroscedasticity?

Answer:

  • Residual vs fitted value plot

  • Breusch–Pagan test

  • White test


39. How to fix heteroscedasticity?

Answer:

  • Log transformation

  • Weighted Least Squares

  • Remove outliers


40. What is normality of residuals?

Answer:
Residuals should follow a normal distribution.

This helps in:

  • Accurate confidence intervals

  • Hypothesis testing


41. What is correlation vs regression?

Answer:

Correlation Regression
Measures relationship strength Predicts values
No causation Assumes dependency
Symmetric Directional

42. Can Linear Regression be used for classification?

Answer:
No. Linear Regression predicts continuous values.

For classification, use:

  • Logistic Regression


43. What is Ordinary Least Squares (OLS)?

Answer:
OLS minimizes the sum of squared residuals to find the best-fit line.

It is the most common method used in Linear Regression.


44. What is feature scaling and why is it needed?

Answer:
Feature scaling brings all features to a similar range.

Required for:

  • Faster convergence

  • Gradient descent efficiency

Methods:

  • Standardization

  • Normalization


45. Is feature scaling required for Linear Regression?

Answer:
Not mandatory for OLS, but important for gradient descent and regularization.


46. What is polynomial regression?

Answer:
Polynomial Regression models non-linear relationships by adding polynomial terms.

Example:

[
Y = b_0 + b_1X + b_2X^2
]


47. What is bias in Linear Regression?

Answer:
Bias is error due to oversimplified assumptions.

High bias leads to underfitting.


48. What is variance in Linear Regression?

Answer:
Variance measures how much the model changes with different datasets.

High variance leads to overfitting.


49. Explain Bias–Variance Tradeoff

Answer:

  • Low bias + high variance → overfitting

  • High bias + low variance → underfitting

Goal: Balance both.


50. What is cross-validation?

Answer:
Cross-validation tests model performance on multiple data splits.

Most common:

  • K-Fold Cross Validation


51. What is K-Fold Cross Validation?

Answer:
Data is divided into K equal parts.

  • Train on K-1 folds

  • Test on remaining fold

  • Repeat K times


52. What is p-value in Linear Regression?

Answer:
p-value measures statistical significance of a feature.

  • p < 0.05 → significant feature

  • p > 0.05 → insignificant feature


53. What is confidence interval?

Answer:
Confidence interval provides a range where the true coefficient lies.

Common: 95% confidence interval.


54. What is dummy variable trap?

Answer:
Occurs when dummy variables are perfectly correlated.

Solution:

  • Drop one dummy variable


55. How do you handle categorical variables in Linear Regression?

Answer:

  • One-Hot Encoding

  • Label Encoding (carefully)


56. What is data leakage?

Answer:
Data leakage occurs when test data influences training.

It results in overly optimistic performance.


57. What is train-test split?

Answer:
Data is split into:

  • Training set

  • Testing set

Common ratio:

  • 70:30 or 80:20


58. What are leverage points?

Answer:
Leverage points are data points with extreme X values.

They strongly influence the regression line.


59. What is Cook’s Distance?

Answer:
Measures the influence of each observation on the regression model.

High value → influential point.


60. Explain Linear Regression workflow

Answer:

  1. Collect data

  2. Clean data

  3. Handle missing values

  4. Feature scaling

  5. Train model

  6. Evaluate model

  7. Tune model


61. How do you explain Linear Regression to a non-technical person?

Answer:
It finds a straight line that best predicts future values based on past data.


62. What are common mistakes in Linear Regression?

Answer:

  • Ignoring assumptions

  • Not handling outliers

  • Using too many features

  • Not validating model


63. What is the effect of adding irrelevant features?

Answer:

  • Reduced model performance

  • Overfitting

  • Lower Adjusted R²


64. Can Linear Regression handle missing values?

Answer:
No. Missing values must be:

  • Removed

  • Imputed (mean/median)


65. Interview Tip: How do you justify Linear Regression?

Answer:
Use Linear Regression when:

  • Relationship is linear

  • Target variable is continuous

  • Interpretability is important

Experienced Interview Questions

 

1. What is Linear Regression? Explain with an equation.

Answer:
Linear Regression is a supervised learning algorithm used to model the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation.

Simple Linear Regression:
[
y = \beta_0 + \beta_1 x + \varepsilon
]

Multiple Linear Regression:
[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \varepsilon
]

Where:

  • (y) → dependent variable

  • (x) → independent variable(s)

  • (\beta_0) → intercept

  • (\beta_i) → coefficients

  • (\varepsilon) → error term


2. What are the key assumptions of Linear Regression?

Answer:
Linear Regression relies on the following assumptions:

  1. Linearity – Relationship between features and target is linear

  2. Independence – Observations are independent

  3. Homoscedasticity – Constant variance of residuals

  4. Normality of Errors – Residuals are normally distributed

  5. No Multicollinearity – Independent variables are not highly correlated

Violation of these assumptions can lead to biased or inefficient estimates.


3. What is Ordinary Least Squares (OLS)?

Answer:
OLS is a method used to estimate regression coefficients by minimizing the sum of squared residuals.

[
\text{Minimize } \sum (y_i - \hat{y}_i)^2
]

OLS provides:

  • Best Linear Unbiased Estimator (BLUE)

  • Efficient estimates when assumptions hold


4. How do you interpret regression coefficients?

Answer:

  • Slope ((\beta_i)): Change in target variable for a one-unit change in the predictor, keeping other variables constant

  • Intercept ((\beta_0)): Expected value of the target when all predictors are zero

Example:
If salary = 30,000 + 2,000 × years_of_experience
→ Each additional year increases salary by β‚Ή2,000.


5. What is R-squared and Adjusted R-squared?

Answer:

R-squared ((R^2))

Measures proportion of variance explained by the model.

[
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
]

  • Ranges from 0 to 1

  • Increases with more features (even irrelevant ones)

Adjusted R-squared

Penalizes unnecessary features.

[
Adjusted\ R^2 = 1 - \left(\frac{1-R^2}{n-p-1}\right)(n-1)
]

Preferred for multiple regression models.


6. What is multicollinearity? How do you detect it?

Answer:
Multicollinearity occurs when independent variables are highly correlated, leading to unstable coefficients.

Detection methods:

  • Correlation matrix

  • Variance Inflation Factor (VIF)

[
VIF = \frac{1}{1 - R^2}
]

  • VIF > 5 → moderate issue

  • VIF > 10 → severe multicollinearity

Solutions:

  • Remove correlated features

  • Use PCA

  • Apply Ridge Regression


7. What are residuals? How do you analyze them?

Answer:
Residuals are differences between actual and predicted values:

[
Residual = y - \hat{y}
]

Residual analysis includes:

  • Residual vs fitted plot (check homoscedasticity)

  • Q-Q plot (check normality)

  • Autocorrelation plot (Durbin–Watson test)


8. Difference between Linear Regression and Logistic Regression?

Linear Regression Logistic Regression
Predicts continuous values Predicts probabilities
Uses OLS Uses Maximum Likelihood
Output is unbounded Output between 0 and 1
Uses MSE loss Uses Log loss

9. What is Gradient Descent in Linear Regression?

Answer:
Gradient Descent is an iterative optimization algorithm used when datasets are large.

Update rule:
[
\beta = \beta - \alpha \frac{\partial J(\beta)}{\partial \beta}
]

Types:

  • Batch Gradient Descent

  • Stochastic Gradient Descent (SGD)

  • Mini-batch Gradient Descent


10. What is overfitting in Linear Regression?

Answer:
Overfitting occurs when the model captures noise instead of patterns, performing well on training data but poorly on test data.

Causes:

  • Too many features

  • Multicollinearity

  • Small dataset

Prevention:

  • Regularization

  • Feature selection

  • Cross-validation


11. Explain Regularization in Linear Regression.

Answer:
Regularization adds a penalty term to reduce model complexity.

Ridge Regression (L2)

[
Loss = MSE + \lambda \sum \beta^2
]

  • Shrinks coefficients

  • Handles multicollinearity

Lasso Regression (L1)

[
Loss = MSE + \lambda \sum |\beta|
]

  • Performs feature selection

Elastic Net

Combination of L1 and L2.


12. What evaluation metrics are used for Linear Regression?

Answer:

  • Mean Absolute Error (MAE)

  • Mean Squared Error (MSE)

  • Root Mean Squared Error (RMSE)

  • R-squared

  • Adjusted R-squared


13. How do you handle outliers in Linear Regression?

Answer:

  • Detect using boxplots, Cook’s distance

  • Transform data (log, sqrt)

  • Remove or cap outliers

  • Use robust regression


14. What is the bias–variance tradeoff?

Answer:

  • High bias → underfitting

  • High variance → overfitting
    Goal is to balance both using regularization and proper model complexity.


15. Real-time scenario: When would Linear Regression fail?

Answer:

  • Non-linear relationships

  • High outliers

  • Heteroscedastic data

  • Categorical variables not encoded

  • Time-series data with autocorrelation


16. How do you check if Linear Regression is suitable for a dataset?

Answer:

  • Scatter plots for linearity

  • Correlation analysis

  • Residual diagnostics

  • Compare with non-linear models


17. Explain feature scaling in Linear Regression.

Answer:
Feature scaling improves convergence in gradient descent.

Methods:

  • Standardization (Z-score)

  • Min-Max Scaling

Important when using regularization.


18. What is Cook’s Distance?

Answer:
Measures influence of each data point on regression coefficients.

  • Large Cook’s Distance → influential outlier


19. What is the difference between parametric and non-parametric models?

Answer:

  • Linear Regression → parametric (assumes fixed form)

  • Decision Trees → non-parametric


20. How do you deploy a Linear Regression model in production?

Answer:

  • Train and validate model

  • Serialize model (pickle/joblib)

  • Build API (Flask/FastAPI)

  • Monitor performance and drift


21. What is the difference between correlation and regression?

Answer:

  • Correlation measures the strength and direction of a relationship between two variables.

  • Regression models the relationship and predicts the dependent variable.

Correlation does not imply causation, while regression attempts to quantify impact.


22. What happens if assumptions of Linear Regression are violated?

Answer:

  • Linearity violated → Biased predictions

  • Homoscedasticity violated → Inefficient estimates

  • Normality violated → Invalid hypothesis tests

  • Multicollinearity → Unstable coefficients

  • Autocorrelation → Incorrect confidence intervals


23. What is heteroscedasticity? How do you handle it?

Answer:
Heteroscedasticity occurs when variance of residuals changes with predictors.

Detection:

  • Residual vs fitted plot

  • Breusch–Pagan test

Handling:

  • Log or Box-Cox transformation

  • Weighted Least Squares

  • Robust standard errors


24. What is autocorrelation in Linear Regression?

Answer:
Autocorrelation means residuals are correlated across observations, common in time-series data.

Detection:

  • Durbin–Watson test (value ~2 is ideal)

Solutions:

  • Add lag variables

  • Use time-series models

  • Generalized Least Squares


25. Explain hypothesis testing in Linear Regression.

Answer:
Used to test statistical significance of coefficients.

  • Null Hypothesis (Hβ‚€): β = 0

  • Alternative Hypothesis (H₁): β ≠ 0

Tests used:

  • t-test → individual coefficients

  • F-test → overall model significance


26. What is p-value? How do you interpret it?

Answer:
p-value measures probability of observing results assuming null hypothesis is true.

  • p < 0.05 → statistically significant

  • p ≥ 0.05 → not significant

Lower p-value → stronger evidence against null hypothesis.


27. What is adjusted R² preferred over R²?

Answer:
Adjusted R² penalizes irrelevant features.

  • R² always increases

  • Adjusted R² increases only if feature improves the model

Best metric for multiple linear regression.


28. Explain bias and variance in Linear Regression.

Answer:

  • Bias: Error from overly simple model

  • Variance: Error from overly complex model

Linear regression generally has low variance, high bias when underfitting.


29. How does sample size affect Linear Regression?

Answer:

  • Small dataset → unreliable coefficients

  • Large dataset → stable estimates, better generalization

Rule of thumb: 10–15 observations per predictor.


30. What is feature selection? Why is it important?

Answer:
Feature selection reduces noise and improves interpretability.

Methods:

  • Forward selection

  • Backward elimination

  • Recursive Feature Elimination (RFE)

  • Lasso regression


31. What is the difference between Ridge and Lasso?

Ridge Lasso
L2 penalty L1 penalty
Shrinks coefficients Can make coefficients zero
No feature selection Performs feature selection
Handles multicollinearity Sparse model

32. What is Elastic Net and why is it used?

Answer:
Elastic Net combines Ridge and Lasso penalties.

Used when:

  • Many correlated features

  • Lasso selects only one variable


33. Explain polynomial regression.

Answer:
Polynomial regression models non-linear relationships by adding polynomial terms.

Example:
[
y = \beta_0 + \beta_1 x + \beta_2 x^2
]

Still considered linear in coefficients.


34. How do categorical variables work in Linear Regression?

Answer:
Categorical variables are converted using dummy encoding.

  • Avoid dummy variable trap by dropping one category

  • Use One-Hot Encoding


35. What is the dummy variable trap?

Answer:
Occurs when dummy variables are perfectly correlated.

Solution: Drop one category to avoid multicollinearity.


36. How do interaction terms work?

Answer:
Interaction terms capture combined effect of features.

Example:
[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2
]

Used when one variable’s effect depends on another.


37. What is leverage in Linear Regression?

Answer:
Leverage measures how far a data point’s predictor values are from the mean.

High leverage points can strongly influence the model.


38. Difference between confidence interval and prediction interval?

Answer:

  • Confidence interval → Mean prediction

  • Prediction interval → Individual prediction (wider)


39. What is cross-validation? Why is it used?

Answer:
Cross-validation evaluates model stability.

  • k-Fold CV

  • Leave-One-Out CV

Reduces overfitting and bias.


40. Explain model interpretability in Linear Regression.

Answer:
Linear regression is highly interpretable:

  • Coefficients show feature impact

  • Sign indicates direction

  • Magnitude shows strength

Preferred in finance, healthcare, policy models.


41. What is Weighted Linear Regression?

Answer:
Assigns weights to observations.

Used when:

  • Data reliability varies

  • Heteroscedasticity exists


42. How does scaling affect coefficients?

Answer:
Scaling changes coefficient magnitude but not predictions.

Important for:

  • Regularization

  • Gradient descent convergence


43. What is model drift in Linear Regression?

Answer:
Occurs when data distribution changes over time.

Handled by:

  • Monitoring metrics

  • Retraining models

  • Data validation


44. When should you not use Linear Regression?

Answer:

  • Non-linear patterns

  • High outliers

  • Categorical target

  • Complex interactions


45. Real-time scenario: Sales prediction model performs poorly in production. What steps do you take?

Answer:

  1. Check data drift

  2. Re-evaluate assumptions

  3. Inspect residuals

  4. Remove outliers

  5. Add interaction or polynomial terms

  6. Retrain model


46. What is Cook’s Distance and why is it important?

Answer:
Measures influence of data points.

High Cook’s Distance → model sensitive to that point.


47. Explain why Linear Regression is sensitive to outliers.

Answer:
OLS minimizes squared errors, giving more weight to extreme values.


48. How do you interpret negative R²?

Answer:
Model performs worse than predicting the mean.

Indicates poor fit or incorrect assumptions.


49. What is Maximum Likelihood Estimation (MLE) in Linear Regression?

Answer:
MLE estimates parameters assuming normally distributed errors.

Equivalent to OLS under Gaussian noise.


50. How do you explain Linear Regression to a business stakeholder?

Answer:
“It shows how much each factor impacts the outcome and helps forecast future values using historical trends.”


51. What is the difference between training error and test error?

Answer:

  • Training error → fit on known data

  • Test error → generalization ability


52. Can Linear Regression handle missing values?

Answer:
No. Missing values must be:

  • Imputed

  • Removed


53. What is a partial regression plot?

Answer:
Shows relationship between target and one predictor, controlling for others.


54. Explain why Linear Regression is a baseline model.

Answer:
Simple, fast, interpretable, and sets performance benchmark.


55. How do you tune hyperparameters in regularized regression?

Answer:
Use Grid Search with Cross-Validation to tune λ.