Top Interview Questions
Linear Regression is one of the most fundamental and widely used algorithms in statistics and machine learning. It is primarily employed for predictive modeling, helping to understand the relationship between a dependent variable (target) and one or more independent variables (predictors). Despite its simplicity, linear regression serves as the foundation for more complex techniques and is essential for data analysis across industries such as finance, healthcare, marketing, and engineering.
At its core, linear regression assumes a linear relationship between the dependent variable ( Y ) and independent variable(s) ( X ). Mathematically, the simplest form is expressed as:
[
Y = \beta_0 + \beta_1 X + \epsilon
]
Where:
( Y ) = Dependent variable (outcome we want to predict)
( X ) = Independent variable (predictor or feature)
( \beta_0 ) = Intercept (value of Y when X=0)
( \beta_1 ) = Slope (amount by which Y changes for a unit change in X)
( \epsilon ) = Error term (difference between actual and predicted values)
When there is more than one independent variable, it becomes Multiple Linear Regression:
[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon
]
This allows modeling of more complex real-world scenarios.
Linear regression works well under certain assumptions. Understanding these assumptions is critical because violation can lead to misleading results:
Linearity: The relationship between the independent and dependent variables should be linear.
Independence: Observations must be independent of each other.
Homoscedasticity: Constant variance of errors ((\epsilon)) across all levels of independent variables.
Normality of Errors: The residuals (differences between actual and predicted values) should be normally distributed.
No multicollinearity (for multiple regression): Independent variables should not be highly correlated with each other.
If these assumptions hold, linear regression produces unbiased, consistent, and efficient estimates of the coefficients.
Linear regression can be broadly categorized into:
Simple Linear Regression:
Involves one independent variable predicting a dependent variable. Example: Predicting a person’s weight based on their height.
Multiple Linear Regression:
Involves multiple independent variables. Example: Predicting house prices based on size, location, and number of bedrooms.
Polynomial Regression (technically an extension):
Models nonlinear relationships by introducing polynomial terms of predictors. Example: ( Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon )
Ridge and Lasso Regression (Regularized Linear Regression):
These methods add penalties to prevent overfitting in models with many variables. Ridge uses ( L2 ) penalty, Lasso uses ( L1 ) penalty.
The goal of linear regression is to fit the best line that minimizes the error between predicted and actual values. The most common approach is Ordinary Least Squares (OLS), which minimizes the sum of squared errors (SSE):
[
SSE = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
]
Where:
( Y_i ) = Actual value
( \hat{Y_i} ) = Predicted value
Minimizing SSE ensures that predictions are as close as possible to actual outcomes. Gradient descent can also be used to iteratively adjust coefficients in large datasets.
Evaluating a linear regression model requires assessing how well the model predicts the dependent variable. Common metrics include:
R-squared (( R^2 )):
Measures the proportion of variance in the dependent variable explained by independent variables. Ranges from 0 to 1.
[
R^2 = 1 - \frac{\sum (Y_i - \hat{Y_i})^2}{\sum (Y_i - \bar{Y})^2}
]
Adjusted R-squared:
Adjusts R-squared for the number of predictors in the model, preventing overestimation of model performance in multiple regression.
Mean Absolute Error (MAE):
Average of absolute differences between predicted and actual values.
[
MAE = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y_i}|
]
Mean Squared Error (MSE) & Root Mean Squared Error (RMSE):
Measures the average squared difference. RMSE is the square root of MSE and is in the same unit as the dependent variable.
[
MSE = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
]
Data Collection: Gather relevant data for predictors and target variables.
Data Preprocessing: Handle missing values, outliers, and categorical variables (encoding).
Exploratory Data Analysis (EDA): Visualize relationships using scatter plots, correlation matrices, etc.
Splitting Data: Divide dataset into training and testing sets to evaluate performance.
Model Training: Fit a linear regression model using OLS or gradient descent.
Model Evaluation: Assess performance using metrics like R-squared, MAE, RMSE.
Model Interpretation: Examine coefficients to understand the impact of each variable.
Prediction: Use the model to predict new observations.
Simplicity and Interpretability: Easy to understand and explain.
Computational Efficiency: Requires less computation than complex models.
Basis for Other Models: Foundation for logistic regression, generalized linear models, and regularized regression.
Predictive Power: Works well for linear relationships and moderate datasets.
Assumes Linearity: Fails if relationships are nonlinear.
Sensitive to Outliers: Extreme values can significantly impact the model.
Multicollinearity: Highly correlated predictors distort coefficient estimates.
Overfitting/Underfitting: May overfit if too many predictors or underfit if relationships are complex.
Cannot Model Complex Patterns: Limited compared to tree-based or neural network models.
Finance: Predicting stock prices, loan defaults, or credit scores.
Healthcare: Estimating disease progression, patient outcomes, or drug effectiveness.
Marketing: Forecasting sales, customer behavior, or advertising effectiveness.
Economics: Modeling economic indicators like GDP growth, inflation, or unemployment.
Engineering: Predicting material strength, energy consumption, or equipment failure.
Linear regression’s versatility makes it applicable in almost any field where quantitative prediction is needed.
Answer:
Linear Regression is a supervised machine learning algorithm used to predict a continuous numerical value based on one or more input variables.
It finds the best-fit straight line that represents the relationship between:
Independent variable(s) (X)
Dependent variable (Y)
Example:
Predicting salary based on years of experience.
Answer:
The equation of a straight line:
[
Y = mX + c
]
Where:
Y = dependent variable (output)
X = independent variable (input)
m = slope (how much Y changes when X changes)
c = intercept (value of Y when X = 0)
Answer:
Simple Linear Regression involves:
One independent variable
One dependent variable
Example:
Salary = f(Experience)
[
Salary = m \times Experience + c
]
Answer:
Multiple Linear Regression uses more than one independent variable.
Equation:
[
Y = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n
]
Example:
House price based on:
Area
Location
Number of rooms
Answer:
Linear Regression works best when the following assumptions are met:
Linearity – Relationship between X and Y is linear
Independence – Observations are independent
Homoscedasticity – Constant variance of errors
Normality – Residuals are normally distributed
No multicollinearity – Independent variables are not highly correlated
Answer:
The cost function measures how well the model fits the data.
Most commonly used:
[
MSE = \frac{1}{n} \sum (Y_{actual} - Y_{predicted})^2
]
The goal is to minimize this error.
Answer:
Gradient Descent is an optimization algorithm used to find the best values of m and c by minimizing the cost function.
Steps:
Start with random values of m and c
Calculate error
Update m and c in the direction of minimum error
Repeat until convergence
Answer:
Learning rate (α) controls how big the step size is during gradient descent.
Too large → overshoots minimum
Too small → slow convergence
Answer:
Residual = Actual value − Predicted value
[
Residual = Y_{actual} - Y_{predicted}
]
Residuals help evaluate model accuracy.
Answer:
R² shows how much variance in the dependent variable is explained by the model.
R² = 1 → perfect fit
R² = 0 → no explanatory power
Example:
R² = 0.85 means 85% of the variation is explained.
Answer:
Adjusted R² adjusts R² based on the number of independent variables.
It penalizes adding irrelevant features.
| R² | Adjusted R² |
|---|---|
| Increases when variables are added | Increases only if variables are useful |
| Can be misleading | More reliable |
Answer:
Multicollinearity occurs when independent variables are highly correlated with each other.
Problems:
Unstable coefficients
Difficult interpretation
Solution:
Remove correlated variables
Use VIF (Variance Inflation Factor)
Answer:
VIF measures how much multicollinearity exists.
VIF < 5 → acceptable
VIF > 10 → serious multicollinearity
Answer:
Overfitting occurs when the model:
Fits training data very well
Performs poorly on unseen data
Causes:
Too many features
Small dataset
Answer:
Underfitting happens when:
Model is too simple
Cannot capture patterns in data
Answer:
Use regularization
Reduce features
Increase data
Cross-validation
Answer:
Regularization adds a penalty term to the cost function to reduce overfitting.
Answer:
Ridge Regression adds L2 penalty.
[
Cost = MSE + \lambda \sum w^2
]
Reduces large coefficients
Does not make them zero
Answer:
Lasso adds L1 penalty.
[
Cost = MSE + \lambda \sum |w|
]
Can shrink coefficients to zero
Performs feature selection
| Ridge | Lasso |
|---|---|
| L2 penalty | L1 penalty |
| No feature elimination | Feature elimination |
| Handles multicollinearity | Sparse model |
Answer:
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R² Score
Answer:
RMSE is the square root of MSE.
[
RMSE = \sqrt{MSE}
]
It is in the same unit as target variable.
Answer:
No, but we can:
Use polynomial features
Apply transformations (log, square)
Answer:
When relationship is highly non-linear
Presence of strong outliers
Categorical target variable
Answer:
Salary prediction
House price prediction
Sales forecasting
Demand prediction
Answer:
scikit-learn
statsmodels
numpy
pandas
Answer:
Linear Regression predicts continuous values by finding the best-fit straight line between input and output variables.
Answer:
Simple and easy to implement
Interpretable results
Fast training
Works well with linearly related data
Answer:
Sensitive to outliers
Assumes linearity
Poor performance with complex data
Answer:
Model coefficients represent the relationship strength between independent variables and the dependent variable.
Positive coefficient → Y increases as X increases
Negative coefficient → Y decreases as X increases
Example:
If coefficient = 5 → 1 unit increase in X increases Y by 5 units.
Answer:
The intercept is the value of Y when all X values are zero.
It helps position the regression line correctly.
Answer:
Box plots
Z-score
IQR (Interquartile Range)
Scatter plots
Outliers can skew the regression line.
Answer:
Change slope significantly
Increase error values
Reduce model accuracy
Linear Regression is sensitive to outliers.
Answer:
Remove extreme outliers
Transform data (log, square root)
Use robust regression techniques
Answer:
Homoscedasticity means the variance of residuals is constant across all values of X.
This ensures reliable predictions.
Answer:
Heteroscedasticity occurs when residual variance changes with X.
Impact:
Unreliable coefficients
Invalid statistical tests
Answer:
Residual vs fitted value plot
Breusch–Pagan test
White test
Answer:
Log transformation
Weighted Least Squares
Remove outliers
Answer:
Residuals should follow a normal distribution.
This helps in:
Accurate confidence intervals
Hypothesis testing
Answer:
| Correlation | Regression |
|---|---|
| Measures relationship strength | Predicts values |
| No causation | Assumes dependency |
| Symmetric | Directional |
Answer:
No. Linear Regression predicts continuous values.
For classification, use:
Logistic Regression
Answer:
OLS minimizes the sum of squared residuals to find the best-fit line.
It is the most common method used in Linear Regression.
Answer:
Feature scaling brings all features to a similar range.
Required for:
Faster convergence
Gradient descent efficiency
Methods:
Standardization
Normalization
Answer:
Not mandatory for OLS, but important for gradient descent and regularization.
Answer:
Polynomial Regression models non-linear relationships by adding polynomial terms.
Example:
[
Y = b_0 + b_1X + b_2X^2
]
Answer:
Bias is error due to oversimplified assumptions.
High bias leads to underfitting.
Answer:
Variance measures how much the model changes with different datasets.
High variance leads to overfitting.
Answer:
Low bias + high variance → overfitting
High bias + low variance → underfitting
Goal: Balance both.
Answer:
Cross-validation tests model performance on multiple data splits.
Most common:
K-Fold Cross Validation
Answer:
Data is divided into K equal parts.
Train on K-1 folds
Test on remaining fold
Repeat K times
Answer:
p-value measures statistical significance of a feature.
p < 0.05 → significant feature
p > 0.05 → insignificant feature
Answer:
Confidence interval provides a range where the true coefficient lies.
Common: 95% confidence interval.
Answer:
Occurs when dummy variables are perfectly correlated.
Solution:
Drop one dummy variable
Answer:
One-Hot Encoding
Label Encoding (carefully)
Answer:
Data leakage occurs when test data influences training.
It results in overly optimistic performance.
Answer:
Data is split into:
Training set
Testing set
Common ratio:
70:30 or 80:20
Answer:
Leverage points are data points with extreme X values.
They strongly influence the regression line.
Answer:
Measures the influence of each observation on the regression model.
High value → influential point.
Answer:
Collect data
Clean data
Handle missing values
Feature scaling
Train model
Evaluate model
Tune model
Answer:
It finds a straight line that best predicts future values based on past data.
Answer:
Ignoring assumptions
Not handling outliers
Using too many features
Not validating model
Answer:
Reduced model performance
Overfitting
Lower Adjusted R²
Answer:
No. Missing values must be:
Removed
Imputed (mean/median)
Answer:
Use Linear Regression when:
Relationship is linear
Target variable is continuous
Interpretability is important
Answer:
Linear Regression is a supervised learning algorithm used to model the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation.
Simple Linear Regression:
[
y = \beta_0 + \beta_1 x + \varepsilon
]
Multiple Linear Regression:
[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \varepsilon
]
Where:
(y) → dependent variable
(x) → independent variable(s)
(\beta_0) → intercept
(\beta_i) → coefficients
(\varepsilon) → error term
Answer:
Linear Regression relies on the following assumptions:
Linearity – Relationship between features and target is linear
Independence – Observations are independent
Homoscedasticity – Constant variance of residuals
Normality of Errors – Residuals are normally distributed
No Multicollinearity – Independent variables are not highly correlated
Violation of these assumptions can lead to biased or inefficient estimates.
Answer:
OLS is a method used to estimate regression coefficients by minimizing the sum of squared residuals.
[
\text{Minimize } \sum (y_i - \hat{y}_i)^2
]
OLS provides:
Best Linear Unbiased Estimator (BLUE)
Efficient estimates when assumptions hold
Answer:
Slope ((\beta_i)): Change in target variable for a one-unit change in the predictor, keeping other variables constant
Intercept ((\beta_0)): Expected value of the target when all predictors are zero
Example:
If salary = 30,000 + 2,000 × years_of_experience
→ Each additional year increases salary by βΉ2,000.
Answer:
Measures proportion of variance explained by the model.
[
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
]
Ranges from 0 to 1
Increases with more features (even irrelevant ones)
Penalizes unnecessary features.
[
Adjusted\ R^2 = 1 - \left(\frac{1-R^2}{n-p-1}\right)(n-1)
]
Preferred for multiple regression models.
Answer:
Multicollinearity occurs when independent variables are highly correlated, leading to unstable coefficients.
Detection methods:
Correlation matrix
Variance Inflation Factor (VIF)
[
VIF = \frac{1}{1 - R^2}
]
VIF > 5 → moderate issue
VIF > 10 → severe multicollinearity
Solutions:
Remove correlated features
Use PCA
Apply Ridge Regression
Answer:
Residuals are differences between actual and predicted values:
[
Residual = y - \hat{y}
]
Residual analysis includes:
Residual vs fitted plot (check homoscedasticity)
Q-Q plot (check normality)
Autocorrelation plot (Durbin–Watson test)
| Linear Regression | Logistic Regression |
|---|---|
| Predicts continuous values | Predicts probabilities |
| Uses OLS | Uses Maximum Likelihood |
| Output is unbounded | Output between 0 and 1 |
| Uses MSE loss | Uses Log loss |
Answer:
Gradient Descent is an iterative optimization algorithm used when datasets are large.
Update rule:
[
\beta = \beta - \alpha \frac{\partial J(\beta)}{\partial \beta}
]
Types:
Batch Gradient Descent
Stochastic Gradient Descent (SGD)
Mini-batch Gradient Descent
Answer:
Overfitting occurs when the model captures noise instead of patterns, performing well on training data but poorly on test data.
Causes:
Too many features
Multicollinearity
Small dataset
Prevention:
Regularization
Feature selection
Cross-validation
Answer:
Regularization adds a penalty term to reduce model complexity.
[
Loss = MSE + \lambda \sum \beta^2
]
Shrinks coefficients
Handles multicollinearity
[
Loss = MSE + \lambda \sum |\beta|
]
Performs feature selection
Combination of L1 and L2.
Answer:
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-squared
Adjusted R-squared
Answer:
Detect using boxplots, Cook’s distance
Transform data (log, sqrt)
Remove or cap outliers
Use robust regression
Answer:
High bias → underfitting
High variance → overfitting
Goal is to balance both using regularization and proper model complexity.
Answer:
Non-linear relationships
High outliers
Heteroscedastic data
Categorical variables not encoded
Time-series data with autocorrelation
Answer:
Scatter plots for linearity
Correlation analysis
Residual diagnostics
Compare with non-linear models
Answer:
Feature scaling improves convergence in gradient descent.
Methods:
Standardization (Z-score)
Min-Max Scaling
Important when using regularization.
Answer:
Measures influence of each data point on regression coefficients.
Large Cook’s Distance → influential outlier
Answer:
Linear Regression → parametric (assumes fixed form)
Decision Trees → non-parametric
Answer:
Train and validate model
Serialize model (pickle/joblib)
Build API (Flask/FastAPI)
Monitor performance and drift
Answer:
Correlation measures the strength and direction of a relationship between two variables.
Regression models the relationship and predicts the dependent variable.
Correlation does not imply causation, while regression attempts to quantify impact.
Answer:
Linearity violated → Biased predictions
Homoscedasticity violated → Inefficient estimates
Normality violated → Invalid hypothesis tests
Multicollinearity → Unstable coefficients
Autocorrelation → Incorrect confidence intervals
Answer:
Heteroscedasticity occurs when variance of residuals changes with predictors.
Detection:
Residual vs fitted plot
Breusch–Pagan test
Handling:
Log or Box-Cox transformation
Weighted Least Squares
Robust standard errors
Answer:
Autocorrelation means residuals are correlated across observations, common in time-series data.
Detection:
Durbin–Watson test (value ~2 is ideal)
Solutions:
Add lag variables
Use time-series models
Generalized Least Squares
Answer:
Used to test statistical significance of coefficients.
Null Hypothesis (Hβ): β = 0
Alternative Hypothesis (Hβ): β ≠ 0
Tests used:
t-test → individual coefficients
F-test → overall model significance
Answer:
p-value measures probability of observing results assuming null hypothesis is true.
p < 0.05 → statistically significant
p ≥ 0.05 → not significant
Lower p-value → stronger evidence against null hypothesis.
Answer:
Adjusted R² penalizes irrelevant features.
R² always increases
Adjusted R² increases only if feature improves the model
Best metric for multiple linear regression.
Answer:
Bias: Error from overly simple model
Variance: Error from overly complex model
Linear regression generally has low variance, high bias when underfitting.
Answer:
Small dataset → unreliable coefficients
Large dataset → stable estimates, better generalization
Rule of thumb: 10–15 observations per predictor.
Answer:
Feature selection reduces noise and improves interpretability.
Methods:
Forward selection
Backward elimination
Recursive Feature Elimination (RFE)
Lasso regression
| Ridge | Lasso |
|---|---|
| L2 penalty | L1 penalty |
| Shrinks coefficients | Can make coefficients zero |
| No feature selection | Performs feature selection |
| Handles multicollinearity | Sparse model |
Answer:
Elastic Net combines Ridge and Lasso penalties.
Used when:
Many correlated features
Lasso selects only one variable
Answer:
Polynomial regression models non-linear relationships by adding polynomial terms.
Example:
[
y = \beta_0 + \beta_1 x + \beta_2 x^2
]
Still considered linear in coefficients.
Answer:
Categorical variables are converted using dummy encoding.
Avoid dummy variable trap by dropping one category
Use One-Hot Encoding
Answer:
Occurs when dummy variables are perfectly correlated.
Solution: Drop one category to avoid multicollinearity.
Answer:
Interaction terms capture combined effect of features.
Example:
[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2
]
Used when one variable’s effect depends on another.
Answer:
Leverage measures how far a data point’s predictor values are from the mean.
High leverage points can strongly influence the model.
Answer:
Confidence interval → Mean prediction
Prediction interval → Individual prediction (wider)
Answer:
Cross-validation evaluates model stability.
k-Fold CV
Leave-One-Out CV
Reduces overfitting and bias.
Answer:
Linear regression is highly interpretable:
Coefficients show feature impact
Sign indicates direction
Magnitude shows strength
Preferred in finance, healthcare, policy models.
Answer:
Assigns weights to observations.
Used when:
Data reliability varies
Heteroscedasticity exists
Answer:
Scaling changes coefficient magnitude but not predictions.
Important for:
Regularization
Gradient descent convergence
Answer:
Occurs when data distribution changes over time.
Handled by:
Monitoring metrics
Retraining models
Data validation
Answer:
Non-linear patterns
High outliers
Categorical target
Complex interactions
Answer:
Check data drift
Re-evaluate assumptions
Inspect residuals
Remove outliers
Add interaction or polynomial terms
Retrain model
Answer:
Measures influence of data points.
High Cook’s Distance → model sensitive to that point.
Answer:
OLS minimizes squared errors, giving more weight to extreme values.
Answer:
Model performs worse than predicting the mean.
Indicates poor fit or incorrect assumptions.
Answer:
MLE estimates parameters assuming normally distributed errors.
Equivalent to OLS under Gaussian noise.
Answer:
“It shows how much each factor impacts the outcome and helps forecast future values using historical trends.”
Answer:
Training error → fit on known data
Test error → generalization ability
Answer:
No. Missing values must be:
Imputed
Removed
Answer:
Shows relationship between target and one predictor, controlling for others.
Answer:
Simple, fast, interpretable, and sets performance benchmark.
Answer:
Use Grid Search with Cross-Validation to tune λ.