Top Interview Questions
Machine Learning (ML) is one of the most transformative technologies of the modern digital era. It is a subset of Artificial Intelligence (AI) that enables computers to learn from data and improve their performance without being explicitly programmed. Instead of following rigid instructions, machine learning systems identify patterns, make decisions, and predict outcomes based on historical data. Today, machine learning powers many everyday applications such as search engines, recommendation systems, voice assistants, fraud detection systems, and autonomous vehicles.
Machine Learning is the science of designing algorithms and models that allow systems to learn from experience. The term was first coined by Arthur Samuel in 1959, who defined it as a “field of study that gives computers the ability to learn without being explicitly programmed.” In simple words, machine learning focuses on building systems that automatically improve through exposure to data.
For example, instead of programming a system with fixed rules to detect spam emails, a machine learning model is trained using thousands of labeled emails (spam and non-spam). Over time, the model learns patterns and improves its ability to classify new emails accurately.
The machine learning process typically involves the following steps:
Data Collection – Gathering relevant data from sources such as databases, sensors, logs, or user interactions.
Data Preprocessing – Cleaning the data by handling missing values, removing noise, and normalizing features.
Feature Selection/Engineering – Choosing important variables that contribute to accurate predictions.
Model Selection – Selecting an appropriate algorithm based on the problem type.
Training – Feeding data to the model so it can learn patterns.
Evaluation – Measuring model performance using metrics like accuracy, precision, recall, or RMSE.
Deployment and Monitoring – Using the model in real-world applications and continuously improving it.
Machine learning is broadly classified into three main types:
Supervised learning uses labeled data, meaning the input data is paired with correct output labels. The goal is to learn a mapping between inputs and outputs.
Common algorithms include:
Linear Regression
Logistic Regression
Decision Trees
Random Forest
Support Vector Machines (SVM)
Neural Networks
Examples of supervised learning applications:
Email spam detection
Credit risk assessment
Image classification
Disease prediction
In unsupervised learning, the data does not contain labeled outputs. The model tries to identify hidden patterns or structures within the data.
Common algorithms include:
K-Means Clustering
Hierarchical Clustering
DBSCAN
Principal Component Analysis (PCA)
Applications include:
Customer segmentation
Market basket analysis
Anomaly detection
Data compression
Reinforcement learning involves an agent that learns by interacting with an environment. The agent receives rewards or penalties based on its actions and aims to maximize cumulative rewards.
Key concepts include:
Agent
Environment
Actions
Rewards
Applications include:
Game playing (e.g., AlphaGo)
Robotics
Autonomous vehicles
Resource optimization
Some of the most widely used machine learning algorithms are:
Linear Regression – Predicts continuous values based on linear relationships.
Logistic Regression – Used for binary classification problems.
Decision Trees – Tree-like models used for classification and regression.
Random Forest – An ensemble of decision trees for improved accuracy.
Support Vector Machines (SVM) – Finds optimal boundaries between classes.
K-Nearest Neighbors (KNN) – Classifies data based on similarity.
Neural Networks – Inspired by the human brain, used in deep learning applications.
Artificial Intelligence (AI) is the broader concept of machines simulating human intelligence.
Machine Learning (ML) is a subset of AI focused on learning from data.
Deep Learning (DL) is a subset of ML that uses multi-layered neural networks to process large volumes of complex data.
For example, image recognition systems often use deep learning techniques like Convolutional Neural Networks (CNNs).
Machine learning has widespread applications across industries:
Healthcare – Disease diagnosis, medical image analysis, drug discovery
Finance – Fraud detection, algorithmic trading, credit scoring
Retail – Recommendation systems, demand forecasting, pricing optimization
Transportation – Self-driving cars, traffic prediction
Education – Personalized learning platforms
Cybersecurity – Intrusion detection and threat analysis
Automates decision-making processes
Improves accuracy over time
Handles large and complex datasets efficiently
Reduces human intervention
Enables predictive analytics
Despite its benefits, machine learning has several challenges:
Requires large amounts of quality data
Computationally expensive
Model interpretability issues
Risk of bias in data
Ethical and privacy concerns
Addressing these challenges requires careful data handling, transparent algorithms, and responsible AI practices.
The future of machine learning is promising and rapidly evolving. With advancements in computing power, cloud technologies, and big data, machine learning models are becoming more powerful and accessible. Emerging trends include AutoML, Explainable AI (XAI), federated learning, and integration with IoT and blockchain technologies. Machine learning is expected to play a critical role in shaping smart cities, personalized healthcare, and intelligent automation.
Answer:
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables systems to learn from data and make predictions or decisions without being explicitly programmed. Instead of writing rules, ML uses algorithms to identify patterns in data and improve over time.
Example: Predicting house prices based on historical data like size, location, and number of rooms.
Answer:
Machine Learning is mainly divided into three types:
Supervised Learning:
The algorithm is trained on labeled data (input + output).
Goal: Predict output for new data.
Example: Predicting salary based on experience.
Unsupervised Learning:
The algorithm works on unlabeled data.
Goal: Find hidden patterns or groupings.
Example: Customer segmentation in marketing.
Reinforcement Learning:
The algorithm learns by trial and error using rewards or penalties.
Example: Training a robot to walk or play a game like chess.
Answer:
| Aspect | AI | ML | Deep Learning (DL) |
|---|---|---|---|
| Definition | Intelligence demonstrated by machines | Algorithms that learn from data | Neural networks with multiple layers |
| Data Requirement | Not always | Needs data | Needs huge amounts of data |
| Complexity | Basic to advanced | Medium | High |
| Example | Chess AI, Chatbots | Linear Regression, SVM | Image recognition, NLP models |
Answer:
Overfitting: Model performs very well on training data but poorly on new/unseen data.
Cause: Too complex model, less data.
Solution: Use more data, regularization, or simpler model.
Underfitting: Model performs poorly on both training and test data.
Cause: Too simple model, insufficient features.
Solution: Use a more complex model, add features.
Example:
Predicting house prices using only the number of bedrooms (underfitting) vs using 50 irrelevant features (overfitting).
Answer:
Features: Input variables used to make predictions.
Example: Age, income, education in predicting loan approval.
Labels: Output or target variable we want to predict.
Example: Loan approved (Yes/No).
Answer:
Common supervised learning algorithms:
Linear Regression: Predicts continuous numeric output.
Logistic Regression: Predicts binary outcome (Yes/No).
Decision Trees: Splits data based on features.
Random Forest: Ensemble of decision trees to improve accuracy.
Support Vector Machine (SVM): Finds a hyperplane that separates classes.
K-Nearest Neighbors (KNN): Classifies based on nearest points in feature space.
Answer:
Common unsupervised learning algorithms:
K-Means Clustering: Groups data points into k clusters.
Hierarchical Clustering: Builds a hierarchy of clusters.
Principal Component Analysis (PCA): Reduces dimensionality.
Association Rule Learning: Finds relationships between variables (e.g., Market Basket Analysis).
Answer:
A confusion matrix evaluates classification models by showing:
| Actual \ Predicted | Positive | Negative |
|---|---|---|
| Positive | TP | FN |
| Negative | FP | TN |
TP: True Positive, correctly predicted positive.
TN: True Negative, correctly predicted negative.
FP: False Positive, incorrectly predicted positive.
FN: False Negative, incorrectly predicted negative.
Metrics from confusion matrix:
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Answer:
| Aspect | Classification | Regression |
|---|---|---|
| Output | Categorical (Yes/No) | Continuous (numbers) |
| Example | Email spam detection | Predicting house prices |
| Algorithm | Logistic Regression, SVM | Linear Regression, SVR |
Answer:
Cross-validation is a technique to validate the performance of a model on unseen data.
K-Fold Cross-Validation: Divides data into k parts, trains on k-1 parts, tests on 1 part, repeats k times.
Helps prevent overfitting and gives a more robust estimate of model performance.
Answer:
Learning rate is a hyperparameter that controls how much the model weights are updated during training.
Too high → model may overshoot minimum (fail to converge).
Too low → slow convergence, may get stuck in local minima.
| Aspect | Parametric Model | Non-Parametric Model |
|---|---|---|
| Parameters | Fixed number | Flexible, depends on data |
| Example | Linear Regression | KNN, Decision Trees |
| Assumption | Assumes data distribution | No assumption on distribution |
Bias: Error due to wrong assumptions in the learning algorithm → underfitting.
Variance: Error due to sensitivity to training data → overfitting.
Goal: Find a balance → bias-variance tradeoff.
Answer:
Feature scaling normalizes data so that all features contribute equally.
Methods:
Standardization: (x - mean) / standard deviation
Min-Max Scaling: (x - min) / (max - min)
Importance: Algorithms like KNN, SVM, and gradient descent perform better with scaled data.
Answer:
Regularization prevents overfitting by adding penalty terms to the loss function:
L1 Regularization (Lasso): Adds absolute value of weights → can reduce some weights to 0 (feature selection).
L2 Regularization (Ridge): Adds squared value of weights → reduces magnitude of weights.
Scikit-learn: Supervised & unsupervised algorithms, preprocessing, metrics.
TensorFlow / Keras: Deep learning frameworks for neural networks.
Pandas / NumPy: Data manipulation and numerical computations.
Matplotlib / Seaborn: Data visualization.
ML focuses on prediction, statistics focuses on inference.
ML can handle large datasets and complex relationships.
Statistics emphasizes hypothesis testing and confidence intervals.
Answer:
Hyperparameters are settings chosen before training the model.
Examples: Learning rate, number of trees in Random Forest, k in KNN.
Hyperparameter tuning is done via Grid Search or Random Search.
Answer:
PCA is a dimensionality reduction technique that transforms features into principal components while retaining maximum variance.
Helps reduce overfitting, speeds up training, and improves visualization.
Common metrics:
Mean Absolute Error (MAE) – average absolute difference.
Mean Squared Error (MSE) – average squared difference.
Root Mean Squared Error (RMSE) – square root of MSE.
R-squared (R²) – proportion of variance explained by model.
Answer:
A Decision Tree is a supervised learning algorithm used for classification and regression.
It splits data based on feature values into branches to make predictions.
The root node represents the feature that best splits the data.
Advantages: Easy to understand, no scaling required.
Disadvantages: Prone to overfitting.
Example: Predicting whether a student passes based on hours studied and attendance.
Answer:
Random Forest is an ensemble learning method that builds multiple decision trees and aggregates their results.
Helps improve accuracy and reduce overfitting.
Each tree is trained on a random subset of data and random subset of features.
Example: Predicting customer churn in telecom using multiple features.
Answer:
SVM is a supervised algorithm for classification and regression.
Finds a hyperplane that best separates classes in feature space.
Kernel trick: Allows SVM to work with non-linear data.
Example: Classifying emails as spam or not spam.
Answer:
KNN is a lazy learning algorithm used for classification and regression.
Predicts output based on the majority class of k nearest points in feature space.
Distance metrics: Euclidean, Manhattan, etc.
Simple but computationally expensive with large datasets.
Answer:
Gradient Descent is an optimization algorithm used to minimize the loss function in ML models.
Updates model weights in the opposite direction of the gradient.
Types:
Batch Gradient Descent: Uses entire dataset → slow but stable.
Stochastic Gradient Descent (SGD): Updates weights per sample → fast but noisy.
Mini-batch Gradient Descent: Updates weights per batch → balanced approach.
Answer:
Deep Learning is a subset of ML that uses neural networks with multiple layers to learn from large datasets.
Works well with images, text, and speech.
Popular architectures: CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), Transformers.
Example: Face recognition, self-driving cars.
Answer:
A neural network is inspired by the human brain and consists of layers:
Input layer – receives features.
Hidden layers – perform computations using weights and activation functions.
Output layer – gives predictions.
Activation functions: Sigmoid, ReLU, Tanh.
Neural networks learn by backpropagation to minimize loss.
| Aspect | Bagging | Boosting |
|---|---|---|
| Purpose | Reduce variance | Reduce bias |
| How | Builds multiple models in parallel and averages results | Builds sequential models where each learns from previous errors |
| Example | Random Forest | AdaBoost, Gradient Boosting |
| Aspect | PCA (Principal Component Analysis) | LDA (Linear Discriminant Analysis) |
|---|---|---|
| Goal | Reduce dimensionality | Reduce dimensionality with class separation |
| Supervised / Unsupervised | Unsupervised | Supervised |
| Example | Visualizing high-dimensional data | Face recognition with labeled classes |
Answer:
RL is a type of ML where an agent learns to take actions in an environment to maximize cumulative reward.
Key components: Agent, Environment, Reward, Policy.
Algorithms: Q-Learning, Deep Q-Networks (DQN).
Example: Training AI for self-driving cars or playing games like Chess/Go.
Answer:
Selection Bias: Data collected is not representative of the population.
Sampling Bias: Some groups are over or under-represented.
Measurement Bias: Data collection process is flawed.
| Aspect | Batch Learning | Online Learning |
|---|---|---|
| Data | Entire dataset at once | One data point at a time |
| Update | Model trained once | Model updated continuously |
| Example | Linear Regression | Stock price prediction |
Answer:
Remove rows or columns with missing values (if few).
Imputation: Fill missing values using mean, median, mode, or prediction models.
Advanced: Use algorithms like XGBoost that handle missing data internally.
ROC Curve: Graph of True Positive Rate (Recall) vs False Positive Rate at different thresholds.
AUC (Area Under Curve): Measures model’s ability to distinguish classes.
1 → perfect model
0.5 → random guessing
Answer:
Clustering is an unsupervised learning technique that groups data points based on similarity.
Algorithms: K-Means, Hierarchical, DBSCAN.
Applications: Market segmentation, anomaly detection.
Answer:
Overfitting in deep learning occurs when the model memorizes training data but fails on new data.
Solutions:
Regularization (L1, L2, Dropout)
Data augmentation
Early stopping
Reduce network complexity
High bias → underfitting → simple model.
High variance → overfitting → complex model.
Goal: Find optimal model complexity with minimal error.
| Aspect | Parametric | Non-Parametric |
|---|---|---|
| Assumption | Assumes data distribution | No assumption |
| Examples | Linear Regression, Logistic Regression | KNN, Decision Trees, SVM |
| Complexity | Low | High |
| Flexibility | Less flexible | More flexible |
Answer:
Ensemble learning combines multiple models to improve accuracy and reduce errors.
Bagging: Random Forest
Boosting: AdaBoost, Gradient Boosting
Stacking: Combines predictions of different models using a meta-model.
Answer:
For multi-class classification, the confusion matrix shows actual vs predicted counts for each class.
Helps compute metrics like accuracy, precision, recall, and F1-score per class.
Example: Handwritten digit recognition (0-9) uses a 10x10 confusion matrix.
Answer:
| Aspect | Machine Learning | Deep Learning | Statistical Learning |
|---|---|---|---|
| Focus | Prediction and decision making | Automated feature extraction + prediction | Understanding relationships in data |
| Data Requirement | Moderate to large datasets | Very large datasets | Moderate datasets |
| Complexity | Medium | High | Low to medium |
| Model Interpretability | Usually interpretable (trees, regression) | Often black-box (neural networks) | Highly interpretable |
| Example | Random Forest, SVM | CNN, RNN | Linear regression, GLM |
Answer:
Imbalanced datasets occur when one class dominates. Solutions include:
Resampling Techniques:
Oversampling minority class: SMOTE, ADASYN
Undersampling majority class: Random undersampling
Algorithmic Approaches:
Use class weights in models like Logistic Regression, XGBoost
Use algorithms robust to imbalance like Balanced Random Forest
Evaluation Metrics:
Accuracy can be misleading; use F1-score, Precision, Recall, ROC-AUC.
Answer:
Overfitting occurs when a model memorizes training data. Solutions:
Regularization: L1, L2, ElasticNet
Cross-validation: K-Fold, Stratified K-Fold
Feature Selection: Remove irrelevant or highly correlated features
Ensemble Methods: Bagging, Boosting
Dropout: For neural networks
Early Stopping: Monitor validation loss during training
Answer:
Cross-validation evaluates a model’s performance on unseen data. Common types:
K-Fold Cross-Validation: Data split into k folds
Stratified K-Fold: Maintains class balance for classification
Leave-One-Out CV: For very small datasets
For time-series: Use TimeSeriesSplit to maintain temporal order, avoiding leakage from future to past.
Answer:
Feature engineering is creating or transforming features to improve model performance.
Types:
Encoding categorical variables: One-hot, label encoding
Scaling/Normalization: MinMaxScaler, StandardScaler
Creating new features: Date/time decomposition, interaction terms
Dimensionality reduction: PCA, t-SNE
Importance: Good features often matter more than complex models.
Answer:
Imputation: Mean, median, mode, or predictive models
Forward/Backward Fill: For time-series data
Indicator Variables: Flag missing values as a separate feature
Pipeline Automation: Use tools like Scikit-learn Pipelines or FeatureStore to handle missing data consistently
Answer:
Ensemble learning combines multiple models to improve performance.
Bagging (Bootstrap Aggregation): Random Forest → reduces variance
Boosting: XGBoost, LightGBM → reduces bias by sequential learning
Stacking: Combines predictions of multiple models using a meta-model
Scenario: For fraud detection, boosting models often outperform a single decision tree.
Answer:
Grid Search: Exhaustive search over predefined parameters
Random Search: Random sampling from parameter distributions
Bayesian Optimization: Probabilistic model to find optimal parameters efficiently
Automated Tools: Optuna, HyperOpt, or Scikit-learn RandomizedSearchCV
Tip: Use cross-validation to avoid overfitting during hyperparameter tuning.
Answer:
Metrics Tracking: Accuracy, F1-score, RMSE, AUC
Drift Detection:
Data Drift: Input distribution changes
Concept Drift: Relationship between features and target changes
Logging & Alerts: Track model predictions and errors
Model Retraining: Trigger retraining when performance drops below threshold
Tools: MLflow, Kubeflow, Seldon, Prometheus
High Bias: Underfitting → e.g., Linear Regression on complex data
High Variance: Overfitting → e.g., Decision Tree without depth limit
Solution: Regularization, pruning trees, ensemble methods, cross-validation
| Aspect | Parametric | Non-Parametric |
|---|---|---|
| Assumptions | Assumes data distribution | No assumptions |
| Examples | Linear Regression, Logistic Regression | KNN, Decision Trees, SVM |
| Flexibility | Less flexible | More flexible |
| Data Requirement | Small datasets | Large datasets |
Answer:
Filter Methods: Pearson correlation, Chi-Square, ANOVA
Wrapper Methods: Recursive Feature Elimination (RFE), forward/backward selection
Embedded Methods: Lasso Regression, Tree-based feature importance
Scenario: Selecting top 10 features out of 100 for customer churn prediction.
Answer:
Gradient Descent Variants: SGD, Mini-batch, Batch
Momentum: Helps accelerate SGD
Adam: Adaptive learning rate optimizer widely used in deep learning
RMSProp: Adaptive learning rate for non-stationary objectives
Answer:
One-Hot Encoding: Convert each category into a binary vector
Label Encoding: Assign integer labels to categories
Target Encoding: Use mean of target variable per category
Embeddings: Neural network-based representation for high-cardinality features
Tree-Based: XGBoost, LightGBM, CatBoost → fast and handles missing data
Linear Models: SGDClassifier, Logistic Regression with sparse data
Clustering: MiniBatchKMeans for large datasets
Answer:
Correlation Matrix: Remove highly correlated features
Variance Inflation Factor (VIF): Remove features with high VIF
Dimensionality Reduction: PCA to combine correlated features
ROC Curve: Works well for balanced datasets
Precision-Recall Curve: Better for imbalanced datasets
Metrics: F1-score balances precision and recall
Answer:
Monitor performance metrics continuously
Retrain model periodically
Online learning: Update model incrementally
Alert triggers: If error rate exceeds threshold
Answer:
XGBoost: Gradient boosting library optimized for speed and accuracy
Advantages:
Handles missing values internally
Regularization to prevent overfitting
Parallel and distributed computing support
Widely used in Kaggle competitions and real-world projects
Classification:
Accuracy, Precision, Recall, F1-score, ROC-AUC, Log-loss
Regression:
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R² Score
| Aspect | Batch Learning | Online Learning |
|---|---|---|
| Data Input | Entire dataset at once | One data point at a time |
| Model Update | Trained once | Continuously updated |
| Use Case | Static datasets | Streaming data (real-time) |
| Examples | Linear Regression, Random Forest | SGD, Online Naive Bayes |
Dimensionality reduction: PCA, feature selection
Sampling: Random or stratified sampling
Distributed computing: Spark MLlib, Dask, Hadoop
Mini-batch training: Neural networks
Number of trees (n_estimators) → More trees → better accuracy, slower training
Maximum depth (max_depth) → Prevent overfitting
Minimum samples per leaf → Control tree size
Feature selection per split (max_features) → Avoid correlated trees
| Aspect | Generative | Discriminative |
|---|---|---|
| Goal | Model joint probability P(x, y) | Model conditional probability P(y |
| Examples | Naive Bayes, HMM | Logistic Regression, SVM |
| Use Case | Text generation, Speech | Classification |
Answer:
Steps:
Save model (pickle, joblib)
Expose via API (Flask, FastAPI, Django)
Containerize using Docker
Orchestrate using Kubernetes for scaling
Monitor performance and retrain periodically
Tools: MLflow, Kubeflow, Seldon, AWS SageMaker, Azure ML
Answer:
CNNs are specialized neural networks for image and spatial data.
Layers:
Convolution Layer: Applies filters to extract features
Pooling Layer: Reduces dimensionality (MaxPooling, AvgPooling)
Fully Connected Layer: Makes predictions
Advantages: Captures spatial hierarchies, fewer parameters than dense networks
Example: Image classification, object detection
Answer:
RNNs: Designed for sequential data; maintains memory of previous inputs
Problem: Vanishing gradients in long sequences
Solution: LSTM (Long Short-Term Memory) networks
Gates: Input, Forget, Output
Allows learning long-term dependencies
Example: Text generation, speech recognition, stock price prediction
Answer:
Transformer architecture uses attention mechanisms instead of recurrence.
Key Components:
Multi-head self-attention
Positional encoding
Feed-forward layers
Advantages: Parallel training, handles long sequences efficiently
Example: GPT, BERT, NLP tasks
Answer:
Data Collection: Text corpus
Text Preprocessing: Tokenization, stemming/lemmatization, stopword removal
Feature Extraction: Bag of Words, TF-IDF, Word embeddings (Word2Vec, GloVe)
Modeling: Logistic Regression, Naive Bayes, LSTM, Transformer-based models
Evaluation: Accuracy, F1-score, BLEU score for text generation
Answer:
Challenges: Seasonality, trends, autocorrelation
Models:
Classical: ARIMA, SARIMA, Exponential Smoothing
ML-based: Random Forest, XGBoost with lag features
Deep Learning: LSTM, GRU
Preprocessing:
Stationarity check (ADF test)
Scaling/normalization
Creating lag and rolling window features
| Aspect | ARIMA | LSTM |
|---|---|---|
| Type | Statistical model | Deep learning model |
| Handles Non-Linearity | Limited | Can handle non-linear patterns |
| Data Requirement | Small to medium datasets | Large datasets |
| Feature Engineering | Minimal (lags, differencing) | Can include multiple features |
Metrics:
MAE (Mean Absolute Error) – Average absolute difference
MSE (Mean Squared Error) – Penalizes larger errors
RMSE (Root Mean Squared Error) – Same unit as original data
MAPE (Mean Absolute Percentage Error) – Relative error metric
Answer:
Components: Agent, Environment, Action, Reward, Policy, Value Function
Types:
Model-free: Q-Learning, SARSA
Model-based: Learn transition probabilities
Applications: Robotics, game AI (Chess, Go), recommendation systems
Answer:
Data leakage occurs when information from the test set influences training, leading to inflated metrics.
Prevention:
Split data properly before preprocessing
Avoid using future information in time-series models
Apply feature engineering only on training data
Answer:
Embeddings are dense vector representations of words or entities capturing semantic meaning.
Types:
Word2Vec, GloVe → static embeddings
BERT, GPT → contextual embeddings
Applications: Sentiment analysis, search, recommendation systems
Answer:
Attention allows a model to focus on relevant parts of input while making predictions.
Widely used in seq2seq models, transformers
Benefit: Captures long-range dependencies and improves translation/generation tasks
Answer:
Resampling techniques: SMOTE, ADASYN per class
Algorithmic: Class weights in cross-entropy loss
Evaluation Metrics: Macro-averaged Precision, Recall, F1-score
Ensemble Methods: Balanced Random Forest, Gradient Boosting with sampling
Answer:
Monitor input data distribution (feature drift)
Monitor output distribution (prediction drift)
Use statistical tests (KS-test, Chi-square) for drift detection
Trigger retraining or online learning when drift exceeds threshold
| Aspect | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Speed | Fast, parallelizable | Faster on large datasets | Moderate, optimized for categorical features |
| Handling Categorical | Requires preprocessing | Supports categorical features | Native categorical support |
| Overfitting Control | L1/L2 regularization | Leaf-wise growth control | Ordered boosting, regularization |
Feature Importance: Tree-based models
SHAP Values: Contribution of each feature for a prediction
LIME: Local interpretable model-agnostic explanations
Partial Dependence Plots: Relationship between features and output
Steps:
Save model (pickle, joblib, ONNX)
API layer (Flask, FastAPI, gRPC)
Containerization (Docker)
Orchestration (Kubernetes, AWS SageMaker endpoints)
Monitor performance and retrain periodically
Tools: MLflow, Seldon, Kubeflow, Airflow
| Aspect | Batch ML | Online ML | Streaming ML |
|---|---|---|---|
| Data Input | Entire dataset at once | One sample at a time | Continuous data stream |
| Update | Train once, static model | Incremental update | Near real-time predictions |
| Use Case | Historical analysis | Stock price updates | Real-time recommendations |
Feature engineering (interaction terms, polynomial features)
Hyperparameter tuning (Grid Search, Bayesian optimization)
Ensemble methods (stacking, boosting, bagging)
Data augmentation for images/text
Dimensionality reduction for high-dimensional data
Statistical Methods: Z-score, IQR
ML-based: Isolation Forest, One-Class SVM, Autoencoders
Applications: Fraud detection, predictive maintenance
Remove highly correlated features (correlation matrix)
Use Regularization (L1/Lasso)
Dimensionality reduction (PCA)
Tree-based algorithms (Random Forest, XGBoost) are less sensitive