Big Data

Big Data

Top Interview Questions

About Big Data

Big Data refers to extremely large and complex sets of data that cannot be effectively captured, stored, processed, or analyzed using traditional data management tools and techniques. In today’s digital world, data is generated at an unprecedented rate from diverse sources such as social media platforms, mobile devices, sensors, business transactions, websites, and Internet of Things (IoT) devices. Big Data plays a crucial role in helping organizations gain insights, improve decision-making, enhance customer experiences, and create competitive advantages.

Understanding Big Data

Big Data is not just about large volumes of data; it also involves managing and analyzing data that is fast-moving, diverse in format, and uncertain in nature. Traditional databases and data processing systems struggle to handle such data efficiently. As a result, new technologies, architectures, and analytical approaches have emerged to address the challenges posed by Big Data.

The concept of Big Data is commonly described using the five Vs:

  1. Volume – Refers to the massive amount of data generated every second. Organizations collect terabytes and petabytes of data from multiple sources, including customer interactions, transaction records, and machine-generated logs.

  2. Velocity – Indicates the speed at which data is generated, processed, and analyzed. Real-time or near-real-time data processing is often required in applications such as fraud detection, stock trading, and online recommendations.

  3. Variety – Represents the different types of data, including structured data (databases, spreadsheets), semi-structured data (XML, JSON), and unstructured data (text, images, videos, audio).

  4. Veracity – Refers to the quality and reliability of data. Big Data often contains inconsistencies, missing values, or inaccuracies, making data validation and cleansing critical.

  5. Value – Highlights the importance of extracting meaningful insights from data. The true worth of Big Data lies in the actionable intelligence it provides, not just in its size.

Big Data Architecture

A typical Big Data architecture is designed to handle large-scale data ingestion, storage, processing, and analysis. It usually consists of the following layers:

  • Data Sources: Data is collected from various sources such as databases, social media platforms, sensors, applications, and logs.

  • Data Ingestion: Tools like Apache Kafka, Flume, or Sqoop are used to collect and transfer data into the system.

  • Data Storage: Distributed storage systems such as Hadoop Distributed File System (HDFS), cloud storage, or NoSQL databases store large volumes of data efficiently.

  • Data Processing: Frameworks like Apache Hadoop MapReduce and Apache Spark process data in batch or real-time modes.

  • Data Analytics and Visualization: Analytical tools and dashboards help users explore data and generate insights for decision-making.

Big Data Technologies

Several technologies and frameworks support Big Data processing:

  • Apache Hadoop: An open-source framework that enables distributed storage and processing of large datasets across clusters of computers.

  • Apache Spark: A fast, in-memory data processing engine that supports batch processing, streaming, machine learning, and graph analytics.

  • NoSQL Databases: Databases such as MongoDB, Cassandra, and HBase are designed to handle unstructured and semi-structured data.

  • Data Warehouses and Data Lakes: Modern data warehouses and data lakes store raw and processed data for analytics and reporting.

  • Cloud Platforms: Cloud services like AWS, Azure, and Google Cloud provide scalable infrastructure and managed Big Data services.

Applications of Big Data

Big Data is widely used across industries to solve complex problems and uncover valuable insights:

  • Healthcare: Analyzing patient records, medical images, and wearable device data to improve diagnosis, treatment, and preventive care.

  • Finance: Detecting fraud, managing risk, and analyzing market trends in real time.

  • Retail and E-commerce: Understanding customer behavior, personalizing recommendations, and optimizing inventory management.

  • Manufacturing: Monitoring equipment performance and predicting failures using sensor data.

  • Telecommunications: Improving network performance and customer retention through usage analytics.

  • Government and Smart Cities: Enhancing public services, traffic management, and resource utilization.

Benefits of Big Data

The adoption of Big Data offers numerous advantages:

  • Better Decision-Making: Data-driven insights lead to more informed and accurate business decisions.

  • Cost Optimization: Identifying inefficiencies and optimizing processes reduces operational costs.

  • Improved Customer Experience: Personalized services and targeted marketing enhance customer satisfaction.

  • Innovation: Big Data enables organizations to develop new products, services, and business models.

  • Competitive Advantage: Companies that effectively leverage Big Data can outperform their competitors.

Challenges of Big Data

Despite its benefits, Big Data presents several challenges:

  • Data Security and Privacy: Protecting sensitive data from breaches and ensuring compliance with regulations.

  • Data Quality: Ensuring accuracy, consistency, and reliability across large datasets.

  • Scalability: Managing rapidly growing data volumes while maintaining performance.

  • Skill Shortage: Lack of skilled professionals such as data engineers, data scientists, and analysts.

  • Integration Complexity: Combining data from diverse sources and systems can be difficult.

Future of Big Data

The future of Big Data is closely tied to advancements in artificial intelligence (AI), machine learning (ML), and cloud computing. As data continues to grow, organizations will increasingly rely on automated analytics, real-time processing, and intelligent systems to extract insights. Technologies such as edge computing, data mesh, and augmented analytics are shaping the next generation of Big Data solutions.

Fresher Interview Questions

 

1. What is Big Data?

Answer:
Big Data refers to extremely large volumes of data that cannot be processed efficiently using traditional databases and data-processing tools. This data can be structured, semi-structured, or unstructured and is generated from sources such as social media, sensors, mobile devices, transactions, and logs. Big Data technologies help store, process, and analyze this data to extract valuable insights.


2. What are the characteristics of Big Data?

Answer:
Big Data is commonly described using the 5 V’s:

  1. Volume – Huge amounts of data (terabytes to petabytes).

  2. Velocity – Speed at which data is generated and processed (real-time or near real-time).

  3. Variety – Different data formats (text, images, videos, JSON, XML).

  4. Veracity – Quality and reliability of data.

  5. Value – Useful insights derived from data.


3. Why is Big Data important?

Answer:
Big Data helps organizations:

  • Make better business decisions

  • Understand customer behavior

  • Improve operational efficiency

  • Detect fraud and security threats

  • Enable predictive analytics and AI models


4. What are the types of Big Data?

Answer:

  1. Structured Data – Organized data in rows and columns (e.g., RDBMS tables).

  2. Semi-Structured Data – Partial structure (e.g., JSON, XML, CSV).

  3. Unstructured Data – No predefined format (e.g., images, videos, social media posts).


5. What is Hadoop?

Answer:
Apache Hadoop is an open-source Big Data framework used to store and process large datasets across distributed systems. It provides fault tolerance, scalability, and high availability.


6. What are the core components of Hadoop?

Answer:
Hadoop has three main components:

  1. HDFS (Hadoop Distributed File System) – Storage layer

  2. YARN (Yet Another Resource Negotiator) – Resource management

  3. MapReduce – Data processing framework


7. What is HDFS?

Answer:
HDFS is a distributed file system that stores data across multiple machines. It breaks large files into blocks and distributes them across nodes to ensure fault tolerance and high availability.


8. What is MapReduce?

Answer:
MapReduce is a programming model used to process large datasets in parallel.

  • Map – Processes input data and produces key-value pairs

  • Reduce – Aggregates and processes the output from Map tasks


9. What is YARN?

Answer:
YARN manages cluster resources and schedules jobs. It allows multiple data-processing engines (like Spark, Hive) to run on Hadoop efficiently.


10. What is Apache Spark?

Answer:
Apache Spark is a fast, in-memory Big Data processing engine. It supports batch processing, real-time streaming, machine learning, and graph processing.


11. Difference between Hadoop and Spark?

Answer:

Hadoop Spark
Disk-based processing In-memory processing
Slower Much faster
Uses MapReduce Uses DAG engine
Good for batch Supports batch & streaming

12. What is Hive?

Answer:
Apache Hive is a data warehouse tool used for querying and analyzing large datasets stored in Hadoop using SQL-like language (HiveQL).


13. What is Pig?

Answer:
Apache Pig is a high-level scripting platform for analyzing large datasets. It uses Pig Latin, which is easier than writing MapReduce programs.


14. What is HBase?

Answer:
HBase is a NoSQL column-oriented database that runs on HDFS. It is used for real-time read/write access to Big Data.


15. What is NoSQL?

Answer:
NoSQL databases are non-relational databases designed to handle large volumes of unstructured or semi-structured data with high scalability.


16. Types of NoSQL databases?

Answer:

  1. Key-Value Stores – Redis

  2. Document Stores – MongoDB

  3. Column-Family Stores – HBase, Cassandra

  4. Graph Databases – Neo4j


17. What is data replication in HDFS?

Answer:
Data replication means storing multiple copies of data blocks across different nodes to ensure fault tolerance. Default replication factor is 3.


18. What is data locality?

Answer:
Data locality means moving computation closer to where data resides instead of moving data across the network, improving performance.


19. What is Big Data Analytics?

Answer:
Big Data Analytics involves analyzing large datasets to discover patterns, trends, and insights using tools like Hadoop, Spark, and machine learning algorithms.


20. What is batch processing?

Answer:
Batch processing processes large volumes of data at once after a specific time interval (e.g., daily sales reports).


21. What is real-time processing?

Answer:
Real-time processing handles data instantly as it arrives (e.g., fraud detection, live streaming analytics).


22. What is Apache Kafka?

Answer:
Apache Kafka is a distributed messaging system used for real-time data streaming between applications.


23. What is ETL in Big Data?

Answer:
ETL stands for Extract, Transform, Load. It extracts data from sources, transforms it into a usable format, and loads it into a data warehouse or Big Data system.


24. What is data ingestion?

Answer:
Data ingestion is the process of collecting and importing data into Big Data systems from various sources.


25. What is fault tolerance in Big Data?

Answer:
Fault tolerance ensures the system continues to function even if hardware or software failures occur, mainly achieved through replication and distributed architecture.


26. What is scalability?

Answer:
Scalability is the ability to increase system capacity by adding more nodes rather than upgrading existing hardware.


27. What is cloud Big Data?

Answer:
Cloud Big Data uses cloud platforms (AWS, Azure, GCP) to store and process large datasets with flexibility and cost efficiency.


28. What are common Big Data use cases?

Answer:

  • Recommendation systems

  • Fraud detection

  • Social media analytics

  • Healthcare analytics

  • Financial risk analysis


29. What skills are required for a Big Data fresher?

Answer:

  • SQL and basic programming (Java/Python)

  • Hadoop & Spark basics

  • Linux commands

  • Data concepts and analytics basics


30. What are challenges in Big Data?

Answer:

  • Data security

  • Data quality

  • Storage and processing costs

  • Managing complex architectures


31. What is data partitioning in Big Data?

Answer:
Data partitioning is the process of dividing large datasets into smaller, manageable parts called partitions. Each partition is processed independently, which improves performance and parallelism. In Hadoop and Spark, partitioning helps distribute data across multiple nodes.


32. What is a data block in HDFS?

Answer:
A data block is the minimum unit of storage in HDFS.

  • Default block size: 128 MB
    Large block sizes reduce the number of disk seeks and improve performance for large files.


33. What is NameNode in Hadoop?

Answer:
The NameNode is the master node of HDFS. It manages:

  • File system metadata

  • File permissions

  • Block locations
    It does not store actual data.


34. What is DataNode?

Answer:
DataNodes store the actual data blocks in HDFS. They perform read/write operations and report their status to the NameNode.


35. What is Secondary NameNode?

Answer:
The Secondary NameNode periodically takes checkpoints of metadata. It does not replace the NameNode but helps reduce recovery time during failures.


36. What is a Hadoop cluster?

Answer:
A Hadoop cluster is a group of machines (nodes) connected together to store and process Big Data.

  • Master nodes – NameNode, ResourceManager

  • Worker nodes – DataNodes, NodeManagers


37. What is schema-on-read?

Answer:
Schema-on-read means data is structured when it is read, not when stored. Hadoop follows this approach, allowing flexible data storage.


38. Difference between schema-on-read and schema-on-write?

Answer:

Schema-on-Read Schema-on-Write
Applied at read time Applied at write time
Flexible Rigid
Used in Hadoop Used in RDBMS

39. What is Apache Sqoop?

Answer:
Apache Sqoop is used to transfer data between RDBMS and Hadoop efficiently.
Example: Importing data from MySQL into HDFS.


40. What is Apache Flume?

Answer:
Apache Flume is used to collect and move large amounts of log data into HDFS from multiple sources.


41. What is data serialization?

Answer:
Serialization converts data into a format suitable for storage or transmission.
Common formats: Avro, Parquet, ORC


42. What is Avro?

Answer:
Avro is a row-based serialization format that supports schema evolution and is commonly used for data exchange.


43. What is Parquet?

Answer:
Parquet is a columnar storage format optimized for analytics and query performance. It reduces storage and improves read speed.


44. What is ORC?

Answer:
ORC (Optimized Row Columnar) is a columnar format mainly used with Hive for fast query execution.


45. What is compression in Big Data?

Answer:
Compression reduces data size to save storage and improve processing speed.
Examples: Snappy, Gzip, LZO


46. What is Spark RDD?

Answer:
RDD (Resilient Distributed Dataset) is Spark’s core data structure. It is:

  • Immutable

  • Distributed

  • Fault tolerant


47. What are Spark transformations and actions?

Answer:

  • Transformations – Create new RDDs (map, filter)

  • Actions – Return results (count, collect)


48. What is Spark DataFrame?

Answer:
A DataFrame is a distributed collection of data organized into named columns, similar to a table in SQL.


49. What is Spark SQL?

Answer:
Spark SQL allows querying structured data using SQL and integrates with Hive metastore.


50. What is Spark Streaming?

Answer:
Spark Streaming processes real-time data streams in micro-batches.


51. What is data skew?

Answer:
Data skew occurs when data is unevenly distributed across partitions, causing performance issues.


52. What is cluster computing?

Answer:
Cluster computing involves multiple computers working together as a single system to process Big Data efficiently.


53. What is Big Data security?

Answer:
Big Data security includes authentication, authorization, encryption, and data governance.


54. What is Kerberos?

Answer:
Kerberos is an authentication protocol used in Hadoop for secure access.


55. What is data lake?

Answer:
A data lake stores raw structured and unstructured data in its native format.


56. Difference between Data Lake and Data Warehouse?

Answer:

Data Lake Data Warehouse
Raw data Processed data
Schema-on-read Schema-on-write
Low cost Higher cost

57. What is machine learning in Big Data?

Answer:
Machine learning in Big Data uses algorithms to analyze large datasets and make predictions using tools like Spark MLlib.


58. What is MLlib?

Answer:
MLlib is Spark’s machine learning library.


59. What is data governance?

Answer:
Data governance defines rules, policies, and standards for data usage, security, and quality.


60. What is data pipeline?

Answer:
A data pipeline is a set of processes that move data from source to destination automatically.


61. What is data lineage?

Answer:
Data lineage tracks the origin, movement, and transformation of data.


62. What is fault recovery in Hadoop?

Answer:
Hadoop recovers from failures using data replication and task re-execution.


63. What is speculative execution?

Answer:
Hadoop runs duplicate tasks for slow-running jobs to improve performance.


64. What is Big Data testing?

Answer:
Big Data testing validates data accuracy, performance, and scalability of Big Data systems.


65. What is a typical Big Data architecture?

Answer:

  1. Data sources

  2. Data ingestion

  3. Storage (HDFS, Data Lake)

  4. Processing (Spark, Hive)

  5. Analytics and visualization


66. What are common Big Data interview mistakes by freshers?

Answer:

  • Confusing Hadoop with Spark

  • Ignoring data formats

  • Lack of SQL knowledge

  • Not understanding use cases


67. Explain a simple Big Data use case.

Answer:
An e-commerce company analyzes customer purchase data to recommend products and improve sales.


68. What programming languages are used in Big Data?

Answer:

  • Java

  • Python

  • Scala

  • SQL


69. What is metadata?

Answer:
Metadata is data about data, such as file size, format, and schema.


70. Why is SQL important in Big Data?

Answer:
SQL helps query and analyze large datasets easily using Hive and Spark SQL.

 

Experienced Interview Questions

 

1. What are the advanced features of Hadoop you have worked on?

Answer:
In 4 years of experience, a candidate might work with:

  • YARN for resource management and multi-tenancy

  • High Availability (HA) NameNode setup

  • HDFS Federation for scaling metadata

  • Data compression using Snappy, ORC, or Parquet

  • Security via Kerberos authentication and Ranger/Atlas

  • HDFS snapshots for backup and recovery


2. Explain the difference between Hadoop 1.x and 2.x.

Answer:

Feature Hadoop 1.x Hadoop 2.x
Resource Management MapReduce only YARN (supports multiple engines)
Scalability Limited High, supports 10k+ nodes
High Availability No Yes, Active/Standby NameNode
Cluster Utilization Low Better through YARN
Processing Engines MapReduce MapReduce, Spark, Tez, Storm

3. What is HDFS Federation and why is it needed?

Answer:
HDFS Federation allows multiple NameNodes to manage namespaces independently, improving scalability. Needed when a single NameNode cannot handle millions of files in a large cluster.


4. What is a YARN architecture?

Answer:
YARN has:

  • ResourceManager (RM) – Master, allocates resources

  • NodeManager (NM) – Manages resources and tasks on nodes

  • ApplicationMaster (AM) – Manages a single application

  • Container – Execution unit with allocated CPU/memory

YARN separates resource management from processing, enabling multiple frameworks to run on Hadoop.


5. How do you optimize Hadoop performance?

Answer:

  • Adjust block size based on file size

  • Enable compression for storage efficiency

  • Use combiner functions to reduce network I/O

  • Proper partitioning to avoid data skew

  • Memory tuning for MapReduce and YARN containers

  • Avoid small files; use HAR files or sequence files


6. Explain how to handle small files problem in HDFS.

Answer:
HDFS is optimized for large files; small files increase NameNode memory usage. Solutions:

  • Use SequenceFile or HAR files

  • Merge small files using Spark/Hive

  • Use HBase for random access storage


7. What is Apache Spark’s DAG scheduler?

Answer:
Spark builds a Directed Acyclic Graph (DAG) for all transformations. The DAG scheduler:

  • Divides jobs into stages

  • Optimizes task execution

  • Supports fault tolerance

Unlike Hadoop MapReduce, Spark executes DAGs in-memory for faster processing.


8. What is Spark’s lazy evaluation and why is it useful?

Answer:
Lazy evaluation means Spark does not execute transformations immediately; it builds a DAG and executes only on an action (e.g., collect, count).
Benefits:

  • Optimized execution

  • Reduced data shuffling

  • Faster processing


9. How do you handle data skew in Spark?

Answer:

  • Use salting of keys for joins

  • Repartition using repartition() or coalesce()

  • Filter skewed partitions separately

  • Use broadcast joins for small lookup tables


10. What is the difference between narrow and wide transformations in Spark?

Answer:

Feature Narrow Wide
Shuffle required No Yes
Examples map, filter reduceByKey, join
Performance Fast Slower, network-heavy

11. How do you tune Spark performance?

Answer:

  • Adjust executor memory and cores

  • Increase parallelism (spark.default.parallelism)

  • Use broadcast variables for small tables

  • Cache frequently used RDDs/DataFrames

  • Optimize joins and shuffle operations

  • Use columnar formats (Parquet/ORC)


12. Explain Spark Streaming vs. Kafka Streams.

Answer:

Feature Spark Streaming Kafka Streams
Processing Micro-batch True streaming
Latency Seconds Milliseconds
Language Scala, Python, Java Java, Scala
Use Case Batch + Streaming Real-time analytics

13. How do you monitor and debug Hadoop jobs?

Answer:

  • Use YARN ResourceManager UI for job status

  • Check logs in HDFS (stderr, stdout)

  • Use Spark UI for DAG visualization

  • Use Ganglia/Prometheus for cluster monitoring


14. How do you design a data pipeline for real-time and batch processing?

Answer:

  1. Ingestion: Kafka/Flume/Sqoop

  2. Storage: HDFS, S3, or HBase

  3. Processing: Spark (batch/streaming)

  4. Analytics: Hive, Spark SQL, MLlib

  5. Visualization: Tableau, Power BI, or custom dashboards


15. Explain HBase architecture and use cases.

Answer:

  • Master – Manages regions

  • RegionServer – Handles read/write requests

  • HFiles – Stores data
    Use Cases:

  • Time-series data

  • Real-time analytics

  • IoT data storage


16. What are HBase vs. Hive differences?

Feature HBase Hive
Data type NoSQL SQL-like
Access Random Batch queries
Speed Low latency High latency
Use case Real-time Reporting

17. How do you handle fault tolerance in Spark?

Answer:

  • Use RDD lineage to recompute lost partitions

  • Enable checkpointing for long-running jobs

  • Use HDFS replication to store persistent data


18. How do you ensure data security in Hadoop clusters?

Answer:

  • Enable Kerberos authentication

  • Use HDFS permissions and ACLs

  • Use Apache Ranger for fine-grained authorization

  • Encrypt data at rest and in transit


19. Explain workflow orchestration tools you’ve used.

Answer:

  • Apache Oozie – Workflow scheduler for Hadoop jobs

  • Airflow – DAG-based orchestration for ETL pipelines

  • NiFi – Real-time data flow management


20. How do you handle schema evolution in Hive/Parquet?

Answer:

  • Add new columns with default values

  • Use nullable fields

  • Maintain backward/forward compatibility with Parquet or Avro


21. Explain Big Data testing strategies.

Answer:

  • Data validation – Compare source and processed data

  • Performance testing – Measure job execution time

  • Integration testing – End-to-end pipeline testing

  • Boundary testing – Handle nulls, empty files, corrupt records


22. How do you optimize Hive queries?

Answer:

  • Use partitioning and bucketing

  • Use columnar formats (ORC/Parquet)

  • Enable vectorized query execution

  • Minimize cross-joins and cartesian joins

  • Use Tez or Spark execution engine


23. What is a broadcast join in Spark?

Answer:
When one table is small enough to fit in memory, broadcast it to all worker nodes to avoid shuffling large datasets, improving performance.


24. Explain checkpointing in Spark.

Answer:

  • Checkpointing saves RDDs/DataFrames to HDFS to prevent recomputation in case of failures.

  • Used in long lineage DAGs or streaming applications.


25. What is lambda architecture?

Answer:
Lambda architecture handles batch + real-time processing:

  • Batch layer – Stores all data, performs batch analytics

  • Speed layer – Handles real-time stream processing

  • Serving layer – Combines batch and real-time views for querying


26. What is kappa architecture?

Answer:
Kappa architecture processes all data as real-time streams; batch layer is not used. Simplifies maintenance and reduces redundancy.


27. What tools have you used for Big Data visualization and reporting?

Answer:

  • Tableau, Power BI

  • Superset, Kibana

  • Zeppelin, Jupyter notebooks

  • Custom dashboards via Python/Scala


28. Scenario Question: How would you process 10 TB of log data daily for analytics?

Answer:

  1. Ingestion: Flume/Kafka for streaming logs

  2. Storage: HDFS/S3 with compression

  3. Processing: Spark batch jobs or streaming pipeline

  4. Analytics: Hive/Spark SQL, aggregations

  5. Visualization: Dashboard with alerting


29. Scenario Question: How do you handle slow-running Spark jobs?

Answer:

  • Identify data skew and repartition

  • Cache intermediate results

  • Adjust executor memory and cores

  • Optimize joins using broadcast variables

  • Use columnar formats and filter early


30. Common Challenges for 4-year Experienced Big Data Professionals

  • Handling large-scale cluster tuning

  • Optimizing real-time pipelines

  • Ensuring data security & compliance

  • Managing multi-tenant environments

  • Debugging complex workflows


31. What is Hadoop High Availability (HA) and how do you implement it?

Answer:
Hadoop HA ensures NameNode failure does not stop the cluster.

  • Active NameNode serves requests, Standby NameNode keeps metadata synchronized.

  • Uses Zookeeper to manage automatic failover.

  • Implementation steps:

    1. Configure two NameNodes in HA mode

    2. Use shared storage (NFS or Quorum Journal Manager)

    3. Enable automatic failover using Zookeeper


32. What are Hadoop Counters and their use?

Answer:
Counters are user-defined or system metrics in MapReduce to monitor job progress.
Examples:

  • Number of processed records

  • Number of failed records

  • Number of skipped corrupt files

Useful for debugging and performance monitoring.


33. How do you debug a failing MapReduce job?

Answer:

  • Check job logs in ResourceManager UI

  • Analyze TaskTracker/NodeManager logs

  • Use counters to track data processed

  • Identify skewed input splits

  • Check memory or disk issues


34. Explain Spark partitioning and shuffle.

Answer:

  • Partitioning divides data across nodes to enable parallel processing.

  • Narrow transformations do not cause shuffle; wide transformations (join, reduceByKey) cause data shuffle, which can slow performance.

  • Optimizations:

    • Use partitionBy or salting

    • Reduce shuffles with broadcast joins

    • Coalesce or repartition strategically


35. What is speculative execution in Hadoop?

Answer:

  • Runs duplicate tasks for slow-running jobs to finish faster.

  • Prevents straggler tasks from delaying the job.

  • Controlled by configuration: mapreduce.map.speculative and mapreduce.reduce.speculative.


36. What is Spark Catalyst Optimizer?

Answer:

  • Spark SQL uses Catalyst Optimizer for query optimization.

  • Performs logical plan generation → optimization → physical plan → code generation.

  • Improves execution speed for DataFrames and SQL queries.


37. Explain broadcast variables and accumulators in Spark.

Answer:

  • Broadcast Variables: Read-only variables shared across all nodes to avoid sending large data multiple times.

  • Accumulators: Write-only counters used for aggregation (e.g., counting errors).


38. How do you handle skewed joins in Spark?

Answer:

  • Broadcast small tables

  • Apply salting on the skewed key

  • Filter skewed keys and handle them separately

  • Use repartitionByRange


39. Explain Tungsten optimization in Spark.

Answer:
Tungsten is Spark’s memory and CPU optimization framework:

  • Uses off-heap memory

  • Optimizes code generation for CPU efficiency

  • Reduces garbage collection overhead


40. How do you monitor and tune Spark jobs?

Answer:

  • Use Spark UI for DAG visualization, stage/task metrics

  • Enable event logs for offline analysis

  • Adjust executor memory, cores, shuffle partitions

  • Cache frequently used datasets


41. What is Kafka and its key components?

Answer:
Kafka is a distributed streaming platform. Components:

  • Producer: Sends messages to Kafka topics

  • Consumer: Reads messages from topics

  • Broker: Kafka server storing messages

  • Zookeeper: Manages cluster coordination

  • Topic & Partition: Logical and physical division of messages


42. Explain Kafka partitions and offsets.

Answer:

  • Partitions allow parallelism and scalability.

  • Offset is a unique sequential ID of a message in a partition.

  • Consumers track offsets for fault-tolerant consumption.


43. How do you handle exactly-once semantics in Kafka?

Answer:

  • Use idempotent producers to avoid duplicates

  • Enable transactional producers for atomic writes

  • Consumers commit offsets after processing


44. What is HBase coprocessor?

Answer:

  • Similar to triggers/stored procedures in RDBMS.

  • Observer: Intercepts CRUD operations

  • Endpoint: Performs batch processing on server side

  • Reduces network overhead by moving computation to servers


45. How do you optimize HBase reads/writes?

Answer:

  • Use row keys wisely to avoid hotspotting

  • Enable block cache for frequently read data

  • Use bloom filters to reduce disk access

  • Tune memstore and block sizes


46. Explain Hive optimization techniques.

Answer:

  • Use partitioning and bucketing

  • Store data in ORC/Parquet formats

  • Use vectorized queries

  • Avoid cartesian joins, use map-side joins

  • Enable Cost-Based Optimizer (CBO)


47. What is Hive ACID transaction?

Answer:

  • ACID in Hive ensures atomicity, consistency, isolation, durability for inserts, updates, deletes.

  • Requires transactional tables with ORC format.

  • Enables compaction to reduce delta files.


48. What is ZooKeeper and its role in Hadoop/Spark?

Answer:

  • ZooKeeper is a coordination service for distributed applications.

  • Roles:

    • Manage leader election

    • Maintain configuration synchronization

    • Support HA failover


49. How do you handle real-time alerts using Big Data pipelines?

Answer:

  1. Ingest data using Kafka or Flume

  2. Process streams using Spark Streaming or Flink

  3. Apply business rules/thresholds

  4. Push alerts via email, Slack, or dashboards

  5. Maintain logs for audit


50. Scenario: You have slow MapReduce jobs due to large input splits, how do you fix it?

Answer:

  • Increase block size to reduce the number of splits

  • Use CombineFileInputFormat to merge small files

  • Optimize mapper and reducer code

  • Enable compression to reduce disk I/O


51. Scenario: Processing 1 TB of JSON data daily using Spark. How would you design?

Answer:

  1. Ingestion: Kafka/S3/Flume

  2. Parsing: Use Spark DataFrames or Datasets

  3. Storage: Parquet format for analytics

  4. Processing: Partition and cache data, avoid shuffles

  5. ETL/aggregation: Use Spark SQL

  6. Visualization: Tableau or Superset


52. How do you secure a Big Data cluster?

Answer:

  • Enable Kerberos authentication

  • Configure HDFS permissions and ACLs

  • Use Ranger/Atlas for fine-grained authorization

  • Enable data encryption at rest and in transit

  • Monitor logs for suspicious activities


53. Explain Lambda vs Kappa architecture for real-time analytics.

Answer:

Feature Lambda Kappa
Layers Batch + Speed + Serving Only stream layer
Complexity High Simpler
Use Case Historical + Real-time Stream-only pipelines
Tools Spark, Kafka Kafka + Spark/Flink

54. How do you tune HDFS for performance?

Answer:

  • Adjust block size (128–256 MB)

  • Set replication factor based on reliability

  • Enable compression (Snappy, LZO)

  • Co-locate compute and storage nodes for data locality

  • Avoid small files problem


55. What is data lineage and why is it important?

Answer:

  • Data lineage tracks origin, transformations, and movement of data.

  • Helps in debugging, auditing, and compliance.

  • Tools: Apache Atlas, Talend, Informatica.