Data Engineer Interview Questions
10 curated questions with evaluation guidance for hiring managers.
How would you design a data pipeline that processes 100 million records daily with low latency?
Should discuss architecture choices (batch vs. streaming), tools (Spark, Kafka, Airflow), partitioning strategies, checkpointing, and monitoring. Look for understanding of scale and trade-offs.
Explain the differences between star schema and snowflake schema. When would you use each?
Should explain denormalization trade-offs, query performance, storage costs, and maintenance complexity. Star for simple analytics and dashboards, snowflake for complex hierarchies.
How do you handle data quality issues in a production pipeline?
Should discuss validation checks, data contracts, monitoring dashboards, alerting, and quarantine strategies for bad data. Look for proactive quality measures rather than reactive fixes.
Describe your experience with change data capture (CDC). How would you implement it?
Should explain CDC patterns (log-based, trigger-based, timestamp-based), tools (Debezium, AWS DMS), and trade-offs. Look for understanding of exactly-once processing challenges.
How do you optimize a slow-running Spark job?
Should mention common issues: data skew, improper partitioning, shuffle operations, caching, broadcast joins, and serialization. Look for systematic profiling approach before optimization.
Walk me through how you would migrate an on-premise data warehouse to a cloud platform.
Should discuss assessment, tool selection (Snowflake, BigQuery, Redshift), migration strategy (lift-and-shift vs. re-architect), testing, and parallel running. Look for risk mitigation approach.
How do you handle schema evolution in a data lake?
Should discuss schema registry, backward/forward compatibility, Delta Lake or Iceberg for ACID transactions, and impact on downstream consumers. Look for governance awareness.
Explain the concept of data lineage and why it matters.
Should discuss tracking data from source to consumption, impact analysis, debugging, compliance (relevant for Indian data protection laws), and tools (Apache Atlas, custom solutions).
How do you design for data privacy and compliance (DPDP Act) in your data pipelines?
Should mention PII detection, data masking, encryption at rest and in transit, access controls, and audit logging. Awareness of India's Digital Personal Data Protection Act is important.
Describe your approach to testing data pipelines.
Should discuss unit testing transformations, integration testing pipelines, data quality assertions, test data generation, and CI/CD for data. Look for treating data code with the same rigor as application code.
Want AI-generated interview questions tailored to your specific job description? Workro analyses your JD and generates behavioural and technical questions calibrated for the role, seniority level, and required skills — in seconds.
Try free