Data Science Technical Questions Interview Questions
10 curated questions with evaluation guidance for hiring managers.
Explain the bias-variance trade-off. How do you diagnose whether your model has high bias or high variance?
Should explain bias (underfitting) and variance (overfitting) clearly, mention learning curves and cross-validation scores for diagnosis, and discuss solutions for each case.
How do you handle missing data in a dataset? What are the pros and cons of different imputation methods?
Should discuss deletion, mean/median imputation, KNN imputation, MICE, and model-based imputation. Look for understanding of how different methods can introduce bias.
Explain the difference between supervised and unsupervised learning. Give examples of algorithms in each category.
Should clearly distinguish labeled vs. unlabeled data, discuss regression/classification vs. clustering/dimensionality reduction, and mention real-world use cases.
How do you evaluate a classification model? What metrics would you use and why?
Should discuss accuracy, precision, recall, F1, AUC-ROC, and confusion matrix. Should explain when each metric matters (e.g., imbalanced data means F1 over accuracy). Look for context-driven metric selection.
Explain how a random forest model works. How does it differ from a single decision tree?
Should explain ensemble of decision trees, bagging, random feature selection, voting/averaging. Should discuss advantages (reduced overfitting) and trade-offs (interpretability, computation).
What is regularization and why is it important? Compare L1 (Lasso) and L2 (Ridge) regularization.
Should explain preventing overfitting by penalizing large coefficients, L1 drives some to zero (feature selection), L2 shrinks all. Look for mathematical understanding and practical applications.
How do you approach feature selection when you have hundreds of features?
Should discuss filter methods (correlation, chi-square), wrapper methods (RFE), embedded methods (Lasso, feature importance), and domain knowledge. Look for practical dimensionality reduction experience.
Explain cross-validation. Why do you use it and what are the different strategies?
Should explain k-fold, stratified k-fold (for imbalanced data), time-series cross-validation, and why it's better than a single train-test split. Look for understanding of data leakage prevention.
How do you deploy a machine learning model to production? What considerations are important?
Should discuss model serialization (pickle, ONNX), serving (API, batch), monitoring (data drift, model degradation), and CI/CD for ML. Look for MLOps awareness.
What is the central limit theorem and why is it important in data science?
Should explain that sampling distribution of the mean approximates normal with large samples, its role in hypothesis testing and confidence intervals, and practical implications for A/B testing.
Want AI-generated interview questions tailored to your specific job description? Workro analyses your JD and generates behavioural and technical questions calibrated for the role, seniority level, and required skills — in seconds.
Try free