AI prompts for data scientists that produce code you can actually trust
Data scientists were early adopters of AI coding assistants, but a vague prompt produces code that silently breaks on edge cases — and untested model code is a production risk, not a shortcut. The prompts that work share a pattern: a specific expert role, your actual schema or data context, a structured output format, and an explicit instruction to surface assumptions. These templates bake that in, so the output is something you can review and ship rather than rewrite.
Last updated · By the Prompt Orange team
Top prompts for data scientists
1. Write an exploratory data analysis
“Analyse this dataset”
Too vague—AI has to guess what you want
“You are a senior data scientist. I have a pandas DataFrame `df` with columns: user_id (int), signup_date (datetime), plan (categorical: free/pro/team), monthly_revenue (float), churned (bool). Write Python (pandas + matplotlib) for an exploratory analysis: missing-value summary, distribution of monthly_revenue by plan, churn rate by plan and signup cohort, and a correlation check. Add a one-line comment on each chart explaining what to look for. Flag any assumptions you made about the data.”
Specific, clear, ready to use
2. Debug a model that won't converge
“Why is my model not working?”
Too vague—AI has to guess what you want
“I'm training a binary classifier with scikit-learn (LogisticRegression) on ~50k rows, 30 features, classes split 95/5. Validation AUC is stuck around 0.5. Walk through the most likely causes in priority order — class imbalance, leakage, unscaled features, a constant/ID column — and for each, give the one-line diagnostic check to confirm or rule it out before I change anything. Don't suggest switching models yet.”
Specific, clear, ready to use
3. Write a SQL feature query
“Write me a SQL query for features”
Too vague—AI has to guess what you want
“Write a PostgreSQL query that builds a feature table for churn prediction, one row per customer. Source tables: customers(id, created_at), orders(customer_id, created_at, amount), sessions(customer_id, started_at). Features: total_orders, total_spend, avg_order_value, days_since_last_order, sessions_last_30d, tenure_days. Use CTEs, handle customers with zero orders (COALESCE to 0, not NULL), and add a comment above each feature. Make it idempotent and readable.”
Specific, clear, ready to use
4. Explain a model to stakeholders
“Explain my model results”
Too vague—AI has to guess what you want
“I built a gradient-boosted model predicting which trial users convert to paid (precision 0.71, recall 0.44 on the positive class). Write a 200-word summary for non-technical executives: what the model does, what precision and recall mean in plain business terms for this use case, the single most important caveat, and one recommended action. No jargon, no formulas — translate metrics into 'out of every 100 users it flags, ~71 actually convert'.”
Specific, clear, ready to use
5. Review code for data leakage
“Check my ML code”
Too vague—AI has to guess what you want
“Review the following scikit-learn pipeline specifically for data leakage and evaluation mistakes — nothing else. Check for: scaling/encoding fitted before the train/test split, target-derived features, time-series rows shuffled across the split, and metrics computed on training data. For each issue found, quote the offending line, explain why it leaks, and show the corrected version. If you find none, say so explicitly.”
Specific, clear, ready to use