ChatGPT for Data Analysis: A Practical Guide with Real Examples (2026)

ChatGPT has become a go-to tool for data professionals looking to speed up exploratory analysis, generate SQL, write Python scripts, and make sense of messy datasets — all without leaving a chat window.

Comparison chart: ChatGPT vs Julius AI vs Claude for data analysis tasks — ChatGPT, Julius AI, and Claude compared across six common data analysis tasks.

But there’s a gap between what people think ChatGPT can do and what it actually delivers. This guide closes that gap. You’ll get a realistic breakdown of capabilities, 20 real prompts with actual outputs, an honest comparison against specialized tools like Julius AI, and a section most articles skip entirely: what data you should never upload, and why.

What ChatGPT Can (and Cannot) Do for Data Analysis

Advanced Data Analysis: How It Actually Works

ChatGPT’s data analysis capability — originally called Code Interpreter, then Advanced Data Analysis — is now built directly into GPT-4o for all paid plans. When you upload a file and ask a question, ChatGPT spins up a sandboxed Python environment server-side, writes code, executes it, and returns both the output and the code that generated it.

The sandbox comes pre-loaded with the libraries data professionals actually use:

pandas + NumPy — data manipulation
matplotlib, Seaborn, Plotly — visualization
scikit-learn — machine learning
SciPy + statsmodels — statistical analysis
openpyxl — Excel I/O

As of GPT-4o, you can also connect files directly from Google Drive and Microsoft OneDrive, and the interactive chart output lets you customize axis labels and colors before downloading.

What ChatGPT Handles Well

Task	ChatGPT Performance
EDA on a new CSV (shape, dtypes, missing values, distributions)	Excellent
SQL generation from plain English	Excellent
Writing pandas cleaning scripts	Excellent
Generating matplotlib/seaborn charts	Very good
Explaining statistical concepts	Very good
Feature engineering for ML	Good
Debugging code you paste in	Good
Basic regression/classification	Adequate
Training large ML models	Not feasible
Analyzing datasets > 50 MB	Limited
Real-time or streaming data	Not supported

Where It Falls Short

Context and file size limits: GPT-4o has a 128,000-token context window. More practically, spreadsheets uploaded for data analysis are capped at approximately 50 MB. High-density CSV files with many columns often fail below this limit. You’re not getting the full 128K for data — part of that budget is consumed by system instructions and your conversation history.

Hallucinations in numerical work: ChatGPT’s overall hallucination rate is approximately 1.5% in 2025 benchmarks — but that’s across all tasks. In numerical analysis, the failure mode is more dangerous: it produces confident-looking wrong answers rather than admitting uncertainty. Time-series calculations with gaps, for example, can silently produce incorrect year-over-year percentages without flagging division-by-zero issues.

Session impermanence: The Python sandbox resets between sessions. Any intermediate files, trained models, or variables you created are gone when you close the chat — unless you explicitly download them.

No live database connections: Unlike Julius AI (covered below), ChatGPT requires file uploads. It cannot connect directly to Snowflake, BigQuery, or PostgreSQL.

ChatGPT vs. Julius AI vs. Claude: Comparison Table

Dimension	ChatGPT (GPT-4o)	Julius AI	Claude (Sonnet/Opus)
Code execution	Yes (Python sandbox)	Yes (automated)	Yes (Python + Node.js)
File size limit	~50 MB spreadsheets	Not specified	30 MB per file
Context window	128K tokens	Multi-model	200K+ tokens
Interactive charts	Yes (customizable)	Yes (native)	Static via code
Reusable workflows	No	Yes (Notebooks)	No
Database connectors	No	Snowflake, BigQuery, Postgres (Pro)	No
Projects/multi-file	20–40 files	Notebooks	50+ files
Price (base paid)	$20/mo (Plus)	$29.16/mo	$20/mo (Pro)
Free tier	Very limited	5 messages/month	Very limited

Use ChatGPT when: You want the fastest path from “here’s a CSV” to working charts and analysis. It’s also the best choice for learning — its explanations are verbose and educational.

Data analysis workflow with ChatGPT step by step — A repeatable 6-step data analysis workflow using ChatGPT — from loading data to presenting insights.

Use Julius AI when: You’re a non-coder who needs repeatable analysis workflows, or you need live database connections without file exports. Julius’s Notebooks let you save a workflow once and re-run it with new data automatically — ChatGPT has no equivalent.

Use Claude when: You’re working with large, complex datasets that stress ChatGPT’s context limits, or you need higher-quality production-ready Python code. Claude also outputs properly formatted .xlsx files (ChatGPT defaults to CSV). In head-to-head testing, Claude’s code required fewer revision cycles and asked clarifying questions before writing.

How to Upload and Analyze a Dataset: Step-by-Step

Here’s exactly what happens when you upload a CSV to ChatGPT for analysis:

Step 1: Upload the file
Click the paperclip icon in the chat input (or drag and drop). ChatGPT accepts CSV, XLSX, TSV, JSON, and PDF. For spreadsheets, stay under 50 MB.

Step 2: Start with a structured EDA prompt
Do not just write “analyze this.” Give ChatGPT a specific checklist:

Act as a senior data analyst. I've uploaded a CSV of customer transactions.
Please:
1. Show me the shape (rows x columns), data types, and first 5 rows
2. Identify any missing values by column (count + percentage)
3. Show the distribution of the 'amount' column with mean, median, std, and a histogram
4. Flag any outliers using the IQR method
5. Generate a correlation matrix heatmap for all numerical columns
Tell me if any column has data quality issues I should address before analysis.

Step 3: Iterate with follow-up questions
Once you have the initial EDA, drill in: “Show me a breakdown of returns by product category” or “Rewrite the histogram with log scale for the x-axis.”

Step 4: Download outputs
Use the download link ChatGPT provides for each generated chart. To download a cleaned dataset, prompt: “Export the cleaned dataframe as a new CSV.”

20 Best ChatGPT Prompts for Data Analysis

These prompts are organized by task. Every prompt includes what ChatGPT actually outputs.

Exploratory Data Analysis (EDA)

Prompt 1 — Full EDA sweep

Act as a data scientist. Analyze this dataset and return:
- Shape, dtypes, and .describe() for numerical columns
- Missing value count and percentage per column
- Top 5 most frequent values in each categorical column
- Skewness for numerical columns
- A list of recommended next steps based on what you find

Output: A structured markdown report with embedded tables and a list of data quality findings.

Prompt 2 — Outlier detection

Using the IQR method, identify outliers in the 'price' column.
For each outlier: show the row index, the actual value, and how many IQRs away from Q1/Q3 it is.
Then show me a box plot.

Output: A table of outlier rows + a matplotlib box plot with IQR bounds marked.

Prompt 3 — Distribution analysis

For each numerical column, generate:
1. A histogram with 20 bins
2. The skewness and kurtosis values
3. A recommendation: "normally distributed," "right-skewed," or "left-skewed"

Prompt 4 — Correlation investigation

Compute the Pearson correlation matrix for all numerical columns.
Create a heatmap with values annotated.
List the top 5 strongest positive correlations and top 5 strongest negative correlations as a table.

Data Cleaning

Prompt 5 — Handle missing values

I have a dataset with missing values. Please:
1. Show which columns have nulls
2. For numerical columns: fill nulls with the column median
3. For categorical columns: fill nulls with the mode
4. Show the null count before and after
Write pandas code I can reuse.

Output:

print("Missing before:
", df.isnull().sum())

num_cols = df.select_dtypes(include="number").columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

cat_cols = df.select_dtypes(include="object").columns
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

print("
Missing after:
", df.isnull().sum())

Prompt 6 — Clean messy columns

The 'revenue' column contains values like '$1,234.56' and '(500.00)' (negatives in parentheses).
The 'date' column mixes formats: '2024-01-15', '01/15/2024', and 'January 15, 2024'.
Write pandas code to clean both columns.

Prompt 7 — Deduplicate intelligently

Find duplicate rows based on ['customer_id', 'order_date', 'product_id'].
Keep the row with the highest 'amount'.
Show how many duplicates were removed.

Prompt 8 — Standardize categorical values

The 'region' column has values like 'US', 'usa', 'United States', 'U.S.A'.
Write code to standardize all variations to 'United States'.
Use a mapping dictionary so I can easily add new variations later.

SQL Generation

Prompt 9 — Complex aggregation with window functions

Schema: sales(sale_id, rep_id, region, product, revenue, sale_date)
Write SQL to show each rep's monthly revenue total for 2024,
ranked within their region. Show only the top 3 reps per region per month.
Use CTEs, not subqueries.

Output:

WITH monthly_totals AS (
    SELECT
        rep_id,
        region,
        DATE_TRUNC('month', sale_date) AS sale_month,
        SUM(revenue) AS monthly_revenue
    FROM sales
    WHERE sale_date >= '2024-01-01' AND sale_date < '2025-01-01'
    GROUP BY rep_id, region, DATE_TRUNC('month', sale_date)
),
ranked AS (
    SELECT *,
        RANK() OVER (PARTITION BY region, sale_month ORDER BY monthly_revenue DESC) AS region_rank
    FROM monthly_totals
)
SELECT rep_id, region, sale_month, monthly_revenue, region_rank
FROM ranked
WHERE region_rank <= 3
ORDER BY region, sale_month, region_rank;

Note: DATE_TRUNC is PostgreSQL/BigQuery syntax. Specify your database for MySQL (DATE_FORMAT) or SQL Server (DATETRUNC).

Prompt 10 — Debug a broken query

This query is supposed to return monthly active users, but it returns more rows than expected:
[paste your query]
Here's the table schema: [paste DDL]
What's wrong, and how do I fix it?

Prompt 11 — Convert between SQL dialects

Convert this MySQL query to BigQuery Standard SQL:
[paste query]
Flag any functions that behave differently between the two dialects.

Python / Pandas

Prompt 12 — Feature engineering for ML

I have a sales dataset with: customer_id, order_date, amount, product_category.
Create these features for a churn prediction model:
- days_since_last_purchase
- total_purchases_last_90_days
- avg_order_value
- most_frequent_category (mode per customer)
Group by customer_id and return a customer-level feature table.

Prompt 13 — Pivot and reshape

I have transactions in long format: date, store_id, product, sales.
Reshape to wide format: date as rows, each store_id as a column, sales as values.
Fill missing store/date combos with 0.

Prompt 14 — Time series resampling

I have daily transaction data for 3 years.
Resample to weekly and monthly using:
- sum for 'revenue'
- mean for 'avg_order_value'
- count for 'transaction_count'
Plot all three metrics as a 3-panel line chart.

Visualization

Prompt 15 — Comparison chart

Create a grouped bar chart comparing Q1, Q2, Q3, Q4 revenue for each product category.
Use distinct colors per quarter, add value labels on each bar, and include a legend.
Export as PNG at 300 DPI.

Prompt 16 — Annotated scatter plot

Create a scatter plot of 'marketing_spend' vs 'revenue', colored by 'region'.
Add a regression line and show the R² value in the chart title.
Annotate the top 5 highest-revenue points with their store names.

Reporting

Prompt 17 — Executive summary

Based on this sales data, write a 3-paragraph executive summary for a non-technical audience.
Include: key wins, main concerns, and one recommended action.
Use specific numbers from the data.

Prompt 18 — Anomaly narration

Look at this weekly revenue time series.
Identify any weeks where revenue was more than 2 standard deviations from the mean.
For each anomaly: give the week, the actual value, the expected range, and a possible explanation if the data supports one.

Prompt 19 — Report template generation

Create a Python script that generates a monthly performance report PDF for any input month.
Include: total revenue, MoM change, top 5 products, bottom 5 products, and a trend chart.
Use reportlab or FPDF.

Prompt 20 — Edge case check (always use this)

After completing any calculation involving time-series data or percentages,
also check for these edge cases and flag any rows affected:
- Division by zero
- Missing periods in the time series
- Negative values where only positive values are expected
- Dates out of expected range

This is the most important prompt to add to your workflow. ChatGPT’s most dangerous failure mode is producing confident-looking wrong numbers — especially in time-series calculations with gaps.

ChatGPT for SQL: Writing, Debugging, and Optimizing

ChatGPT is one of the most effective free tools available for SQL work. It does not need a live database connection — describe your schema in text (or paste DDL), ask in plain English, and it generates correct queries.

What it handles well:

JOINs, GROUP BY, window functions, CTEs, date logic
Debugging slow or broken queries when you paste the execution plan
Converting between dialects: MySQL → PostgreSQL → BigQuery
Explaining what an unfamiliar query does, line by line

Watch out for:

It defaults to PostgreSQL syntax — always specify your target database
Date functions differ significantly across dialects; double-check them
For very complex queries (5+ JOINs, nested CTEs), verify output against real data before production use

ChatGPT for Python: pandas, matplotlib, scikit-learn

The Python code ChatGPT generates is production-adjacent — correct often enough to be useful, but not reliably enough to deploy without review.

Strengths: Cleaning and transformation scripts are clean and commented. Standard scikit-learn pipelines (split → scale → fit → evaluate) come out correct. Matplotlib and Seaborn code works for all standard chart types.

Known issues:

Occasionally uses deprecated function signatures (e.g., infer_datetime_format=True removed in pandas 2.0)
May hallucinate argument names for less common libraries
Scripts assume in-memory data — no chunking logic for large files

Best practice: Paste generated code into a notebook with a sample of real data before running on the full dataset.

ChatGPT for Excel and Power BI

Two approaches work for Excel:

1. Upload the .xlsx file directly. ChatGPT reads multi-sheet workbooks. Reference specific sheet names in prompts: “analyze the ‘Sales’ sheet, not ‘Summary’.”

2. Use the Excel add-in (launched 2025). Query your spreadsheet data directly from within Excel — no file export required.

Common Excel tasks ChatGPT handles well:

Translating VLOOKUP/XLOOKUP logic into pandas merge operations
Generating SUMIF/SUMIFS equivalents as groupby aggregations
Building pivot table logic in Python for datasets too large for Excel
Explaining complex nested formulas in plain language

For Power BI: ChatGPT helps write DAX measures, debug M query errors, and optimize data model relationships — but it cannot connect to your Power BI workspace directly.

Data Privacy: What NOT to Upload to ChatGPT

Most ChatGPT tutorials skip this section entirely. Getting it wrong has real legal and business consequences.

The Scale of the Problem

A 2025 study found 34.8% of employee ChatGPT inputs contained sensitive data — up from 11% in 2023. The highest-risk categories:

PII (names, email, phone, addresses) — GDPR Article 5 violations
PHI (patient records, diagnoses) — HIPAA violations on non-Enterprise plans
Financial projections, M&A plans — trade secret exposure
Source code — the Samsung incident (2023) is the landmark case: proprietary semiconductor code was uploaded, potentially entered training data, and led to a company-wide ChatGPT ban

OpenAI’s Data Retention by Plan

Plan	Used for training	How to opt out
Free / Plus	Yes, by default	Manual toggle in settings
Team (Business)	No by default	Off by default
Enterprise	No by default	Off by default
API	No by default	Per data processing agreement

The key rule: if you are on Free or Plus and have not manually disabled training data use in settings, your uploaded data may be used to train future models.

Healthcare Data: HIPAA Hard No

Free, Plus, and Team tiers are not HIPAA-compliant. Only ChatGPT Enterprise with a signed Business Associate Agreement (BAA) meets healthcare data requirements. Uploading patient data on any other plan is a potential HIPAA violation.

How to Sanitize Data Before Uploading

If you need ChatGPT to analyze real data patterns, anonymize first:

import pandas as pd

# Replace PII before uploading
df["customer_name"] = df["customer_name"].apply(
    lambda x: f"Customer_{hash(x) % 10000}"
)
df["email"] = df["email"].apply(
    lambda x: f"user_{hash(x) % 10000}@example.com"
)
df["phone"] = "[REDACTED]"

# Round financial figures to remove exact proprietary data
df["revenue"] = df["revenue"].round(-3)  # Round to nearest thousand

For regulated industries: use abstracted sample data that mirrors your real data’s structure without containing actual records. ChatGPT can find patterns in representative samples.

Shadow AI Risk

49% of workers report using AI tools at work without IT approval — often sharing sensitive data with free ChatGPT. If your organization handles regulated data (financial, health, legal), establish a clear policy before widespread adoption. Italian regulators fined OpenAI €15 million in December 2024 for GDPR violations — enforcement is real.

Real Limitations You Need to Know

Hallucinations in numerical work: The ~1.5% overall hallucination rate understates the risk in data contexts. The dangerous pattern is confident-looking wrong answers in multi-step numerical calculations — not obvious errors. Always use Prompt 20 (edge case checking) and verify key findings against your source data.

Context window vs. file size in practice: A 50 MB CSV often exceeds what’s usable within the 128K context. For large datasets, upload a representative sample (1,000–50,000 rows) or pre-aggregate before uploading.

Session impermanence: The Python sandbox resets when you close a chat. Projects (Plus and above) add memory, but ChatGPT decides what to retain — you cannot control it fully.

Compute limits: Complex ML training and neural network fine-tuning will time out. For anything beyond standard scikit-learn models, you need a local environment or cloud notebook.

Best Alternatives for Data Analysis

Julius AI — Purpose-Built for Data Work

Julius AI was built specifically for data analysis. The practical differences:

Notebooks: Save an analysis workflow once, upload new data next month, and it re-runs automatically. ChatGPT has nothing equivalent.
Live database connections (Pro plan): Direct Snowflake, BigQuery, and PostgreSQL connectors — no file export cycle.
Statistical outputs: R², p-values, and confidence intervals appear directly in the UI without writing code.
Scheduled refreshes: Workflows can auto-run on a schedule.

For BI analysts running the same analysis repeatedly, Julius AI’s workflow automation is significantly more productive than re-prompting ChatGPT from scratch each time.

Claude — Better for Large, Complex Datasets

Claude’s 200K+ token context window outperforms ChatGPT when your data or analysis is large. It produces cleaner Python code with fewer iteration cycles, outputs properly formatted .xlsx files (ChatGPT defaults to CSV), and handles multi-document synthesis better. For investment-grade analysis or cross-referencing multiple large reports simultaneously, Claude is the stronger choice.

Build the Underlying Skills

Understanding what ChatGPT’s generated code actually does makes you dramatically better at catching its mistakes. DataCamp’s Python for Data Analysis and SQL Fundamentals tracks are well-suited for data professionals who want to work more effectively alongside AI tools — not be dependent on them.

FAQ

Can ChatGPT analyze large datasets?

Spreadsheets are capped at approximately 50 MB. For larger datasets, sample down to 10,000–50,000 representative rows, or pre-aggregate in your database first.

Is ChatGPT free for data analysis?

The free tier has very limited file upload access. For reliable data analysis, ChatGPT Plus ($20/month) includes full Advanced Data Analysis with GPT-4o.

Can ChatGPT connect to my database directly?

No — it requires file uploads. For direct database connections, Julius AI (Pro plan) connects to Snowflake, BigQuery, and PostgreSQL.

How accurate is the analysis?

Accurate enough for exploration and prototyping. Not production-reliable without human review. The most dangerous failure mode is confident-looking wrong numerical answers, not obvious errors.

Should I upload customer data to ChatGPT?

Not to the free or Plus tier without opting out of training data use. For any PII or regulated data, use at minimum ChatGPT Team (Business). For healthcare data, only ChatGPT Enterprise with a signed BAA is compliant.

What’s the difference between ChatGPT and Code Interpreter?

Code Interpreter was the original name (2023). It became Advanced Data Analysis, then became part of standard GPT-4o. Same feature, different era.

Can ChatGPT replace a data analyst?

No — it replaces specific tasks: boilerplate code, standard charts, SQL translation. It does not replace domain judgment, data strategy, stakeholder communication, or verifying whether outputs make business sense.

Conclusion

ChatGPT is most valuable as a productivity multiplier for data professionals who already understand their domain. The 20 prompts above cover the routine 80% of data analysis work — EDA, cleaning, SQL, visualization, and reporting — where ChatGPT reliably saves hours.

Where it breaks down: large datasets, complex ML, guaranteed numerical accuracy, and anything requiring real-time or live data connections.

For your next project:

Start with Prompt 1 (full EDA sweep) to understand your dataset fast
Always append Prompt 20 (edge case checking) to any numerical calculation
Anonymize sensitive data before uploading — even to ChatGPT Plus
If you run repeatable workflows, test Julius AI alongside ChatGPT
To build the foundation that makes AI-assisted analysis more effective, DataCamp’s Python and SQL tracks are worth the investment

Disclosure: links on this page go through a /go/ redirect for click tracking. Some may be affiliate links that support this site at no extra cost to you.