How ChatGPT Works for Advanced Data Analysis: Upload, Clean, Model, and Visualize

Graziano Stefanelli
May 30
4 min read

Write plain-language commands—ChatGPT turns each request into Python code that runs in the background.

Keep files in an isolated workspace—Uploaded data stays inside a restricted container with no external network access.

Iterate through analysis—Clean, model, and visualise data step-by-step in the same chat.

The runtime emits plain Python that imports common data-science libraries (pandas, numpy, matplotlib, scipy, statsmodels, scikit-learn) to read files, clean tables, calculate statistics, fit light machine-learning models, and save images or outputs to the scratch folder. The code is procedural, single-file, network-isolated, and designed to finish within a few CPU-bound minutes.

Let's explore how it performs it all with these topics:

1 │ Intent Detection and Planning

When you type a request— for example, “Plot revenue growth by region and run a linear regression”—ChatGPT token-by-token classifies the tasks hidden in that sentence. Internally it builds a small action graph:

[load file] ─▶ [group data by region] ─▶ [plot line chart] ─▶ [fit regression]

Each node is decorated with the libraries it will need (pandas for grouping, matplotlib for plotting, statsmodels for regression) plus the expected inputs and outputs. This graph becomes the blueprint for the code it will emit in later steps.

If the model sees ambiguities—say, no file name was given—it responds with a follow-up question rather than guessing and risking an error.

2 │ File Upload and Secure Sandbox

Your uploaded files land in a scratch directory (/mnt/data) inside a containerised runtime.

The container cannot reach the public internet or your local machine.
Memory and CPU are throttled to keep runaway jobs from monopolising hardware.
A file quota (≈120 MB per run) prevents “data exfiltration by oversize dump” attacks.
When the conversation ends, the whole container—including temporary code, caches, and data—is destroyed, so nothing persists unnoticed.

3 │ Automatic Code Generation

With the plan from Section 1, ChatGPT now writes Python:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

df = pd.read_excel('/mnt/data/sales.xlsx')
df_region = df.groupby('Region')['Revenue'].sum().reset_index()

plt.plot(df_region['Region'], df_region['Revenue'])
plt.title('Revenue by Region')

X = sm.add_constant(df_region['Revenue'])
y = df_region['Region_Index']
model = sm.OLS(y, X).fit()
print(model.summary())

A small kernel executes this snippet, and any standard output or generated figures are captured. If an exception is raised, the traceback is returned to the language model so it can self-debug (often by rewriting the code with a fix) or ask you for missing information.

4 │ Data-Wrangling Layer

Once the raw file is in a pandas DataFrame, the model can carry out every staple of extract-transform-load (ETL) work:

Cleaning — remove empty rows, convert string “ 1,234 ” to numeric 1234, standardise timestamps.
Reshaping — melt, pivot, stack to move between wide and long formats.
Joins/Merges — SQL-style left/right/inner joins across multiple tables.
Window Operations — rolling averages, cumulative sums, lagged values for time-series.
Vectorised math — element-wise arithmetic that is far faster than Python loops.

Because everything is executed in one process, intermediate results stay in memory, so chained operations are cheap (“Take the grouped data from two lines ago; now bucket it into quartiles”).

5 │ Computation and Modelling

The same environment supports statistical inference and machine-learning routines up to medium scale:

Descriptive stats: means, percentiles, skew/kurtosis.
Parametric tests: t-tests, ANOVA, chi-square, KS.
Predictive models: linear/logistic regression, random forest, gradient boosting, K-Means, DBSCAN.
Time-series: ARIMA, SARIMAX, Holt-Winters, Prophet.
Cross-validation and model diagnostics: R², RMSE, classification reports, residual plots.

Heavy GPU work (deep neural nets, multimillion-row training sets) is intentionally out of scope—the sandbox is CPU-only and time-boxed to a few minutes—so the focus is rapid exploratory analysis rather than production-grade ML pipelines.

6 │ Visualisation Pipeline

matplotlib is the default because it integrates well with the headless server. Typical flow:

fig, ax = plt.subplots()
ax.bar(years, revenue, width=0.5)
ax.set_xlabel('Year'); ax.set_ylabel('Revenue ($M)')
fig.savefig('/mnt/data/revenue_bar.png', dpi=120)

The saved PNG is streamed back and displayed in-chat. If you ask for interactivity, ChatGPT can switch to plotly and embed an HTML chunk; for publication-quality graphics it can alter resolution, fonts, and aspect ratios. You can also instruct it to bundle multiple figures into a single PowerPoint or PDF—those files become downloadable links.

7 │ Dialogue-Driven Iteration

Because the model remembers the execution context during the session, you can iterate naturally:

“Great—now colour the bars red if revenue fell versus the prior year.”

ChatGPT pulls the existing years and revenue arrays from RAM, writes a new conditional colouring loop, updates the PNG, and posts it. This conversational loop replaces the back-and-forth edit-run cycle of a traditional notebook while keeping the same expressiveness.

8 │ Limits and Safeguards

Resource caps — hard wall on RAM, CPU seconds, and file size; jobs that exceed limits return an error.
No external secrets — environment variables and network sockets are stripped.
Stateless after shutdown — once you close or reset the chat, every file and variable is wiped.
Opt-in tier — Advanced Data Analysis lives behind the Plus/Enterprise paywall so free-tier users cannot accidentally run code.
Moderation layer — outbound results are scanned to block disallowed content (malware, PII leaks).

These guardrails let the system offer real Python power while remaining safe for both the user and the platform.

_________

DATA STUDIOS

datastudios.org