Neural Tech Daily
dev-tutorials

Pandas Data Wrangling in 60 Minutes: From CSV to a Clean DataFrame

A 60-minute pandas walkthrough: read CSV / Excel / JSON, coerce dtypes, handle missing data, group-by + agg, merge, resample, and plot — using pandas 3.0.

Updated ~14 min read
Share

What you’ll need

Pandas is the canonical Python library for tabular data: read a file into a DataFrame, clean it, reshape it, and hand it off to a model, a dashboard, or a SQL warehouse. The current stable release is pandas 3.0.1, dated 17 February 2026 per the project’s own release listing. 1 This tutorial is a 60-minute walk from a raw CSV to a clean, typed, aggregated, plotted DataFrame. It assumes working Python, a virtualenv, and Jupyter or any REPL.

The cited pandas user-guide sections frame the workflow as eight discrete moves: load, inspect, coerce dtypes, deal with missing values, filter and select, group and aggregate, merge with another frame, resample on time, and plot. We follow that order. The 3.0 release also shipped two behavioural shifts that change muscle memory from the 2.x era: a dedicated str dtype by default, and copy-on-write semantics on every indexing operation. 2 The examples here use the new defaults rather than the 2.x patterns that may show up in older blog posts.

pandas.pydata.org landing page header used as the editorial hero for this tutorial

Image: pandas.pydata.org landing page, used for editorial coverage of the library covered in this tutorial.

Prerequisites:

  • Python 3.11 or newer (pandas 3.0 bumped the floor from 3.9). 3
  • A virtualenv. Conda or uv work; steps are identical.
  • Optional: Jupyter Lab or a notebook environment if you want inline output and plots.
  • A sample CSV. We use a 10,000-row e-commerce orders file with columns order_id, customer_id, order_date, country, category, unit_price, quantity, discount_pct, payment_method. Substitute your own if you have one.

Install pandas and matplotlib. The [performance] extra brings in optional accelerators (numexpr, bottleneck); install it if you plan to work with frames above a few million rows.

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install 'pandas[performance]' matplotlib jupyterlab
python -c "import pandas as pd; print(pd.__version__)"

If the printed version is below 3.0, your environment is pinned to an old release somewhere; pip install --upgrade pandas and re-check before continuing.

Step 1: read the file

pd.read_csv() is the workhorse. The defaults handle most well-formed CSVs, but four parameters earn their keep on real-world data: dtype (per-column type hints), parse_dates (ISO-format date columns), na_values (project-specific sentinels for missingness like -999 or N/A), and usecols (skip columns you don’t need). 4

import pandas as pd

df = pd.read_csv(
    "orders.csv",
    dtype={"order_id": "str", "customer_id": "str", "payment_method": "category"},
    parse_dates=["order_date"],
    na_values=["", "N/A", "-999"],
    usecols=["order_id", "customer_id", "order_date", "country",
             "category", "unit_price", "quantity", "discount_pct",
             "payment_method"],
)

A few notes. Casting IDs to str is the single most common defence against pandas silently turning a numeric-looking ID into an int64 and stripping leading zeros. Casting low-cardinality categorical columns like payment_method to category cuts memory use significantly when the column has many fewer unique values than rows; exact savings depend on the column’s cardinality and average string length. Specifying parse_dates at read time is faster than parsing afterwards.

For Excel and JSON the equivalents are pd.read_excel("file.xlsx", sheet_name=0) and pd.read_json("file.json", lines=True) (use lines=True for newline-delimited JSON, the standard log-shipping format). 5

Step 2: inspect what you got

Four lines surface 90% of the schema problems before you start wrangling.

df.shape          # (rows, columns)
df.head(3)        # first three rows as a DataFrame
df.dtypes         # column → dtype mapping
df.describe(include="all")  # summary stats, all columns

df.info() is the densest single command: row count, column dtypes, non-null counts, memory usage. Read its output before writing any transformation code. If non-null count for a column equals row count, that column has no missing values; if it’s lower, you have a missingness pattern to deal with in Step 4.

pandas user guide IO tools page documenting read_csv, read_excel, read_json and the parameters this step uses

Image: pandas user guide — IO tools, used for editorial coverage of the readers covered in this step.

Step 3: coerce and clean dtypes

The default dtype that read_csv infers is usually right for numerics and dates but often wrong for IDs, codes, and currency strings. Two patterns cover the cases this tutorial cares about.

Stringly-typed numerics. A column read as object because it contains stray text needs pd.to_numeric with errors="coerce" to turn unparseable values into NaN rather than raising.

df["unit_price"] = pd.to_numeric(df["unit_price"], errors="coerce")
df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce").astype("Int64")

Int64 (capital I) is pandas’ nullable integer dtype. Unlike NumPy’s int64, it can hold NaN alongside integers, which is the correct representation when a numeric column has any missing values at all.

Datetimes. If you skipped parse_dates at read time, convert now.

df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

In pandas 3.0, parsing strings to datetimes defaults to microsecond resolution (datetime64[us]) rather than nanoseconds, and timezone handling uses the standard library’s zoneinfo rather than pytz. 6 If you previously hard-coded datetime64[ns] in dtype maps, expect the dtype string to read differently in 3.0; the behaviour is the same for the operations in this tutorial.

Step 4: handle missing data

Pandas represents missing values as NaN for numeric columns and pd.NA for nullable extension dtypes. The user guide lists four strategies and the trade-off each one makes: drop, fill with a constant, fill with a column statistic, or fill from another row. 7

df.isna().sum()             # count NaN per column
df.isna().mean() * 100      # percentage NaN per column

df = df.dropna(subset=["order_id", "order_date"])  # drop rows missing keys

df["discount_pct"] = df["discount_pct"].fillna(0)  # absence means no discount

df["unit_price"] = df["unit_price"].fillna(
    df.groupby("category")["unit_price"].transform("median")
)

The last line is the workhorse pattern: for each row with a missing unit_price, fill from the median of that row’s category rather than a global median. transform("median") returns a Series aligned to the original index, so the assignment lines up row-for-row.

For time-series gaps where the previous observation is the best guess, df.ffill() carries the last valid value forward. For interpolation, df.interpolate(method="linear") or method="time" if the index is a DatetimeIndex. Choose the strategy that matches what missingness actually means in the dataset; never default to fillna(0) for a column where zero has business meaning different from “we don’t know”.

pandas user guide working with missing data page documenting NaN, pd.NA, fillna, dropna, and interpolate

Image: pandas user guide — Working with missing data, used for editorial coverage of the strategies covered in this step.

Step 5: filter and select

Three patterns cover most of the row-and-column subsetting work.

Boolean masking. Build a mask with comparison operators, then index the frame with it. Parentheses around each clause are non-optional because & and | bind tighter than > or ==.

mask = (df["unit_price"] > 0) & (df["quantity"] >= 1)
clean = df[mask]

Label-based access with .loc[]. Select rows by label or boolean mask, and columns by name. The right-hand argument can be a list of columns or a single string.

clean = df.loc[mask, ["order_id", "order_date", "unit_price", "quantity"]]

Query strings with .query(). When the condition gets long, .query() reads better than nested booleans. It compiles the string at runtime, so column names with spaces need backticks.

clean = df.query("unit_price > 0 and quantity >= 1 and country == 'IN'")

A 3.0 behaviour note: every subset or filter now returns a copy under copy-on-write semantics, which means chained assignment like df[df.x > 0]['y'] = 1 no longer modifies df and the old SettingWithCopyWarning has been removed. 8 Assign through .loc instead: df.loc[df.x > 0, "y"] = 1.

Step 6: group and aggregate

Group-by is pandas’ split-apply-combine engine. The split step partitions rows by one or more keys; the apply step runs a reduction on each partition; the combine step stitches the per-group results into a single frame. 9

# Total revenue by category
df["revenue"] = df["unit_price"] * df["quantity"] * (1 - df["discount_pct"] / 100)
df.groupby("category")["revenue"].sum()

# Multiple stats at once
df.groupby("category").agg(
    revenue_total=("revenue", "sum"),
    revenue_mean=("revenue", "mean"),
    order_count=("order_id", "count"),
    customers=("customer_id", "nunique"),
)

# Group by two keys
df.groupby(["country", "category"])["revenue"].sum().unstack()

The named-aggregation syntax in the second example (name=("column", "function")) is the cleanest way to get multiple stats with predictable column names; the older df.groupby("category").agg(["sum", "mean"]) form returns a MultiIndex which is harder to work with downstream.

transform vs apply. transform returns a Series aligned to the original frame’s index, useful for filling group-aware values like the median imputation in Step 4. apply returns whatever the function returns (a Series, DataFrame, or scalar) and is the slower, more flexible escape hatch.

Step 7: merge with another frame

pd.merge is the SQL-style join. The four join kinds (inner, left, right, outer) match SQL semantics one-for-one. 10

customers = pd.read_csv("customers.csv",
                        dtype={"customer_id": "str"},
                        parse_dates=["signup_date"])

enriched = df.merge(
    customers[["customer_id", "signup_date", "segment"]],
    on="customer_id",
    how="left",
    validate="m:1",
)

The validate="m:1" argument tells pandas that the join is many-to-one: many orders per customer, one customer row per ID. If the assumption is violated (a duplicate customer_id in the customers frame), pandas raises rather than silently producing a row-multiplied result. Use validate on every merge in a production pipeline; the cost is microseconds, the saved debugging time is hours.

For stacking frames with the same schema along the row axis, use pd.concat([df1, df2], ignore_index=True). The ignore_index=True resets the index to a fresh 0..N-1 range; without it the result keeps both frames’ original indices, which often produces duplicates.

Pandas 3.0 added anti-joins (how="left_anti", how="right_anti") to the merge surface, the equivalent of SQL NOT EXISTS, useful when you want orders from customers who don’t appear in the customers table, or vice versa. 11

Step 8: resample on time

When the frame has a DatetimeIndex, resample() is the time-aware group-by. The frequency strings follow the IANA-style offset aliases documented in the user guide; in pandas 2.x and 3.0 the month-end / quarter-end / year-end aliases are ME, QE, YE rather than the older M, Q, Y which were removed as part of the 2.2 deprecation cycle. 12

df = df.set_index("order_date").sort_index()

daily = df["revenue"].resample("D").sum()
weekly = df["revenue"].resample("W").sum()
monthly = df["revenue"].resample("ME").sum()

# Multiple stats per period
df.resample("ME").agg(
    revenue=("revenue", "sum"),
    orders=("order_id", "count"),
    avg_order_value=("revenue", "mean"),
)

For sub-day frequencies (min, h, s), the same pattern applies. resample() accepts the same reduction methods as groupby (sum, mean, min, max, count, ohlc for open-high-low-close) and respects the closed and label parameters that control whether the interval is left-closed or right-closed.

pandas user guide group-by page documenting the split-apply-combine API covered in steps 6 and 8

Image: pandas user guide — Group by, used for editorial coverage of the split-apply-combine pattern.

Step 9: plot

Pandas wraps matplotlib for quick exploratory plots. The .plot() accessor on a Series or DataFrame returns a matplotlib Axes object, which means anything matplotlib can do (titles, axis labels, formatters, second y-axes) works on the returned object. 13

import matplotlib.pyplot as plt

ax = monthly.plot(kind="line", figsize=(10, 4), title="Monthly revenue")
ax.set_ylabel("Revenue")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("monthly_revenue.png", dpi=120)

kind accepts line, bar, barh, hist, box, kde, area, pie, scatter, hexbin. For a quick comparison across categories, df.groupby("category")["revenue"].sum().plot(kind="barh") is a one-liner. For exploratory work in Jupyter, %matplotlib inline (or the newer %matplotlib widget for interactive plots) handles display.

If a project graduates past exploration, swap to seaborn for statistical plots or plotly for interactive ones. Pandas’ built-in plotting is for the speed-of-thought phase, not for the final dashboard.

Step 10: write back out

# CSV (most portable; loses dtype info)
df.to_csv("orders_clean.csv", index=False)

# Parquet (preserves dtypes; smaller files; needs pyarrow)
df.to_parquet("orders_clean.parquet", engine="pyarrow", compression="zstd")

# Excel (slow on large frames; use only when the consumer demands it)
df.to_excel("orders_clean.xlsx", index=False, sheet_name="Orders")

For anything larger than a few hundred thousand rows, Parquet with zstd compression is the right default: it preserves dtypes (including the new str and nullable integer types), files are smaller, and reading back is dramatically faster than CSV.

Common pitfalls

Three failures show up repeatedly when teams move from script-level pandas to production:

Silent dtype regression. A column read as int64, exported to CSV, and re-read may come back as float64 if any row picked up a NaN between writes. Either use Parquet for round-trip persistence or explicitly cast on every re-read.

Index confusion. Group-by results carry the group keys as the index; .reset_index() is often the missing piece before a downstream join or plot. If a merge fails with a cryptic “no common columns” error, check whether the keys are sitting in the index rather than as regular columns.

Memory on big frames. A 10-million-row frame with mostly-object columns can balloon past available RAM. The fix order, cheapest first: cast strings with low cardinality to category, cast 32-bit-fitting integers to int32, downcast float64 to float32 only when the precision loss is acceptable, and only then reach for Dask, Polars, or chunked reads. Profile with df.memory_usage(deep=True) before changing anything.

pandas user guide time series page documenting resample and the new frequency aliases used in step 8

Image: pandas user guide — Time series, used for editorial coverage of resampling and the current frequency aliases.

Where to go next

The pandas user guide is the canonical reference; each of the sections we walked through has its own deep dive (the “Working with missing data” page alone runs to several thousand words on imputation strategies). For working developers, the practical follow-up is wiring the cleaned frame into something downstream: a Streamlit dashboard for ad-hoc exploration, a dbt model for warehouse-side transformation, or a scikit-learn pipeline for modelling. Each of those workflows starts from a frame that looks like the one this tutorial produces.

Two ecosystem notes worth tracking. Polars (a Rust-backed DataFrame library with a pandas-adjacent API) is a serious alternative once frames push past 10-million rows and pandas starts feeling slow. The pandas project itself has been investing in PyArrow-backed dtypes (dtype_backend="pyarrow" on the readers) which close much of the speed gap for the workloads where Polars used to be the obvious choice. Test both on your own data before picking sides.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. pandas official site — current stable release listed as 3.0.1 dated 17 February 2026 (accessed )
  2. 2. pandas 3.0.0 release notes — dedicated str dtype default, copy-on-write semantics, datetime resolution inference, zoneinfo as default timezone backend (accessed )
  3. 3. pandas 3.0.0 release notes — minimum supported Python raised to 3.11, minimum NumPy raised to 1.26 (accessed )
  4. 4. pandas user guide IO tools — read_csv parameters dtype, parse_dates, na_values, usecols documented (accessed )
  5. 5. pandas user guide IO tools — read_excel sheet_name and read_json lines parameters documented (accessed )
  6. 6. pandas 3.0.0 release notes — datetime parsing defaults to microsecond resolution, zoneinfo replaces pytz as default timezone backend (accessed )
  7. 7. pandas user guide — Working with missing data; NaN, pd.NA, fillna, dropna, ffill, interpolate documented (accessed )
  8. 8. pandas 3.0.0 release notes — copy-on-write semantics, SettingWithCopyWarning removed, chained assignment no longer mutates source (accessed )
  9. 9. pandas user guide — Group by: split-apply-combine, named aggregation syntax, transform vs apply (accessed )
  10. 10. pandas user guide — Merge, join, concatenate; how parameter accepts inner / left / right / outer; validate argument for relationship assertion (accessed )
  11. 11. pandas 3.0.0 release notes — anti-join support added to merge (how=left_anti, how=right_anti) (accessed )
  12. 12. pandas user guide — Time series; resample method, frequency alias table including ME, QE, YE replacements for M, Q, Y (accessed )
  13. 13. pandas user guide — Visualization with matplotlib; .plot accessor, kind parameter values, returned Axes object (accessed )

Further Reading

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.