Why Data Wrangling Comes First

Ask most people what data scientists do and they’ll describe modelling: training algorithms, building classifiers, running predictions. Ask a working data scientist what they actually spend their time on and the honest answer is almost always the same: cleaning, transforming, joining, validating, and reshaping data until it’s in a form that can be analysed.

The statistics commonly cited — that 80% of data science work is data preparation — are directionally accurate. In banking and telco environments where data comes from multiple legacy systems with inconsistent schemas, conflicting keys, and no single customer master, the proportion is often higher.

But the framing of data wrangling as the necessary drudgery before the interesting work begins is wrong. It’s not the price of admission. It’s the practice itself.

What data wrangling actually develops

When you spend serious time cleaning and preparing data, you develop something that cannot be built any other way: an intimate understanding of the dataset you’re working with.

You know where the nulls are and what they mean. You know which identifiers are reliable and which have collisions. You know which fields were populated by a human and which were populated by a system — and what class of errors each carries.

This knowledge is what separates the analyst who builds a model that works in production from the one who builds a model that works in a notebook and fails when it meets real data.

What tools to learn, and in what order

Data wrangling in Python means pandas — and specifically, pandas at the level where you can reshape a dataset without looking up the syntax. groupby, merge, melt, pivot, apply, map. String cleaning, date parsing, handling mixed types.

It also means SQL — not just SELECT queries, but window functions, CTEs, aggregations across complex joins. In most commercial environments, the data lives in a database before it ever reaches Python.

And it means developing instincts: Does this distribution make sense? Why are there 17 spellings of this organisation’s name in the customer table? Why do the join keys match on one date and produce duplicates on another?

The mindset shift that matters most

The analysts I’ve seen progress fastest are the ones who got genuinely good at data wrangling early — not as a chore, but as a discipline. They develop an intuition for data quality that never leaves them.

The ones who rushed to modelling never quite lost the habit of trusting their data uncritically. That habit costs them, repeatedly, in ways they often don’t trace back to the root cause.

There is no shortcut through this stage. Build the skill properly.