How to clean a tabular dataset
Last updated on 2025-03-28 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- What is ‘clean’ data?
- How can we find inconsistencies in tabular data?
- How can we correct inconsistencies in tabular data?
Objectives
- Describe what data cleaning is and why it is important
- Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
- Identify missing values within a tabular dataset using filters
- Correct spelling mistakes using spell check tools and find + replace
- Standardise text formats using spreadsheet functions
- Describe the pros and cons of using spreadsheets for data collection and cleaning
Challenge 1: Can you do it?
Open film_dataset.csv
.
- How many missing values are there in the ‘film_title’ column?
- Are there any duplicate entries in the dataset? If so, how many?
- There are 7 missing values in the
film_title
column - There are 5 duplicate rows in the dataset
Key Points
- keypoint