How to clean a tabular dataset

Last updated on 2025-03-28 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • What is ‘clean’ data?
  • How can we find inconsistencies in tabular data?
  • How can we correct inconsistencies in tabular data?

Objectives

  • Describe what data cleaning is and why it is important
  • Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
  • Identify missing values within a tabular dataset using filters
  • Correct spelling mistakes using spell check tools and find + replace
  • Standardise text formats using spreadsheet functions
  • Describe the pros and cons of using spreadsheets for data collection and cleaning

Challenge 1: Can you do it?

Open film_dataset.csv.

  1. How many missing values are there in the ‘film_title’ column?
  2. Are there any duplicate entries in the dataset? If so, how many?
  1. There are 7 missing values in the film_title column
  2. There are 5 duplicate rows in the dataset

Key Points

  • keypoint