Content from What is Research Data?


Last updated on 2026-02-05 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • What is research data, and why is it important in academic and scientific research?
  • What are the different types of research data?
  • Where can research data come from?
  • What are the key components of research data management (RDM)?

Objectives

  • Data types
  • Sources of data
  • What is research data management (collection, storage, organisation, sharing etc)

Understanding data types


Alex is a researcher studying artworks in The Metropolitan Museum of Art. They have just received a dataset containing information about paintings, sculptures, textiles, drawings, and photographs from across the museum’s collection. Before Alex can analyse anything, they need to understand what kinds of data the dataset contains.

Even though the dataset looks like a simple spreadsheet, each column has an underlying data type, and knowing these types will help Alex (and you!) avoid errors, clean the dataset effectively, and choose the right kinds of visualisations or analyses.

Callout

Quick think

Before reading on, ask yourself:

  • What kinds of columns would you expect to see in a museum dataset?

Titles? Dates? Measurements? Artist names?

What is a data type?

A data type describes the kind of information a value represents. It tells the computer how to interpret the data:

  • Is it text?
  • A number?
  • A date?
  • A true/false flag?

When a dataset mixes formats (e.g., a date stored as text, or a number stored as a string), analysis becomes harder and mistakes are more likely. Alex will soon discover that the MET dataset contains a mix of clean values and some messy ones - for example, dates written as 1990, “ca. 1931”, and “07/06/2019 00:00”. Understanding data types helps Alex make sense of this variation.

Why data types matter

Data types are important because they affect how the computer reads, stores, and analyses information. If a column is stored in the wrong format, it can lead to errors, misleading results, or limitations in what you can do with the data.

In summary: knowing data types helps ensure the dataset is trustworthy, analysable, and ready for exploration

Common data types, examples from the MET Museum dataset

Here are some common data types you’ll encounter:

  • String: Text or characters, like "Claude Monet" or "Oil on canvas"
  • Integer: Whole numbers, like 1985, 42, or 0
  • Float: Decimal numbers, like 27.5 or 3.14
  • Boolean: True/False values, like TRUE, FALSE, Yes, No
  • Datetime: Calendar dates or timestamps, like "2020-01-01" or "12/11/2027"

Now that Alex has identified the main data types, let’s try classifying some values ourselves.

Challenge

Challenge

Challenge: What data type is it?

Alex found the following values in the MET dataset. For each one, decide what data type it currently is (it may not be what you think it should be!).

"Claude Monet"
1872
"ca. 1931"
"07/06/2019 00:00"
2021-07-14
27.5
"Oil on canvas"
TRUE

Write down the data type you would assign to each value.

"Claude Monet"           is a string
1872                     is an integer
"ca. 1931"               is a string (messy date)
"07/06/2019 00:00"       is a string (looks like a date but stored as text)
2021-07-14               is a date
27.5                     is a float
"Oil on canvas"          is a string
TRUE                     is a boolean

Notice how several values look like dates or numbers but are stored as text - this is common in real datasets and affects how we analyse them.

Identifying data types in your own dataset

So far, Alex has been looking at individual values. In practice, researchers usually work with whole datasets at once, often in spreadsheets, CSV files, or databases. Let’s think about how to identify data types when your data is laid out in columns.

Imagine Alex opens the MET Museum dataset in a spreadsheet. Each column represents a variable, and each row represents an artwork. The column headers might look something like this:

Object ID Title Artist Object Date Medium Is Public Domain Height (cm)
436121 Water Lilies Claude Monet 1906 Oil on canvas TRUE 200.5
459055 Untitled Unknown ca. 1931 Gelatin silver FALSE 27
12345 Portrait of a Man Rembrandt 07/06/2019 Oil on panel TRUE 98.0

Even without doing any analysis, Alex can already start identifying data types by asking a few simple questions about each column.

Look at the values, not just the column name

Column names are helpful, but they don’t always tell the full story. For each column, Alex checks:

  • Are the values mostly text, numbers, dates, or true/false?
  • Do all the values follow the same format?
  • Are there any “odd” entries that don’t match the rest?
Discussion

Challenge: Trust the name or the values?

Which of these columns would you inspect most carefully, and why?

  • ObjectID
  • Title
  • Artist
  • Object Date
  • Medium
  • Is Public Domain
  • Height_cm

Write down one reason based on the values you might expect to see.

Watch out for mixed data types in a single column

One of the most common problems in spreadsheets is mixing data types in the same column. Alex notices that Object Date contains:

  • 1906 (looks like an integer)
  • ca. 1931 (text)
  • 07/06/2019 (date-like text)

Even though these all describe dates, the computer will usually treat the entire column as text, which makes it hard to sort, filter, or calculate with.

Challenge

Challenge: Thinking about date formats

Look at the following values:

  • 1906

  • ca. 1931

  • 07/06/2019

  • Which values would be easy to convert?

  • Which values would be difficult or ambiguous?

  • What information is missing?

  • What assumptions might you need to make?

  • Some values only include a year, with no month or day.
  • Some values include uncertainty or approximation (e.g. “ca.”).
  • Some values depend on regional date conventions, making them ambiguous.
  • Converting dates may require assumptions, additional metadata, or decisions about how to represent uncertainty.

These are common issues in real datasets and will be addressed later in the course.

Use spreadsheet tools to check data types

Spreadsheets don’t just display data; they also interpret it. Most spreadsheet software gives visual and functional clues that indicate how values are stored, which can help you identify the underlying data type of a column.

Challenge

Challenge: What data type is this column?

Alex opens a different part of the MET dataset containing information about exhibitions and acquisitions.

Column name Values
Accession Number 1975.1, 2003.45a, 1988.12
Department European Paintings, Asian Art, Modern and Contemporary
Acquisition Year 1998, 2005, Unknown
Credit Line Gift of John Smith, Purchase, Bequest
On Display Yes, No
Gallery Number 802, 305, NA
Last Updated 2022-11-03, 03/07/2021, 15 Aug 2020

For each column:

  1. Decide what the data type currently is in the spreadsheet.
  2. Decide what the data type should ideally be for analysis.

Think about:

  • Mixed formats and missing values
  • Columns that look numeric but include text
  • Dates written in different ways

You do not need to clean the data, just identify the data types.

Column name Current data type Ideal data type
Accession Number String String
Department String String
Acquisition Year String (mixed) Integer or date
Credit Line String String
On Display String Boolean
Gallery Number String (mixed) Integer
Last Updated String (mixed dates) Date

Notes:

  • Accession Number looks numeric but contains letters and punctuation, so it must be text.
  • Acquisition Year mixes numbers with "Unknown", forcing the column to be stored as text.
  • On Display represents a yes/no value but is stored as strings.
  • Gallery Number includes numeric values and missing data (NA), which often results in text storage.
  • Last Updated represents dates, but inconsistent formats prevent it from being treated as a date automatically.

Where research data comes from


Research data can originate from many different sources. Understanding where data comes from helps researchers assess its reliability, limitations, and appropriate uses.

What is a data source?

A data source is the origin of the data - where it was collected, generated, or obtained. This could be a person, an instrument, a database, a sensor, or a computational process.

Discussion

Quick check

Which of the following could be considered a data source?

  • A spreadsheet downloaded from a website
  • A survey respondent
  • A microscope
  • A computer model

Answer: All of them.

Primary data

Primary data is data collected directly by the researcher for a specific research question. This might include surveys, interviews, experiments, field observations, or measurements. Primary data offers high relevance but often requires more time and resources to collect.

Discussion

Reflect

Have you ever collected primary data?

  • What made it valuable?
  • What made it challenging?

Secondary data

Secondary data is data that was originally collected by someone else for a different purpose and reused in a new study. Examples include government statistics, museum collections, published datasets, or previously published research data. Secondary data saves time but may not perfectly match the research question.

Generated or synthetic data

Generated or synthetic data is created through computational processes such as simulations, models, or algorithms. This includes data produced by climate models, agent-based simulations, or machine learning systems. Synthetic data is useful for testing hypotheses or protecting privacy, but depends heavily on the assumptions of the model.

Sensor and observational data

Sensor and observational data is collected automatically or systematically through observation, often over time. Examples include environmental sensors, satellite imagery, traffic counters, or wildlife cameras. This data can be large and continuous, requiring careful storage and management.

Data from instruments, tools, and experiments

Scientific instruments, laboratory equipment, or specialised tools produce this type of data. Examples include microscope images, sequencing data, spectrometer readings, or experimental measurements. Instrument data often requires calibration, metadata, and specialised software to interpret.

Discussion

Metadata moment

Why might metadata (for example, calibration settings or units) be especially important for this kind of data?

Examples of data sources in different disciplines

Different fields rely on different data sources. For example, historians may use archival documents, social scientists may use surveys or census data, natural scientists may collect experimental measurements, and digital humanities researchers may work with digitised texts or images.

Considering data quality and limitations

Every data source has limitations. Researchers should consider how the data was collected, potential biases, missing values, accuracy, and whether the data is appropriate for their research question. Understanding these limitations is essential for responsible analysis and interpretation.

Challenge

Classify it

You download a CSV file of air pollution measurements collected by a government agency.

Is this:

  • Primary data
  • Secondary data

Secondary data because you didn’t collect it yourself.

Challenge

Challenge: What kind of data source is this?

Below are several research scenarios. For each one, decide what type of data source is being described.

You may find that more than one category could apply; choose the best fit.

  1. A researcher records temperature and humidity every 10 minutes using a weather station on a university rooftop.

  2. A PhD student analyses digitised letters from a national archive that were scanned and published online by another institution.

  3. A social scientist designs and distributes a questionnaire to study students’ experiences of remote learning.

  4. A computer scientist creates a simulated dataset to test how an algorithm behaves under different conditions.

  5. A biologist collects gene expression data using a sequencing machine in a laboratory experiment.

  1. Sensor and observational data
    (Data collected automatically and repeatedly over time.)

  2. Secondary data
    (Data reused from an existing collection created by others.)

  3. Primary data
    (Data collected directly by the researcher for a specific study.)

  4. Generated or synthetic data
    (Data created through simulation or computational processes.)

  5. Data from instruments, tools, and experiments
    (Data produced by specialised scientific equipment.)

Introduction to research data management


What is Research Data Management (RDM)?

Why RDM matters

The research data lifecycle

RDM in practice: Alex and the MET Dataset

Inline instructor notes can help inform instructors of timing challenges associated with the lessons. They appear in the “Instructor View”

Figures


You can use standard markdown for static figures with the following syntax:

![optional caption that appears below the figure](figure url){alt='alt text for accessibility purposes'}

Blue Carpentries hex person logo with no text.
You belong in The Carpentries!
Callout

Callout sections can highlight information.

They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.

Key Points
  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally

Content from Structuring Research Materials


Last updated on 2026-03-03 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How can you structure data using a standard folder system for better organisation?
  • What are the benefits of using a consistent file naming convention in research data management?
  • Why is version control important, and how can it be incorporated into file naming practices?
  • In what ways can version control tools like Git and GitHub be useful for managing data?

Objectives

  • Organise your research data into a standard folder structure
  • Name files with a consistent naming convention
  • Understand why version control is important, and how to incorporate this into your naming conventions
  • Explain why version control software such as Git/GitHub can be useful for certain types of data.

Folder systems


Alex has recently started a PhD on a project that has been running for a few years. He has been given access to the project’s folders and has been asked by his supervisor to look through some files left by a researcher who recently left.

Challenge

Organising files into a folder structure Part I

In groups, look through Folder 2.1 and discuss the following questions:

  • What problems can you identify with how files are organised?
  • How many different datasets can you identify?
  • Which files are ‘raw’ vs ‘processed’ data?
  • How would you improve the organisation of the files?

When thinking about how you could improve the organisation of the files, consider whether it might be easier to split them between different folders and whether any might need renaming

Alex goes to his supervisor and explains the problems he has found. His supervisor asks him to improve the organisation of the files as he is concerned that no-one will be able to find anything.

Challenge

Organising files into a folder structure Part II

Individually, look at Folder 2.1 again and, within it create a set of folders to organise each file into. You may want to create subfolders inside some of these folders too. Organise the files into your folders.

  • Alex sees that during the project the researcher gave a presentation at a conference and that alongside the slides, there are lots of documents relating to his attendance. Think about the different ways files related to the conference might be stored and consider which might be better for a team needing to share files, versus how an individual might be happy to store them.
  • There are also various files related to a data analysis: can you see the different stages of that analysis process? How might you organise them?

Poor organisation can make it difficult to find files or to even see that a specific file exists. This can become a massive problem where multiple people are working together or on projects that run over a number of years. Out-of-date versions of files may end up being used and shared, and important documents may be effectively lost. Relying on search tools to find documents assumes that you know that the document exists, and that you know how it was named; if you weren’t the person who created it how would you know about it? If you think back to documents you created a few years ago, would you still be able to say what they were all called, what the latest versions were called, and what they all related to?

Taking a few moments to think about the structures you use to store files can save a lot of stress, and time, both for you and anyone you work with. If you work across multiple projects it can be worth coming up with a consistent approach, so that you and anyone you work with always knows where in the folder structure to find the same types of files. Structure folders hierarchically too: start broad and drill-down into specific areas.

It can be worth thinking of an old fashioned set of filing cabinets:

clipart image of a 3-drawered green filing cabinet with the top draw open to show a set of files inside
Image of a filing cabinet
  • each filing cabinet is a project
  • each drawer is an aspect of the project e.g. drawer 1 for data collection; drawer 2 for analysis; drawer 3 for papers and presentations
  • within each drawer are folders containing files about specific subsections of that aspect e.g. in drawer 2 there are separate folders for code, raw data, cleaned data, graphs/ figures, and reports
  • within each folder in each drawer, there may be further sub-sections….

However, do be sensible about the level and number of folders you use: if you have lots of folders that only contain one file, you may have too many, making it more difficult and time-consuming to navigate. If you have very few folders, then there may be too many files in a folder, making it difficult to find the relevant one.

Give each folder a name that is meaningful and concisely describes the contents of the folder, such as “raw_data”, “conference_presentations”, “expenses”.

Whatever structures you choose, it is worth periodically reviewing them to make sure they are still fulfilling their purpose. Perhaps a section of the project folders can be archived? Perhaps there are now enough files of a particular type to necessitate a new folder?

In summary:

  • Use folders! Don’t just save files onto your desktop and expect to be able to find everything in future.
  • Structure hierarchically. Start broad and drill-down into specific areas or projects
  • Use a sensible number of folders. Too few or too many may both make it difficult or time-consuming to find files.
  • Use sensible names. Consider project names and the types of files in each folder, such as “raw_data”, “conference_presentations” or “expenses”
  • Develop a consistent approach across projects
  • Review folder content periodically, and consider moving folders and files that are no longer needed into an ‘archive’ folder

File naming


Alex takes another look at the folder system he has created. The files are easier to search through, but he notices that there are lots of inconsistencies in how those files are named.

Discussion

Naming Files Part I

Open Folder 2.2 and look at the names of the files. Can you identify any problems with the way the files are named? What kinds of issues might they cause for those working on the project?

Poor file naming practices can make it difficult and time-consuming to find files, and lead to people working on the wrong files, or even overwriting important files, thus losing important data. Just as with folder structure, taking some time early in a project to develop a naming convention can save time and effort in the long run, both for your future self, and for any colleagues you work with. Below are some key considerations when creating file names:

What information to include

Carefully consider what information someone would need about the file to know it is the one they want. Do they need to know when it was created? Do they need to know what type of data it contains, for example, raw data, clean data? Does the file relate to a specific ID number? Some of those items of information might be good candidates to form part of the file name.

Whether you need to be able to order the files by a characteristic

For example, will you need to be able to easily select the file that was created most recently? Or quickly find a file by a sample ID? If so, you will want that element of the name to be at the beginning of the file name. Sometimes you might have to prioritise one of those requirements: for example, maybe you need to find the most recent file relating to a specific sample, in which case you might name the files using the format:

sampleID_date

Special Characters and Spaces

Avoid using special characters (such as ?#!“£$%^&*{}@/|<> ) as operating systems and apps may handle these very differently, sometimes being completely unable to open a file with them in their name, or not recognising them at all. Some special characters have a meaning in particular programming languages, and may be interpreted as instructions to the computer rather than as part of the file name.

Similarly, spaces in file names can cause problems:

Challenge

Naming Files Part II

Look at the file name below:

STAR final results.xls

How do you think this might be interpreted by a computer? How might you rewrite the name to avoid that?

If you have spaces in a file name, the computer may interpret a space as showing that the end of the file name has been reached, and therefore not treat the rest of the name as part of the file name. Alternatively, it may interepret it as several file names listed one after the other e.g. STAR final results.xls is either:

1 file named STAR followed by the command ‘final’…

3 files named:

  • STAR
  • final
  • results.xls

A better way to write this file name would be STAR_final_results.xls or STAR-final-results.xls

Recommendation: Use only numbers and letters (without accents) and use hyphens and underscores instead of spaces, to separate the different parts of the file name.

Dates

Challenge

Naming Files Part III

Look at the file names below:

  1. 01022026_sputum_culture_results.csv

  2. 03_09_2025_sputum_culture_results.csv

  3. 05Jun25_sputum_culture_results.csv

  4. 120126_sputum_culture_results.csv

  5. 12252026_sputum_culture_results.csv

What order were those files created in? Are you sure?

If the files came from laboratories in both the UK and in the USA, would that raise any concerns about how to read the dates on the files?

Dates are a frequent cause of issues for researchers. Researchers from different countries may read date numbers differently: “05062026” may be one person’s 5th of June (e.g. in the UK), while for another it’s 6th of May (e.g. in the USA). Ensuring that everyone looking at the date reads it correctly can be the difference between the correct file being selected, and the wrong one.

Discussion

Naming Files Part IV

Look at those file names again:

  1. 01022026_sputum_culture_results.csv

  2. 03_09_2025_sputum_culture_results.csv

  3. 05Jun25_sputum_culture_results.csv

  4. 120126_sputum_culture_results.csv

  5. 12252026_sputum_culture_results.csv

Note that they were given in the order they would appear in a folder, i.e. in numerical and alphabetical order.

Can you think of a better way to write the dates, so that the files appear in date order?

A good format for dates is YYYYMMDD (the ISO 8601 standard), or YYYY-MM-DD. This format ensures that everything can be easily ordered by year, then month, and then day. Seeing the year at the start of the date also indicates to those looking at the files that the date is probably being handled in this way.

Version control


Alex has now organised the folders and renamed many of the files. Things look much better, but as he works through the project, he notices something worrying.

For several documents, there are multiple slightly different versions of the same file, and it’s not always clear:

  • Which one is the most recent

  • What changed between versions

  • Who made those changes, or why

Some files include ‘final’ in the name… sometimes more than once.

Challenge

Versions Everywhere

Look at the files in Folder 2.3:

  • How many different versions of the same document can you find?

  • How can you tell which one is the “latest”?

  • Are you confident you’d pick the right file to work on?

  • What might go wrong if different people used different versions?

You don’t need to open the files to answer these questions.
Focus on the file names themselves:

  • Look for words like final, draft, v1, v2, or dates
  • Notice whether the versioning scheme is consistent across files
  • Ask yourself what assumptions you’re making when deciding which file is “latest”

Would someone new to the project make the same assumptions?

Version control is about tracking change over time, so that you can:

  • See what changed

  • Return to earlier versions if needed

  • Understand how a file reached its current state

Just setting out clear file naming conventions and using them consistently can go a long way towards ensuring you can do all of these things.

Recommendations

  • Use version numbers to indicate the order that document versions were created in. Decimal points can be used to indicate intermediate versions; whole numbers to indicate major versions at key points in the document’s lifecycle. For example, v0.1 might indicate a very early first draft of a paper, v1.0 might be the first version that was circulated to other authors for intial comments, v1.1 an updated version based on the comments, v2.0 might be the version that was submitted to the publisher and v3.0 might be the version that was resubmitted after revisions based on reviewer comments.
  • Sometimes appending dates may be more appropriate, if so use the YYYYMMDD format and consider its position in the file name so that the files are always ordered correctly if sorted alphabetically
  • Where a document has reached a key point in its lifecycle, such as being submitted to a publisher, it may be helpful to append a short word or phrase such as “Submitted”, “Submitted revision”, to clarify that (but do use version numbers too!)
  • Where there are lots of versions of a document in a folder, it may be appropriate to create a subfolder to keep previous versions in: this helps you and any colleagues to be able to quickly find the current version, particularly important if the document is a Standard Operating Procedure or Manual.

However, while file naming conventions can help, they aren’t always enough on their own.

Version control tools


For some types of work, particularly text-based files like code, scripts, and documentation, specialised version control tools can be extremely useful.

These tools are designed to:

  • Automatically record changes

  • Keep a history of edits

  • Show exactly what changed between versions

  • Support collaboration without overwriting others’ work

Challenge

When might tools help?

Consider the following types of files:

  • Word documents

  • Spreadsheets

  • Analysis scripts (e.g. R, Python)

  • Survey questionnaires

  • Images or PDFs

In small groups, discuss:

  • Which of these changes frequently?

  • Which are hard to merge if two people edit them?

  • Which might benefit most from automated version tracking?

Imagine two people editing the file at the same time: what would be easy to reconcile, and what would be painful?

Not all files benefit equally from version control tools. Files such as images or Excel files (known as binary files, i.e. non text file) can be harder to manage, while plain text files work particularly well.

Challenge

Thinking ahead

Without worrying about how to use them yet:

  • What advantages might a version control tool offer over manual file naming?

  • What new challenges might it introduce?

  • In what situations might it be unnecessary or overkill?

You might want to think about: how do you currently keep track of changes? What information gets lost when files are renamed or overwritten?

Also consider scale: one person vs a team? One week vs several years

Callout

Git and GitHub

Git is a version control tool, and GitHub is a platform that hosts Git projects and supports collaboration.

They are particularly useful for certain types of data, especially files that are: - text-based (such as code, scripts, and documentation) - edited frequently - worked on by more than one person

They are generally less helpful for files like images, PDFs, spreadsheets, or heavily formatted documents (such as Word or LibreOffice files), where combining changes is difficult.

File naming can help manage major versions (for example, report_v1, report_v2), but version control tools go further; they can record every change and allow you to return to a specific point in time, a bit like “Track Changes” for an entire project.

You won’t be expected to use Git or GitHub yet. For now, it’s enough to understand why such tools exist and when they might be useful.

Terminology preview

You may hear version control tools described using terms like:

  • Repository: a project’s home: the files and their record of changes
  • Commit: a saved snapshot of changes, with a short note about what was done
  • History: the timeline of commits showing how the project evolved

You don’t need to know how to use these yet. For now, think of them as names for ideas you’ve already encountered when trying to keep track of different versions of files.

Content from Tabular Data Collection


Last updated on 2026-02-04 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • What types of variables are commonly found in tabular data?
  • What kinds of data inconsistencies can affect the quality of a dataset?
  • What are some common causes of inconsistent or messy data?
  • What practices can help ensure clean, consistent data during collection and entry?
  • Why is it important to provide clear instructions or rules when collecting data?
  • What is a data dictionary, and why is it useful?

Objectives

After following this episode, learners will be able to:

  • List variable types and formats
  • Identify inconsistencies in data that can cause problems during analysis
  • Describe methods that can be used during data collection and data entry that can prevent inconsistencies
  • Write guidance for how to collect and enter data
  • Create a data dictionary describing a dataset

Variables, data types and formats


Alex has received a dataset from the MET museum and needs to understand the types of variables before exploring or analysing it further.

Prerequisite

Follow along: Open up the dataset

You should have downloaded a dataset called Met_Objects_Dataset_sample.txt as part of the setup instructions. Please open this file in whatever spreadsheet software you are using (e.g. LibreOffice, Excel). The file is tab delimited (i.e. within each row a gap is used to separate values into their columns) so you may need to use whatever Text to Columns tool your spreadsheet software provides to convert it into columnar data. The first row contains the column headers.

What is a data point?

A data point is a single piece of information collected for one variable about one item.

In the MET museum dataset Alex is using, each row is an object (like a painting or sculpture), and each cell is a data point.

Alex’s dataset looks like a spreadsheet, but underneath, each column contains a specific data type, which we covered in section 1. Knowing these helps to avoid errors and choose the right tools for analysis.

What is a variable?

A variable is a characteristic or attribute that can take on different values. In tabular data, variables are usually represented as columns, where each row contains an observation or entry.

However, the concept of a variable is independent of format, it’s not defined by being a column, but by being a consistent type of information collected across observations.

For example, in Alex’s MET dataset, variables might include objectid, artistname, or dateacquired.

Types of variables

Numeric variables

Variables that represent measurable quantities. These can be integers or floats. Numeric variables can be Discrete, which means they take on specific, separate values (often counts), or Continuous, which can take on any value within a range (often measurements).

Examples:

  • objectid12345 (integer, discrete — a unique ID number)
  • heightcm23.5 (float, continuous — a measurement in centimeters)
  • objectdate1890 (integer, discrete — a specific year)

String variables

Free-form or descriptive text.

Examples:

  • artistdisplayname"Claude Monet" (string)
  • title"Woman with a Parasol" (string)

Categorical variables

Variables that represent groups or categories. These could be strings, integers, or floats - anything used to label a category!
Categorical variables can be Nominal, which means there is no inherent order (e.g., artistnationality), or Ordinal, which means the categories follow a logical order (e.g., popularity).

Examples:

  • gender"Female", "Male" (string, nominal)
  • medium"Marble", "Bronze", "Oil on canvas" (string, nominal)
  • istimelinework"Yes" / "No" (string, nominal — or Boolean: TRUE / FALSE)
  • artistdecade1950, 1960, 1980 (integer, ordinal — ordered decades)

Date/time variables

Variables that represent dates or times.

Examples:

  • lastconserv"2001-05-12" (datetime or string)
  • objectdate1990 (integer), "ca. 1890" (string)
Callout

Note on overlapping types:

Some variables can belong to more than one category depending on their use and format. For example:

objectdate = 1890 might be treated as a numeric variable (discrete integer) if used for sorting or calculations.

The same objectdate could also be considered a date/time variable if formatted as "1890-01-01" and used in time-based analyses.

artistdecade = 1950 could be a categorical variable (ordinal) if grouped into decade-based categories for comparison.

It’s okay for a single value to have more than one interpretation - what matters is how it’s used in context.

Caution

⚠️ Some columns might look like numbers but contain inconsistent formats (e.g., “ca. 1890”). These need cleaning before they can be analysed as dates.

Summary

Understanding the difference between conceptual types (how the data is used or interpreted) and technical types (how the data is stored or formatted) is key for working effectively with tabular data. For example, a column might be technically an integer but conceptually a category (like decades or survey scores).

Conceptual Type Technical Type Description Example
Nominal String Categories, no order artistnationality = Australian
Ordinal String Categories with order popularity = high
Discrete Numeric Integer Countable numbers objectid = 123456
Continuous Numeric Integer, Float Measurable, decimals allowed height = 27.5
Boolean Boolean Yes/No, True/False ishighlight = TRUE
Date/Time Datetime Dates or times lastconserv = 12/11/2027
Textual String Free text artistdisplayname = Claude Monet
Identifier Integer/String Unique reference objectnumber = 1982.456
Callout

Tip for learners (like Alex):

Understanding both the conceptual meaning and the technical format of your data helps you clean it correctly, document it clearly, and analyse it without errors.

Identify inconsistencies in data


Before we can clean or analyse data, it’s important to check for inconsistencies, values that don’t follow a standard or expected format. These might include:

  • Different spellings or formats for the same category
  • Mixed use of upper/lower case
  • Inconsistent date formats
  • Unexpected blank or missing values
  • Invalid or impossible values (e.g. negative heights, future birth dates)

These inconsistencies can lead to errors or misleading results if not corrected.

Example: Inconsistencies in the artistgender column

Here’s an example of how the same concept (“artistgender”) can be recorded in many different ways:

objectid artistgender
1001 Female
1002 female
1003 F
1004 Male
1005 MALE
1006 M
1007 Unknown
1008

We can see:

  • "Female", "female", and "F" all refer to the same category
  • "Male", "MALE", and "M" are also equivalent
  • "Unknown" and the blank entry might indicate missing or uncertain data

These differences need to be standardised before analysis — for example, converting all values to lowercase and replacing shorthand terms with full words.

Challenge

Challenge 1: Can you find any inconsistencies or problems with data entered into a spreadsheet?

Let’s have a deep dive into the Met_Objects_Dataset_sample.txt dataset. Using a coloured fill identify any inconsistencies or problem data in the spreadsheet that you think might cause problems for anyone analysing the data.

Inconsistencies might include where measurements are in different units, there are differing formats for dates, differing cases, or where something is indicated in a variety of different ways but all mean the same thing


Next, we’ll look at how to avoid these kinds of issues from happening in the first place.

Prevent inconsistencies during data collection


As we’ve seen so far, our dataset contains a number of inconsistencies that will complicate analysis. In an ideal world, we would have avoided introducing these errors while collecting the data. It’s always simpler to avoid inconsistencies in the first place, rather than trying to fix them later!

How could we have adjusted our data collection to avoid this? Let’s take the lastconserv column as an example, which represents the date when the object was last conserved. Here we see a large number of different date / time formats, including:

  • 28/01/2025 = day / month / year
  • 07/21/2023 = month / day / year
  • 26.03.23 = day.month.year
  • 07/06/2019 00:00 = day / month / year hour:minute

To avoid this, we could have enforced a specific date/time format during collection. For example, if we were using a form, we could have limited responses in this field to only accept dates as year-month-day, with no time entry allowed.

There are also some incorrect dates in this column e.g. 30/2/2024 (30th February 2024). February only has 28 days, or 29 during a leap year, so this date is impossible. We could have avoided this by providing some kind of date validation in the form - e.g. using a calendar input that only contains real dates.

Some general guidelines

  • Avoid free text fields during data collection. This increases the risk of spelling mistakes, additional spaces etc., that will complicate the final analysis.

  • If a column should only contain particular values, then enforce this! For example, you could use a drop-down menu with set options to choose from.

  • Add validation to avoid ‘impossible’ values. For example, are values only valid within a certain range? Are negative values valid?

  • Where multiple formats are possible (e.g. with dates / times), enforce a specific format.

Challenge

Challenge: Methods to prevent inconsistencies during data collection

In a small group, consider how you could prevent the other inconsistencies you identified in the dataset. What checks or rules could you introduce during data collection?

There are many different solutions to these inconsistencies, but here are some examples:

  • istimelinework could have used a drop down menu that enforced only two choices of True or False.
  • a check could have been added to enforce that accessionyear (the year the object entered the collection) is always after objectdate (the year the object was created).
  • artistnationality could have used a drop-down menu containing set nationality options. This would have avoided inconsistencies like France vs French.

Write data collection guidelines


As we saw in the last section, there are many additional checks / rules we could have added during data collection to make our dataset more consistent and easier to analyse. It’s good practice to document these rules before data collection takes place so that it’s clear to yourself, along with any collaborators and future users of your dataset, exactly how values were collected. This will also be invaluable when it’s time to write the methods section of any papers or reports that use this data.

Make sure you include information about how to handle missing values. How will these be represented in your table? It is also useful to explicitly state why a value may be missing (if possible). For example, in our dataset some objects are made by manufacturing companies like United Merchants & Manufacturers rather than an individual artist - in this case artistgender will be missing, as it doesn’t apply in this scenario.

Discussion

Write data collection guidelines

Choose a variable from the dataset (e.g. lastconserv) and write some bullet-point guidelines for its collection. For example:

  • Which values are valid for this variable?

  • Which format should be used?

  • Are any checks required against other variables in the table?

  • If a value is missing, how should it be represented? E.g. NA, None, not applicable

Data Dictionaries


What is a Data Dictionary?

A data dictionary is a table that describes the variables in your dataset. It provides key information such as:

  • Variable name (column header)
  • Description (what the variable represents)
  • Data type (e.g., string, integer, boolean, datetime)
  • Possible values or format (especially for categorical variables)
  • Units (if relevant)

Data dictionaries help others (and future you!) understand and use your data consistently and correctly.

Example: Wildlife Observations Dataset

Here’s a sample data dictionary for a fictional dataset tracking wildlife sightings in a nature reserve:

Variable Name Description Data Type Possible Values / Format Units
sighting_id Unique ID for each observation Integer 1, 2, 3, … N/A
species_name Name of the animal species observed String e.g., “Red Fox”, “Barn Owl” N/A
count Number of individuals seen Integer 0, 1, 2, … Count
observation_date Date the observation was recorded Datetime YYYY-MM-DD N/A
location Area of the park where sighting occurred String “North Woods”, “Wetland Trail” N/A
is_endangered Whether the species is endangered Boolean TRUE, FALSE N/A
Discussion

Challenge: Write a Data Dictionary for Alex

Alex is trying to make sense of the MET museum dataset. Help Alex out by creating a mini data dictionary!

  1. Open the file Met_Objects_Dataset_sample.txt
  2. Choose three variables (columns) from the dataset
  3. For each one, write down:
    • The variable name
    • A short description
    • The data type (e.g., string, integer, date)
    • Any possible values or units, if relevant

Work in pairs or small groups and compare your answers.

Key Points
  • keypoint 1
  • keypoint 2

Content from How to clean a tabular dataset


Last updated on 2025-03-28 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • What is ‘clean’ data?
  • How can we find inconsistencies in tabular data?
  • How can we correct inconsistencies in tabular data?

Objectives

  • Describe what data cleaning is and why it is important
  • Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
  • Identify missing values within a tabular dataset using filters
  • Correct spelling mistakes using spell check tools and find + replace
  • Standardise text formats using spreadsheet functions
  • Describe the pros and cons of using spreadsheets for data collection and cleaning
Challenge

Challenge 1: Can you do it?

Open film_dataset.csv.

  1. How many missing values are there in the ‘film_title’ column?
  2. Are there any duplicate entries in the dataset? If so, how many?
  1. There are 7 missing values in the film_title column
  2. There are 5 duplicate rows in the dataset
Key Points
  • keypoint

Content from Introduction to R


Last updated on 2025-03-28 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • What is….

Objectives

  • Objective 1