Content from What is Research Data?
Last updated on 2026-02-05 | Edit this page
Overview
Questions
- What is research data, and why is it important in academic and scientific research?
- What are the different types of research data?
- Where can research data come from?
- What are the key components of research data management (RDM)?
Objectives
- Data types
- Sources of data
- What is research data management (collection, storage, organisation, sharing etc)
Understanding data types
Alex is a researcher studying artworks in The Metropolitan Museum of Art. They have just received a dataset containing information about paintings, sculptures, textiles, drawings, and photographs from across the museum’s collection. Before Alex can analyse anything, they need to understand what kinds of data the dataset contains.
Even though the dataset looks like a simple spreadsheet, each column has an underlying data type, and knowing these types will help Alex (and you!) avoid errors, clean the dataset effectively, and choose the right kinds of visualisations or analyses.
Quick think
Before reading on, ask yourself:
- What kinds of columns would you expect to see in a museum dataset?
Titles? Dates? Measurements? Artist names?
What is a data type?
A data type describes the kind of information a value represents. It tells the computer how to interpret the data:
- Is it text?
- A number?
- A date?
- A true/false flag?
When a dataset mixes formats (e.g., a date stored as text, or a number stored as a string), analysis becomes harder and mistakes are more likely. Alex will soon discover that the MET dataset contains a mix of clean values and some messy ones - for example, dates written as 1990, “ca. 1931”, and “07/06/2019 00:00”. Understanding data types helps Alex make sense of this variation.
Why data types matter
Data types are important because they affect how the computer reads, stores, and analyses information. If a column is stored in the wrong format, it can lead to errors, misleading results, or limitations in what you can do with the data.
In summary: knowing data types helps ensure the dataset is trustworthy, analysable, and ready for exploration
Common data types, examples from the MET Museum dataset
Here are some common data types you’ll encounter:
-
String: Text or characters, like
"Claude Monet"or"Oil on canvas" -
Integer: Whole numbers, like
1985,42, or0 -
Float: Decimal numbers, like
27.5or3.14 -
Boolean: True/False values, like
TRUE,FALSE,Yes,No -
Datetime: Calendar dates or timestamps, like
"2020-01-01"or"12/11/2027"
Now that Alex has identified the main data types, let’s try classifying some values ourselves.
Challenge
Challenge: What data type is it?
Alex found the following values in the MET dataset. For each one, decide what data type it currently is (it may not be what you think it should be!).
"Claude Monet"
1872
"ca. 1931"
"07/06/2019 00:00"
2021-07-14
27.5
"Oil on canvas"
TRUE
Write down the data type you would assign to each value.
"Claude Monet" is a string
1872 is an integer
"ca. 1931" is a string (messy date)
"07/06/2019 00:00" is a string (looks like a date but stored as text)
2021-07-14 is a date
27.5 is a float
"Oil on canvas" is a string
TRUE is a boolean
Notice how several values look like dates or numbers but are stored as text - this is common in real datasets and affects how we analyse them.
Identifying data types in your own dataset
So far, Alex has been looking at individual values. In practice, researchers usually work with whole datasets at once, often in spreadsheets, CSV files, or databases. Let’s think about how to identify data types when your data is laid out in columns.
Imagine Alex opens the MET Museum dataset in a spreadsheet. Each column represents a variable, and each row represents an artwork. The column headers might look something like this:
| Object ID | Title | Artist | Object Date | Medium | Is Public Domain | Height (cm) |
|---|---|---|---|---|---|---|
| 436121 | Water Lilies | Claude Monet | 1906 | Oil on canvas | TRUE | 200.5 |
| 459055 | Untitled | Unknown | ca. 1931 | Gelatin silver | FALSE | 27 |
| 12345 | Portrait of a Man | Rembrandt | 07/06/2019 | Oil on panel | TRUE | 98.0 |
Even without doing any analysis, Alex can already start identifying data types by asking a few simple questions about each column.
Look at the values, not just the column name
Column names are helpful, but they don’t always tell the full story. For each column, Alex checks:
- Are the values mostly text, numbers, dates, or true/false?
- Do all the values follow the same format?
- Are there any “odd” entries that don’t match the rest?
Challenge: Trust the name or the values?
Which of these columns would you inspect most carefully, and why?
ObjectIDTitleArtistObject DateMediumIs Public DomainHeight_cm
Write down one reason based on the values you might expect to see.
Watch out for mixed data types in a single column
One of the most common problems in spreadsheets is mixing data types in the same column. Alex notices that Object Date contains:
-
1906(looks like an integer) -
ca. 1931(text) -
07/06/2019(date-like text)
Even though these all describe dates, the computer will usually treat the entire column as text, which makes it hard to sort, filter, or calculate with.
Challenge: Thinking about date formats
Look at the following values:
1906ca. 193107/06/2019Which values would be easy to convert?
Which values would be difficult or ambiguous?
What information is missing?
What assumptions might you need to make?
- Some values only include a year, with no month or day.
- Some values include uncertainty or approximation (e.g. “ca.”).
- Some values depend on regional date conventions, making them ambiguous.
- Converting dates may require assumptions, additional metadata, or decisions about how to represent uncertainty.
These are common issues in real datasets and will be addressed later in the course.
Use spreadsheet tools to check data types
Spreadsheets don’t just display data; they also interpret it. Most spreadsheet software gives visual and functional clues that indicate how values are stored, which can help you identify the underlying data type of a column.
Challenge: What data type is this column?
Alex opens a different part of the MET dataset containing information about exhibitions and acquisitions.
| Column name | Values |
|---|---|
| Accession Number | 1975.1, 2003.45a, 1988.12 |
| Department | European Paintings, Asian Art, Modern and Contemporary |
| Acquisition Year | 1998, 2005, Unknown |
| Credit Line | Gift of John Smith, Purchase, Bequest |
| On Display | Yes, No |
| Gallery Number | 802, 305, NA |
| Last Updated | 2022-11-03, 03/07/2021, 15 Aug 2020 |
For each column:
- Decide what the data type currently is in the spreadsheet.
- Decide what the data type should ideally be for analysis.
Think about:
- Mixed formats and missing values
- Columns that look numeric but include text
- Dates written in different ways
You do not need to clean the data, just identify the data types.
| Column name | Current data type | Ideal data type |
|---|---|---|
| Accession Number | String | String |
| Department | String | String |
| Acquisition Year | String (mixed) | Integer or date |
| Credit Line | String | String |
| On Display | String | Boolean |
| Gallery Number | String (mixed) | Integer |
| Last Updated | String (mixed dates) | Date |
Notes:
- Accession Number looks numeric but contains letters and punctuation, so it must be text.
-
Acquisition Year mixes numbers with
"Unknown", forcing the column to be stored as text. - On Display represents a yes/no value but is stored as strings.
-
Gallery Number includes numeric values and missing data
(
NA), which often results in text storage. - Last Updated represents dates, but inconsistent formats prevent it from being treated as a date automatically.
Where research data comes from
Research data can originate from many different sources. Understanding where data comes from helps researchers assess its reliability, limitations, and appropriate uses.
What is a data source?
A data source is the origin of the data - where it was collected, generated, or obtained. This could be a person, an instrument, a database, a sensor, or a computational process.
Quick check
Which of the following could be considered a data source?
- A spreadsheet downloaded from a website
- A survey respondent
- A microscope
- A computer model
Answer: All of them.
Primary data
Primary data is data collected directly by the researcher for a specific research question. This might include surveys, interviews, experiments, field observations, or measurements. Primary data offers high relevance but often requires more time and resources to collect.
Reflect
Have you ever collected primary data?
- What made it valuable?
- What made it challenging?
Secondary data
Secondary data is data that was originally collected by someone else for a different purpose and reused in a new study. Examples include government statistics, museum collections, published datasets, or previously published research data. Secondary data saves time but may not perfectly match the research question.
Generated or synthetic data
Generated or synthetic data is created through computational processes such as simulations, models, or algorithms. This includes data produced by climate models, agent-based simulations, or machine learning systems. Synthetic data is useful for testing hypotheses or protecting privacy, but depends heavily on the assumptions of the model.
Sensor and observational data
Sensor and observational data is collected automatically or systematically through observation, often over time. Examples include environmental sensors, satellite imagery, traffic counters, or wildlife cameras. This data can be large and continuous, requiring careful storage and management.
Data from instruments, tools, and experiments
Scientific instruments, laboratory equipment, or specialised tools produce this type of data. Examples include microscope images, sequencing data, spectrometer readings, or experimental measurements. Instrument data often requires calibration, metadata, and specialised software to interpret.
Metadata moment
Why might metadata (for example, calibration settings or units) be especially important for this kind of data?
Examples of data sources in different disciplines
Different fields rely on different data sources. For example, historians may use archival documents, social scientists may use surveys or census data, natural scientists may collect experimental measurements, and digital humanities researchers may work with digitised texts or images.
Considering data quality and limitations
Every data source has limitations. Researchers should consider how the data was collected, potential biases, missing values, accuracy, and whether the data is appropriate for their research question. Understanding these limitations is essential for responsible analysis and interpretation.
Classify it
You download a CSV file of air pollution measurements collected by a government agency.
Is this:
- Primary data
- Secondary data
Secondary data because you didn’t collect it yourself.
Challenge: What kind of data source is this?
Below are several research scenarios. For each one, decide what type of data source is being described.
You may find that more than one category could apply; choose the best fit.
A researcher records temperature and humidity every 10 minutes using a weather station on a university rooftop.
A PhD student analyses digitised letters from a national archive that were scanned and published online by another institution.
A social scientist designs and distributes a questionnaire to study students’ experiences of remote learning.
A computer scientist creates a simulated dataset to test how an algorithm behaves under different conditions.
A biologist collects gene expression data using a sequencing machine in a laboratory experiment.
Sensor and observational data
(Data collected automatically and repeatedly over time.)Secondary data
(Data reused from an existing collection created by others.)Primary data
(Data collected directly by the researcher for a specific study.)Generated or synthetic data
(Data created through simulation or computational processes.)Data from instruments, tools, and experiments
(Data produced by specialised scientific equipment.)
Introduction to research data management
Figures
You can use standard markdown for static figures with the following syntax:
{alt='alt text for accessibility purposes'}
Callout sections can highlight information.
They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.
- Use
.mdfiles for episodes when you want static content - Use
.Rmdfiles for episodes when you need to generate output - Run
sandpaper::check_lesson()to identify any issues with your lesson - Run
sandpaper::build_lesson()to preview your lesson locally
Content from Structuring Research Materials
Last updated on 2026-03-03 | Edit this page
Overview
Questions
- How can you structure data using a standard folder system for better organisation?
- What are the benefits of using a consistent file naming convention in research data management?
- Why is version control important, and how can it be incorporated into file naming practices?
- In what ways can version control tools like Git and GitHub be useful for managing data?
Objectives
- Organise your research data into a standard folder structure
- Name files with a consistent naming convention
- Understand why version control is important, and how to incorporate this into your naming conventions
- Explain why version control software such as Git/GitHub can be useful for certain types of data.
Folder systems
Alex has recently started a PhD on a project that has been running for a few years. He has been given access to the project’s folders and has been asked by his supervisor to look through some files left by a researcher who recently left.
Organising files into a folder structure Part I
In groups, look through Folder 2.1 and discuss the following questions:
- What problems can you identify with how files are organised?
- How many different datasets can you identify?
- Which files are ‘raw’ vs ‘processed’ data?
- How would you improve the organisation of the files?
When thinking about how you could improve the organisation of the files, consider whether it might be easier to split them between different folders and whether any might need renaming
Alex goes to his supervisor and explains the problems he has found. His supervisor asks him to improve the organisation of the files as he is concerned that no-one will be able to find anything.
Organising files into a folder structure Part II
Individually, look at Folder 2.1 again and, within it create a set of folders to organise each file into. You may want to create subfolders inside some of these folders too. Organise the files into your folders.
- Alex sees that during the project the researcher gave a presentation at a conference and that alongside the slides, there are lots of documents relating to his attendance. Think about the different ways files related to the conference might be stored and consider which might be better for a team needing to share files, versus how an individual might be happy to store them.
- There are also various files related to a data analysis: can you see the different stages of that analysis process? How might you organise them?
Poor organisation can make it difficult to find files or to even see that a specific file exists. This can become a massive problem where multiple people are working together or on projects that run over a number of years. Out-of-date versions of files may end up being used and shared, and important documents may be effectively lost. Relying on search tools to find documents assumes that you know that the document exists, and that you know how it was named; if you weren’t the person who created it how would you know about it? If you think back to documents you created a few years ago, would you still be able to say what they were all called, what the latest versions were called, and what they all related to?
Taking a few moments to think about the structures you use to store files can save a lot of stress, and time, both for you and anyone you work with. If you work across multiple projects it can be worth coming up with a consistent approach, so that you and anyone you work with always knows where in the folder structure to find the same types of files. Structure folders hierarchically too: start broad and drill-down into specific areas.
It can be worth thinking of an old fashioned set of filing cabinets:

- each filing cabinet is a project
- each drawer is an aspect of the project e.g. drawer 1 for data collection; drawer 2 for analysis; drawer 3 for papers and presentations
- within each drawer are folders containing files about specific subsections of that aspect e.g. in drawer 2 there are separate folders for code, raw data, cleaned data, graphs/ figures, and reports
- within each folder in each drawer, there may be further sub-sections….
However, do be sensible about the level and number of folders you use: if you have lots of folders that only contain one file, you may have too many, making it more difficult and time-consuming to navigate. If you have very few folders, then there may be too many files in a folder, making it difficult to find the relevant one.
Give each folder a name that is meaningful and concisely describes the contents of the folder, such as “raw_data”, “conference_presentations”, “expenses”.
Whatever structures you choose, it is worth periodically reviewing them to make sure they are still fulfilling their purpose. Perhaps a section of the project folders can be archived? Perhaps there are now enough files of a particular type to necessitate a new folder?
In summary:
- Use folders! Don’t just save files onto your desktop and expect to be able to find everything in future.
- Structure hierarchically. Start broad and drill-down into specific areas or projects
- Use a sensible number of folders. Too few or too many may both make it difficult or time-consuming to find files.
- Use sensible names. Consider project names and the types of files in each folder, such as “raw_data”, “conference_presentations” or “expenses”
- Develop a consistent approach across projects
- Review folder content periodically, and consider moving folders and files that are no longer needed into an ‘archive’ folder
File naming
Alex takes another look at the folder system he has created. The files are easier to search through, but he notices that there are lots of inconsistencies in how those files are named.
Naming Files Part I
Open Folder 2.2 and look at the names of the files. Can you identify any problems with the way the files are named? What kinds of issues might they cause for those working on the project?
Poor file naming practices can make it difficult and time-consuming to find files, and lead to people working on the wrong files, or even overwriting important files, thus losing important data. Just as with folder structure, taking some time early in a project to develop a naming convention can save time and effort in the long run, both for your future self, and for any colleagues you work with. Below are some key considerations when creating file names:
What information to include
Carefully consider what information someone would need about the file to know it is the one they want. Do they need to know when it was created? Do they need to know what type of data it contains, for example, raw data, clean data? Does the file relate to a specific ID number? Some of those items of information might be good candidates to form part of the file name.
Whether you need to be able to order the files by a characteristic
For example, will you need to be able to easily select the file that was created most recently? Or quickly find a file by a sample ID? If so, you will want that element of the name to be at the beginning of the file name. Sometimes you might have to prioritise one of those requirements: for example, maybe you need to find the most recent file relating to a specific sample, in which case you might name the files using the format:
sampleID_date
Special Characters and Spaces
Avoid using special characters (such as ?#!“£$%^&*{}@/|<> ) as operating systems and apps may handle these very differently, sometimes being completely unable to open a file with them in their name, or not recognising them at all. Some special characters have a meaning in particular programming languages, and may be interpreted as instructions to the computer rather than as part of the file name.
Similarly, spaces in file names can cause problems:
Naming Files Part II
Look at the file name below:
STAR final results.xls
How do you think this might be interpreted by a computer? How might you rewrite the name to avoid that?
If you have spaces in a file name, the computer may interpret a space as showing that the end of the file name has been reached, and therefore not treat the rest of the name as part of the file name. Alternatively, it may interepret it as several file names listed one after the other e.g. STAR final results.xls is either:
1 file named STAR followed by the command ‘final’…
3 files named:
- STAR
- final
- results.xls
A better way to write this file name would be STAR_final_results.xls or STAR-final-results.xls
Recommendation: Use only numbers and letters (without accents) and use hyphens and underscores instead of spaces, to separate the different parts of the file name.
Dates
Naming Files Part III
Look at the file names below:
01022026_sputum_culture_results.csv
03_09_2025_sputum_culture_results.csv
05Jun25_sputum_culture_results.csv
120126_sputum_culture_results.csv
12252026_sputum_culture_results.csv
What order were those files created in? Are you sure?
If the files came from laboratories in both the UK and in the USA, would that raise any concerns about how to read the dates on the files?
Dates are a frequent cause of issues for researchers. Researchers from different countries may read date numbers differently: “05062026” may be one person’s 5th of June (e.g. in the UK), while for another it’s 6th of May (e.g. in the USA). Ensuring that everyone looking at the date reads it correctly can be the difference between the correct file being selected, and the wrong one.
Naming Files Part IV
Look at those file names again:
01022026_sputum_culture_results.csv
03_09_2025_sputum_culture_results.csv
05Jun25_sputum_culture_results.csv
120126_sputum_culture_results.csv
12252026_sputum_culture_results.csv
Note that they were given in the order they would appear in a folder, i.e. in numerical and alphabetical order.
Can you think of a better way to write the dates, so that the files appear in date order?
A good format for dates is YYYYMMDD (the ISO 8601 standard), or YYYY-MM-DD. This format ensures that everything can be easily ordered by year, then month, and then day. Seeing the year at the start of the date also indicates to those looking at the files that the date is probably being handled in this way.
Version control
Alex has now organised the folders and renamed many of the files. Things look much better, but as he works through the project, he notices something worrying.
For several documents, there are multiple slightly different versions of the same file, and it’s not always clear:
Which one is the most recent
What changed between versions
Who made those changes, or why
Some files include ‘final’ in the name… sometimes more than once.
Versions Everywhere
Look at the files in Folder 2.3:
How many different versions of the same document can you find?
How can you tell which one is the “latest”?
Are you confident you’d pick the right file to work on?
What might go wrong if different people used different versions?
You don’t need to open the files to answer these questions.
Focus on the file names themselves:
- Look for words like
final,draft,v1,v2, or dates - Notice whether the versioning scheme is consistent across files
- Ask yourself what assumptions you’re making when deciding which file is “latest”
Would someone new to the project make the same assumptions?
Version control is about tracking change over time, so that you can:
See what changed
Return to earlier versions if needed
Understand how a file reached its current state
Just setting out clear file naming conventions and using them consistently can go a long way towards ensuring you can do all of these things.
Recommendations
- Use version numbers to indicate the order that document versions were created in. Decimal points can be used to indicate intermediate versions; whole numbers to indicate major versions at key points in the document’s lifecycle. For example, v0.1 might indicate a very early first draft of a paper, v1.0 might be the first version that was circulated to other authors for intial comments, v1.1 an updated version based on the comments, v2.0 might be the version that was submitted to the publisher and v3.0 might be the version that was resubmitted after revisions based on reviewer comments.
- Sometimes appending dates may be more appropriate, if so use the YYYYMMDD format and consider its position in the file name so that the files are always ordered correctly if sorted alphabetically
- Where a document has reached a key point in its lifecycle, such as being submitted to a publisher, it may be helpful to append a short word or phrase such as “Submitted”, “Submitted revision”, to clarify that (but do use version numbers too!)
- Where there are lots of versions of a document in a folder, it may be appropriate to create a subfolder to keep previous versions in: this helps you and any colleagues to be able to quickly find the current version, particularly important if the document is a Standard Operating Procedure or Manual.
However, while file naming conventions can help, they aren’t always enough on their own.
Version control tools
For some types of work, particularly text-based files like code, scripts, and documentation, specialised version control tools can be extremely useful.
These tools are designed to:
Automatically record changes
Keep a history of edits
Show exactly what changed between versions
Support collaboration without overwriting others’ work
When might tools help?
Consider the following types of files:
Word documents
Spreadsheets
Analysis scripts (e.g. R, Python)
Survey questionnaires
Images or PDFs
In small groups, discuss:
Which of these changes frequently?
Which are hard to merge if two people edit them?
Which might benefit most from automated version tracking?
Imagine two people editing the file at the same time: what would be easy to reconcile, and what would be painful?
Not all files benefit equally from version control tools. Files such as images or Excel files (known as binary files, i.e. non text file) can be harder to manage, while plain text files work particularly well.
Thinking ahead
Without worrying about how to use them yet:
What advantages might a version control tool offer over manual file naming?
What new challenges might it introduce?
In what situations might it be unnecessary or overkill?
You might want to think about: how do you currently keep track of changes? What information gets lost when files are renamed or overwritten?
Also consider scale: one person vs a team? One week vs several years
Git and GitHub
Git is a version control tool, and GitHub is a platform that hosts Git projects and supports collaboration.
They are particularly useful for certain types of data, especially files that are: - text-based (such as code, scripts, and documentation) - edited frequently - worked on by more than one person
They are generally less helpful for files like images, PDFs, spreadsheets, or heavily formatted documents (such as Word or LibreOffice files), where combining changes is difficult.
File naming can help manage major versions (for example,
report_v1, report_v2), but version control
tools go further; they can record every change and allow you to
return to a specific point in time, a bit like “Track Changes” for an
entire project.
You won’t be expected to use Git or GitHub yet. For now, it’s enough to understand why such tools exist and when they might be useful.
Terminology preview
You may hear version control tools described using terms like:
-
Repository: a project’s home: the files
and their record of changes
-
Commit: a saved snapshot of changes, with a short
note about what was done
- History: the timeline of commits showing how the project evolved
You don’t need to know how to use these yet. For now, think of them as names for ideas you’ve already encountered when trying to keep track of different versions of files.
Content from Tabular Data Collection
Last updated on 2026-02-04 | Edit this page
Overview
Questions
- What types of variables are commonly found in tabular data?
- What kinds of data inconsistencies can affect the quality of a dataset?
- What are some common causes of inconsistent or messy data?
- What practices can help ensure clean, consistent data during collection and entry?
- Why is it important to provide clear instructions or rules when collecting data?
- What is a data dictionary, and why is it useful?
Objectives
After following this episode, learners will be able to:
- List variable types and formats
- Identify inconsistencies in data that can cause problems during analysis
- Describe methods that can be used during data collection and data entry that can prevent inconsistencies
- Write guidance for how to collect and enter data
- Create a data dictionary describing a dataset
Variables, data types and formats
Alex has received a dataset from the MET museum and needs to understand the types of variables before exploring or analysing it further.
Follow along: Open up the dataset
You should have downloaded a dataset called Met_Objects_Dataset_sample.txt as part of the setup instructions. Please open this file in whatever spreadsheet software you are using (e.g. LibreOffice, Excel). The file is tab delimited (i.e. within each row a gap is used to separate values into their columns) so you may need to use whatever Text to Columns tool your spreadsheet software provides to convert it into columnar data. The first row contains the column headers.
What is a data point?
A data point is a single piece of information collected for one variable about one item.
In the MET museum dataset Alex is using, each row is an object (like a painting or sculpture), and each cell is a data point.
Alex’s dataset looks like a spreadsheet, but underneath, each column contains a specific data type, which we covered in section 1. Knowing these helps to avoid errors and choose the right tools for analysis.
What is a variable?
A variable is a characteristic or attribute that can take on different values. In tabular data, variables are usually represented as columns, where each row contains an observation or entry.
However, the concept of a variable is independent of format, it’s not defined by being a column, but by being a consistent type of information collected across observations.
For example, in Alex’s MET dataset, variables might include objectid, artistname, or dateacquired.
Types of variables
Numeric variables
Variables that represent measurable quantities. These can be integers or floats. Numeric variables can be Discrete, which means they take on specific, separate values (often counts), or Continuous, which can take on any value within a range (often measurements).
Examples:
-
objectid→12345(integer, discrete — a unique ID number) -
heightcm→23.5(float, continuous — a measurement in centimeters) -
objectdate→1890(integer, discrete — a specific year)
String variables
Free-form or descriptive text.
Examples:
-
artistdisplayname→"Claude Monet"(string) -
title→"Woman with a Parasol"(string)
Categorical variables
Variables that represent groups or categories. These could be
strings, integers, or floats - anything used to label a category!
Categorical variables can be Nominal, which means there is no
inherent order (e.g., artistnationality), or
Ordinal, which means the categories follow a logical order
(e.g., popularity).
Examples:
-
gender→"Female","Male"(string, nominal) -
medium→"Marble","Bronze","Oil on canvas"(string, nominal) -
istimelinework→"Yes"/"No"(string, nominal — or Boolean:TRUE/FALSE) -
artistdecade→1950,1960,1980(integer, ordinal — ordered decades)
Date/time variables
Variables that represent dates or times.
Examples:
-
lastconserv→"2001-05-12"(datetime or string) -
objectdate→1990(integer),"ca. 1890"(string)
Note on overlapping types:
Some variables can belong to more than one category depending on their use and format. For example:
objectdate = 1890 might be treated as a numeric variable
(discrete integer) if used for sorting or calculations.
The same objectdate could also be considered a date/time
variable if formatted as "1890-01-01" and used in
time-based analyses.
artistdecade = 1950 could be a categorical variable
(ordinal) if grouped into decade-based categories for comparison.
It’s okay for a single value to have more than one interpretation - what matters is how it’s used in context.
⚠️ Some columns might look like numbers but contain inconsistent formats (e.g., “ca. 1890”). These need cleaning before they can be analysed as dates.
Summary
Understanding the difference between conceptual types (how the data is used or interpreted) and technical types (how the data is stored or formatted) is key for working effectively with tabular data. For example, a column might be technically an integer but conceptually a category (like decades or survey scores).
| Conceptual Type | Technical Type | Description | Example |
|---|---|---|---|
| Nominal | String | Categories, no order | artistnationality = Australian |
| Ordinal | String | Categories with order | popularity = high |
| Discrete Numeric | Integer | Countable numbers | objectid = 123456 |
| Continuous Numeric | Integer, Float | Measurable, decimals allowed | height = 27.5 |
| Boolean | Boolean | Yes/No, True/False | ishighlight = TRUE |
| Date/Time | Datetime | Dates or times | lastconserv = 12/11/2027 |
| Textual | String | Free text | artistdisplayname = Claude Monet |
| Identifier | Integer/String | Unique reference | objectnumber = 1982.456 |
Tip for learners (like Alex):
Understanding both the conceptual meaning and the technical format of your data helps you clean it correctly, document it clearly, and analyse it without errors.
Identify inconsistencies in data
Before we can clean or analyse data, it’s important to check for inconsistencies, values that don’t follow a standard or expected format. These might include:
- Different spellings or formats for the same category
- Mixed use of upper/lower case
- Inconsistent date formats
- Unexpected blank or missing values
- Invalid or impossible values (e.g. negative heights, future birth dates)
These inconsistencies can lead to errors or misleading results if not corrected.
Example: Inconsistencies in the artistgender
column
Here’s an example of how the same concept (“artistgender”) can be recorded in many different ways:
| objectid | artistgender |
|---|---|
| 1001 | Female |
| 1002 | female |
| 1003 | F |
| 1004 | Male |
| 1005 | MALE |
| 1006 | M |
| 1007 | Unknown |
| 1008 |
We can see:
-
"Female","female", and"F"all refer to the same category -
"Male","MALE", and"M"are also equivalent -
"Unknown"and the blank entry might indicate missing or uncertain data
These differences need to be standardised before analysis — for example, converting all values to lowercase and replacing shorthand terms with full words.
Challenge 1: Can you find any inconsistencies or problems with data entered into a spreadsheet?
Let’s have a deep dive into the Met_Objects_Dataset_sample.txt dataset. Using a coloured fill identify any inconsistencies or problem data in the spreadsheet that you think might cause problems for anyone analysing the data.
Inconsistencies might include where measurements are in different units, there are differing formats for dates, differing cases, or where something is indicated in a variety of different ways but all mean the same thing
Next, we’ll look at how to avoid these kinds of issues from happening in the first place.
Prevent inconsistencies during data collection
As we’ve seen so far, our dataset contains a number of inconsistencies that will complicate analysis. In an ideal world, we would have avoided introducing these errors while collecting the data. It’s always simpler to avoid inconsistencies in the first place, rather than trying to fix them later!
How could we have adjusted our data collection to avoid this? Let’s
take the lastconserv column as an example, which represents
the date when the object was last conserved. Here we see a large number
of different date / time formats, including:
- 28/01/2025 = day / month / year
- 07/21/2023 = month / day / year
- 26.03.23 = day.month.year
- 07/06/2019 00:00 = day / month / year hour:minute
To avoid this, we could have enforced a specific date/time format during collection. For example, if we were using a form, we could have limited responses in this field to only accept dates as year-month-day, with no time entry allowed.
There are also some incorrect dates in this column
e.g. 30/2/2024 (30th February 2024). February only has 28
days, or 29 during a leap year, so this date is impossible. We could
have avoided this by providing some kind of date validation in the form
- e.g. using a calendar input that only contains real dates.
Some general guidelines
Avoid free text fields during data collection. This increases the risk of spelling mistakes, additional spaces etc., that will complicate the final analysis.
If a column should only contain particular values, then enforce this! For example, you could use a drop-down menu with set options to choose from.
Add validation to avoid ‘impossible’ values. For example, are values only valid within a certain range? Are negative values valid?
Where multiple formats are possible (e.g. with dates / times), enforce a specific format.
Challenge: Methods to prevent inconsistencies during data collection
In a small group, consider how you could prevent the other inconsistencies you identified in the dataset. What checks or rules could you introduce during data collection?
There are many different solutions to these inconsistencies, but here are some examples:
-
istimelineworkcould have used a drop down menu that enforced only two choices ofTrueorFalse. - a check could have been added to enforce that
accessionyear(the year the object entered the collection) is always afterobjectdate(the year the object was created). -
artistnationalitycould have used a drop-down menu containing set nationality options. This would have avoided inconsistencies likeFrancevsFrench.
Write data collection guidelines
As we saw in the last section, there are many additional checks / rules we could have added during data collection to make our dataset more consistent and easier to analyse. It’s good practice to document these rules before data collection takes place so that it’s clear to yourself, along with any collaborators and future users of your dataset, exactly how values were collected. This will also be invaluable when it’s time to write the methods section of any papers or reports that use this data.
Make sure you include information about how to handle missing values.
How will these be represented in your table? It is also useful to
explicitly state why a value may be missing (if possible). For example,
in our dataset some objects are made by manufacturing companies like
United Merchants & Manufacturers rather than an
individual artist - in this case artistgender will be
missing, as it doesn’t apply in this scenario.
Write data collection guidelines
Choose a variable from the dataset (e.g. lastconserv)
and write some bullet-point guidelines for its collection. For
example:
Which values are valid for this variable?
Which format should be used?
Are any checks required against other variables in the table?
If a value is missing, how should it be represented? E.g. NA, None, not applicable
Data Dictionaries
What is a Data Dictionary?
A data dictionary is a table that describes the variables in your dataset. It provides key information such as:
- Variable name (column header)
- Description (what the variable represents)
- Data type (e.g., string, integer, boolean, datetime)
- Possible values or format (especially for categorical variables)
- Units (if relevant)
Data dictionaries help others (and future you!) understand and use your data consistently and correctly.
Example: Wildlife Observations Dataset
Here’s a sample data dictionary for a fictional dataset tracking wildlife sightings in a nature reserve:
| Variable Name | Description | Data Type | Possible Values / Format | Units |
|---|---|---|---|---|
sighting_id |
Unique ID for each observation | Integer | 1, 2, 3, … | N/A |
species_name |
Name of the animal species observed | String | e.g., “Red Fox”, “Barn Owl” | N/A |
count |
Number of individuals seen | Integer | 0, 1, 2, … | Count |
observation_date |
Date the observation was recorded | Datetime | YYYY-MM-DD | N/A |
location |
Area of the park where sighting occurred | String | “North Woods”, “Wetland Trail” | N/A |
is_endangered |
Whether the species is endangered | Boolean | TRUE, FALSE | N/A |
Challenge: Write a Data Dictionary for Alex
Alex is trying to make sense of the MET museum dataset. Help Alex out by creating a mini data dictionary!
- Open the file
Met_Objects_Dataset_sample.txt - Choose three variables (columns) from the dataset
- For each one, write down:
- The variable name
- A short description
- The data type (e.g., string, integer, date)
- Any possible values or units, if relevant
Work in pairs or small groups and compare your answers.
- keypoint 1
- keypoint 2
Content from How to clean a tabular dataset
Last updated on 2025-03-28 | Edit this page
Overview
Questions
- What is ‘clean’ data?
- How can we find inconsistencies in tabular data?
- How can we correct inconsistencies in tabular data?
Objectives
- Describe what data cleaning is and why it is important
- Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
- Identify missing values within a tabular dataset using filters
- Correct spelling mistakes using spell check tools and find + replace
- Standardise text formats using spreadsheet functions
- Describe the pros and cons of using spreadsheets for data collection and cleaning
Challenge 1: Can you do it?
Open film_dataset.csv.
- How many missing values are there in the ‘film_title’ column?
- Are there any duplicate entries in the dataset? If so, how many?
- There are 7 missing values in the
film_titlecolumn - There are 5 duplicate rows in the dataset
- keypoint
Content from Introduction to R
Last updated on 2025-03-28 | Edit this page
Overview
Questions
- What is….
Objectives
- Objective 1