All in One View
Content from What is Research Data?
Last updated on 2026-02-05 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What is research data, and why is it important in academic and scientific research?
- What are the different types of research data?
- Where can research data come from?
- What are the key components of research data management (RDM)?
Objectives
- Data types
- Sources of data
- What is research data management (collection, storage, organisation, sharing etc)
Understanding data types
Alex is a researcher studying artworks in The Metropolitan Museum of Art. They have just received a dataset containing information about paintings, sculptures, textiles, drawings, and photographs from across the museum’s collection. Before Alex can analyse anything, they need to understand what kinds of data the dataset contains.
Even though the dataset looks like a simple spreadsheet, each column has an underlying data type, and knowing these types will help Alex (and you!) avoid errors, clean the dataset effectively, and choose the right kinds of visualisations or analyses.
Quick think
Before reading on, ask yourself:
- What kinds of columns would you expect to see in a museum dataset?
Titles? Dates? Measurements? Artist names?
What is a data type?
A data type describes the kind of information a value represents. It tells the computer how to interpret the data:
- Is it text?
- A number?
- A date?
- A true/false flag?
When a dataset mixes formats (e.g., a date stored as text, or a number stored as a string), analysis becomes harder and mistakes are more likely. Alex will soon discover that the MET dataset contains a mix of clean values and some messy ones - for example, dates written as 1990, “ca. 1931”, and “07/06/2019 00:00”. Understanding data types helps Alex make sense of this variation.
Why data types matter
Data types are important because they affect how the computer reads, stores, and analyses information. If a column is stored in the wrong format, it can lead to errors, misleading results, or limitations in what you can do with the data.
In summary: knowing data types helps ensure the dataset is trustworthy, analysable, and ready for exploration
Common data types, examples from the MET Museum dataset
Here are some common data types you’ll encounter:
-
String: Text or characters, like
"Claude Monet"or"Oil on canvas" -
Integer: Whole numbers, like
1985,42, or0 -
Float: Decimal numbers, like
27.5or3.14 -
Boolean: True/False values, like
TRUE,FALSE,Yes,No -
Datetime: Calendar dates or timestamps, like
"2020-01-01"or"12/11/2027"
Now that Alex has identified the main data types, let’s try classifying some values ourselves.
Challenge
Challenge: What data type is it?
Alex found the following values in the MET dataset. For each one, decide what data type it currently is (it may not be what you think it should be!).
"Claude Monet"
1872
"ca. 1931"
"07/06/2019 00:00"
2021-07-14
27.5
"Oil on canvas"
TRUE
Write down the data type you would assign to each value.
"Claude Monet" is a string
1872 is an integer
"ca. 1931" is a string (messy date)
"07/06/2019 00:00" is a string (looks like a date but stored as text)
2021-07-14 is a date
27.5 is a float
"Oil on canvas" is a string
TRUE is a boolean
Notice how several values look like dates or numbers but are stored as text - this is common in real datasets and affects how we analyse them.
Identifying data types in your own dataset
So far, Alex has been looking at individual values. In practice, researchers usually work with whole datasets at once, often in spreadsheets, CSV files, or databases. Let’s think about how to identify data types when your data is laid out in columns.
Imagine Alex opens the MET Museum dataset in a spreadsheet. Each column represents a variable, and each row represents an artwork. The column headers might look something like this:
| Object ID | Title | Artist | Object Date | Medium | Is Public Domain | Height (cm) |
|---|---|---|---|---|---|---|
| 436121 | Water Lilies | Claude Monet | 1906 | Oil on canvas | TRUE | 200.5 |
| 459055 | Untitled | Unknown | ca. 1931 | Gelatin silver | FALSE | 27 |
| 12345 | Portrait of a Man | Rembrandt | 07/06/2019 | Oil on panel | TRUE | 98.0 |
Even without doing any analysis, Alex can already start identifying data types by asking a few simple questions about each column.
Look at the values, not just the column name
Column names are helpful, but they don’t always tell the full story. For each column, Alex checks:
- Are the values mostly text, numbers, dates, or true/false?
- Do all the values follow the same format?
- Are there any “odd” entries that don’t match the rest?
Challenge: Trust the name or the values?
Which of these columns would you inspect most carefully, and why?
ObjectIDTitleArtistObject DateMediumIs Public DomainHeight_cm
Write down one reason based on the values you might expect to see.
Watch out for mixed data types in a single column
One of the most common problems in spreadsheets is mixing data types in the same column. Alex notices that Object Date contains:
-
1906(looks like an integer) -
ca. 1931(text) -
07/06/2019(date-like text)
Even though these all describe dates, the computer will usually treat the entire column as text, which makes it hard to sort, filter, or calculate with.
Challenge: Thinking about date formats
Look at the following values:
1906ca. 193107/06/2019Which values would be easy to convert?
Which values would be difficult or ambiguous?
What information is missing?
What assumptions might you need to make?
- Some values only include a year, with no month or day.
- Some values include uncertainty or approximation (e.g. “ca.”).
- Some values depend on regional date conventions, making them ambiguous.
- Converting dates may require assumptions, additional metadata, or decisions about how to represent uncertainty.
These are common issues in real datasets and will be addressed later in the course.
Use spreadsheet tools to check data types
Spreadsheets don’t just display data; they also interpret it. Most spreadsheet software gives visual and functional clues that indicate how values are stored, which can help you identify the underlying data type of a column.
Challenge: What data type is this column?
Alex opens a different part of the MET dataset containing information about exhibitions and acquisitions.
| Column name | Values |
|---|---|
| Accession Number | 1975.1, 2003.45a, 1988.12 |
| Department | European Paintings, Asian Art, Modern and Contemporary |
| Acquisition Year | 1998, 2005, Unknown |
| Credit Line | Gift of John Smith, Purchase, Bequest |
| On Display | Yes, No |
| Gallery Number | 802, 305, NA |
| Last Updated | 2022-11-03, 03/07/2021, 15 Aug 2020 |
For each column:
- Decide what the data type currently is in the spreadsheet.
- Decide what the data type should ideally be for analysis.
Think about:
- Mixed formats and missing values
- Columns that look numeric but include text
- Dates written in different ways
You do not need to clean the data, just identify the data types.
| Column name | Current data type | Ideal data type |
|---|---|---|
| Accession Number | String | String |
| Department | String | String |
| Acquisition Year | String (mixed) | Integer or date |
| Credit Line | String | String |
| On Display | String | Boolean |
| Gallery Number | String (mixed) | Integer |
| Last Updated | String (mixed dates) | Date |
Notes:
- Accession Number looks numeric but contains letters and punctuation, so it must be text.
-
Acquisition Year mixes numbers with
"Unknown", forcing the column to be stored as text. - On Display represents a yes/no value but is stored as strings.
-
Gallery Number includes numeric values and missing data
(
NA), which often results in text storage. - Last Updated represents dates, but inconsistent formats prevent it from being treated as a date automatically.
Where research data comes from
Research data can originate from many different sources. Understanding where data comes from helps researchers assess its reliability, limitations, and appropriate uses.
What is a data source?
A data source is the origin of the data - where it was collected, generated, or obtained. This could be a person, an instrument, a database, a sensor, or a computational process.
Quick check
Which of the following could be considered a data source?
- A spreadsheet downloaded from a website
- A survey respondent
- A microscope
- A computer model
Answer: All of them.
Primary data
Primary data is data collected directly by the researcher for a specific research question. This might include surveys, interviews, experiments, field observations, or measurements. Primary data offers high relevance but often requires more time and resources to collect.
Reflect
Have you ever collected primary data?
- What made it valuable?
- What made it challenging?
Secondary data
Secondary data is data that was originally collected by someone else for a different purpose and reused in a new study. Examples include government statistics, museum collections, published datasets, or previously published research data. Secondary data saves time but may not perfectly match the research question.
Generated or synthetic data
Generated or synthetic data is created through computational processes such as simulations, models, or algorithms. This includes data produced by climate models, agent-based simulations, or machine learning systems. Synthetic data is useful for testing hypotheses or protecting privacy, but depends heavily on the assumptions of the model.
Sensor and observational data
Sensor and observational data is collected automatically or systematically through observation, often over time. Examples include environmental sensors, satellite imagery, traffic counters, or wildlife cameras. This data can be large and continuous, requiring careful storage and management.
Data from instruments, tools, and experiments
Scientific instruments, laboratory equipment, or specialised tools produce this type of data. Examples include microscope images, sequencing data, spectrometer readings, or experimental measurements. Instrument data often requires calibration, metadata, and specialised software to interpret.
Metadata moment
Why might metadata (for example, calibration settings or units) be especially important for this kind of data?
Examples of data sources in different disciplines
Different fields rely on different data sources. For example, historians may use archival documents, social scientists may use surveys or census data, natural scientists may collect experimental measurements, and digital humanities researchers may work with digitised texts or images.
Considering data quality and limitations
Every data source has limitations. Researchers should consider how the data was collected, potential biases, missing values, accuracy, and whether the data is appropriate for their research question. Understanding these limitations is essential for responsible analysis and interpretation.
Classify it
You download a CSV file of air pollution measurements collected by a government agency.
Is this:
- Primary data
- Secondary data
Secondary data because you didn’t collect it yourself.
Challenge: What kind of data source is this?
Below are several research scenarios. For each one, decide what type of data source is being described.
You may find that more than one category could apply; choose the best fit.
A researcher records temperature and humidity every 10 minutes using a weather station on a university rooftop.
A PhD student analyses digitised letters from a national archive that were scanned and published online by another institution.
A social scientist designs and distributes a questionnaire to study students’ experiences of remote learning.
A computer scientist creates a simulated dataset to test how an algorithm behaves under different conditions.
A biologist collects gene expression data using a sequencing machine in a laboratory experiment.
Sensor and observational data
(Data collected automatically and repeatedly over time.)Secondary data
(Data reused from an existing collection created by others.)Primary data
(Data collected directly by the researcher for a specific study.)Generated or synthetic data
(Data created through simulation or computational processes.)Data from instruments, tools, and experiments
(Data produced by specialised scientific equipment.)
Introduction to research data management
Figures
You can use standard markdown for static figures with the following syntax:
{alt='alt text for accessibility purposes'}
Callout sections can highlight information.
They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.
- Use
.mdfiles for episodes when you want static content - Use
.Rmdfiles for episodes when you need to generate output - Run
sandpaper::check_lesson()to identify any issues with your lesson - Run
sandpaper::build_lesson()to preview your lesson locally
Content from Structuring Research Materials
Last updated on 2026-05-26 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How can you structure data using a standard folder system for better organisation?
- What are the benefits of using a consistent file naming convention in research data management?
- Why is version control important, and how can it be incorporated into file naming practices?
- In what ways can version control tools like Git and GitHub be useful for managing data?
Objectives
- Organise your research data into a standard folder structure
- Name files with a consistent naming convention
- Understand why version control is important, and how to incorporate this into your naming conventions
- Explain why version control software such as Git/GitHub can be useful for certain types of data.
Folder systems
Alex has recently started a PhD on a project that has been running for a few years. He has been given access to the project’s folders and has been asked by his supervisor to look through some files left by a researcher who recently left.
Episode setup: Learners should have downloaded and unzipped the exercise materials as part of the lesson setup - Folder 2.1, 2.2, and 2.3 are all included. Check at the start of the session that everyone has done this before moving on to the first exercise.
Estimated timing: ~50 minutes of teaching plus 10 minutes of exercises, though the two folder structure exercises (Parts I and II) together often take 15–20 minutes including debrief.
Organising files into a folder structure Part I
In groups, look through Folder 2.1, without opening any of the files in it, and discuss the following questions:
- What problems can you identify with how files are organised?
- How many different datasets can you identify?
- Which files are ‘raw’ vs ‘processed’ data?
- How would you improve the organisation of the files?
When thinking about how you could improve the organisation of the files, consider whether it might be easier to split them between different folders and whether any might need renaming
Running Part I live: Put learners into groups of 3–4. In online delivery, use breakout rooms for 5–7 minutes, then debrief in the main room. In-person, table groups work well. Remind learners not to open any files - the point is what they can (and cannot) tell from the file and folder names alone.
Key things to draw out in debrief: - Files dumped at the top level with no folder structure - Files that are hard to identify without opening them - Inconsistent or unhelpful names that make it unclear what a file contains
Prompt learners: If you came back to this folder in two years, or if a colleague unfamiliar with the project had to use it, what would they struggle with?
Alex goes to his supervisor and explains the problems he has found. His supervisor asks him to improve the organisation of the files as he is concerned that no-one will be able to find anything.
Organising files into a folder structure Part II
Individually, look at Folder 2.1 again and, within it create a set of folders to organise each file into. You may want to create subfolders inside some of these folders too. Organise the files into your folders. Please note: You will not need to open or rename any of the files for this exercise.
- Alex sees that during the project the researcher gave a presentation at a conference and that alongside the slides, there are lots of documents relating to his attendance. Think about the different ways files related to the conference might be stored and consider which might be better for a team needing to share files, versus how an individual might be happy to store them.
- There are also various files related to a data analysis: can you see the different stages of that analysis process? How might you organise them?
Debriefing Part II: Ask one or two learners to share their folder structure - online, ask them to share their screen; in-person, sketch it on a whiteboard. Emphasise that there is no single correct answer, but highlight the key principles: hierarchy, meaningful names, and consistency. Common themes to look for: - Separating raw data from processed or cleaned data - Grouping conference or travel documents separately from research data - Having a clear “current work” area versus an archive
If there is time, ask learners to compare their structure with a neighbour and discuss any differences.
Poor organisation can make it difficult to find files, or to even realise that a specific file exists. This becomes a particular problem when multiple people are working together, or on projects that run over a number of years.
When organisation breaks down:
- Out-of-date versions of files may end up being used and shared
- Important documents can be effectively lost
- Search tools only help if you already know the file exists and roughly what it was called - if you weren’t the person who created it, how would you know?
If you think back to documents you created a few years ago, would you still be able to say what they were all called, what the latest versions were, and what they all related to?
Taking a few moments to think about folder structure early on can save a lot of stress and time - both for your future self and for anyone you work with. If you work across multiple projects, developing a consistent approach means everyone always knows where to look for a given type of file. Structure folders hierarchically: start broad and drill down into specific areas.
It can be worth thinking of an old fashioned set of filing cabinets:

- each filing cabinet is a project
- each drawer is an aspect of the project e.g. drawer 1 for data collection; drawer 2 for analysis; drawer 3 for papers and presentations
- within each drawer are folders containing files about specific subsections of that aspect e.g. in drawer 2 there are separate folders for code, raw data, cleaned data, graphs/ figures, and reports
- within each folder in each drawer, there may be further sub-sections….
In practice, a research project folder might look something like this:
my_project/
├── data/
│ ├── raw/
│ └── processed/
├── analysis/
│ ├── scripts/
│ └── outputs/
├── reports/
│ ├── drafts/
│ └── submitted/
└── admin/
├── conference/
└── expenses/
However, do be sensible about the level and number of folders you use: if you have lots of folders that only contain one file, you may have too many, making it more difficult and time-consuming to navigate. If you have very few folders, then there may be too many files in a folder, making it difficult to find the relevant one.
Give each folder a name that is meaningful and concisely describes the contents of the folder, such as “raw_data”, “conference_presentations”, “expenses”.
Whatever structures you choose, it is worth periodically reviewing them to make sure they are still fulfilling their purpose. Perhaps a section of the project folders can be archived? Perhaps there are now enough files of a particular type to necessitate a new folder?
In summary:
- Use folders! Don’t just save files onto your desktop and expect to be able to find everything in future.
- Structure hierarchically. Start broad and drill-down into specific areas or projects
- Use a sensible number of folders. Too few or too many may both make it difficult or time-consuming to find files.
- Use sensible names. Consider project names and the types of files in each folder, such as “raw_data”, “conference_presentations” or “expenses”
- Develop a consistent approach across projects
- Review folder content periodically, and consider moving folders and files that are no longer needed into an ‘archive’ folder
File naming
Alex takes another look at the folder system he has created. The files are easier to search through, but he notices that there are lots of inconsistencies in how those files are named.
Running the naming discussion: Give learners 3–5 minutes to look through Folder 2.2 individually before opening discussion. Typical problems learners identify include: inconsistent capitalisation, spaces in names, special characters, vague or unhelpful names, and dates written in different or ambiguous formats. Let learners surface these themselves before introducing the guidance that follows.
Naming Files Part I
Open Folder 2.2 and look at the names of the files (you will not need to to open the files). Can you identify any problems with the way the files are named? What kinds of issues might they cause for those working on the project?
Poor file naming practices can make it difficult and time-consuming to find files. They can also lead to people working on the wrong version, or overwriting important files and losing data that cannot be recovered.
Taking some time at the start of a project to agree on a naming convention will save effort in the long run - both for you and for any colleagues you work with.
Below are some key considerations when creating file names:
What information to include
Carefully consider what information someone would need about the file to know it is the one they want. Do they need to know when it was created? Do they need to know what type of data it contains, for example, raw data, clean data? Does the file relate to a specific ID number? Some of those items of information might be good candidates to form part of the file name.
Whether you need to be able to order the files by a characteristic
Whatever you most often need to find should go at the start of the file name, because files sort alphabetically from left to right. Some examples:
-
Date first - useful when the date is the primary
way you distinguish files, for example a daily instrument export where
there is only one file per day:
20260513_readings.csv -
Sample ID first - useful when you process many
samples on the same day, making the date a poor distinguishing feature.
Putting the sample ID first means all files for a given sample sit
together when sorted:
SAMP042_20260513_raw.csv -
Participant or site ID first - common in clinical
or social research where data is organised around individuals or
locations rather than dates:
P014_interview_transcript.txt
Sometimes you may need to balance two requirements. If you need to
find the most recent file for a specific sample, you might use
sampleID_date, which groups by sample first and sorts
chronologically within each group.
Special Characters and Spaces
Avoid using special characters (such as ?#!“£$%^&*{}@/|<> ) as operating systems and apps may handle these very differently, sometimes being completely unable to open a file with them in their name, or not recognising them at all. Some special characters have a meaning in particular programming languages, and may be interpreted as instructions to the computer rather than as part of the file name.
What are special characters?
Special characters are any characters that are not letters (A–Z, a–z), numbers (0–9), hyphens (-), or underscores (_). Common examples include:
? # ! " £ $ % ^ & * ( ) { } @ / \ | < > : ; ' ~
Some of these characters have a specific meaning to operating systems or software, which is why they can cause problems in folder and file names:
-
/and\are used to separate folders in file paths (e.g.Documents/project/data), so including one in a file name can confuse the computer about where the path ends and the name begins -
:is used in Windows file paths (e.g.C:\) and is not allowed in file or folder names on Windows - Characters like
*,?, and"are used by operating systems and command-line tools as instructions (e.g.*means “match any file”), so they may be misinterpreted when they appear in a name - Some characters may display differently or cause errors when files are shared across different operating systems (Windows, Mac, Linux), making it harder to open or find those files
Similarly, spaces in file names can cause problems:
Naming Files Part II
Look at the file name below:
STAR final results.xlsx
How do you think this might be interpreted by a computer? How might you rewrite the name to avoid that?
If you have spaces in a file name, the computer may interpret a space as showing that the end of the file name has been reached, and therefore not treat the rest of the name as part of the file name. Alternatively, it may interepret it as several file names listed one after the other e.g. STAR final results.xls is either:
1 file named STAR followed by the command ‘final’…
3 files named:
- STAR
- final
- results.xls
A better way to write this file name would be STAR_final_results.xls or STAR-final-results.xls
Recommendation: Use only numbers and letters (without accents) and use hyphens and underscores instead of spaces, to separate the different parts of the file name.
Dates
Naming Files Part III
Look at the file names below:
01022026_sputum_culture_results.csv
03_09_2025_sputum_culture_results.csv
05Jun25_sputum_culture_results.csv
120126_sputum_culture_results.csv
12252026_sputum_culture_results.csv
What order were those files created in? Are you sure?
If the files came from laboratories in both the UK and in the USA, would that raise any concerns about how to read the dates on the files?
Running the dates challenges: Parts III and IV work well as quick whole-group exercises - display the file names on screen and ask learners to answer by a show of hands or a quick poll. The UK/US date ambiguity (e.g. does 05062026 mean 5 June or 6 May?) is often a surprise even to experienced researchers. Let the discussion run briefly before moving on.
Dates are a frequent cause of issues for researchers. Researchers from different countries may read date numbers differently: “05062026” may be one person’s 5th of June (e.g. in the UK), while for another it’s 6th of May (e.g. in the USA). Ensuring that everyone looking at the date reads it correctly can be the difference between the correct file being selected, and the wrong one.
Naming Files Part IV
Look at those file names again:
01022026_sputum_culture_results.csv
03_09_2025_sputum_culture_results.csv
05Jun25_sputum_culture_results.csv
120126_sputum_culture_results.csv
12252026_sputum_culture_results.csv
Note that they were given in the order they would appear in a folder, i.e. in numerical and alphabetical order.
Can you think of a better way to write the dates, so that the files appear in date order?
A good format for dates is YYYYMMDD (the ISO 8601 standard), or YYYY-MM-DD. This format ensures that everything can be easily ordered by year, then month, and then day. Seeing the year at the start of the date also indicates to those looking at the files that the date is probably being handled in this way.
Does your field have its own naming conventions?
Before inventing a naming convention from scratch, it is worth checking whether your discipline already has a standard - some fields have well-established conventions that make data easier to share, deposit in repositories, and reuse by others.
A few examples:
-
Neuroimaging (BIDS): The Brain Imaging Data
Structure standard specifies both folder layout and file names, such as
sub-01_ses-01_task-rest_bold.nii.gz. Files deposited to OpenNeuro must follow this format. - Clinical research (CDISC): The Clinical Data Interchange Standards Consortium defines naming and structure for trial data submitted to regulators (e.g. the FDA and EMA). Variable names, domain codes, and file layouts are all specified.
-
Genomics: Sequencing files submitted to
repositories such as NCBI’s Sequence Read Archive are assigned standard
accession numbers. Many labs also follow community conventions for raw
read files, such as
sampleID_R1.fastq.gz. - Ecology and biodiversity: The Darwin Core standard defines a common vocabulary for recording species observations, including how to express dates, locations, and identifiers.
If a standard exists for your field, using it from the start will save work later - particularly when it comes to depositing data in a repository or sharing it with collaborators.
Version control
Alex has now organised the folders and renamed many of the files. Things look much better, but as he works through the project, he notices something worrying.
For several documents, there are multiple slightly different versions of the same file, and it’s not always clear:
Which one is the most recent
What changed between versions
Who made those changes, or why
Some files include ‘final’ in the name… sometimes more than once.
Setting up the Versions Everywhere challenge: Read or paraphrase the Reinhart-Rogoff callout below to the group before setting learners off on the task - it raises the stakes and motivates the exercise. Give learners 5 minutes to look through Folder 2.3 individually, then debrief as a group.
In debrief, ask: How confident are you that you picked the right file? Most learners will have some uncertainty - that uncertainty is the teaching point.
When the wrong version makes the news
In 2010, economists Reinhart and Rogoff published a widely cited paper claiming that countries with government debt above 90% of GDP experienced sharply slower economic growth. The paper influenced austerity policy in the UK, US, and Europe. In 2013, a graduate student attempting to replicate the work found that several rows of data had been accidentally excluded from a formula in their Excel spreadsheet - the wrong version of the analysis had effectively been published. When corrected, the paper’s central finding largely disappeared. The policies had already been implemented.
Versions Everywhere
Look at the files in Folder 2.3 (do not open the files):
How many different versions of the same document can you find?
How can you tell which one is the “latest”?
Are you confident you’d pick the right file to work on?
What might go wrong if different people used different versions?
You don’t need to open the files to answer these questions.
Focus on the file names themselves:
- Look for words like
final,draft,v1,v2, or dates - Notice whether the versioning scheme is consistent across files
- Ask yourself what assumptions you’re making when deciding which file is “latest”
Would someone new to the project make the same assumptions?
Can’t I just check “Date Modified”?
When trying to work out which file is the most recent, it might seem easy to sort by the date modified or date created shown in your file browser. Unfortunately, these timestamps are not always reliable guides:
- Copying a file resets its “date created” to the moment it was copied, even if the original was created years earlier
- Date modified can change unexpectedly - opening a file and accidentally pressing a key, or a programme auto-saving, can update the timestamp without any meaningful change to the content
- Moving files between folders, drives, or computers can reset one or both timestamps, depending on the operating system
- Restoring files from a backup may set the timestamps to the moment of restoration rather than when the file was last genuinely edited
- Synchronisation tools (such as cloud storage services) may update timestamps when files are synced, even if the content has not changed
This means that “Date Modified” can appear to show a recent date on a file that is actually an old version, and an older date on what is actually the most current copy. Relying on timestamps alone to identify the latest version of a file is therefore risky. Including version information directly in the file name, or using version control software, is a much more reliable approach.
When the wrong data file reaches the clinic
Between 2006 and 2010, a research team at Duke University published a series of papers claiming that genomic signatures could predict how cancer patients would respond to chemotherapy. Other researchers attempting to reproduce the work found systematic data errors: rows had been shifted, sample labels mixed up, and in some cases, earlier or incorrect versions of data files had been used in analyses. By the time the problems were fully documented, clinical trials had been opened on the basis of the flawed results. Several papers were retracted, and the trials were halted.
The researchers who uncovered the errors later described the process of reconstructing which file had been used for which analysis as one of the central difficulties - version information simply had not been recorded.
Baggerly, K.A. & Coombes, K.R. (2009). Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Annals of Applied Statistics, 3(4), 1309–1334. https://doi.org/10.1214/09-AOAS291
Version control is about tracking change over time, so that you can:
See what changed
Return to earlier versions if needed
Understand how a file reached its current state
Just setting out clear file naming conventions and using them consistently can go a long way towards ensuring you can do all of these things.
Recommendations
-
Use version numbers to indicate the order that document versions were created in. Decimal points can be used to indicate intermediate versions; whole numbers to indicate major versions at key points in the document’s lifecycle.
For example:
- v0.1: a very early first draft
- v1.0: the first version circulated to other authors for initial comments
- v1.1: an updated version based on those comments
- v2.0: the version submitted to a publisher
- v3.0: the version resubmitted after revisions based on reviewer comments
Sometimes appending dates may be more appropriate. If so, use the YYYYMMDD format and consider its position in the file name so that the files are always ordered correctly when sorted alphabetically.
Where a document has reached a key point in its lifecycle, such as being submitted to a publisher, it may be helpful to append a short word or phrase such as “Submitted” or “Submitted_revision” to clarify that - but do use version numbers too!
Where there are lots of versions of a document in a folder, it may be appropriate to create a subfolder to keep previous versions in. This helps you and any colleagues to be able to quickly find the current version, which is particularly important if the document is a Standard Operating Procedure or Manual.
However, while file naming conventions can help, they aren’t always enough on their own.
Version control tools
For some types of work, particularly text-based files like code, scripts, and documentation, specialised version control tools can be extremely useful.
These tools are designed to:
Automatically record changes
Keep a history of edits
Show exactly what changed between versions
Support collaboration without overwriting others’ work
Running the tools discussion: A brief (5-minute) group discussion before introducing Git and GitHub. The key insight to steer towards is that plain text files - code, scripts, plain-text documentation - are easy for version control tools to compare line by line, while binary formats (images, Word documents, spreadsheets) are not. The concept of “diffing” (seeing exactly what changed between two versions) is what makes version control most powerful for code.
Learners often ask whether they should use Git for everything. The short answer is: not necessarily - the cloud collaboration callout that follows is a good practical alternative for documents and spreadsheets.
When might tools help?
Consider the following types of files:
Word documents
Spreadsheets
Analysis scripts (e.g. R, Python)
Survey questionnaires
Images or PDFs
In small groups, discuss:
Which of these changes frequently?
Which are hard to merge if two people edit them?
Which might benefit most from automated version tracking?
Imagine two people editing the file at the same time: what would be easy to reconcile, and what would be painful?
Not all files benefit equally from version control tools. Files such as images or spreadsheets (i.e. non-text files) can be harder to manage, while plain text files work particularly well.
A simpler approach for non-text files: cloud collaboration tools
For files like Word documents, spreadsheets, or presentations, a shared cloud storage service (such as OneDrive, Google Drive, or SharePoint) can be a practical alternative. Rather than emailing a file back and forth - and ending up with multiple slightly different copies in different inboxes - collaborators can work from a single shared link. Everyone edits the same file, in the same place, at the same time.
This avoids the version confusion that comes with emailing attachments, and most of these tools also keep a version history so you can see earlier states of the file if needed.
However, it can make keeping very specific versions of files trickier, and it can be confusing if a collaborator makes unexpected changes, so think about how to manage them, and agree a process with everyone involved.
Thinking ahead
Without worrying about how to use them yet:
What advantages might a version control tool offer over manual file naming?
What new challenges might it introduce?
In what situations might it be unnecessary or overkill?
You might want to think about: how do you currently keep track of changes? What information gets lost when files are renamed or overwritten?
Also consider scale: one person vs a team? One week vs several years
Git and GitHub
Git is a version control tool, and GitHub is a platform that hosts Git projects and supports collaboration. They are primarily designed for managing code and other plain text files - they are not a general-purpose solution for all research files.
They are particularly useful for files that are:
- text-based (such as code, scripts, and documentation)
- edited frequently
- worked on by more than one person
They are generally less helpful for files like images, PDFs, spreadsheets, or heavily formatted documents (such as Word or LibreOffice files). Not all collaborators may feel comfortable with using them too, as there is a little bit of a learning curve.
File naming can help manage major versions (for example,
report_v1, report_v2), but version control
tools go further; they can record every change and allow you to
return to a specific point in time, a bit like “Track Changes” for an
entire project.
You won’t be expected to use Git or GitHub yet. For now, it’s enough to understand why such tools exist and when they might be useful.
Terminology preview
You may hear version control tools described using terms like:
- Repository: a project’s home: the files and their record of changes
- Commit: a saved snapshot of changes, with a short note about what was done
- History: the timeline of commits showing how the project evolved
You don’t need to know how to use these yet. For now, think of them as names for ideas you’ve already encountered when trying to keep track of different versions of files.
- Organise files into a hierarchical folder structure: start broad and drill down into specific areas, using a sensible number of folders with meaningful names
- Use a consistent file naming convention across a project so that you and your colleagues can easily find and identify files
- Avoid spaces and special characters in file names; use hyphens or underscores to separate parts of the name instead
- Use the YYYYMMDD (ISO 8601) date format in file names to ensure files sort correctly in date order
- Use version numbers (e.g. v1.0, v1.1) or date-based suffixes to track document versions, and keep earlier versions in a clearly named subfolder
- Version control tools such as Git and GitHub are particularly useful for text-based files (code, scripts, documentation) that change frequently or are worked on collaboratively
Content from Tabular Data Collection
Last updated on 2026-06-02 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What types of variables are commonly found in tabular data?
- What kinds of data inconsistencies can affect the quality of a dataset?
- What are some common causes of inconsistent or messy data?
- What practices can help ensure clean, consistent data during collection and entry?
- Why is it important to provide clear instructions or rules when collecting data?
- What is a data dictionary, and why is it useful?
Objectives
After following this episode, learners will be able to:
- List variable types and formats
- Identify inconsistencies in data that can cause problems during analysis
- Describe methods that can be used during data collection and data entry that can prevent inconsistencies
- Write guidance for how to collect and enter data
- Create a data dictionary describing a dataset
Variables, data types and formats
Alex has received a dataset from the MET museum and needs to understand the types of variables before exploring or analysing it further.
Follow along: Open up the dataset
You should have downloaded a dataset called Met_Objects_Dataset_sample.txt as part of the setup instructions. Please open this file in whatever spreadsheet software you are using (e.g. LibreOffice, Excel). The file is tab delimited (i.e. within each row a gap is used to separate values into their columns) so you may need to use whatever Text to Columns tool your spreadsheet software provides to convert it into columnar data. The first row contains the column headers.
What is a data point?
A data point is a single piece of information collected for one variable about one item.
In the MET museum dataset Alex is using, each row is an object (like a painting or sculpture), and each cell is a data point.
Alex’s dataset looks like a spreadsheet, but underneath, each column contains a specific data type, which we covered in section 1. Knowing these helps to avoid errors and choose the right tools for analysis.
What is a variable?
A variable is a characteristic or attribute that can take on different values. In tabular data, variables are usually represented as columns, where each row contains an observation or entry.
However, the concept of a variable is independent of format, it’s not defined by being a column, but by being a consistent type of information collected across observations.
For example, in Alex’s MET dataset, variables might include objectid, artistname, or dateacquired.
Types of variables
Numeric variables
Variables that represent measurable quantities. These can be integers or floats. Numeric variables can be Discrete, which means they take on specific, separate values (often counts), or Continuous, which can take on any value within a range (often measurements).
Examples:
-
objectid→12345(integer, discrete - a unique ID number) -
heightcm→23.5(float, continuous - a measurement in centimeters) -
objectdate→1890(integer, discrete - a specific year)
String variables
Free-form or descriptive text.
Examples:
-
artistdisplayname→"Claude Monet"(string) -
title→"Woman with a Parasol"(string)
Categorical variables
Variables that represent groups or categories. These could be
strings, integers, or floats - anything used to label a category!
Categorical variables can be Nominal, which means there is no
inherent order (e.g., artistnationality), or
Ordinal, which means the categories follow a logical order
(e.g., popularity).
Examples:
-
gender→"Female","Male"(string, nominal) -
medium→"Marble","Bronze","Oil on canvas"(string, nominal) -
istimelinework→"Yes"/"No"(string, nominal - or Boolean:TRUE/FALSE) -
artistdecade→1950,1960,1980(integer, ordinal - ordered decades)
Date/time variables
Variables that represent dates or times.
Examples:
-
lastconserv→"2001-05-12"(datetime or string) -
objectdate→1990(integer),"ca. 1890"(string)
Note on overlapping types:
Some variables can belong to more than one category depending on their use and format. For example:
objectdate = 1890 might be treated as a numeric variable
(discrete integer) if used for sorting or calculations.
The same objectdate could also be considered a date/time
variable if formatted as "1890-01-01" and used in
time-based analyses.
artistdecade = 1950 could be a categorical variable
(ordinal) if grouped into decade-based categories for comparison.
It’s okay for a single value to have more than one interpretation - what matters is how it’s used in context.
⚠️ Some columns might look like numbers but contain inconsistent formats (e.g., “ca. 1890”). These need cleaning before they can be analysed as dates.
Summary
Understanding the difference between conceptual types (how the data is used or interpreted) and technical types (how the data is stored or formatted) is key for working effectively with tabular data. For example, a column might be technically an integer but conceptually a category (like decades or survey scores).
| Conceptual Type | Technical Type | Description | Example |
|---|---|---|---|
| Nominal | String | Categories, no order | artistnationality = Australian |
| Ordinal | String | Categories with order | popularity = high |
| Discrete Numeric | Integer | Countable numbers | objectid = 123456 |
| Continuous Numeric | Integer, Float | Measurable, decimals allowed | height = 27.5 |
| Boolean | Boolean | Yes/No, True/False | ishighlight = TRUE |
| Date/Time | Datetime | Dates or times | lastconserv = 12/11/2027 |
| Textual | String | Free text | artistdisplayname = Claude Monet |
| Identifier | Integer/String | Unique reference | objectnumber = 1982.456 |
Tip for learners (like Alex):
Understanding both the conceptual meaning and the technical format of your data helps you clean it correctly, document it clearly, and analyse it without errors.
Identify inconsistencies in data
Before we can clean or analyse data, it’s important to check for inconsistencies, values that don’t follow a standard or expected format. These might include:
- Different spellings or formats for the same category
- Mixed use of upper/lower case
- Inconsistent date formats
- Unexpected blank or missing values
- Invalid or impossible values (e.g. negative heights, future birth dates)
These inconsistencies can lead to errors or misleading results if not corrected.
Example: Inconsistencies in the artistgender
column
Here’s an example of how the same concept (“artistgender”) can be recorded in many different ways:
| objectid | artistgender |
|---|---|
| 1001 | Female |
| 1002 | female |
| 1003 | F |
| 1004 | Male |
| 1005 | MALE |
| 1006 | M |
| 1007 | Unknown |
| 1008 |
We can see:
-
"Female","female", and"F"all refer to the same category -
"Male","MALE", and"M"are also equivalent -
"Unknown"and the blank entry might indicate missing or uncertain data
These differences need to be standardised before analysis, for example by converting all values to lowercase and replacing shorthand terms with full words.
Challenge 1: Can you find any inconsistencies or problems with data entered into a spreadsheet?
Let’s have a deep dive into the Met_Objects_Dataset_sample.txt dataset. Using a coloured fill identify any inconsistencies or problem data in the spreadsheet that you think might cause problems for anyone analysing the data.
Inconsistencies might include where measurements are in different units, there are differing formats for dates, differing cases, or where something is indicated in a variety of different ways but all mean the same thing
Next, we’ll look at how to avoid these kinds of issues from happening in the first place.
Prevent inconsistencies during data collection
As we’ve seen so far, our dataset contains a number of inconsistencies that will complicate analysis. In an ideal world, we would have avoided introducing these errors while collecting the data. It’s always simpler to avoid inconsistencies in the first place, rather than trying to fix them later!
How could we have adjusted our data collection to avoid this? Let’s
take the lastconserv column as an example, which represents
the date when the object was last conserved. Here we see a large number
of different date / time formats, including:
- 28/01/2025 = day / month / year
- 07/21/2023 = month / day / year
- 26.03.23 = day.month.year
- 07/06/2019 00:00 = day / month / year hour:minute
To avoid this, we could have enforced a specific date/time format during collection. For example, if we were using a form, we could have limited responses in this field to only accept dates as year-month-day, with no time entry allowed.
There are also some incorrect dates in this column
e.g. 30/2/2024 (30th February 2024). February only has 28
days, or 29 during a leap year, so this date is impossible. We could
have avoided this by providing some kind of date validation in the form
- e.g. using a calendar input that only contains real dates.
Some general guidelines
Avoid free text fields during data collection. This increases the risk of spelling mistakes, additional spaces etc., that will complicate the final analysis.
If a column should only contain particular values, then enforce this! For example, you could use a drop-down menu with set options to choose from.
Add validation to avoid ‘impossible’ values. For example, are values only valid within a certain range? Are negative values valid?
Where multiple formats are possible (e.g. with dates / times), enforce a specific format.
Challenge: Methods to prevent inconsistencies during data collection
In a small group, consider how you could prevent the other inconsistencies you identified in the dataset. What checks or rules could you introduce during data collection?
There are many different solutions to these inconsistencies, but here are some examples:
-
istimelineworkcould have used a drop down menu that enforced only two choices ofTrueorFalse. - a check could have been added to enforce that
accessionyear(the year the object entered the collection) is always afterobjectdate(the year the object was created). -
artistnationalitycould have used a drop-down menu containing set nationality options. This would have avoided inconsistencies likeFrancevsFrench.
Write data collection guidelines
As we saw in the last section, there are many additional checks / rules we could have added during data collection to make our dataset more consistent and easier to analyse. It’s good practice to document these rules before data collection takes place so that it’s clear to yourself, along with any collaborators and future users of your dataset, exactly how values were collected. This will also be invaluable when it’s time to write the methods section of any papers or reports that use this data.
Make sure you include information about how to handle missing values.
How will these be represented in your table? It is also useful to
explicitly state why a value may be missing (if possible). For example,
in our dataset some objects are made by manufacturing companies like
United Merchants & Manufacturers rather than an
individual artist - in this case artistgender will be
missing, as it doesn’t apply in this scenario.
Write data collection guidelines
Choose a variable from the dataset (e.g. lastconserv)
and write some bullet-point guidelines for its collection. For
example:
Which values are valid for this variable?
Which format should be used?
Are any checks required against other variables in the table?
If a value is missing, how should it be represented? E.g. NA, None, not applicable
Data Dictionaries
What is a Data Dictionary?
A data dictionary is a table that describes the variables in your dataset. It provides key information such as:
- Variable name (column header)
- Description (what the variable represents)
- Data type (e.g., string, integer, boolean, datetime)
- Possible values or format (especially for categorical variables)
- Units (if relevant)
Data dictionaries help others (and future you!) understand and use your data consistently and correctly.
Example: Wildlife Observations Dataset
Here’s a sample data dictionary for a fictional dataset tracking wildlife sightings in a nature reserve:
| Variable Name | Description | Data Type | Possible Values / Format | Units |
|---|---|---|---|---|
sighting_id |
Unique ID for each observation | Integer | 1, 2, 3, … | N/A |
species_name |
Name of the animal species observed | String | e.g., “Red Fox”, “Barn Owl” | N/A |
count |
Number of individuals seen | Integer | 0, 1, 2, … | Count |
observation_date |
Date the observation was recorded | Datetime | YYYY-MM-DD | N/A |
location |
Area of the park where sighting occurred | String | “North Woods”, “Wetland Trail” | N/A |
is_endangered |
Whether the species is endangered | Boolean | TRUE, FALSE | N/A |
Challenge: Write a Data Dictionary for Alex
Alex is trying to make sense of the MET museum dataset. Help Alex out by creating a mini data dictionary!
- Open the file
Met_Objects_Dataset_sample.txt - Choose three variables (columns) from the dataset
- For each one, write down:
- The variable name
- A short description
- The data type (e.g., string, integer, date)
- Any possible values or units, if relevant
Work in pairs or small groups and compare your answers.
- Variables in tabular data can be numeric, string, categorical, or date/time, and a single variable may have both a conceptual type (how it is used) and a technical format (how it is stored).
- Data inconsistencies such as mixed cases, varying formats, and invalid values can cause errors during analysis and should be identified before working with a dataset.
- Inconsistencies are easier to prevent than to fix, so enforcing formats, using drop-down menus, and adding validation rules during data collection reduces the need for cleaning later.
- Documenting data collection guidelines before collecting data ensures consistency and supports reproducibility for collaborators and future users.
- A data dictionary describes the variables in a dataset, including their names, types, possible values, and units, making the dataset easier to understand and use correctly.
Content from How to clean a tabular dataset
Last updated on 2025-03-28 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- What is ‘clean’ data?
- How can we find inconsistencies in tabular data?
- How can we correct inconsistencies in tabular data?
Objectives
- Describe what data cleaning is and why it is important
- Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
- Identify missing values within a tabular dataset using filters
- Correct spelling mistakes using spell check tools and find + replace
- Standardise text formats using spreadsheet functions
- Describe the pros and cons of using spreadsheets for data collection and cleaning
Challenge 1: Can you do it?
Open film_dataset.csv.
- How many missing values are there in the ‘film_title’ column?
- Are there any duplicate entries in the dataset? If so, how many?
- There are 7 missing values in the
film_titlecolumn - There are 5 duplicate rows in the dataset
- keypoint
Content from Introduction to R
Last updated on 2025-03-28 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What is….
Objectives
- Objective 1