Content from What is Research Data?


Last updated on 2025-03-28 | Edit this page

Overview

Questions

  • What is research data, and why is it important in academic and scientific research?
  • What are the different types of research data?
  • Where can research data come from?
  • What are the key components of research data management (RDM)?

Objectives

  • Data types
  • Sources of data
  • What is research data management (collection, storage, organisation, sharing etc)

This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.

What you need to know is that there are three sections required for a valid Carpentries lesson:

  1. questions are displayed at the beginning of the episode to prime the learner for the content.
  2. objectives are the learning objectives for an episode displayed with the questions.
  3. keypoints are displayed at the end of the episode to reinforce the objectives.

Challenge 1: Can you do it?

What is the output of this command?

R

paste("This", "new", "lesson", "looks", "good")

OUTPUT

[1] "This new lesson looks good"

Challenge 2: how do you nest solutions within challenge blocks?

You can add a line with at least three colons and a solution tag.

Figures


You can use standard markdown for static figures with the following syntax:

![optional caption that appears below the figure](figure url){alt='alt text for accessibility purposes'}

Blue Carpentries hex person logo with no text.
You belong in The Carpentries!

Callout

Callout sections can highlight information.

They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.

Math


One of our episodes contains \(\LaTeX\) equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:

$\alpha = \dfrac{1}{(1 - \beta)^2}$ becomes: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)

Cool, right?

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally

Content from Structuring Research Materials


Last updated on 2025-03-28 | Edit this page

Overview

Questions

  • How can you structure data using a standard folder system for better organisation?
  • What are the benefits of using a consistent file naming convention in research data management?
  • Why is version control important, and how can it be incorporated into file naming practices?
  • In what ways can version control tools like Git and GitHub be useful for managing data?

Objectives

  • Organise their research data into a standard folder structure
  • Name files with a consistent naming convention
  • Understand why version control is important, and how to incorporate this into your naming conventions
  • Explain why version control software such as Git/GitHub can be useful for certain types of data.

Organising files into a folder structure

In groups, look through the folder of data that you have been given:

  • What problems can you identify with how files are organised?
  • How many different datasets can you identify?
  • Which files are ‘raw’ vs ‘processed’ data?
  • How would you improve the organisation of the files? E.g. how would you split them between different folders?

Content from Tabular Data Collection


Last updated on 2025-06-02 | Edit this page

Overview

Questions

  • What types of variables are commonly found in tabular data?
  • What kinds of data inconsistencies can affect the quality of a dataset?
  • What are some common causes of inconsistent or messy data?
  • What practices can help ensure clean, consistent data during collection and entry?
  • Why is it important to provide clear instructions or rules when collecting data?
  • What is a data dictionary, and why is it useful?

Objectives

After following this episode, learners will be able to:

  • List variable types and formats
  • Identify inconsistencies in data that can cause problems during analysis
  • Describe methods that can be used during data collection and data entry that can prevent inconsistencies
  • Write guidance for how to collect and enter data
  • Create a data dictionary describing a dataset

Variables, data types and formats


Alex has received a dataset from the MET museum and needs to understand the types of variables before exploring or analysing it further.

Follow along: Open up the dataset

You should have downloaded a dataset called Met_Objects_Dataset_sample.txt as part of the setup instructions. Please open this file in whatever spreadsheet software you are using (e.g. LibreOffice, Excel). The file is tab delimited (i.e. within each row a gap is used to separate values into their columns) so you may need to use whatever Text to Columns tool your spreadsheet software provides to convert it into columnar data. The first row contains the column headers.

What is a data point?

A data point is a single piece of information collected for one variable about one item.

In the MET museum dataset Alex is using, each row is an object (like a painting or sculpture), and each cell is a data point.

Alex’s dataset looks like a spreadsheet, but underneath, each column contains a specific data type. Knowing these helps to avoid errors and choose the right tools for analysis.

Basic data types

Before we look at different types of variables, here are some common data types you’ll encounter:

  • String: Text or characters, like "Claude Monet" or "Oil on canvas"
  • Integer: Whole numbers, like 1985, 42, or 0
  • Float: Decimal numbers, like 27.5 or 3.14
  • Boolean: True/False values, like TRUE, FALSE, Yes, No
  • Datetime: Calendar dates or timestamps, like "2020-01-01" or "12/11/2027"

What is a variable?

A variable is a characteristic or attribute that can take on different values. In tabular data, variables are usually represented as columns, where each row contains an observation or entry.

However, the concept of a variable is independent of format, it’s not defined by being a column, but by being a consistent type of information collected across observations.

For example, in Alex’s MET dataset, variables might include objectid, artistname, or dateacquired.

Types of variables

Numeric variables

Variables that represent measurable quantities. These can be integers or floats. Numeric variables can be Discrete, which means they take on specific, separate values (often counts), or Continuous, which can take on any value within a range (often measurements).

Examples:

  • objectid12345 (integer, discrete — a unique ID number)
  • heightcm23.5 (float, continuous — a measurement in centimeters)
  • objectdate1890 (integer, discrete — a specific year)

String variables

Free-form or descriptive text.

Examples:

  • artistdisplayname"Claude Monet" (string)
  • title"Woman with a Parasol" (string)

Categorical variables

Variables that represent groups or categories. These could be strings, integers, or floats - anything used to label a category!
Categorical variables can be Nominal, which means there is no inherent order (e.g., artistnationality), or Ordinal, which means the categories follow a logical order (e.g., popularity).

Examples:

  • gender"Female", "Male" (string, nominal)
  • medium"Marble", "Bronze", "Oil on canvas" (string, nominal)
  • istimelinework"Yes" / "No" (string, nominal — or Boolean: TRUE / FALSE)
  • artistdecade1950, 1960, 1980 (integer, ordinal — ordered decades)

Date/time variables

Variables that represent dates or times.

Examples:

  • lastconserv"2001-05-12" (datetime or string)
  • objectdate1990 (integer), "ca. 1890" (string)

Callout

Note on overlapping types:

Some variables can belong to more than one category depending on their use and format. For example:

objectdate = 1890 might be treated as a numeric variable (discrete integer) if used for sorting or calculations.

The same objectdate could also be considered a date/time variable if formatted as "1890-01-01" and used in time-based analyses.

artistdecade = 1950 could be a categorical variable (ordinal) if grouped into decade-based categories for comparison.

It’s okay for a single value to have more than one interpretation - what matters is how it’s used in context.

Caution

⚠️ Some columns might look like numbers but contain inconsistent formats (e.g., “ca. 1890”). These need cleaning before they can be analysed as dates.

Summary

Understanding the difference between conceptual types (how the data is used or interpreted) and technical types (how the data is stored or formatted) is key for working effectively with tabular data. For example, a column might be technically an integer but conceptually a category (like decades or survey scores).

Conceptual Type Technical Type Description Example
Nominal String Categories, no order artistnationality = Australian
Ordinal String Categories with order popularity = high
Discrete Numeric Integer Countable numbers objectid = 123456
Continuous Numeric Integer, Float Measurable, decimals allowed height = 27.5
Boolean Boolean Yes/No, True/False ishighlight = TRUE
Date/Time Datetime Dates or times lastconserv = 12/11/2027
Textual String Free text artistdisplayname = Claude Monet
Identifier Integer/String Unique reference objectnumber = 1982.456

Callout

Tip for learners (like Alex):

Understanding both the conceptual meaning and the technical format of your data helps you clean it correctly, document it clearly, and analyse it without errors.

Identify inconsistencies in data


Before we can clean or analyse data, it’s important to check for inconsistencies, values that don’t follow a standard or expected format. These might include:

  • Different spellings or formats for the same category
  • Mixed use of upper/lower case
  • Inconsistent date formats
  • Unexpected blank or missing values
  • Invalid or impossible values (e.g. negative heights, future birth dates)

These inconsistencies can lead to errors or misleading results if not corrected.

Example: Inconsistencies in the artistgender column

Here’s an example of how the same concept (“artistgender”) can be recorded in many different ways:

objectid artistgender
1001 Female
1002 female
1003 F
1004 Male
1005 MALE
1006 M
1007 Unknown
1008

We can see:

  • "Female", "female", and "F" all refer to the same category
  • "Male", "MALE", and "M" are also equivalent
  • "Unknown" and the blank entry might indicate missing or uncertain data

These differences need to be standardised before analysis — for example, converting all values to lowercase and replacing shorthand terms with full words.

Challenge 1: Can you find any inconsistencies or problems with data entered into a spreadsheet?

Let’s have a deep dive into the Met_Objects_Dataset_sample.txt dataset. Using a coloured fill identify any inconsistencies or problem data in the spreadsheet that you think might cause problems for anyone analysing the data.

Inconsistencies might include where measurements are in different units, there are differing formats for dates, differing cases, or where something is indicated in a variety of different ways but all mean the same thing


Next, we’ll look at how to avoid these kinds of issues from happening in the first place.

Prevent inconsistencies during data collection


As we’ve seen so far, our dataset contains a number of inconsistencies that will complicate analysis. In an ideal world, we would have avoided introducing these errors while collecting the data. It’s always simpler to avoid inconsistencies in the first place, rather than trying to fix them later!

How could we have adjusted our data collection to avoid this? Let’s take the lastconserv column as an example, which represents the date when the object was last conserved. Here we see a large number of different date / time formats, including:

  • 28/01/2025 = day / month / year
  • 07/21/2023 = month / day / year
  • 26.03.23 = day.month.year
  • 07/06/2019 00:00 = day / month / year hour:minute

To avoid this, we could have enforced a specific date/time format during collection. For example, if we were using a form, we could have limited responses in this field to only accept dates as year-month-day, with no time entry allowed.

There are also some incorrect dates in this column e.g. 30/2/2024 (30th February 2024). February only has 28 days, or 29 during a leap year, so this date is impossible. We could have avoided this by providing some kind of date validation in the form - e.g. using a calendar input that only contains real dates.

Some general guidelines

  • Avoid free text fields during data collection. This increases the risk of spelling mistakes, additional spaces etc., that will complicate the final analysis.

  • If a column should only contain particular values, then enforce this! For example, you could use a drop-down menu with set options to choose from.

  • Add validation to avoid ‘impossible’ values. For example, are values only valid within a certain range? Are negative values valid?

  • Where multiple formats are possible (e.g. with dates / times), enforce a specific format.

Challenge: Methods to prevent inconsistencies during data collection

In a small group, consider how you could prevent the other inconsistencies you identified in the dataset. What checks or rules could you introduce during data collection?

There are many different solutions to these inconsistencies, but here are some examples:

  • istimelinework could have used a drop down menu that enforced only two choices of True or False.
  • a check could have been added to enforce that accessionyear (the year the object entered the collection) is always after objectdate (the year the object was created).
  • artistnationality could have used a drop-down menu containing set nationality options. This would have avoided inconsistencies like France vs French.

Write data collection guidelines


As we saw in the last section, there are many additional checks / rules we could have added during data collection to make our dataset more consistent and easier to analyse. It’s good practice to document these rules before data collection takes place so that it’s clear to yourself, along with any collaborators and future users of your dataset, exactly how values were collected. This will also be invaluable when it’s time to write the methods section of any papers or reports that use this data.

Make sure you include information about how to handle missing values. How will these be represented in your table? It is also useful to explicitly state why a value may be missing (if possible). For example, in our dataset some objects are made by manufacturing companies like United Merchants & Manufacturers rather than an individual artist - in this case artistgender will be missing, as it doesn’t apply in this scenario.

Write data collection guidelines

Choose a variable from the dataset (e.g. lastconserv) and write some bullet-point guidelines for its collection. For example:

  • Which values are valid for this variable?

  • Which format should be used?

  • Are any checks required against other variables in the table?

  • If a value is missing, how should it be represented? E.g. NA, None, not applicable

Data Dictionaries


What is a Data Dictionary?

A data dictionary is a table that describes the variables in your dataset. It provides key information such as:

  • Variable name (column header)
  • Description (what the variable represents)
  • Data type (e.g., string, integer, boolean, datetime)
  • Possible values or format (especially for categorical variables)
  • Units (if relevant)

Data dictionaries help others (and future you!) understand and use your data consistently and correctly.

Example: Wildlife Observations Dataset

Here’s a sample data dictionary for a fictional dataset tracking wildlife sightings in a nature reserve:

Variable Name Description Data Type Possible Values / Format Units
sighting_id Unique ID for each observation Integer 1, 2, 3, … N/A
species_name Name of the animal species observed String e.g., “Red Fox”, “Barn Owl” N/A
count Number of individuals seen Integer 0, 1, 2, … Count
observation_date Date the observation was recorded Datetime YYYY-MM-DD N/A
location Area of the park where sighting occurred String “North Woods”, “Wetland Trail” N/A
is_endangered Whether the species is endangered Boolean TRUE, FALSE N/A

Challenge: Write a Data Dictionary for Alex

Alex is trying to make sense of the MET museum dataset. Help Alex out by creating a mini data dictionary!

  1. Open the file Met_Objects_Dataset_sample.txt
  2. Choose three variables (columns) from the dataset
  3. For each one, write down:
    • The variable name
    • A short description
    • The data type (e.g., string, integer, date)
    • Any possible values or units, if relevant

Work in pairs or small groups and compare your answers.

Key Points

  • keypoint 1
  • keypoint 2

Content from How to clean a tabular dataset


Last updated on 2025-03-28 | Edit this page

Overview

Questions

  • What is ‘clean’ data?
  • How can we find inconsistencies in tabular data?
  • How can we correct inconsistencies in tabular data?

Objectives

  • Describe what data cleaning is and why it is important
  • Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
  • Identify missing values within a tabular dataset using filters
  • Correct spelling mistakes using spell check tools and find + replace
  • Standardise text formats using spreadsheet functions
  • Describe the pros and cons of using spreadsheets for data collection and cleaning

Challenge 1: Can you do it?

Open film_dataset.csv.

  1. How many missing values are there in the ‘film_title’ column?
  2. Are there any duplicate entries in the dataset? If so, how many?
  1. There are 7 missing values in the film_title column
  2. There are 5 duplicate rows in the dataset

Key Points

  • keypoint

Content from Introduction to R


Last updated on 2025-03-28 | Edit this page

Overview

Questions

  • What is….

Objectives

  • Objective 1