Content from What is Research Data?
Last updated on 2025-03-28 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What is research data, and why is it important in academic and scientific research?
- What are the different types of research data?
- Where can research data come from?
- What are the key components of research data management (RDM)?
Objectives
- Data types
- Sources of data
- What is research data management (collection, storage, organisation, sharing etc)
This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.
What you need to know is that there are three sections required for a valid Carpentries lesson:
-
questions
are displayed at the beginning of the episode to prime the learner for the content. -
objectives
are the learning objectives for an episode displayed with the questions. -
keypoints
are displayed at the end of the episode to reinforce the objectives.
Inline instructor notes can help inform instructors of timing challenges associated with the lessons. They appear in the “Instructor View”
Challenge 1: Can you do it?
What is the output of this command?
R
paste("This", "new", "lesson", "looks", "good")
OUTPUT
[1] "This new lesson looks good"
Challenge 2: how do you nest solutions within challenge blocks?
You can add a line with at least three colons and a
solution
tag.
Figures
You can use standard markdown for static figures with the following syntax:
{alt='alt text for accessibility purposes'}
Callout
Callout sections can highlight information.
They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.
Math
One of our episodes contains \(\LaTeX\) equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:
$\alpha = \dfrac{1}{(1 - \beta)^2}$
becomes: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)
Cool, right?
Key Points
- Use
.md
files for episodes when you want static content - Use
.Rmd
files for episodes when you need to generate output - Run
sandpaper::check_lesson()
to identify any issues with your lesson - Run
sandpaper::build_lesson()
to preview your lesson locally
Content from Structuring Research Materials
Last updated on 2025-03-28 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How can you structure data using a standard folder system for better organisation?
- What are the benefits of using a consistent file naming convention in research data management?
- Why is version control important, and how can it be incorporated into file naming practices?
- In what ways can version control tools like Git and GitHub be useful for managing data?
Objectives
- Organise their research data into a standard folder structure
- Name files with a consistent naming convention
- Understand why version control is important, and how to incorporate this into your naming conventions
- Explain why version control software such as Git/GitHub can be useful for certain types of data.
Organising files into a folder structure
In groups, look through the folder of data that you have been given:
- What problems can you identify with how files are organised?
- How many different datasets can you identify?
- Which files are ‘raw’ vs ‘processed’ data?
- How would you improve the organisation of the files? E.g. how would you split them between different folders?
Content from Tabular Data Collection
Last updated on 2025-06-02 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What types of variables are commonly found in tabular data?
- What kinds of data inconsistencies can affect the quality of a dataset?
- What are some common causes of inconsistent or messy data?
- What practices can help ensure clean, consistent data during collection and entry?
- Why is it important to provide clear instructions or rules when collecting data?
- What is a data dictionary, and why is it useful?
Objectives
After following this episode, learners will be able to:
- List variable types and formats
- Identify inconsistencies in data that can cause problems during analysis
- Describe methods that can be used during data collection and data entry that can prevent inconsistencies
- Write guidance for how to collect and enter data
- Create a data dictionary describing a dataset
Variables, data types and formats
Alex has received a dataset from the MET museum and needs to understand the types of variables before exploring or analysing it further.
Follow along: Open up the dataset
You should have downloaded a dataset called Met_Objects_Dataset_sample.txt as part of the setup instructions. Please open this file in whatever spreadsheet software you are using (e.g. LibreOffice, Excel). The file is tab delimited (i.e. within each row a gap is used to separate values into their columns) so you may need to use whatever Text to Columns tool your spreadsheet software provides to convert it into columnar data. The first row contains the column headers.
What is a data point?
A data point is a single piece of information collected for one variable about one item.
In the MET museum dataset Alex is using, each row is an object (like a painting or sculpture), and each cell is a data point.
Alex’s dataset looks like a spreadsheet, but underneath, each column contains a specific data type. Knowing these helps to avoid errors and choose the right tools for analysis.
Basic data types
Before we look at different types of variables, here are some common data types you’ll encounter:
-
String: Text or characters, like
"Claude Monet"
or"Oil on canvas"
-
Integer: Whole numbers, like
1985
,42
, or0
-
Float: Decimal numbers, like
27.5
or3.14
-
Boolean: True/False values, like
TRUE
,FALSE
,Yes
,No
-
Datetime: Calendar dates or timestamps, like
"2020-01-01"
or"12/11/2027"
What is a variable?
A variable is a characteristic or attribute that can take on different values. In tabular data, variables are usually represented as columns, where each row contains an observation or entry.
However, the concept of a variable is independent of format, it’s not defined by being a column, but by being a consistent type of information collected across observations.
For example, in Alex’s MET dataset, variables might include objectid, artistname, or dateacquired.
Types of variables
Numeric variables
Variables that represent measurable quantities. These can be integers or floats. Numeric variables can be Discrete, which means they take on specific, separate values (often counts), or Continuous, which can take on any value within a range (often measurements).
Examples:
-
objectid
→12345
(integer, discrete — a unique ID number) -
heightcm
→23.5
(float, continuous — a measurement in centimeters) -
objectdate
→1890
(integer, discrete — a specific year)
String variables
Free-form or descriptive text.
Examples:
-
artistdisplayname
→"Claude Monet"
(string) -
title
→"Woman with a Parasol"
(string)
Categorical variables
Variables that represent groups or categories. These could be
strings, integers, or floats - anything used to label a category!
Categorical variables can be Nominal, which means there is no
inherent order (e.g., artistnationality
), or
Ordinal, which means the categories follow a logical order
(e.g., popularity
).
Examples:
-
gender
→"Female"
,"Male"
(string, nominal) -
medium
→"Marble"
,"Bronze"
,"Oil on canvas"
(string, nominal) -
istimelinework
→"Yes"
/"No"
(string, nominal — or Boolean:TRUE
/FALSE
) -
artistdecade
→1950
,1960
,1980
(integer, ordinal — ordered decades)
Date/time variables
Variables that represent dates or times.
Examples:
-
lastconserv
→"2001-05-12"
(datetime or string) -
objectdate
→1990
(integer),"ca. 1890"
(string)
Callout
Note on overlapping types:
Some variables can belong to more than one category depending on their use and format. For example:
objectdate = 1890
might be treated as a numeric variable
(discrete integer) if used for sorting or calculations.
The same objectdate
could also be considered a date/time
variable if formatted as "1890-01-01"
and used in
time-based analyses.
artistdecade = 1950
could be a categorical variable
(ordinal) if grouped into decade-based categories for comparison.
It’s okay for a single value to have more than one interpretation - what matters is how it’s used in context.
Caution
⚠️ Some columns might look like numbers but contain inconsistent formats (e.g., “ca. 1890”). These need cleaning before they can be analysed as dates.
Summary
Understanding the difference between conceptual types (how the data is used or interpreted) and technical types (how the data is stored or formatted) is key for working effectively with tabular data. For example, a column might be technically an integer but conceptually a category (like decades or survey scores).
Conceptual Type | Technical Type | Description | Example |
---|---|---|---|
Nominal | String | Categories, no order | artistnationality = Australian |
Ordinal | String | Categories with order | popularity = high |
Discrete Numeric | Integer | Countable numbers | objectid = 123456 |
Continuous Numeric | Integer, Float | Measurable, decimals allowed | height = 27.5 |
Boolean | Boolean | Yes/No, True/False | ishighlight = TRUE |
Date/Time | Datetime | Dates or times | lastconserv = 12/11/2027 |
Textual | String | Free text | artistdisplayname = Claude Monet |
Identifier | Integer/String | Unique reference | objectnumber = 1982.456 |
Callout
Tip for learners (like Alex):
Understanding both the conceptual meaning and the technical format of your data helps you clean it correctly, document it clearly, and analyse it without errors.
Identify inconsistencies in data
Before we can clean or analyse data, it’s important to check for inconsistencies, values that don’t follow a standard or expected format. These might include:
- Different spellings or formats for the same category
- Mixed use of upper/lower case
- Inconsistent date formats
- Unexpected blank or missing values
- Invalid or impossible values (e.g. negative heights, future birth dates)
These inconsistencies can lead to errors or misleading results if not corrected.
Example: Inconsistencies in the artistgender
column
Here’s an example of how the same concept (“artistgender”) can be recorded in many different ways:
objectid | artistgender |
---|---|
1001 | Female |
1002 | female |
1003 | F |
1004 | Male |
1005 | MALE |
1006 | M |
1007 | Unknown |
1008 |
We can see:
-
"Female"
,"female"
, and"F"
all refer to the same category -
"Male"
,"MALE"
, and"M"
are also equivalent -
"Unknown"
and the blank entry might indicate missing or uncertain data
These differences need to be standardised before analysis — for example, converting all values to lowercase and replacing shorthand terms with full words.
Challenge 1: Can you find any inconsistencies or problems with data entered into a spreadsheet?
Let’s have a deep dive into the Met_Objects_Dataset_sample.txt dataset. Using a coloured fill identify any inconsistencies or problem data in the spreadsheet that you think might cause problems for anyone analysing the data.
Inconsistencies might include where measurements are in different units, there are differing formats for dates, differing cases, or where something is indicated in a variety of different ways but all mean the same thing
Next, we’ll look at how to avoid these kinds of issues from happening in the first place.
Prevent inconsistencies during data collection
As we’ve seen so far, our dataset contains a number of inconsistencies that will complicate analysis. In an ideal world, we would have avoided introducing these errors while collecting the data. It’s always simpler to avoid inconsistencies in the first place, rather than trying to fix them later!
How could we have adjusted our data collection to avoid this? Let’s
take the lastconserv
column as an example, which represents
the date when the object was last conserved. Here we see a large number
of different date / time formats, including:
- 28/01/2025 = day / month / year
- 07/21/2023 = month / day / year
- 26.03.23 = day.month.year
- 07/06/2019 00:00 = day / month / year hour:minute
To avoid this, we could have enforced a specific date/time format during collection. For example, if we were using a form, we could have limited responses in this field to only accept dates as year-month-day, with no time entry allowed.
There are also some incorrect dates in this column
e.g. 30/2/2024
(30th February 2024). February only has 28
days, or 29 during a leap year, so this date is impossible. We could
have avoided this by providing some kind of date validation in the form
- e.g. using a calendar input that only contains real dates.
Some general guidelines
Avoid free text fields during data collection. This increases the risk of spelling mistakes, additional spaces etc., that will complicate the final analysis.
If a column should only contain particular values, then enforce this! For example, you could use a drop-down menu with set options to choose from.
Add validation to avoid ‘impossible’ values. For example, are values only valid within a certain range? Are negative values valid?
Where multiple formats are possible (e.g. with dates / times), enforce a specific format.
Challenge: Methods to prevent inconsistencies during data collection
In a small group, consider how you could prevent the other inconsistencies you identified in the dataset. What checks or rules could you introduce during data collection?
There are many different solutions to these inconsistencies, but here are some examples:
-
istimelinework
could have used a drop down menu that enforced only two choices ofTrue
orFalse
. - a check could have been added to enforce that
accessionyear
(the year the object entered the collection) is always afterobjectdate
(the year the object was created). -
artistnationality
could have used a drop-down menu containing set nationality options. This would have avoided inconsistencies likeFrance
vsFrench
.
Write data collection guidelines
As we saw in the last section, there are many additional checks / rules we could have added during data collection to make our dataset more consistent and easier to analyse. It’s good practice to document these rules before data collection takes place so that it’s clear to yourself, along with any collaborators and future users of your dataset, exactly how values were collected. This will also be invaluable when it’s time to write the methods section of any papers or reports that use this data.
Make sure you include information about how to handle missing values.
How will these be represented in your table? It is also useful to
explicitly state why a value may be missing (if possible). For example,
in our dataset some objects are made by manufacturing companies like
United Merchants & Manufacturers
rather than an
individual artist - in this case artistgender
will be
missing, as it doesn’t apply in this scenario.
Write data collection guidelines
Choose a variable from the dataset (e.g. lastconserv
)
and write some bullet-point guidelines for its collection. For
example:
Which values are valid for this variable?
Which format should be used?
Are any checks required against other variables in the table?
If a value is missing, how should it be represented? E.g. NA, None, not applicable
Data Dictionaries
What is a Data Dictionary?
A data dictionary is a table that describes the variables in your dataset. It provides key information such as:
- Variable name (column header)
- Description (what the variable represents)
- Data type (e.g., string, integer, boolean, datetime)
- Possible values or format (especially for categorical variables)
- Units (if relevant)
Data dictionaries help others (and future you!) understand and use your data consistently and correctly.
Example: Wildlife Observations Dataset
Here’s a sample data dictionary for a fictional dataset tracking wildlife sightings in a nature reserve:
Variable Name | Description | Data Type | Possible Values / Format | Units |
---|---|---|---|---|
sighting_id |
Unique ID for each observation | Integer | 1, 2, 3, … | N/A |
species_name |
Name of the animal species observed | String | e.g., “Red Fox”, “Barn Owl” | N/A |
count |
Number of individuals seen | Integer | 0, 1, 2, … | Count |
observation_date |
Date the observation was recorded | Datetime | YYYY-MM-DD | N/A |
location |
Area of the park where sighting occurred | String | “North Woods”, “Wetland Trail” | N/A |
is_endangered |
Whether the species is endangered | Boolean | TRUE, FALSE | N/A |
Challenge: Write a Data Dictionary for Alex
Alex is trying to make sense of the MET museum dataset. Help Alex out by creating a mini data dictionary!
- Open the file
Met_Objects_Dataset_sample.txt
- Choose three variables (columns) from the dataset
- For each one, write down:
- The variable name
- A short description
- The data type (e.g., string, integer, date)
- Any possible values or units, if relevant
Work in pairs or small groups and compare your answers.
Key Points
- keypoint 1
- keypoint 2
Content from How to clean a tabular dataset
Last updated on 2025-03-28 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- What is ‘clean’ data?
- How can we find inconsistencies in tabular data?
- How can we correct inconsistencies in tabular data?
Objectives
- Describe what data cleaning is and why it is important
- Find and resolve inconsistencies within a tabular dataset programmatically (e.g datetime, numeric precision)
- Identify missing values within a tabular dataset using filters
- Correct spelling mistakes using spell check tools and find + replace
- Standardise text formats using spreadsheet functions
- Describe the pros and cons of using spreadsheets for data collection and cleaning
Challenge 1: Can you do it?
Open film_dataset.csv
.
- How many missing values are there in the ‘film_title’ column?
- Are there any duplicate entries in the dataset? If so, how many?
- There are 7 missing values in the
film_title
column - There are 5 duplicate rows in the dataset
Key Points
- keypoint
Content from Introduction to R
Last updated on 2025-03-28 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What is….
Objectives
- Objective 1