Key Points
What is Research Data?
- Use
.mdfiles for episodes when you want static content - Use
.Rmdfiles for episodes when you need to generate output - Run
sandpaper::check_lesson()to identify any issues with your lesson - Run
sandpaper::build_lesson()to preview your lesson locally
Structuring Research Materials
- Organise files into a hierarchical folder structure: start broad and drill down into specific areas, using a sensible number of folders with meaningful names
- Use a consistent file naming convention across a project so that you and your colleagues can easily find and identify files
- Avoid spaces and special characters in file names; use hyphens or underscores to separate parts of the name instead
- Use the YYYYMMDD (ISO 8601) date format in file names to ensure files sort correctly in date order
- Use version numbers (e.g. v1.0, v1.1) or date-based suffixes to track document versions, and keep earlier versions in a clearly named subfolder
- Version control tools such as Git and GitHub are particularly useful for text-based files (code, scripts, documentation) that change frequently or are worked on collaboratively
Tabular Data Collection
- Variables in tabular data can be numeric, string, categorical, or date/time, and a single variable may have both a conceptual type (how it is used) and a technical format (how it is stored).
- Data inconsistencies such as mixed cases, varying formats, and invalid values can cause errors during analysis and should be identified before working with a dataset.
- Inconsistencies are easier to prevent than to fix, so enforcing formats, using drop-down menus, and adding validation rules during data collection reduces the need for cleaning later.
- Documenting data collection guidelines before collecting data ensures consistency and supports reproducibility for collaborators and future users.
- A data dictionary describes the variables in a dataset, including their names, types, possible values, and units, making the dataset easier to understand and use correctly.
How to clean a tabular dataset
- keypoint