Reproducible research

Good practices

December 2023

Nicolas Casajus

Introduction

What is reproducibility?

Reproducibility is about results that can be obtained by someone else (or you in the future) given the same data and the same code. This is a technical problem.

We talk about Computational reproducibility

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)¹

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)¹

Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)²

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)¹

Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)²

Sharing the code and the data is now a prerequisite for publishing in many journals

Reproducibility spectrum

Each degree of reproducibility requires additional skills and time. While some of those skills (e.g. literal programming, version control, setting up environments) pay off in the long run, they can require a high up-front investment.

Concepts

According to Wilson et al. (2017)¹, good practices for a better reproducibility can be organized into the following six topics:

Data management

Project organization

Tracking changes

Collaboration

Manuscript

Code & Software

Data management

Raw data

General recommendations¹

Save and backup the raw data
Do not modify raw data (even for minor changes)
Raw data should be in a read-only mode (🔒)
Any modification produces an output or a derived data
Write code for data acquisition (when possible)
- database requests
- api requests
- download.file(), wget, curl, etc.
Describe and document raw data (README, metadata, etc.)

Raw data

General recommendations¹

Save and backup the raw data
Do not modify raw data (even for minor changes)
Raw data should be in a read-only mode (🔒)
Any modification produces an output or a derived data
Write code for data acquisition (when possible)
- database requests
- api requests
- download.file(), wget, curl, etc.
Describe and document raw data (README, metadata, etc.)

Proposed files organization

project/
└─ data/
   └─ raw-data/
      ├─ raw-data-1.csv 🔒
      ├─ ...
      └─ README.md

Derived data

General recommendations¹

Modified raw data become a derived data (or an output)
Record all the steps used to process data ( | | )
Create the data you wish to see in the world
Create analysis-friendly data: tidy data

Proposed files organization

project/
└─ data/
   └─ raw-data/
      ├─ raw-data-1.csv 🔒
      ├─ ...
      └─ README.md

Derived data

General recommendations¹

Modified raw data become a derived data (or an output)
Record all the steps used to process data ( | | )
Create the data you wish to see in the world
Create analysis-friendly data: tidy data

Source: https://allisonhorst.com/other-r-fun

Proposed files organization

project/
└─ data/
   └─ raw-data/
      ├─ raw-data-1.csv 🔒
      ├─ ...
      └─ README.md

Derived data

General recommendations¹

Modified raw data become a derived data (or an output)
Record all the steps used to process data ( | | )
Create the data you wish to see in the world
Create analysis-friendly data: tidy data

Proposed files organization

project/
├─ data/
│  ├─ raw-data/
│  │  ├─ raw-data-1.csv 🔒
│  │  ├─ ...
│  │  └─ README.md
│  │
│  └─ derived-data/
│     ├─ derived-data-1.RData
│     └─ ...
│
└─ code/
   ├─ process-raw-data-1.R
   └─ ...

Derived data

General recommendations¹

Modified raw data become a derived data (or an output)
Record all the steps used to process data ( | | )
Create the data you wish to see in the world
Create analysis-friendly data: tidy data

Proposed files organization

project/
├─ data/
│  ├─ raw-data/
│  │  ├─ raw-data-1.csv 🔒
│  │  ├─ ...
│  │  └─ README.md
│  │
│  └─ derived-data/
│     ├─ derived-data-1.RData
│     └─ ...
│
└─ code/
   ├─ process-raw-data-1.R
   └─ ...

Alternative

project/
├─ data/
│  ├─ raw-data-1.csv 🔒
│  ├─ ...
│  └─ README.md
│
├─ outputs/
│  ├─ output-1.RData
│  └─ ...
│
└─ code/
   ├─ process-raw-data-1.R
   └─ ...

Data submission

Submit data to a reputable DOI-issuing repository so that others can access and cite it

Write a data paper: Scientific Data, Data in Brief, etc.

Do not forget to write good metadata (e.g. EML)

Develop tools to access and handle published data (e.g. API, R package, ShinyApp, etc.)

Project organization

First things first

« There are only two hard things in Computer Science: cache invalidation and naming things. »

Phil Karlton

Three principles for naming files¹

Human readable

Name contains information on the content
Respect concept of slug from semantic URLs

First things first

« There are only two hard things in Computer Science: cache invalidation and naming things. »

Phil Karlton

Three principles for naming files¹

Human readable
Machine readable

Regular expression and globbing friendly
- avoid space and accented characters
- good use of punctuation and case

First things first

« There are only two hard things in Computer Science: cache invalidation and naming things. »

Phil Karlton

Three principles for naming files¹

Human readable
Machine readable

# File names ----
files <- c("2020-survey_A.csv", "2021-survey_A.csv", "2021-survey_B.csv")

# Extract years ----
strsplit(files, "-") |>              # Split string by '-'
  lapply(function(x) x[1]) |>        # Get the first element
  unlist() |>                        # Convert to vector
  as.numeric()                       # Convert to numeric

[1] 2020 2021 2021

First things first

« There are only two hard things in Computer Science: cache invalidation and naming things. »

Phil Karlton

Three principles for naming files¹

Human readable
Machine readable

# File names ----
files <- c("2020-survey_A.csv", "2021-survey_A.csv", "2021-survey_B.csv")

# Extract years ----
strsplit(files, "-") |>              # Split string by '-'
  lapply(function(x) x[1]) |>        # Get the first element
  unlist() |>                        # Convert to vector
  as.numeric()                       # Convert to numeric

[1] 2020 2021 2021

# Extract surveys ----
strsplit(files, "-") |>              # Split string by '-'
  lapply(function(x) x[2]) |>        # Get the second element
  unlist() |>                        # Convert to vector
  gsub("survey_|\\.csv", "", x = _)  # Clean output

[1] "A" "A" "B"

First things first

« There are only two hard things in Computer Science: cache invalidation and naming things. »

Phil Karlton

Three principles for naming files¹

Human readable
Machine readable
Play well with default ordering

# File names (bad) ----
files <- c("1-survey_A.csv", "2-survey_B.csv", "10-survey_C.csv")

# Sort file names ----
sort(files)

[1] "1-survey_A.csv"  "10-survey_C.csv" "2-survey_B.csv"

First things first

« There are only two hard things in Computer Science: cache invalidation and naming things. »

Phil Karlton

Three principles for naming files¹

Human readable
Machine readable
Play well with default ordering

# File names (bad) ----
files <- c("1-survey_A.csv", "2-survey_B.csv", "10-survey_C.csv")

# Sort file names ----
sort(files)

[1] "1-survey_A.csv"  "10-survey_C.csv" "2-survey_B.csv"

# File names (better) ----
files <- c("01-survey_A.csv", "02-survey_B.csv", "10-survey_C.csv")

# Sort file names ----
sort(files)

[1] "01-survey_A.csv" "02-survey_B.csv" "10-survey_C.csv"

First things first

« There are only two hard things in Computer Science: cache invalidation and naming things. »

Phil Karlton

Three principles for naming files¹

Human readable
Machine readable
Play well with default ordering

Naming variables

Be consistent and follow the guidelines of your community

Research compendium

The goal of a research compendium is to provide a standard and easily recognisable way for organizing the digital materials of a project to enable others to inspect, reproduce, and extend the research.

Marwick B, Boettiger C & Mullen L (2018)¹

Three generic principles

1.
Files organized according to the conventions of the community

2.
Clear separation of data, method, and output

3.
Specify the computational environment that was used

A research compendium must be self-contained

Research compendium

Strong flexibility in the structure of a compendium

Small compendium

project/
├─ .git/
├─ data/ 🔒
├─ code/
│  └─ script.R
├─ outputs/
├─ project.Rproj
├─ .gitignore
└─ README.md

Medium compendium

project/
├─ .git/
├─ data/
│  ├─ raw-data/ 🔒
│  └─ derived-data/
├─ R/
│  ├─ function-x.R
│  └─ function-y.R
├─ analyses/
│  ├─ script-1.R
│  └─ script-n.R
├─ outputs/
├─ project.Rproj
├─ .gitignore
├─ DESCRIPTION
├─ LICENSE
├─ make.R
└─ README.md

Large compendium

project/
├─ .git/
├─ .github/
│  └─ workflows/
│     ├─ workflow-1.yaml
│     └─ workflow-n.yaml
├─ .renv/
├─ data/
│  ├─ raw-data/ 🔒
│  └─ derived-data/
├─ R/
│  ├─ function-x.R
│  └─ function-y.R
├─ analyses/
│  ├─ script-x.R
│  └─ script-n.R
├─ outputs/
├─ paper/
│  ├─ references.bib
│  ├─ style.csl
│  └─ paper.Rmd
├─ project.Rproj
├─ .gitignore
├─ DESCRIPTION
├─ LICENSE
├─ CITATION.cff
├─ make.R
├─ dockerfile
├─ renv.lock
└─ README.md

RStudio Project

Use the power of RStudio Project

File > New Project...

RStudio IDE will create a .Rproj (simple text file) file at the root of the folder

Double-click on a .Rproj file to open a fresh instance of RStudio, w/ the working directory pointing at the folder root
This will help you to create self-contained workspace (= compendium)

In a few slides, we will talk about setwd()

In the meantime

RStudio IDE - Minimal configuration for a better reproducibility

Tools > Global options > General

Never save your workspace as .RData
Decide what you want to save and use
save(), saveRDS(), write.csv(), etc.

In the meantime

RStudio IDE - Minimal configuration for a better reproducibility

Tools > Global Options > General

Never save your workspace as .RData
Decide what you want to save and use
save(), saveRDS(), write.csv(), etc.

Never save your command history
Write your code in scripts not in the console

In the meantime

RStudio IDE - Minimal configuration for a better reproducibility

Tools > Global Options > General

Never save your workspace as .RData
Decide what you want to save and use
save(), saveRDS(), write.csv(), etc.

Never save your command history
Write your code in scripts not in the console

Follow these two recommendations and use RStudio Project, and you’ll¹:

never use again rm(list = ls())
never use again setwd()

What’s wrong with `rm(list = ls())`?¹

Does NOT create a fresh session

It just deletes user-created objects from the global workspace

Other changes may have been made to the session, like options(), library(), etc.

You may get a wrong impression of reproducibility

The solution?

Write every script assuming it will be run in a fresh session

What’s wrong with `setwd()`?¹

Usually used to create absolute paths

# Absolute path on Windows
setwd("C:\\Users\\janedoe\\Documents\\projectname")

# Absolute path on MacOS
setwd("/Users/johndoe/Dropbox/work/projectname")

# Absolute path on GNU/Linux
setwd("/home/johnsmith/git-projects/projectname")

Not portable and not reproducible

The chance of the setwd() command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable.

Jenny Bryan (2017)²

Building robust paths

# Output of here::here() on Windows
here::here()
## [1] "C:/Users/janedoe/Documents/project"

# Output of here::here() on MacOS
here::here()
## [1] "/Users/johndoe/Dropbox/work/project"

# Output of here::here() on GNU/Linux
here::here()
## [1] "/home/johnsmith/git-projects/project"

Building robust paths

# Output of here::here() on Windows
here::here()
## [1] "C:/Users/janedoe/Documents/project"

# Output of here::here() on MacOS
here::here()
## [1] "/Users/johndoe/Dropbox/work/project"

# Output of here::here() on GNU/Linux
here::here()
## [1] "/home/johnsmith/git-projects/project"

Use the package here to create project-relative paths

# Build relative path ----
here::here("data", "raw-data", "raw-data-1.csv")
## [1] "/home/johnsmith/git-projects/project/data/raw-data/raw-data-1.csv"

# Build relative path ----
data <- read.csv(here::here("data", "raw-data", "raw-data-1.csv"))

Building robust paths

# Output of here::here() on Windows
here::here()
## [1] "C:/Users/janedoe/Documents/project"

# Output of here::here() on MacOS
here::here()
## [1] "/Users/johndoe/Dropbox/work/project"

# Output of here::here() on GNU/Linux
here::here()
## [1] "/home/johnsmith/git-projects/project"

Use the package here to create project-relative paths

# Build relative path ----
here::here("data", "raw-data", "raw-data-1.csv")
## [1] "/home/johnsmith/git-projects/project/data/raw-data/raw-data-1.csv"

# Build relative path ----
data <- read.csv(here::here("data", "raw-data", "raw-data-1.csv"))

here will search for a .Rproj file (or a .here file) to define the working directory

The `DESCRIPTION` file

Main component of an package, the DESCRIPTION file can be added to a research compendium to describe project metadata

Package: projectname
Type: Package
Title: The Title of the Project
Authors@R: c(
    person(given   = "John",
           family  = "Doe",
           role    = c("aut", "cre", "cph"),
           email   = "john.doe@domain.com",
           comment = c(ORCID = "9999-9999-9999-9999")))
Description: A paragraph providing a full description of the project.
License: GPL (>= 2)

The `DESCRIPTION` file

Main component of an package, the DESCRIPTION file can be added to a research compendium to describe project metadata

Package: projectname
Type: Package
Title: The Title of the Project
Authors@R: c(
    person(given   = "John",
           family  = "Doe",
           role    = c("aut", "cre", "cph"),
           email   = "john.doe@domain.com",
           comment = c(ORCID = "9999-9999-9999-9999")))
Description: A paragraph providing a full description of the project.
License: GPL (>= 2)
Imports:
    devtools,
    here

It can be used to list all external packages required by the project

You should consider the DESCRIPTION file as the only file to list your external packages

Do not use library() or install.packages() anymore

Dealing w/ dependencies

To call a function (bar()) from an external package (foo), usually you use the function bar() after calling library("foo")

But,

for readability purposes, it’s not perfect (where does the function bar() come from?)
you can have a conflict w/ a function also named bar() but from the package baz also attached with library(). You are not sure which function you are really using.

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ## ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ## ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package to force all conflicts to become errors

A solution to prevent conflict is to call an external function as foo::bar()

library() will load and attach a package
:: will just load a package

Dealing w/ dependencies

In the DESCRIPTION file,

list external packages under the tag Imports if you call functions as foo::bar() - recommended
list external packages under the tag Depends if you want to attach a package (e.g. ggplot2)

Package: projectname
Type: Package
Title: The Title of the Project
Authors@R: c(
    person(given   = "John",
           family  = "Doe",
           role    = c("aut", "cre", "cph"),
           email   = "john.doe@domain.com",
           comment = c(ORCID = "9999-9999-9999-9999")))
Description: A paragraph providing a full description of the project.
License: GPL (>= 2)
Depends:
    ggplot2
Imports:
    devtools,
    here

With this setting, you will use ggplot() but here::here() in your code

Have a look at the tag Remotes to list packages only available on GitHub, GitLab, etc.

Dealing w/ dependencies

Editing the DESCRIPTION file is not enough to install and access external packages.

You have to run these two command lines:

# Install missing packages ----
devtools::install_deps()

# Load and attach (if Depends is used) packages ----
devtools::load_all()

# Install missing packages ----
remotes::install_deps()

# Load and attach (if Depends is used) packages ----
pkgload::load_all()

If you don’t want to upgrade your packages, use remotes::install_deps(upgrade = "never")

Wrap-up: w/ a DESCRIPTION file and the functions install_deps() and load_all(), no need to use library() or install.packages() anymore

The `README` file

A README is a text file that introduces and explains your project

each research compendium should contain a README
you can write different README (project, data, etc.)

Source: https://github.com/frbcesab/chessboard/blob/main/README.md

The `README` file

A README is a text file that introduces and explains your project

each research compendium should contain a README
you can write different README (project, data, etc.)

GitHub and other code hosting platforms recognize and interpret README written in Markdown (README.md)

The `README` file

A README is a text file that introduces and explains your project

each research compendium should contain a README
you can write different README (project, data, etc.)

GitHub and other code hosting platforms recognize and interpret README written in Markdown (README.md)

Source: https://github.com/frbcesab/chessboard

The `README` file

A README is a text file that introduces and explains your project

each research compendium should contain a README
you can write different README (project, data, etc.)

GitHub and other code hosting platforms recognize and interpret README written in Markdown (README.md)

If you want to run code inside your README you can write a README.Rmd (or .qmd) and convert it to a README.md

The `README` file

A good README should answer the following questions¹:

Why should I use it?
How do I get it?
How do I use it?

The `README` file

A good README should answer the following questions¹:

Why should I use it?
How do I get it?
How do I use it?

Main sections (for a research compendium)

Title
Description
Content (file organization)
Prerequisites
Installation
Usage
License
Citation
Acknowledgements
References

Choose a `LICENSE`

By default your work will be release under exclusive copyright - No License

Always select an appropriate license for your project

Two major camps of open source licenses¹
- Permissive licenses
- Copyleft licenses

The choosealicense.com website can help you choose

Examples

you want a permissive license so people can use your code with minimal restrictions: MIT / Apache / BSD
you want a copyleft license so that all derivatives and bundles of your code are also open source: GPLv2 / GPLv3
your project primarily contains data and you want minimal restrictions: CC-0 / CC-BY / ODbL

Tracking changes

Motivations

Motivations

Questions

Which version of analyses.R is the final one?
What about data.csv?
What are the differences between versions?
Who have contributed to these versions? When?

Motivations

Questions

Which version of analyses.R is the final one?
What about data.csv?
What are the differences between versions?
Who have contributed to these versions? When?

Comments

It becomes difficult to find new versions names
And this folder: what a mess!

Motivations

Questions

Which version of analyses.R is the final one?
What about data.csv?
What are the differences between versions?
Who have contributed to these versions? When?

Comments

It becomes difficult to find new versions names
And this folder: what a mess!

We need a tool that deals with versions for us

Motivations

Questions

Which version of analyses.R is the final one?
What about data.csv?
What are the differences between versions?
Who have contributed to these versions? When?

Comments

It becomes difficult to find new versions names
And this folder: what a mess!

We need a tool that deals with versions for us

Motivations

We need a tool that deals with versions for us

Presentation of `git`

Created by Linus Torvalds in 2005 for the development of the Linux kernel
Distributed version control system (DVCS) - Peer-to-peer approach

Presentation of `git`

Created by Linus Torvalds in 2005 for the development of the Linux kernel
Distributed version control system (DVCS) - Peer-to-peer approach

Advantages of VCS (and git)

make contributions transparent (what / who / when / why)
keep the entire history of a file (and project)
inspect a file throughout its life time
revert back to a previous version
handle multiple versions (branches)
keep your working copy clean
facilitate collaborations w/ code hosting platforms
(GitHub, GitLab, Bitbucket, etc.)
backup your project

A word of warning

Git and GitHub are not the same thing

Git is a free and open-source software
GitHub (and co) is a web platform to host and share projects that use git

In other words:

You do not need GitHub to use git but you cannot use GitHub without using git

GUI clients

Git is a command-line interface (CLI)
You interact with git using a terminal
All commands start w/ the keyword git
(git status / log / add / commit)

But a lot of third-party tools provides a graphical interface to git
(e.g. RStudio, GitKraken, GitHub Desktop, extensions for VSCode, VSCodium, neovim, etc.)

Just keep in mind that for some operations you will need to use the terminal

Zoom on RStudio

Git main panel

Zoom on RStudio

Stage files, view differences and commit changes

View history and versions

Installation & Configuration

Installation

To install git you can follow this tutorial: https://frbcesab.github.io/training-courses/installation.html

Configuration

To use git, you need to store your credentials (username and email) that will be added to all your commits.
Open a terminal and run:

git config --global user.name  "Jane Doe"
git config --global user.email "jane.doe@mail.com"

Initialization

For each project, git must be initialized w/

git init

How does `git` work?

git takes a sequence of snapshots
Each snapshot can contain changes for one or many file(s)
User chooses which files to ‘save’ in a snapshot and when
(!= file hosting services like Dropbox, Google Drive, etc.)

In the git universe, a snapshot is a version, i.e. the state of the whole project at a specific point in time

A snapshot is a two-step process:

Stage files: select which files to add to the version
Commit changes: save the version and add metadata (commit message)

Basic workflow

Initialize git in a (empty) folder (repository)

git init

The three areas of a git repository:

working copy: current state of the directory (what you actually see)
staging area: selected files that will be added to the next version
repository: area w/ all the versions (the .git/ subdirectory)

Basic workflow

Add new files in the repository

git status

# On branch main
# 
# No commits yet
# 
# Untracked files:
#   README.md
#   analyses.R
#   data.csv
# 
# Nothing added to commit but untracked files present
# Use "git add <file>..." to track

Basic workflow

Stage (select) one file

git add data.csv

git status

# On branch main
# 
# No commits yet
# 
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#   new file:   data.csv
# 
# Untracked files:
#   (use "git add <file>..." to track)
#   README.md
#   analyses.R

Basic workflow

Stage (select) several files

git add data.csv analyses.R

git status

# On branch main
# 
# No commits yet
# 
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#   new file:   analyses.R
#   new file:   data.csv
# 
# Untracked files:
#   (use "git add <file>..." to track)
#   README.md

Basic workflow

Stage (select) all files

git add .

git status

# On branch main
# 
# No commits yet
# 
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#   new file:   analyses.R
#   new file:   data.csv
#   new file:   README.md

Basic workflow

Commit changes to create a new version

git commit -m "a good commit message"

Basic workflow

Now we are up-to-date

git status

# On branch main
# nothing to commit, working tree clean

The status of a file

With git a file can be untracked or tracked¹. If it’s tracked it can be:

unmodified
modified and unstaged
modified and staged

The `.gitignore`

We can also tell git to ignore specific files: it’s the purpose of the .gitignore file

Which files? For instance:

passwords, tokens and other secrets
temporary files
build files
large files

The syntax is simple:

# Ignore a specific file
README.html

# Ignore all PDF
*.pdf

# Ignore a folder
data/

# Ignore a subfolder
data/raw-data/

# Ignore a specific file in a subfolder
data/raw-data/raw-data.csv

Template for projects available here

Commits

When committing a new version (w/ git commit), the following information must be added:

WHO - the person who has made the changes
(automatically added by git)
WHEN - the date of the commit
(automatically added by git)
WHAT - the files that have been modified
(selected by the user w/ git add)
WHY - the reason of the commit, i.e. what has been done compared to the previous version
(added by the user w/ git commit)

Commits

When committing a new version (w/ git commit), the following information must be added:

WHO - the person who has made the changes
(automatically added by git)
WHEN - the date of the commit
(automatically added by git)
WHAT - the files that have been modified
(selected by the user w/ git add)
WHY - the reason of the commit, i.e. what has been done compared to the previous version
(added by the user w/ git commit)

Print the git history w/ git log

git log --stat

# (*) commit dc4366af56223f533a32ef38794b4e567b0b7422
# Author: ahasverus <ahasverus@users.noreply.github.com>
# Date:   Wed Jun 28 14:11:20 2023 +0200
# 
#     docs: update CITATION.cff file
# 
#  CITATION.cff | 6 +++---
#  1 file changed, 3 insertions(+), 3 deletions(-)
# 
# (*) commit 75c0072892544f779105067efc17d361365554e1 
# Author: Nicolas Casajus <nicolas.casajus@fondationbiodiversite.fr>
# Date:   Wed Jun 28 14:00:15 2023 +0200
# 
#     docs: update references in README
# 
#  README.Rmd | 6 +++---
#  README.md  | 6 +++---
#  2 files changed, 6 insertions(+), 6 deletions(-)
# 
# (*) commit 6a00f3539636726c3715a7e94eea04bd30ef8f69
# Author: Nicolas Casajus <nicolas.casajus@fondationbiodiversite.fr>
# Date:   Wed Jun 28 12:39:48 2023 +0000
# 
#     style: align R code lines in MS
# 
#  joss-paper/paper.bib | 2 +-
#  joss-paper/paper.md  | 8 ++++----
#  2 files changed, 5 insertions(+), 5 deletions(-)
#
# (*) commit 6f7d632c7d1dad3acddfb128a2c4a07f1e8a4e9c
# ...

Commit message

A commit message has a title line, and an optional body

# Commit message w/ title and body
git commit -m "title" -m "body"

# Commit message w/ only title
git commit -m "title"

Commit message

A commit message has a title line, and an optional body

# Commit message w/ title and body
git commit -m "title" -m "body"

# Commit message w/ only title
git commit -m "title"

What is a good commit message?

A good commit title:

should be capitalized (according to the git documentation)
should be short (less than 50 characters)
should be informative and unambiguous
should use active voice and present tense

An optional body can be added to provide detailed information and to link external references (e.g. issue, pull request, etc.)

Commit message

A commit message has a title line, and an optional body

# Commit message w/ title and body
git commit -m "title" -m "body"

# Commit message w/ only title
git commit -m "title"

What is a good commit message?

A good commit title:

should be capitalized (according to the git documentation)
should be short (less than 50 characters)
should be informative and unambiguous
should use active voice and present tense

Template provided by git:

Capitalized, short (50 chars or less) summary

More detailed explanatory text, if necessary.  Wrap it to about 72
characters or so.  In some contexts, the first line is treated as the
subject of an email and the rest of the text as the body.  The blank
line separating the summary from the body is critical (unless you omit
the body entirely); tools like rebase will confuse you if you run the
two together.

Write your commit message in the imperative: "Fix bug" and not 
"Fixed bug" or "Fixes bug."  This convention matches up with commit 
messages generated by commands like git merge and git revert.

Further paragraphs come after blank lines.

- Bullet points are okay, too

- Typically a hyphen or asterisk is used for the bullet, followed by a
  single space, with blank lines in between, but conventions vary here

- Use a hanging indent

Link external references as: Fix #23

An optional body can be added to provide detailed information and to link external references (e.g. issue, pull request, etc.)

When should you commit?

Commit a new version when you reach a milestone
Create small and atomic commits
Commit a state that is actually working

Undoing things

1. Undo recent, uncommitted and unstaged changes

You have modified a file but have not staged changes and you want to restore the previous version

git status

# On branch main
# Changes not staged for commit:
#   (use "git add <file>..." to stage changes)
#   (use "git restore <file>..." to discard changes)
#   modified:   data.csv
#
# No changes added to commit

Undoing things

1. Undo recent, uncommitted and unstaged changes

You have modified a file but have not staged changes and you want to restore the previous version

# Restore one file (discard unstaged changes)
git restore data.csv

git status

# On branch main
# Nothing to commit, working tree clean

Undoing things

1. Undo recent, uncommitted and unstaged changes

You have modified a file but have not staged changes and you want to restore the previous version

# Restore one file (discard unstaged changes)
git restore data.csv

git status

# On branch main
# Nothing to commit, working tree clean

To discard all changes:

# Cancel all non-staged changes
git restore .

Undoing things

2. Unstaged uncommitted files

You have modified and staged file(s) but have not committed changes yet and you want to unstage file(s) and restore the previous version

git status

# On branch main
# Changes to be committed:
#   (use "git restore --staged <file>..." to unstage)
#   modified:   data.csv

Undoing things

2. Unstaged uncommitted files

You have modified and staged file(s) but have not committed changes yet and you want to unstage file(s) and restore the previous version

# Unstage one file
git restore --staged data.csv

git status

# On branch main
# Changes not staged for commit:
#   (use "git add <file>..." to stage changes)
#   (use "git restore <file>..." to discard changes)
#   modified:   data.csv
#
# No changes added to commit

Undoing things

2. Unstaged uncommitted files

You have modified and staged file(s) but have not committed changes yet and you want to unstage file(s) and restore the previous version

# Unstage one file
git restore --staged data.csv

git status

# On branch main
# Changes not staged for commit:
#   (use "git add <file>..." to stage changes)
#   (use "git restore <file>..." to discard changes)
#   modified:   data.csv
#
# No changes added to commit

You can now restore the previous version w/:

# Discard changes (restore previous version)
git restore data.csv

Undoing things

3. Change the most recent commit message

You have committed a new version but you want to change the commit message

# Change message of the last commit
git commit --amend

Note that this will also change the ID of the commit

Undoing things

4. Revert one commit

You want to reverse the effects of a commit: use git revert

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

Undoing things

4. Revert one commit

You want to reverse the effects of a commit: use git revert

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Revert commit dd4472c
git revert dd4472c

Undoing things

4. Revert one commit

You want to reverse the effects of a commit: use git revert

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Revert commit dd4472c
git revert dd4472c

# Print git history
git log --oneline

# d62ad3e (HEAD -> main) Revert "commit 3"
# f960dd3 commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

git revert does not alter the history and creates a new commit

Undoing things

5. Deleting commits

You want to delete one or more commits: use git reset --hard

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

Undoing things

5. Deleting commits

You want to delete one or more commits: use git reset --hard

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Delete the two more recent commits
git reset --hard HEAD~2

Undoing things

5. Deleting commits

You want to delete one or more commits: use git reset --hard

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Delete the two more recent commits
git reset --hard HEAD~2

# Print git history
git log --oneline

# 2bb9bb4 (HEAD -> main) commit 2
# 2d79e7e commit 1

git reset --hard alters the history. Be careful with this command

Back in time

Use the command git checkout to inspect a file at a specific point in time

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

Back in time

Use the command git checkout to inspect a file at a specific point in time

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Back in time
git checkout 2bb9bb4

Back in time

Use the command git checkout to inspect a file at a specific point in time

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Back in time
git checkout 2bb9bb4

# Print git history
git log --oneline

# 2bb9bb4 (HEAD) commit 2
# 2d79e7e commit 1

Back in time

Use the command git checkout to inspect a file at a specific point in time

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Back in time
git checkout 2bb9bb4

# Print git history
git log --oneline

# 2bb9bb4 (HEAD) commit 2
# 2d79e7e commit 1

# Back to the more recent commit
git checkout -

Back in time

Use the command git checkout to inspect a file at a specific point in time

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

# Back in time
git checkout 2bb9bb4

# Print git history
git log --oneline

# 2bb9bb4 (HEAD) commit 2
# 2d79e7e commit 1

# Back to the more recent commit
git checkout -

# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

Collaboration

Code hosting platforms

GitHub and co are cloud-based git repository hosting services

Perfect solutions to collaborate on projects tracked by git

Services

Full integration of version control (commits, history, differences)
Easy collaboration w/ branches, forks, pull requests
Issues tracking system
Enhanced documentation rendering (README, Wiki)
Static website hosting
Automation & monitoring (CI/CD)

Code hosting platforms

Main platforms

Presentation of GitHub

Overview

Created in 2008
For-profit company (property of Microsoft since 2018)
Used by more than 100 million developers around the world

Advantages

User-friendly interface for git
Free account w/ unlimited public/private repositories
Organization account (w/ free plan)
Advanced tools for collaboration

Presentation of GitHub

Presentation of GitHub

Presentation of GitHub

*Source:* https://github.com/frbcesab/rwoslite

Create a repository

Clone a repository

On a terminal

# Change path
cd path/to/store/repository

# Clone repository (SSH)
git clone git@github.com:ahasverus/projectname.git

# Clone repository (HTTPS)
# git clone https://github.com/ahasverus/projectname.git

# Go to repo
cd projectname

Clone a repository w/ RStudio

Clone a repository w/ RStudio

Working w/ GitHub

Add a new file: README.md

git status

# On branch main
# Your branch is up to date with 'origin/main'
#
# Untracked files:
#   README.md
# 
# Nothing added to commit but untracked files present
# Use "git add <file>..." to track

Working w/ GitHub

Stage changes

git add .

git status

# On branch main
# Your branch is up to date with 'origin/main'
#
# Changes to be committed:
#   (use "git restore --staged <file>..." to unstage)
#   new file:   README.md

Working w/ GitHub

Commit changes

git commit -m "add README"

git status

# On branch main
# Your branch is ahead of 'origin/main' by 1 commit.
#   (use "git push" to publish your local commits)
# 
# nothing to commit, working tree clean

Working w/ GitHub

Push changes to remote

git push

# Sometimes, you'll need to use:
git push -u origin main

git status

# On branch main
# Your branch is up to date with 'origin/main'.
# 
# nothing to commit, working tree clean

Working w/ GitHub

Pull changes from remote

Working w/ GitHub

Pull changes from remote

git pull

git status

# On branch main
# Your branch is up to date with 'origin/main'.
# 
# nothing to commit, working tree clean

Help me, I can’t push!

When you try to push, you might see this following error message:

git push

# To github.com:ahasverus/projectname.git
#  ! [rejected]        main -> main (fetch first)
#
# error: failed to push some refs to 'github.com:ahasverus/projectname.git'
#
# hint: Updates were rejected because the remote contains work that you do
# hint: not have locally. This is usually caused by another repository pushing
# hint: to the same ref. You may want to first integrate the remote changes
# hint: (e.g., 'git pull ...') before pushing again.
# hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Just git pull and try to git push again

Help me, I can’t pull!

When you try to pull, you might see this following error message:

git pull

# [...]
# Auto-merging README.md
# CONFLICT (content): Merge conflict in README.md
#
# error: could not apply b8302e6... edit README
#
# hint: Resolve all conflicts manually, mark them as resolved with
# hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
# hint: You can instead skip this commit: run "git rebase --skip".
# hint: To abort and get back to the state before "git rebase", 
# hint: run "git rebase --abort".

Welcome in the wonderful world of git conflicts

Resolving conflicts

What is a (lexical) conflict?

A git conflict appears when two versions cannot be merged by git because changes have been made to the same lines.

README.md - Version A

# The SURPRISE pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

README.md - Version B

# The Surprise Pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

Git will identify conflicts in files:

<<<<<<< HEAD
# The SURPRISE pizza
=======
# The Surprise Pizza
>>>>>>> b8302e6 (edit README)

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

You have to decide which version you want to keep.

Resolving conflicts

What is a (lexical) conflict?

A git conflict appears when two versions cannot be merged by git because changes have been made to the same lines.

README.md - Version A

# The SURPRISE pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

README.md - Version B

# The Surprise Pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

Git will identify conflicts in files:

<<<<<<< HEAD
# The SURPRISE pizza
=======
# The Surprise Pizza
>>>>>>> b8302e6 (edit README)

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

README.md - Final version

# My wonderful pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

You have to decide which version you want to keep.

Collaboration - GitHub Flow

**Source:** https://edav.info/github.html

Working w/ issues

…

Code & Software

Work in progress…

Manuscript

Work in progress…

Notes

https://annakrystalli.me/talks/r-in-repro-research.html#1

https://library.seg.org/doi/abs/10.1190/1.1822162

https://inundata.org/talks/rstd19/#/

https://ropensci-archive.github.io/reproducibility-guide/sections/introduction/

https://insileco.io/workshop_reproducibility/#1

https://arxiv.org/pdf/1609.00037.pdf

https://ram.berkeley.edu/

https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745#s5

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005097

https://eliocamp.github.io/reproducibility-with-r/materials/day1/01-introduction/

https://www.youtube.com/watch?v=KHMW8fV2NXo

https://the-turing-way.netlify.app/reproducible-research/reproducible-research.html

https://annakrystalli.me/rrresearchACCE20/filenaming-view.html

https://allisonhorst.com/other-r-fun

https://www.tidyverse.org/blog/2017/12/workflow-vs-script/

https://doi.org/10.7287/peerj.preprints.3159v2

https://raps-with-r.dev/

Introduction

What is reproducibility?

Why does it matter?

Why does it matter?

Why does it matter?

Reproducibility spectrum

Concepts

Data management

Raw data

Raw data

Derived data

Derived data

Derived data

Derived data

Data submission

Project organization

First things first

First things first

First things first

First things first

First things first

First things first

First things first

Naming variables

Research compendium

Research compendium

RStudio Project

In the meantime

In the meantime

In the meantime

What’s wrong with rm(list = ls())?1

What’s wrong with setwd()?1

Building robust paths

Building robust paths

Building robust paths

The DESCRIPTION file

The DESCRIPTION file

Dealing w/ dependencies

Dealing w/ dependencies

Dealing w/ dependencies

The README file

The README file

The README file

The README file

The README file

The README file

Choose a LICENSE

Tracking changes

Motivations

Motivations

Motivations

Motivations

Motivations

Motivations

Presentation of git

Presentation of git

A word of warning

GUI clients

Zoom on RStudio

Zoom on RStudio

Installation & Configuration

How does git work?

Basic workflow

Basic workflow

Basic workflow

Basic workflow

Basic workflow

Basic workflow

Basic workflow

The status of a file

The .gitignore

Commits

Commits

Commit message

Commit message

Commit message

When should you commit?

When should you commit?

Undoing things

Undoing things

What’s wrong with `rm(list = ls())`?¹

What’s wrong with `setwd()`?¹

The `DESCRIPTION` file

The `DESCRIPTION` file

The `README` file

The `README` file

The `README` file

The `README` file

The `README` file

The `README` file

Choose a `LICENSE`

Presentation of `git`

Presentation of `git`

How does `git` work?

The `.gitignore`