Reproducible research

Good practices

December 2023

Nicolas Casajus

Introduction

What is reproducibility?


Reproducibility is about results that can be obtained by someone else (or you in the future) given the same data and the same code. This is a technical problem.


 We talk about Computational reproducibility

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)1


Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)1


Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)2

Why does it matter?

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.

Claerbout & Karrenbach (1992)1


Reproducibility has the potential to serve as a minimum standard for judging scientific claims (…).

Peng (2011)2


 Sharing the code and the data is now a prerequisite for publishing in many journals

Reproducibility spectrum


Source: Peng (2011)1


Each degree of reproducibility requires additional skills and time. While some of those skills (e.g. literal programming, version control, setting up environments) pay off in the long run, they can require a high up-front investment.

Concepts

According to Wilson et al. (2017)1, good practices for a better reproducibility can be organized into the following six topics:




 Data management

 Project organization

 Tracking changes


 Collaboration

 Manuscript

 Code & Software

 Data management

Raw data

General recommendations1

  • Save and backup the raw data
  • Do not modify raw data (even for minor changes)
  • Raw data should be in a read-only mode (πŸ”’)
  • Any modification produces an output or a derived data
  • Write code for data acquisition (when possible)
    • database requests
    • api requests
    • download.file(), wget, curl, etc.
  • Describe and document raw data (README, metadata, etc.)

Raw data

General recommendations1

  • Save and backup the raw data
  • Do not modify raw data (even for minor changes)
  • Raw data should be in a read-only mode (πŸ”’)
  • Any modification produces an output or a derived data
  • Write code for data acquisition (when possible)
    • database requests
    • api requests
    • download.file(), wget, curl, etc.
  • Describe and document raw data (README, metadata, etc.)

 Proposed files organization

project/
└─ data/
   └─ raw-data/
      β”œβ”€ raw-data-1.csv πŸ”’
      β”œβ”€ ...
      └─ README.md

Derived data

General recommendations1

  • Modified raw data become a derived data (or an output)
  • Record all the steps used to process data ( | | )
  • Create the data you wish to see in the world
  • Create analysis-friendly data: tidy data

 Proposed files organization

project/
└─ data/
   └─ raw-data/
      β”œβ”€ raw-data-1.csv πŸ”’
      β”œβ”€ ...
      └─ README.md

Derived data

General recommendations1

  • Modified raw data become a derived data (or an output)
  • Record all the steps used to process data ( | | )
  • Create the data you wish to see in the world
  • Create analysis-friendly data: tidy data

 Proposed files organization

project/
└─ data/
   └─ raw-data/
      β”œβ”€ raw-data-1.csv πŸ”’
      β”œβ”€ ...
      └─ README.md

Derived data

General recommendations1

  • Modified raw data become a derived data (or an output)
  • Record all the steps used to process data ( | | )
  • Create the data you wish to see in the world
  • Create analysis-friendly data: tidy data

 Proposed files organization

project/
β”œβ”€ data/
β”‚  β”œβ”€ raw-data/
β”‚  β”‚  β”œβ”€ raw-data-1.csv πŸ”’
β”‚  β”‚  β”œβ”€ ...
β”‚  β”‚  └─ README.md
β”‚  β”‚
β”‚  └─ derived-data/
β”‚     β”œβ”€ derived-data-1.RData
β”‚     └─ ...
β”‚
└─ code/
   β”œβ”€ process-raw-data-1.R
   └─ ...

Derived data

General recommendations1

  • Modified raw data become a derived data (or an output)
  • Record all the steps used to process data ( | | )
  • Create the data you wish to see in the world
  • Create analysis-friendly data: tidy data

 Proposed files organization

project/
β”œβ”€ data/
β”‚  β”œβ”€ raw-data/
β”‚  β”‚  β”œβ”€ raw-data-1.csv πŸ”’
β”‚  β”‚  β”œβ”€ ...
β”‚  β”‚  └─ README.md
β”‚  β”‚
β”‚  └─ derived-data/
β”‚     β”œβ”€ derived-data-1.RData
β”‚     └─ ...
β”‚
└─ code/
   β”œβ”€ process-raw-data-1.R
   └─ ...


 Alternative

project/
β”œβ”€ data/
β”‚  β”œβ”€ raw-data-1.csv πŸ”’
β”‚  β”œβ”€ ...
β”‚  └─ README.md
β”‚
β”œβ”€ outputs/
β”‚  β”œβ”€ output-1.RData
β”‚  └─ ...
β”‚
└─ code/
   β”œβ”€ process-raw-data-1.R
   └─ ...

Data submission

  • Submit data to a reputable DOI-issuing repository so that others can access and cite it




  • Do not forget to write good metadata (e.g. EML)


  • Develop tools to access and handle published data (e.g. API, R package, ShinyApp, etc.)

 Project organization

First things first

Β« There are only two hard things in Computer Science: cache invalidation and naming things. Β»

Phil Karlton


Three principles for naming files1

  1. Human readable
  • Name contains information on the content
  • Respect concept of slug from semantic URLs

First things first

Β« There are only two hard things in Computer Science: cache invalidation and naming things. Β»

Phil Karlton


Three principles for naming files1

  1. Human readable
  2. Machine readable
  • Regular expression and globbing friendly
    • avoid space and accented characters
    • good use of punctuation and case

First things first

Β« There are only two hard things in Computer Science: cache invalidation and naming things. Β»

Phil Karlton


Three principles for naming files1

  1. Human readable
  2. Machine readable
# File names ----
files <- c("2020-survey_A.csv", "2021-survey_A.csv", "2021-survey_B.csv")

# Extract years ----
strsplit(files, "-") |>              # Split string by '-'
  lapply(function(x) x[1]) |>        # Get the first element
  unlist() |>                        # Convert to vector
  as.numeric()                       # Convert to numeric
[1] 2020 2021 2021

First things first

Β« There are only two hard things in Computer Science: cache invalidation and naming things. Β»

Phil Karlton


Three principles for naming files1

  1. Human readable
  2. Machine readable
# File names ----
files <- c("2020-survey_A.csv", "2021-survey_A.csv", "2021-survey_B.csv")

# Extract years ----
strsplit(files, "-") |>              # Split string by '-'
  lapply(function(x) x[1]) |>        # Get the first element
  unlist() |>                        # Convert to vector
  as.numeric()                       # Convert to numeric
[1] 2020 2021 2021


# Extract surveys ----
strsplit(files, "-") |>              # Split string by '-'
  lapply(function(x) x[2]) |>        # Get the second element
  unlist() |>                        # Convert to vector
  gsub("survey_|\\.csv", "", x = _)  # Clean output
[1] "A" "A" "B"

First things first

Β« There are only two hard things in Computer Science: cache invalidation and naming things. Β»

Phil Karlton


Three principles for naming files1

  1. Human readable
  2. Machine readable
  3. Play well with default ordering
# File names (bad) ----
files <- c("1-survey_A.csv", "2-survey_B.csv", "10-survey_C.csv")

# Sort file names ----
sort(files)
[1] "1-survey_A.csv"  "10-survey_C.csv" "2-survey_B.csv" 

First things first

Β« There are only two hard things in Computer Science: cache invalidation and naming things. Β»

Phil Karlton


Three principles for naming files1

  1. Human readable
  2. Machine readable
  3. Play well with default ordering
# File names (bad) ----
files <- c("1-survey_A.csv", "2-survey_B.csv", "10-survey_C.csv")

# Sort file names ----
sort(files)
[1] "1-survey_A.csv"  "10-survey_C.csv" "2-survey_B.csv" 


# File names (better) ----
files <- c("01-survey_A.csv", "02-survey_B.csv", "10-survey_C.csv")

# Sort file names ----
sort(files)
[1] "01-survey_A.csv" "02-survey_B.csv" "10-survey_C.csv"

First things first

Β« There are only two hard things in Computer Science: cache invalidation and naming things. Β»

Phil Karlton


Three principles for naming files1

  1. Human readable
  2. Machine readable
  3. Play well with default ordering

Source: xkcd

Naming variables

 Be consistent and follow the guidelines of your community

Research compendium

The goal of a research compendium is to provide a standard and easily recognisable way for organizing the digital materials of a project to enable others to inspect, reproduce, and extend the research.

Marwick B, Boettiger C & Mullen L (2018)1



Three generic principles

1.
Files organized according to the conventions of the community

2.
Clear separation of data, method, and output

3.
Specify the computational environment that was used


 A research compendium must be self-contained

Research compendium

 Strong flexibility in the structure of a compendium


Small compendium

project/
β”œβ”€ .git/
β”œβ”€ data/ πŸ”’
β”œβ”€ code/
β”‚  └─ script.R
β”œβ”€ outputs/
β”œβ”€ project.Rproj
β”œβ”€ .gitignore
└─ README.md

Medium compendium

project/
β”œβ”€ .git/
β”œβ”€ data/
β”‚  β”œβ”€ raw-data/ πŸ”’
β”‚  └─ derived-data/
β”œβ”€ R/
β”‚  β”œβ”€ function-x.R
β”‚  └─ function-y.R
β”œβ”€ analyses/
β”‚  β”œβ”€ script-1.R
β”‚  └─ script-n.R
β”œβ”€ outputs/
β”œβ”€ project.Rproj
β”œβ”€ .gitignore
β”œβ”€ DESCRIPTION
β”œβ”€ LICENSE
β”œβ”€ make.R
└─ README.md

Large compendium

project/
β”œβ”€ .git/
β”œβ”€ .github/
β”‚  └─ workflows/
β”‚     β”œβ”€ workflow-1.yaml
β”‚     └─ workflow-n.yaml
β”œβ”€ .renv/
β”œβ”€ data/
β”‚  β”œβ”€ raw-data/ πŸ”’
β”‚  └─ derived-data/
β”œβ”€ R/
β”‚  β”œβ”€ function-x.R
β”‚  └─ function-y.R
β”œβ”€ analyses/
β”‚  β”œβ”€ script-x.R
β”‚  └─ script-n.R
β”œβ”€ outputs/
β”œβ”€ paper/
β”‚  β”œβ”€ references.bib
β”‚  β”œβ”€ style.csl
β”‚  └─ paper.Rmd
β”œβ”€ project.Rproj
β”œβ”€ .gitignore
β”œβ”€ DESCRIPTION
β”œβ”€ LICENSE
β”œβ”€ CITATION.cff
β”œβ”€ make.R
β”œβ”€ dockerfile
β”œβ”€ renv.lock
└─ README.md

RStudio Project

Use the power of RStudio Project

File > New Project...

RStudio IDE will create a .Rproj (simple text file) file at the root of the folder

  • Double-click on a .Rproj file to open a fresh instance of RStudio, w/ the working directory pointing at the folder root
  • This will help you to create self-contained workspace (= compendium)

 In a few slides, we will talk about setwd()

In the meantime

RStudio IDE - Minimal configuration for a better reproducibility

Tools > Global options > General

  • Never save your workspace as .RData
     Decide what you want to save and use
    save(), saveRDS(), write.csv(), etc.

In the meantime

RStudio IDE - Minimal configuration for a better reproducibility

Tools > Global Options > General

  • Never save your workspace as .RData
     Decide what you want to save and use
    save(), saveRDS(), write.csv(), etc.


  • Never save your command history
     Write your code in scripts not in the console

In the meantime

RStudio IDE - Minimal configuration for a better reproducibility

Tools > Global Options > General

  • Never save your workspace as .RData
     Decide what you want to save and use
    save(), saveRDS(), write.csv(), etc.


  • Never save your command history
     Write your code in scripts not in the console


Follow these two recommendations and use RStudio Project, and you’ll1:

  • never use again rm(list = ls())
  • never use again setwd()

What’s wrong with rm(list = ls())?1

Does NOT create a fresh session

 It just deletes user-created objects from the global workspace


Other changes may have been made to the session, like options(), library(), etc.

 You may get a wrong impression of reproducibility


The solution?

 Write every script assuming it will be run in a fresh session

What’s wrong with setwd()?1

Usually used to create absolute paths

# Absolute path on Windows
setwd("C:\\Users\\janedoe\\Documents\\projectname")

# Absolute path on MacOS
setwd("/Users/johndoe/Dropbox/work/projectname")

# Absolute path on GNU/Linux
setwd("/home/johnsmith/git-projects/projectname")

 Not portable and not reproducible


The chance of the setwd() command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable.

Jenny Bryan (2017)2

Building robust paths


# Output of here::here() on Windows
here::here()
## [1] "C:/Users/janedoe/Documents/project"

# Output of here::here() on MacOS
here::here()
## [1] "/Users/johndoe/Dropbox/work/project"

# Output of here::here() on GNU/Linux
here::here()
## [1] "/home/johnsmith/git-projects/project"

Building robust paths


# Output of here::here() on Windows
here::here()
## [1] "C:/Users/janedoe/Documents/project"

# Output of here::here() on MacOS
here::here()
## [1] "/Users/johndoe/Dropbox/work/project"

# Output of here::here() on GNU/Linux
here::here()
## [1] "/home/johnsmith/git-projects/project"


 Use the package here to create project-relative paths

# Build relative path ----
here::here("data", "raw-data", "raw-data-1.csv")
## [1] "/home/johnsmith/git-projects/project/data/raw-data/raw-data-1.csv"

# Build relative path ----
data <- read.csv(here::here("data", "raw-data", "raw-data-1.csv"))

Building robust paths


# Output of here::here() on Windows
here::here()
## [1] "C:/Users/janedoe/Documents/project"

# Output of here::here() on MacOS
here::here()
## [1] "/Users/johndoe/Dropbox/work/project"

# Output of here::here() on GNU/Linux
here::here()
## [1] "/home/johnsmith/git-projects/project"


 Use the package here to create project-relative paths

# Build relative path ----
here::here("data", "raw-data", "raw-data-1.csv")
## [1] "/home/johnsmith/git-projects/project/data/raw-data/raw-data-1.csv"

# Build relative path ----
data <- read.csv(here::here("data", "raw-data", "raw-data-1.csv"))


 here will search for a .Rproj file (or a .here file) to define the working directory

The DESCRIPTION file

 Main component of an package, the DESCRIPTION file can be added to a research compendium to describe project metadata


Package: projectname
Type: Package
Title: The Title of the Project
Authors@R: c(
    person(given   = "John",
           family  = "Doe",
           role    = c("aut", "cre", "cph"),
           email   = "john.doe@domain.com",
           comment = c(ORCID = "9999-9999-9999-9999")))
Description: A paragraph providing a full description of the project.
License: GPL (>= 2)

The DESCRIPTION file

 Main component of an package, the DESCRIPTION file can be added to a research compendium to describe project metadata


Package: projectname
Type: Package
Title: The Title of the Project
Authors@R: c(
    person(given   = "John",
           family  = "Doe",
           role    = c("aut", "cre", "cph"),
           email   = "john.doe@domain.com",
           comment = c(ORCID = "9999-9999-9999-9999")))
Description: A paragraph providing a full description of the project.
License: GPL (>= 2)
Imports:
    devtools,
    here


 It can be used to list all external packages required by the project


You should consider the DESCRIPTION file as the only file to list your external packages

 Do not use library() or install.packages() anymore

Dealing w/ dependencies

To call a function (bar()) from an external package (foo), usually you use the function bar() after calling library("foo")

But,

  • for readability purposes, it’s not perfect (where does the function bar() come from?)
  • you can have a conflict w/ a function also named bar() but from the package baz also attached with library(). You are not sure which function you are really using.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ## ──
## βœ” dplyr     1.1.2     βœ” readr     2.1.4
## βœ” forcats   1.0.0     βœ” stringr   1.5.0
## βœ” ggplot2   3.4.2     βœ” tibble    3.2.1
## βœ” lubridate 1.9.2     βœ” tidyr     1.3.0
## βœ” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ## ──
## βœ– dplyr::filter() masks stats::filter()
## βœ– dplyr::lag()    masks stats::lag()
## β„Ή Use the conflicted package to force all conflicts to become errors


 A solution to prevent conflict is to call an external function as foo::bar()

  • library() will load and attach a package
  • :: will just load a package

Dealing w/ dependencies

 In the DESCRIPTION file,

  • list external packages under the tag Imports if you call functions as foo::bar() - recommended
  • list external packages under the tag Depends if you want to attach a package (e.g. ggplot2)


Package: projectname
Type: Package
Title: The Title of the Project
Authors@R: c(
    person(given   = "John",
           family  = "Doe",
           role    = c("aut", "cre", "cph"),
           email   = "john.doe@domain.com",
           comment = c(ORCID = "9999-9999-9999-9999")))
Description: A paragraph providing a full description of the project.
License: GPL (>= 2)
Depends:
    ggplot2
Imports:
    devtools,
    here


With this setting, you will use ggplot() but here::here() in your code


 Have a look at the tag Remotes to list packages only available on GitHub, GitLab, etc.

Dealing w/ dependencies

Editing the DESCRIPTION file is not enough to install and access external packages.


 You have to run these two command lines:

# Install missing packages ----
devtools::install_deps()

# Load and attach (if Depends is used) packages ----
devtools::load_all()

Or

# Install missing packages ----
remotes::install_deps()

# Load and attach (if Depends is used) packages ----
pkgload::load_all()


 If you don’t want to upgrade your packages, use remotes::install_deps(upgrade = "never")



Wrap-up: w/ a DESCRIPTION file and the functions install_deps() and load_all(), no need to use library() or install.packages() anymore

The README file

A README is a text file that introduces and explains your project

  • each research compendium should contain a README
  • you can write different README (project, data, etc.)

The README file

A README is a text file that introduces and explains your project

  • each research compendium should contain a README
  • you can write different README (project, data, etc.)


 GitHub and other code hosting platforms recognize and interpret README written in Markdown (README.md)

The README file

A README is a text file that introduces and explains your project

  • each research compendium should contain a README
  • you can write different README (project, data, etc.)


 GitHub and other code hosting platforms recognize and interpret README written in Markdown (README.md)

The README file

A README is a text file that introduces and explains your project

  • each research compendium should contain a README
  • you can write different README (project, data, etc.)


 GitHub and other code hosting platforms recognize and interpret README written in Markdown (README.md)


 If you want to run code inside your README you can write a README.Rmd (or .qmd) and convert it to a README.md

The README file

A good README should answer the following questions1:

  • Why should I use it?
  • How do I get it?
  • How do I use it?

The README file

A good README should answer the following questions1:

  • Why should I use it?
  • How do I get it?
  • How do I use it?

Main sections (for a research compendium)

  • Title
  • Description
  • Content (file organization)
  • Prerequisites
  • Installation
  • Usage
  • License
  • Citation
  • Acknowledgements
  • References

Choose a LICENSE

  • By default your work will be release under exclusive copyright - No License
  • Always select an appropriate license for your project
  • Two major camps of open source licenses1

    • Permissive licenses
    • Copyleft licenses


 The choosealicense.com website can help you choose


 Examples

  • you want a permissive license so people can use your code with minimal restrictions: MIT / Apache / BSD
  • you want a copyleft license so that all derivatives and bundles of your code are also open source: GPLv2 / GPLv3
  • your project primarily contains data and you want minimal restrictions: CC-0 / CC-BY / ODbL

 Tracking changes

Motivations

Project content (without git)

Motivations

Project content (without git)

Questions

  • Which version of analyses.R is the final one?
  • What about data.csv?
  • What are the differences between versions?
  • Who have contributed to these versions? When?

Motivations

Project content (without git)

Questions

  • Which version of analyses.R is the final one?
  • What about data.csv?
  • What are the differences between versions?
  • Who have contributed to these versions? When?

Comments

  • It becomes difficult to find new versions names
  • And this folder: what a mess!

Motivations

Project content (without git)

Questions

  • Which version of analyses.R is the final one?
  • What about data.csv?
  • What are the differences between versions?
  • Who have contributed to these versions? When?

Comments

  • It becomes difficult to find new versions names
  • And this folder: what a mess!


 We need a tool that deals with versions for us

Motivations

Project content (without git)

Questions

  • Which version of analyses.R is the final one?
  • What about data.csv?
  • What are the differences between versions?
  • Who have contributed to these versions? When?

Comments

  • It becomes difficult to find new versions names
  • And this folder: what a mess!


 We need a tool that deals with versions for us

Motivations

Project content (with git)


 We need a tool that deals with versions for us

Presentation of git


Presentation of git


Advantages of VCS (and git)

  • make contributions transparent (what / who / when / why)
  • keep the entire history of a file (and project)
  • inspect a file throughout its life time
  • revert back to a previous version
  • handle multiple versions (branches)
  • keep your working copy clean
  • facilitate collaborations w/ code hosting platforms
    (GitHub, GitLab, Bitbucket, etc.)
  • backup your project

A word of warning

Git and GitHub are not the same thing

  • Git is a free and open-source software
  • GitHub (and co) is a web platform to host and share projects that use git


In other words:

You do not need GitHub to use git but you cannot use GitHub without using git

GUI clients


  • Git is a command-line interface (CLI)
  • You interact with git using a terminal
  • All commands start w/ the keyword git
    (git status / log / add / commit)

 But a lot of third-party tools provides a graphical interface to git
(e.g. RStudio, GitKraken, GitHub Desktop, extensions for VSCode, VSCodium, neovim, etc.)


Just keep in mind that for some operations you will need to use the terminal

Zoom on RStudio

Git main panel

Zoom on RStudio

Stage files, view differences and commit changes

View history and versions

Installation & Configuration

Installation

To install git you can follow this tutorial: https://frbcesab.github.io/training-courses/installation.html


Configuration

To use git, you need to store your credentials (username and email) that will be added to all your commits.
Open a terminal and run:

git config --global user.name  "Jane Doe"
git config --global user.email "jane.doe@mail.com"


Initialization

For each project, git must be initialized w/

git init

How does git work?

  • git takes a sequence of snapshots
  • Each snapshot can contain changes for one or many file(s)
  • User chooses which files to β€˜save’ in a snapshot and when
    (!= file hosting services like Dropbox, Google Drive, etc.)


 In the git universe, a snapshot is a version, i.e. the state of the whole project at a specific point in time


A snapshot is a two-step process:

  1. Stage files: select which files to add to the version
  2. Commit changes: save the version and add metadata (commit message)

Basic workflow

 Initialize git in a (empty) folder (repository)


git init


The three areas of a git repository:

  • working copy: current state of the directory (what you actually see)
  • staging area: selected files that will be added to the next version
  • repository: area w/ all the versions (the .git/ subdirectory)

Basic workflow

 Add new files in the repository


git status

# On branch main
# 
# No commits yet
# 
# Untracked files:
#   README.md
#   analyses.R
#   data.csv
# 
# Nothing added to commit but untracked files present
# Use "git add <file>..." to track

Basic workflow

 Stage (select) one file


git add data.csv


git status

# On branch main
# 
# No commits yet
# 
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#   new file:   data.csv
# 
# Untracked files:
#   (use "git add <file>..." to track)
#   README.md
#   analyses.R

Basic workflow

 Stage (select) several files


git add data.csv analyses.R


git status

# On branch main
# 
# No commits yet
# 
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#   new file:   analyses.R
#   new file:   data.csv
# 
# Untracked files:
#   (use "git add <file>..." to track)
#   README.md

Basic workflow

 Stage (select) all files


git add .


git status

# On branch main
# 
# No commits yet
# 
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#   new file:   analyses.R
#   new file:   data.csv
#   new file:   README.md

Basic workflow

 Commit changes to create a new version


git commit -m "a good commit message"

Basic workflow

 Now we are up-to-date


git status

# On branch main
# nothing to commit, working tree clean

The status of a file



With git a file can be untracked or tracked1. If it’s tracked it can be:

  • unmodified
  • modified and unstaged
  • modified and staged

The .gitignore

 We can also tell git to ignore specific files: it’s the purpose of the .gitignore file


Which files? For instance:

  • passwords, tokens and other secrets
  • temporary files
  • build files
  • large files

The syntax is simple:

# Ignore a specific file
README.html

# Ignore all PDF
*.pdf

# Ignore a folder
data/

# Ignore a subfolder
data/raw-data/

# Ignore a specific file in a subfolder
data/raw-data/raw-data.csv


 Template for projects available here

Commits

When committing a new version (w/ git commit), the following information must be added:

  • WHO - the person who has made the changes
    (automatically added by git)
  • WHEN - the date of the commit
    (automatically added by git)
  • WHAT - the files that have been modified
    (selected by the user w/ git add)
  • WHY - the reason of the commit, i.e. what has been done compared to the previous version
    (added by the user w/ git commit)

Commits

When committing a new version (w/ git commit), the following information must be added:

  • WHO - the person who has made the changes
    (automatically added by git)
  • WHEN - the date of the commit
    (automatically added by git)
  • WHAT - the files that have been modified
    (selected by the user w/ git add)
  • WHY - the reason of the commit, i.e. what has been done compared to the previous version
    (added by the user w/ git commit)

Print the git history w/ git log

git log --stat

# (*) commit dc4366af56223f533a32ef38794b4e567b0b7422
# Author: ahasverus <ahasverus@users.noreply.github.com>
# Date:   Wed Jun 28 14:11:20 2023 +0200
# 
#     docs: update CITATION.cff file
# 
#  CITATION.cff | 6 +++---
#  1 file changed, 3 insertions(+), 3 deletions(-)
# 
# (*) commit 75c0072892544f779105067efc17d361365554e1 
# Author: Nicolas Casajus <nicolas.casajus@fondationbiodiversite.fr>
# Date:   Wed Jun 28 14:00:15 2023 +0200
# 
#     docs: update references in README
# 
#  README.Rmd | 6 +++---
#  README.md  | 6 +++---
#  2 files changed, 6 insertions(+), 6 deletions(-)
# 
# (*) commit 6a00f3539636726c3715a7e94eea04bd30ef8f69
# Author: Nicolas Casajus <nicolas.casajus@fondationbiodiversite.fr>
# Date:   Wed Jun 28 12:39:48 2023 +0000
# 
#     style: align R code lines in MS
# 
#  joss-paper/paper.bib | 2 +-
#  joss-paper/paper.md  | 8 ++++----
#  2 files changed, 5 insertions(+), 5 deletions(-)
#
# (*) commit 6f7d632c7d1dad3acddfb128a2c4a07f1e8a4e9c
# ...

Commit message

A commit message has a title line, and an optional body

# Commit message w/ title and body
git commit -m "title" -m "body"

# Commit message w/ only title
git commit -m "title"

Commit message

A commit message has a title line, and an optional body

# Commit message w/ title and body
git commit -m "title" -m "body"

# Commit message w/ only title
git commit -m "title"


What is a good commit message?

A good commit title:

  • should be capitalized (according to the git documentation)
  • should be short (less than 50 characters)
  • should be informative and unambiguous
  • should use active voice and present tense


An optional body can be added to provide detailed information and to link external references (e.g. issue, pull request, etc.)

Commit message

A commit message has a title line, and an optional body

# Commit message w/ title and body
git commit -m "title" -m "body"

# Commit message w/ only title
git commit -m "title"


What is a good commit message?

A good commit title:

  • should be capitalized (according to the git documentation)
  • should be short (less than 50 characters)
  • should be informative and unambiguous
  • should use active voice and present tense

Template provided by git:

Capitalized, short (50 chars or less) summary

More detailed explanatory text, if necessary.  Wrap it to about 72
characters or so.  In some contexts, the first line is treated as the
subject of an email and the rest of the text as the body.  The blank
line separating the summary from the body is critical (unless you omit
the body entirely); tools like rebase will confuse you if you run the
two together.

Write your commit message in the imperative: "Fix bug" and not 
"Fixed bug" or "Fixes bug."  This convention matches up with commit 
messages generated by commands like git merge and git revert.

Further paragraphs come after blank lines.

- Bullet points are okay, too

- Typically a hyphen or asterisk is used for the bullet, followed by a
  single space, with blank lines in between, but conventions vary here

- Use a hanging indent

Link external references as: Fix #23


An optional body can be added to provide detailed information and to link external references (e.g. issue, pull request, etc.)

When should you commit?

When should you commit?


  • Commit a new version when you reach a milestone
  • Create small and atomic commits
  • Commit a state that is actually working

Undoing things

  1. Undo recent, uncommitted and unstaged changes

You have modified a file but have not staged changes and you want to restore the previous version


git status

# On branch main
# Changes not staged for commit:
#   (use "git add <file>..." to stage changes)
#   (use "git restore <file>..." to discard changes)
#   modified:   data.csv
#
# No changes added to commit

Undoing things

  1. Undo recent, uncommitted and unstaged changes

You have modified a file but have not staged changes and you want to restore the previous version


# Restore one file (discard unstaged changes)
git restore data.csv


git status

# On branch main
# Nothing to commit, working tree clean

Undoing things

  1. Undo recent, uncommitted and unstaged changes

You have modified a file but have not staged changes and you want to restore the previous version


# Restore one file (discard unstaged changes)
git restore data.csv


git status

# On branch main
# Nothing to commit, working tree clean


 To discard all changes:

# Cancel all non-staged changes
git restore .

Undoing things

  2. Unstaged uncommitted files

You have modified and staged file(s) but have not committed changes yet and you want to unstage file(s) and restore the previous version


git status

# On branch main
# Changes to be committed:
#   (use "git restore --staged <file>..." to unstage)
#   modified:   data.csv

Undoing things

  2. Unstaged uncommitted files

You have modified and staged file(s) but have not committed changes yet and you want to unstage file(s) and restore the previous version


# Unstage one file
git restore --staged data.csv


git status

# On branch main
# Changes not staged for commit:
#   (use "git add <file>..." to stage changes)
#   (use "git restore <file>..." to discard changes)
#   modified:   data.csv
#
# No changes added to commit

Undoing things

  2. Unstaged uncommitted files

You have modified and staged file(s) but have not committed changes yet and you want to unstage file(s) and restore the previous version


# Unstage one file
git restore --staged data.csv


git status

# On branch main
# Changes not staged for commit:
#   (use "git add <file>..." to stage changes)
#   (use "git restore <file>..." to discard changes)
#   modified:   data.csv
#
# No changes added to commit


You can now restore the previous version w/:

# Discard changes (restore previous version)
git restore data.csv

Undoing things

  3. Change the most recent commit message

You have committed a new version but you want to change the commit message


# Change message of the last commit
git commit --amend


Note that this will also change the ID of the commit

Undoing things

  4. Revert one commit

You want to reverse the effects of a commit: use git revert


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

Undoing things

  4. Revert one commit

You want to reverse the effects of a commit: use git revert


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Revert commit dd4472c
git revert dd4472c

Undoing things

  4. Revert one commit

You want to reverse the effects of a commit: use git revert


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Revert commit dd4472c
git revert dd4472c


# Print git history
git log --oneline

# d62ad3e (HEAD -> main) Revert "commit 3"
# f960dd3 commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

git revert does not alter the history and creates a new commit

Undoing things

  5. Deleting commits

You want to delete one or more commits: use git reset --hard


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

Undoing things

  5. Deleting commits

You want to delete one or more commits: use git reset --hard


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Delete the two more recent commits
git reset --hard HEAD~2

Undoing things

  5. Deleting commits

You want to delete one or more commits: use git reset --hard


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Delete the two more recent commits
git reset --hard HEAD~2


# Print git history
git log --oneline

# 2bb9bb4 (HEAD -> main) commit 2
# 2d79e7e commit 1

git reset --hard alters the history. Be careful with this command

Back in time

Use the command git checkout to inspect a file at a specific point in time


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

Back in time

Use the command git checkout to inspect a file at a specific point in time


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Back in time
git checkout 2bb9bb4

Back in time

Use the command git checkout to inspect a file at a specific point in time


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Back in time
git checkout 2bb9bb4


# Print git history
git log --oneline

# 2bb9bb4 (HEAD) commit 2
# 2d79e7e commit 1

Back in time

Use the command git checkout to inspect a file at a specific point in time


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Back in time
git checkout 2bb9bb4


# Print git history
git log --oneline

# 2bb9bb4 (HEAD) commit 2
# 2d79e7e commit 1


# Back to the more recent commit
git checkout -

Back in time

Use the command git checkout to inspect a file at a specific point in time


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1


# Back in time
git checkout 2bb9bb4


# Print git history
git log --oneline

# 2bb9bb4 (HEAD) commit 2
# 2d79e7e commit 1


# Back to the more recent commit
git checkout -


# Print git history
git log --oneline

# f960dd3 (HEAD -> main) commit 4
# dd4472c commit 3
# 2bb9bb4 commit 2
# 2d79e7e commit 1

 Collaboration

Code hosting platforms

GitHub and co are cloud-based git repository hosting services

  Perfect solutions to collaborate on projects tracked by git


Services

  • Full integration of version control (commits, history, differences)
  • Easy collaboration w/ branches, forks, pull requests
  • Issues tracking system
  • Enhanced documentation rendering (README, Wiki)
  • Static website hosting
  • Automation & monitoring (CI/CD)

Code hosting platforms

Main platforms

Presentation of GitHub

Overview

  • Created in 2008
  • For-profit company (property of Microsoft since 2018)
  • Used by more than 100 million developers around the world


Advantages

  • User-friendly interface for git
  • Free account w/ unlimited public/private repositories
  • Organization account (w/ free plan)
  • Advanced tools for collaboration

Presentation of GitHub

Presentation of GitHub

Presentation of GitHub

Create a repository

Create a repository

Create a repository

Clone a repository

Clone a repository

On a terminal

# Change path
cd path/to/store/repository

# Clone repository (SSH)
git clone git@github.com:ahasverus/projectname.git

# Clone repository (HTTPS)
# git clone https://github.com/ahasverus/projectname.git

# Go to repo
cd projectname

Clone a repository w/ RStudio


Select Version Control

Select Git

Copy the URL and fill all the fields

Clone a repository w/ RStudio

Working w/ GitHub

 Add a new file: README.md


git status

# On branch main
# Your branch is up to date with 'origin/main'
#
# Untracked files:
#   README.md
# 
# Nothing added to commit but untracked files present
# Use "git add <file>..." to track

Working w/ GitHub

 Stage changes


git add .


git status

# On branch main
# Your branch is up to date with 'origin/main'
#
# Changes to be committed:
#   (use "git restore --staged <file>..." to unstage)
#   new file:   README.md

Working w/ GitHub

 Commit changes


git commit -m "add README"


git status

# On branch main
# Your branch is ahead of 'origin/main' by 1 commit.
#   (use "git push" to publish your local commits)
# 
# nothing to commit, working tree clean

Working w/ GitHub

 Push changes to remote


git push

# Sometimes, you'll need to use:
git push -u origin main


git status

# On branch main
# Your branch is up to date with 'origin/main'.
# 
# nothing to commit, working tree clean

Working w/ GitHub

 Pull changes from remote

Working w/ GitHub

 Pull changes from remote


git pull


git status

# On branch main
# Your branch is up to date with 'origin/main'.
# 
# nothing to commit, working tree clean

Help me, I can’t push!

When you try to push, you might see this following error message:

git push

# To github.com:ahasverus/projectname.git
#  ! [rejected]        main -> main (fetch first)
#
# error: failed to push some refs to 'github.com:ahasverus/projectname.git'
#
# hint: Updates were rejected because the remote contains work that you do
# hint: not have locally. This is usually caused by another repository pushing
# hint: to the same ref. You may want to first integrate the remote changes
# hint: (e.g., 'git pull ...') before pushing again.
# hint: See the 'Note about fast-forwards' in 'git push --help' for details.


 Just git pull and try to git push again

Help me, I can’t pull!

When you try to pull, you might see this following error message:

git pull

# [...]
# Auto-merging README.md
# CONFLICT (content): Merge conflict in README.md
#
# error: could not apply b8302e6... edit README
#
# hint: Resolve all conflicts manually, mark them as resolved with
# hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
# hint: You can instead skip this commit: run "git rebase --skip".
# hint: To abort and get back to the state before "git rebase", 
# hint: run "git rebase --abort".


 Welcome in the wonderful world of git conflicts

Resolving conflicts

What is a (lexical) conflict?

A git conflict appears when two versions cannot be merged by git because changes have been made to the same lines.


README.md - Version A

# The SURPRISE pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

README.md - Version B

# The Surprise Pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.


Git will identify conflicts in files:

<<<<<<< HEAD
# The SURPRISE pizza
=======
# The Surprise Pizza
>>>>>>> b8302e6 (edit README)

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.


 You have to decide which version you want to keep.

Resolving conflicts

What is a (lexical) conflict?

A git conflict appears when two versions cannot be merged by git because changes have been made to the same lines.


README.md - Version A

# The SURPRISE pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.

README.md - Version B

# The Surprise Pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.


Git will identify conflicts in files:

<<<<<<< HEAD
# The SURPRISE pizza
=======
# The Surprise Pizza
>>>>>>> b8302e6 (edit README)

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.


README.md - Final version

# My wonderful pizza

An amazing surprise of the dev team dedicated just to your fancy
thirst for fortune and originality.


 You have to decide which version you want to keep.

Collaboration - GitHub Flow


Working w/ issues

…

 Code & Software

Code & Software

Work in progress…

 Manuscript

Manuscript

Work in progress…

Notes