Code Convention

last updated: 14.06.2023

Reasoning

Creating a project and code that follows established conventions and guidelines in natural sciences, specifically in the field of ecology, ensures that your work is comprehensible to others who review it. Whether it’s your collaborators, fellow researchers, or even future self several months down the line, adhering to these best practices has numerous benefits. Implementing code that aligns with these conventions simplifies the task of identifying and rectifying errors, comparing replicated code, and making significant modifications without disrupting the overall functionality.

R-specific

I am generally working with R programming language, so most of the specifics are related to that.

Holistic workflow

Each R - session is treated as disposable. Always start R with a blank slate (no saved data in memory). It is preferred to start new sessions often and regularly.

For objects, which are created over computationally demanding processes (data, which might be important) should be saved from memory to a hard drive (e.g. as a rds file).

Project structure

Each project should be created as a self-contain unit, i.e. a project-oriented workflow. The R project consists of data and codes with individual scripts and functions. All scripts are stored in the R/ folder, data in Data etc.

A default project template can be found here.

The default folder structure:

├─ Data
|   ├─ Input
|   ├─ Processed
|   └─ Temp
├─ Outputs
|   ├─ Data
|   ├─ Figures
|   └─ Tables
├─ R
|   ├─ ___Init_project___.R
|   ├─ 00_Config_file.R
|   ├─ 00_Master.R
|   ├─ 01_Data_processing
|   |   └─ Run_01.R
|   ├─ 02_Main_analyses
|   |   └─ Run_02.R
|   ├─ 03_Supplementary_analyses
|   |   └─ Run_03.R
|   ├─ Functions
|       └─ example_fc.R
├─ README.md
├─ renv
|   └─ library_list.lock
└─ [project name].Rproj

Files & folders

Folders and files can have numbering to guide a user to the sequences of analyses. However, this can be added later in the project as it causes various issues with version control.

Folders names

  • Underscore with only the first letter capitalized (Capital_snake_style)

File naming

  • Underscore with only the first letter capitalized (Capital_snake_style)
  • File name should contain dates
    • dates should be in YYYY-MM-DD
    • see RUtilpol for easy handling of “version control of files”

Temporary file

  • temporary data files should not hold any important information
  • No links between Temp files and scripts on GitHub
  • Data/Temp should Include in .gitignore

Safe path

All paths should be using the here package to make sure that they work on all machines.

Package dependencies

The {r, eval=falseenv} package, is an R-packages dependency management, which is set up for reproducibility.

The ___Init_project___.R script is used for the preparation of all R-packages. Mainly it will install {r, eval=falseUtilpol} and all dependencies, which is used through the project as version control of files.

Cascade of R scripts

This project is constructed using a script cascade. This means that the 00_Master.R, located within R folder, executes all scripts within sub-folders of the R folder, which in turn, executes all their sub-folders (e.g., R/01_Data_processing/Run_01.R executes R/01_Data_processing/01_full_data_process.R, R/01_Data_processing/02_data_overview.R, R/01_Data_processing/03_data_ant_counts.R, …).

Configuration file

The configuration file (00_Config_file) holds the utmost importance in a project. It serves a multitude of essential purposes, such as defining global variables, loading functions, specifying file paths, and more. Every other file within a project should initiate with a reference to the configuration file (e.g., source("00_Config_file")), as it aims to minimize repetition and establish an abstraction layer that allows for centralized changes. By declaring paths in the configuration file that are utilized across multiple scripts, the user can simply refer to them by their variable name in the scripts. This approach enables seamless modifications, including renaming variables or transitioning from down-sampled to full data. All you need to do is update the relevant path in one place, and the change will automatically propagate throughout your project.

One script per task

Each script should be also self-contained. It means that it should start with loading data and finish with saving results. Therefore, each script should be able to be run without having any other data in memory (except for data from 00_Config_file).

Each script should do just one task (see also file name). If it is hard to describe one task, it is better to split the script into several.

Code

Coding style

My coding style is a combination of various sources (Tidyverse and Google, and others). I am adopting it, as I am progressing in my career. However, style should be consistent at least within a single project.

Code (script) structure

One script should serve one purpose and that should be obvious from its name. The script is always partitioned into clearly readable chunks (see below).

Script annotation (comments)

Script header

The script header should contain the name of the project, objectives (purpose) of that script, authors, and rough date (year of the project).

Example of a header:

#----------------------------------------------------------#
#
#
#                     Project name 
#
#                      Script name
#                      - continue
#
#                       Authors 
#                        Year
#
#----------------------------------------------------------#

Section header

Each section of a script should begin with a header which consists of a name wrapped by two lines. The name of a header should start with a capital letter. Each header name should be followed by ----- so that it is automatically picked by IDE as a section header.

Empty lines should be placed before each header to separate chunks.

Headings can have various hierarchies:

  1. #----------------------------------------------------------#
  2. #--------------------------------------------------#
  3. #----------------------------------------#

Example of a header:

#----------------------------------------------------------#
# Load data -----
#----------------------------------------------------------#

Header names can be denoted by numbers, with subsections separated by *.*

Example of a numbered header:

#----------------------------------------------------------#
# 1. Estimate diversity -----
#----------------------------------------------------------#


#--------------------------------------------------#
# 1.1. Fit model -----
#--------------------------------------------------#

Single-line comments

Adding comments to code plays a pivotal role in ensuring reproducibility and preserving code knowledge for future reference. When things change or break, the user will be thankful for comments. There’s no need to comment excessively or unnecessarily, but a comment describing what a large or complex chunk of code does is always helpful. The first letter of a comment is capitalized and spaced away from the pound sign (#).

Example of a single-line comment:

# This is a comment.

Multi-line comment

Multi-line comments should start with a capital letter and the new line should start with one tab.

Example of a multi-line comment:

# This is a very long comment, where I need to describe
#    what this code is doing

Inline comment

Inline comments should always start with a space.

Example of inline comment:

function(
    agr = 1 # This is an example of an inline comment
)

Commenting functions

Function decoration should be placed before each function. See functions for details.

Code width

No line of code should be longer than 80 characters (including comments). Users can visualise the 80 characters line in selected IDER

Names of objects and function

"There are only two hard things in Computer Science: cache invalidation and naming things."

Object names

Object and function should be using snake_style. The . in names is somewhat popular but it causes issues with names of methods and should be therefore avoided. The names are preferred to be very descriptive, more expressive and more explicit (note that the default linter setting of long names can be disabled).

The names should be nouns and start with the type of object: - data_* - for data - special subcategory is table_* for tables (mainly as an object for reference). Note that all tables can be data but now vice versa. - list_ - for lists - vec_ - for vectors - mod_* - for statistical model - res_ - special category, which can be used within the function to name an object to be returned (return(res_*)).

Examples of good names:

# data
data_diversity_survey

# list
list_diversity_individual_plots

# vector
vec_region_names

# model
mod_diversity_linear

# result
res_estimated_weight

Function names

Names of functions should be verbs and describe the expected functionality.

Examples of good function names

estimate_alpha_diversity()

get_first_value()

tranform_into_character()

Internal function

Note that it is possible to start a function with a "." (e.g., .get_reound_value()) flag internal functions.

Column (variable) names in data.frames

snake_style is preferred for column names in both data.frames and tibbles. Note that the janitor package can be used to edit this automatically.

Syntax

Many of the syntax issues can be checked/fixed by lintr and styler packages, which can be used to automate lots of the tedious aspects.

Spaces (empty character)

Space (" ") should be always placed: - after a comma - before and after infix operators (==, +, -, <-, ~, etc.`)

Exceptions: - No spaces inside or outside parentheses for regular function calls - Operators with high precedence should not be surounced by space :, ::, :::, $, @, [, [[, ^, unary -

New line ()

I prefer to have code more vertical than horizontal. Therefore, there are quite a lot of new lines.

Usage of a semicolon (;) to indicate a new line is not preferred.

A new line should be:

1. After an object assignment (<-)

data_diversity <-
  read_data(...)

An exception is an assignment of function.

get_data <- function(...){
    ...
}

2. After a pipe operator (%>%)

Note that there should be a space before a pipe

data_diversity <-
  get_data() %>%
  transform_to_percentages()

3. After a function argument

This should be true for both function declaration and usage. The exception is a single argument.

get_data <- function(arg1 = foo,
                     arg2 = here::here()) {
  ...
}

data_diversity <-
  get_data(
    arg1 = foo,
    arg2 = here::here()
  )

vec_selected_regions <-
  get_regions(arg1 = foo)

4. Parentheses

Each type of parentheses (brackets) has its own rules:

round ( )
  • should not be placed on separate first and last line
  • always space before the bracket (unless it’s a function)
  • new line after start if it is a multi-argument function

Examples:

1 + ( a + b ) 

get_data(arg = foo)

get_data(
    agr1 = foo,
    agr2 = here::here()
)
Square [ ]
  • Never space before the bracket
  • always space instead of missing value

Examples:

list_diversity_for_each_plot[[1]]

data_cars[, 2]
Curly { }
  • Use only for functions and expressions
  • { should be the last character on a line and should never be on its own
  • } should be the first character on a line
  • Always new brackets after else unless followed by if
  • Not used for chunks of code

Examples:

get_data <- function(agr1){
    ...
}

if(
    logical_test
) {
 ...
} else {
    ...
}

try(
    expr = {
    ...
    }
)

Assignment

Always use the left assignment <-.

Do NOT use: - right assignment (->) - equals (=)

There should be a new line after the assignment. Note that rarely singe-line assignment can be used:

data_diversity <-
  get_data()

prefered_shape <- "triangle"

Logical evaluation

Always use TRUE and FALSE, instead of T and F

Functions

For function calls, always state the arguments even though R can have anonymous arguments. The only exception is for functions, where arguments are not known (i.e. ... argument).

Tidyverse

It is preferred to use the Tidyverse version of functions over base ones:

Base R Better Style, Performance, and Utility
read.csv() readr::read_csv()
df$some_column df %>% dplyr::pull(some_column)
df$some_column = ... df %>% dplyr::mutate(some_column = ...)

Namespace

Always use the full package namespace with a function call. This helps to track the source of function in a script:

data_diversity %>%
  dplyr::mutate(
    beta_diverisity = 0
  )

Creating functions

Specific rules apply for making custom functions: - For naming of functions see function names - Each function (declaration) should be placed in a separate script named the function. Therefore, there should be only a single function in each function script. - function should always return (return(res_value))

Anonymous functions

In various instances, it might be better to not create a new function but to use an anonymous function (e.g. inside of purrr::map_*()).

In that case, the user should use tidle (~) for change in map default values in the function:

purrr::map(
    .f = ~ {
        mean(.x)
    }
)

For purrr::pmap_*(), the user should use ..1, ..2, etc

purrr::pmap(
    .l = list(
        list_1,
        list_2,
        list_3,
     .f = ~ {
         ..1 + ..2 + ..3
     }
    )
)

Function documentation

Each function should have documentation at the beginning of the function using the roxygen2 package. This can be useful also for project-specific functions (not just within the package) as it is easier to transition to a custom package.