How to Open Science: Writing the Data Analyses

Time to talk about reproducibility

May 22, 2023

At this point, you are almost done collecting your data. Since you know the general format your data is stored in, you can now begin writing or revising your data analyses to conform to that format. This example will emulate someone writing the analyses with code or a third party software, so there are a few additional considerations we need to be aware of. More specifically, we need to talk about reproducibility and communication.

What is Reproducibility?

Reproducibility is a simpler form of replicability where, given the data and analysis, you are able to reproduce the exact number reported within the paper. This feels relatively intuitive: you want the results to be consistent while also giving context as to how you obtained your specific results.

In practice, however, making something reproducible is difficult. Providing the code or the third party software file is usually not enough. You need to specify what libraries and versions you are using, reference the file locations relative of the machine, provide information on the operating system and hardware in case of specific requirements, etc. There are so many underlying assumptions that you make because your machine is already set up that you need to take additional steps to ensure everything works.

How do I make my analyses reproducible?

There are a lot of steps you can take to make things reproducible. You can search such things on the internet. Additionally, I make blog posts every Thursday about how to improve reproducibility. However, if we want to break it down as simple as possible: either have someone else try to run your code, or run the code on a new machine.

Each method comes with its own benefits and downsides. A colleague would provide feedback on what was or was not working. Additionally, they could split the workload and fix issues themselves, letting you know how to solve the issue. However, typically colleague will have a similar machine or setup to your own, which would cause more obscure issues to fly under the radar. Running your code on a new machine solves that issue, since there are no underlying assumptions made. However, you would be the one to try and solve all the issues, especially if on an unfamiliar operating system.

One of the things I typically recommend is using Docker which essentially creates an isolated operating system instance via containers which you can use to test your code. Additionally, container scripts are relatively transferrable between operating systems, assuming the underlying bytecode has no issues.

How do I communicate my results?

Once you have written your data analyses, you need a way to communicate your results and provide understanding on what your source code or third party software file is doing. This will increase its understanding and robustness whenever someone down the line decides to use or take advantage of your work.

Format as the Report

The output of your data analyses should be formatted such that each number in the paper can be easily linked to numbers within the output. You can output more numbers than you are reporting on, but that typically makes it more confusing to understand where a given number is referenced if they have similar values.

If you want to output more results, I would just have a flag specifically for reproducing the study raw. When enabled, only the results reported would be displayed. When disabled, the entire output can be displayed.

Document Source Code

If you have source code, you should fully document anything the consumer might come into contact with. Documenting the source code improves its readability and helps you remember why a certain thing was implemented in a certain way. I usually recommend doing this as you are writing the code. Comments come more naturally in the moment as you explain each individual subsection to yourself as you continually revise the logic until you get to the final product.

Add a README

A README contains a general description of the work along with any setup instructions necessary to run or reproduce the results. All work should have an associated README, regardless of if there is existing documentation on the source code. The README is a great way to provide general context that the consumer needs to be aware of to have everything work properly.

Some Additional Thoughts

On the Subject of Docker

There are plenty of good tutorials for Docker out there. I do plan on writing a blog post on Docker within the next few months as well. I want to provide a generalizable script that can be applied for a given environment with minimal changes such that the author only needs to do a few searches.

Do not assume people know anything

In general, most consumers of your work do not understand how or why something is implemented in a certain way. As such, you should always document as much as possible. The only assumptions you should make are about what software you are writing in and that any general syntax within your community to write shorthand comments.

Fears of errors

Some researchers might not make their data analyses public for fear it might be wrong. This is a valid concern since it would invalidate the results of the work. However, that's just how work is. You'll make mistakes sometimes. You can try to mitigate this by having others double check your work, but in general, just be honest and report the corrected results in a follow-up work or preprint if needed.

Let's Discover Open Science and Reproducibility