How to Open Science: Preparation Before Data Collection

There is more to data collection than just collecting data.

May 01, 2023

Now that we have a general idea of what study we want to conduct, we now have to actually run the study. A number of studies collect data to run analyses on. However, you should not start collecting data right away. If you do, you might run into a number of issues, making it difficult or impossible to make your data open. As such, let's review some questions to ask and steps to take before beginning to collect the data for your study.

What data are you collecting?

This is probably the most important question to ask yourself, if you haven't already. What data you end up collecting will influence the outcome of the study. There are two directions to go when deciding what data you want to collect: targeted or broad.

Targeted Data Collection

Targeted data collection is when you know exactly what data you plan to collect. Any data which is extraneous or unused would not be collected from the associated location. This would typically already be in your preregistration, some analysis done prior which you plan to replicate, or from an existing dataset. The targeted approach is typically much simpler to handle as you can easily justify the reasons behind what data you collect. Additionally, you can easily target what data you need to collect and plan for from the participants. On the other hand, targeted datasets are difficult to perform secondary analyses that are not in the original scope of the study. Additionally, any exploratory analyses would be limited.

Broad Data Collection

Broad data collection is the exact opposite, where you try to collect as much data as possible from the associated location. These are typically used in papers which are reporting on datasets, conducted with secondary analyses, or exploring many different avenues of research. Within the education technology field, there are plenty of dataset providers like ASSISTments and Eedi. The pros and cons of broad data collection is the opposite of targeted collection: it is much easier to conduct subsequent analyses and exploratory research; however, it is harder to justify and release the entire dataset freely.

Who are you collecting the data from?

Who you collect the data from impacts what data can be released and what extra steps you need to take. For example collecting data from primary school children requires a lot more safeguards and assurances compared to adults or college students. There are numerous edge cases depending on where you collect the data, so check with you local laws to see what steps you must take during data collection and how the data must be stored. However, there are still some general concepts which all researchers should be aware of.

Human Ethics Committee

When conducting research on human subjects, you typically must get the research approved by a human ethics committee. The human ethics committee is responsible for protecting the rights and welfare of human subjects participating in research. Most universities and organizations typically have some administrative body responsible for approving, disapproving, monitoring, modifying, or exempting research from these principles. In the United States, the human ethics committee is referred to as the Institutional Review Board, or IRB for short. It may also have different names, such as the Independent Ethics Committee (IEC); Ethical Review Board (ERB); or Research Ethics Board (REB), depending on where you are.

Whenever you create the paperwork necessary for the human ethics committee, you should make sure that the data can be used by any researcher who would request the data for their given use case. If there is no data sharing clause within the paperwork itself, then the committee assumes the data will not be released. As such, you would need to refile the paperwork with the correct information.

I recommend consulting and submitting the necessary paperwork to the humans ethics committee regardless of how you are using human subjects. It is better to have a documenting stating you are exempt from the process rather than defending the position yourself. The administrative bodies responsible will always be more well versed in the subject than you and can provide additional points of view that you did not think of. For example, my research which emailed authors about open science and reproducibility within their papers was exempted from the review process. However, during initial discussions, there was some concern over researchers potentially harming themselves through the information they revealed within the study, which could of had it go through the regular review process. So, it is always better to be safe than sorry.

Data Sharing Agreements

Data Sharing Agreements, or DSAs, are common amongst organizations or on publicly available datasets. Similar to a license, these typically indicate the terms and conditions under which you are allowed to use the dataset. DSAs differ between organizations, but they usually have three things in common: the dataset cannot be redistributed from its original location, the dataset must not be used to deanonymize participants, and the paper the dataset originated from or the dataset itself must be cited. Some datasets also contain non-commercial and public study clauses to make sure the dataset can only be used within a research setting. Organizations sometimes have stricter requirements, usually requiring you to destroy the data after the study has been completed.

Whenever signing a data sharing agreement, you should make sure of the following things when possible:

The dataset can be made public, or some metadata can be made public along with a way to request the original dataset.
The agreement allows researchers outside the initial group to use the dataset or metadata made publicly available for their own research.
The dataset is properly licensed for usage within your study and can be reported on without bias from the organization.

If you cannot share the data...

Of course, there are numerous cases where you are unable to share the raw data. Most of the time, it boils down to the human ethics committee and the data sharing agreements above. However, it can also be because of security concerns, existing policies, etc. In these cases, there are a few additional things you can do to comply while also making your data partially open.

Any of the situations below must be written into any data sharing agreements or paperwork submitted to the human ethics committee.

Protected Access Repositories

The dataset can be uploaded to a secured third party repository, like the Interuniversity Consortium for Political and Social Research (ICPSR) which external researchers can access via some documented process. These repositories typically require authors to provide detailed instructions on how to access the dataset, review results obtained before public disclosure, and provide detailed documentation on the dataset. These are particularly useful if you want a more minimal intervention approach to managing a dataset.

Can Request from Author

On the other hand, you can instead provide a clear pathway on how the dataset can be requested from the author themselves. This is typically what most authors do as it is much simpler to setup with a Google Form and then providing access to some cloud storage or through an email. However, it also requires more intervention from the author themselves years after the study has been released and the research published.

Subsetted, Transformed, or Anonymized Datasets

If you are unable to release the raw data itself, you might be able to release a different form of the data instead. Subset datasets are essentially regular datasets with only some of their data released. This is typically used by researchers whose studies are ongoing or plan to conduct additional studies. Transformed datasets are aggregates of the data, usually with any preprocessing already accomplished. Transformations are particularly useful when trying to obscure easily identifiable information within the raw data, such as one minority student within a majority classroom. It also simplifies any additional steps as only the information on the aggregate level needs to be made secure. Anonymized datasets anonymize the data such that you cannot link it back to its original owner. Most publicly released datasets anonymize the subject to protect their identity. Researchers typically use a combination of these three strategies before publicly releasing a dataset.

Recoverable Metadata

I touched upon this briefly above. If you are unable to release the dataset itself, you should at least obtain the rights to release metadata associated with the dataset which can be used to recover the information if another researcher requests the same dataset from the organization. While this is not that big of an issue for replication, since at worst you will receive more data; however, reproducing the original result will be almost impossible.

Now you have a better understanding of some of the additional steps needed to collect data.

Some Additional Thoughts

There are always risks involved

There are always risks involved when releasing datasets into the world. Someone may use them for malicious purposes while others could use it outside of its appropriate license. That is always the risk when dealing with openly available materials. My general opinion is that you should always provide some access point to request the data whenever without making it 'public'. However, I also believe that there will always be more benefit than harm to making information publicly available. Regardless, it is your choice, along with the limitations of the agreements and ethics committee, on what to do with the data.

Let's Discover Open Science and Reproducibility