Data Enclaves

Storing identifiable data securely

Jun 08, 2023

There are many ways to store data on the internet. When publishing data with sensitive information, most researchers choose to anonymize the dataset. This protects the participants of the study from being identifiable with their personal or demographic information. But what if you want to make the raw dataset with the personal information available? Well, setting aside the legal concerns, you can do so by using a data enclave.

What is a Data Enclave?

A data enclave is basically just a remote, secure location to store a dataset that has identifiable information. The implementation of the data enclave, however, tends to be widely different as most researchers have relatively different definitions of what a data enclave is. In general, this can be classified into three level, each with more restrictive access than the last.

The Three Levels of Restriction

Protected Access

The least restrictive version of a data enclave is one where the data is hosted on some remote location, where access can be requested and usually provided with some conditions. A accessor may need to sign a contract or agree to a license saying to only use the data for its requested purpose an destroyed afterwards. Most researchers don't think of this as a data enclave. As such, the data is considered to have protected access, where it must be requested to use the data. The data can be then downloaded onto the local machine for use within your experiment.

Virtual Data Enclave

The next level is similar to protected access with one caveat, the data cannot be downloaded onto the local machine. Instead, the accessor must remote into a virtual machine to view the data and use it. This means that any source code to use the data must be programmed on the machine or copied over. Since a virtual machine is being used, this is known as a virtual data enclave. A good portion of researchers believe that this is considered to be what a data enclave actually is. One archive site which uses virtual data enclaves is the Inter-university Consortium for Political and Social Research (ICPSR).

Private Data Enclave

The final level prevents the accessor from viewing the raw data itself. Instead, the accessor is given a description of the metadata to use for their source code. From there, the source code and any other necessary files must be provided to the data enclave, usually in the form of a job, which runs the source code in an isolated container. After the source code has run, the outcome is passed through some output function which aggregates, making it impossible to decouple. Because of all the privacy involved, I call this a private data enclave. The other portion of researchers consider this to be the definition of a data enclave in general. One software architecture which implements this design is known as the MOOC Replication Framework (MORF).

How do I use a data enclave?

Well, that depends on the service provider you choose to use based on whatever risk analysis you conduct on your data. Each service has their own steps to follow for implementation. You can also try making one yourself, but unless you are up-to-date in security research, I would recommend using one which already exists.

Some Additional Thoughts

Legal Concerns

You can only publish and distribute data based upon the contract, license, or waiver signed with the participants. Even a private data enclave cannot be used if the contract does not give permission to outside researchers to view or analyze the data. In those cases, you would have to become a sub-contractor of the university to be able to access the data itself.

Let's Discover Open Science and Reproducibility