Anonymising Datasets
These guidelines have been written specifically for studies using REDCap, however they are also applicable for studies using other electronic data collection tools, e.g. Bristol Online Surveys (BOS).
The Principle Investigator for the study is responsible for ensuring that an adequate
risk assessment has been completed on the study dataset schema before the REDCap project is moved to Live status.
Most REDCap projects are used to collect participant data for clinical research. A person’s medical history is classified as strictly confidential by the university. If your REDCap project is not being used to collect medical data please contact the REDCap support team for advice.
Data sets that contain both medical data about a person and information that allows the person to be identified (these include name, address phone number and NHS number) are governed by the data protection act. The University has a legal responsibility to safe guard this information.
Consider the implications if an exported dataset from REDCap was accidentally made public or if the REDCap application was compromised and a malicious hacker obtained the data. In both of these cases University of Bristol Information Security must be informed. See http://www.bristol.ac.uk/infosec/policies/ for further information and http://www.bristol.ac.uk/infosec/uobdata/reportloss/ to report a data loss (or suspected data loss). If the dataset falls under the data protection act the breach must also be reported to the Information Commissioner’s Office.
REDCap is provided solely for anonymised clinical data linked by a participantID; this is classified as linked anonymised data. A separate spreadsheet or database is used to store any participant identifiers. With the data split and stored on separate physical servers a data compromise on the REDCap system does not need to be reported to the ICO as no identifiable clinical data has been lost.
Creating a linked anonymised dataset involves making sure that:
- The Link IDs are held securely and in a separate location to the REDCap data
- There are no Direct Identifiers
- The likelihood of being able to identify an individual by combining any Indirect Identifiers is very low
- Fields where users are able to enter free text and other Potential Identifiers are minimised.
The ICO does not provide detailed guidance with regard to the above. Researchers are required to follow the BRMS guidelines; studies that wish to store data excluded by the BRMS guidelines must apply to the Secretary’s Office for authorisation.
BRMS Guidelines for Linked Anonymised Data Sets
Link IDs
Typically this is a single randomly assigned ParticipantID.
Direct Identifiers
Direct Identifiers must not be stored on REDCap and should be stored in a separate spreadsheet or database. Direct Identifiers include any of the following:
- First name
- Middle name(s)
- Surname
- Address
- Full postcode
- Phone number
- Fax number
- Mobile number
- Email address (unless using Surveys, see below)
- NHS number
- National Insurance Number
- Other identifying reference number
- Names of relatives, friends or colleagues
- Photographs/video showing face or other distinguishing features
The Secretary’s Office has authorised the storage of a single email address per participant if the sensitivity of the clinical data is low. This is to allow the REDCap project to use Surveys to email questionnaires to participants. Studies that collect data on sexual health or drug use must not store email addresses and must sent survey links using a university email account.
Indirect Identifiers
There are typically four types of indirect identifiers:
- Demographic data: marital status, gender, ethnicity, occupation, salary
- Geographic data: most usually post code but can be any specific geographic location e.g. GP surgery
- Dates: date of birth, date of death, admission date
- Medical identifiers: height, weight, unusual medical condition
There are two approaches to take to reduce the likelihood or re-identification using Indirect identifiers:
- Increase the granularity of the data you are requesting: ranges instead of exact values for items like salary, height or weight; age at baseline instead of date of birth; month and year or year for other dates; the first three characters of a post code instead of the full post code.
- Store with the direct identifiers in a separate spreadsheet or database. Much of this information does not change over time and therefore can be collected at the point of consent. Indirect identifiers that are required for the analysis can be imported into a statistical package when required. The resulting dataset would be considered an identifiable clinical dataset and must be secured accordingly.
Potential Identifiers
The use of Notes Box (Paragraph Text) fields should be minimised. If REDCap is being used to store data like Medical Notes Review or Serious Adverse Events information, then training for the staff entering this data must be provided, with guidelines, so that identifiable data is not stored together with the clinical data.
If Notes Box (Paragraph Text) fields are used to collect data directly from the participant using surveys, each field must be risk assessed. If there is the possibility that the user might enter identifiable data then a suitable warning, using the REDCap Field Note, must be added.
Source Documentation
Information Commissioner’s Office: Anonymisation: managing data protection risk code of practice