In this section:
- Understand data purpose and context
- Understand specific safeguarding requirements
- Determine the right safeguarding approach
- Document and monitor safeguarding actions
- Build good governance around data and safeguarding
To routinely and safely share data it must not contain:
- Information which may identify an individual or community
- Sensitive information
- Any data that could trigger, create or contribute to a threat, issue, breach or vulnerability
This advice gives suggestions on how to safeguard data to ensure personal and sensitive information for sharing and release.
De-identification involves removing or altering information that identifies an individual, or is reasonably likely to do so.
De-sensitisation involves removing any sensitive data that could require protection. Sensitive data includes information that, if released, could trigger, create or contribute to a threat, issue, financial impact, breach or vulnerability.
De-identifying and de-sensitising are referred to as safeguarding data.
Safeguarding generally involves:
- removing personal identifiers, such as an individual’s name, address, date of birth or other identifying information, and
- removing or altering other information that may allow an individual to be identified, for example, because of a rare characteristic of the individual, or a combination of unique or remarkable characteristics that enable identification
- removing or altering information that can identify sensitivities.
To prepare for safeguarding, the following steps should be considered.
Understanding the purpose of the data, how it was derived and its attributes will help to identify and evaluate any risks that may need to be managed.
Once the context of the data is understood and if potential risks have been flagged, identify the specific aspects of the data that need safeguarding.
The following questions can help identify what safeguarding is required:
- Where is the personal or sensitive information contained in the dataset?
- Is identifying or sensitive information spread through a number of data elements?
- Does the dataset contain direct identifiers?
- Does it contain characteristics or other forms of information that would enable personal identity or other sensitivities to be determined?
- Does a single data element need to be safeguarded or will a range of elements require treatment?
- Does the dataset involve direct identification, where individuals are explicitly identified through, for example, a list of full names?
- Is indirect identification possible through the combination of two or more data sources?
Once the type of sensitive information requiring protecting has been identified, one or more of the following methods may be used when seeking to safeguard sensitive data.
Although most apply to personal information, they can also be applied to safeguard other forms of sensitive information:
- Remove or modify personal identifiers such as a person’s name, address and date of birth as an essential component of de-identification.
- Combine information or data that is likely to enable identification of an individual into categories. For example, age could be combined and expressed in ranges, like ‘age range: 25-39’, rather than in a single year like ‘age: 27’.
- Manufacture ‘synthetic data’, which can be generated from original data and then substituted for it, while preserving some of the patterns contained in the original data. This removes any personal or sensitive information, while allowing general conclusions or insights to be released.
- Introducing small amounts of random error, e.g. rounding or swapping data between records. This could include swapping modified identifying information for one person with the information for another person with similar characteristics to prevent the release of unique personal information.
- Suppress data to ensure identifying or sensitive data is not released. For example you may want to redact commercially sensitive figures or company names. Data suppression may impair the utility of a dataset so design data suppression methods with care.
- Alter identifiable information (a tolerable error) in a small way such that the aggregate information or data is not significantly affected but the original values cannot be known with certainty.
- Remove / encrypt / modify quasi-identifiers that are unique to an individual or that in combination with other of unique or remarkable characteristics (profession, significant dates etc), are reasonably likely to identify an individual.
- Remove names and attach a coded reference or pseudonym to each record. This will allow data to be released without the individual being identified. The same coded reference should always be replaced by the same matching pseudonym to allow data insights to be maintained. With pseudonymisation, it is important to ensure that other publicly accessible data, such as electoral roll data, cannot be used to reintroduce the names that have been removed from the dataset.
- Aggregate or display data as totals rather than individual values, so no data relating to or identifying any individual is shown. Small numbers in totals are often suppressed through ‘blurring’ or by being omitted altogether.
- You can reduce the risk to privacy when publishing spatial information by:
- increasing a mapping area to cover more properties or occupants
- reducing the frequency or timeliness of publication, so that it covers more events, is harder to identify a recent case, or does not reveal additional data such as time or date of an event
- removing the final ‘octet’ on IP addresses to degrade the location data they contain
- using formats, such as heat maps, that provide an overview without allowing the inference of detailed information about a particular place or person
- never publish spatial information at a household level.
When sharing data after safeguarding, any safeguarding actions that have been performed need to be described in data quality statements, to help users understand any limitations to or alterations of the data. The safeguards should also be described in the metadata about the datasets.
Ongoing monitoring is an important part of safeguarding processes. It's important to ensure that safeguarding risks and risk treatments are routinely and systematically reviewed, to keep pace with advances in technology and advances in re-identification processes, and to maintain alignment with best practice de-sensitisation practices and techniques.
To build good governance around safeguarding, establish protection processes such as:
- clear responsibilities for authorising and overseeing the safeguarding process
- staff training
- monitoring processes, to ensure safeguarding techniques are applied correctly, effectively and consistently
- procedures for identifying cases where data safeguarding may be difficult to achieve in practice
- maintaining awareness of new techniques for safeguarding data and any potential re-identification risks
- liaison with other organisations performing safeguarding work
- awareness of the broader data environment, such as other related datasets and people who may have a motivation to exploit the data beyond its intended use.
You can also seek independent validation of your safeguarding approaches by engaging a third party to perform motivated re-identification or linking.
Good data management practices across your organisation will also enable you to safely gather and store data, and ensure it is protected against misuse, interference, loss, unauthorised access, modification or release.
Data creation and management processes should also be designed to ensure that minimal personal information is collected and that protecting data is as seamless as possible. Data collecting processes should also ensure that members of the public who are represented in government datasets:
- are aware of and consent to their data being collected and released
- understand what the data will be used for, and
- know how to access their information, or to make a complaint.
Last updated: 19 June 2019