Data anonymization

Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.

Data anonymization has been defined as a "process by which personal data is altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party."^[1] Data anonymization may enable the transfer of information across a boundary, such as between two departments within an agency or between two agencies, while reducing the risk of unintended disclosure, and in certain environments in a manner that enables evaluation and analytics post-anonymization.

In the context of medical data, anonymized data refers to data from which the patient cannot be identified by the recipient of the information. The name, address, and full postcode must be removed, together with any other information which, in conjunction with other data held by or disclosed to the recipient, could identify the patient.^[2]

There will always be a risk that anonymized data may not stay anonymous over time. Pairing the anonymized dataset with other data, clever techniques and raw power are some of the ways previously anonymous data sets have become de-anonymized; The data subjects are no longer anonymous.

De-anonymization is the reverse process in which anonymous data is cross-referenced with other data sources to re-identify the anonymous data source.^[3]Generalization and perturbation are the two popular anonymization approaches for relational data.^[4] The process of obscuring data with the ability to re-identify it later is also called pseudonymization and is one way companies can store data in a way that is HIPAA compliant.

However, according to ARTICLE 29 DATA PROTECTION WORKING PARTY, Directive 95/46/EC refers to anonymisation in Recital 26 "signifies that to anonymise any data, the data must be stripped of sufficient elements such that the data subject can no longer be identified. More precisely, that data must be processed in such a way that it can no longer be used to identify a natural person by using “all the means likely reasonably to be used” by either the controller or a third party. An important factor is that the processing must be irreversible. The Directive does not clarify how such a de-identification process should or could be performed. The focus is on the outcome: that data should be such as not to allow the data subject to be identified via “all” “likely” and “reasonable” means. Reference is made to codes of conduct as a tool to set out possible anonymisation mechanisms as well as retention in a form in which identification of the data subject is “no longer possible”.^[5]