Anonymized data – curse or blessing of data protection?! (May 2020)

The desire for the “free use of data” is widespread. European data protection law treats anonymization insufficently. This creates uncertainty as to whether and when anonymization techniques can be used effectively. These uncertainties hinder the development of corresponding anonymization techniques and the much emphasized necessary (European) progress in the field of AI and Big Data.

Both public and private institutions require data for purposes that go beyond the original collection purposes or use the data independently of such original purpose. Such data is necessary for the development of AI and Big Data applications, for market research, statistical analysis, product innovation, as well as for Open Data initiatives. If personal data once collected is (further) used for a new purpose, a data protection justification is necessary (again). In such cases – if at all – only the consent of the data subject may apply. However, for many of these purposes, data records with non-personal content are sufficient. At this point anonymized data comes into consideration. However, there are still considerable uncertainties on the way to (sufficiently) anonymous data and its legal consideration.

 

1. Legal requirements for anonymization

With its principle of data minimization, the European Data Protection Regulation (“GDPR”) expressly provides for the use of anonymized data in the above-mentioned cases. If no conclusion can be drawn about identified or identifiable persons based on anonymized data, such data can be used “freely”.

 

1.1 When is data anonymous?

The GDPR does not specifically define “anonymization” as opposed to pseudonymisation in Art. 4. Recital 26 merely hints that the principles of data protection should not apply to anonymous information, i.e. information, which does not relate to an identified or identifiable natural person, or personal data, which has been rendered anonymous in such a way that the data subject cannot be identified or can no longer be identified.

Anonymization requires more than pseudonymisation, where it is sufficient to include additional, but separately stored information to identify the person. Anonymized data no longer contains any personal reference.

An absolute anonymization is given if, regardless of possible additional knowledge of a third party, a de-anonymization is no longer possible for anyone. Such absolute anonymization can only be achieved with very few of the anonymization techniques currently available. However, it is not necessary in terms of data protection law. Accordingly, so-called “de facto anonymization” is sufficient, i.e. de-anonymization need not be 100% impossible. Pursuant to the standards of the Article 29 Working Party, it is sufficient if the residual risk of identifying a natural person is excluded as far as possible and it is ensured under practical considerations that it is no longer possible (i) to extract a particular person from a data set, (ii) to link the data sets relating to a person with each other or (iii) to derive information about a person from such a data set by inference (cf. Article 29 Working Party’s Working Paper 216).

 

1.2 Does anonymization require a legal basis?

Already the question of the legal nature of anonymization is controversial. There are voices claiming that anonymization is not “processing” within the meaning of the GDPR, as it is not covered by its protective purpose (i.e., protection of the data subjects from loss of control over their data). Further, since the GDPR does not apply to anonymous data and anonymization is even required by the GDPR, this process should not be made more difficult by the GDPR.

On the other hand, it is only through the process of anonymization that non-personal data is created. Consequently, the application of the GDPR cannot be ruled out until anonymous data is actually created.

 

1.3 Which legal bases are applicable?

Taking into account the results of anonymization, two legal bases – apart from consent – can be considered. Firstly, if the personal data is collected solely for the purpose of anonymization, processing is justified under Art. 6 (1)(f) GDPR (legitimate interests of the controller). Secondly, anonymization is permissible under Art. 6 (4) GDPR (further processing), if it concerns the anonymization of already existing personal data that was initially collected for other purposes. Any subsequent use of the already anonymized data is in any case – in the absence of the applicability of the GDPR – possible without any basis.

Considering the purpose of anonymization (protection of the data subjects) and for the purposes of effective data protection, it could also be privileged under data protection law, as opposed to other processing activities under the GDPR. Both the GDPR and anonymization shall protect data subjects. In order to implement and promote the principle of data minimization, here the requirement of a legal basis could be dispensable or privileges granted with regard the fulfilment of information obligations.

 

2. Actual requirements for anonymization

Uncertainties also exist with regard to the use of anonymization techniques. Friction points arise, in particular, as a sufficient degree of anonymity of the data must be achieved by generalization, deletion, falsification, addition of existing information or synthesis, and at the same time, a necessary degree of (statistical) validity should be maintained, which can be lost through the respective anonymization procedure.

Also, high demands must be made on the effectiveness of anonymization. Due to the advancing technical development and availability of additional data, de-anonymization is becoming easier to achieve nowadays. In many cases, therefore, instead of anonymization, actually, pseudonymisation is achieved.

Ultimately, the assessment of anonymization techniques is about taking into account and, as far as possible, excluding the associated residual risk of identifying the data subject. Thus, before using an anonymization technique, this evaluation requires appropriate planning by examining the respective strengths and weaknesses and determining the prerequisites and the respective objectives of the chosen anonymization procedure. The choice of the most suitable solution should be based on a case-by-case evaluation, taking into account the legal requirements for anonymous data described above.

According to all this, it is not sufficient to only eliminate the so-called “direct identification features”, i.e. names, addresses, personal identification numbers, bank details or telephone numbers. This leaves sufficient additional features that can at least indirectly identify a person, for example, by linking several indirect pieces of information or other correlating knowledge. Therefore, these characteristics must be further alienated in order to make the personal reference disappear as far as possible. Four different superordinate techniques are available, which can produce a sufficient degree of anonymity depending on the purpose and scope of the subsequent use of such data (further information can be found here):

  • Randomization (random change of data): here, a random change of data is carried out, i.e. the characteristics are changed according to predefined randomized patterns, for example (i) by replacing the values of a characteristic with a certain probability by other possible characteristic values, (ii) by adding a random value to the values or (iii) by multiplying the values by a random value.
  • Generalization (especially aggregation): here, exact values are replaced by less exact values, e.g. by combining data (e.g. age 25: to age 20-30). It is then no longer possible to determine from the group set what exact value a person has within the less value range, which makes it even more difficult to re-identify persons by singling out. In extreme cases, it is no longer possible to single-out individuals, but only groups of identical persons.
  • Permutation (random permutation of data): in this case, the values of the characteristic are interchanged and, thus, the direct link between the data and the person concerned removed.
  • Data Synthesis (creation of completely new, synthetic data): in this process, the real data is completely discarded and replaced by new, randomly generated values. The statistical distributions according to which the new data are generated are estimated from the real data – usually by machine learning – so that the synthetic data is statistically as similar as possible to the real data. Both all individual values of all characteristics and the relationships between characteristics are artificially generated.

 

Outlook

There is still no uniform approach to the creation and use of anonymized data. This has negative effects on the German/European economy in the area of digitization. For example, American and Chinese companies are already gaining ground here, although there is no lack of promising European solutions. Thus, it is up to the legislator to ensure that these solutions can be applied in a legally secure manner. Still much light will have to be shed on anonymization required by data protection law. This has prompted the German Federal Commissioner for Data Protection and Information Security to launch his first consultation procedure on the subject of anonymization (see here). After evaluating numerous statements from science and business, this could be a first step towards more legal security.