Models for Extracting Process Mining Event Logs With Privacy in Mind

With privacy regulations and compliance growing increasingly hawkish worldwide, the application of process mining must go hand in hand with due regard for privacy rights.

‍Only then it will deliver a net positive impact on the organisational goals.

Specifically, extracting event logs from business systems is a critical step in process mining and process discovery. Event logs provide tangible evidence for making operational decisions rather than basing these decisions on assumptions.

However, event logs also contain sensitive information like names, designations, timestamps, locational data, and activity details that can be used to identify individuals personally. How are experts tackling this challenge?

Pseudonymisation: A Conventional Approach to Data Privacy

The need to preserve privacy and balance it alongside gaining deep process insights has given rise to numerous techniques. One of them is Pseudonymisation or replacing personally identifiable information fields within a dataset with artificial identifiers or pseudonyms.

Simply put, the Pseudonymisation technique allows data managers the flexibility to use pseudonyms wherever possible. That way, they are masking personally identifiable information and make it difficult to correlate them with the actual entities.

For instance, if an organisation does not want process analysts to know the name and contact of the employees working on a process, these fields can be replaced with alpha-numeric strings, restricting information access to authorised users only.

4 Applications of Pseudonymisation Technique

Pseudonymisation as a battle-tested data privacy protection technique is usually applied in four ways.

Suppression

Suppression is removing certain data elements since that combination is rare and is at risk of reidentifying.

Source: Anonymization Methods

Swapping

Swapping involves the swap of an attribute value with each other.

Source: Researchgate

Masking

Masking is implemented by converting the personally identifiable information into a hash string.

Source: SAS® Help Center

Generalisation

Generalisation replaces a specific value with a range, broadening the horizon. For instance, the age of 42 can be replaced with the range of 40-50.

Source: Sciencedirect

Even though pseudonymisation as a technique has been around for some time, modern data scientists are divided on its sustained relevance in defending privacy. The concern is the fundamentally improvising nature of the Pseudonymisation methods and the lack of orderability around them.

‍

While procedures like Suppression, Swapping, Masking, and Generalisation do get the job done, they are silent on gauging the level of privacy their application can ensure for process mining event logs.

‍

Privacy Models: Guaranteeing Confidentiality by Design

Privacy models should maintain the integrity of personally identifiable information and the confidentiality of data by design.

Besides iron-clad protection, approaching data security through Privacy Models provides the added advantage of arithmetically proving the degree of privacy, setting them in tune with the stringent regulatory and compliance requirements of the age.

Some of the Privacy Models that have recently been gaining prominence in the field of process mining include:

Differential Privacy Model

The Differential Privacy Model obscures the information attributes about entities, making it impossible for anyone to conclude from the output whether the information was included in the original dataset or not.

It dwells on the intersection of general and private information, mathematically ensuring that anyone seeing the analysis will essentially reach the exact inference about an individual’s personal information, irrespective of whether they are included in the analysis.

Thus, with the Differential Privacy Model, event logs provide enough data for the process mining engine to extract the necessary insights. At the same time, the confidentiality of personally identifiable information remains intact.

Secure Multi-Party Computation (SMPC)

SMPC model prescribes the distribution of data processing across multiple centres, wherein none have visibility into the data being processed at the other centres. This model meets the stringent compliance, security, and privacy needs for processing data, without exposing or moving it.

SMPC model is ideal for mining processes in multi-organisational contexts, where actionable process insights can be derived from the private event logs of the entities without anyone actually sharing their logs.

k-Anonymity

Inspired by the data security and privacy concept introduced in 1998, the k-Anonymity Privacy Model is based on the idea that information about a particular entity can be reliably obscured by combining it with datasets of similar attributes.

Often referred to as the power of ‘hiding in the crowd’, k-Anonymity involves pooling data of individuals in a large group, that makes it challenging to correlate the information with specific individuals, masking their identities.

The Way Forward

Whether process data managers choose to continue with the conventional techniques or embrace the more advanced privacy models, focusing on maintaining the confidentiality of entities will be a persisting reality in process mining.

Different ways by which you can handle privacy in within process mining logs extraction. You might need to mix and match multiple methods. Yet, success will hinge around formulating a balanced approach that delivers a measurable guarantee of privacy yet does not prevent process mining platforms from operating at their peaks.

‍