Using Machine Learning for Detecting Personally Identifiable Information

Using Machine Learning for Detecting Personally Identifiable Information

Safeguarding personal data is a critical component of corporate governance. Data breaches can result in hefty fines and significant reputational damage. Personal data encompasses a wide range of information, including names, addresses, phone numbers, locations, bank details, and National Insurance numbers.

While structured databases make it relatively straightforward to identify personally identifiable information (PII), a substantial amount of PII also resides in unstructured “free text” data, such as comments fields, emails, reports, and customer service call transcriptions. Companies need effective methods to detect PII within unstructured data to ensure compliance with GDPR regulations. PII detection also enables redaction or anonymisation of sensitive information.

The AI community has focused considerable research on machine learning techniques like Named Entity Recognition (NER) and Part-of-Speech (POS) tagging, which have applications in PII detection. Existing NER research typically focuses on identifying people, locations, and organizations, as other PII elements like phone numbers and ID numbers can often be found through more mechanical search approaches. However, there are nuanced edge cases, such as distinguishing between a person’s name and a location with the same name.

Infotel Consulting UK, in partnership with the National Innovation Centre for Data, examined three approaches to leveraging machine learning for PII detection, each with its own advantages and drawbacks: pre-trained language models, fine-tuning existing language models, and generative AI. Given companies’ cautious approach to sensitive data, the team prioritized solutions that can operate on-premises without relying on external third-party services, which limits the performance capabilities compared to the latest large language models (LLMs).

The first use case, detecting whether a piece of text contains any personally identifiable information (PII), is a binary classification problem – the model must determine if the input text contains PII or not. In contrast, the more complex use cases of redaction and anonymisation require the machine learning model to operate at the phrase, word, or sub-word level. A single input text may contain various types of PII, such as names, addresses, and locations, all of which need to be accurately identified within the text.

The initial processing step is to tokenize the input text. While tokens often correspond to individual words, the tokenizer’s finite vocabulary means some words may be represented by multiple tokens. This is particularly relevant for PII that spans multiple words, such as “Newcastle upon Tyne” representing a single location.

There are different standards for labeling tokens, such as BIO (beginning, inside, outside) and BILOU (beginning, inside, last, outside, unit). Understanding the specific tokenization and labeling scheme used by a given model is critical when using or fine-tuning it, as the dataset’s labels must match the model’s expected format.

In recent years, the Transformer architecture has dominated Natural Language Processing (NLP), and all the models in this study are based on Transformer models (Ashish Vaswani, 2017).

Caption: Click to view our findings when we compared pre-trained models, existing language models & generative AI for detecting PII in data.

Discussion

Pre-Trained Language Models

Pre-trained models are the easiest to work with, as they are ready to use out of the box. However, finding a suitable pre-trained model that has been trained for the specific task required can be time-consuming. Some large language models (LLMs) are trained on a single language, typically English or Chinese, while others are multilingual. Additionally, we need a labeled dataset to test the model and evaluate its performance.

Fine-Tuning Language Models
Ideally, companies could invest in training their own LLMs from their own research and development to address their specific needs. Realistically, however, this approach may not be commercially feasible.

The fine-tuning process starts with a pre-trained model. The final layer (the head) of the model is replaced with a new, fully connected layer with randomly initialized weights. During fine-tuning, the weights of the pre-trained model are frozen, and only the weights in the new layer are trained.

Generative AI
The previous machine learning models discussed in this paper have been discriminative, meaning they were trained to find the best decision boundaries for classifying the words in the input text as a Person, Location, or Organization. Once trained, these models are deterministic, producing the same output for a given input.

Generative models, such as Large Language Models (LLMs), work with probabilities. They learn a probability distribution during training and can produce non-deterministic output, where a given input can generate different outputs at different times. This is one reason why generative models are prone to “hallucinate” and generate inaccurate information.

Prompt engineering can be used to reduce hallucination and steer the LLM to produce an output that is more relevant to our requirements. The prompt can include some example inputs and outputs to guide the LLM (few-shot learning) or no examples (zero-shot).

Infotel UK Consulting’s Little Ragner Whitepaper

Retrieval-Augmented Generation (RAG) can be used to select the example inputs and outputs for a prompt. This process involves:

Generating an “embedding” – a numerical vector that captures the contextual meaning of the input text.
Identifying the closest embeddings from a pre-processed dataset of examples.
Incorporating these retrieved examples into the prompt.

The rewrite aims to improve clarity and flow by:

Breaking the original long sentence into a more concise, step-by-step explanation.
Using more direct language (e.g. “can be used to select” instead of “can be selected using”).
Providing more descriptive labels for each step of the RAG process.

As we’ve explored, large language models and natural language processing can be incredibly powerful tools when it comes to detecting and extracting personally identifiable information (PII) from large datasets. However, the sheer scale and complexity of modern data sources presents challenges that require modern solutions.

Our debut whitepaper: “Little Ragner: Toward Lightweight, Generative, Named Entity Recognition Through Prompt Engineering, and MultiLevel Retrieval Augmented Generation” collates our research, and delves into the development of a new approach to named entity recognition and retrieval generation that aims to be more lightweight and efficient than traditional methods.

By leveraging prompt engineering and multilevel retrieval-augmented generation, the Little RAGNER system is able to identify and extract PII with accuracy, even in unstructured, high-volume data. The key lies in its ability to seamlessly integrate large language models with targeted prompts that guide the model towards the most relevant information. This hybrid approach allows for fast, accurate entity recognition without the need for resource-intensive fine-tuning or complex architectural changes.

The multilevel retrieval component enhances the model’s understanding of context and relationships, which allows it to better distinguish between true PII and incidental mentions.

The implications of this research are significant, particularly as businesses face mounting pressure to safeguard sensitive data. The research conducted within our Little RAGNER project represents a major step forward in making PII detection more scalable, efficient, and accessible to a wide range of applications and industries.

By striking a balance between the power of large language models and the precision of tailored prompts and retrieval techniques, this innovative system holds the potential to transform how we approach data privacy and compliance in the years to come.