Can we remove the personally identifiable information of a customer from emails, chat messages, and other customer interactions while retaining the value of the content for building models to improve customer experience of the firm's services? In this blog, we describe how we redact PI and tokenize data, help discover, classify and protect sensitive entities at JPMorgan Chase

What is Personally Identifiable Information (PII)?

PII is any information that can be used to distinguish or trace an individual's identity such as name, social security number, date and place of birth, mother's maiden name, biometric records, and any other information that is linked or linkable to an individual, including medical, educational, financial, and employment information.

Protecting customer’s PII is a fundamental legal, regulatory, and business requirement for the firm. While much PII data is in structured columns and hence can easily be removed, sources like customer call transcripts, emails, and messages are examples of unstructured data sources in which a customer may disclose PII such as addresses, names, and social security numbers at any point within a conversation. This creates a challenge for data use, as this information must be safeguarded before the data can be used.

How do you safeguard PII?

PI redaction methods help discover, classify and protect sensitive entities. There are multiple techniques to safeguard PII within text data. In each case, PII content is identified and either removed or replaced. PI content can be replaced with a generic string or masked. Here are types of entities PI tool can detect and redact.

Numeric Entities

Numeric identifiers e.g. Social Security, Phone number, codes etc. can appear as a combination of numbers and words. Both types and combinations are removed. Sequences of digits are identified using pattern matching with regular expressions. Numeric words ('one', 'two', …) are also matched. Numeric content is either removed or replaced with a generic string.

Physical Addresses

The US Postal Service dataset of 20,000 city and state names across US states and territories is used to identify city and state names. Street names are matched in context of words preceding street type symbols such as road or avenue. Using real physical address from https://openaddresses.io, physical address redaction was 99+% effective.

Email Addresses

The format of Internet email addresses is defined by a standards [IETF RFC 5322] published by the Internet Engineer Task Force (IEFT). Unfortunately the specification doesn't lend itself to simple pattern matching. The PII redaction package uses a domain database of about 8000 top level domains plus a regular expression pattern matcher.

Proper Names

Proper name recognition uses a Natural Language Processing (NLP) technique called Named-Entity Recognition (NER). NER identifies named-entities in sentences and classifies the entities by type. For example, if the text is:

Mary disputed a transaction with Online Shoppers that was reported in May 2020.

then the NER would label it as follows:

[Mary]Person disputed a transaction with [Online Shoppers]Organization that was reported in [May 2020]Time.

Existing NER tools use a combination of grammar based and statistical models. Grammatical features use sentence structure to inform proper nouns categorization. The PII redaction package uses the Stanford Named Entity Recognizer which uses a statistical modeling technique called Conditional Random Field (CRF). CRF uses a graph model to take context into account when making predictions about a word in the text.

In testing, the NER based redaction of proper nouns is 92% accurate. Improving NER models continues to be an active area of research.

Tokenization

Many machine learning models that deal with text data perform an additional step called tokenization. Here we proposed a custom Tokenization process to convert texts to integer sequences in a securer way:

  1. Text tokenization: turning original text into sequence of word tokens, for example, ['hello', 'world'].
  2. Hashing: applying one-way hashing algorithm such as SHA-256 on each word (token) and then replacing the hashed value with its token. The algorithm is called 'one-way' because it is mathematically difficult to invert, and is why such algorithms are also used for data encryption. Each hash is unique across the dictionary of words, and each time the same word is seen in the stream the same hash value is produced.
  3. Sequencing: Mapping hashed code into sequential integers. This is achieved by replacing hashed code with its position in the original text stream. This step helps to save storage and transmission space by replacing long strings to integers. Sequential integers are also required when constructing a large word embedding matrix for NLP models.

A common technique to make it more difficult to reverse engineer the original text from the tokens is to randomize the hashing function. A random seed value or salting can be added at the hashing step. The result is a stream of numbers that is close to impossible to reverse engineer without the original dictionary.

The combination of PII redaction and tokenization provides high level of resiliency from data attacks such the rainbow attack or word frequency analysis. In the rainbow attack, a table is a precomputed with all possible input words for reversing cryptographic hash function. This technique is used, for example, for cracking password hashes. Tables are usually used in recovering a password (or credit card numbers, etc.) up to a certain length consisting of a limited set of characters. However without access to the same dictionary, the mapping is not the same after indexing. The salting token provides an additional safeguard.

In word frequency analysis, suppose that the tokenized stream is intercepted. Could the original text or meaning be reversed using knowledge of word frequency in common text? This scenario is likewise unlikely since PII is either already redacted of very low-frequency in the token stream.

Applying PII removal

Industry Solutions

Recently a public cloud provider announced an extension of its speech transcription service in which the transcript is PII redacted after the transcription is performed and another has a data loss prevention service running on the cloud. A disadvantage of these approaches is that the on premise data needs to be moved to the cloud before it can be cleaned.

JPMC Solution

A typical use case is an on premise data set which includes PII values in an unstructured text stream. Each block of text is redacted using the redaction package. Optionally the redacted text can be tokenized.

Conclusion

Using PI removal, PI data can be redacted from text corpus. Using tokenization for data, each string of text is replaces with non-sensitive token. Using PI removal and tokenization enables teams make small and secure datasets available to ML teams.