Introducing easier de-identification of Cloud Storage data

By mullaned2002

August 8, 2022

545

De-identification of Cloud Storage just got easier

Many organizations require effective processes and techniques for removing or obfuscating certain sensitive information in the data they store. An important tool to achieve this goal is de-identification. Defined by NIST as a technique that “removes identifying information from a dataset so that individual data cannot be linked with specific individuals. De-identification can reduce the privacy risk associated with collecting, processing, archiving, distributing or publishing information.”

Always striving to make data security easier, today we are happy to announce the availability of a de-identification action for our Cloud Storage inspection jobs. Now, you can de-identify Cloud Storage objects, folders, and buckets without needing to run your own pipeline or custom code. Additionally, we have enhanced our transforms by adding a new dictionary replacement method that can help you achieve stronger privacy protection – especially with unstructured data you might store like customer support chat logs.

The “De-identify findings” Action

The “de-identify findings” action for Cloud DLP inspection jobs is a fully managed feature that creates a de-identified copy of the data objects that are inspected. This means that you can inspect a Cloud Storage bucket for sensitive data like Personal Identifiable Information (PII) and then create a redacted copy of these objects all with a few clicks in the Console UI. No need to write custom code or manage complex pipelines and since it’s fully managed, it will auto-scale for you without you needing to manage quota.

This new action supports the following data types:

Text files

Comma- or tab-separated values

Images (see regional limitations)

Once enabled, the DLP job will perform an inspection of the data and produce a de-identified copy of all supported files into the output bucket or folder.

You can also use the new de-identify action on Job Triggers to automatically de-identify new content as it appears on a recurring schedule. This is useful for creating a workflow with a safe drop zone for incoming files that need to be de-identified before being made accessible.

What can automatic De-identification do?

Cloud DLP provides a set of transformation techniques to de-identify sensitive data while attempting to make the data still useful for your business. These techniques include:

Redaction: Deletes all or part of a detected sensitive value.

Replacement: Replaces a detected sensitive value with a specified surrogate value.

Masking: Replaces a number of characters of a sensitive value with a specified surrogate character, such as a hash (#) or asterisk (*).

Crypto-based tokenization: Encrypts the original sensitive data value using a cryptographic key. Cloud DLP supports several types of tokenization, including transformations that can be reversed, or “re-identified.”

Bucketing: “Generalizes” a sensitive value by replacing it with a range of values. (For example, replacing a specific age with an age range, or temperatures with ranges corresponding to “Hot,” “Medium,” and “Cold.”)

Date shifting: Shifts sensitive date values by a random amount of time.

Time extraction: Extracts or preserves specified portions of date and time values.

New Dictionary Replace method

When a sensitive data element is found, dictionary replacement replaces it with a randomly selected value from a list of words that you provide. This transformation method is especially useful if you want the redacted output to have more realistic surrogate values.

Consider the following example: You collect customer support chat logs as part of providing service to your customers. These support chat logs contain various types of Personal Identifiable Information (PII) including people’s names and email addresses. Cloud DLP can find and de-identify the sensitive elements with static replacements such as “[REDACTED]” to help prevent someone from seeing this sensitive data.

With the new dictionary replacement method you can instead replace these findings with a randomly selected value from a dictionary. This dictionary replacement provides two key benefits over static replacement:

The resulting output can look more realistic

Because the output looks more realistic, it can help conceal any residual names (a privacy de-identification technique sometimes referred to as “hiding in plain sight”)

An example of this:

Input:

[Agent] Hi, my name is Jason, can I have your name?

[Customer] My name is Valeria

[Agent] In case we need to contact you, what is your email address?

[Customer] My email is [email protected]

[Agent] Thank you. How can I help you?

De-identified Output:

[Agent] Hi, my name is Gavaia, can I have your name?

[Customer] My name is Bijal

[Agent] In case we need to contact you, what is your email address?

[Customer] My email is [email protected]

[Agent] Thank you. How can I help you?

As you can see in the output, the names and email addresses have been replaced with a random value that both protects the original sensitive information but also makes the output look more realistic. This can make the data more useful and help “hide” any residual PII.

Next Steps:

To learn more about De-Identification check out our Technical Docs, try De-identification of Storage in the Cloud Console and Watch a recent Google I/O talk on De-identification of data.

Cloud BlogRead More

Previous articleSnapLogic Named Best of Summer 2022 by TrustRadius!

Next articleFilestore Enterprise for fully managed, fault tolerant persistent storage on GKE

Introducing easier de-identification of Cloud Storage data

De-identification of Cloud Storage just got easier

The “De-identify findings” Action

What can automatic De-identification do?

New Dictionary Replace method

Next Steps:

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Build a platform with KRM: Part 4 – Administering a multi-cluster environment

Driving success through open communication

7 innovative ways to use low-code tools and platforms

POPULAR CATEGORY

Introducing easier de-identification of Cloud Storage data

De-identification of Cloud Storage just got easier

The “De-identify findings” Action

What can automatic De-identification do?

New Dictionary Replace method

Next Steps:

Cloud Data Loss Prevention (Cloud DLP) Overview

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY