‘Data Management & GDPR, let’s discover the five pillars for success Using Talend’

When the General Data Protection Regulation (GDPR) goes live in May 2018, businesses will need to track and trace sensitive data and determine how it is processed across their information supply chain. As a result,  the way businesses approach data management will need to be handled carefully and will need to comply with “privacy by design” principles. In a GDPR world, this means each new digital service leveraging personal data must now take data protection into account as well.  A recent survey from the International Association of Privacy Professionals indicates that up to 75,000 Data Protection Officers (DPOs) will need to be hired globally in the run up to May 2018 to manage EU citizens’ personal data.

Breaches of some GDPR provisions could lead to data watchdogs levying fines of up to €20 million or 4% of global annual turnover for the preceding financial year, whichever is the greater. This leaves many IT departments looking for a playbook to handle these impending data regulations. According to a recent Dell survey, which polled 821 IT professionals worldwide, 97% said their companies didn’t have a plan in place to implement the new law. That needs to change as GDPR will have major implications for the IT landscape of any business and especially in terms of data management best practice.

Getting Prepared for GDPR

To fully comply with and prepare for GDPR, organizations will need to create and maintain a holistic data inventory to know what Personally Identifiable Information (PII) they have stored and processed. To achieve this, they’ll need access to the latest metadata management techniques.

They will also be required to trace their data. They need to know if, how, and when a customer has opted-in. Traceability is a key part of the GDPR mandate and to deliver it, organizations need to implement a PII data hub where they can pull all relevant data together in one place. They also need to reconcile and harmonize the disparate PII data into a “single version of the truth” using data quality and master data management (MDM) together with metadata management to establish data lineage.

The key GDPR concept of privacy by design will also become more significant to enforcing data protection within systems that contain sensitive information that needs to be protected, such as the data warehouses, or the cloud applications. That is where data masking, data anonymisation, and data pseudonymization should be considered.

Next, is establishing data governance policies. These could relate to implementing parameters around opt-in periods, or archiving historical data, for example. Finally, organizations need to foster accountability. Appointing a DPO might be mandatory for most businesses, but the DPO alone cannot be the one who enforces the rules for data protection across all the systems in the company that refers to PII. For example, the data trail for one customer might cover information held in the sales department, as well as data in marketing, finance, legal, maintenance, and even mobile or internet of things’ systems. Within each of these systems, the person responsible for managing that data is likely to be different. Collaborative data stewardship that is empowered by self-service apps will be critical in successfully supporting this self-service approach and in fostering accountability across all stakeholders.

Finally, there is a need for businesses not just to protect their data but also to open it, using data integration and data services technologies. That’s particularly important because, under the terms of GDPR, the data subject has the right to ask organizations to provide them with relevant data they hold about them. They can also ask for the ‘right to be forgotten’, for corrections to be made if data is inaccurate, and for relevant data to be delivered to them in a machine-readable format.

Tackling The 5 Pillars of GDPR with Talend

First pillar

The first pillar focuses on bringing all data into a data lake and using tools like Hadoop to track it within this environment.  This would allow companies to collect all data that requires attention, but also connect it into a hub where it can be discovered, harmonized, cleansed, protected, governed, and shared safely. Once done, organizations could achieve a broader view beyond the data lake by reaching upstream and looking at data sources before they fill the lake, such as the CRM, marketing, and digital systems. This would enable them to get an end-to-end view of their information supply chain and ensure data governance, quality, and stewardship at the point of origin.

This holistic approach may be favored by many larger organizations looking to get an end-to-end view across their whole information supply chain using enterprise metadata management solutions. Nevertheless, it can be expensive and time-consuming to achieve with respect to the GDPR deadlines, which is why a data lake approach, although perhaps not the ultimate destination for all companies, is a pragmatic milestone for GDPR compliance.

In this approach, the first pillar to be pursued relates to data capture and integration. It’s important here to capture each and every PII data together with data related to consent across any data source and then reconciles them into a 360° view of the identity of each customer (figure 1). The challenge is that businesses typically know their customers – or employees – in many different contexts. An airline may know a customer through their Twitter account, as a passenger, and as a frequent flyer, for example.

So how can organizations achieve this 360-degree identity view? Talend’s Big Data Platform can help to populate it. It embeds a native data quality component to match disparate data, helping business understand that John Smith is the same person as jsmith@widgets.com or @JohnSmith, for example.

Talend and GDPR Data Quality

Figure 1: Talend combines Data Quality, data stewardship, data and big data integration into a unified platform to collect, standardize, reconcile, certify, protect and propagate PII data

Talend Master Data Management (MDM) can be also leveraged, not only to reconcile the data around a common master data record, but also to enable governance and stewardship on top for data protection, and safely propagating it across the required systems. In the context of GDPR, MDM also has particular relevance for managing opt-ins. In GDPR, opt-ins need to apply across multiple applications. So, businesses need to consider them across a range of areas – for emails campaigns, for personalizing the website with best offers, and with respect to other applications such as billing or customer service. All these elements are likely to require different applications to process them – and so MDM will help reconcile, protect, and create an audit trail of personal data in one place (figure 2) – and then apply it across the different applications.

All these elements are likely to require different applications to process them – and so MDM will help reconcile, protect, and create an audit trail of personal data in one place (figure 2) – and then apply it across the different applications.

Talend and GDPR Data Lineage

Figure 2: Talend provides record level lineage with undo/redo capabilities, thereby providing an audit trail for opt-ins and any other data that relate to a data subject

Second pillar

The second pillar, data classification, and lineage involves helping businesses define and categorize the data which needs to be accessed, pinpointing where it is located across the system and gauging how that information is related to other relevant information across the system. When using a Hadoop environment for the GDPR data lake, this can be achieved using Apache Atlas and Cloudera Navigatortechnologies that provide a map of the business data within Hadoop. Talend’s Big Data Platform tightly integrates with those environments to provide the data lineage for data flows, highlighting where PII comes from and where does it go. In Addition, Talend Metadata Manager can draw the information supply chain across any system and beyond the data lake (figure 3). This type of metadata management, in turn, enables potentially anyone in the organization to know where the data is through a business glossary (figure 4) and also reveals the relevant files or databases within which it is stored, thereby effectively establishing data lineage.

Talend and GDPR Data Metadata Manager

Figure 3: Talend can automatically harvest data to create your PII inventory and draws your information chain end to end view: you know where your data comes from and where does it go

Talend and GDPR Data Metadata Explorer

Figure 4: Talend Business Glossary helps you to create your reference, document, and classify your critical data elements with direct links to the datasets to refer to them

Third pillar

The third GDPR pillar is data anonymization and pseudonymization. Here, the latest semantic discovery capabilities enable organizations to automatically capture whether or not there are sensitive data such as credit card numbers or emails within newly loaded data sources. This is an important capability because it alerts organizations about potential data privacy issues – effectively driving them to certain data sources that may require attention for GDPR compliance. They can then ask themselves the key question – do I really need to expose this sensitive data in this context?

Applying those techniques can bring processing and storage of personal data outside the scope of the GDPR as well. For example, a sensitive data could be accessible in a CRM system but masked when used for analytics or development and testing. Related to this is the concept of data shuffling, a type of data masking that involves a column of data being randomly shuffled so its identity is hidden, but the relevant values remain in place. In this way, privacy is preserved, but analytics and data testing can still take place using the original data values. Data masking and shuffling are part of Talend Data Quality, which provides one single set of tools to build data quality controls across all Talend integration platforms (figure 5). It generates native code to run data quality controls and data anonymization at the right place, on-premise or in the cloud, and at the right time, on data at rest or data in-flight.

Talend and GDPR Data Masking

Figure 5: Talend provides data masking and shuffling capabilities for batch and real data streams, and for any audience, including business users through self-service tools

Fourth pillar

Pillar four, self-service curation and certification, supports the delegation of authority from data experts like data protection officers or data stewards to business users. Think about a sales engineer who might be best positioned to ensure the contact data related to its account are up-to-date; or a campaign manager who becomes accountable to check and prove that a consent mechanism has been put in place by the partners that they work with for enriching the marketing database with new contact data. To ensure that anyone in the organization can manage their data usage in a compliant manner, the business will need to provide straightforward self-served apps, such as Talend Data Preparation and Talend Data Stewardship (figure 6) to the different departments, thereby providing them with enhanced autonomy.

Fifth pillar

The final pillar of GDPR, data portability, allows customers to access easily, require for rectification or erasure, and reclaim their personal data. To facilitate this capability, businesses could implement a data download tool with a data integration software, for example. So here the business could have a list of all the customers who asked for their personal data.

Talend and GDPR Data Stewardship

Figure 6: Talend allows to delegate accountability for Personal Data to potentially anyone in the organization through self service data preparation and stewardship tools

Based on that, they could run the job using Talend Data Integration; create a comma separated value (CSV) for each customer and then send it by email, for example, automatically through data integration (figure 7). The other option is to use an open API – in other words, the business could open a GDPR service on its website and encourage any customer wanting to know what data the business has about them to access it. It is an approach that could be supported by the use of a Talend Data Services that can expose real-time data services through a standard, well documented, and easy to consume API, such as REST.

Talend and GDPR Data Integration

Figure 7: Complying with the right to data portability with Talend Data Integration.

Ready for Today & Tomorrow

As GDPR approaches, companies are becoming increasingly concerned that it could threaten their business. In tackling this threat, organizations first need to put in place the right platform. Here, we see the data lake as an excellent candidate to create the identity data hub and become the focal point. The data lake will not only be used to document, categorize and map the data but also to track and trace the changes applied to it and deliver the data services to the data subject per their rights (right of access, of rectification, data portability, rights to be forgotten).

As shown, organizations can build on this platform to implement the five pillars of data management best practice and put in place the latest data management capabilities to deliver everything from data capture, integration, classification, and lineage, to data anonymisation, self-service curation, and data portability. By doing so, they will put themselves in a position to more effectively manage the changes brought by GDPR and further develop a best-practice approach to data management that will help drive success today and into the future.

Let’s Engage