data masking: testing with near-real data
DESCRIPTION
Organizations worldwide collect data about customers, users, products, and services. Striving to get the most out of collected data, they use it to fuel many day-to-day processes including software testing, development, and personnel training. The majority of this collected data is sensitive and falls under specific government regulations or industry standards that define policies for privacy and generally limit or prohibit using the data for these secondary purposes. Data masking solves this problem. It replaces sensitive information with data that looks real and is structurally similar to the actual information but is useless to anyone trying to obtain the real data. Learn about the process, pros and cons of static and dynamic data masking architectures, subsetting, randomization, generalization, shuffling, and other basic techniques used to set up data masking. Discover how to start data masking and learn about common challenges on data masking projects.TRANSCRIPT
T20 Test Techniques
5/2/2013 3:00:00 PM
Data Masking: Testing with
Near-real Data
Presented by:
Martin Kralj
Ekobit
Brought to you by:
340 Corporate Way, Suite 300, Orange Park, FL 32073
888-268-8770 ∙ 904-278-0524 ∙ [email protected] ∙ www.sqe.com
Martin Kralj
Martin Kralj is responsible for the data masking line of tools and services at Ekobit. In his fifteen years in the software industry, Martin has worked as a business analyst, enterprise software development professional, consultant, project manager, and customer support engineer. He has held key roles on teams producing Ekobit’s flagship products, TeamCompanion and BizDataX, and directed software projects in-house and worldwide. Recently Martin has specialized in application lifecycle management, particularly agile software development methodologies, team work, and data masking. He presents at various conferences and writes about software.
13.5.2013
1
Data Masking
•Testing with Near-real Data
About me
~ Martin Kralj
~ Software development
~ Project managementand ALM + consulting
~ BizDataX by Ekobit� Complex data relationships
� Large databases
� Near-real data
� Designed for enterprise
13.5.2013
2
13.5.2013
3
Agenda
~ Handling sensitive data
� Define “sensitive”
� Norms and regulations
~ Data masking
� Concepts and basic techniques
� How can we do it?
� Scripts vs. tools and platforms
Comply to data privacy and
security laws
13.5.2013
4
USA norms and regulations
~ Nationwide� HIPAA (Health Insurance Portability and Accountability Act)
� HITEC (Health Information Technology for Economic and Clinical Health Act)
~ State specific, California as an example� CMIA (Confidentiality of Medical Information Act)
� IPA (Information Practices Act)
� PAHRA (Patient Access to Health Records Act)
� IPPA (Insurance Information and Privacy Protection Act)
� Security Breach Notification Law
~ Industry wide� PCI DSS (Payment Card Industry Data Security Standard)
Self-interests and reputation
~ Corporate rules
~ Competition and industrial espionage
~ Protecting intellectual property
~ Ethical reasons and
protection of reputation
13.5.2013
5
Work with near-real data
~ Format preserving and
context sensitive
~ Secondary usage of
sensitive data
is avoided
{
// Demo
}
Demo: is it real or fabricated?
13.5.2013
6
Suppression
ID First Name Last Name Date of Birth Phone Gender
1 Sasha Cortez 20.7.1967 1-340-337-7194 Female
2 Neve Dyer 17.11.1975 1-599-974-8272 Female
3 September Graves 9.6.1977 1-404-899-2966 Female
4 Theodore Graves 27.10.1962 1-266-364-7119 Male
5 Donovan Hoover 19.3.1978 1-728-752-4244 Male
6 Lynn Joyner 16.12.1984 1-124-859-5234 Female
7 Quon May 19.11.1954 1-406-895-7153 Female
8 Berk Mcclain 18.7.1966 1-938-803-0464 Male
9 Hakeem Ray 9.4.1964 1-734-314-8964 Male
10 Paki Sellers 10.11.1956 1-641-173-5621 Male
ID First Name Last Name Gender
2 Neve Dyer Female
4 Theodore Graves Male
5 Donovan Hoover Male
7 Quon May Female
8 Berk Mcclain Male
Shuffling
ID First Name Last Name Gender
1 Cortez Female
2 Dyer Female
3 Graves Female
4 Graves Male
5 Hoover Male
6 Joyner Female
7 May Female
8 Mcclain Male
9 Ray Male
10 Sellers Male
Sasha
Neve
September
Theodore
Donovan
Lynn
Quon
Berk
Hakeem
Paki
13.5.2013
7
Redaction (blacking-out)
ID First Name Last Name Age
1 Sasha Cortez 44
2 Neve Dyer 36
3 September Graves 34
4 Theodore Graves 49
5 Donovan Hoover 33
6 Lynn Joyner 27
7 Quon May 57
8 Berk Mcclain 45
9 Hakeem Ray 47
10 Paki Sellers 55
Generalization
ID First Name Last Name Age
1 Sasha Cortez 41-50
2 Neve Dyer 31.40
3 September Graves 31-40
4 Theodore Graves 41-50
5 Donovan Hoover 31-40
6 Lynn Joyner 21-30
7 Quon May 51-
8 Berk Mcclain 41-50
9 Hakeem Ray 41-50
10 Paki Sellers 51-
13.5.2013
8
Randomization, generating
and substitution
ID First Name Last Name Phone
1 Sasha Cortez 1-340-337-7194
2 Neve Dyer 1-599-974-8272
3 September Graves 1-404-899-2966
4 Theodore Graves 1-266-364-7119
5 Donovan Hoover 1-728-752-4244
6 Lynn Joyner 1-124-859-5234
7 Quon May 1-406-895-7153
8 Berk Mcclain 1-938-803-0464
9 Hakeem Ray 1-734-314-8964
10 Paki Sellers 1-641-173-5621
ID First Name Last Name Phone
1 Sasha Cortez 1-182-260-6935
2 Neve Dyer 1-886-794-9258
3 September Graves 1-847-263-1225
4 Theodore Graves 1-341-810-3139
5 Donovan Hoover 1-982-608-9112
6 Lynn Joyner 1-960-142-1834
7 Quon May 1-872-132-9340
8 Berk Mcclain 1-612-726-9353
9 Hakeem Ray 1-157-361-5540
10 Paki Sellers 1-834-906-6092
Masking techniques
~ Suppression
~ Shuffling
~ Redaction (blacking out)
~ Generalization
~ Randomization, generating and substitution
13.5.2013
9
Dynamic data masking
Real
data1234-5678-4011DDM
XXXX-XXXX-4011
1234-5678-4011
Primary
process
Secondary
process
Static data masking
Real
data1234-5678-4011
Masked
dataXXXX-XXXX-4011
SDM
XXXX-XXXX-4011
1234-5678-4011
Primary process
Secondary processes
13.5.2013
10
{
// Demo
}
Simple script
Tools: masking logic built-in
13.5.2013
11
Tools: declarative approach
{
// Demo
}
Define rules: simplicity and power
13.5.2013
12
Tools: performance and expertize
~ Explicit and implicit parallelism
~ Automatic and scheduled execution
~ Notifications, monitoring and auditing
~ Efficient processing of large amounts of data
~ Deterministic or repeatable masking
Tools: systematic approach
~ Thorough analysis of existing infrastructure,
data, people and processes
~ Natural separation of roles and responsibilities
~ Data can be handled as other must haves and
daily routines
~ Accountability and traceability
13.5.2013
13
Conclusion
~ Data Masking improves data security and handling sensitive data in general
~ Technology is heavily underutilized
� Side job for administrator or programmer
� There is no real control
~ Done by the book
� Systematical and project approach
� Explore and use specialized tools
� Ask for help