data security guidelines

Data Security Guidelines

May 2010 – Version 1.0

Gary Waldrom

Data Security Guidelines 2010·

Entity Class

A logical model of an identifiable party• Each instance of an

entity defined within the system should be identified and marked for drill down investigation

Domains

A logical structure of attributes represented within a single entity• Each instance of a

domain structure as listed within the spreadsheet (slides 5-8) and being contained within an identified Entity should be marked for further drill down investigation

Attributes

Individual data fields under data type constraints and associated business and integrity rules• Each attribute type as listed within the spreadsheet (slides 5-8) and being contained within

an identified domain is a candidate for data obfuscation based on the data obfuscation rules

Candidate Selection


Level 1• Sensitivity level 1 is a unique identifier in which a party can be identified

without further reference to other sensitive information (High Cardinality), all instances should be obfuscated or masked

Level 2• Sensitivity level 2 is information which collectively i.e. more than 1

instance may form a positive identification of a party, in isolation this data, although deemed sensitive has no direct and unique identification of the party however the more attributes supplied ultimately form a sensitivity level 1 without a level 1 being involved (Normal-Cardinality). All combined instances must be obfuscated

Level 3• Sensitivity level 3 is data with a Low Cardinality ratio. All combined

instances should be obfuscated although individual instances will not identify a party

Data Sensitivity

Risk of Identification of Parties

• ∞ • High risk, these

identifiers will uniquely identify party and are traceable through various public domain based systems

Unique Identifier – Sensitivity Level 1

• n exponent

• Becomes an identifier as multiple instances increase cardinality, exponent based on cardinality

Composite Identifiers –

Sensitivity Level 2 • n + Composite • Multiple composites

increase identification, cardinality increases as instances are added

Low Cardinality Identifiers –

Sensitivity Level 3


Attribute Identification


Entity Domain Attribute Data Type (Generic) Classification

Client

Name

Firstname(s) Character 2

Surnames / Family Name Character 2

Title / Prefix Denormalised: Character 3

Suffix Denormalised: Character 3

Salutation Denormalised: Character 2

Address

House Number/Name Character 2

Address Line 1 Character 2



Address Line 4 Character 2State / County / Canton / Region etc Denormalised: Character 3

Zip / Post Code Character 2

Country Denormalised: Character 3

Contact

Home Telephone Number Character 1

Work Telephone Number Character 1

Cell/Mobile Number Character 1

Additional Telephone Numbers Character 1

Email1 Character 1

Email2 Character 1

Additional Email Accts Character 1




Client

Personal Details

Date of Birth Date 3

Gender Denormalised: Character 3

Political Persuasion Denormalised: Character 3

Religious or Philosophical Beliefs Denormalised: Character 3

Sexual Persuasion Denormalised: Character 3

Race or Ethnic Origin Denormalised: Character 3

Accusations or Suspicions Denormalised: Character 3Convictions / Judgements / Criminal Records Denormalised: Character 3

NotesLong Character (Free text could hold sensitive details) 1

Internet usage & web tracking information Character / W3C Logs 2

Physical and/or Mental Health Character 3

Source of WealthLong Character (Free text could hold sensitive details) 1

Nationality Denormalised: Character 3

Domicile Denormalised: Character 3

Spouse Name Domain 2

Children Name Domain 2




Client

Natural Keys

SSN / Tax ID / NI Number Character 1

Passport Number Character 1

Login ID's & Passwords Character 1Union / Club / Society Membership Character 1

Bank Account Number(s) Number 1

Sort Code(s) Number 2

Account Name(s) Character 1

Residential Address Address Domain 2

Linked Data

Beneficiary Beneficiary Entity 1

IFA IFA Entity 2

Intermediary Intermediary Entity 2

Sub Account Sub Account Entity 1

Accountant Accountant Entity 2




Beneficiary All Client Entity Domains 2

IFA All Client Entity Domains 3

Intermediary All Client Entity Domains 3

Sub Account All Client Entity Domains 1

ClassificationKey

Sensitivity Level 1Sensitivity level 1 is a unique identifier in which a party can be identified without further reference to other sensitive information (High Cardinality), all instances should be obfuscated

Sensitivity Level 2

Sensitivity level 2 is information which collectively i.e. more than 1 instance may form a positive identification of a party, in isolation this data, although deemed sensitive has no direct and unique identification of the party ,however the more attributes supplied ultimately form a sensitivity level 1 without a level 1 being involved (Normal-Cardinality). All combined instances must be obfuscated

Sensitivity Level 3

Sensitivity level 3 is data with a Low Cardinality ratio. All combined instances should be obfuscated although individual instances will not identify a partyNote: Normalised data types obfuscated layer at the reference table level

Use-Case Example of Composite Identifiers (Sensitivity Level 2)

• Cardinality =>1,000,000

First Name

• Cardinality =>100,000Surname

• Cardinality =>10,000Country

• Cardinality =>100Region

• Cardinality =>5Post Code

• Cardinality =<2House

Number


Increase of positive

identification by a

cumulative of sensitivity 2

attributes held within the

same domain

Data is purely for reference

Obfuscation point

Point of probability

Use-Case Example of Composite Identifiers (Sensitivity Level 3)


Gender


Country

• Cardinality =>1,000,000Region

• Cardinality =>3,000Date of Birth

• Cardinality =>5Surname

• Cardinality <=2Post Code


Little increase of positive

identification by a

cumulative of sensitivity 1

until the addition of a sensitivity

level 2 attribute

Data is purely for reference

Obfuscation point

Point of probability


Numbers used in aggregate functions and checked to provide accuracy i.e. holdings, values, transactions, should not be obfuscated if all other attributes within the domain/entity structure have been obfuscated and

there is no method of reversing the obfuscation layer to identify sensitive data against the values, barring that:

Integers should be obfuscated equal to

or less than the length of the original

number but still conform to any

specific business rules

Fixed point numbers should be

obfuscated equal to or less than the

original precision and obfuscated but retain the original scale number but

still conform to any specific business

rules

Floating point numbers should be obfuscated equal to

or less than the original precision and scale number but still conform to

any specific business rules

Currency/percentage formatting over numeric values

should be retained for verification

purposes

Ordinal numbers should have the

alphabetic element obfuscated in the same way as an

alpha data element retaining the same

two character format

Numeric Obfuscation


Alphabetic and Alphanumeric data types should be obfuscated retaining the original structure of the underlying

data, however certain exceptions exist for search/view criteria

SGML/XML/HTML/XHTML/RSS data formats must retain XML

reserved characters in order for them to be used in native views, DTD, XLS, Web based formats

etc.

Embedded Java Code must be retained but underlying attributes

obfuscated

Alpha Obfuscation


Obfuscation of keys gives rise to the challenge of failure of Declarative Referential Integrity when presented to

certain applications that rely upon them thus:

Natural keys that are identified as sensitive

data can only be anonymised/masked

Natural keys that are identified as non-

sensitive are out of scope and may be

retained

Surrogate keys are out of scope and

should be retained

Key Obfuscation


Dates should retain the original date format of the National Character set of the underlying

data

Day numbers should be

obfuscated but retain the 1-31

format

Day of the week numbers should be obfuscated

but retain the 0-6 or 1-7 formatting

dependent on platform

Day names should be

obfuscated as per the alpha data element, however the

length of the day must be changed

to a length between 6 and 9 but not the same

length as the original day

Month numbers should be

obfuscated but retain the 1-12

format

Ordinal numbers should have the

alphabetic element

obfuscated in the same way as an

alpha data element retaining

the same two character format

Date Obfuscation 1


Dates should retain the original date format of the National Character set of the underlying

data

Month names should be obfuscated as per the alpha data element,

however the length of the month must be changed to a length

between 3 and 12 but not the same length as the original month.

Abbreviated month names should be obfuscated retaining the 3-

character format

Year numbers should always retain the century 4-number format in the range (current year- any validation criteria) to current year-1 for years in the past and current year + 1 to

(current year +any validation criteria) for projected ranges. (This potentially could cause problems

with date verification functions and any function code which performs these verifications must utilise the same seed value as the date value and must fully enclose within the

same block all other dates)

Decision support systems relying on “roll-forward”/”roll-back” date

scenarios and date range queries must retain the requested period

change between two dates

Date Obfuscation 2

Granularity of Access to Sensitive Data

Development•Development environments must be fully obfuscated at the data level (not obfuscated views) as developers usually hold higher privileges in these environments

UAT•UAT environments must be fully obfuscated to all Development, Support, and Non-Authorised users

•Business users may see sensitive data based on their individual levels of authorisation

•Access to data by Support users should be disallowed if possible

• If access is allowed for “fix-on-fail” functionality this must be keystroke logged through an auditing application

Production•Production environments must be fully obfuscated to all Development, Support, and Non-Authorised users

•Business users may see sensitive data based on their individual levels of authorisation

•Access to data by Support users should be disallowed if possible

• If access is allowed for “fix-on-fail” functionality this must be keystroke logged through an auditing application


Business Users, Development & Support

Business Users only

Development & Support only

Deployment Methods

Data Security Guideline Policy

Full Environment Access Control

Prod, UAT, SIT & Dev environments

are fully segregated by user type, or privilege level.

Data obfuscation/anonymisation/masking is performed through ETL tools from one environment to the

next

Shared Environment

Access

Prod, UAT, SIT & Dev environment

may share different user types i.e.

business, developers,

support. The level of granularity must

be defined on a per-user type or

privilege level basis.

Data is obfuscated/anonymised/masked based

on the authority level of the user type or privilege

level

Hybrid Environments

Prod, UAT, & SIT environments may be obfuscated at a user type level but transfers of data

into Dev environments may

be performed through ETL utilities

Data is obfuscated to the same rules

but the deployment method uses both technical methods


Benefits & Drawbacks of Deployment Methods

Full Environment

Access Control

Benefits•Leverage existing tools capabilities and vendor support

•Guaranteed obfuscation contained within the environment

•User access managed at different layer to data access

•Access to environment determines visibility

Drawbacks•ETL tool license/platform costs

•Load window issues•Metadata & cipher security concerns

Shared Environment

Access Control

Benefits•Higher level of access granularity, greater flexibility

•Define the level of encryption to conform to national regulatory controls

•No load window issues all users share same data instance

Drawbacks•Development costs•Requires clear delineation of user roles and role management

•Proprietary technology solutions

Hybrid Environments

Benefits•All prior mentioned•Greater flexibility in defining a solution which fits with a current “modus operandi”

Drawbacks•All prior mentioned•Potential support complexity issues


Data Obfuscation Methodology

Full Environmental Access Control• No data

obfuscation, none authorised users have no access

Hybrid environment• No access to

PROD, obfuscation in UAT based on roles and rules, ETL obfuscation into DEV

Shared Environment• Data obfuscation

based on roles and rules of sensitivity


Environmental Control (Access Method)


DEVUAT

ETL (Apply Obfuscation Rules)

Business UsersDevelopment & Support

Users

Informatica

Instance 2 (Obfuscated)

Instance 3 (Obfuscated)

Informatica

PROD

Instance 1ETL (Apply Obfuscation Rules)

Environmental Control (Hybrid Method)


PROD DEVUAT

ETL (Apply Obfuscation Rules)

Business UsersDevelopment & Support

Users

Obfuscation Layer

Informatica

Instance 1 Instance 1 or 2

Instance 3 (Totally Obfuscated)

Periodic Refresh or Duplex Feed

Appendix

Terms of Reference

Lingual Reference

Risk Impact/Probability

Non-Deterministic Obfuscation

Monte Carlo Method

Dynamic Obfuscation

Function

Methods



Anonymous/Anonymised

To remain unidentified, nameless i.e. NULL therefore a field that is anonymous would not show any

data at all and you could not verify the structure of the data

Obfuscate/Obfuscated

To confuse, scramble i.e. encrypt, therefore you could verify that a date was a date albeit the wrong one, a number is a number albeit

the wrong one and alpha is alpha in the same structure so you would

see the structure but the sensitive data would be indecipherable

Mask/MaskedTo cover, hide, this would normally

be used in password protection where the asterisk is displayed as

typed

Lingual Reference

Anonymous and Obfuscate are used in literature, an anonymous writer is unknown whereas writing under a nom de plume is obfuscated

Risk impact/Probability


Probability - A risk is an event that "may" occur. The probability of it occurring can range anywhere from just above 0% to just below 100%. (Note: It can't be exactly 100%, because then it would be a certainty, not a risk. And it can't be exactly 0%, or it wouldn't be a risk.)

Impact - A risk, by its very nature, always has a negative impact. However, the size of the impact varies in terms of cost and impact on some other critical factor.

We apply these rules to determine when to obfuscate data and when not to

Non-Deterministic Obfuscation

A variety of factors can cause an algorithm to behave in a way which is not deterministic, or non-deterministic:• If it uses external state other than the input, such as user input, a

global variable, a hardware timer value, a random value, or stored disk data.

• If it operates in a way that is timing-sensitive, for example if it has multiple processors writing to the same data at the same time. In this case, the precise order in which each processor writes its data will affect the result.

• If a hardware error causes its state to change in an unexpected way.

A major problem with deterministic algorithms is that sometimes, we don't want the results to be predictable.

For example, if you are playing an on-line game of blackjack that shuffles its deck using a pseudorandom

number generator, a clever gambler might guess precisely the numbers the generator will choose and so determine the entire contents of the deck ahead of time,

allowing him to cheat. Similar problems arise in cryptography, where private keys are often generated

using such a generator. This sort of problem is generally avoided using a cryptographically secure pseudo-random

number generator.


The Monte Carlo Methods

Monte Carlo methods are computational algorithms that rely on repeated random sampling to compute their results one of which is a

stochastic function to create an obfuscation layer

Stochastic programming is a framework for modelling optimization problems that involve uncertainty.

Because of their reliance on repeated computation of random or pseudo-random numbers, these methods are most suited and tend to

be used when it is unfeasible or impossible to compute an exact result with a deterministic algorithm thus ensuring data obfuscation

These are the building blocks to secure obfuscation of highly sensitive data within the banking environment and will satisfy an

external audit


Dynamic Obfuscation Function Methods


This is an example of a high level data obfuscation function in which a decision is made based on the previous criteria of when to obfuscate and the process of obfuscation for an alpha data type (simplest form)

Data is obfuscated on the decision point based on the underlying technologies info-gap non-probalistic theory methods of random number generation which creates seed data for ASCII conversion of real-data

data security guidelines

Documents

data sensitivitylevel

house numbername character

data type constraints

masked level

address address line

individual data fields

identifier cardinality

attribute type