tokenization

53
• Tokenization https://store.theartofservice.com/the-tokenization- toolkit.html

Upload: jonas-dennis

Post on 26-Dec-2015

246 views

Category:

Documents


2 download

TRANSCRIPT

• Tokenization

https://store.theartofservice.com/the-tokenization-toolkit.html

C preprocessor Phases

1 Tokenization: The preprocessor breaks the result into preprocessing tokens and whitespace. It replaces

comments with whitespace.

https://store.theartofservice.com/the-tokenization-toolkit.html

Enterprise search Content processing and analysis

1 As part of processing and analysis, tokenization is applied to split the

content into tokens which is the basic matching unit. It is also common to normalize tokens to lower case to provide case-insensitive search, as

well as to normalize accents to provide better recall.

https://store.theartofservice.com/the-tokenization-toolkit.html

Lexical analysis - Token

1 A token is a string of one or more characters that is significant as a

group. The process of forming tokens from an input stream of characters is

called tokenization.

https://store.theartofservice.com/the-tokenization-toolkit.html

Lexical analysis - Tokenization

1 Tokenization is the process of demarcating and possibly classifying

sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing

input.

https://store.theartofservice.com/the-tokenization-toolkit.html

PerspecSys - Technology

1 The AppProtex Cloud Data Control Gateway secures data in software as a service and platform as a service

provider applications through the use of encryption or tokenization.

Gartner, a marketing research firm, refers to this type of technology as a

cloud encryption gateway, and categorizes providers of this

technology cloud access security brokers.

https://store.theartofservice.com/the-tokenization-toolkit.html

PerspecSys - Technology

1 Within the Gateway organizations may define encryption, and tokenization options at the

field-level

https://store.theartofservice.com/the-tokenization-toolkit.html

PerspecSys - Standards

1 Its tokenization option was evaluated by Coalfire, a PCI DSS Qualified Security Assessor (QSA) and a

FedRamp 3PAO, to ensure that it adheres to industry guidelines

https://store.theartofservice.com/the-tokenization-toolkit.html

Identity resolution - Data preprocessing

1 Standardization can be accomplished through simple rule-

based data transformations or more complex procedures such as lexicon-based tokenization and probabilistic

hidden Markov models

https://store.theartofservice.com/the-tokenization-toolkit.html

Lexing - Token

1 A 'token' is a string of one or more characters that is significant as a

group. The process of forming tokens from an input stream of characters is

called 'tokenization'.

https://store.theartofservice.com/the-tokenization-toolkit.html

Syntax (programming languages) - Levels of syntax

1 This modularity is sometimes possible, but in many real-world

languages an earlier step depends on a later step – for example, the lexer hack in C is because tokenization

depends on context

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization (disambiguation)

1 * Tokenization in language processing (both natural and

computer)

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization

1 'Tokenization' is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements

called tokens. The list of tokens becomes input for further processing such as

parsing or text mining. Tokenization is useful both in linguistics (where it is a

form of text segmentation), and in computer science, where it forms part of

lexical analysis.

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization - Methods and obstacles

1 Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a

word. Often a tokenizer relies on simple heuristics, for example:

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization - Methods and obstacles

1 Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek, Chinese language|Chinese,Huang, C., Simon, P., Hsieh, S., Prevot, L. (2007)

[http://www.aclweb.org/anthology/P/P07/P07-2018.pdf Rethinking Chinese Word

Segmentation: Tokenization, Character Classification, or Word break Identification]

or Thai language|Thai.

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization - Services

1 *[http://tokenex.com TokenEx] Cost-effective tokenization solution on the market for one-time, recurring and

archival transaction data.

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization (data security)

1 Tokenization can be used to safeguard sensitive data involving, for example, bank accounts, financial statements,

medical records, criminal records, driver's licenses, loan applications, stock trade (financial instrument)|trades, voter

registrations, and other types of personally identifiable information (PII).

[http://www.shift4.com/dotn/4tify/trueTokenization.cfm What is Tokenization?]

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization (data security)

1 In payment card industry (PCI) context, tokens are used to reference

cardholder data that is stored in a separate database, application or off-

site secure facility.”.[http://www.shift4.com/pr_20

080917_tokenizationindepth.cfm Shift4 Corporation Releases

Tokenization in Depth White Paper]

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization (data security)

1 Building an alternate payments ecosystem requires a number of entities working together in order to deliver Near field communication|NFC or other tech based payment services to

the end users. One of the issues is the interoperability between the players and to resolve this issue the role of trusted service manager (TSM) is proposed to establish a

technical link between MNOs and providers of services, so that these entities can work

together. Tokenization helps you to do that.

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization (data security)

1 The Payment Card Industry Data Security Standard, an industry-wide standard that must be met by any organization

that stores, processes, or transmits cardholder data, mandates that Creditcard data must be protected when

stored.[https://www.pcisecuritystandards.org/security_standards/pci_dss.shtml The Payment Card Industry Data Security Standard] Tokenization, as applied to payment card data, is

often implemented to meet this mandate, replacing Creditcard numbers in some systems with a random value.

[http://searchsecurity.techtarget.com/expert/KnowledgebaseAnswer/0,289625,sid14_gci1275256,00.html Can Tokenization of Creditcard Numbers Satisfy PCI Requirements?] Tokens can

be formatted in a variety of ways

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokenization (data security)

1 Tokenization makes it more difficult for hackers to gain access to cardholder data outside of the token storage system. Implementation of

tokenization could simplify the requirements of the Payment Card Industry Data Security

Standard|PCI DSS, as systems that no longer store or process sensitive data are removed

from the scope of the PCI audit.[http://www.etronixlabs.com/tokenization

/ “Securing Data: What Tokenization Does”]

https://store.theartofservice.com/the-tokenization-toolkit.html

Credit card fraud - Countermeasures

1 * Tokenization (data security) – not storing the full number in computer systems

https://store.theartofservice.com/the-tokenization-toolkit.html

Speech synthesis

1 This process is often called text normalization, pre-processing, or tokenization

https://store.theartofservice.com/the-tokenization-toolkit.html

Informix - Key Products

1 There is also an advanced data warehouse edition of Informix. This

version includes the Informix Warehouse Accelerator which uses a combination of newer technologies

including in-memory data, tokenization, deep compression, and

columnar database technology to provide extreme high performance on business intelligence and data

warehouse style queries.https://store.theartofservice.com/the-tokenization-toolkit.html

Yacc

1 Yacc produces only a parser (phrase analyzer); for full syntactic analysis this requires an external lexical analyzer to

perform the first tokenization stage (word analysis), which is then followed by the parsing stage proper. Lexical analyzer

generators, such as Lex programming tool|Lex or Flex lexical analyser|Flex are widely

available. The IEEE POSIX P1003.2 standard defines the functionality and requirements

for both Lex and Yacc.

https://store.theartofservice.com/the-tokenization-toolkit.html

Credit card number - Security

1 * Tokenization (data security)|Tokenization – in which an artificial account number (token) is printed,

stored or transmitted in place of the true account number.

https://store.theartofservice.com/the-tokenization-toolkit.html

OpenNLP

1 It supports the most common NLP tasks, such as tokenization, Sentence boundary disambiguation|sentence

segmentation, part-of-speech tagging, Named entity recognition|named entity extraction, Shallow

parsing|chunking, Syntactic parsing|parsing, and coreference|coreference

resolution

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Document parsing

1 The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in

corporate slang.

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Document parsing

1 Natural language processing, as of 2006, is the subject of continuous

research and technological improvement. Tokenization presents many challenges in extracting the

necessary information from documents for indexing to support quality searching. Tokenization for

indexing involves multiple technologies, the implementation of

which are commonly kept as corporate secrets.

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Challenges in natural language processing

1 The goal during tokenization is to identify words for which users will search

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Tokenization

1 During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation,

which are represented by numeric codes, some of which are non-

printing control characters

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Language recognition

1 If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as stemming and

part of speech tagging)

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Format analysis

1 If the search engine supports multiple File format|document formats, documents must be

prepared for tokenization

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Section recognition

1 Some search engines incorporate section recognition, the identification of major parts of a document, prior

to tokenization

https://store.theartofservice.com/the-tokenization-toolkit.html

Index (search engine) - Meta tag indexing

1 The design of the HTML markup language initially included support

for meta tags for the very purpose of being properly and easily indexed,

without requiring tokenization.Berners-Lee, T.,

Hypertext Markup Language - 2.0, RFC 1866, Network Working Group,

November 1995.

https://store.theartofservice.com/the-tokenization-toolkit.html

Applesoft BASIC - Speed issues, features

1 Furthermore, because the language used tokenization, a programmer had to avoid

using any consecutive letters that were also Applesoft commands or operations (one

could not use the name SCORE for a variable because it would interpret the OR as a

Boolean operator, thus rendering it SC OR E, nor could one use BACKGROUND because

the command GR invoked the low-resolution graphics mode, in this case creating a syntax

error).

https://store.theartofservice.com/the-tokenization-toolkit.html

Identifier - In computer languages

1 However, a common restriction is not to permit whitespace characters and language operators; this simplifies tokenization by making it Free-form language|free-form and context-free

https://store.theartofservice.com/the-tokenization-toolkit.html

Identifier - In computer languages

1 This overlap can be handled in various ways: these may be

forbidden from being identifiers – which simplifies tokenization and parsing – in which case they are

reserved words; they may both be allowed but distinguished in other

ways, such as via stropping; or keyword sequences may be allowed

as identifiers and which sense is determined from context, which requires a context-sensitive lexer

https://store.theartofservice.com/the-tokenization-toolkit.html

Tokens - Computing

1 ** Tokenization (data security), the process of substituting a sensitive data element

https://store.theartofservice.com/the-tokenization-toolkit.html

IVONA - Inside IVONA

1 This process is often called text normalization, pre-processing, or tokenization

https://store.theartofservice.com/the-tokenization-toolkit.html

Identifier (computer science) - In computer languages

1 However, a common restriction is not to permit whitespace characters and language operators; this simplifies tokenization by making it Free-form language|free-form and context-free

https://store.theartofservice.com/the-tokenization-toolkit.html

Underscore - Multi-word identifiers

1 However, spaces are not typically permitted inside identifiers, as they are treated as delimiters between

tokenization|tokens

https://store.theartofservice.com/the-tokenization-toolkit.html

W-shingling

1 The document, a rose is a rose is a rose can be tokenization|tokenized as follows:

https://store.theartofservice.com/the-tokenization-toolkit.html

Slot machines - Description

1 Recently, some casinos have chosen to take advantage of a concept

commonly known as tokenization, where one token buys more than one

credit

https://store.theartofservice.com/the-tokenization-toolkit.html

VTD-XML - Non-Extractive, Document-Centric Parsing

1 Traditionally, a lexical analysis|lexical analyzer represents tokens (the small units of indivisible character values)

as discrete string objects. This approach is designated extractive parsing. In contrast, non-extractive

tokenization mandates that one keeps the source text intact, and

uses offsets and lengths to describe those tokens.

https://store.theartofservice.com/the-tokenization-toolkit.html

CipherCloud

1 Hickey, CipherCloud Uses Encryption, Tokenization to Bolster Cloud

Security, CRN, February 14, 2011]

https://store.theartofservice.com/the-tokenization-toolkit.html

CipherCloud - Platform

1 Snooping, The Washington Times, August 18, 2013] The company uses

Tokenization (data security)|tokenization, which is the process of substituting a sensitive data element

with a non-sensitive equivalent

https://store.theartofservice.com/the-tokenization-toolkit.html

Parsing expression grammar - Advantages

1 Parsers for languages expressed as a CFG, such as LR parsers, require a separate

tokenization step to be done first, which breaks up the input based on the location of spaces, punctuation, etc. The tokenization is necessary because of the way these parsers

use lookahead to parse CFGs that meet certain requirements in linear time. PEGs do

not require tokenization to be a separate step, and tokenization rules can be written in

the same way as any other grammar rule.

https://store.theartofservice.com/the-tokenization-toolkit.html

ProPay

1 'ProPay, Inc' is an American financial services company headquartered in Lehi, UT. The company provides payment solutions that

include Merchant account provider|merchant accounts, payment processing, ACH services,

pre-paid cards and other payment-related products. ProPay also provides end-to-end encryption and tokenization services. In

December, 2012, ProPay was acquired by Total System Services, Inc. (TSYS) a publicly

traded company, TSS (NYSE).

https://store.theartofservice.com/the-tokenization-toolkit.html

ProPay - History

1 In 2009, ProPay was among a handful of companies that began to offer an end-to-end encryption and tokenization service.ProPay Unlocks ProtectPay

Encrypted Credit Card Processing, TMC.net 02/20/2009 At that time, ProPay also introduced the

MicroSecure Card Reader®, allowing small merchants to securely accept card present transactions.Pocket Credit Card Reader Takes Transactions on the Go, PC

World 01/07/2009 In 2010, ProPay received the Independent Sales Organization of Year award from

the Electronic Transaction Association.ProPay Receives 2010 Electronic Transaction Association ISO

of the Year Award, Silicone Slopes 04/20/2010

https://store.theartofservice.com/the-tokenization-toolkit.html

Casio fx-7000G - Programming

1 Tokenization is performed by using characters and symbols in place of long lines of code to minimize the

amount of memory being used

https://store.theartofservice.com/the-tokenization-toolkit.html

Cuban art

1 A movement that mirrored this artistic piece was underway in which the shape of Cuba became a token in

the artwork in a phase known as tokenization

https://store.theartofservice.com/the-tokenization-toolkit.html