tokenization in nlp is the responsibilities of the lexer and the parserresponsibilities of the lexer...

15
Tokenization in NLP is the Responsibilities of the Lexer and the Parser We can tokenize identifiers, assignment symbols,integer literals correctly. In general, whitespace is insignificant.

Upload: eugenia-day

Post on 30-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Tokenization in NLP is the Responsibilities of the Lexer and the Parser

We can tokenize identifiers, assignment symbols,integer literals correctly.

In general, whitespace is insignificant.

Page 2: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Protecting Sensitive Data Through Tokenization

• For the input foo = 42, three tokens are recognized:• foo (identifier) • = (symbol) • 42 (integer literal) Let’s consider the input foo = 42bar, which is invalid

due to the (significant) missing space between 42 and bar. My lexer incorrectly recognizes the following tokens:

• foo (identifier) • = (symbol) • 42 (integer literal) • bar (identifier)

Page 3: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

(Con’t)Once the lexer sees the digit 4, it keeps reading until it encounters a non-digit.

It therefore consumes the 2 and stores 42 as an integer literal token. Because whitespace is insignificant, the lexer discards any whitespace (if there is any) and starts reading the next token: It finds the identifier bar

Page 4: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Tokenization

Given a character sequence and a defined document unit,

Tokenization is the task of chopping it up into pieces: tokens ,by throwing away certain characters, such as punctuation.

Example of tokenization: Input: Friends, Romans, Countrymen, lend me your Book Please ;

Output:

Here tokens are loosely referred to as terms or words. 1. A token is an instance of a sequence of characters in some particular document that are grouped together

as a useful semantic unit for processing. 2. A type is the class of all tokens containing the same character sequence. 3. A term is a (perhaps normalized) type that is included in the IR system's dictionary. The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic

identifiers in a taxonomy, but in practice in modern IR systems they are strongly related to the tokens in the document.

Tokens are usually derived from by various normalization processes..

Page 5: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Tokenization (Cont..)

We can chop on whitespace and throw away punctuation characters. In English there are a number of tricky cases.

For exampleVarious uses of the apostrophe for possession

and contractions.

Sentence: ----------Mr. O'Neill thinks that the boys' storiesabout Chile's capital aren't amusing. For O'Neill, which of the following is ----------------the desired tokenization?

? And for haven't, is it: ----------------

Page 6: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

A simple strategy is to just split on all non-alphanumeric characters

While   looks okay looks intuitively bad. For all of them, the choices determine which Boolean

queries will match. A query of neill AND capital will match in three cases but not the other two.

By processing queries with the same tokenizer, it is guaranted that a sequence of characters in a text will always match the same sequence typed in a query

Page 7: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Normalization (equivalence classing of terms) Broking the documents into tokens, The easy case is if tokens in the query just match tokens in the token list of the document,

there are many cases when two character sequences are not quite the same but match occur.

For example, if you search for USA, you might hope to also match documents containing

U.S.A.

• Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens. The most standard way to normalize is to implicitly create equivalence classes , normally named after one member of the set.

• For example………………, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto

the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either.

The advantage of just using mapping rules by removing characters like hyphens is that the equivalence classing to be done is implicit. Rather than fully calculating in advance: The terms that happen to become identical as the

result of these rules are the equivalence classes. It is only easy to write rules of this sort that remove characters.

Since the equivalence classes are implicit, it is not obvious when you might want to add characters.

For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory. 18

Page 8: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment
Page 9: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Small tokenizer in PythonPlease go through

https://docs.python.org/3/library/tokenize.html

Suppose we have <tag:some_text> Where tag is a combination of a finite set of values and some_text in

just text.

The delimiters <, : and > can be escaped by a single \ if they appear in, and as, text.

• I used the regex r"((\\)?[<:>])" along with re.finditer to find delimiters and then remove the backslash if necessary by checking with token.startswith('\\').

• General problem is that if more backslashes come before the delimiter, the regex is wrong, e.g. "<tag:Some \\\\< text>" -> ['<', 'tag', ':', 'Some \\\\', '<', ' text', '>'].

Page 10: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Tokenizing first and last name as one token

Check whether it is possible to tokenize a text in tokens such that first and last name are combined in one token.

For example if my text is:

text = “Prem Chand is the HOD" Then:

text.split() results in:

[‘Prem', ‘Chand', 'is', 'the, ‘HOD']

Can we recognize the first and last name

to get [‘Prem Chand', 'is', 'the', ‘HOD'] as tokens?

Page 11: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

• Tokenization is not Limited to NLP only…

Page 12: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

AES Encryption Method of Tokenization:

Tokenization is again a method to detect the encoding of source.

Lets see the AES Encryption Approach

Using AES encryption, both the host (acquirer, or gateway) and the merchant (e-Commerce and Brick and Mortar, have demonstrated a need to store transaction data in an accessible manner for a wide range of purposes ) are completely removed from the burden of storing cardholder data in the clear.Using this approach, all sensitive data is processed within the FIPS 140-2 Level 3-validated Secure Cryptographic Device (SCD) of the Excrypt SSP Series HSM.

AES Encryption Method of Tokenization:

1.Helps reduces PCI DSS compliance cost 2.It is Cryptographically strong, hardware- based AES encryption and protects all data at rest and in transit 3.It supports single and multi-use tokens, enabling multiple transactions

 

Page 13: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment
Page 14: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment

Point-to-Point Encryption (P2PE) technology

• Enabling Point-to-Point Encryption (P2PE) technology to be used alongside tokenization

• Futurex has developed a methodology for protecting data both in transit and at rest.

Point-to-Point Encryption enables protection of cardholder data, including the Primary Account Number, from the moment it is captured at the Point of Sale.

a) Cardholder data encryption reduces the scope of PCI DSS compliance for merchants and reduces the risk of exposing card numbers. So card fraud can be reduced.

b) By directly encrypting this data at the point of capture and following immediately with tokenization, data is protected both in transit and at rest.

c) The de-tokenization process is of great importance as well, in which tokens are replaced with the original cardholder data upon request by production systems.

By supporting tokenization, de-tokenization & P2PE, Futurex HSMs offer a comprehensive solution for securing the entire

transaction process.

Page 15: Tokenization in NLP is the Responsibilities of the Lexer and the ParserResponsibilities of the Lexer and the Parser We can tokenize identifiers, assignment