Download - Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011
![Page 1: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/1.jpg)
Introduction & Tokenization
Ling570Shallow Processing Techniques for NLP
September 28, 2011
![Page 2: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/2.jpg)
RoadmapCourse Overview
Tokenization
Homework #1
![Page 3: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/3.jpg)
Course Overview
![Page 4: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/4.jpg)
Course InformationCourse web page:
http://courses.washington.edu/ling570Syllabus:
Schedule and readingsLinks to other readings, slides, links to class recordingsSlides posted before class, but may be revised
Catalyst tools: GoPost discussion board for class issuesCollectIt Dropbox for homework submission and TA
commentsGradebook for viewing all grades
![Page 5: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/5.jpg)
GoPost Discussion BoardMain venue for course-related questions, discussion
What not to post:Personal, confidential questions; Homework solutions
What to post: Almost anything else course-related
Can someone explain…? Is this really supposed to take this long to run?
Key location for class participationPost questions or answersYour discussion space: Sanghoun & I will not jump in
often
![Page 6: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/6.jpg)
GoPostEmily’s 5-minute rule:
If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!
Mechanics:Please use your UW NetID as your user idPlease post early and often !
Don’t wait until the last minuteNotifications:
Decide how you want to receive GoPost postings
![Page 7: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/7.jpg)
EmailShould be used only for personal or confidential
issuesGrading issues, extended absences, other problems
General questions/comments go on GoPost
Please send email from your UW account Include Ling570 in the subject If you don’t receive a reply in 24 hours, please follow-
up
![Page 8: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/8.jpg)
Homework SubmissionAll homework should be submitted through CollectIt
Tar cvf hw1.tar hw1_dir
Homework due 11:45 Wednesdays
Late homework receives 10%/day penalty (incremental)
Most major programming languages accepted C/C++/C#, Java, Python, Perl, Ruby
If you want to use something else, please check first
Please follow naming, organization guidelines in HW
Expect to spend 10-20 hours/week, including HW docs
![Page 9: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/9.jpg)
GradingAssignments: 90%
Class participation: 10%
No midterm or final exams
Grades in Catalyst Gradebook
TA feedback returned through CollectIt
Incomplete: only if all work completed up last two weeksUW policy
![Page 10: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/10.jpg)
RecordingsAll classes will be recorded
Links to recordings appear in syllabusAvailable to all students, DL and in class
Please remind me to:Record the meeting (look for the red dot)Repeat in-class questions
Note: Instructor’s screen is projected in classAssume that chat window is always public
![Page 11: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/11.jpg)
Contact InfoGina: Email: [email protected]
Office hour:Fridays: 10-11 (before Treehouse meeting)Location: Padelford B-201Or by arrangement
Available by Skype or Adobe ConnectAll DL students should arrange a short online meeting
TA: Sanghoun Song: Email: [email protected] hour: Time: TBD, see GoPostLocation:
![Page 12: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/12.jpg)
Online OptionPlease check you are registered for correct section
CLMS online: Section AState-funded: Section BCLMS in-class: Section CNLT/SCE online (or in-class): Section D
Online attendance for in-class studentsNot more than 3 times per term (e.g. missed bus, ice)
Please enter meeting room 5-10 before start of classTry to stay online throughout class
![Page 13: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/13.jpg)
Online TipIf you see:
You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. you do not have a Connect account but need to have
one. For UWEO students: If you have just created your UW NetID or just enrolled
in a course…..
Clear your cache, close and restart your browser
![Page 14: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/14.jpg)
Course Description
![Page 15: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/15.jpg)
Course PrerequisitesProgramming Languages:
Java/C++/Python/Perl/..
Operating Systems: Basic Unix/linux
CS 326 (Data structures) or equivalentLists, trees, queues, stacks, hash tables, …Sorting, searching, dynamic programming,..
Automata, regular expressions,…
Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….
![Page 16: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/16.jpg)
TextbookJurafsky and Martin, Speech and Language
Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008Available from UW Bookstore, Amazon, etc
Reference: Manning and Schutze, Foundations of Statistical Natural Language Processing
![Page 17: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/17.jpg)
Topics in Ling570Unit #1: Formal Languages and Automata (2-3
weeks)Formal languagesFinite-state AutomataFinite-state TransducersMorphological analysis
Unit #2: Ngram Language Models and HMMsNgram Language Models and SmoothingPart-of-speech (POS) tagging:
HMMNgram
![Page 18: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/18.jpg)
Topics in Ling570Unit #3: Classification (2-3 weeks)
Intro to classificationPOS tagging with classifiersChunkingNamed Entity (NE) recognition
Other topics (2 weeks) Intro, tokenizationClustering Information ExtractionSummary
![Page 19: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/19.jpg)
RoadmapMotivation:
Applications
Language and Thought
Knowledge of LanguageCross-cutting themes
Ambiguity, Evaluation, & Multi-linguality
Course Overview
![Page 20: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/20.jpg)
Motivation: ApplicationsApplications of Speech and Language Processing
Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….
![Page 21: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/21.jpg)
Shallow vs Deep Processing
Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)
Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking
![Page 22: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/22.jpg)
Shallow vs Deep Processing
Shallow processing (Ling 570)Usually relies on surface forms (e.g., words)
Less elaborate linguistic representationsE.g. Part-of-speech tagging; Morphology; Chunking
Deep processing (Ling 571)Relies on more elaborate linguistic representations
Deep syntactic analysis (Parsing)Rich spoken language understanding (NLU)
![Page 23: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/23.jpg)
Shallow or Deep?Applications of Speech and Language Processing
Call routing Information retrievalQuestion-answeringMachine translationDialog systemsSpam taggingSpell- , Grammar- checkingSentiment Analysis Information extraction….
![Page 24: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/24.jpg)
Language & IntelligenceTuring Test: (1949) – Operationalize intelligence
Two contestants: human, computer Judge: humanTest: Interact via text questionsQuestion: Can you tell which contestant is human?
Crucially requires language use and understanding
![Page 25: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/25.jpg)
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT
AGGRESSIVE…
![Page 26: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/26.jpg)
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT
AGGRESSIVE...Passes the Turing Test!! (sort of)
![Page 27: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/27.jpg)
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT
AGGRESSIVE...Passes the Turing Test!! (sort of)“You can fool some of the people....”
![Page 28: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/28.jpg)
Limitations of Turing TestELIZA (Weizenbaum 1966)
Simulates Rogerian therapist User: You are like my father in some waysELIZA: WHAT RESEMBLANCE DO YOU SEEUser: You are not very aggressiveELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE...
Passes the Turing Test!! (sort of)“You can fool some of the people....”
Simple pattern matching techniqueVery shallow processing
![Page 29: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/29.jpg)
Turing Test Revived“On the web, no one knows you’re a….”
Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human
Test: Something human can do, ‘bot can’t
![Page 30: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/30.jpg)
Turing Test Revived“On the web, no one knows you’re a….”
Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human
Test: Something human can do, ‘bot can’t
Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot
![Page 31: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/31.jpg)
Turing Test Revived“On the web, no one knows you’re a….”
Problem: ‘bots’Automated agents swamp servicesChallenge: Prove you’re human
Test: Something human can do, ‘bot can’t
Solution: CAPTCHAsDistorted images: trivial for human; hard for ‘bot
Key: Perception, not reasoning
![Page 32: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/32.jpg)
Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey)
need to know to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
![Page 33: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/33.jpg)
Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to
know to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Phonetics & Phonology (Ling 450/550) Sounds of a language, acoustics Legal sound sequences in words
![Page 34: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/34.jpg)
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to
converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Morphology (Ling 570) Recognize, produce variation in word forms Singular vs. plural: Door + sg: -> door; Door + plural -> doors Verb inflection: Be + 1st person, sg, present -> am
![Page 35: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/35.jpg)
Knowledge of LanguageWhat does HAL (of 2001, A Space Odyssey) need to
know to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Part-of-speech tagging (Ling 570) Identify word use in sentence Bay (Noun) --- Not verb, adjective
![Page 36: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/36.jpg)
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know
to converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Syntax (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) Order and group words in sentence
I’m I do , sorry that afraid Dave I can’t.
![Page 37: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/37.jpg)
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to
converse?
Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Semantics (Ling 571) Word meaning:
individual (lexical), combined (compositional)
‘Open’ : AGENT cause THEME to become open;
‘pod bay doors’ : (pod bay) doors
![Page 38: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/38.jpg)
Knowledge of Language What does HAL (of 2001, A Space Odyssey) need to know to
converse?
Dave: Open the pod bay doors, HAL. (request)
HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement)
Pragmatics/Discourse/Dialogue (Ling 571, maybe) Interpret utterances in context Speech act (request, statement) Reference resolution: I = HAL; that = ‘open doors’ Politeness: I’m sorry, I’m afraid I can’t
![Page 39: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/39.jpg)
Cross-cutting ThemesAmbiguity
How can we select among alternative analyses?
Evaluation How well does this approach perform:
On a standard data set?When incorporated into a full system?
Multi-linguality Can we apply this approach to other languages? How much do we have to modify it to do so?
![Page 40: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/40.jpg)
Ambiguity“I made her duck”
Means....
![Page 41: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/41.jpg)
Ambiguity“I made her duck”
Means.... I caused her to duck down
![Page 42: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/42.jpg)
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has
![Page 43: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/43.jpg)
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her
![Page 44: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/44.jpg)
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned
![Page 45: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/45.jpg)
Ambiguity“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck
![Page 46: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/46.jpg)
Ambiguity: POS“I made her duck”
Means.... I caused her to duck down I made the (carved) duck she has I cooked duck for her I cooked the duck she owned I magically turned her into a duck
V
N
Pron
Poss
![Page 47: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/47.jpg)
Ambiguity: Syntax“I made her duck”
Means.... I made the (carved) duck she has
((VP (V made) (NP (POSS her) (N duck)))
![Page 48: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/48.jpg)
Ambiguity: Syntax“I made her duck”
Means.... I made the (carved) duck she has
((VP (V made) (NP (POSS her) (N duck)))
I cooked duck for her((VP (V made) (NP (PRON her)) (NP (N (duck)))
![Page 49: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/49.jpg)
AmbiguityPervasive
Pernicious
Particularly challenging for computational systems
Problem we will return to again and again in class
![Page 50: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/50.jpg)
TokenizationGiven input text, split into words or sentences
Tokens: words, numbers, punctuation
Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "
Why tokenize?
![Page 51: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/51.jpg)
TokenizationGiven input text, split into words or sentences
Tokens: words, numbers, punctuation
Example:Sherwood said reaction has been "very positive.”Sherwood said reaction has been ” very positive . "
Why tokenize? Identify basic units for downstream processing
![Page 52: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/52.jpg)
TokenizationProposal 1: Split on white space
Good enough
![Page 53: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/53.jpg)
TokenizationProposal 1: Split on white space
Good enough? No
Why not?
![Page 54: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/54.jpg)
TokenizationProposal 1: Split on white space
Good enough? No
Why not?Multi-linguality:
Languages without white space delimiters: Chinese, Japanese
Agglutinative languages (Finnish)Compounding languages (German)
E.g. Lebensversicherungsgesellschaftsangestellter
“Life insurance company employee”
![Page 55: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/55.jpg)
TokenizationProposal 1: Split on white space
Good enough? No
Why not?Multi-linguality:
Languages without white space delimiters: Chinese, JapaneseAgglutinative languages (Finnish)Compounding languages (German)
E.g. Lebensversicherungsgesellschaftsangestellter
“Life insurance company employee”Even with English, misses punctuation
![Page 56: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/56.jpg)
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough?
![Page 57: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/57.jpg)
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems:
![Page 58: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/58.jpg)
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc
![Page 59: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/59.jpg)
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc
Problems: no-splitting whitespaceNames: New York; Collocations: pick up
![Page 60: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/60.jpg)
Tokenization - againProposal 2: Split on white space and punctuation
For English
Good enough? No
Problems: Non-splitting punctuation1.23 1 . 23 X; 1,23 1 , 23 Xdon’t don t X; E-mail E mail X, etc
Problems: no-splitting whitespaceNames: New York; Collocations: pick up
What’s a word?
![Page 61: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/61.jpg)
Sentence SegmentationSimilar issues
Proposal: Split on ‘.’
Problems?
![Page 62: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/62.jpg)
Sentence SegmentationSimilar issues
Proposal: Split on ‘.’
Problems?Other punctuation: ?,!,etcNon-boundary periods:
1.23Mr. SherwordM.p.g.….
Solutions?
![Page 63: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/63.jpg)
Sentence SegmentationSimilar issues
Proposal: Split on ‘.’
Problems? Other punctuation: ?,!,etc Non-boundary periods:
1.23Mr. SherwordM.p.g.….
Solutions? Rules, dictionaries esp. abbreviations Marked up text + machine learning
![Page 64: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/64.jpg)
Homework #1
![Page 65: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/65.jpg)
General Notes and Naming
All code must run on the CL cluster under Condor
Each programming question needs a corresponding condor command file named:E.g. hw1_q1.cmdAnd should run with: $ condor_submit hw1_q1.cmd
General comments should appear in hw1.(pdf|txt)
Your code may be tested on new data
![Page 66: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/66.jpg)
Q1-Q2: Building and Testing a Rule-Based English
TokenizerQ1: eng_tokenize.*
Condor file must contain Input = <input_file> Output = <output_file> You may use additional arguments ith input line should correspond to the ith output line Don’t waste too much time making it perfect!!
Q2: Explain how your tokenizer handles different
phenomena – numbers, hyphenated words, etc – and identify remaining problems.
![Page 67: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/67.jpg)
Q3: make_vocab.*Given the input text:
The teacher bought the bike. The bike is expensive.
The output should be:The 2
the 1
teacher 1
bought 1
bike. 1
bike 1
is 1
expensive. 1
![Page 68: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/68.jpg)
Q4: Investigating Tokenization
Run programs from Q1, Q3
Compare vocabularies with and without tokenization
![Page 69: Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011](https://reader035.vdocuments.net/reader035/viewer/2022062407/56649d2c5503460f94a02f6c/html5/thumbnails/69.jpg)
Next TimeProbability
Formal languages
Finite-state automata