lecture 1: course introduction, setuppitt.edu/~naraehan/ling1330/lecture1.pdf · 2019-08-27 · 6...

24
Lecture 1: Course Introduction, Setup Ling 1330/2330 Computational Linguistics Na-Rae Han, 8/27/2019

Upload: others

Post on 23-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Lecture 1:

Course Introduction, Setup

Ling 1330/2330 Computational Linguistics

Na-Rae Han, 8/27/2019

Page 2: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Objectives

Course Introduction

Student survey

Syllabus: http://www.pitt.edu/~naraehan/ling1330/index.html

CourseWeb

Computational linguistics: what is, what we will learn

Introduction to Python 3

Using Python Interactive Shell (ex. IDLE, IPython)

Setup, housekeeping

A warm-up: processing a string

8/27/2019 2

Page 3: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Language engineering applications

3

Language engineering is at the center of many essential tools and applications: Internet search engines (Google, yahoo, Bing...)

Automatic (machine) translation systems

Voice and speech recognition (ASR telephone systems)

Smart assistants (Alexa, Siri, Cortana, Google Assistant)

Automatic essay scoring systems (e-rater)

Spelling and grammatical error correction

... and many more.

Page 4: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Language engineering & AI

4

Computational linguistics: an interdisciplinary field dealing with modeling of natural language from a computational perspective.

A “natural language”?

Any real-world human language.

cf. constructed, artificial, or computer languages such as: formal (logic) language, computer-programming languages, Klingon, etc.

Areas: natural language processing (NLP), natural language understanding, speech processing and synthesis, dialogue systems…

part of AI (Artificial Intelligence)

Page 5: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Two types of architecture

5

Approach 1: rule-based, symbolic

A finite set of hand-written rules are composed by trained linguists

A system applies these rules in a deterministic fashion

that is a beautiful work → noun

my parents work too much → verb

“work is: (1) a noun if following an adjective or preposition, (2) a verb if following a plural noun.”

Logic and semantic relations such as entailment are programmed in.

Ex. If "Every X does Y" is true, "Some X does Y" is also true.

Page 6: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

6

Approach 2: statistical, data-driven

First, a huge collection of human-annotated text (“corpus”) is built

An algorithm automatically extracts statistical patterns from the corpus:

If work follows an adjective (ADJ), it is a noun x% of the time.

If work follows a plural noun (NNS), it is a verb y% of the time.

To process a new text, a system applies these rules in a probabilisticfashion, making a prediction

ex: mountainous/ADJ work/?

→ NN (probability 0.8) vs. VB (0.2)

Known as machine learning

This is the current dominant approach

Fueled by BIG DATA, aka corpora

Two types of architecture

Page 7: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

7

Corpus (pl: corpora): a collection of electronic text

Some corpora are just a collection of raw texts; others contain richer linguistic annotation (part-of-speech, syntactic structure, named entities, and more)

Is an essential resource for:

data-oriented linguistic research

language engineering

These days, corpora are subsumed under …

Data. (big, small, naturally occurring…)

Corpus

Page 8: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Role of linguists

8

What is the role of linguists in all of these?

Linguistic expertise intersecting with engineering is in high demand. Tasks include: Build linguistic infrastructures: grammars, lexicons, ontologies

(knowledge bank)

Design and manage linguistic annotation projects

Consult for quality enhancement (machine translation systems, search results, etc.)

Build NLP systems, train models

Set standards and guidelines for localization

Working knowledge of computational linguistics is essential

Page 9: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Computational linguistics

8/27/2019 9

What is the goal of computational linguistics?

Represent human language in terms of information processing.

Theoretical computational linguistics Investigates the computability of human language and linguistic

theories.

Formal language theory and automata theory.

Applied computational linguistics

The focus is on building (working) models of human language.

Natural language processing (NLP) and engineering (NLE), human language technologies (HLT), artificial intelligence (AI)

Often centers around building practical applications: spell checkers, machine translation systems, automatic grader, etc.

Two distinct flavors: symbolic vs. statistical approaches

Page 10: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Computational linguistics course

8/27/2019 10

Old LING 1330 course description:

The class was about theoretical and symbolic computational linguistics.

In both linguistics and computer science, we need to study languages and their grammar from a mathematical point of view. This course is an introduction to the mathematical theory of language and its applications. The first half will deal mainly with elements of the theory of automata and its relation to grammars. The second half will survey ways in which this theory can be applied to English grammar and to the design of programming languages. We will concentrate on syntax, but will also pay some attention to theories of meaning.

Page 11: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

What we learn in this class

8/27/2019 11

Theoretical computational linguistics

Formal language theory

Automata, the Chomsky hierarchy

We will cover only the basic ideas.

Applied computational linguistics

Text and corpus processing

Statistical methods using NLTK

Examples of natural language processing applications:

Spelling correction, POS tagger, machine translation, text classification

Page 12: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

How to do well in this course

8/27/2019 12

1. Be hard-working. If you are fairly new to programming, expect to put in 4-credit worth of

work.

2. Be detail-oriented. Do not skim for "important" stuff! You should learn everything.

Pay attention to instructions & details when doing homework.

3. Be organized. Class moves fast, and there are a lot of assignments.

4. Be persistent. Coding has frustrating moments. Keep at it!

5. Ask for help. When you had enough of hair-pulling, email/office hours.

Page 13: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Python: script vs. shell

1. Python script is a plain-text file with .py extension.

2. Python shell is an interactive environment where Python will respond to each individual Python command.

Valuable tool for learning!

IDLE shell(Python interpreter)

Python SCRIPT file

Page 14: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Following up a script in shell

8/27/2019 14

Variable name is available in shell for subsequent

commands

Page 15: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Every command you type in is stored as command history.

You can quickly bring up and edit previously entered commands using these shortcuts (IDLE): Windows: Alt+p / Alt+n

Mac: Ctrl+p / Ctrl+n (previous/next)

But these key combos are cumbersome! Let's remap to: (Up arrow: previous command)

(Down arrow: next command)

Command history

8/27/2019 15

>>> print('Hello, world!')

Hello, world!

>>> print('Hello, worl You can edit command line

Page 16: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Customize command history keys

8/27/2019 16

Changing command history key bindings to:

(Up arrow: previous command)

(Down arrow: next command)

Instructions:1. From the menu go to "Options → Configure IDLE"

2. Click "Keys" tab

3. Scroll down and click on the line starting with "history-next". Click button "Get New Keys for Selection"

4. Scroll down to find "Down Arrow" and click on it.

5. The new key is now set to "<Key-Down>". Press OK.

Arrows are commonly used in shell

Page 17: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

8/27/2019 17

6. You are prompted to name your custom key scheme. Give any name.

7. Now repeat above process for "history-previous". But this time select "Up Arrow".

Page 18: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Housekeeping (1)

8/27/2019 18

New to programming/Python? Follow the setup below.

Use the Python.org (https://www.python.org/) distribution, version 3.7.

Windows installation & configuration (do BOTH!): http://www.pitt.edu/~naraehan/python3/getting_started_win_install.html

http://www.pitt.edu/~naraehan/python3/getting_started_win_config.html

Mac installation: http://www.pitt.edu/~naraehan/python3/getting_started_mac_install.html

➔ continued

Page 19: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Housekeeping (2)

8/27/2019 19

Designated folder for coding work:

Create a folder within Documents called “ling1330” or “ling2330”, and store all your coding work there.

So this folder will be:

C:\Users\yourname\Documents\ling1330 (Windows)

/Users/yourname/Documents/ling1330 (Mac)

Verify your default "current working directory":

NOT good: things like C:\Program Files (x86)\Python37-32

Good! sensible default CWD

Page 20: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Warm-up: fox in sox

8/27/2019 20

Let's practice string processing.

Download this template script: http://www.pitt.edu/~naraehan/ling1330/fox_in_sox.TEMPLATE.py

Complete [1] – [5].

10 minutes

Page 21: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Speed-coding fox_in_sox.py

8/27/2019 21

Video of me completing the script, in 5x speed!

http://www.pitt.edu/~naraehan/ling1330/code_in_shell.mp4

Watch closely! This may well be the most important video yet… It will change your life.

Page 22: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

How to program efficiently

8/27/2019 22

The rookie way:

You are doing all your coding in the script window, blindly.

You only use the IDLE shell to verify the output of your script.

Page 23: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

How to program efficiently

8/27/2019 23

The seasoned Python programmer way:

After a script is run, all the variables and the objects in it are still available in IDLE shell for you to tinker with.

Experimenting in SHELL is much quicker – instant feedback!

You should go back and forth between SCRIPT and SHELLwindows, successively building up your script.

Page 24: Lecture 1: Course Introduction, Setuppitt.edu/~naraehan/ling1330/Lecture1.pdf · 2019-08-27 · 6 Approach 2: statistical, data-driven First, a huge collection of human-annotated

Wrap-up

8/27/2019 24

Exercise #1 out

Due Thursday 9am, on CourseWeb

Next class (Thu):

Writing structured programs