course notes for unit 1 of the udacity course cs262 programming languages

32
CS262 - Unit 1: String Patterns Building a Web Browser Course and Project Overview Field Trip: Mozilla Breaking Up Strings Regular Expressions Finite State Machines Conclusion Answers

Upload: iain-mcculloch

Post on 27-Oct-2014

154 views

Category:

Documents


6 download

DESCRIPTION

Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

TRANSCRIPT

Page 1: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

CS262 - Unit 1: String Patterns

Building a Web Browser Course and Project Overview Field Trip: Mozilla

Breaking Up Strings Regular Expressions Finite State Machines Conclusion Answers

Page 2: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Building a Web Browser

In this course you’ll learn the theory and practice of programming languages. The course will culminate in the construction of a web browser. Your web browser will take HTML and java script as input – the primary languages of the web– and use it to produce an image of the webpage.

You may well be familiar with HTML, which describes the basics of web pages. However, you might be less familiar with JavaScript. This allows us to define computations in web pages so that we can have a lot of power and also some flashy graphics.

For example, sites may use JavaScript to animate tabs, so that when you scroll over them a drop-down appears. If you look at the source code for pages which have these features, you’ll see that they use both HTML and JavaScript.

Course and Project Overview

We start with the source code of a web page, which will be in HTML and JavaScript. Next, we’ll break that code down into important words. Then we need to understand the structure of the words that we’ve found. Finally, we’ll figure out the meaning of that structure.

Building the web browser should be a lot of fun, but the overall goal of the course isn’t actually to build a production-quality browser, but rather to use the goal of building the browser as a way to structure our exploration of computer science.

Field Trip: Mozilla

In this class, we’ll only be dealing with a very restricted subset of JavaScript. We don’t pay any attention to the Document Object Model, for example. You might be wondering how useful it is to focus on a subset of JavaScript or HTML. Can the skills that we’ll learn for making a lexer, parser and interpreter for a smaller language carry over to important tasks in the real world? Westley put this question to Brendan Eich, the inventor of JavaScript and the CTO of Mozilla.

The answer is yes. In Mozilla, and in other browsers, you often deal with a subset of JavaScript. There’s even something called json, which is more-or-less a subset of JavaScript, which is quite popular and is a useful way of describing trees of data. It turns out that you want to parse json really quickly – even when it’s loaded as if it was JavaScript.

A lot of JavaScript libraries, especially the ones with “query” in their name are generating compiled functions that are optimised to match a certain query against a Document Object Model. So there’s code generation going on, and a certain amount of partial evaluation going on. A subset of the language is being used to construct matches, and rules for matching trees.

Page 3: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Breaking Up Strings

We want to break up strings, like the source code for a web page, into important words, and we’re going to use Python to do it. Let’s say we are given some code like this:

<b>Hello 1…

One approach to breaking this up would be to use Python’s string.find function to find the space between the “Hello” and the “1”, and then split the string into two parts - everything to the right of the space and everything to the left of the space:

<b>Hello 1…

Python's string.find function is often described as finding a needle in a haystack. For example, let’s say we want to find the "fun" in "Mifune Toshiro":

"Mifune Toshiro".find("fun")

The ‘needle’ is “fun”, and we want to find the first copy of it in the ‘haystack’ which is the string "Mifune Toshiro". The answer we get will be the string index of the beginning of “fun” which, in this case is 2.

Remember: Python starts counting positions at zero; so that the "M" is in position 0, the "i" is in position 1, and the "f" is in position 2.

Let’s see another example. To find the space “ “ in “Hello world” we would use:

"Hello world".find(" ")

and the result we get would be 5.

You can give the find() function a starting position. If we wanted to find “1” in the string "1 + 1=2", the first occurrence is at position 0, but if we start at position 2:

"1 + 1=2".find("1", 2)

The result we get would now be 4.

If the needle you’re looking for doesn’t actually occur in your haystack

"haystack".find("needle")

Python will return -1 (negative one) to indicate that it is “out-of-range” for the given string.

Page 4: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Breaking Up Strings Quiz

What result should we expect from the Python interpreter for the following expressions:

"Ada Lovelace".find(" ") "Alan Turing".find("n", 4)

Selecting Substrings

Now we know how to find positions, or ‘indices’, in strings. What we want to do now is to “chop up” those strings into substrings. Once I know where the spaces are, I can start splitting a sentence into words. The Python syntax for this looks something like this:

"hello"[1:3]

This means “find the substring that starts at the first number (1), and goes up to, but not including, the second number (3)”. In this case, the answer the Python interpreter returns will be 'el'.

You can also leave out one of the number specifiers. [1:] means “start at index 1 and continue to the end of the string”. [:4] means “start at the beginning of the string, and continue up to, but not including, the character in index position 4”.

Now that you know how to find a substring within a string, and how to chop strings up, let’s see if you can combine them together to write a Python procedure.

Selecting Substrings Quiz

Let p and q be strings containing two words separated by a space. For example: "bell hooks", "grace hopper", "alonzo church".

Write a procedure called myfirst_yoursecond(p,q) that returns True if the first word in p equals the second word in q.

Page 5: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Split

Splitting words by spaces is such a common task that Python has a built-in function, string.split() that does just that. For example, if we were to enter:

"Jane Eyre".split( )

into the Python interpreter, we would get a list of the words in the string that are separated by spaces:

["Jane", "Eyre"]

Split Quiz

What is the number of elements in the list returned by the split( ) function in each of the following.

"Python is fun".split( ) "July-August 1842".split( ) "6*9==42".split( )

Regular Expressions

We will also want to split strings that include elements like hyphens or operational symbols like the second and third examples in the last quiz. This suggests that we need more control over splitting strings, so that we can split on things other than spaces. It turns out that there is a tool that lets us do just that.

Regular expressions are a popular and concise notation for specifying sets of strings, and can be used as a tool to give you more control over how you split strings.

Suppose you want to find all the numbers in a string. You could make ten different calls to s.find(), looking for 1, then 2, and so on:

s.find(“1”) s.find(“2”) s.find(“3”) …

This would get really tedious, really fast. It turns out that regular expressions are a much better way to do this.

Page 6: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

The term "regular" has special meaning in mathematics and computer science theory. For now, however, it just means simple strings. An “expression” is simply a concise notation. You’ve probably seen this before in a maths class, but you might not have thought about it in this way.

If we write a mathematical expression like:

x = sqrt(4) or 5 < x < 9

each of these ‘admits’ or corresponds to, some possible values for x:

x = sqrt(4) → x = 2 or x = -2

5 < x < 9 → x = 6 or 7 or 8

All these values of x satisfy the corresponding mathematical equation. So, these mathematical expressions are concise notations for potentially (very) large sets of values. Consider the expression

50 < x < 90

Where x has 39 possible integer values that satisfy the expression.

So, these mathematical expressions are very concise and allow us to describe a large number of integers or numbers. In a similar way, regular expressions are going to provide a very concise method for describing large numbers of simple strings.

Let’s introduce our first regular expression:

[1–3]

This matches or ‘denotes’ the three strings, “1”, “2”, “3”

The underlying idea is that the regular expression has some symbol on the left and some symbol on the right, and it matches everything in between:

[4-8] → “4”, “5”, “6”, “7”, “8”

[a-c] → “a” “b” “c”

Regular expressions are very popular and very useful online and in computing in general. Credit cards, phone numbers, addresses and emails are all handled by regular expressions on websites you probably already use everyday.

Regular expressions are commonly used when you want to enter structured data. Things like date of birth, social security number, and email address, all have different structured formats. Regular expressions allow you to make sense of this type of data and process it when you see it on web pages.

Page 7: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Single Digits Quiz

Select all of the strings that exactly match [0-9]:

0 1 10 11 05 9 Isak Dinesen

Import Re

Commercial or industrial software is often so big that it doesn’t fit onto one page. It is often broken up into chunks, just like a book is broken into chapters. In computer science, a module is a repository or library of functions and data. In Python we use import to bring in a module.

Python comes with a bunch of functions relating to regular expressions, so we won’t have to re-invent the wheel. We just need to import these functions into our code and then we can use them as we require. Python’s regular expressions module is called re, so, to have access to these regular expression functions just include this statement at the top of your code:

import re

If we’re going to write regular expressions out in Python, we need to know what they look like. Python regular expressions look just like strings, except that regular expressions begin with a lower-case ‘r’:

string: "[0-9]"

regular expression: r"[0-9]"

The string variable above is a 5-character string. The regular expression matches ten 1-digit strings.

Technically, the r in r"[0-9]" means raw string rather than regular expression, but for this course the latter is a good mnemonic. The difference has to do with escape sequences, a topic we will cover later in this unit.

Writing regular expressions is a creative process. You, as the programmer, actually have to do it.

Page 8: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

One of the most common functions involving regular expressions is findall(). This takes a regular expression and a string, and return a list of all the sub-strings that match the given regular expression. For example:

re.findall(r"[0-9]", "1+2==3")

returns the list: ["1", "2", "3"].

The ‘re’ indicates that the function comes from the regular expression library. We must have already imported re for this to work.

The function:

re.findall(r"[a-c]", "Barbara Liskov")

returns the list ["a", "b", "a", "a"].

Note that the uppercase B does not match the regular expression given.

Findall() Quiz

Which elements are returned by each of the following findall() expressions:

re.findall(r"[0-9]", "Mir Taqi 1723")

re.findall(r"[A-Z]", "Mir Dard 1721")

re.findall(r"[0-9]", "11 – 7 == 4")

Designing JavaScript

Brendan Eich describes the history and design of JavaScript.

When Brendan arrived at Netscape, he didn’t have much time to write JavaScript and it was important that he already had the necessary skills. He’d been interested in writing programming language implementations for his entire career. “Every practicing programmer should take the time to invent a language at some point. There is often a need that you have, that no particular language is perfect for. It’s educational, and will often solve your problem better than any other language. I did this myself on a number of occasions and it certainly prepared me for JavaScript”.

In this class we will learn a lot of this. Regular expressions. Finite state machines. Having a lexer fall out of that automatically. Context-free grammars. Parsing. Having a parser fall out of that automatically.

Page 9: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Concatenation

Now that we’ve mastered single-character regular expressions, lets look at gluing them together. We’re going to need to find important punctuation elements like “/>” and “==” to reason about JavaScript and HTML, and thus build our web browser. We need to be able to concatenate (put right next to each other) and repeat regular expressions.

In fact, with regular expressions that’s actually as simple as just writing two regular expressions right next to each other:

r"[a-c][1-2]"

This will match six possible strings:

“a1”, “a2”, “b1”, “b2”, “c1”, “c2”

The first letter in each string comes from the first regular expression and the number matches the second regular expression. In effect, we’ve concatenated the letters a to c with the numbers 1 to 2 to match more complicated strings. You may have noticed that we suddenly had quite a few strings from a relatively small regular expression. In fact, if we put:

r"[0-9][0-9]"

we match 100 strings: “00”, “01”, “02”, … “98”, “99”

Concatenation Quiz

Which of the following are return elements of:

re.findall(r"[a-z][0-9]", “a1 2b cc3 44d”)

‘a1’ ‘2b’ ‘b2’ ‘cc’ ‘cc3’ ‘44’ ‘d4’ ‘’ ‘c3’

Page 10: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Regular Expressions At Mozilla

Steve Fink at Mozilla describes using multiple regular expressions to make the Firefox web browser better.

We’ve been having a problem recently with the Firefox browser where we get periodic stalls. This is often caused by some event happening, like a key press or a timer, and the browser takes 100ms or so to service it. I pulled out my regular toolkit of PERL and regular expressions and I wrote a little script. One regular expression to identify this class declaration, another to find what it inherits from, another one to find a run method which has to exist on all of these (which just gives me a good place to inject my code), and it basically jus worked. That’s just one example of where I’ve done this. I tend to do this fairly often.

One Or More

It’s time to introduce a new regular expression which is really handy when we want to match one, or more, of something. This is ‘+’. This is a very concise way of matching what is really an infinite number of possibilities. If we write:

r"a+"

this matches “a”, “aa”, “aaa”, “aaaa”, …

The + ‘looks back’ at the preceding regular expression and changes the meaning, so that instead of just matching it once, you match it once or more – as many times as you like.

r"[0-1]+" matches “0”, “1”, “00”, “01”, “10”, “11”, “000”, “001”, …

There is a minor ambiguity that we need to clear up about the +. Let’s say we’re looking for:

re.findall(r"[0-9]+", "13 from 1 in 1776")

One possible answer is: ['13', '1', '1776']

But the + just means “one or more”. Do we have to match them all at the same time?

Could we also say: ['1', '3', '1','1','7', '7','6']?

It turns out that there is a rule in regular expressions called “maximal munch” which says that a regular expression should ‘consume’ or ‘eat’, or match the biggest string that it can, and not its smaller parts. So findall gives the following from the above expression:

['13', '1', '1776']

Page 11: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

One Or More Quiz

Which of the following are elements of the return value of:

re.findall(r"[0-9][ ][0-9]+", "a1 2b cc3 44d")

‘a1 2’ ‘1 2’ ‘1 2b’ ‘2 3’ ‘44’ ‘3 44’ ‘3 44d’

Finite State Machines

We want to do even more with regular expressions, such as matching a word or a number. To do this, we’re going to introduce a visual representation for regular expressions that shows exactly what’s going on behind the scenes. Then we will follow along in Python. Suppose we have the expression:

r"[0-9]+%"

Any character that just appears on its own, like the % in the expression above, is matched directly. So the expression will match expressions like 30%, 99% and 2%.

Here, we’ve drawn a finite state machine, which is a visual representation of this regular expression:

The arrow on the left of the picture indicates where we start. The three circles represent ‘states’. They represent where we’re up to when we’re matching a string against the regular expression. The other arrows are called ‘edges’, or ‘transitions’. They tell us when to move from one state to another.

Page 12: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

So, we start in state 1. If we see a digit matching 0-9, we move over to state 2. You’ll have noticed that state 3 has a double circle. That indicates that state 3 is an ‘accepting state’. If we end up in an accepting state at the end of the input, this finite state machine matches the given string.

Let’s think about what happens for an few input strings. We’ll start with the input string “23%”.

We start in the start state, state 1. We see a ‘2’, which matches 0-9, so we move to state 2. The next thing we see is a 3, so we follow the upper loop back to state 2 (these are sometimes called ‘self-loops’ – a loop that takes us back to where we started). Now we see the ‘%’ sign which takes us to state 3. Since state 3 is an accepting state, the finite state machine accepts this string, ‘23%’, just like our regular expression would.

What would happen if we had just the string “2”? We would start in the start state, we’d see a ‘2’ so we move to state 2, and then we’re done. We ran out of input, but we’re not in an accepting state. The finite state machine rejects this, just like our regular expression would.

Finally, let’s consider the string “2x”. Again, we start in the start state. Again, we see a ‘2’, so we move to state 2. But now we see the ‘x’, and there’s no outgoing edge for an ‘x’ from state 2. So we fall off the finite state machine and die! When this happens our finite state machine does not accept the input.

Accepting States Quiz

Which of these are accepted exactly and fully?

‘a1’ ‘aa’ ‘2b’ ‘ ‘ ‘cc3’ ‘44d’

Page 13: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

FSM Evolution Quiz

I want to change the FSM from the previous quiz into one that accepts r"[a-z]+[0-9]"

Which state should get the edge. What’s the label for that edge going to be?

Disjunction

Consider this new finite state machine:

It accepts words, i.e. one-or-more letters (e.g. ‘w’, ‘o’, ‘r’, ‘d’), and also numbers of one-or-more digits (e.g. ‘1’, ‘2’, ‘3’). It has two accepting states making it much more powerful. Can we do the same thing with regular expressions?

It turns out that we can, but we need to introduce a new regular expression operator. The vertical bar (‘pipe’) in the expression below means “match either the thing to the left of the bar, or the thing to the right of the bar”.

r"[a-z]+|[0-9]+"

The formal name for this is ‘disjunction’, but we can just read it as ‘or’. So:

“match [a-z]+ or [0-9]+”

Let’s consider an example. In this case, we want to find all the matches of lowercase [a-z]+ or [0-9]+ in the phrase “Goethe, 1749”:

re.findall(r"[a-z]+|[0-9]+", "Goethe 1749")

What we will get is:

[‘oethe’, ‘1749’]

Page 14: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Disjunction In FSMs Quiz

Which of the following are accepted by this finite state machine:

‘a’ ‘ ‘ ‘Havel 1936’ ‘havel 2011’ ‘1933’

Disjunction Construction Quiz

Let’s try approaching the problem from the other direction.

Assign to the variable regexp a regular expression that matches either the exact string ab, or one or more digits.

Page 15: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Options

Now we have a way to choose between options in our regular expressions using the construct a|b. Another very common choice is to choose between something and nothing: a|nothing. To have part of a string that is optional. For example, when you are writing numbers it is possible for the number to begin with a negative sign, but they don’t have to. Here is a finite state machine that accepts numbers with, or without a leading negative sign:

Notice that the parts of the FSM in the boxes are identical. This duplication goes against our goal of developing a concise method. Conceptually it might be simpler if we had an edge that consumes no input:

By convention, we use the Greek letter epsilon, ε, to indicate an edge that takes no input. You can think of this as meaning “consume no input” or, if you prefer, you can think of it as the empty string.

Continuing the theme that anything that can be done in a finite state machine can be done in a regular expression, and vice versa, we now introduce a new regular expression, the question mark, ?, to indicate “optional”. We can also read this as “the previous thing zero or one times”. A regular expression which accepts numbers which may optionally have a leading negative sign would be:

r"-?[0-9]+"

If we apply it to the string "1861-1941 R. Tagore" using:

re.findall(r"-?[0-9]", "1861-1941 R. Tagore")

we will get [‘1861’, ‘-1941’]

Page 16: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Escape Sequences

Just as we have the plus symbol for one or more copies, we can also use another regular expression, star, ‘*’, for zero or more copies. You can always convert between the two expressions using:

a+ ≡ aa*

The + is more common for specifying Python and JavaScript.

So now we have a series of symbols that have special meanings in regular expressions:

+ * ? [ ]

They help us denote sets of strings. But what if the string I want to match is just a plus sign, ‘+’? How do I do this if ‘+’ simply means one or more of what went before?

We’re going to solve this by using something called escape sequences, but first, let’s introduce them by means of an analogy.

In Python, you can define a string by using either double quotes, as in “string”, or by single quotes, as in ‘string’. If you wanted to define a single string that reads:

P & P is Jane’s favourite book.

and you used single quotes, Python would be confused by the single quote in your string. In this case, you could simply use double quotes and all would be well. But what if your string includes quoted dialogue?

I said, “P & P is Jane’s favourite book.”

Now we’re using both single and double quotes in our string. How do we get around this?

Well, Python will actually let you get around this by using triple quotes:

"""I said, “P & P is Jane’s favourite book.” """

But there is another way to do this too. If I were to just put a backward slash in front of a quote (or any other character), Python will treat it as being part of the string, and not the end of the string. We’re escaping out of quotes being string delimiters.

"I said, \“P & P is Jane’s favourite book.\” "

The backslash is called the escape character, and the combination of “ is called an escape sequence.

Page 17: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

It turns out that we can do the same thing with regular expressions. If we wanted to find the string ‘++’ we can use:

r"\+\+"

This expression has two escape sequences, and will find only the string ‘++’.

Hyphenation Quiz

Assign to the variable regexp a Python regular expression that matches lower-case words [a-z] or singly-hyphenated lower-case words.

This was a particularly tricky quiz, and the right answer really wasn’t obvious. We’re really interested in supporting phone numbers from lots of countries and these might be in a range of formats. We’re really only interested in a hyphen if it’s followed by more digits. Conceptually, you might say that we want to group the hyphen and the following digits and say that we’ll have either all of them, or none of them.

Re Challenges Quiz

Assign to the variable regexp a Python regular expression that matches single-argument mathematical functions. The function name is a lowercase word [a-z], the function argument must be a number [0-9], and there may optionally be spaces before and/or after the argument

Quoted Strings

Quoted strings, that is, strings that are surrounded by double-quotes, are a tricky issue that comes up in both JavaScript and HTML. Let’s think about how we can use the power of regular expressions to separate quoted strings from other words. Consider the quoted string:

"I said, \"Hello.\""

As it stands this could easily be misinterpreted. Essentially, we just want to remove the outer double-quotes, but if we just use string.find() repeatedly to find double quotes we will end up with all four from this string which isn’t what we want.

Page 18: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

This approach could end up mistakenly returning the two strings:

"I said, \" and "".

In a shocking twist, it turns out we have to use regular expressions instead. First though, to make our lives easier, we’ll introduce a couple of new regular expressions.

The first new regular expression is the dot, or period, ‘.’, which matches any character except a new-line (what you get when you press Enter or Return). For example, the regular expression:

re.findall(r"[0-9].[0-9]", "1a1 222 cc3")

will match 3-character strings beginning and ending with a digit from 0-9, with any character (except a line break) in the middle. In this case it will return:

[‘1a1’, ‘222’]

The second new regular expression lets us specify anything except the given character. The string is the circumflex, ‘^’. Inside square brackets, [ ], this means Not, or Set complement. So, the regular expression:

re.findall(r"[0-9][^ab]", "1a1 222 cc3")

will find two character strings that start with the decimal digit 0-9, followed by anything that is not a and is also not b. In this case it will return:

[‘1 ’, ‘22’, ‘2 ’]

Structure

When an expression gets complicated in mathematics, we can add parentheses to show the structure or grouping. For example:

(x – 3) * 5 is different from x – (3 * 5)

Python regular expressions have similar parentheses, but they are written a little differently. The closing parentheses looks just the same as in mathematics, but the opening parentheses has 3 characters:

(?: )

In the regular expression:

(?:xyz)

we are matching the whole group ‘xyz’.

Page 19: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Suppose we wanted to find words made out of combinations of the musical notes: do re mi fa so la ti. Let’s say we wanted to recognise words made out of combinations of these syllables. We could try something like:

re.findall(r"do+/re+/mi+", "mimi rere midore doo-wop")

However, the results returned by this function aren’t what we wanted at all:

[‘mi’, ‘mi’, ‘re’, ‘re’, ‘mi’, ‘do’, ‘re’, ‘doo’]

The clue to what went wrong is in the final ‘doo’. The ‘+’ symbols in the regular expression above only apply to the second letter in each string. So, the regular expression "mi+" will match the strings ‘mi’, mii’, ‘miii’, and so on.

To get the result we were after, we have to re-write the regular expression slightly:

re.findall(r"(?:do/re/me)+", "mimi rere midore doo-wop")

Now, anything in the group enclosed by the parentheses repeated one or more times will be matched by the regular expression.

[‘mimi’, ‘rere’, ‘midore’, ‘do’]

Escaping The Escape Quiz

Assign to regexp a regular expression for double-quoted string literals that allows for escaped double quotes.

Representing A FSM

A finite state machine can be represented (or encoded) in Python. We use Python dictionaries (or maps) to represent a finite state machine's edges. Here is a finite state machine that corresponds to the regular expression "a+1+":

Page 20: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Let’s verify that the finite state machine matches the regular expression by tracing the route of the input ‘aa1’ through the machine:

1. We start in state 1. 2. We receive the character ‘a’ and jump to state 2. 3. We receive another ‘a’ and self-loop back to state 2. 4. We receive a ‘1’ and jump to state 3. 5. State 3 is a receiving state, and we have no more input, so we end.

This is more-or-less what the computer does “under the hood” to check strings against regular expressions or to evaluate finite state machines. All you really need to do is keep track of where you are in the input, and which state you’re in.

So let’s write a computer program in Python to check whether a finite state machine accepts a string.

The first thing we have to decide is how are we going to represent the finite state machine? We can’t pass a picture into Python! For the states, we can just pass in a list of the states. It’s the edges we need to think about.

For an edge, what we really want to know is “if I’m in state 1, and the next input is a, where do I go”. We can use a Python dictionaries to do this. The dictionary will store the current states and inputs so we can find the next state by looking at the dictionary:

edges[(1,a)] = 2

Page 21: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Before we begin, let’s have a quick refresher on Python dictionaries and tuples.

A Python dictionary, or ‘map’, is a set of zero or more key-value pairs, surrounded by curly brackets. It’s purpose is to associate one thing (the ‘value’) with another (the ‘key).

You make a new, empty, dictionary in Python using the construct:

is_flower = {}

We add or update dictionary elements using the construct:

<Dictionary>[<Key>] = <Value>

So, for the is_flower dictionary we defined above, we might add entries as follows:

is_flower[' rose '] = True

is_flower[' dog '] = False

Now, is_flower[' rose '] will return True.

We could also create the dictionary by specifying all of the bindings as:

is_flower = {'rose': True, 'dog': False}

A Python tuple is just an immutable, or unchangeable, list. We might have a tuple that holds the Cartesian coordinates of some object, say at (1,5) on the grid:

point = (1,5)

We can access the elements in the same way that we would for a list:

point[0] == 1 point[1] == 5

Page 22: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

FSM Simulator

So now let’s encode our finite state machine in Python. We can make a dictionary for the edges:

edges = {(1, 'a'): 2, (2, 'a'): 2, (2, '1') : 3, (3, '1') : 3}

We’ll also need to know which states are accepting states. For this, we can just have a list of all the accepting states:

accepting = [3]

You might think that we’d need a list of all the nodes. In fact, we can get away without it because all of the nodes we actually care about already appear in the listing of the edges.

We are now in a position to define our procedure, fsmsim(), for the finite state machine simulator. We’ll pass the string we want to test, the current node, the edges dictionary and the list of accepting strings into the procedure as parameters.

The first thing we need to do is test whether the string is empty. If it is, we check whether our current state is an accepting state and return True if it is, and False if it isn’t. If it isn’t an empty string, then We can define letter to be the 0th position in the string.

The pseudo code below shows the steps that will follow:

def fsmsim(string, current, edges, accepting): if string == "": return current in accepting else: letter = sring[0] #Is there a valid edge? #If so, take it #If not, return false #HINT: use recursion

FSM Simulator Quiz

Complete the code for fsmsim().

Page 23: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

FSM Interpretation Quiz

Assign values to edges and accepting to encode the regular expression r"q*". Name your start 1

More FSM Encoding Quiz

Define edges and accepting to encode r"[a-b][c-d]?". Name your start state 1.

MIS MSF Quiz

(FSM SIM in reverse!)

Provide two different strings that are accepted by the FSM given the following values for edges and accepting:

edges = {(1,'a'):2, (1,'b'):3, (2,'c'):4, (3,'d'):5, (5,'c'):2 (5,'f'):6 (5,'g'):1} accepting = [6]

Page 24: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Epsilon And Ambiguity

It turns out that Python’s regular expression module, re, uses actually uses something very similar to fsmsim() ‘under the hood’. You just take the regular expression, turn it into a finite-state machine, which we’ve done forward and backwards many times, and then check with a simple recursive procedure to see if the finite-state machine accepts a string.

However, the simulations we’ve seen so far haven’t handled epsilon transitions, ε, or ambiguity.

By ambiguity, I mean, what if there are two outgoing edges labelled ‘a’:

Let’s say one of the edges leads to an accepting state and the other doesn’t. What should we do. Well, there is a formal definition for this kind of ambiguity:

A finite-state machine accepts a string, s, if there exists even one path from the start state to any accepting state that follows s.

This doesn’t really solve our problem. Put simply, we didn’t code for either epsilon transitions or ambiguity, so we will have to go back to the code.

Phone It In Quiz

Suppose we want to recognise phone numbers, with or without hyphens. This is a common problem in e-commerce.

Define a regular expression, regexp, that works for any number of groups of any (non-empty) size, separated by 1 hyphen. Each group is [0-9]+.

Page 25: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Inverting The Problem Quiz

Here is an FSM describing that problem of accepting that phone number language. One edge is blank. What is the missing label?

Nondeterminism

These “easy-to-write” FSMs that we’ve been using, that involve epsilon transitions, or ambiguity, are formally known as non-deterministic finite-state machines. In this context, non-deterministic simply means that you may not know where to go next. The model involves choices.

A “lock-step” finite state machine, with no epsilon edges or ambiguity is known as a deterministic finite-state machine. Our finite-state machine simulation function can handle these deterministic FSMs. That makes them really useful for implementing regular expressions.

It turns out that every non-deterministic finite-state machine has a corresponding deterministic finite-state machine that accepts exactly the same strings.

Non-deterministic FSMs are not more powerful than deterministic FSMs, they are just more convenient. It’s easier to write them down!

Let’s consider an example. The following finite-state machine is equivalent to the regular expression –

r"ab?c"

Page 26: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

This is a very non-deterministic finite-state machine. The two epsilon transitions at node 2 represent the explicit choice of having a ‘b’, or skipping it. Let’s see a deterministic finite-state machine that does exactly the same thing:

This may need a little explanation.

After we see an ‘a’, we could be in state 2, 3, 6, or 4 of the non-deterministic FSM. We have just recorded all of them as the name of the new state in the model above.

From here, if we see a ‘b’ (and we survived!), we must have been in state 3, at which point we just move to state 4. By contrast, if we had seen a ‘c’, it must have been that we were in state 4, and we’re now in state 5. Finally, if we’re in state 4 and the see a ‘c’, we just move to state 5, which is the accepting state.

So this deterministic state machine accepts the same language as the non-deterministic one above, the two strings ‘abc’ and ‘ac’, but it doesn’t have any epsilon transitions or ambiguity.

So the idea here is to build a deterministic machine D, where every state in D corresponds to a set of states in the non-deterministic machine.

Nondet To Det Quiz

Here we have a non-deterministic machine:

It has ambiguity. In state 1, there are two ways to go if the input string is an ‘a’. It also has epsilon transitions.

Page 27: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Here, we have started to make the deterministic equivalent FSM:

When we entered the non-deterministic FSM, we could only be in state 1. If we then see an ‘a’, we could move to state 2, or state 4, or we could take the epsilon transition to state 5, or we could keep going and take the free epsilon transition to 6. The node we move to in the deterministic machine will therefore be 2456. This is an accepting state because state 6 is an accepting state. The original machine would accept ‘a’, so we want this machine to also accept ‘a’.

In the ‘converted’ world, a state accepts if any of the corresponding original states were accepting states.

What happens if we see a ‘c’ in states 2, 4, 5, or 6? If we are in state 2, state 4 , or state 6 we fall off the world. If we’re in state 5 we move to state 6, which is an accepting state.

Now, there are some other ways to get out of states 2, 4, 5, and 6, and when we do, we end up in states 2 or 3. In our deterministic FSM the new state will be state 23, and it is an accepting state since state 3 in the original machine was also an accepting state.

If we’re in 2 or 3, on a ‘b’ from 2 we’d move to 3, and from c we’d fall off the world. If we’re in 2 or 3 and see a c, from 2 we’d fall off the world and from c we’d move back to c. Thus from state 23, on either a ‘b’ or ‘c’ we end up in state 3 which is also an accepting state. If we’re in state 3, there’s a self-loop back to state 3.

What should the label for the edge linking states 2456 and 23 in the non-deterministic model above?

Page 28: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Conclusion

Let’s wrap up what we’ve learned in this unit.

STRINGS – are just sequences of characters REGULAR EXPRESSIONS – a concise notation for specifying sets of strings.

o more flexible than using fixed string matching. FINITE-STATE MACHINES are a pictorial equivalent of regular expressions. DETERMINISTIC – every FSM can be converted to a deterministic FSM. FSM SIMULATION – it is very easy (~10 lines of recursive code) to see if a

deterministic FSM accepts a string.

Now we know how to implement regular expressions, take that regular expression and make a finite state machine, make the fsm deterministic, and then call fsmsim(). From here on, we’ll just use Python’s regular expression library, re, but we should always remember that it is doing exactly these steps ‘under the hood’.

In the next unit, we are going to use what we have learned to specify important parts of HTML and JavaScript, like string constants or hypertext tags as the first step towards writing our web browser.

Page 29: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Answers

Quiz: Breaking Up Strings

3 9

Selecting Substrings Quiz

def myfirst_yoursecond(p,q): pindex = p.find(" ") qindex = q.find(" ") p_word1 = p[:pindex] q_word2 = q[qindex+1:] if p_word1 == q_word2: return True else: return False

Split Quiz

3 2 1

Single Digits Quiz

0 1 10 11 05 9 Isak Dinesen

Findall() Quiz

[‘1’, ‘7’, ‘2’, ‘3’] [‘M’, ‘D’] [‘1’, ‘1’, ‘7’, ‘4’]

Page 30: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

Concatenation Quiz

‘a1’ ‘2b’ ‘b2’ ‘cc’ ‘cc3’ ‘44’ ‘d4’ ‘’ ‘c3’

One Or More Quiz

‘a1 2’ ‘1 2’ ‘1 2b’ ‘2 3’ ‘44’ ‘3 44’ ‘3 44d’

Accepting States Quiz

‘a1’ ‘aa’ ‘2b’ ‘ ‘ ‘cc3’ ‘44d’

Page 31: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

FSM Evolution Quiz

2 a – z

Disjunction In FSMs Quiz

‘a’ ‘ ‘ ‘Havel 1936’ ‘havel 2011’ ‘1933’

Disjunction Construction Quiz

regexp = r"ab|[0-9]+"

Hyphenation Quiz

regexp = r"[a-z]+-?[a-z]*"

Re Challenges Quiz

regexp = r"[a-z]+\( *-?[0-9]+ *\)"

Escaping The Escape Quiz

regexp = r'"(?:[^\\]|(?:\\.))*"'

Page 32: Course Notes for Unit 1 of the Udacity Course CS262 Programming Languages

FSM Simulator Quiz

def fsmsim(string, current, edges, accepting): if string == "": return current in accepting else: letter = string[0] #Is there a valid edge? if (current, letter) in edges: # If so, take it. destination = edges[(current, letter)] remaining_string = string[1:] return fsmsim(remaining_string, destination, edges, accepting) # If not, return False. else: return False

Fsm Interpretation Quiz

edges = {(1,'q'):1} accepting = [1]

More FSM Encoding Quiz

edges = {(1,'a'): 2, (1,'b'): 2,(2,'c'): 3,(2,'d'): 3} accepting = [2, 3]

MIS MSF Quiz

s1 = "bdf" s2 = "bdgbdf"

Phone It In Quiz

regexp = r"[0-9]+(?:-[0-9]+)*"

Inverting The Problem Quiz

0-9

Nondet To Det Quiz

b