functionswithinfinitedomains,automata,andregularex- โ€ฆfunctions with infinite domains, automata,...

Post on 09-Feb-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

  • Figure 6.1: Once you know how to multiply multi-digit numbers, you can do so for every number ๐‘›of digits, but if you had to describe multiplicationusing Boolean circuits or NAND-CIRC programs,you would need a different program/circuit for everylength ๐‘› of the input.

    6Functions with Infinite domains, Automata, and Regular ex-pressions

    โ€œAn algorithm is a finite answer to an infinite number of questions.โ€, At-tributed to Stephen Kleene.

    The model of Boolean circuits (or equivalently, the NAND-CIRCprogramming language) has one very significant drawback: a Booleancircuit can only compute a finite function ๐‘“ . In particular, since everygate has two inputs, a size ๐‘  circuit can compute on an input of lengthat most 2๐‘ . Thus this model does not capture our intuitive notion of analgorithm as a single recipe to compute a potentially infinite function.For example, the standard elementary school multiplication algorithmis a single algorithm that multiplies numbers of all lengths. However,we cannot express this algorithm as a single circuit, but rather need adifferent circuit (or equivalently, a NAND-CIRC program) for everyinput length (see Fig. 6.1).

    In this chapter, we extend our definition of computational tasks toconsider functions with the unbounded domain of {0, 1}โˆ—. We focuson the question of defining what tasks to compute, mostly leavingthe question of how to compute them to later chapters, where we willsee Turing machines and other computational models for computingon unbounded inputs. However, we will see one example of a sim-ple restricted model of computation - deterministic finite automata(DFAs).

    This chapter: A non-mathy overviewIn this chapter, we discuss functions that take as input stringsof arbitrary length. We will often focus on the special caseof Boolean functions, where the output is a single bit. Theseare still infinite functions since their inputs have unbounded

    Compiled on 3.16.2021 13:56

    Learning Objectives:โ€ข Define functions on unbounded length inputs,

    that cannot be described by a finite size tableof inputs and outputs.

    โ€ข Equivalence with the task of decidingmembership in a language.

    โ€ข Deterministic finite automatons (optional): Asimple example for a model for unboundedcomputation.

    โ€ข Equivalence with regular expressions.

  • 218 introduction to theoretical computer science

    Figure 6.2: The NAND circuit and NAND-CIRCprogram for computing the XOR of 5 bits. Note howthe circuit for XOR5 merely repeats four times thecircuit to compute the XOR of 2 bits.

    length and hence such a function cannot be computed by anysingle Boolean circuit.In the second half of this chapter, we discuss finite automata,a computational model that can compute unbounded lengthfunctions. Finite automata are not as powerful as Python orother general-purpose programming languages but can serveas an introduction to these more general models. We alsoshow a beautiful result - the functions computable by finiteautomata are precisely the ones that correspond to regularexpressions. However, the reader can also feel free to skipautomata and go straight to our discussion of Turing machinesin Chapter 7.

    6.1 FUNCTIONS WITH INPUTS OF UNBOUNDED LENGTH

    Up until now, we considered the computational task of mappingsome string of length ๐‘› into a string of length ๐‘š. However, in gen-eral, computational tasks can involve inputs of unbounded length.For example, the following Python function computes the functionXOR โˆถ {0, 1}โˆ— โ†’ {0, 1}, where XOR(๐‘ฅ) equals 1 iff the number of 1โ€™sin ๐‘ฅ is odd. (In other words, XOR(๐‘ฅ) = โˆ‘|๐‘ฅ|โˆ’1๐‘–=0 ๐‘ฅ๐‘– mod 2 for every๐‘ฅ โˆˆ {0, 1}โˆ—.) As simple as it is, the XOR function cannot be com-puted by a Boolean circuit. Rather, for every ๐‘›, we can compute XOR๐‘›(the restriction of XOR to {0, 1}๐‘›) using a different circuit (e.g., seeFig. 6.2).

    def XOR(X):

    '''Takes list X of 0's and 1's

    Outputs 1 if the number of 1's is odd and outputs 0

    otherwise'''โ†ช

    result = 0

    for i in range(len(X)):

    result = (result + X[i]) % 2

    return result

    Previously in this book, we studied the computation of finite func-tions ๐‘“ โˆถ {0, 1}๐‘› โ†’ {0, 1}๐‘š. Such a function ๐‘“ can always be describedby listing all the 2๐‘› values it takes on inputs ๐‘ฅ โˆˆ {0, 1}๐‘›. In this chap-ter, we consider functions such as XOR that take inputs of unboundedsize. While we can describe XOR using a finite number of symbols(in fact, we just did so above), it takes infinitely many possible in-puts, and so we cannot just write down all of its values. The same istrue for many other functions capturing important computationaltasks, including addition, multiplication, sorting, finding paths in

  • functions with infinite domains, automata, and regular expressions 219

    graphs, fitting curves to points, and so on. To contrast with the fi-nite case, we will sometimes call a function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} (or๐น โˆถ {0, 1}โˆ— โ†’ {0, 1}โˆ—) infinite. However, this does not mean that ๐นtakes as input strings of infinite length! It just means that ๐น can takeas input a string of that can be arbitrarily long, and so we cannot sim-ply write down a table of all the outputs of ๐น on different inputs.

    Big Idea 8 A function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1}โˆ— specifies the computa-tional task mapping an input ๐‘ฅ โˆˆ {0, 1}โˆ— into the output ๐น(๐‘ฅ).

    As we have seen before, restricting attention to functions that usebinary strings as inputs and outputs does not detract from our gener-ality, since other objects, including numbers, lists, matrices, images,videos, and more, can be encoded as binary strings.

    As before, it is essential to differentiate between specification andimplementation. For example, consider the following function:

    TWINP(๐‘ฅ) =โŽง{โŽจ{โŽฉ

    1 โˆƒ๐‘โˆˆโ„• s.t.๐‘, ๐‘ + 2 are primes and ๐‘ > |๐‘ฅ|0 otherwise

    (6.1)

    This is a mathematically well-defined function. For every ๐‘ฅ,TWINP(๐‘ฅ) has a unique value which is either 0 or 1. However, atthe moment, no one knows of a Python program that computes thisfunction. The Twin prime conjecture posits that for every ๐‘› thereexists ๐‘ > ๐‘› such that both ๐‘ and ๐‘ + 2 are primes. If this conjectureis true, then ๐‘‡ is easy to compute indeed - the program def T(x):return 1 will do the trick. However, mathematicians have triedunsuccessfully to prove this conjecture since 1849. That said, whetheror not we know how to implement the function TWINP, the definitionabove provides its specification.

    6.1.1 Varying inputs and outputsMany of the functions that interest us take more than one input. Forexample, the function

    MULT(๐‘ฅ, ๐‘ฆ) = ๐‘ฅ โ‹… ๐‘ฆ (6.2)takes the binary representation of a pair of integers ๐‘ฅ, ๐‘ฆ โˆˆ โ„•, and

    outputs the binary representation of their product ๐‘ฅโ‹…๐‘ฆ. However, sincewe can represent a pair of strings as a single string, we will considerfunctions such as MULT as mapping {0, 1}โˆ— to {0, 1}โˆ—. We will typi-cally not be concerned with low-level details such as the precise wayto represent a pair of integers as a string, since virtually all choices willbe equivalent for our purposes.

    https://en.wikipedia.org/wiki/Twin_prime

  • 220 introduction to theoretical computer science

    Another example of a function we want to compute is

    PALINDROME(๐‘ฅ) =โŽง{โŽจ{โŽฉ

    1 โˆ€๐‘–โˆˆ[|๐‘ฅ|]๐‘ฅ๐‘– = ๐‘ฅ|๐‘ฅ|โˆ’๐‘–0 otherwise

    (6.3)

    PALINDROME has a single bit as output. Functions with a singlebit of output are known as Boolean functions. Boolean functions arecentral to the theory of computation, and we will discuss them oftenin this book. Note that even though Boolean functions have a singlebit of output, their input can be of arbitrary length. Thus they are stillinfinite functions that cannot be described via a finite table of values.

    โ€œBooleanizingโ€ functions. Sometimes it might be convenient to ob-tain a Boolean variant for a non-Boolean function. For example, thefollowing is a Boolean variant of MULT.

    BMULT(๐‘ฅ, ๐‘ฆ, ๐‘–) =โŽง{โŽจ{โŽฉ

    ๐‘–๐‘กโ„Ž bit of ๐‘ฅ โ‹… ๐‘ฆ ๐‘– < |๐‘ฅ โ‹… ๐‘ฆ|0 otherwise

    (6.4)

    If we can compute BMULT via any programming language such asPython, C, Java, etc., we can compute MULT as well, and vice versa.

    Solved Exercise 6.1 โ€” Booleanizing general functions. Show that for everyfunction ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1}โˆ—, there exists a Boolean function BF โˆถ{0, 1}โˆ— โ†’ {0, 1} such that a Python program to compute BF can betransformed into a program to compute ๐น and vice versa.

    โ– 

    Solution:

    For every ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1}โˆ—, we can define

    BF(๐‘ฅ, ๐‘–, ๐‘) =โŽง{{โŽจ{{โŽฉ

    ๐น(๐‘ฅ)๐‘– ๐‘– < |๐น(๐‘ฅ)|, ๐‘ = 01 ๐‘– < |๐น(๐‘ฅ)|, ๐‘ = 10 ๐‘– โ‰ฅ |๐‘ฅ|

    (6.5)

    to be the function that on input ๐‘ฅ โˆˆ {0, 1}โˆ—, ๐‘– โˆˆ โ„•, ๐‘ โˆˆ {0, 1} out-puts the ๐‘–๐‘กโ„Ž bit of ๐น(๐‘ฅ) if ๐‘ = 0 and ๐‘– < |๐‘ฅ|. If ๐‘ = 1 then BF(๐‘ฅ, ๐‘–, ๐‘)outputs 1 iff ๐‘– < |๐น(๐‘ฅ)| and hence this allows to compute the lengthof ๐น(๐‘ฅ).

    Computing BF from ๐น is straightforward. For the other direc-tion, given a Python function BF that computes BF, we can compute๐น as follows:

    def F(x):

    res = []

    i = 0

    while BF(x,i,1):

  • functions with infinite domains, automata, and regular expressions 221

    res.apppend(BF(x,i,0))

    i += 1

    return res

    โ– 

    6.1.2 Formal LanguagesFor every Boolean function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1}, we can define the set๐ฟ๐น = {๐‘ฅ|๐น(๐‘ฅ) = 1} of strings on which ๐น outputs 1. Such sets areknown as languages. This name is rooted in formal language theory aspursued by linguists such as Noam Chomsky. A formal language is asubset ๐ฟ โŠ† {0, 1}โˆ— (or more generally ๐ฟ โŠ† ฮฃโˆ— for some finite alphabetฮฃ). The membership or decision problem for a language ๐ฟ, is the task ofdetermining, given ๐‘ฅ โˆˆ {0, 1}โˆ—, whether or not ๐‘ฅ โˆˆ ๐ฟ. If we can com-pute the function ๐น , then we can decide membership in the language๐ฟ๐น and vice versa. Hence, many texts such as [Sip97] refer to the taskof computing a Boolean function as โ€œdeciding a languageโ€. In thisbook, we mostly describe computational tasks using the function nota-tion, which is easier to generalize to computation with more than onebit of output. However, since the language terminology is so popularin the literature, we will sometimes mention it.

    6.1.3 Restrictions of functionsIf ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} is a Boolean function and ๐‘› โˆˆ โ„• then the re-striction of ๐น to inputs of length ๐‘›, denoted as ๐น๐‘›, is the finite function๐‘“ โˆถ {0, 1}๐‘› โ†’ {0, 1} such that ๐‘“(๐‘ฅ) = ๐น(๐‘ฅ) for every ๐‘ฅ โˆˆ {0, 1}๐‘›. Thatis, ๐น๐‘› is the finite function that is only defined on inputs in {0, 1}๐‘›, butagrees with ๐น on those inputs. Since ๐น๐‘› is a finite function, it can becomputed by a Boolean circuit, implying the following theorem:

    Theorem 6.1 โ€” Circuit collection for every infinite function. Let ๐น โˆถ {0, 1}โˆ— โ†’{0, 1}. Then there is a collection {๐ถ๐‘›}๐‘›โˆˆ{1,2,โ€ฆ} of circuits such thatfor every ๐‘› > 0, ๐ถ๐‘› computes the restriction ๐น๐‘› of ๐น to inputs oflength ๐‘›.

    Proof. This is an immediate corollary of the universality of Booleancircuits. Indeed, since ๐น๐‘› maps {0, 1}๐‘› to {0, 1}, Theorem 4.15 impliesthat there exists a Boolean circuit ๐ถ๐‘› to compute it. In fact, the size ofthis circuit is at most ๐‘ โ‹… 2๐‘›/๐‘› gates for some constant ๐‘ โ‰ค 10.

    โ– 

    In particular, Theorem 6.1 implies that there exists such a circuitcollection {๐ถ๐‘›} even for the TWINP function we described before,even though we do not know of any program to compute it. Indeed,this is not that surprising: for every particular ๐‘› โˆˆ โ„•, TWINP๐‘› is eitherthe constant zero function or the constant one function, both of which

  • 222 introduction to theoretical computer science

    can be computed by very simple Boolean circuits. Hence a collectionof circuits {๐ถ๐‘›} that computes TWINP certainly exists. The difficultyin computing TWINP using Python or any other programming lan-guage arises from the fact that we do not know for each particular ๐‘›what is the circuit ๐ถ๐‘› in this collection.

    6.2 DETERMINISTIC FINITE AUTOMATA (OPTIONAL)

    All our computational models so far - Boolean circuits and straight-line programs - were only applicable for finite functions.

    In Chapter 7, we will present Turing machines, which are the centralmodels of computation for unbounded input length functions. How-ever, in this section we present the more basic model of deterministicfinite automata (DFA). Automata can serve as a good stepping-stone forTuring machines, though they will not be used much in later parts ofthis book, and so the reader can feel free to skip ahead to Chapter 7.DFAs turn out to be equivalent in power to regular expressions: a pow-erful mechanism to specify patterns, which is widely used in practice.Our treatment of automata is relatively brief. There are plenty of re-sources that help you get more comfortable with DFAs. In particular,Chapter 1 of Sipserโ€™s book [Sip97] contains an excellent exposition ofthis material. There are also many websites with online simulators forautomata, as well as translators from regular expressions to automataand vice versa (see for example here and here).

    At a high level, an algorithm is a recipe for computing an outputfrom an input via a combination of the following steps:

    1. Read a bit from the input2. Update the state (working memory)3. Stop and produce an output

    For example, recall the Python program that computes the XORfunction:

    def XOR(X):

    '''Takes list X of 0's and 1's

    Outputs 1 if the number of 1's is odd and outputs 0

    otherwise'''โ†ช

    result = 0

    for i in range(len(X)):

    result = (result + X[i]) % 2

    return result

    In each step, this program reads a single bit X[i] and updates itsstate result based on that bit (flipping result if X[i] is 1 and keep-ing it the same otherwise). When it is done transversing the input,

    http://ivanzuzak.info/noam/webapps/fsm2regex/https://cyberzhg.github.io/toolbox/nfa2dfa

  • functions with infinite domains, automata, and regular expressions 223

    Figure 6.3: A deterministic finite automaton thatcomputes the XOR function. It has two states 0 and 1,and when it observes ๐œŽ it transitions from ๐‘ฃ to ๐‘ฃ โŠ• ๐œŽ.

    the program outputs result. In computer science, such a program iscalled a single-pass constant-memory algorithm since it makes a singlepass over the input and its working memory is finite. (Indeed, in thiscase, result can either be 0 or 1.) Such an algorithm is also known asa Deterministic Finite Automaton or DFA (another name for DFAs is afinite state machine). We can think of such an algorithm as a โ€œmachineโ€that can be in one of ๐ถ states, for some constant ๐ถ. The machine startsin some initial state and then reads its input ๐‘ฅ โˆˆ {0, 1}โˆ— one bit at atime. Whenever the machine reads a bit ๐œŽ โˆˆ {0, 1}, it transitions into anew state based on ๐œŽ and its prior state. The output of the machine isbased on the final state. Every single-pass constant-memory algorithmcorresponds to such a machine. If an algorithm uses ๐‘ bits of mem-ory, then the contents of its memory can be represented as a stringof length ๐‘. Therefore such an algorithm can be in one of at most 2๐‘states at any point in the execution.

    We can specify a DFA of ๐ถ states by a list of ๐ถ โ‹… 2 rules. Each rulewill be of the form โ€œIf the DFA is in state ๐‘ฃ and the bit read from theinput is ๐œŽ then the new state is ๐‘ฃโ€ฒโ€. At the end of the computation,we will also have a rule of the form โ€œIf the final state is one of thefollowing โ€ฆ then output 1, otherwise output 0โ€. For example, thePython program above can be represented by a two-state automatonfor computing XOR of the following form:

    โ€ข Initialize in the state 0.โ€ข For every state ๐‘  โˆˆ {0, 1} and input bit ๐œŽ read, if ๐œŽ = 1 then change

    to state 1 โˆ’ ๐‘ , otherwise stay in state ๐‘ .โ€ข At the end output 1 iff ๐‘  = 1.

    We can also describe a ๐ถ-state DFA as a labeled graph of ๐ถ vertices.For every state ๐‘  and bit ๐œŽ, we add a directed edge labeled with ๐œŽbetween ๐‘  and the state ๐‘ โ€ฒ such that if the DFA is at state ๐‘  and reads ๐œŽthen it transitions to ๐‘ โ€ฒ. (If the state stays the same then this edge willbe a self-loop; similarly, if ๐‘  transitions to ๐‘ โ€ฒ in both the case ๐œŽ = 0 and๐œŽ = 1 then the graph will contain two parallel edges.) We also labelthe set ๐’ฎ of states on which the automaton will output 1 at the end ofthe computation. This set is known as the set of accepting states. SeeFig. 6.3 for the graphical representation of the XOR automaton.

    Formally, a DFA is specified by (1) the table of the ๐ถ โ‹… 2 rules, whichcan be represented as a transition function ๐‘‡ that maps a state ๐‘  โˆˆ [๐ถ]and bit ๐œŽ โˆˆ {0, 1} to the state ๐‘ โ€ฒ โˆˆ [๐ถ] which the DFA will transition tofrom state ๐‘ on input ๐œŽ and (2) the set ๐’ฎ of accepting states. This leadsto the following definition.

  • 224 introduction to theoretical computer science

    Definition 6.2 โ€” Deterministic Finite Automaton. A deterministic finiteautomaton (DFA) with ๐ถ states over {0, 1} is a pair (๐‘‡ , ๐’ฎ) with๐‘‡ โˆถ [๐ถ] ร— {0, 1} โ†’ [๐ถ] and ๐’ฎ โŠ† [๐ถ]. The finite function ๐‘‡ is knownas the transition function of the DFA. The set ๐’ฎ is known as the set ofaccepting states.

    Let ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} be a Boolean function with the infinitedomain {0, 1}โˆ—. We say that (๐‘‡ , ๐’ฎ) computes a function ๐น โˆถ {0, 1}โˆ— โ†’{0, 1} if for every ๐‘› โˆˆ โ„• and ๐‘ฅ โˆˆ {0, 1}๐‘›, if we define ๐‘ 0 = 0 and๐‘ ๐‘–+1 = ๐‘‡ (๐‘ ๐‘–, ๐‘ฅ๐‘–) for every ๐‘– โˆˆ [๐‘›], then

    ๐‘ ๐‘› โˆˆ ๐’ฎ โ‡” ๐น(๐‘ฅ) = 1 (6.6)

    PMake sure not to confuse the transition function ofan automaton (๐‘‡ in Definition 6.2), which is a finitefunction specifying the table of โ€œrulesโ€ which it fol-lows, with the function the automaton computes (๐น inDefinition 6.2) which is an infinite function.

    RRemark 6.3 โ€” Definitions in other texts. Deterministicfinite automata can be defined in several equivalentways. In particular Sipser [Sip97] defines a DFA as afive-tuple (๐‘„, ฮฃ, ๐›ฟ, ๐‘ž0, ๐น ) where ๐‘„ is the set of states,ฮฃ is the alphabet, ๐›ฟ is the transition function, ๐‘ž0 isthe initial state, and ๐น is the set of accepting states.In this book the set of states is always of the form๐‘„ = {0, โ€ฆ , ๐ถ โˆ’ 1} and the initial state is always ๐‘ž0 = 0,but this makes no difference to the computationalpower of these models. Also, we restrict our attentionto the case that the alphabet ฮฃ is equal to {0, 1}.

    Solved Exercise 6.2 โ€” DFA for (010)โˆ—. Prove that there is a DFA that com-putes the following function ๐น :

    ๐น(๐‘ฅ) =โŽง{โŽจ{โŽฉ

    1 3 divides |๐‘ฅ| and โˆ€๐‘–โˆˆ[|๐‘ฅ|/3]๐‘ฅ๐‘–๐‘ฅ๐‘–+1๐‘ฅ๐‘–+2 = 0100 otherwise

    (6.7)

    โ– 

    Solution:

    When asked to construct a deterministic finite automaton, it isoften useful to start by constructing a single-pass constant-memory

  • functions with infinite domains, automata, and regular expressions 225

    Figure 6.4: A DFA that outputs 1 only on inputs๐‘ฅ โˆˆ {0, 1}โˆ— that are a concatenation of zero or morecopies of 010. The state 0 is both the starting stateand the only accepting state. The table denotes thetransition function of ๐‘‡ , which maps the current stateand symbol read to the new symbol.

    algorithm using a more general formalism (for example, usingpseudocode or a Python program). Once we have such an algo-rithm, we can mechanically translate it into a DFA. Here is a simplePython program for computing ๐น :

    def F(X):

    '''Return 1 iff X is a concatenation of zero/more

    copies of [0,1,0]'''โ†ช

    if len(X) % 3 != 0:

    return False

    ultimate = 0

    penultimate = 1

    antepenultimate = 0

    for idx, b in enumerate(X):

    antepenultimate = penultimate

    penultimate = ultimate

    ultimate = b

    if idx % 3 == 2 and ((antepenultimate,

    penultimate, ultimate) != (0,1,0)):โ†ช

    return False

    return True

    Since we keep three Boolean variables, the working memory canbe in one of 23 = 8 configurations, and so the program above canbe directly translated into an 8 state DFA. While this is not neededto solve the question, by examining the resulting DFA, we can seethat we can merge some states and obtain a 4 state automaton, de-scribed in Fig. 6.4. See also Fig. 6.5, which depicts the execution ofthis DFA on a particular input.

    โ– 

    6.2.1 Anatomy of an automaton (finite vs. unbounded)Now that we are considering computational tasks with unboundedinput sizes, it is crucial to distinguish between the components of ouralgorithm that have fixed length and the components that grow withthe input size. For the case of DFAs these are the following:

    Constant size components: Given a DFA ๐ด, the following quantities arefixed independent of the input size:

    โ€ข The number of states ๐ถ in ๐ด.

    โ€ข The transition function ๐‘‡ (which has 2๐ถ inputs, and so can be speci-fied by a table of 2๐ถ rows, each entry in which is a number in [๐ถ]).

    โ€ข The set ๐’ฎ โŠ† [๐ถ] of accepting states. This set can be described by astring in {0, 1}๐ถ specifiying which states are in ๐’ฎ and which are not.

  • 226 introduction to theoretical computer science

    Together the above means that we can fully describe an automatonusing finitely many symbols. This is a property we require out of anynotion of โ€œalgorithmโ€: we should be able to write down a completespecification of how it produces an output from an input.

    Components of unbounded size: The following quantities relating to aDFA are not bounded by any constant. We stress that these are stillfinite for any given input.

    โ€ข The length of the input ๐‘ฅ โˆˆ {0, 1}โˆ— that the DFA is provided. Theinput length is always finite, but not a priori bounded.

    โ€ข The number of steps that the DFA takes can grow with the length ofthe input. Indeed, a DFA makes a single pass on the input and so ittakes precisely |๐‘ฅ| steps on an input ๐‘ฅ โˆˆ {0, 1}โˆ—.

    Figure 6.5: Execution of the DFA of Fig. 6.4. Thenumber of states and the transition function size arebounded, but the input can be arbitrarily long. Ifthe DFA is at state ๐‘  and observes the value ๐œŽ then itmoves to the state ๐‘‡ (๐‘ , ๐œŽ). At the end of the executionthe DFA accepts iff the final state is in ๐’ฎ.

    6.2.2 DFA-computable functionsWe say that a function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} is DFA computable if thereexists some DFA that computes ๐น . In Chapter 4 we saw that everyfinite function is computable by some Boolean circuit. Thus, at thispoint, you might expect that every infinite function is computable bysome DFA. However, this is very much not the case. We will soon seesome simple examples of infinite functions that are not computable byDFAs, but for starters, let us prove that such functions exist.

    Theorem 6.4 โ€” DFA-computable functions are countable. Let DFACOMP bethe set of all Boolean functions ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} such that thereexists a DFA computing ๐น . Then DFACOMP is countable.

    Proof Idea:

  • functions with infinite domains, automata, and regular expressions 227

    Every DFA can be described by a finite length string, which yieldsan onto map from {0, 1}โˆ— to DFACOMP: namely, the function thatmaps a string describing an automaton ๐ด to the function that it com-putes.

    โ‹†

    Proof of Theorem 6.4. Every DFA can be described by a finite string,representing the transition function ๐‘‡ and the set of accepting states,and every DFA ๐ด computes some function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1}. Thuswe can define the following function ๐‘†๐‘ก๐ท๐ถ โˆถ {0, 1}โˆ— โ†’ DFACOMP:

    ๐‘†๐‘ก๐ท๐ถ(๐‘Ž) =โŽง{โŽจ{โŽฉ

    ๐น ๐‘Ž represents automaton ๐ด and ๐น is the function ๐ด computesONE otherwise

    (6.8)where ONE โˆถ {0, 1}โˆ— โ†’ {0, 1} is the constant function that outputs1 on all inputs (and is a member of DFACOMP). Since by definition,every function ๐น in DFACOMP is computable by some automaton,๐‘†๐‘ก๐ท๐ถ is an onto function from {0, 1}โˆ— to DFACOMP, which meansthat DFACOMP is countable (see Section 2.4.2).

    โ– 

    Since the set of all Boolean functions is uncountable, we get thefollowing corollary:

    Theorem 6.5 โ€” Existence of DFA-uncomputable functions. There exists aBoolean function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} that is not computable by anyDFA.

    Proof. If every Boolean function ๐น is computable by some DFA, thenDFACOMP equals the set ALL of all Boolean functions, but by Theo-rem 2.12, the latter set is uncountable, contradicting Theorem 6.4.

    โ– 

    6.3 REGULAR EXPRESSIONS

    Searching for a piece of text is a common task in computing. At itsheart, the search problem is quite simple. We have a collection ๐‘‹ ={๐‘ฅ0, โ€ฆ , ๐‘ฅ๐‘˜} of strings (e.g., files on a hard-drive, or student records ina database), and the user wants to find out the subset of all the ๐‘ฅ โˆˆ ๐‘‹that are matched by some pattern (e.g., all files whose names end withthe string .txt). In full generality, we can allow the user to specify thepattern by specifying a (computable) function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1},where ๐น(๐‘ฅ) = 1 corresponds to the pattern matching ๐‘ฅ. That is, theuser provides a program ๐‘ƒ in a programming language such as Python,and the system returns all ๐‘ฅ โˆˆ ๐‘‹ such that ๐‘ƒ(๐‘ฅ) = 1. For example,

  • 228 introduction to theoretical computer science

    one could search for all text files that contain the string importantdocument or perhaps (letting ๐‘ƒ correspond to a neural-network basedclassifier) all images that contain a cat. However, we donโ€™t want oursystem to get into an infinite loop just trying to evaluate the program๐‘ƒ ! For this reason, typical systems for searching files or databases donot allow users to specify the patterns using full-fledged programminglanguages. Rather, such systems use restricted computational models thaton the one hand are rich enough to capture many of the queries neededin practice (e.g., all filenames ending with .txt, or all phone numbersof the form (617) xxx-xxxx), but on the other hand are restrictedenough so that queries can be evaluated very efficiently on huge filesand in particular cannot result in an infinite loop.

    One of the most popular such computational models is regularexpressions. If you ever used an advanced text editor, a command-lineshell, or have done any kind of manipulation of text files, then youhave probably come across regular expressions.

    A regular expression over some alphabet ฮฃ is obtained by combin-ing elements of ฮฃ with the operation of concatenation, as well as |(corresponding to or) and โˆ— (corresponding to repetition zero ormore times). (Common implementations of regular expressions inprogramming languages and shells typically include some extra oper-ations on top of | and โˆ—, but these operations can be implemented asโ€œsyntactic sugarโ€ using the operators | and โˆ—.) For example, the fol-lowing regular expression over the alphabet {0, 1} corresponds to theset of all strings ๐‘ฅ โˆˆ {0, 1}โˆ— where every digit is repeated at least twice:

    (00(0โˆ—)|11(1โˆ—))โˆ— . (6.9)The following regular expression over the alphabet {๐‘Ž, โ€ฆ , ๐‘ง, 0, โ€ฆ , 9}

    corresponds to the set of all strings that consist of a sequence of oneor more of the letters ๐‘Ž-๐‘‘ followed by a sequence of one or more digits(without a leading zero):

    (๐‘Ž|๐‘|๐‘|๐‘‘)(๐‘Ž|๐‘|๐‘|๐‘‘)โˆ—(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)โˆ— . (6.10)

    Formally, regular expressions are defined by the following recursivedefinition:

    Definition 6.6 โ€” Regular expression. A regular expression ๐‘’ over an al-phabet ฮฃ is a string over ฮฃ โˆช {(, ), |, โˆ—,โˆ…, ""} that has one of thefollowing forms:

    1. ๐‘’ = ๐œŽ where ๐œŽ โˆˆ ฮฃ

    2. ๐‘’ = (๐‘’โ€ฒ|๐‘’โ€ณ) where ๐‘’โ€ฒ, ๐‘’โ€ณ are regular expressions.

    https://goo.gl/2vTAFUhttps://goo.gl/2vTAFU

  • functions with infinite domains, automata, and regular expressions 229

    3. ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ) where ๐‘’โ€ฒ, ๐‘’โ€ณ are regular expressions. (We oftendrop the parentheses when there is no danger of confusion andso write this as ๐‘’โ€ฒ ๐‘’โ€ณ.)

    4. ๐‘’ = (๐‘’โ€ฒ)โˆ— where ๐‘’โ€ฒ is a regular expression.

    Finally we also allow the following โ€œedge casesโ€: ๐‘’ = โˆ… and๐‘’ = "". These are the regular expressions corresponding to accept-ing no strings, and accepting only the empty string respectively.

    We will drop parentheses when they can be inferred from thecontext. We also use the convention that OR and concatenation areleft-associative, and we give highest precedence to โˆ—, then concate-nation, and then OR. Thus for example we write 00โˆ—|11 instead of((0)(0โˆ—))|((1)(1)).

    Every regular expression ๐‘’ corresponds to a function ฮฆ๐‘’ โˆถ ฮฃโˆ— โ†’{0, 1} where ฮฆ๐‘’(๐‘ฅ) = 1 if ๐‘ฅ matches the regular expression. For exam-ple, if ๐‘’ = (00|11)โˆ— then ฮฆ๐‘’(110011) = 1 but ฮฆ๐‘’(101) = 0 (can you seewhy?).

    PThe formal definition of ฮฆ๐‘’ is one of those definitionsthat is more cumbersome to write than to grasp. Thusit might be easier for you first to work out the defini-tion on your own, and then check that it matches whattis written below.

    Definition 6.7 โ€” Matching a regular expression. Let ๐‘’ be a regular expres-sion over the alphabet ฮฃ. The function ฮฆ๐‘’ โˆถ ฮฃโˆ— โ†’ {0, 1} is definedas follows:

    1. If ๐‘’ = ๐œŽ then ฮฆ๐‘’(๐‘ฅ) = 1 iff ๐‘ฅ = ๐œŽ.

    2. If ๐‘’ = (๐‘’โ€ฒ|๐‘’โ€ณ) then ฮฆ๐‘’(๐‘ฅ) = ฮฆ๐‘’โ€ฒ(๐‘ฅ)โˆจฮฆ๐‘’โ€ณ(๐‘ฅ) where โˆจ is the OR op-erator.

    3. If ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ) then ฮฆ๐‘’(๐‘ฅ) = 1 iff there is some ๐‘ฅโ€ฒ, ๐‘ฅโ€ณ โˆˆ ฮฃโˆ— suchthat ๐‘ฅ is the concatenation of ๐‘ฅโ€ฒ and ๐‘ฅโ€ณ and ฮฆ๐‘’โ€ฒ(๐‘ฅโ€ฒ) = ฮฆ๐‘’โ€ณ(๐‘ฅโ€ณ) =1.

    4. If ๐‘’ = (๐‘’โ€ฒ)โˆ— then ฮฆ๐‘’(๐‘ฅ) = 1 iff there is some ๐‘˜ โˆˆ โ„• and some๐‘ฅ0, โ€ฆ , ๐‘ฅ๐‘˜โˆ’1 โˆˆ ฮฃโˆ— such that ๐‘ฅ is the concatenation ๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘˜โˆ’1 andฮฆ๐‘’โ€ฒ(๐‘ฅ๐‘–) = 1 for every ๐‘– โˆˆ [๐‘˜].

    5. Finally, for the edge cases ฮฆโˆ… is the constant zero function, andฮฆ"" is the function that only outputs 1 on the empty string "".

  • 230 introduction to theoretical computer science

    We say that a regular expression ๐‘’ over ฮฃ matches a string ๐‘ฅ โˆˆ ฮฃโˆ—if ฮฆ๐‘’(๐‘ฅ) = 1.

    PThe definitions above are not inherently difficult butare a bit cumbersome. So you should pause here andgo over it again until you understand why it corre-sponds to our intuitive notion of regular expressions.This is important not just for understanding regularexpressions themselves (which are used time andagain in a great many applications) but also for get-ting better at understanding recursive definitions ingeneral.

    A Boolean function is called โ€œregularโ€ if it outputs 1 on preciselythe set of strings that are matched by some regular expression. That is,

    Definition 6.8 โ€” Regular functions / languages. Let ฮฃ be a finite set and๐น โˆถ ฮฃโˆ— โ†’ {0, 1} be a Boolean function. We say that ๐น is regular if๐น = ฮฆ๐‘’ for some regular expression ๐‘’.

    Similarly, for every formal language ๐ฟ โŠ† ฮฃโˆ—, we say that ๐ฟ is reg-ular if and only if there is a regular expression ๐‘’ such that ๐‘ฅ โˆˆ ๐ฟ iff๐‘’ matches ๐‘ฅ.

    โ–  Example 6.9 โ€” A regular function. Let ฮฃ = {๐‘Ž, ๐‘, ๐‘, ๐‘‘, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}and ๐น โˆถ ฮฃโˆ— โ†’ {0, 1} be the function such that ๐น(๐‘ฅ) outputs 1 iff๐‘ฅ consists of one or more of the letters ๐‘Ž-๐‘‘ followed by a sequenceof one or more digits (without a leading zero). Then ๐น is a regularfunction, since ๐น = ฮฆ๐‘’ where

    ๐‘’ = (๐‘Ž|๐‘|๐‘|๐‘‘)(๐‘Ž|๐‘|๐‘|๐‘‘)โˆ—(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)โˆ—(6.11)

    is the expression we saw in (6.10).If we wanted to verify, for example, that ฮฆ๐‘’(๐‘Ž๐‘๐‘12078) = 1,

    we can do so by noticing that the expression (๐‘Ž|๐‘|๐‘|๐‘‘) matchesthe string ๐‘Ž, (๐‘Ž|๐‘|๐‘|๐‘‘)โˆ— matches ๐‘๐‘, (0|1|2|3|4|5|6|7|8|9) matches thestring 1, and the expression (0|1|2|3|4|5|6|7|8|9)โˆ— matches the string2078. Each one of those boils down to a simpler expression. For ex-ample, the expression (๐‘Ž|๐‘|๐‘|๐‘‘)โˆ— matches the string ๐‘๐‘ because bothof the one-character strings ๐‘ and ๐‘ are matched by the expression๐‘Ž|๐‘|๐‘|๐‘‘.

    Regular expression can be defined over any finite alphabet ฮฃ, butas usual, we will mostly focus our attention on the binary case, where

  • functions with infinite domains, automata, and regular expressions 231

    ฮฃ = {0, 1}. Most (if not all) of the theoretical and practical generalinsights about regular expressions can be gleaned from studying thebinary case.

    6.3.1 Algorithms for matching regular expressionsRegular expressions would not be very useful for search if we couldnot evaluate, given a regular expression ๐‘’, whether a string ๐‘ฅ ismatched by ๐‘’. Luckily, there is an algorithm to do so. Specifically,there is an algorithm (think โ€œPython programโ€ though later wewill formalize the notion of algorithms using Turing machines) thaton input a regular expression ๐‘’ over the alphabet {0, 1} and a string๐‘ฅ โˆˆ {0, 1}โˆ—, outputs 1 iff ๐‘’ matches ๐‘ฅ (i.e., outputs ฮฆ๐‘’(๐‘ฅ)).

    Indeed, Definition 6.7 actually specifies a recursive algorithm forcomputing ฮฆ๐‘’. Specifically, each one of our operations -concatenation,OR, and star- can be thought of as reducing the task of testing whetheran expression ๐‘’ matches a string ๐‘ฅ to testing whether some sub-expressions of ๐‘’ match substrings of ๐‘ฅ. Since these sub-expressionsare always shorter than the original expression, this yields a recursivealgorithm for checking if ๐‘’ matches ๐‘ฅ, which will eventually terminateat the base cases of the expressions that correspond to a single symbolor the empty string.

  • 232 introduction to theoretical computer science

    Algorithm 6.10 โ€” Regular expression matching.

    Input: Regular expression ๐‘’ over ฮฃโˆ—, ๐‘ฅ โˆˆ ฮฃโˆ—Output: ฮฆ๐‘’(๐‘ฅ)1: procedure Match(๐‘’,๐‘ฅ)2: if ๐‘’ = โˆ… then return 0 ;3: if ๐‘ฅ = "" then return MatchEmpty(()๐‘’) ;4: if ๐‘’ โˆˆ ฮฃ then return 1 iff ๐‘ฅ = ๐‘’ ;5: if ๐‘’ = (๐‘’โ€ฒ|๐‘’โ€ณ) then return Match(๐‘’โ€ฒ, ๐‘ฅ) or Match(๐‘’โ€ณ, ๐‘ฅ)

    ;6: if ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ) then7: for ๐‘– โˆˆ [|๐‘ฅ| + 1] do8: if Match(๐‘’โ€ฒ, ๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘–โˆ’1) and Match(๐‘’โ€ณ, ๐‘ฅ๐‘– โ‹ฏ ๐‘ฅ|๐‘ฅ|โˆ’1)

    then return 1 ;9: end for

    10: end if11: if ๐‘’ = (๐‘’โ€ฒ)โˆ— then12: if ๐‘’โ€ฒ = "" then return Match("", ๐‘ฅ) ;13: # ("")โˆ— is the same as ""14: for ๐‘– โˆˆ [|๐‘ฅ|] do15: # ๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘–โˆ’1 is shorter than ๐‘ฅ16: if Match(๐‘’, ๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘–โˆ’1) and Match(๐‘’โ€ฒ, ๐‘ฅ๐‘– โ‹ฏ ๐‘ฅ|๐‘ฅ|โˆ’1)

    then return 1 ;17: end for18: end if19: return 020: end procedure

    We assume above that we have a procedure MatchEmpty thaton input a regular expression ๐‘’ outputs 1 if and only if ๐‘’ matches theempty string "".

    The key observation is that in our recursive definition of regular ex-pressions, whenever ๐‘’ is made up of one or two expressions ๐‘’โ€ฒ, ๐‘’โ€ณ thenthese two regular expressions are smaller than ๐‘’. Eventually (whenthey have size 1) then they must correspond to the non-recursivecase of a single alphabet symbol. Correspondingly, the recursive callsmade in Algorithm 6.10 always correspond to a shorter expression or(in the case of an expression of the form (๐‘’โ€ฒ)โˆ—) a shorter input string.Thus, we can prove the correctness of Algorithm 6.10 on inputs of theform (๐‘’, ๐‘ฅ) by induction over min{|๐‘’|, |๐‘ฅ|}. The base case is when ei-ther ๐‘ฅ = "" or ๐‘’ is a single alphabet symbol, "" or โˆ…. In the case theexpression is of the forrm ๐‘’ = (๐‘’โ€ฒ|๐‘’โ€ณ) or ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ), we make recur-sive calls with the shorter expressions ๐‘’โ€ฒ, ๐‘’โ€ณ. In the case the expressionis of the form ๐‘’ = (๐‘’โ€ฒ)โˆ—, we make recursive calls with either a shorter

  • functions with infinite domains, automata, and regular expressions 233

    string ๐‘ฅ and the same expression, or with the shorter expression ๐‘’โ€ฒand a string ๐‘ฅโ€ฒ that is equal in length or shorter than ๐‘ฅ.Solved Exercise 6.3 โ€” Match the empty string. Give an algorithm that oninput a regular expression ๐‘’, outputs 1 if and only if ฮฆ๐‘’("") = 1.

    โ– 

    Solution:

    We can obtain such a recursive algorithm by using the followingobservations:

    1. An expression of the form "" or (๐‘’โ€ฒ)โˆ— always matches the emptystring.

    2. An expression of the form ๐œŽ, where ๐œŽ โˆˆ ฮฃ is an alphabet sym-bol, never matches the empty string.

    3. The regular expression โˆ… does not match the empty string.

    4. An expression of the form ๐‘’โ€ฒ|๐‘’โ€ณ matches the empty string if andonly if one of ๐‘’โ€ฒ or ๐‘’โ€ณ matches it.

    5. An expression of the form (๐‘’โ€ฒ)(๐‘’โ€ณ) matches the empty string ifand only if both ๐‘’โ€ฒ and ๐‘’โ€ณ match it.

    Given the above observations, we see that the following algo-rithm will check if ๐‘’ matches the empty string:

    procedure{MatchEmpty}{๐‘’} lIf {๐‘’ = โˆ…} return 0 lendif lIf{๐‘’ = ""} return 1 lendif lIf {๐‘’ = โˆ… or ๐‘’ โˆˆ ฮฃ} return 0 lendif lIf{๐‘’ = (๐‘’โ€ฒ|๐‘’โ€ณ)} return ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐ธ๐‘š๐‘๐‘ก๐‘ฆ(๐‘’โ€ฒ) or ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐ธ๐‘š๐‘๐‘ก๐‘ฆ(๐‘’โ€ณ) lendifLIf {๐‘’ = (๐‘’โ€ฒ)(๐‘Ÿโ€ฒ)} return ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐ธ๐‘š๐‘๐‘ก๐‘ฆ(๐‘’โ€ฒ) or ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐ธ๐‘š๐‘๐‘ก๐‘ฆ(๐‘’โ€ณ)lendif lIf {๐‘’ = (๐‘’โ€ฒ)โˆ—} return 1 lendif endprocedure

    โ– 

    6.4 EFFICIENT MATCHING OF REGULAR EXPRESSIONS (OP-TIONAL)

    Algorithm 6.10 is not very efficient. For example, given an expressioninvolving concatenation or the โ€œstarโ€ operation and a string of length๐‘›, it can make ๐‘› recursive calls, and hence it can be shown that in theworst case Algorithm 6.10 can take time exponential in the length ofthe input string ๐‘ฅ. Fortunately, it turns out that there is a much moreefficient algorithm that can match regular expressions in linear (i.e.,๐‘‚(๐‘›)) time. Since we have not yet covered the topics of time and spacecomplexity, we describe this algorithm in high level terms, withoutmaking the computational model precise. Rather we will use thecolloquial notion of ๐‘‚(๐‘›) running time as used in introduction to

  • 234 introduction to theoretical computer science

    programming courses and whiteboard coding interviews. We will seea formal definition of time complexity in Chapter 13.

    Theorem 6.11 โ€” Matching regular expressions in linear time. Let ๐‘’ be aregular expression. Then there is an ๐‘‚(๐‘›) time algorithm thatcomputes ฮฆ๐‘’.

    The implicit constant in the ๐‘‚(๐‘›) term of Theorem 6.11 depends onthe expression ๐‘’. Thus, another way to state Theorem 6.11 is that forevery expression ๐‘’, there is some constant ๐‘ and an algorithm ๐ด thatcomputes ฮฆ๐‘’ on ๐‘›-bit inputs using at most ๐‘ โ‹…๐‘› steps. This makes sensesince in practice we often want to compute ฮฆ๐‘’(๐‘ฅ) for a small regularexpression ๐‘’ and a large document ๐‘ฅ. Theorem 6.11 tells us that wecan do so with running time that scales linearly with the size of thedocument, even if it has (potentially) worse dependence on the size ofthe regular expression.

    We prove Theorem 6.11 by obtaining more efficient recursive al-gorithm, that determines whether ๐‘’ matches a string ๐‘ฅ โˆˆ {0, 1}๐‘› byreducing this task to determining whether a related expression ๐‘’โ€ฒmatches ๐‘ฅ0, โ€ฆ , ๐‘ฅ๐‘›โˆ’2. This will result in an expression for the runningtime of the form ๐‘‡ (๐‘›) = ๐‘‡ (๐‘› โˆ’ 1) + ๐‘‚(1) which solves to ๐‘‡ (๐‘›) = ๐‘‚(๐‘›).

    Restrictions of regular expressions. The central definition for the algo-rithm behind Theorem 6.11 is the notion of a restriction of a regularexpression. The idea is that for every regular expression ๐‘’ and symbol๐œŽ in its alphabet, it is possible to define a regular expression ๐‘’[๐œŽ] suchthat ๐‘’[๐œŽ] matches a string ๐‘ฅ if and only if ๐‘’ matches the string ๐‘ฅ๐œŽ. Forexample, if ๐‘’ is the regular expression 01|(01) โˆ— (01) (i.e., one or moreoccurrences of 01) then ๐‘’[1] is equal to 0|(01) โˆ— 0 and ๐‘’[0] will be โˆ….(Can you see why?)

    Algorithm 6.12 computes the resriction ๐‘’[๐œŽ] given a regular ex-pression ๐‘’ and an alphabet symbol ๐œŽ. It always terminates, since therecursive calls it makes are always on expressions smaller than theinput expression. Its correctness can be proven by induction on thelength of the regular expression ๐‘’, with the base cases being when ๐‘’ is"", โˆ…, or a single alphabet symbol ๐œ .

  • functions with infinite domains, automata, and regular expressions 235

    Algorithm 6.12 โ€” Restricting regular expression.

    Input: Regular expression ๐‘’ over ฮฃ, symbol ๐œŽ โˆˆ ฮฃOutput: Regular expression ๐‘’โ€ฒ = ๐‘’[๐œŽ] such that ฮฆ๐‘’โ€ฒ(๐‘ฅ) =

    ฮฆ๐‘’(๐‘ฅ๐œŽ) for every ๐‘ฅ โˆˆ ฮฃโˆ—1: procedure Restrict(๐‘’,๐œŽ)2: if ๐‘’ = "" or ๐‘’ = โˆ… then return โˆ… ;3: if ๐‘’ = ๐œ for ๐œ โˆˆ ฮฃ then return "" if ๐œ = ๐œŽ and return

    โˆ… otherwise ;4: if ๐‘’ = (๐‘’โ€ฒ|๐‘’โ€ณ) then return (Restrict(๐‘’โ€ฒ, ๐œŽ)|Restrict(๐‘’โ€ณ, ๐œŽ))

    ;5: if ๐‘’ = (๐‘’โ€ฒ)โˆ— then return (๐‘’โ€ฒ)โˆ—(Restrict(๐‘’โ€ฒ, ๐œŽ)) ;6: if ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ) and ฮฆ๐‘’โ€ณ("") = 0 then return

    (๐‘’โ€ฒ)(Restrict(๐‘’โ€ณ, ๐œŽ)) ;7: if ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ) and ฮฆ๐‘’โ€ณ("") = 1 then return

    (๐‘’โ€ฒ)(Restrict(๐‘’โ€ณ, ๐œŽ) | Restrict(๐‘’โ€ฒ, ๐œŽ)) ;8: end procedure

    Using this notion of restriction, we can define the following recur-sive algorithm for regular expression matching:

    Algorithm 6.13 โ€” Regular expression matching in linear time.

    Input: Regular expression ๐‘’ over ฮฃโˆ—, ๐‘ฅ โˆˆ ฮฃ๐‘› where ๐‘› โˆˆ โ„•Output: ฮฆ๐‘’(๐‘ฅ)1: procedure FMatch(๐‘’,๐‘ฅ)2: if ๐‘ฅ = "" then return MatchEmpty(()๐‘’) ;3: Let ๐‘’โ€ฒ โ† Restrict(๐‘’, ๐‘ฅ๐‘›โˆ’2)4: return FMatch(๐‘’โ€ฒ, ๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘›โˆ’1)5: end procedure

    By the definition of a restriction, for every ๐œŽ โˆˆ ฮฃ and ๐‘ฅโ€ฒ โˆˆ ฮฃโˆ—,the expression ๐‘’ matches ๐‘ฅโ€ฒ๐œŽ if and only if ๐‘’[๐œŽ] matches ๐‘ฅโ€ฒ. Hence forevery ๐‘’ and ๐‘ฅ โˆˆ ฮฃ๐‘›, ฮฆ๐‘’[๐‘ฅ๐‘›โˆ’1](๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘›โˆ’2) = ฮฆ๐‘’(๐‘ฅ) and Algorithm 6.13does return the correct answer. The only remaining task is to analyzeits running time. Note that Algorithm 6.13 uses the MatchEmptyprocedure of Solved Exercise 6.3 in the base case that ๐‘ฅ = "". However,this is OK since this procedureโ€™s running time depends only on ๐‘’ andis independent of the length of the original input.

    For simplicity, let us restrict our attention to the case that the al-phabet ฮฃ is equal to {0, 1}. Define ๐ถ(โ„“) to be the maximum numberof operations that Algorithm 6.12 takes when given as input a regularexpression ๐‘’ over {0, 1} of at most โ„“ symbols. The value ๐ถ(โ„“) can beshown to be polynomial in โ„“, though this is not important for this the-orem, since we only care about the dependence of the time to compute

  • 236 introduction to theoretical computer science

    ฮฆ๐‘’(๐‘ฅ) on the length of ๐‘ฅ and not about the dependence of this time onthe length of ๐‘’.

    Algorithm 6.13 is a recursive algorithm that input an expression๐‘’ and a string ๐‘ฅ โˆˆ {0, 1}๐‘›, does computation of at most ๐ถ(|๐‘’|) stepsand then calls itself with input some expression ๐‘’โ€ฒ and a string ๐‘ฅโ€ฒ oflength ๐‘› โˆ’ 1. It will terminate after ๐‘› steps when it reaches a string oflength 0. So, the running time ๐‘‡ (๐‘’, ๐‘›) that it takes for Algorithm 6.13to compute ฮฆ๐‘’ for inputs of length ๐‘› satisfies the recursive equation:

    ๐‘‡ (๐‘’, ๐‘›) = max{๐‘‡ (๐‘’[0], ๐‘› โˆ’ 1), ๐‘‡ (๐‘’[1], ๐‘› โˆ’ 1)} + ๐ถ(|๐‘’|) (6.12)

    (In the base case ๐‘› = 0, ๐‘‡ (๐‘’, 0) is equal to some constant dependingonly on ๐‘’.) To get some intuition for the expression Eq. (6.12), let usopen up the recursion for one level, writing ๐‘‡ (๐‘’, ๐‘›) as

    ๐‘‡ (๐‘’, ๐‘›) = max{๐‘‡ (๐‘’[0][0], ๐‘› โˆ’ 2) + ๐ถ(|๐‘’[0]|),๐‘‡ (๐‘’[0][1], ๐‘› โˆ’ 2) + ๐ถ(|๐‘’[0]|),๐‘‡ (๐‘’[1][0], ๐‘› โˆ’ 2) + ๐ถ(|๐‘’[1]|),๐‘‡ (๐‘’[1][1], ๐‘› โˆ’ 2) + ๐ถ(|๐‘’[1]|)} + ๐ถ(|๐‘’|) .

    (6.13)

    Continuing this way, we can see that ๐‘‡ (๐‘’, ๐‘›) โ‰ค ๐‘› โ‹… ๐ถ(๐ฟ) + ๐‘‚(1)where ๐ฟ is the largest length of any expression ๐‘’โ€ฒ that we encounteralong the way. Therefore, the following claim suffices to show thatAlgorithm 6.13 runs in ๐‘‚(๐‘›) time:

    Claim: Let ๐‘’ be a regular expression over {0, 1}, then there is a num-ber ๐ฟ(๐‘’) โˆˆ โ„•, such that for every sequence of symbols ๐›ผ0, โ€ฆ , ๐›ผ๐‘›โˆ’1, ifwe define ๐‘’โ€ฒ = ๐‘’[๐›ผ0][๐›ผ1] โ‹ฏ [๐›ผ๐‘›โˆ’1] (i.e., restricting ๐‘’ to ๐›ผ0, and then ๐›ผ1and so on and so forth), then |๐‘’โ€ฒ| โ‰ค ๐ฟ(๐‘’).

    Proof of claim: For a regular expression ๐‘’ over {0, 1} and ๐›ผ โˆˆ {0, 1}๐‘š,we denote by ๐‘’[๐›ผ] the expression ๐‘’[๐›ผ0][๐›ผ1] โ‹ฏ [๐›ผ๐‘šโˆ’1] obtained by restrict-ing ๐‘’ to ๐›ผ0 and then to ๐›ผ1 and so on. We let ๐‘†(๐‘’) = {๐‘’[๐›ผ]|๐›ผ โˆˆ {0, 1}โˆ—}.We will prove the claim by showing that for every ๐‘’, the set ๐‘†(๐‘’) is fi-nite, and hence so is the number ๐ฟ(๐‘’) which is the maximum length of๐‘’โ€ฒ for ๐‘’โ€ฒ โˆˆ ๐‘†(๐‘’).We prove this by induction on the structure of ๐‘’. If ๐‘’ is a symbol, theempty string, or the empty set, then this is straightforward to showas the most expressions ๐‘†(๐‘’) can contain are the expression itself, "",and โˆ…. Otherwise we split to the two cases (i) ๐‘’ = ๐‘’โ€ฒโˆ— and (ii) ๐‘’ =๐‘’โ€ฒ๐‘’โ€ณ, where ๐‘’โ€ฒ, ๐‘’โ€ณ are smaller expressions (and hence by the inductionhypothesis ๐‘†(๐‘’โ€ฒ) and ๐‘†(๐‘’โ€ณ) are finite). In the case (i), if ๐‘’ = (๐‘’โ€ฒ)โˆ— then๐‘’[๐›ผ] is either equal to (๐‘’โ€ฒ)โˆ—๐‘’โ€ฒ[๐›ผ] or it is simply the empty set if ๐‘’โ€ฒ[๐›ผ] = โˆ….Since ๐‘’โ€ฒ[๐›ผ] is in the set ๐‘†(๐‘’โ€ฒ), the number of distinct expressions in๐‘†(๐‘’) is at most |๐‘†(๐‘’โ€ฒ)| + 1. In the case (ii), if ๐‘’ = ๐‘’โ€ฒ๐‘’โ€ณ then all therestrictions of ๐‘’ to strings ๐›ผ will either have the form ๐‘’โ€ฒ๐‘’โ€ณ[๐›ผ] or the form๐‘’โ€ฒ๐‘’โ€ณ[๐›ผ]|๐‘’โ€ฒ[๐›ผโ€ฒ] where ๐›ผโ€ฒ is some string such that ๐›ผ = ๐›ผโ€ฒ๐›ผโ€ณ and ๐‘’โ€ณ[๐›ผโ€ณ]

  • functions with infinite domains, automata, and regular expressions 237

    matches the empty string. Since ๐‘’โ€ณ[๐›ผ] โˆˆ ๐‘†(๐‘’โ€ณ) and ๐‘’โ€ฒ[๐›ผโ€ฒ] โˆˆ ๐‘†(๐‘’โ€ฒ), thenumber of the possible distinct expressions of the form ๐‘’[๐›ผ] is at most|๐‘†(๐‘’โ€ณ)| + |๐‘†(๐‘’โ€ณ)| โ‹… |๐‘†(๐‘’โ€ฒ)|. This completes the proof of the claim.

    The bottom line is that while running Algorithm 6.13 on a regularexpression ๐‘’, all the expressions we ever encounter are in the finite set๐‘†(๐‘’), no matter how large the input ๐‘ฅ is, and so the running time ofAlgorithm 6.13 satisfies the equation ๐‘‡ (๐‘›) = ๐‘‡ (๐‘› โˆ’ 1) + ๐ถโ€ฒ for someconstant ๐ถโ€ฒ depending on ๐‘’. This solves to ๐‘‚(๐‘›) where the implicitconstant in the O notation can (and will) depend on ๐‘’ but crucially,not on the length of the input ๐‘ฅ.

    6.4.1 Matching regular expressions using DFAsTheorem 6.11 is already quite impressive, but we can do even better.Specifically, no matter how long the string ๐‘ฅ is, we can compute ฮฆ๐‘’(๐‘ฅ)by maintaining only a constant amount of memory and moreovermaking a single pass over ๐‘ฅ. That is, the algorithm will scan the input๐‘ฅ once from start to finish, and then determine whether or not ๐‘ฅ ismatched by the expression ๐‘’. This is important in the common caseof trying to match a short regular expression over a huge file or docu-ment that might not even fit in our computerโ€™s memory. Of course, aswe have seen before, a single-pass constant-memory algorithm is sim-ply a deterministic finite automaton. As we will see in Theorem 6.16, afunction can be computed by regular expression if and only if it can becomputed by a DFA. We start with showing the โ€œonly ifโ€ direction:

    Theorem 6.14 โ€” DFA for regular expression matching. Let ๐‘’ be a regularexpression. Then there is an algorithm that on input ๐‘ฅ โˆˆ {0, 1}โˆ—computes ฮฆ๐‘’(๐‘ฅ) while making a single pass over ๐‘ฅ and maintaininga constant amount of memory.

    Proof Idea:

    The single-pass constant-memory for checking if a string matchesa regular expression is presented in Algorithm 6.15. The idea is toreplace the recursive algorithm of Algorithm 6.13 with a dynamic pro-gram, using the technique of memoization. If you havenโ€™t taken yet analgorithms course, you might not know these techniques. This is OK;while this more efficient algorithm is crucial for the many practicalapplications of regular expressions, it is not of great importance forthis book.

    โ‹†

    https://goo.gl/kgLdX1https://goo.gl/kgLdX1https://en.wikipedia.org/wiki/Memoization

  • 238 introduction to theoretical computer science

    Algorithm 6.15 โ€” Regular expression matching by a DFA.

    Input: Regular expression ๐‘’ over ฮฃโˆ—, ๐‘ฅ โˆˆ ฮฃ๐‘› where ๐‘› โˆˆ โ„•Output: ฮฆ๐‘’(๐‘ฅ)1: procedure DFAMatch(๐‘’,๐‘ฅ)2: Let ๐‘† โ† ๐‘†(๐‘’) be the set {๐‘’[๐›ผ]|๐›ผ โˆˆ {0, 1}โˆ—} as defined

    in the proof of [reglintimethm]().ref.3: for ๐‘’โ€ฒ โˆˆ ๐‘† do4: Let ๐‘ฃ๐‘’โ€ฒ โ† 1 if ฮฆ๐‘’โ€ฒ("") = 1 and ๐‘ฃ๐‘’โ€ฒ โ† 0 otherwise5: end for6: for ๐‘– โˆˆ [๐‘›] do7: Let ๐‘™๐‘Ž๐‘ ๐‘ก๐‘’โ€ฒ โ† ๐‘ฃ๐‘’โ€ฒ for all ๐‘’โ€ฒ โˆˆ ๐‘†8: Let ๐‘ฃ๐‘’โ€ฒ โ† ๐‘™๐‘Ž๐‘ ๐‘ก๐‘’โ€ฒ[๐‘ฅ๐‘–] for all ๐‘’โ€ฒ โˆˆ ๐‘†9: end for

    10: return ๐‘ฃ๐‘’11: end procedure

    Proof of Theorem 6.14. Algorithm 6.15 checks if a given string ๐‘ฅ โˆˆ ฮฃโˆ—is matched by the regular expression ๐‘’. For every regular expres-sion ๐‘’, this algorithm has a constant number 2|๐‘†(๐‘’)| Boolean vari-ables (๐‘ฃ๐‘’โ€ฒ , ๐‘™๐‘Ž๐‘ ๐‘ก๐‘’โ€ฒ for ๐‘’โ€ฒ โˆˆ ๐‘†(๐‘’)), and it makes a single pass overthe input string. Hence it corresponds to a DFA. We prove its cor-rectness by induction on the length ๐‘› of the input. Specifically, wewill argue that before reading the ๐‘–-th bit of ๐‘ฅ, the variable ๐‘ฃ๐‘’โ€ฒ isequal to ฮฆ๐‘’โ€ฒ(๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘–โˆ’1) for every ๐‘’โ€ฒ โˆˆ ๐‘†(๐‘’). In the case ๐‘– = 0 thisholds since we initialize ๐‘ฃ๐‘’โ€ฒ = ฮฆ๐‘’โ€ฒ("") for all ๐‘’โ€ฒ โˆˆ ๐‘†(๐‘’). For ๐‘– > 0this holds by induction since the inductive hypothesis implies that๐‘™๐‘Ž๐‘ ๐‘กโ€ฒ๐‘’ = ฮฆ๐‘’โ€ฒ(๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘–โˆ’2) for all ๐‘’โ€ฒ โˆˆ ๐‘†(๐‘’) and by the definition of the set๐‘†(๐‘’โ€ฒ), for every ๐‘’โ€ฒ โˆˆ ๐‘†(๐‘’) and ๐‘ฅ๐‘–โˆ’1 โˆˆ ฮฃ, ๐‘’โ€ณ = ๐‘’โ€ฒ[๐‘ฅ๐‘–โˆ’1] is in ๐‘†(๐‘’) andฮฆ๐‘’โ€ฒ(๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘–โˆ’1) = ฮฆ๐‘’โ€ณ(๐‘ฅ0 โ‹ฏ ๐‘ฅ๐‘–).

    โ– 

    6.4.2 Equivalence of regular expressions and automataRecall that a Boolean function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} is defined to beregular if it is equal to ฮฆ๐‘’ for some regular expression ๐‘’. (Equivalently,a language ๐ฟ โŠ† {0, 1}โˆ— is defined to be regular if there is a regularexpression ๐‘’ such that ๐‘’ matches ๐‘ฅ iff ๐‘ฅ โˆˆ ๐ฟ.) The following theorem isthe central result of automata theory:

    Theorem 6.16 โ€” DFA and regular expression equivalency. Let ๐น โˆถ {0, 1}โˆ— โ†’{0, 1}. Then ๐น is regular if and only if there exists a DFA (๐‘‡ , ๐’ฎ) thatcomputes ๐น .

    Proof Idea:

  • functions with infinite domains, automata, and regular expressions 239

    Figure 6.6: A deterministic finite automaton thatcomputes the function ฮฆ(01)โˆ— .

    Figure 6.7: Given a DFA of ๐ถ states, for every ๐‘ฃ, ๐‘ค โˆˆ[๐ถ] and number ๐‘ก โˆˆ {0, โ€ฆ , ๐ถ} we define the function๐น ๐‘ก๐‘ฃ,๐‘ค โˆถ {0, 1}โˆ— โ†’ {0, 1} to output one on input๐‘ฅ โˆˆ {0, 1}โˆ— if and only if when the DFA is initializedin the state ๐‘ฃ and is given the input ๐‘ฅ, it will reach thestate ๐‘ค while going only through the intermediatestates {0, โ€ฆ , ๐‘ก โˆ’ 1}.

    One direction follows from Theorem 6.14, which shows that forevery regular expression ๐‘’, the function ฮฆ๐‘’ can be computed by a DFA(see for example Fig. 6.6). For the other direction, we show that givena DFA (๐‘‡ , ๐’ฎ) for every ๐‘ฃ, ๐‘ค โˆˆ [๐ถ] we can find a regular expression thatwould match ๐‘ฅ โˆˆ {0, 1}โˆ— if and only if the DFA starting in state ๐‘ฃ, willend up in state ๐‘ค after reading ๐‘ฅ.

    โ‹†

    Proof of Theorem 6.16. Since Theorem 6.14 proves the โ€œonly ifโ€ direc-tion, we only need to show the โ€œifโ€ direction. Let ๐ด = (๐‘‡ , ๐’ฎ) be a DFAwith ๐ถ states that computes the function ๐น . We need to show that ๐น isregular.

    For every ๐‘ฃ, ๐‘ค โˆˆ [๐ถ], we let ๐น๐‘ฃ,๐‘ค โˆถ {0, 1}โˆ— โ†’ {0, 1} be the functionthat maps ๐‘ฅ โˆˆ {0, 1}โˆ— to 1 if and only if the DFA ๐ด, starting at thestate ๐‘ฃ, will reach the state ๐‘ค if it reads the input ๐‘ฅ. We will prove that๐น๐‘ฃ,๐‘ค is regular for every ๐‘ฃ, ๐‘ค. This will prove the theorem, since byDefinition 6.2, ๐น(๐‘ฅ) is equal to the OR of ๐น0,๐‘ค(๐‘ฅ) for every ๐‘ค โˆˆ ๐’ฎ.Hence if we have a regular expression for every function of the form๐น๐‘ฃ,๐‘ค then (using the | operation), we can obtain a regular expressionfor ๐น as well.

    To give regular expressions for the functions ๐น๐‘ฃ,๐‘ค, we start bydefining the following functions ๐น ๐‘ก๐‘ฃ,๐‘ค: for every ๐‘ฃ, ๐‘ค โˆˆ [๐ถ] and0 โ‰ค ๐‘ก โ‰ค ๐ถ, ๐น ๐‘ก๐‘ฃ,๐‘ค(๐‘ฅ) = 1 if and only if starting from ๐‘ฃ and observ-ing ๐‘ฅ, the automata reaches ๐‘ค with all intermediate states being in the set[๐‘ก] = {0, โ€ฆ , ๐‘ก โˆ’ 1} (see Fig. 6.7). That is, while ๐‘ฃ, ๐‘ค themselves mightbe outside [๐‘ก], ๐น ๐‘ก๐‘ฃ,๐‘ค(๐‘ฅ) = 1 if and only if throughout the execution ofthe automaton on the input ๐‘ฅ (when initiated at ๐‘ฃ) it never enters anyof the states outside [๐‘ก] and still ends up at ๐‘ค. If ๐‘ก = 0 then [๐‘ก] is theempty set, and hence ๐น 0๐‘ฃ,๐‘ค(๐‘ฅ) = 1 if and only if the automaton reaches๐‘ค from ๐‘ฃ directly on ๐‘ฅ, without any intermediate state. If ๐‘ก = ๐ถ thenall states are in [๐‘ก], and hence ๐น ๐‘ก๐‘ฃ,๐‘ค = ๐น๐‘ฃ,๐‘ค.

    We will prove the theorem by induction on ๐‘ก, showing that ๐น ๐‘ก๐‘ฃ,๐‘ค isregular for every ๐‘ฃ, ๐‘ค and ๐‘ก. For the base case of ๐‘ก = 0, ๐น 0๐‘ฃ,๐‘ค is regularfor every ๐‘ฃ, ๐‘ค since it can be described as one of the expressions "", โˆ…,0, 1 or 0|1. Specifically, if ๐‘ฃ = ๐‘ค then ๐น 0๐‘ฃ,๐‘ค(๐‘ฅ) = 1 if and only if ๐‘ฅ isthe empty string. If ๐‘ฃ โ‰  ๐‘ค then ๐น 0๐‘ฃ,๐‘ค(๐‘ฅ) = 1 if and only if ๐‘ฅ consistsof a single symbol ๐œŽ โˆˆ {0, 1} and ๐‘‡ (๐‘ฃ, ๐œŽ) = ๐‘ค. Therefore in this case๐น 0๐‘ฃ,๐‘ค corresponds to one of the four regular expressions 0|1, 0, 1 or โˆ…,depending on whether ๐ด transitions to ๐‘ค from ๐‘ฃ when it reads either 0or 1, only one of these symbols, or neither.

    Inductive step: Now that weโ€™ve seen the base case, let us prove thegeneral case by induction. Assume, via the induction hypothesis, thatfor every ๐‘ฃโ€ฒ, ๐‘คโ€ฒ โˆˆ [๐ถ], we have a regular expression ๐‘…๐‘ก๐‘ฃ,๐‘ค that computes๐น ๐‘ก๐‘ฃโ€ฒ,๐‘คโ€ฒ . We need to prove that ๐น ๐‘ก+1๐‘ฃ,๐‘ค is regular for every ๐‘ฃ, ๐‘ค. If the

  • 240 introduction to theoretical computer science

    automaton arrives from ๐‘ฃ to ๐‘ค using the intermediate states [๐‘ก + 1],then it visits the ๐‘ก-th state zero or more times. If the path labeled by ๐‘ฅcauses the automaton to get from ๐‘ฃ to ๐‘ค without visiting the ๐‘ก-th stateat all, then ๐‘ฅ is matched by the regular expression ๐‘…๐‘ก๐‘ฃ,๐‘ค. If the pathlabeled by ๐‘ฅ causes the automaton to get from ๐‘ฃ to ๐‘ค while visiting the๐‘ก-th state ๐‘˜ > 0 times then we can think of this path as:โ€ข First travel from ๐‘ฃ to ๐‘ก using only intermediate states in [๐‘ก โˆ’ 1].

    โ€ข Then go from ๐‘ก back to itself ๐‘˜ โˆ’ 1 times using only intermediatestates in [๐‘ก โˆ’ 1]

    โ€ข Then go from ๐‘ก to ๐‘ค using only intermediate states in [๐‘ก โˆ’ 1].Therefore in this case the string ๐‘ฅ is matched by the regular expres-

    sion ๐‘…๐‘ก๐‘ฃ,๐‘ก(๐‘…๐‘ก๐‘ก,๐‘ก)โˆ—๐‘…๐‘ก๐‘ก,๐‘ค. (See also Fig. 6.8.)Therefore we can compute ๐น ๐‘ก+1๐‘ฃ,๐‘ค using the regular expression

    ๐‘…๐‘ก๐‘ฃ,๐‘ค | ๐‘…๐‘ก๐‘ฃ,๐‘ก(๐‘…๐‘ก๐‘ก,๐‘ก)โˆ—๐‘…๐‘ก๐‘ก,๐‘ค . (6.14)This completes the proof of the inductive step and hence of the theo-rem.

    โ– 

    Figure 6.8: If we have regular expressions ๐‘…๐‘ก๐‘ฃโ€ฒ,๐‘คโ€ฒcorresponding to ๐น ๐‘ก๐‘ฃโ€ฒ,๐‘คโ€ฒ for every ๐‘ฃโ€ฒ, ๐‘คโ€ฒ โˆˆ [๐ถ], we canobtain a regular expression ๐‘…๐‘ก+1๐‘ฃ,๐‘ค corresponding to๐น ๐‘ก+1๐‘ฃ,๐‘ค . The key observation is that a path from ๐‘ฃ to ๐‘คusing {0, โ€ฆ , ๐‘ก} either does not touch ๐‘ก at all, in whichcase it is captured by the expression ๐‘…๐‘ก๐‘ฃ,๐‘ค, or it goesfrom ๐‘ฃ to ๐‘ก, comes back to ๐‘ก zero or more times, andthen goes from ๐‘ก to ๐‘ค, in which case it is captured bythe expression ๐‘…๐‘ก๐‘ฃ,๐‘ก(๐‘…๐‘ก๐‘ก,๐‘ก)โˆ—๐‘…๐‘ก๐‘ก,๐‘ค.

    6.4.3 Closure properties of regular expressionsIf ๐น and ๐บ are regular functions computed by the expressions ๐‘’ and ๐‘“respectively, then the expression ๐‘’|๐‘“ computes the function ๐ป = ๐น โˆจ ๐บdefined as ๐ป(๐‘ฅ) = ๐น(๐‘ฅ) โˆจ ๐บ(๐‘ฅ). Another way to say this is that the setof regular functions is closed under the OR operation. That is, if ๐น and ๐บare regular then so is ๐น โˆจ ๐บ. An important corollary of Theorem 6.16is that this set is also closed under the NOT operation:

  • functions with infinite domains, automata, and regular expressions 241

    Lemma 6.17 โ€” Regular expressions closed under complement. If ๐น โˆถ {0, 1}โˆ— โ†’{0, 1} is regular then so is the function ๐น , where ๐น(๐‘ฅ) = 1 โˆ’ ๐น(๐‘ฅ) forevery ๐‘ฅ โˆˆ {0, 1}โˆ—.

    Proof. If ๐น is regular then by Theorem 6.11 it can be computed by aDFA ๐ด = (๐‘‡ , ๐’œ) with some ๐ถ states. But then the DFA ๐ด = (๐‘‡ , [๐ถ]โงต๐’œ)which does the same computation but where flips the set of acceptedstates will compute ๐น . By Theorem 6.16 this implies that ๐น is regularas well.

    โ– 

    Since ๐‘Ž โˆง ๐‘ = ๐‘Ž โˆจ ๐‘, Lemma 6.17 implies that the set of regularfunctions is closed under the AND operation as well. Moreover, sinceOR, NOT and AND are a universal basis, this set is also closed un-der NAND, XOR, and any other finite function. That is, we have thefollowing corollary:

    Theorem 6.18 โ€” Closure of regular expressions. Let ๐‘“ โˆถ {0, 1}๐‘˜ โ†’ {0, 1} beany finite Boolean function, and let ๐น0, โ€ฆ , ๐น๐‘˜โˆ’1 โˆถ {0, 1}โˆ— โ†’ {0, 1} beregular functions. Then the function ๐บ(๐‘ฅ) = ๐‘“(๐น0(๐‘ฅ), ๐น1(๐‘ฅ), โ€ฆ , ๐น๐‘˜โˆ’1(๐‘ฅ))is regular.

    Proof. This is a direct consequence of the closure of regular functionsunder OR and NOT (and hence AND), combined with Theorem 4.13,that states that every ๐‘“ can be computed by a Boolean circuit (which issimply a combination of the AND, OR, and NOT operations).

    โ– 

    6.5 LIMITATIONS OF REGULAR EXPRESSIONS AND THE PUMPINGLEMMA

    The efficiency of regular expression matching makes them very useful.This is why operating systems and text editors often restrict theirsearch interface to regular expressions and do not allow searching byspecifying an arbitrary function. However, this efficiency comes ata cost. As we have seen, regular expressions cannot compute everyfunction. In fact, there are some very simple (and useful!) functionsthat they cannot compute. Here is one example:

    Lemma 6.19 โ€” Matching parentheses. Let ฮฃ = {โŸจ, โŸฉ} and MATCHPAREN โˆถฮฃโˆ— โ†’ {0, 1} be the function that given a string of parentheses, out-puts 1 if and only if every opening parenthesis is matched by a corre-sponding closed one. Then there is no regular expression over ฮฃ thatcomputes MATCHPAREN.

    Lemma 6.19 is a consequence of the following result, which isknown as the pumping lemma:

  • 242 introduction to theoretical computer science

    Theorem 6.20 โ€” Pumping Lemma. Let ๐‘’ be a regular expression oversome alphabet ฮฃ. Then there is some number ๐‘›0 such that for ev-ery ๐‘ค โˆˆ ฮฃโˆ— with |๐‘ค| > ๐‘›0 and ฮฆ๐‘’(๐‘ค) = 1, we can write ๐‘ค = ๐‘ฅ๐‘ฆ๐‘ง forstrings ๐‘ฅ, ๐‘ฆ, ๐‘ง โˆˆ ฮฃโˆ— satisfying the following conditions:

    1. |๐‘ฆ| โ‰ฅ 1.

    2. |๐‘ฅ๐‘ฆ| โ‰ค ๐‘›0.

    3. ฮฆ๐‘’(๐‘ฅ๐‘ฆ๐‘˜๐‘ง) = 1 for every ๐‘˜ โˆˆ โ„•.

    Figure 6.9: To prove the โ€œpumping lemmaโ€ we lookat a word ๐‘ค that is much larger than the regularexpression ๐‘’ that matches it. In such a case, part of๐‘ค must be matched by some sub-expression of theform (๐‘’โ€ฒ)โˆ—, since this is the only operator that allowsmatching words longer than the expression. If welook at the โ€œleftmostโ€ such sub-expression and define๐‘ฆ๐‘˜ to be the string that is matched by it, we obtain thepartition needed for the pumping lemma.

    Proof Idea:

    The idea behind the proof the following. Let ๐‘›0 be twice the num-ber of symbols that are used in the expression ๐‘’, then the only waythat there is some ๐‘ค with |๐‘ค| > ๐‘›0 and ฮฆ๐‘’(๐‘ค) = 1 is that ๐‘’ containsthe โˆ— (i.e. star) operator and that there is a non-empty substring ๐‘ฆ of๐‘ค that was matched by (๐‘’โ€ฒ)โˆ— for some sub-expression ๐‘’โ€ฒ of ๐‘’. We cannow repeat ๐‘ฆ any number of times and still get a matching string. Seealso Fig. 6.9.

    โ‹†

    PThe pumping lemma is a bit cumbersome to state,but one way to remember it is that it simply says thefollowing: โ€œif a string matching a regular expression islong enough, one of its substrings must be matched usingthe โˆ— operatorโ€.

  • functions with infinite domains, automata, and regular expressions 243

    Proof of Theorem 6.20. To prove the lemma formally, we use inductionon the length of the expression. Like all induction proofs, this willbe somewhat lengthy, but at the end of the day it directly follows theintuition above that somewhere we must have used the star operation.Reading this proof, and in particular understanding how the formalproof below corresponds to the intuitive idea above, is a very goodway to get more comfortable with inductive proofs of this form.

    Our inductive hypothesis is that for an ๐‘› length expression, ๐‘›0 =2๐‘› satisfies the conditions of the lemma. The base case is when theexpression is a single symbol ๐œŽ โˆˆ ฮฃ or that the expression is โˆ… or"". In all these cases the conditions of the lemma are satisfied simplybecause there ๐‘›0 = 2 and there is no string ๐‘ฅ of length larger than ๐‘›0that is matched by the expression.

    We now prove the inductive step. Let ๐‘’ be a regular expressionwith ๐‘› > 1 symbols. We set ๐‘›0 = 2๐‘› and let ๐‘ค โˆˆ ฮฃโˆ— be a stringsatisfying |๐‘ค| > ๐‘›0. Since ๐‘’ has more than one symbol, it has one ofthe forms (a) ๐‘’โ€ฒ|๐‘’โ€ณ, (b), (๐‘’โ€ฒ)(๐‘’โ€ณ), or (c) (๐‘’โ€ฒ)โˆ— where in all these casesthe subexpressions ๐‘’โ€ฒ and ๐‘’โ€ณ have fewer symbols than ๐‘’ and hencesatisfy the induction hypothesis.

    In the case (a), every string ๐‘ค matched by ๐‘’ must be matched byeither ๐‘’โ€ฒ or ๐‘’โ€ณ. If ๐‘’โ€ฒ matches ๐‘ค then, since |๐‘ค| > 2|๐‘’โ€ฒ|, by the inductionhypothesis there exist ๐‘ฅ, ๐‘ฆ, ๐‘ง with |๐‘ฆ| โ‰ฅ 1 and |๐‘ฅ๐‘ฆ| โ‰ค 2|๐‘’โ€ฒ| < ๐‘›0 suchthat ๐‘’โ€ฒ (and therefore also ๐‘’ = ๐‘’โ€ฒ|๐‘’โ€ณ) matches ๐‘ฅ๐‘ฆ๐‘˜๐‘ง for every ๐‘˜. Thesame arguments works in the case that ๐‘’โ€ณ matches ๐‘ค.

    In the case (b), if ๐‘ค is matched by (๐‘’โ€ฒ)(๐‘’โ€ณ) then we can write ๐‘ค =๐‘คโ€ฒ๐‘คโ€ณ where ๐‘’โ€ฒ matches ๐‘คโ€ฒ and ๐‘’โ€ณ matches ๐‘คโ€ณ. We split to subcases. If|๐‘คโ€ฒ| > 2|๐‘’โ€ฒ| then by the induction hypothesis there exist ๐‘ฅ, ๐‘ฆ, ๐‘งโ€ฒ with|๐‘ฆ| โ‰ค 1, |๐‘ฅ๐‘ฆ| โ‰ค 2|๐‘’โ€ฒ| < ๐‘›0 such that ๐‘คโ€ฒ = ๐‘ฅ๐‘ฆ๐‘งโ€ฒ and ๐‘’โ€ฒ matches ๐‘ฅ๐‘ฆ๐‘˜๐‘งโ€ฒfor every ๐‘˜ โˆˆ โ„•. This completes the proof since if we set ๐‘ง = ๐‘งโ€ฒ๐‘คโ€ณthen we see that ๐‘ค = ๐‘คโ€ฒ๐‘คโ€ณ = ๐‘ฅ๐‘ฆ๐‘ง and ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ) matches ๐‘ฅ๐‘ฆ๐‘˜๐‘ง forevery ๐‘˜ โˆˆ โ„•. Otherwise, if |๐‘คโ€ฒ| โ‰ค 2|๐‘’โ€ฒ| then since |๐‘ค| = |๐‘คโ€ฒ| + |๐‘คโ€ณ| >๐‘›0 = 2(|๐‘’โ€ฒ| + |๐‘’โ€ณ|), it must be that |๐‘คโ€ณ| > 2|๐‘’โ€ณ|. Hence by the inductionhypothesis there exist ๐‘ฅโ€ฒ, ๐‘ฆ, ๐‘ง such that |๐‘ฆ| โ‰ฅ 1, |๐‘ฅโ€ฒ๐‘ฆ| โ‰ค 2|๐‘’โ€ณ| and ๐‘’โ€ณmatches ๐‘ฅโ€ฒ๐‘ฆ๐‘˜๐‘ง for every ๐‘˜ โˆˆ โ„•. But now if we set ๐‘ฅ = ๐‘คโ€ฒ๐‘ฅโ€ฒ we see that|๐‘ฅ๐‘ฆ| โ‰ค |๐‘คโ€ฒ| + |๐‘ฅโ€ฒ๐‘ฆ| โ‰ค 2|๐‘’โ€ฒ| + 2|๐‘’โ€ณ| = ๐‘›0 and on the other hand theexpression ๐‘’ = (๐‘’โ€ฒ)(๐‘’โ€ณ) matches ๐‘ฅ๐‘ฆ๐‘˜๐‘ง = ๐‘คโ€ฒ๐‘ฅโ€ฒ๐‘ฆ๐‘˜๐‘ง for every ๐‘˜ โˆˆ โ„•.

    In case (c), if ๐‘ค is matched by (๐‘’โ€ฒ)โˆ— then ๐‘ค = ๐‘ค0 โ‹ฏ ๐‘ค๐‘ก where forevery ๐‘– โˆˆ [๐‘ก], ๐‘ค๐‘– is a nonempty string matched by ๐‘’โ€ฒ. If |๐‘ค0| > 2|๐‘’โ€ฒ|,then we can use the same approach as in the concatenation case above.Otherwise, we simply note that if ๐‘ฅ is the empty string, ๐‘ฆ = ๐‘ค0, and๐‘ง = ๐‘ค1 โ‹ฏ ๐‘ค๐‘ก then |๐‘ฅ๐‘ฆ| โ‰ค ๐‘›0 and ๐‘ฅ๐‘ฆ๐‘˜๐‘ง is matched by (๐‘’โ€ฒ)โˆ— for every๐‘˜ โˆˆ โ„•.

    โ– 

  • 244 introduction to theoretical computer science

    RRemark 6.21 โ€” Recursive definitions and inductiveproofs. When an object is recursively defined (as in thecase of regular expressions) then it is natural to proveproperties of such objects by induction. That is, if wewant to prove that all objects of this type have prop-erty ๐‘ƒ , then it is natural to use an inductive step thatsays that if ๐‘œโ€ฒ, ๐‘œโ€ณ, ๐‘œโ€ด etc have property ๐‘ƒ then so is anobject ๐‘œ that is obtained by composing them.

    Using the pumping lemma, we can easily prove Lemma 6.19 (i.e.,the non-regularity of the โ€œmatching parenthesisโ€ function):

    Proof of Lemma 6.19. Suppose, towards the sake of contradiction, thatthere is an expression ๐‘’ such that ฮฆ๐‘’ = MATCHPAREN. Let ๐‘›0 bethe number obtained from Theorem 6.20 and let ๐‘ค = โŸจ๐‘›0โŸฉ๐‘›0 (i.e.,๐‘›0 left parenthesis followed by ๐‘›0 right parenthesis). Then we seethat if we write ๐‘ค = ๐‘ฅ๐‘ฆ๐‘ง as in Lemma 6.19, the condition |๐‘ฅ๐‘ฆ| โ‰ค ๐‘›0implies that ๐‘ฆ consists solely of left parenthesis. Hence the string๐‘ฅ๐‘ฆ2๐‘ง will contain more left parenthesis than right parenthesis. HenceMATCHPAREN(๐‘ฅ๐‘ฆ2๐‘ง) = 0 but by the pumping lemma ฮฆ๐‘’(๐‘ฅ๐‘ฆ2๐‘ง) = 1,contradicting our assumption that ฮฆ๐‘’ = MATCHPAREN.

    โ– 

    The pumping lemma is a very useful tool to show that certain func-tions are not computable by a regular expression. However, it is not anโ€œif and only ifโ€ condition for regularity: there are non-regular func-tions that still satisfy the pumping lemma conditions. To understandthe pumping lemma, it is crucial to follow the order of quantifiers inTheorem 6.20. In particular, the number ๐‘›0 in the statement of Theo-rem 6.20 depends on the regular expression (in the proof we chose ๐‘›0to be twice the number of symbols in the expression). So, if we wantto use the pumping lemma to rule out the existence of a regular ex-pression ๐‘’ computing some function ๐น , we need to be able to choosean appropriate input ๐‘ค โˆˆ {0, 1}โˆ— that can be arbitrarily large andsatisfies ๐น(๐‘ค) = 1. This makes sense if you think about the intuitionbehind the pumping lemma: we need ๐‘ค to be large enough as to forcethe use of the star operator.

    Solved Exercise 6.4 โ€” Palindromes is not regular. Prove that the followingfunction over the alphabet {0, 1, ; } is not regular: PAL(๐‘ค) = 1 if andonly if ๐‘ค = ๐‘ข; ๐‘ข๐‘… where ๐‘ข โˆˆ {0, 1}โˆ— and ๐‘ข๐‘… denotes ๐‘ข โ€œreversedโ€:the string ๐‘ข|๐‘ข|โˆ’1 โ‹ฏ ๐‘ข0. (The Palindrome function is most often definedwithout an explicit separator character ;, but the version with such aseparator is a bit cleaner, and so we use it here. This does not make

  • functions with infinite domains, automata, and regular expressions 245

    Figure 6.10: A cartoon of a proof using the pumping lemma that a function ๐น is not regular. The pumping lemma states that if ๐น is regular then thereexists a number ๐‘›0 such that for every large enough ๐‘ค with ๐น(๐‘ค) = 1, there exists a partition of ๐‘ค to ๐‘ค = ๐‘ฅ๐‘ฆ๐‘ง satisfying certain conditions suchthat for every ๐‘˜ โˆˆ โ„•, ๐น(๐‘ฅ๐‘ฆ๐‘˜๐‘ง) = 1. You can imagine a pumping-lemma based proof as a game between you and the adversary. Every there existsquantifier corresponds to an object you are free to choose on your own (and base your choice on previously chosen objects). Every for every quantifiercorresponds to an object the adversary can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proofcorresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choiceof ๐‘˜ that would result in ๐น(๐‘ฅ๐‘ฆ๐‘˜๐‘ง) = 0, hence violating the conclusion of the pumping lemma.

  • 246 introduction to theoretical computer science

    much difference, as one can easily encode the separator as a specialbinary string instead.)

    โ– 

    Solution:

    We use the pumping lemma. Suppose toward the sake of con-tradiction that there is a regular expression ๐‘’ computing PAL,and let ๐‘›0 be the number obtained by the pumping lemma (The-orem 6.20). Consider the string ๐‘ค = 0๐‘›0 ; 0๐‘›0 . Since the reverseof the all zero string is the all zero string, PAL(๐‘ค) = 1. Now, bythe pumping lemma, if PAL is computed by ๐‘’, then we can write๐‘ค = ๐‘ฅ๐‘ฆ๐‘ง such that |๐‘ฅ๐‘ฆ| โ‰ค ๐‘›0, |๐‘ฆ| โ‰ฅ 1 and PAL(๐‘ฅ๐‘ฆ๐‘˜๐‘ง) = 1 forevery ๐‘˜ โˆˆ โ„•. In particular, it must hold that PAL(๐‘ฅ๐‘ง) = 1, but thisis a contradiction, since ๐‘ฅ๐‘ง = 0๐‘›0โˆ’|๐‘ฆ|; 0๐‘›0 and so its two parts arenot of the same length and in particular are not the reverse of oneanother.

    โ– 

    For yet another example of a pumping-lemma based proof, seeFig. 6.10 which illustrates a cartoon of the proof of the non-regularityof the function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} which is defined as ๐น(๐‘ฅ) = 1 iff๐‘ฅ = 0๐‘›1๐‘› for some ๐‘› โˆˆ โ„• (i.e., ๐‘ฅ consists of a string of consecutivezeroes, followed by a string of consecutive ones of the same length).

    6.6 ANSWERING SEMANTIC QUESTIONS ABOUT REGULAR EX-PRESSIONS

    Regular expressions have applications beyond search. For example,regular expressions are often used to define tokens (such as what is avalid variable identifier, or keyword) in the design of parsers, compilersand interpreters for programming languages. Regular expressionshave other applications too: for example, in recent years, the worldof networking moved from fixed topologies to โ€œsoftware definednetworksโ€. Such networks are routed by programmable switchesthat can implement policies such as โ€œif packet is secured by SSL thenforward it to A, otherwise forward it to Bโ€. To represent such policieswe need a language that is on one hand sufficiently expressive tocapture the policies we want to implement, but on the other handsufficiently restrictive so that we can quickly execute them at networkspeed and also be able to answer questions such as โ€œcan C see thepackets moved from A to B?โ€. The NetKAT network programminglanguage uses a variant of regular expressions to achieve preciselythat. For this application, it is important that we are not merely ableto answer whether an expression ๐‘’ matches a string ๐‘ฅ but also answersemantic questions about regular expressions such as โ€œdo expressions

    https://goo.gl/oeJNuwhttps://goo.gl/oeJNuw

  • functions with infinite domains, automata, and regular expressions 247

    ๐‘’ and ๐‘’โ€ฒ compute the same function?โ€ and โ€œdoes there exist a string ๐‘ฅthat is matched by the expression ๐‘’?โ€. The following theorem showsthat we can answer the latter question:

    Theorem 6.22 โ€” Emptiness of regular languages is computable. There is analgorithm that given a regular expression ๐‘’, outputs 1 if and only ifฮฆ๐‘’ is the constant zero function.

    Proof Idea:

    The idea is that we can directly observe this from the structureof the expression. The only way a regular expression ๐‘’ computesthe constant zero function is if ๐‘’ has the form โˆ… or is obtained byconcatenating โˆ… with other expressions.

    โ‹†

    Proof of Theorem 6.22. Define a regular expression to be โ€œemptyโ€ if itcomputes the constant zero function. Given a regular expression ๐‘’, wecan determine if ๐‘’ is empty using the following rules:

    โ€ข If ๐‘’ has the form ๐œŽ or "" then it is not empty.

    โ€ข If ๐‘’ is not empty then ๐‘’|๐‘’โ€ฒ is not empty for every ๐‘’โ€ฒ.

    โ€ข If ๐‘’ is not empty then ๐‘’โˆ— is not empty.

    โ€ข If ๐‘’ and ๐‘’โ€ฒ are both not empty then ๐‘’ ๐‘’โ€ฒ is not empty.

    โ€ข โˆ… is empty.

    Using these rules, it is straightforward to come up with a recursivealgorithm to determine emptiness.

    โ– 

    Using Theorem 6.22, we can obtain an algorithm that determineswhether or not two regular expressions ๐‘’ and ๐‘’โ€ฒ are equivalent, in thesense that they compute the same function.

    Theorem 6.23 โ€” Equivalence of regular expressions is computable. LetREGEQ โˆถ {0, 1}โˆ— โ†’ {0, 1} be the function that on input (a stringrepresenting) a pair of regular expressions ๐‘’, ๐‘’โ€ฒ, REGEQ(๐‘’, ๐‘’โ€ฒ) = 1if and only if ฮฆ๐‘’ = ฮฆ๐‘’โ€ฒ . Then there is an algorithm that computesREGEQ.

    Proof Idea:

    The idea is to show that given a pair of regular expressions ๐‘’ and๐‘’โ€ฒ we can find an expression ๐‘’โ€ณ such that ฮฆ๐‘’โ€ณ(๐‘ฅ) = 1 if and only ifฮฆ๐‘’(๐‘ฅ) โ‰  ฮฆ๐‘’โ€ฒ(๐‘ฅ). Therefore ฮฆ๐‘’โ€ณ is the constant zero function if and only

  • 248 introduction to theoretical computer science

    if ๐‘’ and ๐‘’โ€ฒ are equivalent, and thus we can test for emptiness of ๐‘’โ€ณ todetermine equivalence of ๐‘’ and ๐‘’โ€ฒ.

    โ‹†

    Proof of Theorem 6.23. We will prove Theorem 6.23 from Theorem 6.22.(The two theorems are in fact equivalent: it is easy to prove Theo-rem 6.22 from Theorem 6.23, since checking for emptiness is the sameas checking equivalence with the expression โˆ….) Given two regu-lar expressions ๐‘’ and ๐‘’โ€ฒ, we will compute an expression ๐‘’โ€ณ such thatฮฆ๐‘’โ€ณ(๐‘ฅ) = 1 if and only if ฮฆ๐‘’(๐‘ฅ) โ‰  ฮฆ๐‘’โ€ฒ(๐‘ฅ). One can see that ๐‘’ is equiva-lent to ๐‘’โ€ฒ if and only if ๐‘’โ€ณ is empty.

    We start with the observation that for every bit ๐‘Ž, ๐‘ โˆˆ {0, 1}, ๐‘Ž โ‰  ๐‘ ifand only if

    (๐‘Ž โˆง ๐‘) โˆจ (๐‘Ž โˆง ๐‘) . (6.15)Hence we need to construct ๐‘’โ€ณ such that for every ๐‘ฅ,

    ฮฆ๐‘’โ€ณ(๐‘ฅ) = (ฮฆ๐‘’(๐‘ฅ) โˆง ฮฆ๐‘’โ€ฒ(๐‘ฅ)) โˆจ (ฮฆ๐‘’(๐‘ฅ) โˆง ฮฆ๐‘’โ€ฒ(๐‘ฅ)) . (6.16)To construct the expression ๐‘’โ€ณ, we will show how given any pair of

    expressions ๐‘’ and ๐‘’โ€ฒ, we can construct expressions ๐‘’ โˆง ๐‘’โ€ฒ and ๐‘’ thatcompute the functions ฮฆ๐‘’ โˆง ฮฆ๐‘’โ€ฒ and ฮฆ๐‘’ respectively. (Computing theexpression for ๐‘’ โˆจ ๐‘’โ€ฒ is straightforward using the | operation of regularexpressions.)

    Specifically, by Lemma 6.17, regular functions are closed undernegation, which means that for every regular expression ๐‘’, there is anexpression ๐‘’ such that ฮฆ๐‘’(๐‘ฅ) = 1 โˆ’ ฮฆ๐‘’(๐‘ฅ) for every ๐‘ฅ โˆˆ {0, 1}โˆ—. Now,for every two expressions ๐‘’ and ๐‘’โ€ฒ, the expression

    ๐‘’ โˆง ๐‘’โ€ฒ = (๐‘’|๐‘’โ€ฒ) (6.17)

    computes the AND of the two expressions. Given these two transfor-mations, we see that for every regular expressions ๐‘’ and ๐‘’โ€ฒ we can finda regular expression ๐‘’โ€ณ satisfying (6.16) such that ๐‘’โ€ณ is empty if andonly if ๐‘’ and ๐‘’โ€ฒ are equivalent.

    โ– 

    โœ“ Chapter Recap

    โ€ข We model computational tasks on arbitrarily largeinputs using infinite functions ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1}โˆ—.

    โ€ข Such functions take an arbitrarily long (but stillfinite!) string as input, and cannot be described bya finite table of inputs and outputs.

    โ€ข A function with a single bit of output is known asa Boolean function, and the task of computing it isequivalent to deciding a language ๐ฟ โŠ† {0, 1}โˆ—.

  • functions with infinite domains, automata, and regular expressions 249

    โ€ข Deterministic finite automata (DFAs) are one simplemodel for computing (infinite) Boolean functions.

    โ€ข There are some functions that cannot be computedby DFAs.

    โ€ข The set of functions computable by DFAs is thesame as the set of languages that can be recognizedby regular expressions.

    6.7 EXERCISES

    Exercise 6.1 โ€” Closure properties of regular functions. Suppose that ๐น, ๐บ โˆถ{0, 1}โˆ— โ†’ {0, 1} are regular. For each one of the following defini-tions of the function ๐ป , either prove that ๐ป is always regular or give acounterexample for regular ๐น, ๐บ that would make ๐ป not regular.

    1. ๐ป(๐‘ฅ) = ๐น(๐‘ฅ) โˆจ ๐บ(๐‘ฅ).

    2. ๐ป(๐‘ฅ) = ๐น(๐‘ฅ) โˆง ๐บ(๐‘ฅ)

    3. ๐ป(๐‘ฅ) = NAND(๐น(๐‘ฅ), ๐บ(๐‘ฅ)).

    4. ๐ป(๐‘ฅ) = ๐น(๐‘ฅ๐‘…) where ๐‘ฅ๐‘… is the reverse of ๐‘ฅ: ๐‘ฅ๐‘… = ๐‘ฅ๐‘›โˆ’1๐‘ฅ๐‘›โˆ’2 โ‹ฏ ๐‘ฅ๐‘œ for๐‘› = |๐‘ฅ|.

    5. ๐ป(๐‘ฅ) =โŽง{โŽจ{โŽฉ

    1 ๐‘ฅ = ๐‘ข๐‘ฃ s.t. ๐น(๐‘ข) = ๐บ(๐‘ฃ) = 10 otherwise

    6. ๐ป(๐‘ฅ) =โŽง{โŽจ{โŽฉ

    1 ๐‘ฅ = ๐‘ข๐‘ข s.t. ๐น(๐‘ข) = ๐บ(๐‘ข) = 10 otherwise

    7. ๐ป(๐‘ฅ) =โŽง{โŽจ{โŽฉ

    1 ๐‘ฅ = ๐‘ข๐‘ข๐‘… s.t. ๐น(๐‘ข) = ๐บ(๐‘ข) = 10 otherwise

    โ– 

    Exercise 6.2 One among the following two functions that map {0, 1}โˆ—to {0, 1} can be computed by a regular expression, and the other onecannot. For the one that can be computed by a regular expression,write the expression that does it. For the one that cannot, prove thatthis cannot be done using the pumping lemma.

    โ€ข ๐น(๐‘ฅ) = 1 if 4 divides โˆ‘|๐‘ฅ|โˆ’1๐‘–=0 ๐‘ฅ๐‘– and ๐น(๐‘ฅ) = 0 otherwise.

    โ€ข ๐บ(๐‘ฅ) = 1 if and only if โˆ‘|๐‘ฅ|โˆ’1๐‘–=0 ๐‘ฅ๐‘– โ‰ฅ |๐‘ฅ|/4 and ๐บ(๐‘ฅ) = 0 otherwise.

    โ– 

    Exercise 6.3 โ€” Non-regularity. 1. Prove that the following function ๐น โˆถ{0, 1}โˆ— โ†’ {0, 1} is not regular. For every ๐‘ฅ โˆˆ {0, 1}โˆ—, ๐น(๐‘ฅ) = 1 iff ๐‘ฅ isof the form ๐‘ฅ = 13๐‘– for some ๐‘– > 0.

  • 250 introduction to theoretical computer science

    2. Prove that the following function ๐น โˆถ {0, 1}โˆ— โ†’ {0, 1} is not regular.For every ๐‘ฅ โˆˆ {0, 1}โˆ—, ๐น(๐‘ฅ) = 1 iff โˆ‘๐‘— ๐‘ฅ๐‘— = 3๐‘– for some ๐‘– > 0.

    โ– 

    6.8 BIBLIOGRAPHICAL NOTES

    The relation of regular expressions with finite automata is a beautifultopic, on which we only touch upon in this text. It is covered moreextensively in [Sip97; HMU14; Koz97]. These texts also discuss top-ics such as non-deterministic finite automata (NFA) and the relationbetween context-free grammars and pushdown automata.

    The automaton of Fig. 6.4 was generated using the FSM simulatorof Ivan Zuzak and Vedrana Jankovic. Our proof of Theorem 6.11 isclosely related to the Myhill-Nerode Theorem. One direction of theMyhill-Nerode theorem can be stated as saying that if ๐‘’ is a regularexpression then there is at most a finite number of strings ๐‘ง0, โ€ฆ , ๐‘ง๐‘˜โˆ’1such that ฮฆ๐‘’[๐‘ง๐‘–] โ‰  ฮฆ๐‘’[๐‘ง๐‘—] for every 0 โ‰ค ๐‘– โ‰  ๐‘— < ๐‘˜.

    http://ivanzuzak.info/noam/webapps/fsm_simulator/https://goo.gl/mnKVMP

top related