functionswithinfinitedomains,automata,andregularex- โฆfunctions with infinite domains, automata,...
Post on 09-Feb-2021
2 Views
Preview:
TRANSCRIPT
-
Figure 6.1: Once you know how to multiply multi-digit numbers, you can do so for every number ๐of digits, but if you had to describe multiplicationusing Boolean circuits or NAND-CIRC programs,you would need a different program/circuit for everylength ๐ of the input.
6Functions with Infinite domains, Automata, and Regular ex-pressions
โAn algorithm is a finite answer to an infinite number of questions.โ, At-tributed to Stephen Kleene.
The model of Boolean circuits (or equivalently, the NAND-CIRCprogramming language) has one very significant drawback: a Booleancircuit can only compute a finite function ๐ . In particular, since everygate has two inputs, a size ๐ circuit can compute on an input of lengthat most 2๐ . Thus this model does not capture our intuitive notion of analgorithm as a single recipe to compute a potentially infinite function.For example, the standard elementary school multiplication algorithmis a single algorithm that multiplies numbers of all lengths. However,we cannot express this algorithm as a single circuit, but rather need adifferent circuit (or equivalently, a NAND-CIRC program) for everyinput length (see Fig. 6.1).
In this chapter, we extend our definition of computational tasks toconsider functions with the unbounded domain of {0, 1}โ. We focuson the question of defining what tasks to compute, mostly leavingthe question of how to compute them to later chapters, where we willsee Turing machines and other computational models for computingon unbounded inputs. However, we will see one example of a sim-ple restricted model of computation - deterministic finite automata(DFAs).
This chapter: A non-mathy overviewIn this chapter, we discuss functions that take as input stringsof arbitrary length. We will often focus on the special caseof Boolean functions, where the output is a single bit. Theseare still infinite functions since their inputs have unbounded
Compiled on 3.16.2021 13:56
Learning Objectives:โข Define functions on unbounded length inputs,
that cannot be described by a finite size tableof inputs and outputs.
โข Equivalence with the task of decidingmembership in a language.
โข Deterministic finite automatons (optional): Asimple example for a model for unboundedcomputation.
โข Equivalence with regular expressions.
-
218 introduction to theoretical computer science
Figure 6.2: The NAND circuit and NAND-CIRCprogram for computing the XOR of 5 bits. Note howthe circuit for XOR5 merely repeats four times thecircuit to compute the XOR of 2 bits.
length and hence such a function cannot be computed by anysingle Boolean circuit.In the second half of this chapter, we discuss finite automata,a computational model that can compute unbounded lengthfunctions. Finite automata are not as powerful as Python orother general-purpose programming languages but can serveas an introduction to these more general models. We alsoshow a beautiful result - the functions computable by finiteautomata are precisely the ones that correspond to regularexpressions. However, the reader can also feel free to skipautomata and go straight to our discussion of Turing machinesin Chapter 7.
6.1 FUNCTIONS WITH INPUTS OF UNBOUNDED LENGTH
Up until now, we considered the computational task of mappingsome string of length ๐ into a string of length ๐. However, in gen-eral, computational tasks can involve inputs of unbounded length.For example, the following Python function computes the functionXOR โถ {0, 1}โ โ {0, 1}, where XOR(๐ฅ) equals 1 iff the number of 1โsin ๐ฅ is odd. (In other words, XOR(๐ฅ) = โ|๐ฅ|โ1๐=0 ๐ฅ๐ mod 2 for every๐ฅ โ {0, 1}โ.) As simple as it is, the XOR function cannot be com-puted by a Boolean circuit. Rather, for every ๐, we can compute XOR๐(the restriction of XOR to {0, 1}๐) using a different circuit (e.g., seeFig. 6.2).
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
otherwise'''โช
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result
Previously in this book, we studied the computation of finite func-tions ๐ โถ {0, 1}๐ โ {0, 1}๐. Such a function ๐ can always be describedby listing all the 2๐ values it takes on inputs ๐ฅ โ {0, 1}๐. In this chap-ter, we consider functions such as XOR that take inputs of unboundedsize. While we can describe XOR using a finite number of symbols(in fact, we just did so above), it takes infinitely many possible in-puts, and so we cannot just write down all of its values. The same istrue for many other functions capturing important computationaltasks, including addition, multiplication, sorting, finding paths in
-
functions with infinite domains, automata, and regular expressions 219
graphs, fitting curves to points, and so on. To contrast with the fi-nite case, we will sometimes call a function ๐น โถ {0, 1}โ โ {0, 1} (or๐น โถ {0, 1}โ โ {0, 1}โ) infinite. However, this does not mean that ๐นtakes as input strings of infinite length! It just means that ๐น can takeas input a string of that can be arbitrarily long, and so we cannot sim-ply write down a table of all the outputs of ๐น on different inputs.
Big Idea 8 A function ๐น โถ {0, 1}โ โ {0, 1}โ specifies the computa-tional task mapping an input ๐ฅ โ {0, 1}โ into the output ๐น(๐ฅ).
As we have seen before, restricting attention to functions that usebinary strings as inputs and outputs does not detract from our gener-ality, since other objects, including numbers, lists, matrices, images,videos, and more, can be encoded as binary strings.
As before, it is essential to differentiate between specification andimplementation. For example, consider the following function:
TWINP(๐ฅ) =โง{โจ{โฉ
1 โ๐โโ s.t.๐, ๐ + 2 are primes and ๐ > |๐ฅ|0 otherwise
(6.1)
This is a mathematically well-defined function. For every ๐ฅ,TWINP(๐ฅ) has a unique value which is either 0 or 1. However, atthe moment, no one knows of a Python program that computes thisfunction. The Twin prime conjecture posits that for every ๐ thereexists ๐ > ๐ such that both ๐ and ๐ + 2 are primes. If this conjectureis true, then ๐ is easy to compute indeed - the program def T(x):return 1 will do the trick. However, mathematicians have triedunsuccessfully to prove this conjecture since 1849. That said, whetheror not we know how to implement the function TWINP, the definitionabove provides its specification.
6.1.1 Varying inputs and outputsMany of the functions that interest us take more than one input. Forexample, the function
MULT(๐ฅ, ๐ฆ) = ๐ฅ โ ๐ฆ (6.2)takes the binary representation of a pair of integers ๐ฅ, ๐ฆ โ โ, and
outputs the binary representation of their product ๐ฅโ ๐ฆ. However, sincewe can represent a pair of strings as a single string, we will considerfunctions such as MULT as mapping {0, 1}โ to {0, 1}โ. We will typi-cally not be concerned with low-level details such as the precise wayto represent a pair of integers as a string, since virtually all choices willbe equivalent for our purposes.
https://en.wikipedia.org/wiki/Twin_prime
-
220 introduction to theoretical computer science
Another example of a function we want to compute is
PALINDROME(๐ฅ) =โง{โจ{โฉ
1 โ๐โ[|๐ฅ|]๐ฅ๐ = ๐ฅ|๐ฅ|โ๐0 otherwise
(6.3)
PALINDROME has a single bit as output. Functions with a singlebit of output are known as Boolean functions. Boolean functions arecentral to the theory of computation, and we will discuss them oftenin this book. Note that even though Boolean functions have a singlebit of output, their input can be of arbitrary length. Thus they are stillinfinite functions that cannot be described via a finite table of values.
โBooleanizingโ functions. Sometimes it might be convenient to ob-tain a Boolean variant for a non-Boolean function. For example, thefollowing is a Boolean variant of MULT.
BMULT(๐ฅ, ๐ฆ, ๐) =โง{โจ{โฉ
๐๐กโ bit of ๐ฅ โ ๐ฆ ๐ < |๐ฅ โ ๐ฆ|0 otherwise
(6.4)
If we can compute BMULT via any programming language such asPython, C, Java, etc., we can compute MULT as well, and vice versa.
Solved Exercise 6.1 โ Booleanizing general functions. Show that for everyfunction ๐น โถ {0, 1}โ โ {0, 1}โ, there exists a Boolean function BF โถ{0, 1}โ โ {0, 1} such that a Python program to compute BF can betransformed into a program to compute ๐น and vice versa.
โ
Solution:
For every ๐น โถ {0, 1}โ โ {0, 1}โ, we can define
BF(๐ฅ, ๐, ๐) =โง{{โจ{{โฉ
๐น(๐ฅ)๐ ๐ < |๐น(๐ฅ)|, ๐ = 01 ๐ < |๐น(๐ฅ)|, ๐ = 10 ๐ โฅ |๐ฅ|
(6.5)
to be the function that on input ๐ฅ โ {0, 1}โ, ๐ โ โ, ๐ โ {0, 1} out-puts the ๐๐กโ bit of ๐น(๐ฅ) if ๐ = 0 and ๐ < |๐ฅ|. If ๐ = 1 then BF(๐ฅ, ๐, ๐)outputs 1 iff ๐ < |๐น(๐ฅ)| and hence this allows to compute the lengthof ๐น(๐ฅ).
Computing BF from ๐น is straightforward. For the other direc-tion, given a Python function BF that computes BF, we can compute๐น as follows:
def F(x):
res = []
i = 0
while BF(x,i,1):
-
functions with infinite domains, automata, and regular expressions 221
res.apppend(BF(x,i,0))
i += 1
return res
โ
6.1.2 Formal LanguagesFor every Boolean function ๐น โถ {0, 1}โ โ {0, 1}, we can define the set๐ฟ๐น = {๐ฅ|๐น(๐ฅ) = 1} of strings on which ๐น outputs 1. Such sets areknown as languages. This name is rooted in formal language theory aspursued by linguists such as Noam Chomsky. A formal language is asubset ๐ฟ โ {0, 1}โ (or more generally ๐ฟ โ ฮฃโ for some finite alphabetฮฃ). The membership or decision problem for a language ๐ฟ, is the task ofdetermining, given ๐ฅ โ {0, 1}โ, whether or not ๐ฅ โ ๐ฟ. If we can com-pute the function ๐น , then we can decide membership in the language๐ฟ๐น and vice versa. Hence, many texts such as [Sip97] refer to the taskof computing a Boolean function as โdeciding a languageโ. In thisbook, we mostly describe computational tasks using the function nota-tion, which is easier to generalize to computation with more than onebit of output. However, since the language terminology is so popularin the literature, we will sometimes mention it.
6.1.3 Restrictions of functionsIf ๐น โถ {0, 1}โ โ {0, 1} is a Boolean function and ๐ โ โ then the re-striction of ๐น to inputs of length ๐, denoted as ๐น๐, is the finite function๐ โถ {0, 1}๐ โ {0, 1} such that ๐(๐ฅ) = ๐น(๐ฅ) for every ๐ฅ โ {0, 1}๐. Thatis, ๐น๐ is the finite function that is only defined on inputs in {0, 1}๐, butagrees with ๐น on those inputs. Since ๐น๐ is a finite function, it can becomputed by a Boolean circuit, implying the following theorem:
Theorem 6.1 โ Circuit collection for every infinite function. Let ๐น โถ {0, 1}โ โ{0, 1}. Then there is a collection {๐ถ๐}๐โ{1,2,โฆ} of circuits such thatfor every ๐ > 0, ๐ถ๐ computes the restriction ๐น๐ of ๐น to inputs oflength ๐.
Proof. This is an immediate corollary of the universality of Booleancircuits. Indeed, since ๐น๐ maps {0, 1}๐ to {0, 1}, Theorem 4.15 impliesthat there exists a Boolean circuit ๐ถ๐ to compute it. In fact, the size ofthis circuit is at most ๐ โ 2๐/๐ gates for some constant ๐ โค 10.
โ
In particular, Theorem 6.1 implies that there exists such a circuitcollection {๐ถ๐} even for the TWINP function we described before,even though we do not know of any program to compute it. Indeed,this is not that surprising: for every particular ๐ โ โ, TWINP๐ is eitherthe constant zero function or the constant one function, both of which
-
222 introduction to theoretical computer science
can be computed by very simple Boolean circuits. Hence a collectionof circuits {๐ถ๐} that computes TWINP certainly exists. The difficultyin computing TWINP using Python or any other programming lan-guage arises from the fact that we do not know for each particular ๐what is the circuit ๐ถ๐ in this collection.
6.2 DETERMINISTIC FINITE AUTOMATA (OPTIONAL)
All our computational models so far - Boolean circuits and straight-line programs - were only applicable for finite functions.
In Chapter 7, we will present Turing machines, which are the centralmodels of computation for unbounded input length functions. How-ever, in this section we present the more basic model of deterministicfinite automata (DFA). Automata can serve as a good stepping-stone forTuring machines, though they will not be used much in later parts ofthis book, and so the reader can feel free to skip ahead to Chapter 7.DFAs turn out to be equivalent in power to regular expressions: a pow-erful mechanism to specify patterns, which is widely used in practice.Our treatment of automata is relatively brief. There are plenty of re-sources that help you get more comfortable with DFAs. In particular,Chapter 1 of Sipserโs book [Sip97] contains an excellent exposition ofthis material. There are also many websites with online simulators forautomata, as well as translators from regular expressions to automataand vice versa (see for example here and here).
At a high level, an algorithm is a recipe for computing an outputfrom an input via a combination of the following steps:
1. Read a bit from the input2. Update the state (working memory)3. Stop and produce an output
For example, recall the Python program that computes the XORfunction:
def XOR(X):
'''Takes list X of 0's and 1's
Outputs 1 if the number of 1's is odd and outputs 0
otherwise'''โช
result = 0
for i in range(len(X)):
result = (result + X[i]) % 2
return result
In each step, this program reads a single bit X[i] and updates itsstate result based on that bit (flipping result if X[i] is 1 and keep-ing it the same otherwise). When it is done transversing the input,
http://ivanzuzak.info/noam/webapps/fsm2regex/https://cyberzhg.github.io/toolbox/nfa2dfa
-
functions with infinite domains, automata, and regular expressions 223
Figure 6.3: A deterministic finite automaton thatcomputes the XOR function. It has two states 0 and 1,and when it observes ๐ it transitions from ๐ฃ to ๐ฃ โ ๐.
the program outputs result. In computer science, such a program iscalled a single-pass constant-memory algorithm since it makes a singlepass over the input and its working memory is finite. (Indeed, in thiscase, result can either be 0 or 1.) Such an algorithm is also known asa Deterministic Finite Automaton or DFA (another name for DFAs is afinite state machine). We can think of such an algorithm as a โmachineโthat can be in one of ๐ถ states, for some constant ๐ถ. The machine startsin some initial state and then reads its input ๐ฅ โ {0, 1}โ one bit at atime. Whenever the machine reads a bit ๐ โ {0, 1}, it transitions into anew state based on ๐ and its prior state. The output of the machine isbased on the final state. Every single-pass constant-memory algorithmcorresponds to such a machine. If an algorithm uses ๐ bits of mem-ory, then the contents of its memory can be represented as a stringof length ๐. Therefore such an algorithm can be in one of at most 2๐states at any point in the execution.
We can specify a DFA of ๐ถ states by a list of ๐ถ โ 2 rules. Each rulewill be of the form โIf the DFA is in state ๐ฃ and the bit read from theinput is ๐ then the new state is ๐ฃโฒโ. At the end of the computation,we will also have a rule of the form โIf the final state is one of thefollowing โฆ then output 1, otherwise output 0โ. For example, thePython program above can be represented by a two-state automatonfor computing XOR of the following form:
โข Initialize in the state 0.โข For every state ๐ โ {0, 1} and input bit ๐ read, if ๐ = 1 then change
to state 1 โ ๐ , otherwise stay in state ๐ .โข At the end output 1 iff ๐ = 1.
We can also describe a ๐ถ-state DFA as a labeled graph of ๐ถ vertices.For every state ๐ and bit ๐, we add a directed edge labeled with ๐between ๐ and the state ๐ โฒ such that if the DFA is at state ๐ and reads ๐then it transitions to ๐ โฒ. (If the state stays the same then this edge willbe a self-loop; similarly, if ๐ transitions to ๐ โฒ in both the case ๐ = 0 and๐ = 1 then the graph will contain two parallel edges.) We also labelthe set ๐ฎ of states on which the automaton will output 1 at the end ofthe computation. This set is known as the set of accepting states. SeeFig. 6.3 for the graphical representation of the XOR automaton.
Formally, a DFA is specified by (1) the table of the ๐ถ โ 2 rules, whichcan be represented as a transition function ๐ that maps a state ๐ โ [๐ถ]and bit ๐ โ {0, 1} to the state ๐ โฒ โ [๐ถ] which the DFA will transition tofrom state ๐ on input ๐ and (2) the set ๐ฎ of accepting states. This leadsto the following definition.
-
224 introduction to theoretical computer science
Definition 6.2 โ Deterministic Finite Automaton. A deterministic finiteautomaton (DFA) with ๐ถ states over {0, 1} is a pair (๐ , ๐ฎ) with๐ โถ [๐ถ] ร {0, 1} โ [๐ถ] and ๐ฎ โ [๐ถ]. The finite function ๐ is knownas the transition function of the DFA. The set ๐ฎ is known as the set ofaccepting states.
Let ๐น โถ {0, 1}โ โ {0, 1} be a Boolean function with the infinitedomain {0, 1}โ. We say that (๐ , ๐ฎ) computes a function ๐น โถ {0, 1}โ โ{0, 1} if for every ๐ โ โ and ๐ฅ โ {0, 1}๐, if we define ๐ 0 = 0 and๐ ๐+1 = ๐ (๐ ๐, ๐ฅ๐) for every ๐ โ [๐], then
๐ ๐ โ ๐ฎ โ ๐น(๐ฅ) = 1 (6.6)
PMake sure not to confuse the transition function ofan automaton (๐ in Definition 6.2), which is a finitefunction specifying the table of โrulesโ which it fol-lows, with the function the automaton computes (๐น inDefinition 6.2) which is an infinite function.
RRemark 6.3 โ Definitions in other texts. Deterministicfinite automata can be defined in several equivalentways. In particular Sipser [Sip97] defines a DFA as afive-tuple (๐, ฮฃ, ๐ฟ, ๐0, ๐น ) where ๐ is the set of states,ฮฃ is the alphabet, ๐ฟ is the transition function, ๐0 isthe initial state, and ๐น is the set of accepting states.In this book the set of states is always of the form๐ = {0, โฆ , ๐ถ โ 1} and the initial state is always ๐0 = 0,but this makes no difference to the computationalpower of these models. Also, we restrict our attentionto the case that the alphabet ฮฃ is equal to {0, 1}.
Solved Exercise 6.2 โ DFA for (010)โ. Prove that there is a DFA that com-putes the following function ๐น :
๐น(๐ฅ) =โง{โจ{โฉ
1 3 divides |๐ฅ| and โ๐โ[|๐ฅ|/3]๐ฅ๐๐ฅ๐+1๐ฅ๐+2 = 0100 otherwise
(6.7)
โ
Solution:
When asked to construct a deterministic finite automaton, it isoften useful to start by constructing a single-pass constant-memory
-
functions with infinite domains, automata, and regular expressions 225
Figure 6.4: A DFA that outputs 1 only on inputs๐ฅ โ {0, 1}โ that are a concatenation of zero or morecopies of 010. The state 0 is both the starting stateand the only accepting state. The table denotes thetransition function of ๐ , which maps the current stateand symbol read to the new symbol.
algorithm using a more general formalism (for example, usingpseudocode or a Python program). Once we have such an algo-rithm, we can mechanically translate it into a DFA. Here is a simplePython program for computing ๐น :
def F(X):
'''Return 1 iff X is a concatenation of zero/more
copies of [0,1,0]'''โช
if len(X) % 3 != 0:
return False
ultimate = 0
penultimate = 1
antepenultimate = 0
for idx, b in enumerate(X):
antepenultimate = penultimate
penultimate = ultimate
ultimate = b
if idx % 3 == 2 and ((antepenultimate,
penultimate, ultimate) != (0,1,0)):โช
return False
return True
Since we keep three Boolean variables, the working memory canbe in one of 23 = 8 configurations, and so the program above canbe directly translated into an 8 state DFA. While this is not neededto solve the question, by examining the resulting DFA, we can seethat we can merge some states and obtain a 4 state automaton, de-scribed in Fig. 6.4. See also Fig. 6.5, which depicts the execution ofthis DFA on a particular input.
โ
6.2.1 Anatomy of an automaton (finite vs. unbounded)Now that we are considering computational tasks with unboundedinput sizes, it is crucial to distinguish between the components of ouralgorithm that have fixed length and the components that grow withthe input size. For the case of DFAs these are the following:
Constant size components: Given a DFA ๐ด, the following quantities arefixed independent of the input size:
โข The number of states ๐ถ in ๐ด.
โข The transition function ๐ (which has 2๐ถ inputs, and so can be speci-fied by a table of 2๐ถ rows, each entry in which is a number in [๐ถ]).
โข The set ๐ฎ โ [๐ถ] of accepting states. This set can be described by astring in {0, 1}๐ถ specifiying which states are in ๐ฎ and which are not.
-
226 introduction to theoretical computer science
Together the above means that we can fully describe an automatonusing finitely many symbols. This is a property we require out of anynotion of โalgorithmโ: we should be able to write down a completespecification of how it produces an output from an input.
Components of unbounded size: The following quantities relating to aDFA are not bounded by any constant. We stress that these are stillfinite for any given input.
โข The length of the input ๐ฅ โ {0, 1}โ that the DFA is provided. Theinput length is always finite, but not a priori bounded.
โข The number of steps that the DFA takes can grow with the length ofthe input. Indeed, a DFA makes a single pass on the input and so ittakes precisely |๐ฅ| steps on an input ๐ฅ โ {0, 1}โ.
Figure 6.5: Execution of the DFA of Fig. 6.4. Thenumber of states and the transition function size arebounded, but the input can be arbitrarily long. Ifthe DFA is at state ๐ and observes the value ๐ then itmoves to the state ๐ (๐ , ๐). At the end of the executionthe DFA accepts iff the final state is in ๐ฎ.
6.2.2 DFA-computable functionsWe say that a function ๐น โถ {0, 1}โ โ {0, 1} is DFA computable if thereexists some DFA that computes ๐น . In Chapter 4 we saw that everyfinite function is computable by some Boolean circuit. Thus, at thispoint, you might expect that every infinite function is computable bysome DFA. However, this is very much not the case. We will soon seesome simple examples of infinite functions that are not computable byDFAs, but for starters, let us prove that such functions exist.
Theorem 6.4 โ DFA-computable functions are countable. Let DFACOMP bethe set of all Boolean functions ๐น โถ {0, 1}โ โ {0, 1} such that thereexists a DFA computing ๐น . Then DFACOMP is countable.
Proof Idea:
-
functions with infinite domains, automata, and regular expressions 227
Every DFA can be described by a finite length string, which yieldsan onto map from {0, 1}โ to DFACOMP: namely, the function thatmaps a string describing an automaton ๐ด to the function that it com-putes.
โ
Proof of Theorem 6.4. Every DFA can be described by a finite string,representing the transition function ๐ and the set of accepting states,and every DFA ๐ด computes some function ๐น โถ {0, 1}โ โ {0, 1}. Thuswe can define the following function ๐๐ก๐ท๐ถ โถ {0, 1}โ โ DFACOMP:
๐๐ก๐ท๐ถ(๐) =โง{โจ{โฉ
๐น ๐ represents automaton ๐ด and ๐น is the function ๐ด computesONE otherwise
(6.8)where ONE โถ {0, 1}โ โ {0, 1} is the constant function that outputs1 on all inputs (and is a member of DFACOMP). Since by definition,every function ๐น in DFACOMP is computable by some automaton,๐๐ก๐ท๐ถ is an onto function from {0, 1}โ to DFACOMP, which meansthat DFACOMP is countable (see Section 2.4.2).
โ
Since the set of all Boolean functions is uncountable, we get thefollowing corollary:
Theorem 6.5 โ Existence of DFA-uncomputable functions. There exists aBoolean function ๐น โถ {0, 1}โ โ {0, 1} that is not computable by anyDFA.
Proof. If every Boolean function ๐น is computable by some DFA, thenDFACOMP equals the set ALL of all Boolean functions, but by Theo-rem 2.12, the latter set is uncountable, contradicting Theorem 6.4.
โ
6.3 REGULAR EXPRESSIONS
Searching for a piece of text is a common task in computing. At itsheart, the search problem is quite simple. We have a collection ๐ ={๐ฅ0, โฆ , ๐ฅ๐} of strings (e.g., files on a hard-drive, or student records ina database), and the user wants to find out the subset of all the ๐ฅ โ ๐that are matched by some pattern (e.g., all files whose names end withthe string .txt). In full generality, we can allow the user to specify thepattern by specifying a (computable) function ๐น โถ {0, 1}โ โ {0, 1},where ๐น(๐ฅ) = 1 corresponds to the pattern matching ๐ฅ. That is, theuser provides a program ๐ in a programming language such as Python,and the system returns all ๐ฅ โ ๐ such that ๐(๐ฅ) = 1. For example,
-
228 introduction to theoretical computer science
one could search for all text files that contain the string importantdocument or perhaps (letting ๐ correspond to a neural-network basedclassifier) all images that contain a cat. However, we donโt want oursystem to get into an infinite loop just trying to evaluate the program๐ ! For this reason, typical systems for searching files or databases donot allow users to specify the patterns using full-fledged programminglanguages. Rather, such systems use restricted computational models thaton the one hand are rich enough to capture many of the queries neededin practice (e.g., all filenames ending with .txt, or all phone numbersof the form (617) xxx-xxxx), but on the other hand are restrictedenough so that queries can be evaluated very efficiently on huge filesand in particular cannot result in an infinite loop.
One of the most popular such computational models is regularexpressions. If you ever used an advanced text editor, a command-lineshell, or have done any kind of manipulation of text files, then youhave probably come across regular expressions.
A regular expression over some alphabet ฮฃ is obtained by combin-ing elements of ฮฃ with the operation of concatenation, as well as |(corresponding to or) and โ (corresponding to repetition zero ormore times). (Common implementations of regular expressions inprogramming languages and shells typically include some extra oper-ations on top of | and โ, but these operations can be implemented asโsyntactic sugarโ using the operators | and โ.) For example, the fol-lowing regular expression over the alphabet {0, 1} corresponds to theset of all strings ๐ฅ โ {0, 1}โ where every digit is repeated at least twice:
(00(0โ)|11(1โ))โ . (6.9)The following regular expression over the alphabet {๐, โฆ , ๐ง, 0, โฆ , 9}
corresponds to the set of all strings that consist of a sequence of oneor more of the letters ๐-๐ followed by a sequence of one or more digits(without a leading zero):
(๐|๐|๐|๐)(๐|๐|๐|๐)โ(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)โ . (6.10)
Formally, regular expressions are defined by the following recursivedefinition:
Definition 6.6 โ Regular expression. A regular expression ๐ over an al-phabet ฮฃ is a string over ฮฃ โช {(, ), |, โ,โ , ""} that has one of thefollowing forms:
1. ๐ = ๐ where ๐ โ ฮฃ
2. ๐ = (๐โฒ|๐โณ) where ๐โฒ, ๐โณ are regular expressions.
https://goo.gl/2vTAFUhttps://goo.gl/2vTAFU
-
functions with infinite domains, automata, and regular expressions 229
3. ๐ = (๐โฒ)(๐โณ) where ๐โฒ, ๐โณ are regular expressions. (We oftendrop the parentheses when there is no danger of confusion andso write this as ๐โฒ ๐โณ.)
4. ๐ = (๐โฒ)โ where ๐โฒ is a regular expression.
Finally we also allow the following โedge casesโ: ๐ = โ and๐ = "". These are the regular expressions corresponding to accept-ing no strings, and accepting only the empty string respectively.
We will drop parentheses when they can be inferred from thecontext. We also use the convention that OR and concatenation areleft-associative, and we give highest precedence to โ, then concate-nation, and then OR. Thus for example we write 00โ|11 instead of((0)(0โ))|((1)(1)).
Every regular expression ๐ corresponds to a function ฮฆ๐ โถ ฮฃโ โ{0, 1} where ฮฆ๐(๐ฅ) = 1 if ๐ฅ matches the regular expression. For exam-ple, if ๐ = (00|11)โ then ฮฆ๐(110011) = 1 but ฮฆ๐(101) = 0 (can you seewhy?).
PThe formal definition of ฮฆ๐ is one of those definitionsthat is more cumbersome to write than to grasp. Thusit might be easier for you first to work out the defini-tion on your own, and then check that it matches whattis written below.
Definition 6.7 โ Matching a regular expression. Let ๐ be a regular expres-sion over the alphabet ฮฃ. The function ฮฆ๐ โถ ฮฃโ โ {0, 1} is definedas follows:
1. If ๐ = ๐ then ฮฆ๐(๐ฅ) = 1 iff ๐ฅ = ๐.
2. If ๐ = (๐โฒ|๐โณ) then ฮฆ๐(๐ฅ) = ฮฆ๐โฒ(๐ฅ)โจฮฆ๐โณ(๐ฅ) where โจ is the OR op-erator.
3. If ๐ = (๐โฒ)(๐โณ) then ฮฆ๐(๐ฅ) = 1 iff there is some ๐ฅโฒ, ๐ฅโณ โ ฮฃโ suchthat ๐ฅ is the concatenation of ๐ฅโฒ and ๐ฅโณ and ฮฆ๐โฒ(๐ฅโฒ) = ฮฆ๐โณ(๐ฅโณ) =1.
4. If ๐ = (๐โฒ)โ then ฮฆ๐(๐ฅ) = 1 iff there is some ๐ โ โ and some๐ฅ0, โฆ , ๐ฅ๐โ1 โ ฮฃโ such that ๐ฅ is the concatenation ๐ฅ0 โฏ ๐ฅ๐โ1 andฮฆ๐โฒ(๐ฅ๐) = 1 for every ๐ โ [๐].
5. Finally, for the edge cases ฮฆโ is the constant zero function, andฮฆ"" is the function that only outputs 1 on the empty string "".
-
230 introduction to theoretical computer science
We say that a regular expression ๐ over ฮฃ matches a string ๐ฅ โ ฮฃโif ฮฆ๐(๐ฅ) = 1.
PThe definitions above are not inherently difficult butare a bit cumbersome. So you should pause here andgo over it again until you understand why it corre-sponds to our intuitive notion of regular expressions.This is important not just for understanding regularexpressions themselves (which are used time andagain in a great many applications) but also for get-ting better at understanding recursive definitions ingeneral.
A Boolean function is called โregularโ if it outputs 1 on preciselythe set of strings that are matched by some regular expression. That is,
Definition 6.8 โ Regular functions / languages. Let ฮฃ be a finite set and๐น โถ ฮฃโ โ {0, 1} be a Boolean function. We say that ๐น is regular if๐น = ฮฆ๐ for some regular expression ๐.
Similarly, for every formal language ๐ฟ โ ฮฃโ, we say that ๐ฟ is reg-ular if and only if there is a regular expression ๐ such that ๐ฅ โ ๐ฟ iff๐ matches ๐ฅ.
โ Example 6.9 โ A regular function. Let ฮฃ = {๐, ๐, ๐, ๐, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}and ๐น โถ ฮฃโ โ {0, 1} be the function such that ๐น(๐ฅ) outputs 1 iff๐ฅ consists of one or more of the letters ๐-๐ followed by a sequenceof one or more digits (without a leading zero). Then ๐น is a regularfunction, since ๐น = ฮฆ๐ where
๐ = (๐|๐|๐|๐)(๐|๐|๐|๐)โ(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)โ(6.11)
is the expression we saw in (6.10).If we wanted to verify, for example, that ฮฆ๐(๐๐๐12078) = 1,
we can do so by noticing that the expression (๐|๐|๐|๐) matchesthe string ๐, (๐|๐|๐|๐)โ matches ๐๐, (0|1|2|3|4|5|6|7|8|9) matches thestring 1, and the expression (0|1|2|3|4|5|6|7|8|9)โ matches the string2078. Each one of those boils down to a simpler expression. For ex-ample, the expression (๐|๐|๐|๐)โ matches the string ๐๐ because bothof the one-character strings ๐ and ๐ are matched by the expression๐|๐|๐|๐.
Regular expression can be defined over any finite alphabet ฮฃ, butas usual, we will mostly focus our attention on the binary case, where
-
functions with infinite domains, automata, and regular expressions 231
ฮฃ = {0, 1}. Most (if not all) of the theoretical and practical generalinsights about regular expressions can be gleaned from studying thebinary case.
6.3.1 Algorithms for matching regular expressionsRegular expressions would not be very useful for search if we couldnot evaluate, given a regular expression ๐, whether a string ๐ฅ ismatched by ๐. Luckily, there is an algorithm to do so. Specifically,there is an algorithm (think โPython programโ though later wewill formalize the notion of algorithms using Turing machines) thaton input a regular expression ๐ over the alphabet {0, 1} and a string๐ฅ โ {0, 1}โ, outputs 1 iff ๐ matches ๐ฅ (i.e., outputs ฮฆ๐(๐ฅ)).
Indeed, Definition 6.7 actually specifies a recursive algorithm forcomputing ฮฆ๐. Specifically, each one of our operations -concatenation,OR, and star- can be thought of as reducing the task of testing whetheran expression ๐ matches a string ๐ฅ to testing whether some sub-expressions of ๐ match substrings of ๐ฅ. Since these sub-expressionsare always shorter than the original expression, this yields a recursivealgorithm for checking if ๐ matches ๐ฅ, which will eventually terminateat the base cases of the expressions that correspond to a single symbolor the empty string.
-
232 introduction to theoretical computer science
Algorithm 6.10 โ Regular expression matching.
Input: Regular expression ๐ over ฮฃโ, ๐ฅ โ ฮฃโOutput: ฮฆ๐(๐ฅ)1: procedure Match(๐,๐ฅ)2: if ๐ = โ then return 0 ;3: if ๐ฅ = "" then return MatchEmpty(()๐) ;4: if ๐ โ ฮฃ then return 1 iff ๐ฅ = ๐ ;5: if ๐ = (๐โฒ|๐โณ) then return Match(๐โฒ, ๐ฅ) or Match(๐โณ, ๐ฅ)
;6: if ๐ = (๐โฒ)(๐โณ) then7: for ๐ โ [|๐ฅ| + 1] do8: if Match(๐โฒ, ๐ฅ0 โฏ ๐ฅ๐โ1) and Match(๐โณ, ๐ฅ๐ โฏ ๐ฅ|๐ฅ|โ1)
then return 1 ;9: end for
10: end if11: if ๐ = (๐โฒ)โ then12: if ๐โฒ = "" then return Match("", ๐ฅ) ;13: # ("")โ is the same as ""14: for ๐ โ [|๐ฅ|] do15: # ๐ฅ0 โฏ ๐ฅ๐โ1 is shorter than ๐ฅ16: if Match(๐, ๐ฅ0 โฏ ๐ฅ๐โ1) and Match(๐โฒ, ๐ฅ๐ โฏ ๐ฅ|๐ฅ|โ1)
then return 1 ;17: end for18: end if19: return 020: end procedure
We assume above that we have a procedure MatchEmpty thaton input a regular expression ๐ outputs 1 if and only if ๐ matches theempty string "".
The key observation is that in our recursive definition of regular ex-pressions, whenever ๐ is made up of one or two expressions ๐โฒ, ๐โณ thenthese two regular expressions are smaller than ๐. Eventually (whenthey have size 1) then they must correspond to the non-recursivecase of a single alphabet symbol. Correspondingly, the recursive callsmade in Algorithm 6.10 always correspond to a shorter expression or(in the case of an expression of the form (๐โฒ)โ) a shorter input string.Thus, we can prove the correctness of Algorithm 6.10 on inputs of theform (๐, ๐ฅ) by induction over min{|๐|, |๐ฅ|}. The base case is when ei-ther ๐ฅ = "" or ๐ is a single alphabet symbol, "" or โ . In the case theexpression is of the forrm ๐ = (๐โฒ|๐โณ) or ๐ = (๐โฒ)(๐โณ), we make recur-sive calls with the shorter expressions ๐โฒ, ๐โณ. In the case the expressionis of the form ๐ = (๐โฒ)โ, we make recursive calls with either a shorter
-
functions with infinite domains, automata, and regular expressions 233
string ๐ฅ and the same expression, or with the shorter expression ๐โฒand a string ๐ฅโฒ that is equal in length or shorter than ๐ฅ.Solved Exercise 6.3 โ Match the empty string. Give an algorithm that oninput a regular expression ๐, outputs 1 if and only if ฮฆ๐("") = 1.
โ
Solution:
We can obtain such a recursive algorithm by using the followingobservations:
1. An expression of the form "" or (๐โฒ)โ always matches the emptystring.
2. An expression of the form ๐, where ๐ โ ฮฃ is an alphabet sym-bol, never matches the empty string.
3. The regular expression โ does not match the empty string.
4. An expression of the form ๐โฒ|๐โณ matches the empty string if andonly if one of ๐โฒ or ๐โณ matches it.
5. An expression of the form (๐โฒ)(๐โณ) matches the empty string ifand only if both ๐โฒ and ๐โณ match it.
Given the above observations, we see that the following algo-rithm will check if ๐ matches the empty string:
procedure{MatchEmpty}{๐} lIf {๐ = โ } return 0 lendif lIf{๐ = ""} return 1 lendif lIf {๐ = โ or ๐ โ ฮฃ} return 0 lendif lIf{๐ = (๐โฒ|๐โณ)} return ๐๐๐ก๐โ๐ธ๐๐๐ก๐ฆ(๐โฒ) or ๐๐๐ก๐โ๐ธ๐๐๐ก๐ฆ(๐โณ) lendifLIf {๐ = (๐โฒ)(๐โฒ)} return ๐๐๐ก๐โ๐ธ๐๐๐ก๐ฆ(๐โฒ) or ๐๐๐ก๐โ๐ธ๐๐๐ก๐ฆ(๐โณ)lendif lIf {๐ = (๐โฒ)โ} return 1 lendif endprocedure
โ
6.4 EFFICIENT MATCHING OF REGULAR EXPRESSIONS (OP-TIONAL)
Algorithm 6.10 is not very efficient. For example, given an expressioninvolving concatenation or the โstarโ operation and a string of length๐, it can make ๐ recursive calls, and hence it can be shown that in theworst case Algorithm 6.10 can take time exponential in the length ofthe input string ๐ฅ. Fortunately, it turns out that there is a much moreefficient algorithm that can match regular expressions in linear (i.e.,๐(๐)) time. Since we have not yet covered the topics of time and spacecomplexity, we describe this algorithm in high level terms, withoutmaking the computational model precise. Rather we will use thecolloquial notion of ๐(๐) running time as used in introduction to
-
234 introduction to theoretical computer science
programming courses and whiteboard coding interviews. We will seea formal definition of time complexity in Chapter 13.
Theorem 6.11 โ Matching regular expressions in linear time. Let ๐ be aregular expression. Then there is an ๐(๐) time algorithm thatcomputes ฮฆ๐.
The implicit constant in the ๐(๐) term of Theorem 6.11 depends onthe expression ๐. Thus, another way to state Theorem 6.11 is that forevery expression ๐, there is some constant ๐ and an algorithm ๐ด thatcomputes ฮฆ๐ on ๐-bit inputs using at most ๐ โ ๐ steps. This makes sensesince in practice we often want to compute ฮฆ๐(๐ฅ) for a small regularexpression ๐ and a large document ๐ฅ. Theorem 6.11 tells us that wecan do so with running time that scales linearly with the size of thedocument, even if it has (potentially) worse dependence on the size ofthe regular expression.
We prove Theorem 6.11 by obtaining more efficient recursive al-gorithm, that determines whether ๐ matches a string ๐ฅ โ {0, 1}๐ byreducing this task to determining whether a related expression ๐โฒmatches ๐ฅ0, โฆ , ๐ฅ๐โ2. This will result in an expression for the runningtime of the form ๐ (๐) = ๐ (๐ โ 1) + ๐(1) which solves to ๐ (๐) = ๐(๐).
Restrictions of regular expressions. The central definition for the algo-rithm behind Theorem 6.11 is the notion of a restriction of a regularexpression. The idea is that for every regular expression ๐ and symbol๐ in its alphabet, it is possible to define a regular expression ๐[๐] suchthat ๐[๐] matches a string ๐ฅ if and only if ๐ matches the string ๐ฅ๐. Forexample, if ๐ is the regular expression 01|(01) โ (01) (i.e., one or moreoccurrences of 01) then ๐[1] is equal to 0|(01) โ 0 and ๐[0] will be โ .(Can you see why?)
Algorithm 6.12 computes the resriction ๐[๐] given a regular ex-pression ๐ and an alphabet symbol ๐. It always terminates, since therecursive calls it makes are always on expressions smaller than theinput expression. Its correctness can be proven by induction on thelength of the regular expression ๐, with the base cases being when ๐ is"", โ , or a single alphabet symbol ๐ .
-
functions with infinite domains, automata, and regular expressions 235
Algorithm 6.12 โ Restricting regular expression.
Input: Regular expression ๐ over ฮฃ, symbol ๐ โ ฮฃOutput: Regular expression ๐โฒ = ๐[๐] such that ฮฆ๐โฒ(๐ฅ) =
ฮฆ๐(๐ฅ๐) for every ๐ฅ โ ฮฃโ1: procedure Restrict(๐,๐)2: if ๐ = "" or ๐ = โ then return โ ;3: if ๐ = ๐ for ๐ โ ฮฃ then return "" if ๐ = ๐ and return
โ otherwise ;4: if ๐ = (๐โฒ|๐โณ) then return (Restrict(๐โฒ, ๐)|Restrict(๐โณ, ๐))
;5: if ๐ = (๐โฒ)โ then return (๐โฒ)โ(Restrict(๐โฒ, ๐)) ;6: if ๐ = (๐โฒ)(๐โณ) and ฮฆ๐โณ("") = 0 then return
(๐โฒ)(Restrict(๐โณ, ๐)) ;7: if ๐ = (๐โฒ)(๐โณ) and ฮฆ๐โณ("") = 1 then return
(๐โฒ)(Restrict(๐โณ, ๐) | Restrict(๐โฒ, ๐)) ;8: end procedure
Using this notion of restriction, we can define the following recur-sive algorithm for regular expression matching:
Algorithm 6.13 โ Regular expression matching in linear time.
Input: Regular expression ๐ over ฮฃโ, ๐ฅ โ ฮฃ๐ where ๐ โ โOutput: ฮฆ๐(๐ฅ)1: procedure FMatch(๐,๐ฅ)2: if ๐ฅ = "" then return MatchEmpty(()๐) ;3: Let ๐โฒ โ Restrict(๐, ๐ฅ๐โ2)4: return FMatch(๐โฒ, ๐ฅ0 โฏ ๐ฅ๐โ1)5: end procedure
By the definition of a restriction, for every ๐ โ ฮฃ and ๐ฅโฒ โ ฮฃโ,the expression ๐ matches ๐ฅโฒ๐ if and only if ๐[๐] matches ๐ฅโฒ. Hence forevery ๐ and ๐ฅ โ ฮฃ๐, ฮฆ๐[๐ฅ๐โ1](๐ฅ0 โฏ ๐ฅ๐โ2) = ฮฆ๐(๐ฅ) and Algorithm 6.13does return the correct answer. The only remaining task is to analyzeits running time. Note that Algorithm 6.13 uses the MatchEmptyprocedure of Solved Exercise 6.3 in the base case that ๐ฅ = "". However,this is OK since this procedureโs running time depends only on ๐ andis independent of the length of the original input.
For simplicity, let us restrict our attention to the case that the al-phabet ฮฃ is equal to {0, 1}. Define ๐ถ(โ) to be the maximum numberof operations that Algorithm 6.12 takes when given as input a regularexpression ๐ over {0, 1} of at most โ symbols. The value ๐ถ(โ) can beshown to be polynomial in โ, though this is not important for this the-orem, since we only care about the dependence of the time to compute
-
236 introduction to theoretical computer science
ฮฆ๐(๐ฅ) on the length of ๐ฅ and not about the dependence of this time onthe length of ๐.
Algorithm 6.13 is a recursive algorithm that input an expression๐ and a string ๐ฅ โ {0, 1}๐, does computation of at most ๐ถ(|๐|) stepsand then calls itself with input some expression ๐โฒ and a string ๐ฅโฒ oflength ๐ โ 1. It will terminate after ๐ steps when it reaches a string oflength 0. So, the running time ๐ (๐, ๐) that it takes for Algorithm 6.13to compute ฮฆ๐ for inputs of length ๐ satisfies the recursive equation:
๐ (๐, ๐) = max{๐ (๐[0], ๐ โ 1), ๐ (๐[1], ๐ โ 1)} + ๐ถ(|๐|) (6.12)
(In the base case ๐ = 0, ๐ (๐, 0) is equal to some constant dependingonly on ๐.) To get some intuition for the expression Eq. (6.12), let usopen up the recursion for one level, writing ๐ (๐, ๐) as
๐ (๐, ๐) = max{๐ (๐[0][0], ๐ โ 2) + ๐ถ(|๐[0]|),๐ (๐[0][1], ๐ โ 2) + ๐ถ(|๐[0]|),๐ (๐[1][0], ๐ โ 2) + ๐ถ(|๐[1]|),๐ (๐[1][1], ๐ โ 2) + ๐ถ(|๐[1]|)} + ๐ถ(|๐|) .
(6.13)
Continuing this way, we can see that ๐ (๐, ๐) โค ๐ โ ๐ถ(๐ฟ) + ๐(1)where ๐ฟ is the largest length of any expression ๐โฒ that we encounteralong the way. Therefore, the following claim suffices to show thatAlgorithm 6.13 runs in ๐(๐) time:
Claim: Let ๐ be a regular expression over {0, 1}, then there is a num-ber ๐ฟ(๐) โ โ, such that for every sequence of symbols ๐ผ0, โฆ , ๐ผ๐โ1, ifwe define ๐โฒ = ๐[๐ผ0][๐ผ1] โฏ [๐ผ๐โ1] (i.e., restricting ๐ to ๐ผ0, and then ๐ผ1and so on and so forth), then |๐โฒ| โค ๐ฟ(๐).
Proof of claim: For a regular expression ๐ over {0, 1} and ๐ผ โ {0, 1}๐,we denote by ๐[๐ผ] the expression ๐[๐ผ0][๐ผ1] โฏ [๐ผ๐โ1] obtained by restrict-ing ๐ to ๐ผ0 and then to ๐ผ1 and so on. We let ๐(๐) = {๐[๐ผ]|๐ผ โ {0, 1}โ}.We will prove the claim by showing that for every ๐, the set ๐(๐) is fi-nite, and hence so is the number ๐ฟ(๐) which is the maximum length of๐โฒ for ๐โฒ โ ๐(๐).We prove this by induction on the structure of ๐. If ๐ is a symbol, theempty string, or the empty set, then this is straightforward to showas the most expressions ๐(๐) can contain are the expression itself, "",and โ . Otherwise we split to the two cases (i) ๐ = ๐โฒโ and (ii) ๐ =๐โฒ๐โณ, where ๐โฒ, ๐โณ are smaller expressions (and hence by the inductionhypothesis ๐(๐โฒ) and ๐(๐โณ) are finite). In the case (i), if ๐ = (๐โฒ)โ then๐[๐ผ] is either equal to (๐โฒ)โ๐โฒ[๐ผ] or it is simply the empty set if ๐โฒ[๐ผ] = โ .Since ๐โฒ[๐ผ] is in the set ๐(๐โฒ), the number of distinct expressions in๐(๐) is at most |๐(๐โฒ)| + 1. In the case (ii), if ๐ = ๐โฒ๐โณ then all therestrictions of ๐ to strings ๐ผ will either have the form ๐โฒ๐โณ[๐ผ] or the form๐โฒ๐โณ[๐ผ]|๐โฒ[๐ผโฒ] where ๐ผโฒ is some string such that ๐ผ = ๐ผโฒ๐ผโณ and ๐โณ[๐ผโณ]
-
functions with infinite domains, automata, and regular expressions 237
matches the empty string. Since ๐โณ[๐ผ] โ ๐(๐โณ) and ๐โฒ[๐ผโฒ] โ ๐(๐โฒ), thenumber of the possible distinct expressions of the form ๐[๐ผ] is at most|๐(๐โณ)| + |๐(๐โณ)| โ |๐(๐โฒ)|. This completes the proof of the claim.
The bottom line is that while running Algorithm 6.13 on a regularexpression ๐, all the expressions we ever encounter are in the finite set๐(๐), no matter how large the input ๐ฅ is, and so the running time ofAlgorithm 6.13 satisfies the equation ๐ (๐) = ๐ (๐ โ 1) + ๐ถโฒ for someconstant ๐ถโฒ depending on ๐. This solves to ๐(๐) where the implicitconstant in the O notation can (and will) depend on ๐ but crucially,not on the length of the input ๐ฅ.
6.4.1 Matching regular expressions using DFAsTheorem 6.11 is already quite impressive, but we can do even better.Specifically, no matter how long the string ๐ฅ is, we can compute ฮฆ๐(๐ฅ)by maintaining only a constant amount of memory and moreovermaking a single pass over ๐ฅ. That is, the algorithm will scan the input๐ฅ once from start to finish, and then determine whether or not ๐ฅ ismatched by the expression ๐. This is important in the common caseof trying to match a short regular expression over a huge file or docu-ment that might not even fit in our computerโs memory. Of course, aswe have seen before, a single-pass constant-memory algorithm is sim-ply a deterministic finite automaton. As we will see in Theorem 6.16, afunction can be computed by regular expression if and only if it can becomputed by a DFA. We start with showing the โonly ifโ direction:
Theorem 6.14 โ DFA for regular expression matching. Let ๐ be a regularexpression. Then there is an algorithm that on input ๐ฅ โ {0, 1}โcomputes ฮฆ๐(๐ฅ) while making a single pass over ๐ฅ and maintaininga constant amount of memory.
Proof Idea:
The single-pass constant-memory for checking if a string matchesa regular expression is presented in Algorithm 6.15. The idea is toreplace the recursive algorithm of Algorithm 6.13 with a dynamic pro-gram, using the technique of memoization. If you havenโt taken yet analgorithms course, you might not know these techniques. This is OK;while this more efficient algorithm is crucial for the many practicalapplications of regular expressions, it is not of great importance forthis book.
โ
https://goo.gl/kgLdX1https://goo.gl/kgLdX1https://en.wikipedia.org/wiki/Memoization
-
238 introduction to theoretical computer science
Algorithm 6.15 โ Regular expression matching by a DFA.
Input: Regular expression ๐ over ฮฃโ, ๐ฅ โ ฮฃ๐ where ๐ โ โOutput: ฮฆ๐(๐ฅ)1: procedure DFAMatch(๐,๐ฅ)2: Let ๐ โ ๐(๐) be the set {๐[๐ผ]|๐ผ โ {0, 1}โ} as defined
in the proof of [reglintimethm]().ref.3: for ๐โฒ โ ๐ do4: Let ๐ฃ๐โฒ โ 1 if ฮฆ๐โฒ("") = 1 and ๐ฃ๐โฒ โ 0 otherwise5: end for6: for ๐ โ [๐] do7: Let ๐๐๐ ๐ก๐โฒ โ ๐ฃ๐โฒ for all ๐โฒ โ ๐8: Let ๐ฃ๐โฒ โ ๐๐๐ ๐ก๐โฒ[๐ฅ๐] for all ๐โฒ โ ๐9: end for
10: return ๐ฃ๐11: end procedure
Proof of Theorem 6.14. Algorithm 6.15 checks if a given string ๐ฅ โ ฮฃโis matched by the regular expression ๐. For every regular expres-sion ๐, this algorithm has a constant number 2|๐(๐)| Boolean vari-ables (๐ฃ๐โฒ , ๐๐๐ ๐ก๐โฒ for ๐โฒ โ ๐(๐)), and it makes a single pass overthe input string. Hence it corresponds to a DFA. We prove its cor-rectness by induction on the length ๐ of the input. Specifically, wewill argue that before reading the ๐-th bit of ๐ฅ, the variable ๐ฃ๐โฒ isequal to ฮฆ๐โฒ(๐ฅ0 โฏ ๐ฅ๐โ1) for every ๐โฒ โ ๐(๐). In the case ๐ = 0 thisholds since we initialize ๐ฃ๐โฒ = ฮฆ๐โฒ("") for all ๐โฒ โ ๐(๐). For ๐ > 0this holds by induction since the inductive hypothesis implies that๐๐๐ ๐กโฒ๐ = ฮฆ๐โฒ(๐ฅ0 โฏ ๐ฅ๐โ2) for all ๐โฒ โ ๐(๐) and by the definition of the set๐(๐โฒ), for every ๐โฒ โ ๐(๐) and ๐ฅ๐โ1 โ ฮฃ, ๐โณ = ๐โฒ[๐ฅ๐โ1] is in ๐(๐) andฮฆ๐โฒ(๐ฅ0 โฏ ๐ฅ๐โ1) = ฮฆ๐โณ(๐ฅ0 โฏ ๐ฅ๐).
โ
6.4.2 Equivalence of regular expressions and automataRecall that a Boolean function ๐น โถ {0, 1}โ โ {0, 1} is defined to beregular if it is equal to ฮฆ๐ for some regular expression ๐. (Equivalently,a language ๐ฟ โ {0, 1}โ is defined to be regular if there is a regularexpression ๐ such that ๐ matches ๐ฅ iff ๐ฅ โ ๐ฟ.) The following theorem isthe central result of automata theory:
Theorem 6.16 โ DFA and regular expression equivalency. Let ๐น โถ {0, 1}โ โ{0, 1}. Then ๐น is regular if and only if there exists a DFA (๐ , ๐ฎ) thatcomputes ๐น .
Proof Idea:
-
functions with infinite domains, automata, and regular expressions 239
Figure 6.6: A deterministic finite automaton thatcomputes the function ฮฆ(01)โ .
Figure 6.7: Given a DFA of ๐ถ states, for every ๐ฃ, ๐ค โ[๐ถ] and number ๐ก โ {0, โฆ , ๐ถ} we define the function๐น ๐ก๐ฃ,๐ค โถ {0, 1}โ โ {0, 1} to output one on input๐ฅ โ {0, 1}โ if and only if when the DFA is initializedin the state ๐ฃ and is given the input ๐ฅ, it will reach thestate ๐ค while going only through the intermediatestates {0, โฆ , ๐ก โ 1}.
One direction follows from Theorem 6.14, which shows that forevery regular expression ๐, the function ฮฆ๐ can be computed by a DFA(see for example Fig. 6.6). For the other direction, we show that givena DFA (๐ , ๐ฎ) for every ๐ฃ, ๐ค โ [๐ถ] we can find a regular expression thatwould match ๐ฅ โ {0, 1}โ if and only if the DFA starting in state ๐ฃ, willend up in state ๐ค after reading ๐ฅ.
โ
Proof of Theorem 6.16. Since Theorem 6.14 proves the โonly ifโ direc-tion, we only need to show the โifโ direction. Let ๐ด = (๐ , ๐ฎ) be a DFAwith ๐ถ states that computes the function ๐น . We need to show that ๐น isregular.
For every ๐ฃ, ๐ค โ [๐ถ], we let ๐น๐ฃ,๐ค โถ {0, 1}โ โ {0, 1} be the functionthat maps ๐ฅ โ {0, 1}โ to 1 if and only if the DFA ๐ด, starting at thestate ๐ฃ, will reach the state ๐ค if it reads the input ๐ฅ. We will prove that๐น๐ฃ,๐ค is regular for every ๐ฃ, ๐ค. This will prove the theorem, since byDefinition 6.2, ๐น(๐ฅ) is equal to the OR of ๐น0,๐ค(๐ฅ) for every ๐ค โ ๐ฎ.Hence if we have a regular expression for every function of the form๐น๐ฃ,๐ค then (using the | operation), we can obtain a regular expressionfor ๐น as well.
To give regular expressions for the functions ๐น๐ฃ,๐ค, we start bydefining the following functions ๐น ๐ก๐ฃ,๐ค: for every ๐ฃ, ๐ค โ [๐ถ] and0 โค ๐ก โค ๐ถ, ๐น ๐ก๐ฃ,๐ค(๐ฅ) = 1 if and only if starting from ๐ฃ and observ-ing ๐ฅ, the automata reaches ๐ค with all intermediate states being in the set[๐ก] = {0, โฆ , ๐ก โ 1} (see Fig. 6.7). That is, while ๐ฃ, ๐ค themselves mightbe outside [๐ก], ๐น ๐ก๐ฃ,๐ค(๐ฅ) = 1 if and only if throughout the execution ofthe automaton on the input ๐ฅ (when initiated at ๐ฃ) it never enters anyof the states outside [๐ก] and still ends up at ๐ค. If ๐ก = 0 then [๐ก] is theempty set, and hence ๐น 0๐ฃ,๐ค(๐ฅ) = 1 if and only if the automaton reaches๐ค from ๐ฃ directly on ๐ฅ, without any intermediate state. If ๐ก = ๐ถ thenall states are in [๐ก], and hence ๐น ๐ก๐ฃ,๐ค = ๐น๐ฃ,๐ค.
We will prove the theorem by induction on ๐ก, showing that ๐น ๐ก๐ฃ,๐ค isregular for every ๐ฃ, ๐ค and ๐ก. For the base case of ๐ก = 0, ๐น 0๐ฃ,๐ค is regularfor every ๐ฃ, ๐ค since it can be described as one of the expressions "", โ ,0, 1 or 0|1. Specifically, if ๐ฃ = ๐ค then ๐น 0๐ฃ,๐ค(๐ฅ) = 1 if and only if ๐ฅ isthe empty string. If ๐ฃ โ ๐ค then ๐น 0๐ฃ,๐ค(๐ฅ) = 1 if and only if ๐ฅ consistsof a single symbol ๐ โ {0, 1} and ๐ (๐ฃ, ๐) = ๐ค. Therefore in this case๐น 0๐ฃ,๐ค corresponds to one of the four regular expressions 0|1, 0, 1 or โ ,depending on whether ๐ด transitions to ๐ค from ๐ฃ when it reads either 0or 1, only one of these symbols, or neither.
Inductive step: Now that weโve seen the base case, let us prove thegeneral case by induction. Assume, via the induction hypothesis, thatfor every ๐ฃโฒ, ๐คโฒ โ [๐ถ], we have a regular expression ๐ ๐ก๐ฃ,๐ค that computes๐น ๐ก๐ฃโฒ,๐คโฒ . We need to prove that ๐น ๐ก+1๐ฃ,๐ค is regular for every ๐ฃ, ๐ค. If the
-
240 introduction to theoretical computer science
automaton arrives from ๐ฃ to ๐ค using the intermediate states [๐ก + 1],then it visits the ๐ก-th state zero or more times. If the path labeled by ๐ฅcauses the automaton to get from ๐ฃ to ๐ค without visiting the ๐ก-th stateat all, then ๐ฅ is matched by the regular expression ๐ ๐ก๐ฃ,๐ค. If the pathlabeled by ๐ฅ causes the automaton to get from ๐ฃ to ๐ค while visiting the๐ก-th state ๐ > 0 times then we can think of this path as:โข First travel from ๐ฃ to ๐ก using only intermediate states in [๐ก โ 1].
โข Then go from ๐ก back to itself ๐ โ 1 times using only intermediatestates in [๐ก โ 1]
โข Then go from ๐ก to ๐ค using only intermediate states in [๐ก โ 1].Therefore in this case the string ๐ฅ is matched by the regular expres-
sion ๐ ๐ก๐ฃ,๐ก(๐ ๐ก๐ก,๐ก)โ๐ ๐ก๐ก,๐ค. (See also Fig. 6.8.)Therefore we can compute ๐น ๐ก+1๐ฃ,๐ค using the regular expression
๐ ๐ก๐ฃ,๐ค | ๐ ๐ก๐ฃ,๐ก(๐ ๐ก๐ก,๐ก)โ๐ ๐ก๐ก,๐ค . (6.14)This completes the proof of the inductive step and hence of the theo-rem.
โ
Figure 6.8: If we have regular expressions ๐ ๐ก๐ฃโฒ,๐คโฒcorresponding to ๐น ๐ก๐ฃโฒ,๐คโฒ for every ๐ฃโฒ, ๐คโฒ โ [๐ถ], we canobtain a regular expression ๐ ๐ก+1๐ฃ,๐ค corresponding to๐น ๐ก+1๐ฃ,๐ค . The key observation is that a path from ๐ฃ to ๐คusing {0, โฆ , ๐ก} either does not touch ๐ก at all, in whichcase it is captured by the expression ๐ ๐ก๐ฃ,๐ค, or it goesfrom ๐ฃ to ๐ก, comes back to ๐ก zero or more times, andthen goes from ๐ก to ๐ค, in which case it is captured bythe expression ๐ ๐ก๐ฃ,๐ก(๐ ๐ก๐ก,๐ก)โ๐ ๐ก๐ก,๐ค.
6.4.3 Closure properties of regular expressionsIf ๐น and ๐บ are regular functions computed by the expressions ๐ and ๐respectively, then the expression ๐|๐ computes the function ๐ป = ๐น โจ ๐บdefined as ๐ป(๐ฅ) = ๐น(๐ฅ) โจ ๐บ(๐ฅ). Another way to say this is that the setof regular functions is closed under the OR operation. That is, if ๐น and ๐บare regular then so is ๐น โจ ๐บ. An important corollary of Theorem 6.16is that this set is also closed under the NOT operation:
-
functions with infinite domains, automata, and regular expressions 241
Lemma 6.17 โ Regular expressions closed under complement. If ๐น โถ {0, 1}โ โ{0, 1} is regular then so is the function ๐น , where ๐น(๐ฅ) = 1 โ ๐น(๐ฅ) forevery ๐ฅ โ {0, 1}โ.
Proof. If ๐น is regular then by Theorem 6.11 it can be computed by aDFA ๐ด = (๐ , ๐) with some ๐ถ states. But then the DFA ๐ด = (๐ , [๐ถ]โงต๐)which does the same computation but where flips the set of acceptedstates will compute ๐น . By Theorem 6.16 this implies that ๐น is regularas well.
โ
Since ๐ โง ๐ = ๐ โจ ๐, Lemma 6.17 implies that the set of regularfunctions is closed under the AND operation as well. Moreover, sinceOR, NOT and AND are a universal basis, this set is also closed un-der NAND, XOR, and any other finite function. That is, we have thefollowing corollary:
Theorem 6.18 โ Closure of regular expressions. Let ๐ โถ {0, 1}๐ โ {0, 1} beany finite Boolean function, and let ๐น0, โฆ , ๐น๐โ1 โถ {0, 1}โ โ {0, 1} beregular functions. Then the function ๐บ(๐ฅ) = ๐(๐น0(๐ฅ), ๐น1(๐ฅ), โฆ , ๐น๐โ1(๐ฅ))is regular.
Proof. This is a direct consequence of the closure of regular functionsunder OR and NOT (and hence AND), combined with Theorem 4.13,that states that every ๐ can be computed by a Boolean circuit (which issimply a combination of the AND, OR, and NOT operations).
โ
6.5 LIMITATIONS OF REGULAR EXPRESSIONS AND THE PUMPINGLEMMA
The efficiency of regular expression matching makes them very useful.This is why operating systems and text editors often restrict theirsearch interface to regular expressions and do not allow searching byspecifying an arbitrary function. However, this efficiency comes ata cost. As we have seen, regular expressions cannot compute everyfunction. In fact, there are some very simple (and useful!) functionsthat they cannot compute. Here is one example:
Lemma 6.19 โ Matching parentheses. Let ฮฃ = {โจ, โฉ} and MATCHPAREN โถฮฃโ โ {0, 1} be the function that given a string of parentheses, out-puts 1 if and only if every opening parenthesis is matched by a corre-sponding closed one. Then there is no regular expression over ฮฃ thatcomputes MATCHPAREN.
Lemma 6.19 is a consequence of the following result, which isknown as the pumping lemma:
-
242 introduction to theoretical computer science
Theorem 6.20 โ Pumping Lemma. Let ๐ be a regular expression oversome alphabet ฮฃ. Then there is some number ๐0 such that for ev-ery ๐ค โ ฮฃโ with |๐ค| > ๐0 and ฮฆ๐(๐ค) = 1, we can write ๐ค = ๐ฅ๐ฆ๐ง forstrings ๐ฅ, ๐ฆ, ๐ง โ ฮฃโ satisfying the following conditions:
1. |๐ฆ| โฅ 1.
2. |๐ฅ๐ฆ| โค ๐0.
3. ฮฆ๐(๐ฅ๐ฆ๐๐ง) = 1 for every ๐ โ โ.
Figure 6.9: To prove the โpumping lemmaโ we lookat a word ๐ค that is much larger than the regularexpression ๐ that matches it. In such a case, part of๐ค must be matched by some sub-expression of theform (๐โฒ)โ, since this is the only operator that allowsmatching words longer than the expression. If welook at the โleftmostโ such sub-expression and define๐ฆ๐ to be the string that is matched by it, we obtain thepartition needed for the pumping lemma.
Proof Idea:
The idea behind the proof the following. Let ๐0 be twice the num-ber of symbols that are used in the expression ๐, then the only waythat there is some ๐ค with |๐ค| > ๐0 and ฮฆ๐(๐ค) = 1 is that ๐ containsthe โ (i.e. star) operator and that there is a non-empty substring ๐ฆ of๐ค that was matched by (๐โฒ)โ for some sub-expression ๐โฒ of ๐. We cannow repeat ๐ฆ any number of times and still get a matching string. Seealso Fig. 6.9.
โ
PThe pumping lemma is a bit cumbersome to state,but one way to remember it is that it simply says thefollowing: โif a string matching a regular expression islong enough, one of its substrings must be matched usingthe โ operatorโ.
-
functions with infinite domains, automata, and regular expressions 243
Proof of Theorem 6.20. To prove the lemma formally, we use inductionon the length of the expression. Like all induction proofs, this willbe somewhat lengthy, but at the end of the day it directly follows theintuition above that somewhere we must have used the star operation.Reading this proof, and in particular understanding how the formalproof below corresponds to the intuitive idea above, is a very goodway to get more comfortable with inductive proofs of this form.
Our inductive hypothesis is that for an ๐ length expression, ๐0 =2๐ satisfies the conditions of the lemma. The base case is when theexpression is a single symbol ๐ โ ฮฃ or that the expression is โ or"". In all these cases the conditions of the lemma are satisfied simplybecause there ๐0 = 2 and there is no string ๐ฅ of length larger than ๐0that is matched by the expression.
We now prove the inductive step. Let ๐ be a regular expressionwith ๐ > 1 symbols. We set ๐0 = 2๐ and let ๐ค โ ฮฃโ be a stringsatisfying |๐ค| > ๐0. Since ๐ has more than one symbol, it has one ofthe forms (a) ๐โฒ|๐โณ, (b), (๐โฒ)(๐โณ), or (c) (๐โฒ)โ where in all these casesthe subexpressions ๐โฒ and ๐โณ have fewer symbols than ๐ and hencesatisfy the induction hypothesis.
In the case (a), every string ๐ค matched by ๐ must be matched byeither ๐โฒ or ๐โณ. If ๐โฒ matches ๐ค then, since |๐ค| > 2|๐โฒ|, by the inductionhypothesis there exist ๐ฅ, ๐ฆ, ๐ง with |๐ฆ| โฅ 1 and |๐ฅ๐ฆ| โค 2|๐โฒ| < ๐0 suchthat ๐โฒ (and therefore also ๐ = ๐โฒ|๐โณ) matches ๐ฅ๐ฆ๐๐ง for every ๐. Thesame arguments works in the case that ๐โณ matches ๐ค.
In the case (b), if ๐ค is matched by (๐โฒ)(๐โณ) then we can write ๐ค =๐คโฒ๐คโณ where ๐โฒ matches ๐คโฒ and ๐โณ matches ๐คโณ. We split to subcases. If|๐คโฒ| > 2|๐โฒ| then by the induction hypothesis there exist ๐ฅ, ๐ฆ, ๐งโฒ with|๐ฆ| โค 1, |๐ฅ๐ฆ| โค 2|๐โฒ| < ๐0 such that ๐คโฒ = ๐ฅ๐ฆ๐งโฒ and ๐โฒ matches ๐ฅ๐ฆ๐๐งโฒfor every ๐ โ โ. This completes the proof since if we set ๐ง = ๐งโฒ๐คโณthen we see that ๐ค = ๐คโฒ๐คโณ = ๐ฅ๐ฆ๐ง and ๐ = (๐โฒ)(๐โณ) matches ๐ฅ๐ฆ๐๐ง forevery ๐ โ โ. Otherwise, if |๐คโฒ| โค 2|๐โฒ| then since |๐ค| = |๐คโฒ| + |๐คโณ| >๐0 = 2(|๐โฒ| + |๐โณ|), it must be that |๐คโณ| > 2|๐โณ|. Hence by the inductionhypothesis there exist ๐ฅโฒ, ๐ฆ, ๐ง such that |๐ฆ| โฅ 1, |๐ฅโฒ๐ฆ| โค 2|๐โณ| and ๐โณmatches ๐ฅโฒ๐ฆ๐๐ง for every ๐ โ โ. But now if we set ๐ฅ = ๐คโฒ๐ฅโฒ we see that|๐ฅ๐ฆ| โค |๐คโฒ| + |๐ฅโฒ๐ฆ| โค 2|๐โฒ| + 2|๐โณ| = ๐0 and on the other hand theexpression ๐ = (๐โฒ)(๐โณ) matches ๐ฅ๐ฆ๐๐ง = ๐คโฒ๐ฅโฒ๐ฆ๐๐ง for every ๐ โ โ.
In case (c), if ๐ค is matched by (๐โฒ)โ then ๐ค = ๐ค0 โฏ ๐ค๐ก where forevery ๐ โ [๐ก], ๐ค๐ is a nonempty string matched by ๐โฒ. If |๐ค0| > 2|๐โฒ|,then we can use the same approach as in the concatenation case above.Otherwise, we simply note that if ๐ฅ is the empty string, ๐ฆ = ๐ค0, and๐ง = ๐ค1 โฏ ๐ค๐ก then |๐ฅ๐ฆ| โค ๐0 and ๐ฅ๐ฆ๐๐ง is matched by (๐โฒ)โ for every๐ โ โ.
โ
-
244 introduction to theoretical computer science
RRemark 6.21 โ Recursive definitions and inductiveproofs. When an object is recursively defined (as in thecase of regular expressions) then it is natural to proveproperties of such objects by induction. That is, if wewant to prove that all objects of this type have prop-erty ๐ , then it is natural to use an inductive step thatsays that if ๐โฒ, ๐โณ, ๐โด etc have property ๐ then so is anobject ๐ that is obtained by composing them.
Using the pumping lemma, we can easily prove Lemma 6.19 (i.e.,the non-regularity of the โmatching parenthesisโ function):
Proof of Lemma 6.19. Suppose, towards the sake of contradiction, thatthere is an expression ๐ such that ฮฆ๐ = MATCHPAREN. Let ๐0 bethe number obtained from Theorem 6.20 and let ๐ค = โจ๐0โฉ๐0 (i.e.,๐0 left parenthesis followed by ๐0 right parenthesis). Then we seethat if we write ๐ค = ๐ฅ๐ฆ๐ง as in Lemma 6.19, the condition |๐ฅ๐ฆ| โค ๐0implies that ๐ฆ consists solely of left parenthesis. Hence the string๐ฅ๐ฆ2๐ง will contain more left parenthesis than right parenthesis. HenceMATCHPAREN(๐ฅ๐ฆ2๐ง) = 0 but by the pumping lemma ฮฆ๐(๐ฅ๐ฆ2๐ง) = 1,contradicting our assumption that ฮฆ๐ = MATCHPAREN.
โ
The pumping lemma is a very useful tool to show that certain func-tions are not computable by a regular expression. However, it is not anโif and only ifโ condition for regularity: there are non-regular func-tions that still satisfy the pumping lemma conditions. To understandthe pumping lemma, it is crucial to follow the order of quantifiers inTheorem 6.20. In particular, the number ๐0 in the statement of Theo-rem 6.20 depends on the regular expression (in the proof we chose ๐0to be twice the number of symbols in the expression). So, if we wantto use the pumping lemma to rule out the existence of a regular ex-pression ๐ computing some function ๐น , we need to be able to choosean appropriate input ๐ค โ {0, 1}โ that can be arbitrarily large andsatisfies ๐น(๐ค) = 1. This makes sense if you think about the intuitionbehind the pumping lemma: we need ๐ค to be large enough as to forcethe use of the star operator.
Solved Exercise 6.4 โ Palindromes is not regular. Prove that the followingfunction over the alphabet {0, 1, ; } is not regular: PAL(๐ค) = 1 if andonly if ๐ค = ๐ข; ๐ข๐ where ๐ข โ {0, 1}โ and ๐ข๐ denotes ๐ข โreversedโ:the string ๐ข|๐ข|โ1 โฏ ๐ข0. (The Palindrome function is most often definedwithout an explicit separator character ;, but the version with such aseparator is a bit cleaner, and so we use it here. This does not make
-
functions with infinite domains, automata, and regular expressions 245
Figure 6.10: A cartoon of a proof using the pumping lemma that a function ๐น is not regular. The pumping lemma states that if ๐น is regular then thereexists a number ๐0 such that for every large enough ๐ค with ๐น(๐ค) = 1, there exists a partition of ๐ค to ๐ค = ๐ฅ๐ฆ๐ง satisfying certain conditions suchthat for every ๐ โ โ, ๐น(๐ฅ๐ฆ๐๐ง) = 1. You can imagine a pumping-lemma based proof as a game between you and the adversary. Every there existsquantifier corresponds to an object you are free to choose on your own (and base your choice on previously chosen objects). Every for every quantifiercorresponds to an object the adversary can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proofcorresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choiceof ๐ that would result in ๐น(๐ฅ๐ฆ๐๐ง) = 0, hence violating the conclusion of the pumping lemma.
-
246 introduction to theoretical computer science
much difference, as one can easily encode the separator as a specialbinary string instead.)
โ
Solution:
We use the pumping lemma. Suppose toward the sake of con-tradiction that there is a regular expression ๐ computing PAL,and let ๐0 be the number obtained by the pumping lemma (The-orem 6.20). Consider the string ๐ค = 0๐0 ; 0๐0 . Since the reverseof the all zero string is the all zero string, PAL(๐ค) = 1. Now, bythe pumping lemma, if PAL is computed by ๐, then we can write๐ค = ๐ฅ๐ฆ๐ง such that |๐ฅ๐ฆ| โค ๐0, |๐ฆ| โฅ 1 and PAL(๐ฅ๐ฆ๐๐ง) = 1 forevery ๐ โ โ. In particular, it must hold that PAL(๐ฅ๐ง) = 1, but thisis a contradiction, since ๐ฅ๐ง = 0๐0โ|๐ฆ|; 0๐0 and so its two parts arenot of the same length and in particular are not the reverse of oneanother.
โ
For yet another example of a pumping-lemma based proof, seeFig. 6.10 which illustrates a cartoon of the proof of the non-regularityof the function ๐น โถ {0, 1}โ โ {0, 1} which is defined as ๐น(๐ฅ) = 1 iff๐ฅ = 0๐1๐ for some ๐ โ โ (i.e., ๐ฅ consists of a string of consecutivezeroes, followed by a string of consecutive ones of the same length).
6.6 ANSWERING SEMANTIC QUESTIONS ABOUT REGULAR EX-PRESSIONS
Regular expressions have applications beyond search. For example,regular expressions are often used to define tokens (such as what is avalid variable identifier, or keyword) in the design of parsers, compilersand interpreters for programming languages. Regular expressionshave other applications too: for example, in recent years, the worldof networking moved from fixed topologies to โsoftware definednetworksโ. Such networks are routed by programmable switchesthat can implement policies such as โif packet is secured by SSL thenforward it to A, otherwise forward it to Bโ. To represent such policieswe need a language that is on one hand sufficiently expressive tocapture the policies we want to implement, but on the other handsufficiently restrictive so that we can quickly execute them at networkspeed and also be able to answer questions such as โcan C see thepackets moved from A to B?โ. The NetKAT network programminglanguage uses a variant of regular expressions to achieve preciselythat. For this application, it is important that we are not merely ableto answer whether an expression ๐ matches a string ๐ฅ but also answersemantic questions about regular expressions such as โdo expressions
https://goo.gl/oeJNuwhttps://goo.gl/oeJNuw
-
functions with infinite domains, automata, and regular expressions 247
๐ and ๐โฒ compute the same function?โ and โdoes there exist a string ๐ฅthat is matched by the expression ๐?โ. The following theorem showsthat we can answer the latter question:
Theorem 6.22 โ Emptiness of regular languages is computable. There is analgorithm that given a regular expression ๐, outputs 1 if and only ifฮฆ๐ is the constant zero function.
Proof Idea:
The idea is that we can directly observe this from the structureof the expression. The only way a regular expression ๐ computesthe constant zero function is if ๐ has the form โ or is obtained byconcatenating โ with other expressions.
โ
Proof of Theorem 6.22. Define a regular expression to be โemptyโ if itcomputes the constant zero function. Given a regular expression ๐, wecan determine if ๐ is empty using the following rules:
โข If ๐ has the form ๐ or "" then it is not empty.
โข If ๐ is not empty then ๐|๐โฒ is not empty for every ๐โฒ.
โข If ๐ is not empty then ๐โ is not empty.
โข If ๐ and ๐โฒ are both not empty then ๐ ๐โฒ is not empty.
โข โ is empty.
Using these rules, it is straightforward to come up with a recursivealgorithm to determine emptiness.
โ
Using Theorem 6.22, we can obtain an algorithm that determineswhether or not two regular expressions ๐ and ๐โฒ are equivalent, in thesense that they compute the same function.
Theorem 6.23 โ Equivalence of regular expressions is computable. LetREGEQ โถ {0, 1}โ โ {0, 1} be the function that on input (a stringrepresenting) a pair of regular expressions ๐, ๐โฒ, REGEQ(๐, ๐โฒ) = 1if and only if ฮฆ๐ = ฮฆ๐โฒ . Then there is an algorithm that computesREGEQ.
Proof Idea:
The idea is to show that given a pair of regular expressions ๐ and๐โฒ we can find an expression ๐โณ such that ฮฆ๐โณ(๐ฅ) = 1 if and only ifฮฆ๐(๐ฅ) โ ฮฆ๐โฒ(๐ฅ). Therefore ฮฆ๐โณ is the constant zero function if and only
-
248 introduction to theoretical computer science
if ๐ and ๐โฒ are equivalent, and thus we can test for emptiness of ๐โณ todetermine equivalence of ๐ and ๐โฒ.
โ
Proof of Theorem 6.23. We will prove Theorem 6.23 from Theorem 6.22.(The two theorems are in fact equivalent: it is easy to prove Theo-rem 6.22 from Theorem 6.23, since checking for emptiness is the sameas checking equivalence with the expression โ .) Given two regu-lar expressions ๐ and ๐โฒ, we will compute an expression ๐โณ such thatฮฆ๐โณ(๐ฅ) = 1 if and only if ฮฆ๐(๐ฅ) โ ฮฆ๐โฒ(๐ฅ). One can see that ๐ is equiva-lent to ๐โฒ if and only if ๐โณ is empty.
We start with the observation that for every bit ๐, ๐ โ {0, 1}, ๐ โ ๐ ifand only if
(๐ โง ๐) โจ (๐ โง ๐) . (6.15)Hence we need to construct ๐โณ such that for every ๐ฅ,
ฮฆ๐โณ(๐ฅ) = (ฮฆ๐(๐ฅ) โง ฮฆ๐โฒ(๐ฅ)) โจ (ฮฆ๐(๐ฅ) โง ฮฆ๐โฒ(๐ฅ)) . (6.16)To construct the expression ๐โณ, we will show how given any pair of
expressions ๐ and ๐โฒ, we can construct expressions ๐ โง ๐โฒ and ๐ thatcompute the functions ฮฆ๐ โง ฮฆ๐โฒ and ฮฆ๐ respectively. (Computing theexpression for ๐ โจ ๐โฒ is straightforward using the | operation of regularexpressions.)
Specifically, by Lemma 6.17, regular functions are closed undernegation, which means that for every regular expression ๐, there is anexpression ๐ such that ฮฆ๐(๐ฅ) = 1 โ ฮฆ๐(๐ฅ) for every ๐ฅ โ {0, 1}โ. Now,for every two expressions ๐ and ๐โฒ, the expression
๐ โง ๐โฒ = (๐|๐โฒ) (6.17)
computes the AND of the two expressions. Given these two transfor-mations, we see that for every regular expressions ๐ and ๐โฒ we can finda regular expression ๐โณ satisfying (6.16) such that ๐โณ is empty if andonly if ๐ and ๐โฒ are equivalent.
โ
โ Chapter Recap
โข We model computational tasks on arbitrarily largeinputs using infinite functions ๐น โถ {0, 1}โ โ {0, 1}โ.
โข Such functions take an arbitrarily long (but stillfinite!) string as input, and cannot be described bya finite table of inputs and outputs.
โข A function with a single bit of output is known asa Boolean function, and the task of computing it isequivalent to deciding a language ๐ฟ โ {0, 1}โ.
-
functions with infinite domains, automata, and regular expressions 249
โข Deterministic finite automata (DFAs) are one simplemodel for computing (infinite) Boolean functions.
โข There are some functions that cannot be computedby DFAs.
โข The set of functions computable by DFAs is thesame as the set of languages that can be recognizedby regular expressions.
6.7 EXERCISES
Exercise 6.1 โ Closure properties of regular functions. Suppose that ๐น, ๐บ โถ{0, 1}โ โ {0, 1} are regular. For each one of the following defini-tions of the function ๐ป , either prove that ๐ป is always regular or give acounterexample for regular ๐น, ๐บ that would make ๐ป not regular.
1. ๐ป(๐ฅ) = ๐น(๐ฅ) โจ ๐บ(๐ฅ).
2. ๐ป(๐ฅ) = ๐น(๐ฅ) โง ๐บ(๐ฅ)
3. ๐ป(๐ฅ) = NAND(๐น(๐ฅ), ๐บ(๐ฅ)).
4. ๐ป(๐ฅ) = ๐น(๐ฅ๐ ) where ๐ฅ๐ is the reverse of ๐ฅ: ๐ฅ๐ = ๐ฅ๐โ1๐ฅ๐โ2 โฏ ๐ฅ๐ for๐ = |๐ฅ|.
5. ๐ป(๐ฅ) =โง{โจ{โฉ
1 ๐ฅ = ๐ข๐ฃ s.t. ๐น(๐ข) = ๐บ(๐ฃ) = 10 otherwise
6. ๐ป(๐ฅ) =โง{โจ{โฉ
1 ๐ฅ = ๐ข๐ข s.t. ๐น(๐ข) = ๐บ(๐ข) = 10 otherwise
7. ๐ป(๐ฅ) =โง{โจ{โฉ
1 ๐ฅ = ๐ข๐ข๐ s.t. ๐น(๐ข) = ๐บ(๐ข) = 10 otherwise
โ
Exercise 6.2 One among the following two functions that map {0, 1}โto {0, 1} can be computed by a regular expression, and the other onecannot. For the one that can be computed by a regular expression,write the expression that does it. For the one that cannot, prove thatthis cannot be done using the pumping lemma.
โข ๐น(๐ฅ) = 1 if 4 divides โ|๐ฅ|โ1๐=0 ๐ฅ๐ and ๐น(๐ฅ) = 0 otherwise.
โข ๐บ(๐ฅ) = 1 if and only if โ|๐ฅ|โ1๐=0 ๐ฅ๐ โฅ |๐ฅ|/4 and ๐บ(๐ฅ) = 0 otherwise.
โ
Exercise 6.3 โ Non-regularity. 1. Prove that the following function ๐น โถ{0, 1}โ โ {0, 1} is not regular. For every ๐ฅ โ {0, 1}โ, ๐น(๐ฅ) = 1 iff ๐ฅ isof the form ๐ฅ = 13๐ for some ๐ > 0.
-
250 introduction to theoretical computer science
2. Prove that the following function ๐น โถ {0, 1}โ โ {0, 1} is not regular.For every ๐ฅ โ {0, 1}โ, ๐น(๐ฅ) = 1 iff โ๐ ๐ฅ๐ = 3๐ for some ๐ > 0.
โ
6.8 BIBLIOGRAPHICAL NOTES
The relation of regular expressions with finite automata is a beautifultopic, on which we only touch upon in this text. It is covered moreextensively in [Sip97; HMU14; Koz97]. These texts also discuss top-ics such as non-deterministic finite automata (NFA) and the relationbetween context-free grammars and pushdown automata.
The automaton of Fig. 6.4 was generated using the FSM simulatorof Ivan Zuzak and Vedrana Jankovic. Our proof of Theorem 6.11 isclosely related to the Myhill-Nerode Theorem. One direction of theMyhill-Nerode theorem can be stated as saying that if ๐ is a regularexpression then there is at most a finite number of strings ๐ง0, โฆ , ๐ง๐โ1such that ฮฆ๐[๐ง๐] โ ฮฆ๐[๐ง๐] for every 0 โค ๐ โ ๐ < ๐.
http://ivanzuzak.info/noam/webapps/fsm_simulator/https://goo.gl/mnKVMP
top related