dfa minimization algorithms in map reduce

DFA Minimization Algorithms in Map-

ReduceIraj Hedayati Somarin

Master Thesis Defense – January 2016

Computer Science and Software EngineeringFaculty of Engineering and Computer Science

Concordia University

Supervisor: Gösta K. GrahneExaminer: Brigitte JaumardExaminer: Hovhannes A. HarutyunyanChair: Rajagopalan Jayakumar

2

Outline• Introduction• DFA Minimization in Map-Reduce• Cost Analysis• Experimental Results• Conclusion

3

INTRODUCTIONAn introduction about the problem and related works done

so far

4

DFA, Big-Data and our Motivation• Finite Automata• Deterministic Finite Automata• DFA Minimization is the process of:

• Removing unreachable states• Merging non-distinguishable states

• What is Big-Data? (e.g. peta equal to 250 or 1015)• Insufficient study of DFA minimization for data-intensive

applications and parallel environments

𝐴=⟨𝑄 , Σ , 𝛿 , 𝑠 ,𝐹 ⟩

5

DFA Minimization Methods(Watson, 1993)

Equivalence of States

()

Equivalence Relation

Bottom-Up Top-Down

Layer-wise Unordered State Pairs

Point-Wise

BrzozowskiDenote as a partition on , then:

6

Moore’s Algorithm (Moore, 1956)• Input is DFA where and • Initialize partition over where:

• Iteratively refine the partition using equivalence relation in iteration (

• The initial partition is • Complexity

7

Hopcroft’s Algorithm (Hopcroft, 1971)

• The idea is avoiding some unnecessary operations • Input is DFA where and • Initialize partition over where:

• Keep list of splitters• Iteratively divide partitions using splitter

where and

• Update the list of splitters• Complexity= ; Number of Iterations =

8

Hopcroft’s Algorithm (Example)𝑃 𝐵

𝑄𝑈𝐸={ ⟨𝑃 ,𝑎 ⟩ , ⟨𝑃1 ,𝑎 ⟩ , ⟨𝑃2 ,𝑎 ⟩ }

𝑃1

𝑃2

𝑄𝑈𝐸=𝑄𝑈𝐸∪⟨ 𝐵1 ,𝑎⟩

𝑃 𝐵1 𝐵2

𝑃1

𝑃2

9

Map-Reduce Model

DFSData 1

Data 2

Data 3

Data 4

Mapping

Mapper 1

Mapper 2

Reduce

Reducer 1

Reducer 2

Reducer 3

DFS

Data 1

Data 2

Data 3

Original Data Mapped Data

10

Related Works in Parallel DFA Minimization

1) Employing EREW-PRAM model (Moore’s method) (Ravikumar and Xiong 1996)

2) Employing CRCW-PRAM model (Moore’s method) ()(Tewari et al. 2002)

3) Employing Map-Reduce model (Moore’s method) [Moore-MR] (Harrafi 2015)

• Challenge is how to store block numbers:1) Parallel in-block sorting and rename blocks in serial2) Parallel Perfect Hashing Function and partial sum3) No action is taken

11

Cost Model• Communication Complexity (Yao 1979 & Kushilevitz 1997)• The Lower Bound Recipe for Replication Rate (Afrati et al. 2013)• Computational Complexity of Map-Reduce (Turan 2015)

12

Cost Model – Communication Complexity

• Yao’s two-party model

Bob𝑦∈ {0,1 }𝑛

Alice𝑥∈ {0,1 }𝑛

𝑓 : {0,1 }𝑛× {0,1 }𝑛→ {0,1}

How much communication isrequired?

Upper Bound (Worst Case):

Rec 6Rec 4

Rec 1

Rec 2

Rec 5

Rec 3

𝐴⊂ {0,1 }𝑛

𝐵⊂ {0,1 }𝑛

Lower Bound:

where is the number of rectangles

Fooling set is a well-known method for finding f-monochromatic rectangles

13

Cost Model – Lower Bound Recipe(Afrati et al. 2013)

Reducer 1

Reducer 2

Reducer n

Reducer Capacity =

Input =

𝜌1

𝜌2

𝜌𝑛

Output = O

𝑔 (𝜌1)

𝑔 (𝜌¿¿ 2)¿

𝑔 (𝜌¿¿𝑛)¿

ℛ=∑𝑖=1

𝑛

𝜌𝑖

¿ 𝐼∨¿¿

14

Cost Model – Computational Complexity(Turan 2015)

• Lets denote a Turing machine where:• indicates whether it is a mapping task () or a reducer task ()• indicates the round number• indicates the input size• indicates the reducer size

• there is an -space and -time Turing machine and

15

DFA MINIMIZATION IN MAP-REDUCE

Proposed algorithms for minimizing a DFA in Map-Reduce model

16

Enhancement to Moore-MR• Moore-MR (Harrafi 2015):• Input • Pre-Processing: generate with records from • Mapping Schema: map every transition record of based on if and based on

and if

• Reducer Task: Compute new block number using Moore method• Note that, in order to accomplish reducer task in reducer , it requires for every

state it has a transition to. Transitions with are responsible to carry these data• Challenge is new block numbers are concatenation of other block numbers.

After round , the size of each is equal to .

17

Enhancement to Moore-MRPPHF-MR

• Having and where , then is a one-to-one function

• Mapping: map every record to • Reducer Task: assign new block

number from range where is reducer number

Moore-MR-PPHF is obtained by applying PPHF-MR after each iteration of Moore-MR

18

Hopcroft-MRPre-

Processing

PreProcessing

Mapper Reducer

Iterate Until QUE is not empty

PartitionDetect

Mapper Reducer

BlockUpdate

Mapper Reducer

PPHF-MR

Mapper Reducer

Construct

Minimal DFA

h (𝑞) h (𝑞) h (𝑝) h (𝜋𝑝)

Transition:

Δ blocks[a,Bi]

Block tuple:

, blocks[a,Bi]

Update tuple:

new, blocks[a,Bi],new , blocks[a,Bi]

19

Hopcroft-MR vs. Hopcroft-MR-PAR

• In Hopcroft-MR we pick one splitter at a time while in Hopcroft-MR-PAR we pick all the splitters from QUE

• In Hopcroft-MR,

• In Hopcroft-MR-PAR, A

• Where A is bit vector

20

COST ANALYSISAnalyzing cost measures for the proposed algorithms as well

as finding lower bound and upper bound on each

21

Communication Cost Bounds• Upper-Bound for DFA minimization problem in parallel

environments

where and • Lower-Bound on DFA minimization problem in parallel

environments

22

Lower Bound on Replication Rate

• : For every input record (transition) a reducer produces exactly one record of output. Hence

• The output is exactly equal to input size containing updated transitions. Hence, .

23

Moore-MR-PPHF

• where is number of Map-Reduce rounds

24

Hopcroft-MR

25

Hopcroft-MR-PAR

26

Comparison of Complexity Measures

Replication Rate

Communication Cost

Sensitive to Skewness

Lower Bound 1 -Moore-MR (Harrafi 2015)

No

Moore-MR-PPHF NoHopcroft-MR YesHopcroft-MR-PAR Yes

27

EXPERIMENTAL RESULTS

Plotting the results gathered from running proposed algorithms on different data sets

28

Data Generator - CircularInput DFA Minimized DFA

29

Data Generator – Duplicated RandomInput DFA Minimized DFA

30

Data Generator – Linear

31

Moore-MR vs. Moore-MR-PPHF

32

Circular DFA

33

Replicated Random DFA

34

Number of Rounds

35

CONCLUSIONConcluding work done in this thesis and suggesting future

works and further questions

36

Conclusion• In this work we studied DFA minimization algorithms in Map-Reduce and

PRAM• Proposed an enhancement to a DFA minimization algorithm in Map-

Reduce by introducing PPHF in Map-Reduce• Proposed a new algorithm in Map-Reduce based on Hopcroft’s method• Found lower bound on Replication Rate in Map-Reduce and

Communication Cost in parallel environment for DFA minimization problem

• Studied different measures of Map-Reduce algorithms• Found that two critical measures are missing: Sensitivity to Skewness

and Horizontal growth of data

37

Future Works• Reducer Capacity vs. Number of Rounds trade-off• Investigating other methods of minimization • Extending complexity model and class• Is it possible to compare Map-Reduce algorithms with others in

different models (PRAM, serial, and etc.)?

38

Thank you

Questions & Answer

dfa minimization algorithms in map reduce

Data & Analytics