work @ fudan university
DESCRIPTION
Work @ Fudan University. Chen, Yaoliang. Engineering work. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English words . Fraud Detecting Time series tech. Research Work. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/1.jpg)
1
Work @ Fudan UniversityChen, Yaoliang
![Page 2: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/2.jpg)
2
ENGINEERING WORK
• TTS System• A Chinese Text-To-Speech system
• SafeDB• Bug backlog
• SMemoHelper• A small tool that helps learn English words.
• Fraud Detecting• Time series tech
![Page 3: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/3.jpg)
3
•CGAP-align: A high performance DNA short read alignment tool▫Coauthor with BCM. Bioinformatics in
progress▫NDBC Demo
•On Encoding Shortest Paths in Large Graphs ▫Coauthor with Jian Pei. VLDB in progress▫Coauthor with Haixun Wang. Sigmod in
progress▫NDBC
•Other Projects
RESEARCH WORK
![Page 4: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/4.jpg)
4
•Baylor College of Medicine•序列比对及意义▫Reference & Reads
ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT
•Given a number z reference X and read W, we want to find a subsequence W’=X[i,i+1,…,j] such that EditDistance(W,W’)≤z.
CGAP-ALIGN: BACKGROUND
ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT
![Page 5: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/5.jpg)
5
DNA sequences in GenBank
CHALLENGES
•A human genome sequence▫ 2000 € 1,000,000,000 in ~10 years▫ 2008 € 50 - 100,000 in ~4 months▫ 2010 € 5 - 10,000 in ~2 weeks▫ ...2015 € 1,000 in ~1 day▫ ...2020 € 10 in ~1 hour to minutes
![Page 6: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/6.jpg)
6
•Burrows-Wheeler Alignment Tool▫一个流行的在大型参照序列上对基因片段进行比对工具
•Optimization of BWA▫Code level▫Algorithm level
•BWA Performance: T = N × Taln▫N: enumerate all mismatches and gaps
of the read▫Taln: time to locate the modified reads in
the reference during the alignment stage
PERFORMANCE OF BWA
![Page 7: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/7.jpg)
7
•Optimizing Taln: efficiency for matching▫Suffix Tarray
•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating
OPTIMIZATION
![Page 8: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/8.jpg)
8
•Suffix Tree•Suffix Array Based on BWT (FM-index)•Comparison
SUFFIX TARRAY
...
Root
A C G T
FM-index
R(AA)
Leaf(b=2) A C G T A C TA
R(AA)_
R(TT)_
R(TT)R(TC)_
R(TC)...
Ref=ATCTTCAAGARead=TAA
![Page 9: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/9.jpg)
9BURROWS-WHEELER TRANSFORM
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp
#mississippi
ssippi#missiissippi#miss Sort the rows
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
i ppi#missis s i #mississip p
#mississipp i
LF
From Yuval Rikover
![Page 10: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/10.jpg)
10
BURROWS-WHEELER TRANSFORM
1. Find F by sorting L 2. First char of T? m3. Find m in L4. L[i] precedes F[i] in T. Therefore we
get mi5. How do we choose the correct i in L?
▫ The i’s are in the same order in L and F▫ As are the rest of the char’s
6. i is followed by s: mis7. And so on….
F
Reminder: Recovering T from L
#iiiimppssssipssm#pissii
L
![Page 11: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/11.jpg)
11
NEXT: COUNT P IN T
• Backward-search algorithm• Uses only L (output of BWT)• Relies on 2 structures:
▫ C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars)
▫ Occ(c,q): number of occurrences of char c in prefix L[1,q]
Example
•C[ ] for T = mississippi#
•occ(s, 5) = 2•occ(s,12) = 4
Occ Rank
8 6 5 1
123456789101112
i m p s
![Page 12: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/12.jpg)
12SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES)
frocc=2[lr-fr+1]
#mississippi#mississipippi#missisissippi#misississippi#
mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
L
mississippi
#1i 2m 7p 8S 10
C
Availa
ble in
foP = siFirst step
fr
lr Inductive step: Given fr,lr for P[j+1,p]Take
c=P[j]
P[ j ]
Find the first c in L[fr, lr]Find the last c in L[fr, lr]
lr
rows prefixedby char “i” s
s
unknown
Occ() oracle is enough
![Page 13: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/13.jpg)
13
•Backward search•Store “First” and “Last” (k and l) values
SUFFIX T-ARRAY
![Page 14: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/14.jpg)
14
BACKWARD-SEARCH EXAMPLE•P = CAA▫ i =
▫ c =
▫ First =
▫ Last =
3
‘A’First(AA)
Last(AA)
‘C’C[‘T’] + Occ(‘C’,First(AA)) +1
C[‘T’] + Occ(‘C’,Last(AA))
12
‘A’
A
A
FM-index
Root
![Page 15: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/15.jpg)
15
•Optimizing Taln: efficiency for matching▫Suffix Tarray
•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating
OPTIMIZATION
![Page 16: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/16.jpg)
16
• e(W)▫minimal number of the edit operations that
is needed to make W exactly align onto the reference X.
•D-array▫D[i] : Lower bound of e(W[0…i])
D-ARRAY: MOTIVATION
34
… i0
![Page 17: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/17.jpg)
17
•Given a string W and an arbitrary combination strings of W = w1,w2,…,wk, we have e(W)>
•D array in BWA▫split W into several small strings like
W=w1w2…wk with e(wi)=1 for all i. The correctness of the algorithm depends on the inequality: e(W) > .
D ARRAY: MOTIVATION
![Page 18: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/18.jpg)
18
•Example Reference X = “AACGTATCGACG”
▫W▫D
•A better segmentation: Consider e(·)= 2
▫W▫D
▫calculating e(·) costs exponential time▫Need to pre-compution
D ARRAY: MOTIVATION
A G T C A AT C A AC A AA AA A GG0 0 1 10 1
A G T C A AG T C A AT C A AC A AA AA A0 0 2 20 1
![Page 19: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/19.jpg)
19
• Fasta file F containing training reads
•Should be similar to the reads in practice
•Data Concious
SOLUTION - FREQUENT PATTERN
Train Reads
Frequent Patterns
Trie DFA
Frequent Patterns
Train Reads
Trie DFA
•Mining Frequent Patterns (FPs)
•Art of State Methods
•Our solution: A simple DFS on FM-index▫Count=Last-First+1
•Generate prefix trie T for the FPs with e(w)=2.
•Refine T to a DFA GT
![Page 20: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/20.jpg)
20
•Why Trie DFA?▫When online doing alignment, we need to
find all the FPs contained in a read ▫This operation should be no more expensive
than O(|W|)
TRIE DETERMINISTIC FINITE AUTOMATON
![Page 21: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/21.jpg)
21
Offline Index: Construction•String Set(FP set)▫AA▫C▫G▫T▫AC▫AG
•The prefix trie done. We start to construct DFA.
R
1
A C G T
LC LG LT
LAA
A C G
LAC LAG
TRIE DETERMINISTIC FINITE AUTOMATON
R
41
6
5
2 7
3
T
![Page 22: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/22.jpg)
22
•DFS order – minimize the average hop between each jump. (7% up)
RE-ORDERING
R
1
A C G T
3 4 5
2
A C G
6 7
T
65 7
2 43
![Page 23: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/23.jpg)
23
Online Query•String Set(FP set)▫AA▫AC▫AG▫C▫G▫T
•W=“CACAT”
R
1
A C G T
LC LG LT
LAA
A C G
LAC LAG
T
TRIE DETERMINISTIC FINITE AUTOMATON
R
LC1
LAC
1 LT
![Page 24: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/24.jpg)
24
•Optimizing Taln: efficiency for matching▫Suffix Tarray (20% up)
•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating (0-
200% up)
EXPERIMENT
![Page 25: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/25.jpg)
25
• Background•Consider a graph G = (V,E), where V is a
set of vertices and E =VxV is a set of edges.
•FH-Partition
ON ENCODING SHORTEST PATHS IN LARGE GRAPHS
![Page 26: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/26.jpg)
26
EXAMPLES
7 47->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10
![Page 27: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/27.jpg)
27
PROBLEM STATEMENT
•Numbering Function
![Page 28: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/28.jpg)
28
MCN IS NP-HARD!!
![Page 29: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/29.jpg)
29
WORKFLOW•Compute a naïve
numbering function
•Store the FH-partitions
Compute FH-Partitions
Get Numbering Function(s)
Encoding FH-Partitions
Get Numbering Function(s)
Compute FH-Partitions
Encoding FH-Partitions
•Reduce to TSP
•Region tree
•Multi numbering functions
•Further Compression
•Answering query efficiently
![Page 30: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/30.jpg)
30
EXPERIMENTS
![Page 31: Work @ Fudan University](https://reader034.vdocuments.net/reader034/viewer/2022042509/56816620550346895dd97456/html5/thumbnails/31.jpg)
31
Thank you!