i ncremental m aintenance of l ength n ormalized i ndexes for a pproximate s tring m atching -...
DESCRIPTION
I NTRODUCTION Inverted Document Frequency Partial Score Contribution 3TRANSCRIPT
![Page 1: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/1.jpg)
INCREMENTAL MAINTENANCE OF LENGTH NORMALIZED INDEXES FORAPPROXIMATE STRING MATCHING
- Ashwin Joshi1
![Page 2: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/2.jpg)
PROBLEM Consider a real system - Tens of millions of strings - Updated on hourly basis - Practical scenario 1. Updates buffered 2. Indexed rebuilt weekly - Re-computation time = few hours - Limitations of online systems
2
![Page 3: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/3.jpg)
INTRODUCTION Inverted Document Frequency
Partial Score Contribution
3
![Page 4: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/4.jpg)
LENGTH NORMALIZATION
Types : L0 ,L1 & L2 ………Why L2 is preferred? Similarity,
e.g. Query, q = {t1, t2, t3}, String S1 = {t1}, String S2 = {t1, t2, t3}
and idf(t1) = 10 , idf(t2) = 8 , idf(t3)= 2 .
For L0 , S0(q,s1) = 100/3 > S0(q,s2) = 168/9
For L1 , S1(q,s1) = 100/200 > S1(q,s2) = 168/400
For L2 , S2(q,s1) = 100/41 < S2(q,s2) = 168/168 = 1 4
![Page 5: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/5.jpg)
APPROXIMATE STRING MATCHING Theorem:
Length Boundedness Determine string that are either too
short or too long to match the query
5
![Page 6: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/6.jpg)
MAINTENANCE OPERATIONS Propagating Updates 1. Insert 2. Delete 3. Modify Effectively a ‘Delete’ followed by an
‘Insert’
6
![Page 7: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/7.jpg)
Insert S7
- Generate new tokens - Add new strings - N changes -> idf changes -> L changes
INSERT
7
![Page 8: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/8.jpg)
RELAXED PROPAGATION Relaxation of N - What is Nb ? - Divergence between N & Nb
Relaxation of df - Definition of dfp(ti) - Range of dfp(ti)
Relaxed similarity S2~
8
![Page 9: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/9.jpg)
LOSS IN PRECISION Assume total possible divergence in idf
Relaxed Similarity,
For ρ=1.1 & query threshold,
Equation1 : ,
Equation2 : , 9
![Page 10: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/10.jpg)
UPDATE PROPAGATION ALGORITHM
10 …continued
![Page 11: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/11.jpg)
11
![Page 12: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/12.jpg)
EXPERIMENT (DBLP) - Period = 30 days - 2460433 author/id pairs - 5712041 total words - 269281 distinct words - 33461 total updates - 32121 insertions,1340 deletions
12
![Page 13: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/13.jpg)
EXPERIMENT (BUSINESS LISTING)
13
![Page 14: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/14.jpg)
14
![Page 15: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/15.jpg)
15
![Page 16: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/16.jpg)
QUERY ACCURACY
16
![Page 17: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1](https://reader036.vdocuments.net/reader036/viewer/2022062401/5a4d1b4c7f8b9ab0599a5ca0/html5/thumbnails/17.jpg)
THANK YOU.
17