retrieval system voice based information · there are 3 different tasks of the voice based...

Voice Based Information Retrieval System

How far is it from text based retrieval system?

PRAJNA BHANDARYCMSC 676

MOTIVATION

● The ever increasing Internet bandwidth, the ever-decreasing storage costs and the fast development of multimedia technologies have paved road for more and more multimedia network content.

● The main motivation for many researchers in this area is to help visually challenged individuals to get information using a device used for speech recognition system

There are 3 different tasks of the Voice based Retrieval System● Using Text Queries to retrieve spoken documents

○ Referred as Spoken Document Retrieval○ Found that the queries need to be long in order for it to be more

efficient● Using spoken queries to retrieve text documents

○ Voice Search○ The information to be retrieved is usually an existing text database

such as those in directory assistance applications, although with lexical variations and so on but primarily without recognition uncertainty.

● Using spoken queries to retrieve spoken documents○ In this case the speech recognition uncertainty exists on both sides of

the queries and the documents, and therefore naturally this is a more difficult task this.

INTRODUCTION

COMPARISON

Text-Based Voice-Based

Resources Rich resources-huge quantities of text documents available over the internet Quantity continues to increase exponentially due to convenient access

Spoken/multimedia content are the new trend Can be realized even sooner given mature technologies

Accuracy Retrieval accuracy is acceptable to users and are properly ranked and filtered

Problems with speech recognition errors, especially for spontaneous speech under adverse environments

User-System Interaction

Retrieved documents easily summarised on-screen thus easily scanned and selected by the userUser may easily select query terms suggested for next iteration retrieval in an interactive process

Spoken/multimedia documents easily summarised on-screen thus difficult to scan and selectLacks efficient user system interaction

RETRIEVAL ACCURACY

● Lattice-based Approaches

● Position Specific Posterior Lattices(PSPL)● Confusion Networks(CN)● Time-based Merging for Indexing(TMI)● Time-anchored Lattice Expansion(TALE)

● Position Specific Posterior Lattices(PSPL)

● Locating a word in a segment according to the position(or sequence ordering) of the word in a path as a tuple (W, d, pos, prob).

● Confusion Networks(CN)

● Clustering several words in a segment according to similar time spans and word pronunciation.

Relevance rankingrelevance scores between the segments and a query Q, which is a sequence of words, {W j , j = 1, 2.., Q}First calculate the expected tapered-count for each N-gram {Wi...Wi+N−1} within the query in a spoken segment d, S(d,Wi...Wi+N−1) as given below and aggregate the results to produce a score S N-gram (d, Q) for each order N as in

RETRIEVAL ACCURACY (Cont’d)

where L is the lattice obtained from d and k is the cluster number in PSPL or CN structures. The different proximity types, one for each N-gram order allowed by the query length Q, are finally combined by a weighted sum to give the final relevance score S(d, Q),

● Multi-model dialoguefor a query given by the user, the retrieval system produces a topic hierarchy constructed from the retrieved spoken documents to be shown on the screen.

● Semantic analysis of spoken documents

USER-SYSTEM INTERACTION

● Automatic Generation of Summaries and Titles for spoken documents

● Query-based Local Semantic Structuring of Spoken Documents● Semantic Structuring of spoken documents● Interactive retrieval in Dialogue loop

● Key term extraction from spoken documentsBased on latent topic significance

USER-SYSTEM INTERACTION

Voice Voice to text Keyword

Pattern Matching BoW(Bag of words)

Voice based reply

Voice Reply

If match with

DB

no

yes

PROPOSED MODEL

This is a three step process:

1. Speech to text 2. Pattern matching3. Text to speech

● A fuzzy logics can be used to match the speech of different accents. eg. the word “Vector” has different pronunciations

● Thus a single word can be represented by a fuzzy set.● Now since this is a very specific to fit in a generic model of speech

recognition, we can have a more general model of fuzzification of phonemes.

● This model is applied to spoken sentences. One fuzzy set is based on accents, the second one the speeds of pronunciation and the third on emphasis

VOICE TO TEXT

BAG-of-WORDS

● A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:○ A vocabulary of known words.○ A measure of the presence of unknown words.○ The steps followed:

■ Collect data■ Create Vocabulary■ Create Document Vector■ Managing Vocabulary■ Scoring words■ Word Hashing■ TF-IDF

● Boyer-Moore(BM) algorithm can be used which positions the pattern over the leftmost characters in the text and attempts to match it from right to left. If no mismatch occurs then the pattern is found else.

● The algorithm computes a shift by an amount by which the pattern is moved to the right before a new matching is undertaken

● Shift is computed using two heuristics :○ match heuristic○ Occurence heuristics

i. Match all characters previously matched andii. To bring different character to the position in the text that caused

the mismatch

𝑑[𝑥] = 𝑚𝑖𝑛{𝑠|𝑠 = 𝑚 𝑜𝑟 (0 𝑠 < 𝑚 𝑎𝑛𝑑 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 [𝑚 − 𝑠] = 𝑥)}

PATTERN MATCHING

● After getting the text it must it must analyse and then transform into a phonetic description

● NLP module:○ Digital Signal Processing(DSP) module: It transforms the symbolic

information received to audible one as follows: text analysis: first the text is segmented into tokens. The token-to-word conversion creates the orthographic form of the token example Mr is mister and humber like 2 are transformed to two

○ Application of Pronunciation rules: After the text analysis is completed pronunciation rules can be applied. Silent letters in a word(h in caught) or several phoneme like(m in maximum)■ Dictionary based solution: A dictionary can be used where all

forms of possible words are stored. ■ Rule based solution: rules are generated from the phonological

knowledge of dictionaries. Only words with come exception on pronunciation are included

TEXT TO VOICE

CONCLUSION & FUTURE SCOPE

It can be concluded that this approach is efficient in term of reduced computation complexity, reduced time

● There is research being done to make the whole process telephonic ● Limitations of Bag-of-Words

● Vocabulary● Sparsity● Meaning

REFERENCES[1] R. Uma, B. Latha. “An efficient voice based information retrieval using bag of words based indexing”, International Journal of Engineering & Technology

[2] Lin-shan Lee and Yi-cheng Pan. “Voice-based Information Retrieval- how far are we from the text-based information retrieval?”, 2009 IEEE

[3] Kiruthika M, Priyadarsini S, Rishwana Roshan K, Shifana Parvin V.M, Dr. G. Umamaheshwari. “Voice Based iNformation Retrieval System”, International Journal of Innovative Research in Science, Engineering and Technology

[4]Personal Voice Based Information Retrieval System, patent

[5] Lakra, Sachin, et al. "Application of fuzzy mathematics to speechto-text conversion by elimination of paralinguistic content." arXiv preprint arXiv: 1209.4535 (2012).

[6] KNUTH, D., J. MORRIS, and V. PRATT. 1977. "Fast Pattern Matching in Strings." SIAM J on Computing, 6, 323-50.

[7] BOYER, R., and S. MOORE. 1977. "A Fast String Searching Algorithm." CACM, 20, 762-72.

[8] Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. Total recall:Automatic query expansion with a generative feature model for object retrieval. In ICCV, pages1–8, 2007.

[9] HHerv´eJ´egou, MatthijsDouze, and CordeliaSchmid. Improving bag-of-features for largescale image search. International Journal of Computer Vision, 87(3):316–336, 2010.