retrieval system voice based information · there are 3 different tasks of the voice based...
TRANSCRIPT
Voice Based Information Retrieval System
How far is it from text based retrieval system?
PRAJNA BHANDARYCMSC 676
MOTIVATION
● The ever increasing Internet bandwidth, the ever-decreasing storage costs and the fast development of multimedia technologies have paved road for more and more multimedia network content.
● The main motivation for many researchers in this area is to help visually challenged individuals to get information using a device used for speech recognition system
There are 3 different tasks of the Voice based Retrieval System● Using Text Queries to retrieve spoken documents
○ Referred as Spoken Document Retrieval○ Found that the queries need to be long in order for it to be more
efficient● Using spoken queries to retrieve text documents
○ Voice Search○ The information to be retrieved is usually an existing text database
such as those in directory assistance applications, although with lexical variations and so on but primarily without recognition uncertainty.
● Using spoken queries to retrieve spoken documents○ In this case the speech recognition uncertainty exists on both sides of
the queries and the documents, and therefore naturally this is a more difficult task this.
INTRODUCTION
COMPARISON
Text-Based Voice-Based
Resources Rich resources-huge quantities of text documents available over the internet Quantity continues to increase exponentially due to convenient access
Spoken/multimedia content are the new trend Can be realized even sooner given mature technologies
Accuracy Retrieval accuracy is acceptable to users and are properly ranked and filtered
Problems with speech recognition errors, especially for spontaneous speech under adverse environments
User-System Interaction
Retrieved documents easily summarised on-screen thus easily scanned and selected by the userUser may easily select query terms suggested for next iteration retrieval in an interactive process
Spoken/multimedia documents easily summarised on-screen thus difficult to scan and selectLacks efficient user system interaction
RETRIEVAL ACCURACY
● Lattice-based Approaches
● Position Specific Posterior Lattices(PSPL)● Confusion Networks(CN)● Time-based Merging for Indexing(TMI)● Time-anchored Lattice Expansion(TALE)
● Position Specific Posterior Lattices(PSPL)
● Locating a word in a segment according to the position(or sequence ordering) of the word in a path as a tuple (W, d, pos, prob).
● Confusion Networks(CN)
● Clustering several words in a segment according to similar time spans and word pronunciation.
Relevance rankingrelevance scores between the segments and a query Q, which is a sequence of words, {W j , j = 1, 2.., Q}First calculate the expected tapered-count for each N-gram {Wi...Wi+N−1} within the query in a spoken segment d, S(d,Wi...Wi+N−1) as given below and aggregate the results to produce a score S N-gram (d, Q) for each order N as in
RETRIEVAL ACCURACY (Cont’d)
where L is the lattice obtained from d and k is the cluster number in PSPL or CN structures. The different proximity types, one for each N-gram order allowed by the query length Q, are finally combined by a weighted sum to give the final relevance score S(d, Q),
● Multi-model dialoguefor a query given by the user, the retrieval system produces a topic hierarchy constructed from the retrieved spoken documents to be shown on the screen.
● Semantic analysis of spoken documents
USER-SYSTEM INTERACTION
● Automatic Generation of Summaries and Titles for spoken documents
● Query-based Local Semantic Structuring of Spoken Documents● Semantic Structuring of spoken documents● Interactive retrieval in Dialogue loop
● Key term extraction from spoken documentsBased on latent topic significance
USER-SYSTEM INTERACTION
Voice Voice to text Keyword
Pattern Matching BoW(Bag of words)
Voice based reply
Voice Reply
If match with
DB
no
yes
PROPOSED MODEL
This is a three step process:
1. Speech to text 2. Pattern matching3. Text to speech
● A fuzzy logics can be used to match the speech of different accents. eg. the word “Vector” has different pronunciations
● Thus a single word can be represented by a fuzzy set.● Now since this is a very specific to fit in a generic model of speech
recognition, we can have a more general model of fuzzification of phonemes.
● This model is applied to spoken sentences. One fuzzy set is based on accents, the second one the speeds of pronunciation and the third on emphasis
VOICE TO TEXT
BAG-of-WORDS
● A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:○ A vocabulary of known words.○ A measure of the presence of unknown words.○ The steps followed:
■ Collect data■ Create Vocabulary■ Create Document Vector■ Managing Vocabulary■ Scoring words■ Word Hashing■ TF-IDF
● Boyer-Moore(BM) algorithm can be used which positions the pattern over the leftmost characters in the text and attempts to match it from right to left. If no mismatch occurs then the pattern is found else.
● The algorithm computes a shift by an amount by which the pattern is moved to the right before a new matching is undertaken
● Shift is computed using two heuristics :○ match heuristic○ Occurence heuristics
i. Match all characters previously matched andii. To bring different character to the position in the text that caused
the mismatch
𝑑[𝑥] = 𝑚𝑖𝑛{𝑠|𝑠 = 𝑚 𝑜𝑟 (0 𝑠 < 𝑚 𝑎𝑛𝑑 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 [𝑚 − 𝑠] = 𝑥)}
PATTERN MATCHING
● After getting the text it must it must analyse and then transform into a phonetic description
● NLP module:○ Digital Signal Processing(DSP) module: It transforms the symbolic
information received to audible one as follows: text analysis: first the text is segmented into tokens. The token-to-word conversion creates the orthographic form of the token example Mr is mister and humber like 2 are transformed to two
○ Application of Pronunciation rules: After the text analysis is completed pronunciation rules can be applied. Silent letters in a word(h in caught) or several phoneme like(m in maximum)■ Dictionary based solution: A dictionary can be used where all
forms of possible words are stored. ■ Rule based solution: rules are generated from the phonological
knowledge of dictionaries. Only words with come exception on pronunciation are included
TEXT TO VOICE
CONCLUSION & FUTURE SCOPE
It can be concluded that this approach is efficient in term of reduced computation complexity, reduced time
● There is research being done to make the whole process telephonic ● Limitations of Bag-of-Words
● Vocabulary● Sparsity● Meaning
REFERENCES[1] R. Uma, B. Latha. “An efficient voice based information retrieval using bag of words based indexing”, International Journal of Engineering & Technology
[2] Lin-shan Lee and Yi-cheng Pan. “Voice-based Information Retrieval- how far are we from the text-based information retrieval?”, 2009 IEEE
[3] Kiruthika M, Priyadarsini S, Rishwana Roshan K, Shifana Parvin V.M, Dr. G. Umamaheshwari. “Voice Based iNformation Retrieval System”, International Journal of Innovative Research in Science, Engineering and Technology
[4]Personal Voice Based Information Retrieval System, patent
[5] Lakra, Sachin, et al. "Application of fuzzy mathematics to speechto-text conversion by elimination of paralinguistic content." arXiv preprint arXiv: 1209.4535 (2012).
[6] KNUTH, D., J. MORRIS, and V. PRATT. 1977. "Fast Pattern Matching in Strings." SIAM J on Computing, 6, 323-50.
[7] BOYER, R., and S. MOORE. 1977. "A Fast String Searching Algorithm." CACM, 20, 762-72.
[8] Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. Total recall:Automatic query expansion with a generative feature model for object retrieval. In ICCV, pages1–8, 2007.
[9] HHerv´eJ´egou, MatthijsDouze, and CordeliaSchmid. Improving bag-of-features for largescale image search. International Journal of Computer Vision, 87(3):316–336, 2010.