multilingual search system
DESCRIPTION
Multilingual search system as part of Information Retrieval. The presentation deals with the implementation of a search system using Solr.TRANSCRIPT
Multilingual Search System
TEAM NAME –SHIELD
Vamshi Krishna Padidela(50169645)
Manikant Manohar Kapuganti(50170071)
Pramod Rangaraju(50169514)
Sudheer Bondada(50170321)
Nikhil Ayyagari(50169485)
Introduction
In this project, we built a retrieval system powered by Solr to search within tweets.
The dataset includes 11,000 tweets(multiple languages) consumed using the Twitter’s REST API. The tweets belong to two sets of topics isis and health with significant sub topics in each.
The UI for the search system is built on banana framework which has powerful dashboard capabilities to visualize big data analytics.
We have implemented below components
1. Content Tagging (Monolingual)
2. Faceted Search
3. Cross-Document Analytics
4. Topic Models and/or LSI
Content Tagging (Monolingual)
We realized content tagging using Alchemy’s Entity Extraction API.
The Alchemy API identifies proper nouns(places, people, organizations) using Natural Language Processing.
The tags for each tweet returned by the Alchemy API is added to the respective tweet using another field “tags”.
The new JSON file with the added “tags” is re-indexed in Solr.
The tags give insights into interesting metrics like popularity of a person, place etc over a period of time.
Results from Alchemy API’s content tagging
Tags for a search field
The tags displayed in the order of most used
Faceted Search
Faceted Search is available with banana framework where the search can be limited based on the fields like text, language, location and etc.
The functionality of facets are similar to filters with added document count.
Faceted search helps displaying dashboards for various analytical purposes.
Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search.
Facets and filters
Pie chart showing the geographical distribution
Cross Document Analytics
Distribution of tweets against time and location
Topic Models-LSI
Implemented Latent Semantic Indexing(LSI) on the data collected to demonstrate semantic search instead of keyword search.
Latent Dirichlet Allocation (LDA) is an initial probabilistic extension of the LSI technique.
LDA is responsible for extraction of collections of topics.
LDA processes tweets in order to find the topic distribution fro each document and also the document distribution for each topic.
The LDA algorithm is invoked on the vectors generated from the Sequence file.
We are using MALLET(Machine Learning for Language Toolkit) for topic generation.(Results pending)
Search System UI – 1/2
Search System UI – 2/2
Thank You!!