bangalore executive seminar 2015: case study - text analysis on mongodb for a large enterprise
Post on 06-Apr-2017
521 Views
Preview:
TRANSCRIPT
Case Study-Text Analytics on Unstructured Data
2 © Happiest Minds – Confidential
Our Business
Digital Transformation for enterprises and technology providers leveraging an integrated set of disruptive technologies
Big Data & Analytics Mobility
Security
Cloud Social Computing Unified Communications
BPM, Workflow Business
Integration
IoT
Digital Enterprise
3 © Happiest Minds – Confidential
Infrastructure Transformation and Managed Services
Public / Hybrid Cloud Transformation
Virtualization- Server , Storage, Network
Private Cloud Migrations
End to End Infrastructure Transformation Services
Managed Infrastructure Services ( GNOC ) – 24X7
Basic - Managed Infra Services Advanced - Managed Infra Services
• Data base Monitoring and Mgmt.• Cloud Operations Monitoring and Mgmt.• Mobility / BYOD Monitoring and Mgmt.
• Big Data Provisioning and Mgmt.
• Audio/Video/IPT Monitoring and Mgmt.
Cloud Adoption Strategy
End to End Infrastructure Advisory Services
Data Center Consolidation
Software Defined
Data Center
Messaging &
Collaboration
Audio/Video
Big DataAdoption
BYOD/Mobility
Smart WorkspaceComputing
VirtualizationAdoption
Big Data
Software Defined DCS, Network, Storage
Next Gen DCS Provisioning & Mgmt.
Converged Systems Migration
Automation & Administration
VDI Migration
Unified Device Migration
Messaging / Collaboration
Audio/ Video/ IPT Enhancements
Mobility / BYOD
Mongo DB migration/Acceleration
Database migration/Acceleration
Hadoop migration / Acceleration
• Service Desk • NOC Services – Server , Network ,
Storage , Database, Backup / Archival , Asset Mgmt. , Vendor Mgmt.
Enhanced - Managed Infra Services
ITSM Definition
Cloud and Next Gen Data Center Services Unified Communication and Device Management
4 © Happiest Minds – Confidential
Our Expertise
Data at Rest
Data in Motion
Structured , Multi-structured
Flume
Sqoop
Impala
Kafka
Pentaho
Hadoop
Columnar
Document
MPP
Apache, Cloudera, Hortonworks
HBase , Cassandra
MongoDB
Greenplum
DistributedStreamingDocumentation &
Indexation
MapReduce StormSolar Cloud, Lucence
Visualization
Predictive Analysis
Machine Learning
Text Mining , NLP
R, Revolution R, Python
Mahout
Tableau, QlikViewQlikView, Cloudera, Cassandra,
Revolution Analytics, Nagios, Apache Ambari,Ganglia,10Gen,DataSta,Platfora,Tableau
5 © Happiest Minds – Confidential
CASE STUDY
6 © Happiest Minds – Confidential
About Customer
• A multinational corporation based in India with revenue of $US 33 billion.
• Involved in Steel, Energy and Infrastructure services• Operation in 29 countries, employing over 60
thousand people.• The pilot is for Oil and Gas division of the company.
7
Business Requirement
• What transactions happened with whom and when• Regulatory requirement: Communication logs must be
kept and reviewed by Auditors• Primary use case: Text Analytics on Email, Chat and
Audio Data Combined to spot deceitful transactions
8
The Business Problem
• No single view of all communications happened through email, chat and voice.
• Auditors review process was a daunting task as they need to read through numerous email and chat files and need to listen to audio files to qualify a transaction as ‘clean’.
• Huge dependency on people maintaining these files systems.• No support for any scientific reasoning to back the findings of the
Auditors.• Brand Reputation was at risk.
9
Data Challenge
• Semi and Unstructured data- Email, Voice and Chat.• There is not unique ID to bind the communication between the
channels.• Need for an algorithm for deep relevancy calculation. • More documents added to collections all the time.• Extracting features from the documents - challenge due to high
dimensionality and latent variables.
10 © Happiest Minds – Confidential
The Approach
• Extract data from Email, Chat message and Audio files – Java.• Preprocess data- harvesting, synchronizing and harmonizing rich data
from communication media – Python and Java.• Storing, accessing and Processing- MongoDB.• Cluster coherent documents based on topic using LDA - Java Map
Reduce.• Report Generation-BIRT.
11
Why MongoDB
Storing email, chat and voice files is in itself is problem. SQL database is not a right fit given its forced constraints and true relational models.
• With all the files as json objects, MongoDB made sense to take in these objects and query on them efficiently.
• MongoDB support for extremely simple and flexible data model allowed storing similar but different objects (chat and email from an user) as embedded documents rather storing data in different relational tables and relying on complex joins to retrieve data.
12
Why MongoDB
• GridFS, database storage system for large objects helped to store voice files in raw format. Using efficient sort and filter options of MongoDB, were able to efficiently integrate and get email, chat and voice data as one group.
• MongoDB’s Aggregation framework provided an easy way to work on large documents to transform them to aggregated results.
• Java Map Reduce was used to construct term document matrix to identify shared commonalities in a corpus of documents. LDA (Latent Dirichlet Allocation) an advanced statistical method was used to determine the topic>>terms from document contents.
13
Why MongoDB
• MongoDB’s Dynamic Schema was a key feature that powered our developers to track a new metric on the fly from arbitrary data pulled from SAP ERP.
• Various indexing techniques including Text Index and additional secondary Index allowed to quickly filter traders by user’s preferred criteria.
• High availability- Responsiveness of our service is key to our SLA. Replication feature in MongoDB gave us resilience to failures.
14
Data Consolidation and Analysis Architecture
Logic Tier
Information is stored and retrieved from a database or file system. The information is then passed to logic tier processing and eventually back to the user.
This layer coordinates the application , process commands, make logical decisions and evaluations, and performs aggregations. It also moves and processes data between the two surrounding layers.
Data Tier
Top most level of the application is the user interface. The main function is to translate tasks and results to something that user can understand.
Presentation Tier
Database Data Parser/Loader
Audio
Chat
Query
Get List of messages sent to Yahoo ID between a data range?
Get List of all communications sent to Yahoo ID on June 10?
ChatEmailAudio
Reports
15
Text Analytics Application Model
Rich Queries Find everybody who did $25 million transaction last week or did a transaction with a supplier in China last week worth $ 1 million to $ 25 million.
Aggregation What is the number of traded deals for particular product in time range.
Text Search Find all documents that mention a supplier
Map Reduce List all documents based on a topic or document terms (discovered using LDA algorithm)
16
What was achieved
• The results were staggering!• RAAD: Completed pilot development in three weeks, which otherwise
would have taken couple of months. • Performance: The application was responding to user queries within
50 millisecond window. MongoDB enabled low-latency queries across thousands of documents.
17
Business Benefits
• Provided single view of all communications per transaction.• Auditor’s evaluation time brought down from 2 weeks to 1 day.• Saved around 300 man hrs., which consumed for manual
consolidation of data from email, chat and audio servers.• Text Analytics application offered new insights like deeper
understanding of supplier market which was not possible before.
Thank youwww.happiestminds.com
top related