architecture
DESCRIPTION
TRANSCRIPT
© 2008 Palantir Technologies Inc. All rights reserved.
Architecture & Scalability
An overview of the Palantir Server Architecture
Akash JainDirector of Engineering
Overview
Palantir Server Architecture– A fully-featured, enterprise-grade analytic platform– Robust, scalable, open and maintainable
In this talk– Dispatch Server– Oracle DB– Search Server– Job Server– Raptor Server
Server Architecture
Dispatch Server
Revisioning DB
JDBC 3.0w/ SSL
OracleDatabase Storage
Search Server
Lucene Index
Storage
HTTPS
Job Server
Shared Storage
HTTPS
Job Data and Specs
Job Logsand Results
Dispatch Server
Clients connect here– “Gateway to Palantir”– Clients can only connect here
Connects to database– Access control– Revisioning database
Connects to search and federated search Responsible for job creation and scheduling
Roadmap: Revisioning DB
Dispatch Server
Revisioning DB
JDBC 3.0w/ SSL
OracleDatabase Storage
Search Server
Lucene Index
Storage
HTTPS
Job Server
Shared Storage
HTTPS
Job Data and Specs
Job Logsand Results
Revisioning DB
Persistence store Oracle 10g RDBMS Enterprise-grade
– Scalability– Backup and Maintenance– Industry Standard– Large DBA community
JDBC 3.0 with SSL
Dispatch Server
Revisioning DB
JDBC 3.0w/ SSL
OracleDatabase Storage
Roadmap: Search Server
Dispatch Server
Revisioning DB
JDBC 3.0w/ SSL
OracleDatabase Storage
Search Server
Lucene Index
Storage
HTTPS
Job Server
Shared Storage
HTTPS
Job Data and Specs
Job Logsand Results
Search Server
Built on Apache Lucene– Leverage text processing capability– IR Library -> Enterprise Server– Full-text search capability– Custom fuzzy search using approxes
Why build our own?– Flexibility – database agnostic– Security – built into indexes– Scalability
Search Server
Lucene Index Storage
Clustered Search Scale Parameters
Palantir Search Server scales horizontally User scale
– Number of concurrent requests Data scale
– Additional corpora/data sources– Also includes manually entered data
Search Server
Lucene Index Storage
Clustered Search Mirroring Mirroring for User Scalability
– Redundancy across machines– Index write requests go to all mirrors– Search requests go to one mirror– More mirrors-> more concurrent queries
Search Mirror
Lucene Index Storage
Search Mirror
Lucene Index Storage
Search Mirror
Lucene Index Storage
Index Request
A
Index Request
A
Index Request
A
Search Request
1
Search Request
3
Search Request
2
Search Mirror
Lucene Index
Storage
Search Mirror
Lucene Index
Storage
Search Mirror
Lucene Index
Storage
Search Mirror
Lucene Index
Storage
Search Mirror
Lucene Index
Storage
Search Mirror
Lucene Index
Storage
Increased ThroughputSearch
Request 1
Search Request
2
Search Request
3
Search Request
4
Search Request
5
Search Request
6
Clustered Search Partitioning Partitioning for Data Scale
– Split data across many machines– Search requests go to all partitions– Index write requests go to one partition– More partitions -> more data with constant index size
Search Partition
Lucene Index Storage
Search Partition
Lucene Index Storage
Search Partition
Lucene Index Storage
Index Request
1
Index Request
3
Index Request
2Search Partitio
n
Lucene Index
Storage
Search Partitio
n
Lucene Index
Storage
Search Partitio
n
Lucene Index
Storage
Search Partitio
n
Lucene Index
Storage
Search Partitio
n
Lucene Index
Storage
Search Partitio
n
Lucene Index
Storage
Search Request
A
Search Request
A
Search Request
A
Increased Index CapacityIndex Reque
st 1
Index Reque
st 3
Index Reque
st 2
Index Reque
st 4
Index Reque
st 6
Index Reque
st 5
Roadmap: Job Server
Dispatch Server
Revisioning DB
JDBC 3.0w/ SSL
OracleDatabase Storage
Search Server
Lucene Index
Storage
HTTPS
Job Server
Shared Storage
HTTPS
Job Data and Specs
Job Logsand Results
Job Server
The job server runs asynchronous jobs on behalf of clients– Bulk data imports– Persistent searches– LDAP auth syncs
Many job servers
Dispatch Server
Job Server
Shared Storage
HTTPS
Job Data and Specs Job Logs
and Results
Systems Diagram
External Network
DMZ
Internal Network
Dispatch Server
Rev DB
JDBC 3.0w/ SSL
OracleDatabase Storage
Search Server
Lucene Index
Storage
HTTPS
Shared Storage
HTTPS
Job Server
Job Data and Specs
Job Logsand Results
HTTPS
Client
Raptor Overview
Raptor sits in front of data sources Raptor indexes data source and answers search
queries Raptor monitors changes in your data source and
sends them to Palantir
Federated Search
Raptor is Palantir’s federated search server– Rich data modeling– Extensible searching– Highly scalable indexing and search capabilities
Leverages– Palantir Data Import Pipeline– Palantir Clustered Search Server
With Raptor: Data owners control data You control performance characteristics
Raptor Query Process
Raptor A
Searching
Raptor B
Searching
Raptor C
Searching
Search Query• Hits Palantir Search
Server• Federated to Raptor
Instances if applicable• Supports both keyword
search and structured queries
Results Collection• Results are sorted using
relevance from each search
Import to Palantir• On-The-Fly (OTF) Import• Sourcing information
retained for each attribute imported
• Enables full Palantir functionality
Palantir Query Result
Raptor C
Raptor B
Raptor A
Raptor Scale Characteristics
Data Scale– 100 million row Netflix dataset– 10 million document usenet corpus– 1.5 million entity extracted Wikipedia corpus
Indexing Performance– 1m rows/hour structured indexing– 500k docs/hour unstructured document indexing– 100k docs/hour entity-extracted document indexing
Searching Performance– Sub-second search processing
Summary
Palantir server components support a robust, scalable platform for analysis
Leverage enterprise-grade infrastructure Raptor provides further scalability
© 2008 Palantir Technologies Inc. All rights reserved.
Architecture & Scalability
An overview of the Palantir Server Architecture
Akash JainDirector of Engineering