architecture

© 2008 Palantir Technologies Inc. All rights reserved.

Architecture & Scalability

An overview of the Palantir Server Architecture

Akash JainDirector of Engineering

Overview

Palantir Server Architecture– A fully-featured, enterprise-grade analytic platform– Robust, scalable, open and maintainable

In this talk– Dispatch Server– Oracle DB– Search Server– Job Server– Raptor Server

Server Architecture

Dispatch Server

Revisioning DB

JDBC 3.0w/ SSL

OracleDatabase Storage

Search Server

Lucene Index

Storage

HTTPS

Job Server

Shared Storage

HTTPS

Job Data and Specs

Job Logsand Results

Dispatch Server

Clients connect here– “Gateway to Palantir”– Clients can only connect here

Connects to database– Access control– Revisioning database

Connects to search and federated search Responsible for job creation and scheduling

Roadmap: Revisioning DB

Dispatch Server

Revisioning DB

JDBC 3.0w/ SSL


Search Server

Lucene Index

Storage

HTTPS

Job Server

Shared Storage

HTTPS

Job Data and Specs

Job Logsand Results

Revisioning DB

Persistence store Oracle 10g RDBMS Enterprise-grade

– Scalability– Backup and Maintenance– Industry Standard– Large DBA community

JDBC 3.0 with SSL

Dispatch Server

Revisioning DB

JDBC 3.0w/ SSL


Roadmap: Search Server

Dispatch Server

Revisioning DB

JDBC 3.0w/ SSL


Search Server

Lucene Index

Storage

HTTPS

Job Server

Shared Storage

HTTPS

Job Data and Specs

Job Logsand Results

Search Server

Built on Apache Lucene– Leverage text processing capability– IR Library -> Enterprise Server– Full-text search capability– Custom fuzzy search using approxes

Why build our own?– Flexibility – database agnostic– Security – built into indexes– Scalability

Search Server

Lucene Index Storage

Clustered Search Scale Parameters

Palantir Search Server scales horizontally User scale

– Number of concurrent requests Data scale

– Additional corpora/data sources– Also includes manually entered data

Search Server


Clustered Search Mirroring Mirroring for User Scalability

– Redundancy across machines– Index write requests go to all mirrors– Search requests go to one mirror– More mirrors-> more concurrent queries

Search Mirror


Search Mirror


Search Mirror


Index Request

A

Index Request

A

Index Request

A

Search Request

1

Search Request

3

Search Request

2

Search Mirror

Lucene Index

Storage

Search Mirror

Lucene Index

Storage

Search Mirror

Lucene Index

Storage

Search Mirror

Lucene Index

Storage

Search Mirror

Lucene Index

Storage

Search Mirror

Lucene Index

Storage

Increased ThroughputSearch

Request 1

Search Request

2

Search Request

3

Search Request

4

Search Request

5

Search Request

6

Clustered Search Partitioning Partitioning for Data Scale

– Split data across many machines– Search requests go to all partitions– Index write requests go to one partition– More partitions -> more data with constant index size

Search Partition


Search Partition


Search Partition


Index Request

1

Index Request

3

Index Request

2Search Partitio

n

Lucene Index

Storage

Search Partitio

n

Lucene Index

Storage

Search Partitio

n

Lucene Index

Storage

Search Partitio

n

Lucene Index

Storage

Search Partitio

n

Lucene Index

Storage

Search Partitio

n

Lucene Index

Storage

Search Request

A

Search Request

A

Search Request

A

Increased Index CapacityIndex Reque

st 1

Index Reque

st 3

Index Reque

st 2

Index Reque

st 4

Index Reque

st 6

Index Reque

st 5

Roadmap: Job Server

Dispatch Server

Revisioning DB

JDBC 3.0w/ SSL


Search Server

Lucene Index

Storage

HTTPS

Job Server

Shared Storage

HTTPS

Job Data and Specs

Job Logsand Results

Job Server

The job server runs asynchronous jobs on behalf of clients– Bulk data imports– Persistent searches– LDAP auth syncs

Many job servers

Dispatch Server

Job Server

Shared Storage

HTTPS

Job Data and Specs Job Logs

and Results

Systems Diagram

External Network

DMZ

Internal Network

Dispatch Server

Rev DB

JDBC 3.0w/ SSL


Search Server

Lucene Index

Storage

HTTPS

Shared Storage

HTTPS

Job Server

Job Data and Specs

Job Logsand Results

HTTPS

Client

Raptor Overview

Raptor sits in front of data sources Raptor indexes data source and answers search

queries Raptor monitors changes in your data source and

sends them to Palantir

Federated Search

Raptor is Palantir’s federated search server– Rich data modeling– Extensible searching– Highly scalable indexing and search capabilities

Leverages– Palantir Data Import Pipeline– Palantir Clustered Search Server

With Raptor: Data owners control data You control performance characteristics

Raptor Query Process

Raptor A

Searching

Raptor B

Searching

Raptor C

Searching

Search Query• Hits Palantir Search

Server• Federated to Raptor

Instances if applicable• Supports both keyword

search and structured queries

Results Collection• Results are sorted using

relevance from each search

Import to Palantir• On-The-Fly (OTF) Import• Sourcing information

retained for each attribute imported

• Enables full Palantir functionality

Palantir Query Result

Raptor C

Raptor B

Raptor A

Raptor Scale Characteristics

Data Scale– 100 million row Netflix dataset– 10 million document usenet corpus– 1.5 million entity extracted Wikipedia corpus

Indexing Performance– 1m rows/hour structured indexing– 500k docs/hour unstructured document indexing– 100k docs/hour entity-extracted document indexing

Searching Performance– Sub-second search processing

Summary

Palantir server components support a robust, scalable platform for analysis

Leverage enterprise-grade infrastructure Raptor provides further scalability

architecture

Technology

job server dispatch

search capabilities

federated search raptor

machines search requests

federated search responsible

data lucene index storage

dispatch server clients

job serversshared storage