instantly finding a needle of data in a haystack of …...instantly finding a needle of data in a...

22
Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Gregory Touretsky INFINIDAT

Upload: others

Post on 20-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS

Environment

Gregory Touretsky INFINIDAT

Page 2: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Who am I?

1997-2015: IT Systems Engineer / Solutions Architect at a Fortune 100 company

As of 9/20: Product Manager at Infinidat

2

Page 3: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Introducing INFINIDAT Founded in 2011 >100 patents Production deployments at Fortune 500

companies Over 200PB deployed in the field InfiniBox:

99.99999% uptime 2PB in 42U rack Multi-protocol 8kW max power consumption 750k IOPS Zero-wait RESTful API

Page 4: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

INFINIDAT NAS

>250,000 Spec SFS ops/sec at first release Infinidat Hyper-scale filesystem Scales to billions of files per FS 4,000 100,000 max filesystems Virtually unlimited size

N+2 architecture InfiniSnaps No performance impact

4

Page 5: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

And now to my past life experience…

5

Page 6: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Looking for answers?

6

Page 7: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

How big is the Haystack?

Over 1,000 NFS file servers Over 35PB of configured capacity 125,000 NFS file systems Over 90,000 NFS clients 103B files in the Unified Name Space

7

Page 8: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

What are the Needles?

Comments within code Keywords Where is environment variable defined? Errors in logs Finding all references to a variable, method, or

class throughout a project …

8

grep –R <text> <path>

Is there a better way?

Page 9: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Requirements and basic assumptions

NAS appliance = “black box” Indexing SLA = 24 hours Avoid NAS DoS due to indexing Differentiate scanning load by NAS type

Reuse Search “hints”: owner, time, project, … Future: one interface to search through all data

sources (SharePoint, Wiki, Web, DB, NFS, …) 9

Page 10: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Considered solutions

Enterprise search (Google Appliance) Cost NFS integration

Open source – Solr Concerns with scalability, supportability Less applicable for other potential use cases (logs,

etc) Missing scalable crawler

Open source – ElasticSearch Missing scalable crawler

10

Page 11: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Implementation architecture

11

Mount points and FS

metadata

Whitelist and blacklist

Scale-out Crawlers

ElasticSearch

Content for indexing

NFS search

Data Virtualization layer

Readers @ NB

List of files for indexing (changes since previous scan, include/exclude)

Scale-out Readers

Ensure fast processing without performance impact on file servers

Page 12: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Pools of crawlers, readers

One job per file system Multi-threaded crawler

Internal batch queues management system Queue per file server “MaxRunning” limit per file server Dynamic vs. static limit

Root access to NFS

12

Page 13: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Indexing considerations

One index per filesystem White list, black list support Index updates ElasticSearch cluster configuration

13

Page 14: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

User interface

WebUI CLI Grep-like for backward compatibility Expose full Lucene interface

title:"The Right Way" AND text:go "jakarta apache"~10 …

14

Page 15: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Security aspects

Index encryption ElasticSearch cluster access control Search access control – preserve NFS

permissions Initially – at the index (filesystem) granularity

level

15

Page 16: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Pilot status

One project’s data is indexed 85 file systems / 71TB / 409M files

Index size: 17TB Access control UI

16

Page 17: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

It works!

17

Before: grep -R myKeyWord /nfs/mymount <results … results … results> 32:41.11

After: ESgrep myKeyWord /nfs/mymount <results … results … results> 0:00.49

ESgrep myKeyWord <results … results … results> 0:00.52

Page 18: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

What if NAS is not a “black box”?

18

Page 19: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Find changes within the filesystem

Indexing Data replication (rsync, etc) Data aging / recycling Controlled technology Antivirus Backup …

19

CRAWLING

Page 20: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Find changes within the filesystem

Future directions for better integration: NAS to provide fast diff from last index

(Faster, prevents network RTT) Move from “pull” to “push”

(Improve SLA)

20

CRAWLING

Page 21: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Do you always know path to your data?

21

NAS = Network Attached… Search? • Search by content • Search by file name • Search by owner • Search by type • …

Page 22: Instantly Finding a Needle of Data in a Haystack of …...Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment Author Touretsky, Gregory Created Date 9/17/2015

2015 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Questions

22