h-hypermap - heatmap analytics at scale: presented by david smiley, d w smiley llc

Post on 16-Apr-2017

166 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

O C T O B E R   1 1 -­‐ 1 4 ,   2 0 1 6     •     B O S T O N ,   M A  

H-­‐Hypermap:  Heatmap  Analy?cs  at  Scale  David  Smiley  

Freelance  Search  Developer/Consultant  

About:  David  Smiley  •  So2ware  Engineer  (16  years)  •  Search  (7  years)  •  Java  (full-­‐stack),  Web,  SpaGal  

•  Freelance  search  consultant  /  developer  •  Apache  Lucene  /  Solr  commiKer  &  PMC  •  Wrote  first  book  on  Solr,  updated  twice  

Agenda  •  About  this  project  •  Architecture  •  Solr  &  Gme  sharding  •  Experiences  with:  –  Kotlin,  Dropwizard,  Swagger  

–  KaUa  –  Docker,  Kontena  

•  Solr  for  geo-­‐enrichment  •  Solr  adapter  for  Lucene  BKD  Lat-­‐Lon  point  search  &  sort  

•  Heatmaps  –  ExisGng  funcGonality  

•  demo  –  New  funcGonality  

H-­‐Hypermap  /  BOP  •  Harvard  University,  CGA:    Center  for  GeospaGal  Analysis  hKp://gis.harvard.edu  

•  Harvard  Hypermap  Project  – Managed  by  Ben  Lewis  

•  BOP  “Billion  Object  Pla^orm”  –  Funded  by  the  Sloan  FoundaGon  

BOP  Requirements  Summary  

•  Most  recent  ~billion  geo-­‐tweets  •  RealGme  search  (<5  sec  latency)  •  Sub-­‐second  queries  –  Including  heatmaps!  

•  On  the  cheap:  ~6  mediocre  boxes  

Provide  a  proof-­‐of-­‐concept  pla^orm  designed  to  lower  the  barrier  for  researchers  who  need  to  access  big  streaming  spaGo-­‐temporal  datasets.  

Logical  High-­‐Level  Architecture  

Archival  

RealGme  

HarvesGng   Enrichment  

various  clients...  

various  clients...  

Data  flows  via  Apache  KaLa   Systems  expose  HTTP  web  services  

“BOP”  

Shard:  W51  

The  BOP  KaUa  Topic   Ingester  

ZooKeeper  

Shard:  W52  Shard:  W53  Shard:  W54  Shard:  RT  

...  

Web-­‐Service  

KaUa  Streams  •  Create  Solr  doc  •  Routes  to  shard  

REST/JSON  API  •  Keyword  search  •  FaceGng  •  Heatmaps  •  CSV  export    

...  

BOP  Solr  Sharding  Architecture  RealGme  

T2016_05_20  T2016_05_06  T2016_04_22  T2016_04_08  

…  4-­‐5  mo.  

T2016_05_20  T2016_05_06  T2016_04_22  T2016_04_08  

…  4-­‐5  mo.  

G_North_America   G_Elsewhere  

Lone  RealGme  CollecGon/Shard.  1-­‐25  hrs  Copy  then  delete,  at  night  

•  RealGme  shard  is  where  realGme  search  happens.  No  caches,  but  small.      

•  Primary  collecGons  have  useful  caches  •  Housekeeping  Tasks:  

•  Move  data  from  RT  to  primary  •  Create  new  shards;  expire  old  •  Merge/opGmize  shards  

Building  a  Search  Web-­‐Service  •  Kotlin  language  (JVM  based)  – Nullity  as  first-­‐class  language  feature  

•  DropWizard  framework  – Designed  for  web-­‐services  

•  Swagger  – Dynamically  generated  dev  UI  for  web-­‐services  

Apache  KaUa  •  KaUa:  a  scalable  message/queue  pla^orm  •  See  new  KaUa  Streams  &  KaUa  Connect  APIs  •  No  back-­‐pressure;  can  be  a  challenge  •  Non-­‐obvious  use:  – For  storage;  Gme  parGGoning  

•  Lots  of  benefits  yet  serious  limitaGons  

Docker  •  Easy  to  find/try/use  so2ware  –  No  installaGon  –  Simplified  configuraGon  (env  variables)  

–  Common  logging  –  Isolated  

•  Ideal  for:  –  ConGnuous  Int.  servers  –  Trying  new  so2ware  –  ProducGon  advantages  

•  But  “new”  

Docker  in  ProducGon  •  I  use  “Kontena”  •  Common  logging,  machine/proc  stats,  security  –  VPN  to  secure  network;  access  everything  as  local  

•  No  longer  need  to  care  about:  – Ansible,  Chef,  Puppet,  etc.  –  Security  at  network  or  proxy;  not  service  specific  

•  Challenges:  state  &  big-­‐data  

Enrichment  

Geo:  Query  Solr  via  spaGal  point  query;  aKach  related  metadata  to  tweet  

KaUa  Topic   Enrich   KaUa  

Topic  

TwiKer  SenGment  Classifier  

Geo:  Solr  with  regional  polygons  &  metadata  

Solr  for  Geo  Enrichment  •  Tweets  (docs)  can  have  a  geo  lat/lon  •  Enrich  tweet  with  Country,  State/Province,    …  – GazeKeer  lookup  (point-­‐in-­‐polygon)  

Data  Set   Features   Raw  size   Index  ?me   Index  size  

Admin2   46,311   824  MB   510  min   892  MB  

US  States   74,002   747  MB   4.9  min   840  MB  

MassachuseKs  Census  Blocks   154,621   152  MB   5.9  min   507  MB  

Fast  Point-­‐in-­‐Polygon  Tricks  Index/Config  •  OpGmize  to  1  segment  •  RptWithGeometry  

SpaGalField  –  precisionModel=  

"floating_single"  –  autoIndex="true"  

•  <cache  name=  "perSegSpatial  FieldCache_WKT"  …  

Search  •  Embed  Solr  (in-­‐process)  •  Use  docValues,  not  stored  

–  fl=block:field(GEOID10)  Query  like  this:  •  q={!field  cache=false  

f=WKT}Intersects(POINT(  $lon  $lat))  

Sub-­‐Millisecond!  

Lucene  “LatLonPoint”  •  Uses  new  PointValues  (BKD  index)  in  Lucene  6  •  Fastest:  hKp://home.apache.org/~mikemccand/geobench.html  

•  Presently  in  Lucene  sandbox  module  •  Some  limitaGons:  WGS84  points  only  •  Credit  to  Rob  Muir  and  Mike  McCandless  

Solr  Adapter  For  LatLonPoint  •  New  Solr  FieldType  for  Lucene  LatLonPoint  – Filter  points  by  circle,  rect,  polygon  – Distance  sort;  but  no  boos(ng  

Coming  soon!  Solr  6.4?  

Heatmaps:  SpaGal  Grid  FaceGng  •  SpaGal  density  summary  grid  faceGng,  

also  useful  for  point-­‐plovng  search  results  •  Lucene  &  Solr  APIs  •  Scalable  &  fast  usually…  

•  Usually  rendered  with  a  gradient  radius  -­‐>  •  See:  hKp://spacemansteve.github.io/  

leaflet-­‐solr-­‐heatmap/example/index.html  

How-­‐to:  Heatmaps  •  On  an  RPT  field      geo="false" worldBounds= "ENVELOPE( -180, 180, 180, -180)" prefixTree="packedQuad"

•  Query:    /select?facet=true &facet.heatmap=geo_rpt &facet.heatmap.geom= ["-180 -90" TO "180 90”] &facet.heatmap.format= ints2D or png

// Normal Solr response... "facet_counts":{ ... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”, [null, null, [0, 1, ... ]] ...

New  HeatmapSpaGalField  •  Why?  – With  new  BKD/PointValues,  no  “RPT”  field  to  use  – Scalable  for  heatmaps;  don’t  worry  about  search  

•  Scalable  at  all  resoluGons;  many  millions  of  docs/shard  

– Can  be  specific  about  grid  resoluGons  

Coming  soon!  Solr  6.4?  

Heatmaps  with  Stats  •  Instead  of  counGng  docs;  calculate  a  metric  – Ex:  avg(minuteOfDay)  

•  Will  require  JSON  Facet  API  •  Inherently  slower  than  just  doc  counts  

Coming  soon!  Solr  6.4?  

Final  Remarks  •  Open-­‐Source  – hKps://github.com/dsmiley/hhypermap-­‐bop  

•  In-­‐progress  •  Improvements  to  Solr  expected  to  be  available  before  December;  officially  in  Solr  6.4.  

top related