linked data infrastructure soma - the insight centre for data … · 2016-02-23 · distributed...

25
Soma: Linked Data Infrastructure

Upload: others

Post on 12-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Soma: Linked Data Infrastructure

Page 2: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

What is Soma?

It’s Big Data Candy for the Cloud.

The Soma platform helps Data Scientist to collaborate together to discover and share new facts from large datasets hosted on shared infrastructure.

All this while lowering development & operations bottom line.

Page 3: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Meet our CustomersExpertSee themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.

CreativePeople who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.

EngineerSee themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.

ResearcherSee themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.

Page 4: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Customers we support now

Creative Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.

EngineerFocused on the technical problem of managing data Normally strong software developers

ResearcherPeople with deep academic background in science, maths, machine learning Reluctant coders.

Page 5: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

What we deliver to customers

CreativeNow:

● Gitlab integration● from gitlab● Web facing applications

ResearcherNow:

● Discovery early adoptersEarly September

● Discovery platform rollout

EngineerNow:

● Big Data Cluster● Container Management

November:● Storage frameworks

Page 6: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Fully operational big data stationRight NowMesos based Cloud O/S● Cluster of 88 CPUs 295 GB of memory● Distributed Application Scheduling● Resource Scheduling

Container ManagementDNS service discover

Features

Page 7: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Deployment

GitlabMesos ClusterZookeeper ClusterHDFS ClusterIntegrated DNSCI serversDocker Registry

Page 8: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every
Page 9: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Gitlab● All applications MUST be in gitlab

Mesos Cluster and Container Manager● Let’s have a look at what is running right now:

Deeper Dive

Page 10: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

“can mix both batch and real-time processing”

“process at batch and real-time Velocity”

Lambda architecture

Page 11: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Data sources

Page 12: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Source Control ManagementContinuous DeploymentService MonitoringAlways available key datasets● DBPedia● SemanticWeb Dogfood

Features

Page 13: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

1. Have gitlab account2. Ask Research ops to add Soma Role to your project3. If you are accepted you will be guided through

“dockerizing” you gitlab project4. Once accepted, every push to your master branch will be

deployed and accessible online through soma.

Continuous Deployment

Page 14: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Integrated Discovery platformSOMA Discover - hosted discovery tool based on smarter

data project allowing exploration of data and sharing results.

Other internal tools such as Sig.ma, Social Lens, and other projects to follow.

Features

Page 15: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Goals for Research Ops

Nurture a Data Engineering community at Insight with supportive experts, shared tools & best practices

Provide a Shared analytics platform for Data Scientists at Insight (Soma)

Encourage new research and engagements with the wider big data analytics research community

Page 16: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Nurture● Provide a structured approach to managing and

releasing all Engineering IP (Code and Data) at insight○ Source control (Git)

○ release management

○ Assist in IP management

● Provide Quality Circles for Engineering practices○ 2 Groups - Data Visualisation & Big Data, Workshops to

commence this month.

Page 17: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Provide● Build big data infrastructure for Insight

○ Soma platform

● Support Hadoop ongoing development○ Hadoop clusters, Dataspace support

● Support Ad Hoc projects requiring scale○ Cancer atlas

● Provide “Big Data” Expertise to the Linked Data group○ Hadoop, Yarn, Mesos, Spark, Dataspace, Mongo and Virtuoso

Page 18: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Problems being met

● High cost in research when data scales to “Big Data” [P1]○ Ad Hoc Maintenance of big data sets is expensive [P2]

○ Development complexity of valuable Big Data jobs is prohibitive

[P3]

● The high cost in Operating Big Data infrastructure [P4]○ Scarcity of hardware and lack of funds for new Hardware [P5]

○ Inability to maintain a core operations team [P7]

● Missed opportunity for researcher to collaborate [P6]

Page 19: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Soma serving our customers

Soma Create - Serves data fresh from the source. Has queryable large datasets that are both highly available & up-to-date. Has service to mash these up.

Soma Engineer - Provides a Lambda architecture consuming, cleaning, processing and loading the data to the data layer.

Soma Discover - Useful blocks of processing that can connected together using a nice GUI, works with many datastores

Soma Expert - vertical applications solving a real world problem, these apps are built by Insight’s Data Researchers and Data Creatives.

Page 20: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

The 4 kinds of Data ScientistExpertSee themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.

CreativePeople who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.

EngineerSee themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.

ResearcherSee themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.

Page 21: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Goals

Soma to be a complete ecosystem to help researchers deliver “Big Data” distributed applications

Showcase Insight expertise Standardize best practices for linked data at big data scalesDelivers targeted applications & tools tools to build complex analytics apps & job management

Page 22: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every
Page 23: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Distributed O/S (Better than cloud)

● We use Mesos based infrastructure to provide○ Scheduling Process Execution of Jobs/Applications across the

cluster

○ Resource scheduling of the needed CPU/Memory/Storage for

these applications

Page 24: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

SOMA Discover (Data)

Page 25: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every

Where we are now

What we haveSoma Engineer - Standard Mesos platform - Provides a

Lambda architecture consuming, cleaning, processing and loading the data to the data layer.

Soma Discover - Smarter Data - an interactive expressive query tool creates data blocks & visualisations

What we need help onSoma Expert - Pivoty - a medical index built from

standard HCLS datasets and uses a Pivot BrowserSoma Create - The Insight Standard Dataset - a shared

queryable standard set of big-data sources