drupal as a data warehouse · drupal as a data warehouse. ... the importance of outcomes medical...

47
Everybody Into the Data Lake! Gail Radecki, CHCP, American Academy of Allergy Asthma & Immunology Ezra Wolfe, EthosCE Devin Zuczek, EthosCE Drupal as a Data Warehouse

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Everybody Into the Data Lake!

Gail Radecki, CHCP, American Academy of Allergy Asthma & ImmunologyEzra Wolfe, EthosCE

Devin Zuczek, EthosCE

Drupal as a Data Warehouse

Page 2: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Above all else, show the data

Edward Tufte

“ “

Page 3: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

My name is Gail…

● Pushy

● A know-just-enough

● Annoying

...and I’m one of “those” customers.

Page 4: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

The World of Continuing Medical Education

● Accreditation/Compliance

● Needs Assessment

● Grants Reconciliation

Page 5: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Data reporting...stayed the same

Page 6: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

This is where Community comes in handy

● EthosCE User Group

● Support tickets

● Online Community

Page 7: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

The Importance of Outcomes

● Medical Specialty Society

● Needs assessment

● Accreditation

Page 8: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

We need a solution!

Page 9: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

My name is Ezra.

I am the product manager.

This is a me with a bad haircut.

Page 10: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

The Product:

● 65+ hospitals, health systems, associations

● Almost 1 million learner accounts

● 7.9 million course enrollments

Page 11: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

All happy families customers are alike;

each unhappy family customer is unhappy about their reports.

Leo Tolstoy

“ “

Page 12: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Existing process

1. Customer requests a new report

2. Requirements

3. Development feasibility

4. Back and forth with customer

5. Make the view, put it in code

6. Change management, code review, documentation, tests

7. Custom report is deployed! We are done!

8. Repeat steps #1-7 again because "that" customer forgot something

9. Repeat steps #1-7 again later because of a product schema change

Page 13: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

My name is Devin and I am a systems architect.

Throwback to PHP 4?

This is a duck.

Page 14: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Big data!!!

Phase 1

Big data

Phase 2

?

Phase 3

Profit!

Page 15: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Making a plan

● Previous solutions involved a suite of modules:Views, Homebox, Charts + HighCharts, Views Data Export.

● Building our own tool would have been a distraction from our core business — we are not data scientists.

● We needed a tool that we could give to customers to report on their own data instead of us doing it.

Page 16: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Vendor selection

● Vendors don’t know about...○ Drupal○ php serialization○ webform

● It’s your job to do your due diligence and ensure you select the correct system

Page 17: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

BYODW?(Bring your own data warehouse?)

Page 18: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Did I mention where I work??

● Non-profit

● Tech maintenance vs. adding features

● We have to justify EVERYTHING

Page 19: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra
Page 20: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

First steps

● Could we point a tool at Drupal and have it report out of the box?

● Do we need a data warehouse?

To answer the question of why reporting on Drupal data in its native form is not optimal, we have to look at how the data is stored.

Page 21: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

DBA 101: Tables

U. Corp. changed their name. Making this change requires...

people

user_id name employer_name

1 Barry Cuda U. Corp.

2 Abby Normal Gekko & Co.

3 Rita Book U. Corp.

4 Ray O’Sun U. Corp.

Page 22: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

DBA 101: Tables

U. Corp. changed their name. Making this change requires...

3 row updates. Not good at scale, as this will lock those rows for editing!

people

user_id name employer_name

1 Barry Cuda Initech

2 Abby Normal Gekko & Co

3 Rita Book Initech

4 Ray O’Sun Initech

Page 23: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Normalizing

Eliminate columns with duplicate data by creating separate tables, and identify that data with a key.

Move data that is not relevant to the primary key.

People

uid name employer_name

1 Barry Cuda Initech

2 Abby Normal Gekko & Co

3 Rita Book Initech

People

uid name eid

1 Barry Cuda

X

2 Abby Normal

X

3 Rita Book

X

Employer data

eid name

5 Initech

6 Gekko & Co.

Employer relation

uid eid

1 5

2 6

3 5

Page 24: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

DBA 201: Normalizing

How do we efficiently store a user, full name, location, and employer?

User

user_id user_name

1 john

2 jane

Profile

profile_id user_id

2 1

3 2

Location

loc_id province country

3 PA US

4 NJ US

Profile location

profile_id loc_id

2 3

3 4

Profile name

profile_id fullname

2 John Smith

3 Jane Doe

Profile employer

profile_id emp_id

2 6

3 7

Employer

employer_id name

6 A, LLC.

7 D Corp

No data is unnecessarily dependent.

An update only requires 1 write.

...that includes the database!

Page 25: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

The problem

Operational database

● Fast writes● Holds data● Relational model not

informative● Built for integrity

Reporting database

● Fast, easy reads● Holds information● Relational model more

informative to users

— E.F. Codd, "Further Normalization of the Data Base Relational Model

Page 26: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Another module

Page 27: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Denormalizer module

This module pulls from Drupal defined schema to programmatically build denormalized tables or views for use in data warehousing flows.

Example: A node of type “chinchilla” with a rate field. We could also request the raw vote table.

1. Denormalized table name2. Entity/table to denormalize3. The primary key to use4. The changed key to use

Page 28: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

SELECT * FROM...developlocal.drupal_node dn_courseLEFT JOIN developlocal.drupal_field_data_field_course_rating_access field_course_rating_access ON field_course_rating_access.entity_type = 'node' AND field_course_rating_access.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_catalog field_course_catalog ON field_course_catalog.entity_type = 'node' AND field_course_catalog.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_date field_course_date ON field_course_date.entity_type = 'node' AND field_course_date.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_disclosure field_course_disclosure ON field_course_disclosure.entity_type = 'node' AND field_course_disclosure.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_event_date field_course_event_date ON field_course_event_date.entity_type = 'node' AND field_course_event_date.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_external_url field_course_external_url ON field_course_external_url.entity_type = 'node' AND field_course_external_url.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_format field_course_format ON field_course_format.entity_type = 'node' AND field_course_format.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_taxonomy_term_data field_course_format_tax ON field_course_format_tid = field_course_format_tax.tidLEFT JOIN developlocal.drupal_field_data_field_course_image field_course_image ON field_course_image.entity_type = 'node' AND field_course_image.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_live field_course_live ON field_course_live.entity_type = 'node' AND field_course_live.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_location field_course_location ON field_course_location.entity_type = 'node' AND field_course_location.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_password field_course_password ON field_course_password.entity_type = 'node' AND field_course_password.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_rating field_course_rating ON field_course_rating.entity_type = 'node' AND field_course_rating.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_transcript field_course_transcript ON field_course_transcript.entity_type = 'node' AND field_course_transcript.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_enrollment_requirement_min field_enrollment_requirement_min ON field_enrollment_requirement_min.entity_type = 'node' AND field_enrollment_requirement_min.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_requirements_max field_requirements_max ON field_requirements_max.entity_type = 'node' AND field_requirements_max.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_requirements_min field_requirements_min ON field_requirements_min.entity_type = 'node' AND field_requirements_min.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_show_child_catalog field_show_child_catalog ON field_show_child_catalog.entity_type = 'node' AND field_show_child_catalog.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_show_child_transcript field_show_child_transcript ON field_show_child_transcript.entity_type = 'node' AND field_show_child_transcript.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_og_group_ref og_group_ref ON og_group_ref.entity_type = 'node' AND og_group_ref.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_accme_data field_accme_data ON field_accme_data.entity_type = 'node' AND field_accme_data.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_activity_format field_activity_format ON field_activity_format.entity_type = 'node' AND field_activity_format.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_level field_course_level ON field_course_level.entity_type = 'node' AND field_course_level.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_topic field_course_topic ON field_course_topic.entity_type = 'node' AND field_course_topic.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_custom_target_audience field_custom_target_audience ON field_custom_target_audience.entity_type = 'node' AND field_custom_target_audience.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_presenter field_presenter ON field_presenter.entity_type = 'node' AND field_presenter.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_faculty_list field_faculty_list ON field_faculty_list.entity_type = 'node' AND field_faculty_list.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_hotel_link field_hotel_link ON field_hotel_link.entity_type = 'node' AND field_hotel_link.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_hotel_photo field_hotel_photo ON field_hotel_photo.entity_type = 'node' AND field_hotel_photo.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_venue_phone field_venue_phone ON field_venue_phone.entity_type = 'node' AND field_venue_phone.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_category field_course_category ON field_course_category.entity_type = 'node' AND field_course_category.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_taxonomy_term_data field_course_category_tax ON field_course_category_tid = field_course_category_tax.tidLEFT JOIN developlocal.drupal_field_data_field_related_courses_view field_related_courses_view ON field_related_courses_view.entity_type = 'node' AND field_related_courses_view.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_show_on_calendar field_show_on_calendar ON field_show_on_calendar.entity_type = 'node' AND field_show_on_calendar.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_course_node cn ON cn.nid = dn_course.nidWHERE (dn_course.type IN ('course', 'group_event_series_event', 'course_imported')) GROUP BY `dn_course`.nid;

Querying field data

Page 29: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

dw_mysiteprod.dw_user

user_id fullname province country employer changed

1 Barry Cuda PA US A, LLC 2018-01-01

2 Abbey Normal NJ US B, Inc. 2018-05-25

Denormalizer will build a denormalized query and dump that data into a separate database.

On subsequent runs, it will only insert or update new and changed records based on the ID and "changed key".

Page 30: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

d_chinchillad_userd_external_table

drupal_userdrupal_nodedrupal_profiledrupal_field_datadrupal_field_datadrupal_field_datasecret_tableexternal_table

Prod DB

Data DB

Only the data you want is exposed in the database containing the denormalized data.

Third party tools can be configured to connect as a user that only has read access to the data DB.

A read-only replica can be used for increased performance.

Page 31: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

d_userd_chinchillaexternal_tablesendgrid_opensg_ad

Data...lake?

(denormalizer)

????

“The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”- James Dixon, Pentaho

Page 32: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Tricks

uid colors

1 [blue, green]

2 [blue, green, red]

uid color

1 blue

1 green

2 blue

2 green

2 red

We can do some things that should not be done in an operational database, like combining multiple values into an array column, to be "unnested" later in a reporting database.

item_id p_id

1 [6,7,8]

6 [7,8]

Page 33: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

What’s the Difference?

Data lake

● You don't know what you're using this data for.

● Data can be structured or unstructured.

● Retain all data. Avoids creating data silos.

Data warehouse

● You know what you're using the data for.

● Data typically structured or processed.

● Data is modeled to make logical sense to the user, or to create “information”.

Page 34: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

BLT ETL, ELT?

E = Extract

T = Transform

L = Load

B = Bacon

Page 35: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Singer

The open-source standard forwriting scripts that move data.

https://www.singer.io

Page 36: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Singer

Taps extract data from any source and write it to a standard stream in a JSON-based format.

Targets consume data from taps and do something with it, like load it into a file, API or database.

Page 37: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Singer cycle

Tap Target

State

1. The tap extracts information from the source

2. The target takes in data from the tap and transforms it to what the target needs.

It outputs the state that the tap also sent over.

3. State is then passed back to the tap, and the process continues at the last replication point.

# tap-mysql -c catalog.json -s state.json | target-postgres > state.json

The catalog defines how to replicate the data. We use Drupal to generate these "streams".

# drush singer-json > catalog.json

Page 38: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Singer

d_userd_chinchillaexternal_tablesendgrid_opensg_ad

Data lake Athena

BigQuery

PostgreSQLRedshift ???

Singer

(denormalized)

Page 39: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Our solution

● Drupal and Singer being used to populate the data lake

● Production database on AWS Aurora provides a read-only replica of the lake

● ~15 minutes, Singer extracts and loads new data from lake of Drupal + other data in MySQL to PostgreSQL

● Looker transforms and reports on data

● Customer modeling generated from Drupal

● 100M+ rows being processed

DrupalSinger

PostgreSQL

Page 40: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Rollout & Results

● Customer interviews, mockups, beta tests for feature design

● Pre-built embedded dashboards in all customer sites

● Self-service reporting as an add-on

● No price increase

● Data warehouse became a competitive advantage

Page 41: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Data Can Be Beautiful

Page 42: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Some Section headerSecond Line

Support Desk Heros

Custom reports are now built by non-developers.

Often for free!Sometimes in minutes!

Page 43: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Have we arrived?

● Customers appreciate all the work

● Still some things to work out on both ends

Page 44: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Join us!Join us!

Page 45: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Resources

● Singer ETL: https://www.singer.io/● Denormalizer: https://www.drupal.org/project/denormalizer● Drupal Singer: https://www.drupal.org/sandbox/devin/3006817● MySQL tap: https://github.com/singer-io/tap-mysql● PostgreSQL tap:

https://github.com/singer-io/tap-postgres/tree/master/tap_postgres● PostgreSQL target: https://github.com/statsbotco/target-postgres

Some project are reaching maturity. Stay tuned!

https://looker.com/blog/introducing-singer

Page 46: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

Join us forcontribution opportunities

Friday, April 12, 2019

9:00-18:00Room: 602

Mentored Contribution

First TimeContributor Workshop

GeneralContribution

#DrupalContributions

9:00-12:00Room: 606

9:00-18:00Room: 6A

Page 47: Drupal as a Data Warehouse · Drupal as a Data Warehouse. ... The Importance of Outcomes Medical Specialty Society Needs assessment Accreditation. We need a solution! My name is Ezra

What did you think?

Locate this session at the DrupalCon Seattle website:

http://seattle2019.drupal.org/schedule

Take the Survey!

https://www.surveymonkey.com/r/DrupalConSeattle