drupal as a data warehouse · drupal as a data warehouse. ... the importance of outcomes medical...
TRANSCRIPT
Everybody Into the Data Lake!
Gail Radecki, CHCP, American Academy of Allergy Asthma & ImmunologyEzra Wolfe, EthosCE
Devin Zuczek, EthosCE
Drupal as a Data Warehouse
Above all else, show the data
Edward Tufte
“ “
My name is Gail…
● Pushy
● A know-just-enough
● Annoying
...and I’m one of “those” customers.
The World of Continuing Medical Education
● Accreditation/Compliance
● Needs Assessment
● Grants Reconciliation
Data reporting...stayed the same
This is where Community comes in handy
● EthosCE User Group
● Support tickets
● Online Community
The Importance of Outcomes
● Medical Specialty Society
● Needs assessment
● Accreditation
We need a solution!
My name is Ezra.
I am the product manager.
This is a me with a bad haircut.
The Product:
● 65+ hospitals, health systems, associations
● Almost 1 million learner accounts
● 7.9 million course enrollments
All happy families customers are alike;
each unhappy family customer is unhappy about their reports.
Leo Tolstoy
“ “
Existing process
1. Customer requests a new report
2. Requirements
3. Development feasibility
4. Back and forth with customer
5. Make the view, put it in code
6. Change management, code review, documentation, tests
7. Custom report is deployed! We are done!
8. Repeat steps #1-7 again because "that" customer forgot something
9. Repeat steps #1-7 again later because of a product schema change
My name is Devin and I am a systems architect.
Throwback to PHP 4?
This is a duck.
Big data!!!
Phase 1
Big data
Phase 2
?
Phase 3
Profit!
Making a plan
● Previous solutions involved a suite of modules:Views, Homebox, Charts + HighCharts, Views Data Export.
● Building our own tool would have been a distraction from our core business — we are not data scientists.
● We needed a tool that we could give to customers to report on their own data instead of us doing it.
Vendor selection
● Vendors don’t know about...○ Drupal○ php serialization○ webform
● It’s your job to do your due diligence and ensure you select the correct system
BYODW?(Bring your own data warehouse?)
Did I mention where I work??
● Non-profit
● Tech maintenance vs. adding features
● We have to justify EVERYTHING
First steps
● Could we point a tool at Drupal and have it report out of the box?
● Do we need a data warehouse?
To answer the question of why reporting on Drupal data in its native form is not optimal, we have to look at how the data is stored.
DBA 101: Tables
U. Corp. changed their name. Making this change requires...
people
user_id name employer_name
1 Barry Cuda U. Corp.
2 Abby Normal Gekko & Co.
3 Rita Book U. Corp.
4 Ray O’Sun U. Corp.
DBA 101: Tables
U. Corp. changed their name. Making this change requires...
3 row updates. Not good at scale, as this will lock those rows for editing!
people
user_id name employer_name
1 Barry Cuda Initech
2 Abby Normal Gekko & Co
3 Rita Book Initech
4 Ray O’Sun Initech
Normalizing
Eliminate columns with duplicate data by creating separate tables, and identify that data with a key.
Move data that is not relevant to the primary key.
People
uid name employer_name
1 Barry Cuda Initech
2 Abby Normal Gekko & Co
3 Rita Book Initech
People
uid name eid
1 Barry Cuda
X
2 Abby Normal
X
3 Rita Book
X
Employer data
eid name
5 Initech
6 Gekko & Co.
Employer relation
uid eid
1 5
2 6
3 5
DBA 201: Normalizing
How do we efficiently store a user, full name, location, and employer?
User
user_id user_name
1 john
2 jane
Profile
profile_id user_id
2 1
3 2
Location
loc_id province country
3 PA US
4 NJ US
Profile location
profile_id loc_id
2 3
3 4
Profile name
profile_id fullname
2 John Smith
3 Jane Doe
Profile employer
profile_id emp_id
2 6
3 7
Employer
employer_id name
6 A, LLC.
7 D Corp
No data is unnecessarily dependent.
An update only requires 1 write.
...that includes the database!
The problem
Operational database
● Fast writes● Holds data● Relational model not
informative● Built for integrity
Reporting database
● Fast, easy reads● Holds information● Relational model more
informative to users
— E.F. Codd, "Further Normalization of the Data Base Relational Model
Another module
Denormalizer module
This module pulls from Drupal defined schema to programmatically build denormalized tables or views for use in data warehousing flows.
Example: A node of type “chinchilla” with a rate field. We could also request the raw vote table.
1. Denormalized table name2. Entity/table to denormalize3. The primary key to use4. The changed key to use
SELECT * FROM...developlocal.drupal_node dn_courseLEFT JOIN developlocal.drupal_field_data_field_course_rating_access field_course_rating_access ON field_course_rating_access.entity_type = 'node' AND field_course_rating_access.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_catalog field_course_catalog ON field_course_catalog.entity_type = 'node' AND field_course_catalog.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_date field_course_date ON field_course_date.entity_type = 'node' AND field_course_date.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_disclosure field_course_disclosure ON field_course_disclosure.entity_type = 'node' AND field_course_disclosure.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_event_date field_course_event_date ON field_course_event_date.entity_type = 'node' AND field_course_event_date.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_external_url field_course_external_url ON field_course_external_url.entity_type = 'node' AND field_course_external_url.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_format field_course_format ON field_course_format.entity_type = 'node' AND field_course_format.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_taxonomy_term_data field_course_format_tax ON field_course_format_tid = field_course_format_tax.tidLEFT JOIN developlocal.drupal_field_data_field_course_image field_course_image ON field_course_image.entity_type = 'node' AND field_course_image.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_live field_course_live ON field_course_live.entity_type = 'node' AND field_course_live.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_location field_course_location ON field_course_location.entity_type = 'node' AND field_course_location.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_password field_course_password ON field_course_password.entity_type = 'node' AND field_course_password.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_rating field_course_rating ON field_course_rating.entity_type = 'node' AND field_course_rating.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_transcript field_course_transcript ON field_course_transcript.entity_type = 'node' AND field_course_transcript.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_enrollment_requirement_min field_enrollment_requirement_min ON field_enrollment_requirement_min.entity_type = 'node' AND field_enrollment_requirement_min.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_requirements_max field_requirements_max ON field_requirements_max.entity_type = 'node' AND field_requirements_max.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_requirements_min field_requirements_min ON field_requirements_min.entity_type = 'node' AND field_requirements_min.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_show_child_catalog field_show_child_catalog ON field_show_child_catalog.entity_type = 'node' AND field_show_child_catalog.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_show_child_transcript field_show_child_transcript ON field_show_child_transcript.entity_type = 'node' AND field_show_child_transcript.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_og_group_ref og_group_ref ON og_group_ref.entity_type = 'node' AND og_group_ref.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_accme_data field_accme_data ON field_accme_data.entity_type = 'node' AND field_accme_data.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_activity_format field_activity_format ON field_activity_format.entity_type = 'node' AND field_activity_format.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_level field_course_level ON field_course_level.entity_type = 'node' AND field_course_level.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_topic field_course_topic ON field_course_topic.entity_type = 'node' AND field_course_topic.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_custom_target_audience field_custom_target_audience ON field_custom_target_audience.entity_type = 'node' AND field_custom_target_audience.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_presenter field_presenter ON field_presenter.entity_type = 'node' AND field_presenter.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_faculty_list field_faculty_list ON field_faculty_list.entity_type = 'node' AND field_faculty_list.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_hotel_link field_hotel_link ON field_hotel_link.entity_type = 'node' AND field_hotel_link.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_hotel_photo field_hotel_photo ON field_hotel_photo.entity_type = 'node' AND field_hotel_photo.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_venue_phone field_venue_phone ON field_venue_phone.entity_type = 'node' AND field_venue_phone.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_course_category field_course_category ON field_course_category.entity_type = 'node' AND field_course_category.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_taxonomy_term_data field_course_category_tax ON field_course_category_tid = field_course_category_tax.tidLEFT JOIN developlocal.drupal_field_data_field_related_courses_view field_related_courses_view ON field_related_courses_view.entity_type = 'node' AND field_related_courses_view.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_field_data_field_show_on_calendar field_show_on_calendar ON field_show_on_calendar.entity_type = 'node' AND field_show_on_calendar.entity_id = dn_course.nidLEFT JOIN developlocal.drupal_course_node cn ON cn.nid = dn_course.nidWHERE (dn_course.type IN ('course', 'group_event_series_event', 'course_imported')) GROUP BY `dn_course`.nid;
Querying field data
dw_mysiteprod.dw_user
user_id fullname province country employer changed
1 Barry Cuda PA US A, LLC 2018-01-01
2 Abbey Normal NJ US B, Inc. 2018-05-25
Denormalizer will build a denormalized query and dump that data into a separate database.
On subsequent runs, it will only insert or update new and changed records based on the ID and "changed key".
d_chinchillad_userd_external_table
drupal_userdrupal_nodedrupal_profiledrupal_field_datadrupal_field_datadrupal_field_datasecret_tableexternal_table
Prod DB
Data DB
Only the data you want is exposed in the database containing the denormalized data.
Third party tools can be configured to connect as a user that only has read access to the data DB.
A read-only replica can be used for increased performance.
d_userd_chinchillaexternal_tablesendgrid_opensg_ad
Data...lake?
(denormalizer)
????
“The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”- James Dixon, Pentaho
Tricks
uid colors
1 [blue, green]
2 [blue, green, red]
uid color
1 blue
1 green
2 blue
2 green
2 red
We can do some things that should not be done in an operational database, like combining multiple values into an array column, to be "unnested" later in a reporting database.
item_id p_id
1 [6,7,8]
6 [7,8]
What’s the Difference?
Data lake
● You don't know what you're using this data for.
● Data can be structured or unstructured.
● Retain all data. Avoids creating data silos.
Data warehouse
● You know what you're using the data for.
● Data typically structured or processed.
● Data is modeled to make logical sense to the user, or to create “information”.
BLT ETL, ELT?
E = Extract
T = Transform
L = Load
B = Bacon
Singer
The open-source standard forwriting scripts that move data.
https://www.singer.io
Singer
Taps extract data from any source and write it to a standard stream in a JSON-based format.
Targets consume data from taps and do something with it, like load it into a file, API or database.
Singer cycle
Tap Target
State
1. The tap extracts information from the source
2. The target takes in data from the tap and transforms it to what the target needs.
It outputs the state that the tap also sent over.
3. State is then passed back to the tap, and the process continues at the last replication point.
# tap-mysql -c catalog.json -s state.json | target-postgres > state.json
The catalog defines how to replicate the data. We use Drupal to generate these "streams".
# drush singer-json > catalog.json
Singer
d_userd_chinchillaexternal_tablesendgrid_opensg_ad
Data lake Athena
BigQuery
PostgreSQLRedshift ???
Singer
(denormalized)
Our solution
● Drupal and Singer being used to populate the data lake
● Production database on AWS Aurora provides a read-only replica of the lake
● ~15 minutes, Singer extracts and loads new data from lake of Drupal + other data in MySQL to PostgreSQL
● Looker transforms and reports on data
● Customer modeling generated from Drupal
● 100M+ rows being processed
DrupalSinger
PostgreSQL
Rollout & Results
● Customer interviews, mockups, beta tests for feature design
● Pre-built embedded dashboards in all customer sites
● Self-service reporting as an add-on
● No price increase
● Data warehouse became a competitive advantage
Data Can Be Beautiful
Some Section headerSecond Line
Support Desk Heros
Custom reports are now built by non-developers.
Often for free!Sometimes in minutes!
Have we arrived?
● Customers appreciate all the work
● Still some things to work out on both ends
Join us!Join us!
Resources
● Singer ETL: https://www.singer.io/● Denormalizer: https://www.drupal.org/project/denormalizer● Drupal Singer: https://www.drupal.org/sandbox/devin/3006817● MySQL tap: https://github.com/singer-io/tap-mysql● PostgreSQL tap:
https://github.com/singer-io/tap-postgres/tree/master/tap_postgres● PostgreSQL target: https://github.com/statsbotco/target-postgres
Some project are reaching maturity. Stay tuned!
https://looker.com/blog/introducing-singer
Join us forcontribution opportunities
Friday, April 12, 2019
9:00-18:00Room: 602
Mentored Contribution
First TimeContributor Workshop
GeneralContribution
#DrupalContributions
9:00-12:00Room: 606
9:00-18:00Room: 6A
What did you think?
Locate this session at the DrupalCon Seattle website:
http://seattle2019.drupal.org/schedule
Take the Survey!
https://www.surveymonkey.com/r/DrupalConSeattle