stop worrying & love the sql - a case study
Post on 15-Jul-2015
784 Views
Preview:
TRANSCRIPT
Stop WorryingLove the SQL! (the Quepid story)
OpenSource Connections
Me/UsDoug Turnbull@softwaredoug
Likes: Solr, Elasticsearch, Cassandra, Postgres
OpenSource Connections@o19s
Search, Discovery and Analytics
Let us introduce you to freelancing!
OpenSource Connections
Most Importantly we do...Make my search results more relevant!
“Search Relevancy”
What database works best for problem X?“(No)SQL Architect/Trusted Advisor”
OpenSource Connections
How products actually get built
Rena: Doug, John can you come by this afternoon?
One of our Solr-based products needs some urgent relevancy work
Its Friday, it needs to get done today!
Us: Sure!
The Client(Rena!)smart cookie!
OpenSource Connections
A few hours laterUs: we’ve made a bit of progress!
image frustration-1081 by jseliger2
Rena: but everytime we fix something, we break an existing search!
Us: yeah! we’re stuck in a whack-a-mole-game
other image: whack a mole by jencu
OpenSource Connections
Whack-a-MoleWhat search relevancy work actually looks like
OpenSource Connections
I HAVE AN IDEA● Middle of the afternoon, I stop doing search
work and start throwing together some python
from flask import Flaskapp = Flask(__name__)
Everyone: Doug, stop that, you have important search work to do!
Me: We’re not making any progress!WE NEED A WAY TO REGRESSION TEST OUR RELEVANCY AS WE TUNE!
Everyone: You’re nuts!
OpenSource Connections
What did I make?Focus on gathering stakeholder (ie Rena) feedback on search, coupled w/ workbench tuning against that feedback
Today we have customers...
… forget that, tell me about your failures!
OpenSource Connections
Our war storyMy mistakes:
● Building a product● Selling a product● As a user experience engineer● As an Angular developer● At choosing databases
OpenSource Connections
Quepid 0.0.0.0.0.0.1Track multiple user searches
for this query (hdmi cables) Rena rates this document as a good/bad search result
need to store:<search> -> <id for search result> -> <rating 1-10>“hdmi cables” -> “doc1234” -> “10”
*Actual UI may have been much uglier
OpenSource Connections
Data structure selection under duress
● What’s simple, easy, and will persist our data?
● What plays well with python?
● What can I get working now in Rena’s office?
OpenSource Connections
Redis● In memory “Data Structure Server”
○ hashes, lists, simple key-> value storage
● Persistent -- write to disk every X minutes
OpenSource Connections
Redis
from redis import Redisredis = Redis()redis.set("foo", "bar")redis.get("foo") # gets ‘bar’
$ pip install redis
Easy to install and go! Specific to our problem:
from redis import Redisredis = Redis()
ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”
redis.hsetall(searchQuery, ratings)
Store a hash table at “hdmi cables” with:
“doc1234” -> “10”“doc532” -> “5”
OpenSource Connections
Success!● My insanity paid off that afternoon
● Now we’re left with a pile of hacked together (terrible) code -- now what?
OpenSource Connections
Adding some features● Would like to add multiple “cases”
(different search projects that solve different problems)
● Would like to add user accounts
● Still a one-off for Silverchair
OpenSource Connections
CasesTuning a cable shopping site... … vs state laws
OpenSource Connections
Cases in Redis?
from redis import Redisredis = Redis()
ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”
redis.hset(searchQuery, ratings)
Recall our existing implementation“data model”
Out of the box, redis can deal with 2 levels deep:{
“hdmi cables”: {“doc1234”: “10”,“doc532”: “5”
},“ethernet cables”...
}
Can’t add extra layer (redis hash only one layer)
{“cable site”: {“hdmi cables”: {...}“ethernet cables”: {...}
}“laws site: {...}}
OpenSource Connections
Time to give up Redis?“All problems in computer science can be solved by another level of indirection” -- David Wheeler
Crazy Idea: Add dynamic prefix to query keys to indicate case, ie:{
“case_cablestore_hdmi cables”: {“doc1234”: “10”,“doc532”: “5”
},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}
}
Queries for “Cable Store” case
Query for “State Laws” case
redis.keys(“case_cablestore*”)
To Fetch:
OpenSource Connections
Store other info about cases?New problem: we need to store some information about cases, case name, et
{“case_cablestore_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}
}
Where would it go here?{
“case_cablestore” {“name”: “cablestore”,“created” “20140101”
},“case_cablestore_query_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},“case_cablestore_query_ethernet cables”:
{… },“case_statelaws_query_car tax”: { …}
}
OpenSource Connections
Oh but let’s add usersExtrapolating on past patterns {
“user_doug” {“name”: “Doug”,“created_date”: “20140101”
},“user_doug_case_cablestore” {
“name”: “cablestore”,“created_date” “20140101”
},“user_doug_case_cablestore_query_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},“user_doug_case_cablestore_query_ethernet cables”:
{… },“user_tom_case_statelaws_query_car tax”: { …}
}image: Rage Wallpaper from Flickr user Thoth God of Knowledge
You right now!
OpenSource Connections
Step BackWe ask ourselves: Is this tool a product? Is it useful outside of this customer?
What level of software engineering helps us move forward?
● Migrate to RDMS?● “NoSQL” options?● Clean up use of Redis somehow?
OpenSource Connections
SubRedis
Operationalizes hierarchy inside of redis
https://github.com/softwaredoug/subredis
from redis import Redisfrom subredis import SubRedisredis = Redis()
sr = SubRedis(“case_%s” % caseId , redis)
ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”
sr.hsetall(searchQuery, ratings)
Create a redis sandbox for this case
Interact with this case’s queries with redis sandbox specific to that case
Behind the scenes, subredis queries/appends the case_1 prefix to everything
OpenSource Connections
SubRedis == composable
userSr = SubRedis(“user_%s” % userId , redis)
caseSr = SubRedis(“case_%s” % caseId , userSr)
# Sandbox redis for queries about userratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”
caseSr.hsetall(searchQuery, ratings)
SubRedis takes any Redis like thing, and works safely in that sandbox
Now working on sandbox, within a sandbox
OpenSource Connections
Does something reasonable under the hood
{
“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},“user_2_name”, “Rena”,...
}
AllRedis
user_1 subred.
case_1subred.
OpenSource Connections
We reflect again● Ok we tried this out as a product. Launched.
● Paid off *some* tech debt, but wtf are we doing
● Works well enough, we’ve got a bunch of new features, forge ahead
OpenSource Connections
We reflect again● We have real customers
● Our backend is evolving away from simple key-value storage○ user accounts? users that share cases? stored
search snapshots? etc etc
OpenSource Connections
Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?
{
“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},“user_2_name”, “Rena”,“user_2_case_1_name”: “name”: “cablestore”“user_2_case_1_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},}
Could duplicate the data? This stinks!
● Updates require visiting many (every?) user, looking for this case
● Bloated database
Duplicate the data?
OpenSource Connections
Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?
{
“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...
}
User 1
Case 1
User 2
Store list of owned cases
Break out cases to a top-level record?
OpenSource Connections
SudRedisRelational?{
“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {
“doc1234”: “10”,“doc532”: “5”
},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...
}
We’ve actually just normalized our data.
Why was this good?● We want to update case 1 in isolation
without anomalies● We don’t want to visit every user to
update case 1!● We want to avoid duplication
We just made our “NoSQL” database a bit relational
OpenSource Connections
Other Problems● Simple CRUD tasks like “delete a case”
need to be coded up
● We’re managing our own record ids
● Is any of this atomic? does it occur in isolation?
OpenSource Connections
What’s our next DB?● These problems are hard, we need a new
DB
● We also need better tooling!
OpenSource Connections
Irony● This is the exact situation we warn clients
about in our (No)SQL Architect Roles.○ Relational == General Purpose○ Many-many, many-one, one-many, etc○ Relational == consistent tooling
○ NoSQL == solve specific problems well
OpenSource Connections
So we went relational!● Took advantage of great tooling: MySQL,
Sqlalchemy (ORM), Alembic (migrations)
● Modeled our data relationships exactly like we needed them to be modeled
OpenSource Connections
Map db Python classes
class SearchQuery(Base): __tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")
class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)
Can model my domain in coder-friendly classes class SearchQuery(Base):
__tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")
class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)
OpenSource Connections
Easy CRUDq = SearchQuery(search_string=”hdmi cable”)db.session.add(q)db.session.commit()
del q.ratings[0]db.session.add(q)db.session.commit()
q = SearchQuery.query.filter(id=1).one()q.search_string=”foo”db.session.add(q)db.session.commit()
Create!
Delete!
Update!
OpenSource Connections
Migrations are good
alembic revision --autogenerate -m "name for tries"alembic upgrade headalembic downgrade 0ab51c25c
How do you upgrade your database to add/move/reorganize data?
● Redis this was always done manually/scripted
● Migrations with RDMS are a very robust/well-understood way to handle this
SQLAlchemy has “alembic” to help:
OpenSource Connections
Modeling Users ←→ Casesassociation_table = Table(case2users, Base.metadata, Column('case_id', Integer, ForeignKey('case.id')), Column('user_id', Integer, ForeignKey('user.id')))
class User(Base): __tablename__ = 'user' id = Column(Integer, primary_key=True) cases = relationship("Case", secondary=association_table)
class Case(Base): __tablename__ = 'case' id = Column(Integer, primary_key=True)
Can model many-many relationships
OpenSource Connections
Ultimate Query Flexibilityfor user in User.query.all(): for case in user.cases: print case.caseName
for user in User.query.filter(User.isPaying==True): for case in user.cases: print case.caseName
Print all cases:
Cases from paying members:
OpenSource Connections
Lots of things easier● backups● robust hosting services (RDS)● industrial strength ACID with flexible
querying● 3rd-party tooling (ie VividCortex for MySQL)
OpenSource Connections
When NoSQL?● Solve specific problems well
○ Optimize for specific query patterns○ Full-Text Search (Elasticsearch, Solr)○ Caching, shared data structure (Redis)
● Optimize for specific scaling problems○ Provide a denormalized “view” of your data for
specific task
OpenSource Connections
Final ThoughtsSometimes RDMS’s have harder initial hurdle for setup, figuring out migrations; data modeling; etc
Why isn’t the easy path the wise path?
OpenSource Connections
In conclusion
top related