exploration of multidimensional biomedical data in pub chem, presented by lianyi han at solr...
TRANSCRIPT
Exploration of multidimensional biomedical data in PubChem
Lianyi Han
National Center for Biotechnology InformationAdvances science and health by providing access to biomedical and genomic information.
Literatures• PubMed• PMC• PubMed Health• …
Sequences• Proteins• Genes &
Expression• Genome & Maps• …
Chemicals & Bioassays• PubChem
Databases• BioSystems• …
Software & tools• Blast• Structure Search• Entrez/Eutils
Structure & Domains• Structure• CDD• …
Provides information on the biological activities of small molecules and beyond
PubChemSubstance
Compound
Bioactivities
Literatures(link)
Target
Patent
Pathways
23 million citations
The Challenge
•Varity heterogeneous documents with many-to-many relationships
•Volume200M+ bioactivity data
40M+ compounds600K+ bioassays20K+ pathways
9k targets
•Velocityquery wide quickly, query deep quickly, facet search quickly
Answers
The Direction
Velocity
Variety
Volu
me
Existing Search Systems
• ASN.1, XML schema• RDMS(SQL)• In-house NoSQL Search Engine• Specialized Search Engine• Homebrewed messaging system• Queue systems
A new search system• Features? • Scalability?• Accessibility?• Maintenance?• Reusability?• Extensibility?• Cost effective?
Archive Analysis
The feature requirements for the new search system
• Full text search• Highlighting• Faceting• Molecule formula search • 2D similarity search• Molecule superstructure/substructure search• Joins, cascading joins to search wide and deep• Transfer search result effectively across services
We can make the feature complete in SOLR!
• Full text search(SOLR)• Highlighting(SOLR)• Faceting(SOLR)• Molecule formula search (implement MF search in SOLR)• 2D similarity search (implement 2D fingerprint search in SOLR)• Molecule superstructure/substructure search (SOLR-5244)• Joins, cascading joins to search wide and deep (SOLR-4787)• Transfer search result effectively across services(SOLR-4787, SOLR-5244)
Architecture
UI/UX
Web API
RDMS NoSQL(SOLR) Specialized Search Backend
Caching/List handling
The Backend• Backend Components (SOLR+SQL+ Specialized search engine)
– Configuration– Importing pipeline
• Dumping & Importing (SGE Farm)• DIH (jdbc)
– Replication– Warm up
• Web API– Encapsulate the backend implementation– Load balancing and throttling– Generic data model for heterogeneous document– Query language
The Frontend
• Easier to develop or expand based on modern web technologies. – One backend, multiple frontends– One data model, multiple presentations
• UI/UX design– MVC– Reusability– Mobile browser friendly– Interactivity & Accessibility
The Frontends• PubChem widgets (beta)
– A reusable UI components
• PubChem new search (beta)– A new search system that delivers
multiple search features
Briefly on UI architecture• PubChem widgets as an example
PubChem widgets
ExtJS components
Data model/store
Web API
backend
Controller
Demo : PubChem widget• http://jsfiddle.net/Gtbg7/
PubChem.widget.CreateGridTable({ gridtabletype: 'pcassay', cid: 2244, renderTo: ‘table’, width: "90%", height: 400});
More PubChem widgets
Demo : PubChem Search• https://pubchem.ncbi.nlm.nih.gov/search/
Desktop Mobile
Faceting
Molecular Formula SearchSuper/sub Structure Search
Full-text Search
Brief Summary on PubChem Search Demo
Thanks
• Yu Bo• Renata Geer• Asta Gindulyte• Siqian He• Paul Thiessen• Jiyao Wang• Jeff Zhang
• Steve Bryant• Lewis Geer• Evan Bolton• Yanli Wang• NCBI IEB and IRB
This research was supported [in part] by the Intramural Research Program of the NIH, National Library of Medicine.
Questions
About this talk: [email protected]: https://www.facebook.com/pubchemNCBI: https://www.facebook.com/ncbi.nlm