psi meeting 2013 - psidev.infopsidev.info/sites/default/files/2018-03/psi_april_2013.pdf · psi...
TRANSCRIPT
Outline● Summary of 2012/2013 activities and achievements
● MIRIAM and identifiers.org
● MITAB 2.7 and MIQL 2.7
● Clustering
● New PSICQUIC reference implementation
● PSICQUIC view update
● Data Distribution Best Practices
Summary of 2012/2013 activities and achievements
PSICQUIC Hackathon● 28th May – 1st June 2012
● 10 developers from 7 different partners● BioJS, Cytoscape, DIP, InnateDB, IntAct, MatrixDB, MINT, MPIDB ● http://code.google.com/p/psicquic/wiki/PSICQUICHackathon2012
● 2 working groups● SOLR team :
● reference implementation● indexing MITAB 2.5, 2.6 and 2.7 using SOLR● MIQL 2.7 ● XML indexing and PSICQUIC webservices improvements
● Client team : ● PSICQUIC view visualization: table, network and search● Cytoscape plugin
2012/2013 releases● MITAB 2.7
http://code.google.com/p/psimi/wiki/PsimiTab27Format● MIQL 2.7
http://code.google.com/p/psicquic/wiki/MiqlReference27
● PSICQUIC reference implementation http://code.google.com/p/psicquic/wiki/PsicquicSpec_1_3_Rest● LUCENE 1.2.3● SOLR 1.3.9
● PSI-MI java librarieshttp://code.google.com/p/psimi/downloads/list● psi25-xml parser 1.8.3● psimitab parser 1.8.3● psi25-xml to RDF/Biopax converter 1.8.3● Calimocho 2.5.0● Calimocho to XGMML converter 2.5.0.3
● PSICQUIC-view http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml
PSICQUIC growth+ 25 millions binary interactions
since 2012
+ 2 services since 2012 => total of 28 service and one more in progress (Flybase)
Work in progress...
● PSICQUIC/MITAB 2.7 publication submitted and in review
● PSICQUIC view and download all button
● BioJS : new javascript components for molecular interaction visualization
● Clustering improvements (new web interface, …)
● JAMI (Java framework for molecular interactions)● XML/MITAB validator prototype● Enricher
MIRIAM and Identifiers.org
Introduction: http://identifiers.org/about
MIRIAM/identifiers.org benefits
● PSICQUIC links to data entries (pubmed, uniprot, ensembl...)
➢ Automatic remapping when services down → more reliable links
● Up to date resource with database accession regular expressions
➢ Do not duplicate work in psi-mi ontology
More reliable PSICQUIC links (1)• Several locations/resources for accessing uniprot P00533
3 existing resources for accessing P00533
Identifiers.org/uniprot/P00533
More reliable PSICQUIC links (2)• Use the most reliable location/resource for uniprot P00533
Identifiers.org/uniprot/P00533?profile=most_reliable
More reliable PSICQUIC links (3)• Use the uniprotkb location/resource for uniprot P00533
Identifiers.org/uniprot/P00533?resource=MIR:00100134
Up to date database links and regular expressions (1)
Up to date database links and regular expressions (2)
What next?● The CV MI database terms should have xrefs to MIRIAM
namespace
● The regular expressions in the database MI terms could be obsoleted to rely on MIRIAM
- Hierarchy information - No data/formats update - Relies on MIRIAM for the regular expressions and links
- More work for the MI CV maintainers.- MIRIAM namespaces not visible in MITAB/XML- Need to update PSI-XML validator
Maybe XML 3.0?
MITAB 2.7 and MIQL 2.7
MITAB 2.7: introduction
● Format description at http://code.google.com/p/psicquic/wiki/MITAB27Format
● Extension of MITAB 2.6 and 2.5
● Total of 42 column
Can contain minimum information recommended by MIMIx
MITAB 2.7: Complex expansion
● Distinguish true binary interactions from binary interactions expanded from n-ary interactions● Know the method used to expand
● Spoke● Matrix● Bipartite
● psi-mi:”MI:1060” (spoke expansion)● psi-mi:”MI:1061” (matrix expansion)● psi-mi:”MI:1062” (bipartite expansion)
Recognized for backward compatibility
MITAB 2.7: re-build n-ary from spoke expansion?
A BC D
Interaction id 1
Interaction id 2
E FG
5 binary interactions 2 n-ary interactions
bait prey
A B
C
D
bait
A
A
bait
prey
prey
● Interaction id 1● Spoke
● Interaction id 1● Spoke
● Interaction id 1● Spoke
E F
G
bait
E
bait
prey
prey
● Interaction id 2● Spoke
● Interaction id 2● Spoke
Need ● interactor id● expansion
method● interaction id
Not enough ● Publication● Detection method● Host organism● Interaction type
MITAB 2.7: re-build n-ary from bipartite expansion?
A BC D
Interaction id 1
Interaction id 2
E FG
7 binary interactions 2 n-ary interactions
interactionI1 A
interactor
● Bipartite
Need ● interactor id● expansion
method● interaction id
interactionI1 B
interactor
● Bipartite
interactionI1 C
interactor
● Bipartite
interactionI1 D
interactor
● Bipartite
interactionI2 E
interactor
● Bipartite
interactionI2 F
interactor
● Bipartite
interactionI2 G
interactor
● Bipartite
MITAB 2.7: re-build n-ary from matrix expansion?
A BC D
Interaction id 1
Interaction id 2
E FG
9 binary interactions2 n-ary interactions
CA
DA
A B
● Interaction id 2● Matrix
Need ● interactor id● expansion
method● interaction id
Not enough ● Publication● Detection method● Host organism● Interaction type
CB
DB
CD
GE
EF
GF
● Interaction id 2● Matrix
● Interaction id 2● Matrix
● Interaction id 1● Matrix
● Interaction id 1● Matrix
● Interaction id 1● Matrix
● Interaction id 1● Matrix
● Interaction id 1● Matrix
● Interaction id 1● Matrix
MITAB 2.7: MIMIx columns
● Participant's biological roles (col 17 and 18)➢ Ex: psi-mi:”MI:0684” (ancillary)
● Participant's experimental roles (col 19 and 20)➢ Ex: psi-mi:”MI:0496” (bait)
● Participant identification methods (col 41 and 42)➢ Ex: psi-mi:”MI:0113” (western blot)
● Host organism for the experiment (col 29)➢ Ex: taxid:-1 (in vitro)
MITAB 2.7: new types of interactions accepted
● Negative interactions (col 36)
● Self interactions:– homodimers, homotrimers, …
– auto-catalysis, …
P P
P
Inter-molecular
Intra-molecular
Unique id A (col 1)
Unique id B (col 2)
…. Stoichiometry A(col 39)
Stoichiometry B (col 40)
P P ... x 0
Unique id A (col 1)
Unique id B (col 2)
…. Stoichiometry A(col 39)
Stoichiometry B (col 40)
P - ... 1 -
MITAB 2.7: interactor types
● Columns 21 and 22➢ Ex: psi-mi:”MI:0327” (peptide)
● Solve some ambiguity with interactor identifiers
● More precise than registry tags
MITAB 2.7: interactor and interaction xrefs
● Interactor xrefs (col 23 and 24)
● Interaction xrefs (col 25)➢ Ex: go:"GO:0005057"(receptor signaling protein
activity)
➢ Ex: intact:EBI-626658(see-also)
• To give more information about interactor or interaction
• Not an identifier• Allows to lighten the 6 first columns• Not used for clustering• use cross reference type
MITAB 2.7: interactor and interaction annotations
● Interactor annotations
(col 26 and 27)
● Interaction annotations
(col 28)➢ Ex: dataset:Cancer - Interactions
investigated in the context of cancer
➢ Ex: imex-curation
PSICQUIC Registry tags
MITAB 2.7: participant's features● Users want to ask: “show me all evidence where molecule X has binding domains”
Binding sites AND other features (eg. Tags, PTMs,..)
yes
binding site:51-124(IPR003651)binding site:45..53-119..129binding site:n-51,99-123gst tag:c-chis tag:?-?
no
51-124(IPR003651)45..53-119..129n-51,99-123c-c?-?
MITAB 2.7: more...
● Interaction parameters (col 30)➢ Ex: kd:9.0x10^-7 (molar)
● Creation date (col 31)➢ Ex: 2011/03/15
● Last update date (col 32)➢ Ex: 2011/04/05
● Interactor checksum (col 33 and 34)➢ Ex: rogid:bjwQTTv7ws6z/T+fM8bNGnEsEXk6239
● Interaction checksum (col 35)➢ Ex: rigid:G6RtLd3+FtR/ZtRciwH2vj9R0Tc
MITAB 2.7: limitation and issues
● 42 columns!
● Feature, checksum, confidence and parameter types can only be names
● Cannot represent linked features and inferred interactions
● Cannot export feature xrefs and annotations
● Not all the columns have the same syntax
● Same syntax does not mean same content
● Cell types, tissues and compartments cannot be specified in host organism column.
• Issue when converting to XML where Xref is mandatory
• Cannot recognize MI from MOD terms• Names can be ambiguous
MITAB: what next?
● Only column names● A syntax per column● Customize....
– Number of columns
– Order of columns
MIQL 2.7: introduction
● Fields description at http://code.google.com/p/psicquic/wiki/MiqlReference27
● Extension of MIQL 2.5
● Total of 35 fields
MIQL 2.7: new fields
MIQL 2.7: examples➢ I want to filter out expanded binary interactions
➢ Complex:”-”
➢ I want to include negative interactions➢ negative:(true OR false)
➢ I want all interactions having parameters➢ param:true
➢ I want all interactions having stoichiometry➢ stc:true
➢ I want all interactions having binding sites➢ ftypeA:”binding site” AND ftypeB:”binding site”
➢ I want all intra-molecular interactions➢ idA:\- OR idB:\-
➢ I want all interactions internally-curated➢ annot:”internally-curated”
What should we do?● Export and index MITAB 2.7
➢ Complex expansion
➢ MIMIx information
➢ Registry tags and tagging interaction
● Use PSICQUIC registry tags that are important at the interaction level
● Move ➢ Gene names and other names to alias columns (col 5 and 6)
➢ Extra unique identifiers to alternative identifiers (col 3 and 4)
➢ Rogid, Inchi key and rigid to checksum columns (col 33, 34 and 35)
➢ GO and non identifiers to xref columns (col 23, 24 and 25)
PSICQUIC clustering
Clustering binary interactions• Clustering = regrouping multiple interaction evidences of a
unique pair of interactors in a single MITAB line.
• It boils down to grouping molecule pairs, hence the importance of describing your molecules properly
• Necessary for a user doing data analysis and interaction networking
• http://code.google.com/p/micluster/
A-B : Y2HA-B : CIPA-C : Y2HA-B : pull downA-D : pull down
A-B : Y2H | CIP | pull downA-C : Y2HA-D : pull down
How to deal with ambiguous identifiers?
• Depends on the list of identifiers provided by each PSICQUIC service
= 1 interaction but should it be 2?
- Uses one identifier per species- ambiguous identifiers (uniprot gene and organism demerge) can be moved to xrefs
A1-B : A1 → uniprotkb:Q5R7D3|uniprotkb:P08107
+A2-B : A2 → uniprotkb:Q5R7D3
1
2
A2-B : A2 → uniprotkb:P081073
Should we cluster MITAB 2.7?● Lose experiment/interaction hierarchy : some information are
specific to the experiment!– Experimental roles
– Interaction parameters
– Features and tags
– Host organism
● Some fields are confusing when clustered– Complex expansion
– Interactor types
– Negative
– Stoichiometry
● Some fields make sense associated with source● Created date● Update date
Clustering improvements
● Relying on aliases for identifying molecule? => names are not identifiers
● Proposing other clustering options? (sequence+organism, checksum)
● Respect Data Distribution Best practices avoids inconsistent results => better data integration and analysis for the user
Clustering alternatives● Clustering unique binary pairs during
indexing?Ex: a new field 'binary': identifier1-identifier2
● Getting the unique binary pairs is instantaneous
● Can have statistics related to a binary pair
● Identifiers always sorted so always same order
● Possibility to keep relationships of original MITAB
● Needs to agree on common identifiers
● Needs regular protein updates● Not flexible if several identifiers
New PSICQUIC reference implementation
LUCENE reference implementation 1.2.3
MITAB 2.5
Lucene indexing (3.0)Calimocho 2.5.0Psimitab parser 1.8.3
PSICQUIC 1.2
MIQL 2.5 (14 fields)
tab25 (default)tab25-binxgmmlBiopaxRDF
● Fix some memory issues (pagination, threads, …)
● Use psimitab parser and XML converter 1.8.3 with bug fixes
● Improved performances XGMML export (no limits of 5000 interactions)
SOLR reference implementation 1.3.9
MITAB 2.5 SOLR indexing (3.6.0)Calimocho (2.5.0)Spring batch
PSICQUIC 1.3
MIQL 2.7 (35 fields)
tab25 (default)tab26tab27xgmmlBiopaxRDF
● Use psimitab parser and XML converter 1.8.3 with bug fixes (can convert MITAB 2.7 to PSI-XML 2.5)
● Improved performances XGMML export (no limits of 5000 interactions)
● Common SOLR schema
MITAB 2.6
MITAB 2.7
What is SOLR?
● Web application and web server
● Based on LUCENE => compatible with MIQL
● SolrJ: java API to index/search
● HTTP requests to SOLR
● Caching results
● Provides admin interface
– Browse indexed data
– Access schema and configuration
– Server, cache and index statistics
SOLR admin interface
Help/documentation Query
Schema, config, statistics
SOLR results interfaceQuery parametersQuery parameters
Number of results
Document and 'stored' fields
What is faceting?
● Breaks up search results into multiple categories
● Show counts for each category (facet field)
● Allows user to restrict/filter search based on those facets
Provides statistics about the content of the results for a given query
Example of faceting
Facet results
facet=trueFacet.field=species_s
Search: how is data indexed? (1)
● MIQL 2.7 fields indexed but not stored
● Bug fix: split by ':' and duplicated terms!➢ Ex: MI:0356 => MI, 0356
➢ Ex: taxid:9606(human)|taxid:9606(homo sapiens) => taxid, 9606, human, taxid, 9606, homo, sapiens
● Default fields (free text search)➢ Identifier, pubauth, pubid, interaction_id, detmethod,
type, species
Search: how is data indexed? (2)
● Database, value and text for general xrefs➢ Ex: uniprotkb:P12346 => uniprotkb, P12345 and uniprotkb:P12345
➢ Ex: taxid:8906(human) => taxid, 9606, human and taxid:9606
➢ Ex: uniprotkb:brca2(gene name) => uniprotkb, brca2, “gene name” and uniprotkb:brca2
● Features, annotations➢ Ex: figure legend:Fig 3. => “figure legend”, “Fig 3.”
➢ Ex: binding site:12-12(text) => “binding site”
● Negative (always excluded by default!)➢ Ex:' -' or false => false
● Parameters and stoichiometry➢ Ex:' 1' or 'kd:9.0x10^-7 (molar)' => true
➢ Ex: '-' => false
● Publication first author– Ex:'author (date)' => “author”, “date”
Search: how is data indexed? (3)● Ignore parenthesis
● Case insensitive
● Discard common english words (a, with, …)
● Discard empty space before and after a word
● White space tokenizer => search for exact words● Ex: BRCA2 will not match BRCA2b● Ex: P12345 will not match P12345-1 => use P12345*● Ex: experimental will match both 'experimental method' and
'experimental feature'
What is stored and returned?
● MIQL fields + non searchable fields ending with '_o'➢ Ex: taxidA_o, pbioroleA_o, checksumA_o
● Excludes copy fields● Id, alias, identifier, ptype, pbiorole, ftype, species, pmethod
● Stores the original MITAB column
● Missing fields are automatically replaced by '-'
PSICQUIC facet fields
● MIQL fields ending with '_s'➢ Ex: species_s, pbiorole_s
● Stores the original MITAB cross reference➢ Ex: taxid:9606(human) => taxid:9606
● Exact match
● Excludes text
Current indexing issues and possible improvements
● More default fields?
● Alias names: fuzzy search allowed?
● Annotation description: fuzzy search should be allowed
● Sort fields cannot be multivalued!
➢ Unique identifier?➢ MITAB not clustered => controlled vocabulary terms➢ Current issue with publication (pubmed, imex) ➢ Cannot sort by annotations and xrefs!
SOLR and PSICQUIC installation
PSICQUIC webservice extensions
● Add a sort parameter
● Allowing faceting
➢ Define method name (not getByQuery for backward compatibility)
➢ Use SOLR XML to return facets or facets embedded in the response?
Current PSICQUIC specifications issues
● SOAP and REST discrepancies
➢ Do we maintain both?➢ Should we update SOAP with new REST
methods?
● Update and improve documentation, bug tracker, FAQ
PSICQUIC view update
Data Distribution Best Practices
Master headline
????
??? ?
??
?
?
?
?
?
?
??
?
?
? ?
?