neuroimaging databases: a data engineering perspective
DESCRIPTION
Neuroimaging Databases: A Data Engineering Perspective. Amarnath Gupta University of California San Diego. Find a pair of employees who always work on the same project in the same location?. Three Queries. select E.eID, M.eID from emp E, emp M, dept D where E.salary > ( select avg(salary) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/1.jpg)
Neuroimaging Databases:A Data Engineering Perspective
Amarnath Gupta
University of California San Diego
![Page 2: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/2.jpg)
2 IMAGE03, Edinburgh
Three Queries
2. Find a pair of employees who always work on the same project in the same location?
Emp(eID, name, degree, salary).
Project(pID, start_date, end_date, status).
Dept(dID, name, mgrID).
Works_For(pID, eID, location).
1. Which employees have a Ph.D. degree and work in the San Francisco office?
3. In La Jolla SEARS, find all employees E who earn more than the average manager’s salary (over all departments), and the list the managers M who earn less than E.
select E.eID
from emp E, works_for W
where E.degree = ‘Ph.D’ and
E.eID = W.eID and
W.location = ‘San Francisco’
select E1.eID, E2.eID
from emp E1, emp E2, works_for W1, works_for W2
where E1.eID = W1.eID and E2.eID = W2.eID and
W1.pID = W2.pID and
W1.location = W2.location and
E1.eID != E2.eID
select E.eID, M.eID
from emp E, emp M, dept D
where E.salary > (
select avg(salary)
from emp E2, dept D2
where E2.eID = D2.mgrID
) and M.eID = D.mgrID and
E.salary > M.salary
group by E.eID
![Page 3: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/3.jpg)
3 IMAGE03, Edinburgh
Now Try These Queries1. In mice, which ‘calcium binding’ proteins are found in
the brain region ‘hippocampus’?2. Find protein pairs that act as voltage-gated channels
and are always co-localized in the region “cerebellum”.3. In mouse-strain X, find all brain regions R which express
more -synuclein than the average -synuclein expression level over all other brain regions, and list the brain regions S that express less -synuclein than R.
A. Why are these queries inherently harder?B. Why is it a very hard task to build systems
that would answer queries like these and produce scientifically valid results?
![Page 4: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/4.jpg)
The Data Modeling Problem
Lack of disciplined abstraction in modeling the data
![Page 5: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/5.jpg)
5 IMAGE03, Edinburgh
Large Scale Brain Maps
• Custom high precision montaging stage
• 40 X 30 image panels
• 40X 1.3 oil objective
• 800 Mb full resolution TIFF
![Page 6: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/6.jpg)
6 IMAGE03, Edinburgh
The Molecular Distribution Case
• Protein localization queries– Which proteins are found more in the granule cell
layer of than the Purkinje cell layer?– Are proteins P1 and P2 always co-localized,
sometimes co-localized or never co-localized in the cerebellum?
– Which proteins follow the distribution pattern CA1 > (basal ganglia ~ deep cerebellar nuclei) > CA3 ?
• The abstract model– Array Data Model (Libkin, Machlin, Wong 1996)– Histogram Data Model (Santini, Gupta 1999)
• A molecular distribution can be modeled as a “block histogram” where the “base dimensions” are in R2 (or R3)
• A cell in the histogram can contain a tuple (or a vector) of aggregate values
![Page 7: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/7.jpg)
7 IMAGE03, Edinburgh
Block Histogram as an ADT
type image { id: identifier,picture: blobregions: set(region),color block histogram:
2Darray(histogram),};type region { label: string,shape: polygon};type histogram { variable name:
string,value:1Darray(bucket),};type bucket { start bucket: integer,end bucket: integer,count: integer};
Abstract Data Types Block Histogram
![Page 8: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/8.jpg)
8 IMAGE03, Edinburgh
Querying Block Histograms
• Which proteins follow the distribution pattern CA1 > (basal ganglia ~ deep cerebellar nuclei) > CA3 ?– cut: histogram polygon histogram– agg: agg_func histogram attribute_name number– sim_dist: number number number
select protein
from brain_level_protein_distributions D, mouse_atlas Mwhere a1 is agg(avg, D.pd_hist.cut(M.ca1_poly), protein_amt)
and a2 is agg(avg, D.pd_hist.cut(M.bg_poly), protein_amt)
and a3 is agg(avg, D.pd_hist.cut(M.dcn_poly),
protein_amt) anda4 is agg(avg, D.pd_hist.cut(M.ca3_poly), protein_amt)
and sim_dist(a1, a2) > 0.2 and sim_dist(a2, a3) < 0.1 and sim_dist(a3, a4) > 0.2
Similar models on Volumes and Surfaces are being developed
![Page 9: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/9.jpg)
The Representation Selection Problem
Often multiple representations of the data are created for different
purposes, but the queries are over the “generic” data
![Page 10: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/10.jpg)
10 IMAGE03, Edinburgh
Surface Representations
Fiducial representation: as-exact-as-possiblerepresentation of the cortex, with all the folds and the creases of the actual surface.Allows the measurement of all geometricquantities of interest, including differen-tial properties (Gaussian curvature..)but most quantities are difficult tocompute, as they require the integration of the local propertiesof the surface.
Spherical map: the cortex can be projected on the surface of a sphere in a way that preserves (approximately) the distances between points. This represnta- tion affords the efficient computation of distances,areas, and topological relations, but not of properties related to the curvature of the surface.
Neuroscientists use different representationsof the cortex surfaces for different purposes
flat map: preserves the area of the regions,but introduces cuts so that distances and topological properties can’t be computed
All these representations are stored in the database, but scientists ask questions on a conceptual model based on the fiducial representation. How can we rewrite the query to make optimal use of the available representations?
All these representations are stored in the database, but scientists ask questions on a conceptual model based on the fiducial representation. How can we rewrite the query to make optimal use of the available representations?
![Page 11: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/11.jpg)
11 IMAGE03, Edinburgh
Configuration<RewriteConfiguration>
<ReplaceTypes> <Type name="Cortex"/> <Type name="Spherical"/> </ReplaceTypes>
<FunctionParameterTable> <Function name="Area"> <Type name="Cortex" feasibility="2"/> <Type name="Spherical" feasibility="5"/> <Type name="Flat" feasibility="8"/> </Function> <Function name="Connectivity"> <Type name="Cortex" feasibility="5"/> <Type name="Spherical" feasibility="5"/> <Type name="Flat" feasibility="0"/> </Function> </FunctionParameterTable>
<AttributeConversionTable> <ConversionSpec type="Cortex"> <Attribute name="A">
<Translation type="Spherical"> $.Q </Translation> <Translation type="Flat">
$.A </Translation></Attribute>
</ConversionSpec> <ConversionSpec type="Spherical"> <Attribute name="AREA"> <Translation type="Flat">
$.A*2 </Translation> </Attribute> </ConversionSpec> </AttributeConversionTable>
<Strategy> <Step type="replacement" threshold="2"/> <Step type="consolidation" /> </Strategy>
</RewriteConfiguration>
Function-type table: for each function of the geometric data cartridge, lists the various representations, and the feasibility of computing that function with the given data type. Feasibility=0 means that the function can’t be computed with data of that type.
Conversion of attributes between representations
Query rewritingstrategy
declaration of the types that can be replaced
![Page 12: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/12.jpg)
12 IMAGE03, Edinburgh
Variable replacement-step 1
query
wherefromselect
a b c
F(b)
query
wherefromselect
a b c
F(b)
AND
b=
Insertion of the new variable
VLDB 2002
![Page 13: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/13.jpg)
13 IMAGE03, Edinburgh
Variable replacement-step 2
query
wherefromselect
a b c
F(replace())
AND
b=
query
wherefromselect
a b c
F(b)
AND
b=
During consolidation, every other function that can be efficiently computed using the variable (which has already been inserted) will be computed using it.
![Page 14: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/14.jpg)
14 IMAGE03, Edinburgh
Scenario
),(R Replace if:The current representation has efficiency less than ANDthere is a representation with efficiency at least
6 8 1
2 5 8
5 6 0
Fid
uc
ial
Gauss
Sp
he
rica
l
Fla
t
Area
Connectivity
Strategy 1:1. R(3,3): Area -> Flat
Query:select *from Cortex cwhere (Connectivity(c.TOPO) = 2 AND Gauss(c.PEAKS) < 2) AND Area(c.PTS) < 100
Strategy 2:1. R(8,8): Gauss -> Spherical, Area->Flat
Strategy 3:1. R(8,8): Gauss -> Spherical, Area->Flat2. C:3. R(6,6): Connectivity -> Spherical
![Page 15: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/15.jpg)
A Thought
•Multimedia Databases advocated the need to query by features and k-NN queries•The mainstream DBs hasn’t quite “bought” the idea of features•Is this the time to think how attribute-value based querying and feature-based querying would work together?
![Page 16: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/16.jpg)
The Semantic Rewriting Problem
The user prefers to query on a high-level schema (remember “conceptual query
languages”?)So the system should rewrite the query on the logical schema but the rewriting should
be semantically sound
![Page 17: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/17.jpg)
17 IMAGE03, Edinburgh
A Deception
• The scientific question– Are proteins P1 and P2 always co-localized,
sometimes co-localized or never co-localized in the cerebellum?
• The database queries– Find all images I such that
• anatomic structure A is observed in I• A is cerebellum OR part-of(A, cerebellum)• R_P1 is a region where P1 is found in I• R_P2 is a region where P2 is found in I• boundary(A) overlaps boundary(R_P1)• boundary(A) overlaps boundary(R_P2)
– Count the number of images I– Similarly find other images where
• P1 is present but P2 is not in the same regions
– Report the ratios
part-of(A, cerebellum)
–Find all images I1, I2 such that•anatomic structure A1 is observed in I1•anatomic structure A2 is observed in I2•A1 is cerebellum OR part-of(A1, cerebellum)•part-of(A2, A1)•R_P1 is a region where P1 is found in I1
•R_P2 is a region where P2 is found in I2
•boundary(A1) overlaps boundary(R_P1)•boundary(A2) overlaps boundary(R_P2)
![Page 18: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/18.jpg)
An External “Knowledge Source”
ANATOM Domain Map
SSDBM 2000
![Page 19: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/19.jpg)
19 IMAGE03, Edinburgh
Using the Ontology
• SMOP – a simple matter of (query) planning?– Rewrite the query with the ontology source O, and write a
rule to execute the O.part_of predicate first
• Semantic Correctness– Purkinje cells are part of the cerebellum– dendrite is a compartment of the (generic) neuron– Should the images be selected if
• Image I has P1, P2 in a region marked ‘dendrite’ ?• Image I has P1 in a region labeled ‘dendrite’ and P2 in a
different region also marked ‘dendrite’?• Image I1 has P1 in a region marked ‘Purkinje Cell’ and I2 has P2
in a region marked ‘Purkinje cell dendrite’?• Image I1 has P1 in a region marked ‘SER’ and P2 in a region
marked ‘Spine’, both covered by a larger region marked ‘dendrite’?
• How can these cases be automatically taken care of in the query rewriting process?
![Page 20: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/20.jpg)
The Ontology Search Problem(aside from the subsumption problem)
The Ontology can be viewed a large graph where the edges denote relations. These edges may have many labels with widely different semantics. We need to perform
meaningful graph-search over them.
![Page 21: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/21.jpg)
21 IMAGE03, Edinburgh
Graph-Structured Knowledge Sources
• Taxonomies are often directed and acyclic– Querying labeled graphs
• A large fragment of the ontologies we encounter are DAGs where edges are often transitive
• We represent DAGs in a relational structure– Each node carries its DFS traversal numbers– Ancestor and Descendant operations become range
queries– Left biased Numbering scheme
» Merge nodes: have pointers to all parents» Other nodes: have pointers to leftmost parents» Parent pointers carry edge labels
– Path Expressions are evaluated using an extension of the PathStack algorithm (Srivastava et al, 2001)
» Adds linear (in the number of variables of the path expression) complexity over PathStack
What about more general graphs?What about graphs where the edge labels have specific semantics?
What about more general graphs?What about graphs where the edge labels have specific semantics?
Current 2003
![Page 22: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/22.jpg)
22 IMAGE03, Edinburgh
Modeling Interactions(Towards a “Disease Map”)
• An interaction in a graph is– A labeled edge
• regulates(A,B)– A parameterized edge
• regulates(up)(A,B)– The specialization of an edge
• activates(A,B,phosphorylation)::regulates(A,B)– A conditional edge
• inhibits(A,B,deacetylation) binds_to (C,A) exists((low(nitrogen)):condition)
– A complex edge• inhibits(binding(A,B), binding(C,D))
– A state transition• releases(Byck1p,Tpk1p)
– …
A Bregulates
PRECONDbound(Byck1p,Tpk1p)
THENbinds_to (cAMP, Byck1p)
POSTCONDbound(Byck1p,cAMP)
ANDfree(Tpk1p)
A Bregulates(up)
A’ B’regulates
A Bactivates
A Bbinds to
C Dbinds toinhibits(proc, proc)
A Binhibits
![Page 23: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/23.jpg)
The Feasible Rewriting Problem
If sources admit limited access patterns, can feasible plans be constructed?
![Page 24: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/24.jpg)
24 IMAGE03, Edinburgh
A Touch of Theory (Nash and Ludäscher, 2003)
• Web sources, functions and web services can be modeled as relations with limited access patterns
• Planning an arbitrary Union of Conjunctive Queries (UCQ) with negation– Checking feasibility is equivalent to checking
containment for UCQ and is hence 2P-complete
– Plan computation for UCQ queries can be approximated by producing an underestimate and an overestimate of the query and deferring the feasibility check
– Complete answers can be obtained even if the parts of the plan are not answerable
• partial results are produced when some of the conjuncts are feasible
![Page 25: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/25.jpg)
The Execution Planning Problem
Remote, Distributed Functions, and Data Movement
(where Data Engineering meets the Grid Environments)
![Page 26: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/26.jpg)
26 IMAGE03, Edinburgh
Planning Queries with Functions
• X0 select ca1_poly from M @AtlasSource
• X1 D.pd_hist.cut(X0) @Datacutter
• a1 avg(X0, protein_amt) @Mediator
• temp_store(a1) @MediatorStore
where a1 is agg(avg, D.pd_hist.cut(M.ca1_poly), protein_amt) and … sim_dist(a1, a2) > 0.2 and …
• Create transaction T1(X0 select ca1_poly from M @AtlasSource Store X0 into $V1 @ AtlasWrapper)
• Create transaction T2( X1 D.pd_hist.cut(fetch(X0, $T0)) @DatacutterStore X1 into $V2 @TempStore)
• a1 avg(X0, protein_amt)• temp_store(a1) @MediatorStore
Standard Mediator Distributed System over the Grid
![Page 27: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/27.jpg)
27 IMAGE03, Edinburgh
Planning Queries with Functions
• Create transaction T1(X0 select ca1_poly from M @AtlasSource Store X0 into $V1 @ AtlasWrapper)
• Create transaction T2( ServiceCatalog.lookup(histogram_cutting_service, $resource, $paramList)R1 constructRequest ((X1 D.pd_hist.cut(fetch(X0, $T0))), $resource, $paramList)X1 ExecuteRequest(R1))
• Create transaction T3(
S1 getSize(X1)ServiceCatalog.lookup(dataStorageService, S1,
$resource, $params2) R2 constructRequest ((Store X1 into $V2), $resource,
$params2))
How do you plan (and cost estimate) the operations ?
Distributed System over the Grid with GridService Catalog
![Page 28: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/28.jpg)
The “Goodness of Result” Problem
The query retrieves information from the information sources.
The Result Processor may need to estimate the “quality” of the results with respect to a
reference
![Page 29: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/29.jpg)
29 IMAGE03, Edinburgh
Two Viewpoints
• The application person– Send the result retrieved
• Case 1– To a statistical package and compute standard statistics S1…Sk
• Case 2– To a program that generates a specialized random set of data
and matches the statistical significance of the retrieved results
• The database person– For these applications
• Can we perform the queries on a sample rather than the entire data? Any guidelines on the sampling method?
• Can we use approximations instead of producing exact answers?• Should we find only “interesting” or “most frequent” data by using
data mining algorithms?• Can we package the descriptive statistics that a DBMS can
compute to make the overall work more efficient?• Can the use of user-defined aggregates (cf. ATLAS project at
UCLA) help eliminate the statistical package?
![Page 30: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/30.jpg)
30 IMAGE03, Edinburgh
In Essence
• A tour of a few “database-y” problems we have encountered so far in our work with Neuroimaging and associated information – Still scratching the surface of most problems
• The help of forward-thinking domain scientists has been the most crucial asset in figuring out the problems at a deeper-than-usual level
• The database scientists, need to be “cross-thinkers” to venture beyond our own domain of specific expertise to develop a holistic approach to these problems
• There are many more exciting problems – let’s go get them!!
![Page 31: Neuroimaging Databases: A Data Engineering Perspective](https://reader036.vdocuments.net/reader036/viewer/2022062315/568158c5550346895dc60df1/html5/thumbnails/31.jpg)
31 IMAGE03, Edinburgh
Acknowledging
• Maryann Martone – who always asks hard questions I don’t know how to answer
• Bertram Ludäscher– who has finally convinced me that “theory” is more practical
than I thought
• Simone Santini– the feature-man who (almost) always wins the argument on
any technical matter
• Animesh Ray– the geneticist, who is forcing me to learn and think about
process interactions and models of complex phenomena
• Mark Ellisman– the godfather who excels at making offers we can’t refuse
• The staff and students who make it happen