query by excel

32
QUERY BY EXCEL A. Witkowski, S. Bellamkonda, T. Bozkaya, B. A. Naimat, L. Sheng, S. Subramanian, A. Waingold Oracle Corporation

Upload: kairos

Post on 04-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

QUERY BY EXCEL. Witkowski, S. Bellamkonda, T. Bozkaya, A. Naimat, L. Sheng, S. Subramanian, A. Waingold. Oracle Corporation. Spreadsheets. Spreadsheets are established analytical tools: Attractive user interface Easy to use computational model Interactivity for what if analysis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: QUERY BY EXCEL

QUERY BY EXCEL

A. Witkowski, S. Bellamkonda, T. Bozkaya, B. A. Naimat, L. Sheng,

S. Subramanian, A. Waingold

Oracle Corporation

Page 2: QUERY BY EXCEL

Spreadsheets

Spreadsheets are established analytical tools:– Attractive user interface

– Easy to use computational model

– Interactivity for what if analysis

But, they do not offer:–Scalability

–Parallelization

–A unified view of the data model

Page 3: QUERY BY EXCEL

Our proposal

QUERY BY EXCEL (QBX)

Combines Presentational interactive modeling power

of Excel (spreadsheet tools) Computational power and scalability of

RDBMS via analytical extensions

Page 4: QUERY BY EXCEL

QBX – How it works Analyst builds a model using Excel. The model

is translated to SQL and stored in relational views.

Analysts designate areas in Excel as relational sources (RTables). An RTable can be transformed into another RTable using Excel operations corresponding to Outer Join, Selection, Projection and Aggregation. Analyst does not write any SQL during this process.

Analysts write Excel formulas on samples of relational sources that fit in a spreadsheet. The translated SQL works on the whole data for scalability.

Business Reporting tools can access the relational views for consolidation.

Page 5: QUERY BY EXCEL

Analytic SQL Extensions for QBX SQL MODEL (Witkowski, et al. Sigmod

2003) SQL PIVOT (Cunningham, et al. Vldb

2004)

Page 6: QUERY BY EXCEL

QBX Architecture

Excel Analyst

RDBMS Interaction &

Modeling

Persistence

EXCEL

Excel ->

SQL Translation

QBX

Database

Schema

QBX generated

SQL Objects

RDBMS

Application RDBMS User

Page 7: QUERY BY EXCEL

QBX Metadata

Cells( eid, sheet, row, col, x, f) A B C D

1 Sale Diff

2 10.00

3 12.00

=C3-C2

For this Excel spreadsheet, we store five rows in Cells Table: C1, C2, C3, D1, D3

RTables( eid, RTable, sheet, row, col, sample, RTableView, …)

eid sheet

row col x f

1 1 1 3 ‘Sale’

1 1 1 4 ‘Diff’

1 1 2 3 ’10.00’

1 1 3 3 ’12.00’

1 1 3 4 ‘=C3-C2’

Excels( eid, name, owner, ExcelBinary, SQLView)

Page 8: QUERY BY EXCEL

QBX Infrastructure

Interaction and Modeling Component (VBA add on)

– Menu interface (QBX)– QBX->Rtables manages Rtables (import, add

column, save as relational view..)– QBX->Spreadsheet translates Excel to SQL,

saves and loads it.

Persistence Component (VBA add-on) Translation Component

Page 9: QUERY BY EXCEL

Excel to SQL Translation

Fix Frame translation Table Translation Unified Translation

Page 10: QUERY BY EXCEL

SELECT sheet, row, col, x FROM cellsMODEL DBY (sheet,row,col) MEA (x) RULES AUTOMATIC ORDER( x[1,1,2] = x[1,1,1] + x[1,2,1], -- B1=A1+A2 x[1,2,2] = x[1,3,1] + 1, -- B2=A3+1 x[1,3,2] = sum(x)[1,1<=row<=3,1] –- B3=sum(A1:A3));

Fix Frame TranslationA B C

1 1 =A1+A2

2 2 =A3+1

3 3 =sum(A1:A3)

Page 11: QUERY BY EXCEL

Fix Frame Translation-VLOOKUP - HLOOKUPWe use REFERENCE SQL MODEL:VLOOKUP(key, (<rs,cs>, <re,ce>), col):

REFERENCE vlookup_ref ON

( SELECT k.x key, v.x value

FROM cells k, cells v

WHERE k.col=cs AND v.col=cs+col-1 AND

k.row >= rs AND k.row <= re AND

v.row=k.row )

DIMENSION BY (key) MEASURES (value)

Page 12: QUERY BY EXCEL

Fix Frame Translation-VLOOKUP - HLOOKUP

EXAMPLE: A3 =Vlookup(C3, A1:B4, 2)

SELECT row, col, x FROM cellsMODELREFERENCE vlookup_ref ON (SELECT k.x key,v.x value FROM cells k,cells v WHERE k.col = 1 AND v.col = 2 AND k.row >= 0 AND k.row <= 4 AND v.row = k.row) DIMENSION BY(key) MEASURES(value)MAIN DIMENSION BY (row, col) MEASURES (x)RULES

( x[3,1] = vlookup_ref.value[ x[3,3] ] );

A B C D

1 1 2 3 4

2 11 6 7 8

3 9 10 11 12

4 13 14 15 16

5 17 18 19 20

6

C3

Page 13: QUERY BY EXCEL

Table Translation

Table Translation creates named, protected regions within Excel named RTables.

We remember associated metadata for RTable regions (PK, PK-FK constraints, etc..)

A direct RTable represents a– An RDBMS table (entire table or sample)– An RDBMS view (Direct Rtables can be created through QBX menu)

A derived RTable represents the result of relational operations on other RTables.

Page 14: QUERY BY EXCEL

Rtable ExampleA B C D … I J

1 FACT TIME_D

2 City Prod Month Sale Month Year

3 LA tv M1.00 10.00 M1.00 Y.00

4 LA radio M2.00 12.00 M2.00 Y.00

5 LA tv M1.01 14.00

6 LA radio M2.01 16.00 PROD_D

7 Boston tv M1.00 20.00 Prod Categ

8 Boston radio M2.00 22.00 tv Video

9 Boston tv M1.01 24.00 radio Audio

10 Boston radio M2.01 26.00

11 REGION_D

12 City State

13 LA CA

14 Boston MA

FACT(A2:D10)

TIME_D(I2:J4)

PROD_D(I7:J9)

REGION_D(I12:J14)

Page 15: QUERY BY EXCEL

Table Translation - Operations Inter-column calculations

Adding new (calculated) column to an Rtable Projection Joining of Rtables

The closest Excel operation to join is Hlookup/Vlookup,

Which is similar to relational OUTER JOIN.Steps: (R1 LEFT OUTER JOIN R2 ON R1.col1=R2.col2)1. A new column is added.2. The new column is populated with

(VLOOKUP(R1.Col1, R2, R2.Col2) Aggregation

Page 16: QUERY BY EXCEL

Inter-column Calculations

Computations involving columns of the same row. EX: A1 = B1+D1A B C D

1 10 8 2

2 15 9 6

3 18 2 16

… … … …MODELDBY(row,col) MEASURES (x)RULES( x[ANY, 1] = x[cv(row),2] + x[cv(row),4])

Page 17: QUERY BY EXCEL

Table Translation – Join ExampleA B C D E F G … I J

1 FACT TIME_D

2 City Prod Month State

Categ

Year Sale Month Year

3 LA tv M1.00 CA video Y.00 10.00

M1.00 Y.00

4 LA radio

M2.00 CA audio Y.00 12.00

M2.00 Y.00

5 LA tv M1.01 CA video Y.01 14.00

6 LA radio

M2.01 CA audio Y.01 16.00

PROD_D

7 Boston

tv M1.00 MA video Y.00 20.00

Prod Categ

8 Boston

radio

M2.00 MA audio Y.00 22.00

tv Video

9 Boston

tv M1.01 MA video Y.01 24.00

radio Audio

10 Boston

radio

M2.01 MA audio Y.01 26.00

E3=VLOOKUP(B3,I8:J9,2)

F3=VLOOKUP(C3,I3:J4,2)

Page 18: QUERY BY EXCEL

Join SQL

SELECT f.city, f.prod, f.month, g.state, p.categ, t.year, sale,

row_number() over (order by city, prod, month) rn

FROM

fact f outer join time_d t on f.month = t.month

outer join prod_d p on f.prod = p.prod

outer join geog_d g on f.city = g.city

ORDER BY city NULLS LAST, prod NULLS LAST,

month NULLS LAST;

Page 19: QUERY BY EXCEL

Table Translation- Aggregation Aggregation in Excel can be done

through DATA PIVOTTABLE operation.This corresponds to (in RDBMS):

Aggregation via SQL GROUP BY operator Aggregation via SQL PIVOT operator

Page 20: QUERY BY EXCEL

Aggregation – SQL GROUP BYL M N

1 AGG_Q

2 State Year

Total

3 CA Y.00 22.00

4 Y.01 30.00

5 52.00

6 MA Y.00 42.00

7 Y.01 50.00

8 92.00

SELECT state, year, sum(amt) amt,

row_number() over (order by state,year) rn

FROM

fact f outer join time_d t on f.month = t.month

outer join prod_d p on f.prod = p.prod

outer join geog_d g on f.city = g.city

GROUP BY

GROUPING SETS ((state,year),(state))

ORDER BY state NULLS LAST,

year NULLS LAST;

Page 21: QUERY BY EXCEL

Translation of Fix Frame Operations on RTables

Excel computation is possible once we map relational data to 2-D form. This is called linearization.

Assignment Linearization Reference Linearization

Page 22: QUERY BY EXCEL

Reference Linearization

L M N O

1 AGG_Q

2 State Year Total Ratio

3 CA Y.00 22.00

=N3/N5

4 Y.01 30.00

=N4/N5

5 52.00

=N5/N5

6 MA Y.00 42.00

=N6/N8

7 Y.01 50.00

=N7/N8

8 92.00

=N8/N8

SELECT row, col, x FROM cells

MODEL

REFERENCE r ON

( SELECT rn, state, time, total FROM RT )

DIMENSION BY (rn)

MEASURES (state, time, total)

DIMENSION BY (row, col) MEASURES (x)

( x[3, 15] = r.total[1] / r.total[3], -- =N3/N5

x[4, 15] = r.total[2] / r.total[3], -- =N4/N5

x[5, 15] = r.total[3] / r.total[3], -- =N5/N5

x[6, 15] = r.total[4] / r.total[6], -- =N6/N5

x[7, 15] = r.total[5] / r.total[6], -- =N6/N5

x[8, 15] = r.total[6] / r.total[6] -- =N8/N5

);

Page 23: QUERY BY EXCEL

Relative Referencing to RTables

Introducing a new lookup function for referencing values in Rtables:

RTLOOKUP(RTREGION, COL, {PKEYS})

L M N O

1 AGG_Q

2 State Year

Total Ratio

3 CA Y.00 22.00

0.42

4 Y.01 30.00

0.58

5 52.00

1.00

6 MA Y.00 42.00

0.48

7 Y.01 50.00

0.52

8 92.00

1.00

O3 = N3/rtlookup(L2:N8,3,L3,NULL)

O4 = N3/rtlookup(L2:N8,3,L3,NULL)

Page 24: QUERY BY EXCEL

Relative Referencing to RTablesCREATE VIEW AGG_Q ASSELECT state, year, ratio, sum(amt) amt, row_number() over () (order by state, year) rnFROM fact f outer join time_d t on f.month=t.month outer join prod_d p on f.prod = p.prod outer join geog_d g on f.city = g.cityGROUP BY GROUPING SETS ((state,year),(state))MODEL DBY (state, year) MEA (total, 0 ratio)( ratio[ANY,ANY] = total[CV(state),CV(year)]/total[CV(state), null])ORDER BY state nulls last, year nulls last;

Page 25: QUERY BY EXCEL

Optimizations

Collapsing of Equivalent Rules For Loops vs. Existential Form Existing optimization of SQL Model

functionality– Rule pruning– Filter pushdown– Others…

Page 26: QUERY BY EXCEL

Optimizations- Collapsing of Rules

MODEL

DBY(row,col) MEA(x)

RULES

(

x[1,1] = x[1,2] + x[1,4],

x[2,1] = x[2,2] + x[2,4],

x[20,1] = x[20,2] +x[20,4]

);

MODELDBY(row,col) MEA(x)RULES( x[for row from 1 to 20,1] = x[CV(row),2] + x[CV(row), 4]);

==

A B C D

1 10 8 2

2 15 9 6

3 18 2 16

4 … … …

5

A1=B1+D1

Page 27: QUERY BY EXCEL

Optimizations- Collapsing of Rules

0

510

15

2025

30

3540

45

500 1000 2500 5000 10000

Number of rules

Co

mp

ilat

ion

tim

e

Page 28: QUERY BY EXCEL

Optimization – For Loops vs Existential Rules

MODELDBY(row,col) MEA(x)RULES( x[for row from 1 to 20,1] = x[CV(row),2] +

x[CV(row), 4]);

MODELDBY(row,col) MEA(x)RULES( x[1<=row<=20,1] = x[CV(row),2] +

x[CV(row), 4]);

VS

A B C D

1 10 8 2

2 15 9 6

3 18 2 16

4 … … …

5

A1=B1+D1

Page 29: QUERY BY EXCEL

Optimization – For Loops vs Existential Rules

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0.1 0.5 1 5 10 20 40

Percentage of cells modified

Lo

ok

up

tim

e/S

ca

n t

ime

Use For Loops Use scan

Page 30: QUERY BY EXCEL

Conclusion Our goal is to translate Excel computation to SQL

so that Business Models built in Excel can be stored and queried in RDBMS.

We proposed translation techniques for expressing Excel computation in RDBMS SQL using new analytic extensions.

We proposed representation techniques for relational data in Excel by using Rtables and described how Excel operations on RTables can be simulated in SQL.

We discussed how this proposed system would fit into our RDBMS SQL execution engine and benefit from all its capabilities and optimizations.

Page 31: QUERY BY EXCEL

What is ahead?

Excel– Pivoting and advanced filtering turned out

to be essential, but a few more relational friendly extensions would go a long way, particularly in simulating joins, window function computations.

SQL– RDBMS SQL needs to be extended to cover

the functionality provided in Excel, particularly financial functions.

Page 32: QUERY BY EXCEL

AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S