hello · • reside on unix servers running the solaris operating system ... this paper describes...

64
HELLO

Upload: doantruc

Post on 20-Aug-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

HELLO

Advantages and Disadvantages of Using SAS/MDDB ® , HOLAP,

and SAS/IntrNet® in the Development of an Interactive

System

Lori GuidoRichard Denby

Housing and Household Economic Statistics Division

(HHES)

Overview

• U.S. Census Bureau best known for conducting a decennial census

• A decennial census is – A national event that involves everyone– Is conducted in years ending in a zero

Census Data

• Three major categories– “Apportionment” data– “Redistricting” data– “Long form” data

Census Data Continued ...

• “Apportionment” data– Population totals for each state– Used to determine the number of seats in the

U.S. House of Representatives for each state– Given to the U. S. President by December 31

Census Data Continued ...

• “Redistricting” data– Used to delineate congressional and other

election districts– Given to the states within one year of the

census

Census Data Continued ...

• “Long form” data– Used to manage or evaluate federal programs– Collected on a form containing many more

questions then the standard form– 53 questions covering 34 subjects– 1 in 6 households received the “long form”– Review scheduled to begin during Fall 2001

The Task

Find SAS technologies best suited for the review of the Census 2000 long

form data

What Was Done

• Three prototypes were developed – SAS 6.12 prototype – HOLAP prototype using SAS 8.1– Web-based prototype using SAS 8.1

Selection Criteria

• Easy to maintain• Flexible• 40,000,000+ obs • New data incorporated

on a “flow” basis

• View data for multiple states within the same report

• Easy to deploy• Display data quickly

The SAS V6.12 Prototype

• Developed by HHES• Client-server approach• Each PC has to have

– SAS 6.12– The application’s config file– The application’s profile file

The SAS V6.12 Prototype’s Data

• Reside on Unix servers running the Solaris operating system

• Accessed using “Remote library services” via SAS/CONNECT®

• One metabase file on a Novell server• Application files on a Novell server

The SAS V6.12 Prototype’s Files

• For each state and subject area there is – One detail data set– One MDDB– One set of SAS/EIS® reports– One set of application files

The SAS V6.12 Prototype

– Data are displayed via a gif file – Hot spots link the states to an icon that lists the

reports for each state– Access one state’s data at a time– One set of 74 reports for each state (+3400)

The SAS V6.12 Prototype

– Seven seconds to display a summary report– Two seconds to display 557,000 detailed obs

from a data set of 29,700,000 obs – “Off the shelf” objects– No override methods are used– Minimal knowledge of screen control language

(SCL) needed

SAS V6.12 Prototype - Problem

• Formats to display meaningful character values instead of coded values

• “Show detail” option does not show any detail records

• “Reach-through” process searches for the character values used in the formats instead of the stored data values

SAS V6.12 Prototype - Solution

• Create a separate field with the meaningful character value

The HOLAP Prototype

• Developed by SAS and HHES• Client-server approach • Retains all the functionality of the SAS

v6.12 prototype• Uses hybrid online analytical processing

(HOLAP) technology

HOLAP Technology

• Allows users to access data from multiple sources as if the data were coming from a single data source.

• A SAS view allows concurrent access to the detail data sets– Must be rerun each time a new state’s data is

added– Input to a “template” MDDB

The HOLAP Template MDDB

• Stores the hierarchies, formats, and base table attributes to pass to the resulting HOLAP cube

• Does not have to be rerun every time a new state is added

• Will need to be rerun only when the hierarchies, formats, or base tables change.

The HOLAP Cube

• Stored in the central repository• Holds the location of each component of the

logical data group and what it contains• Has to be updated each time a new state is

added.

The HOLAP Technology

• Decreases the number of reports to be created

• Allows one single set of 74 reports to be used for any state

• Multiple states and multiple years can be viewed in one report

• Allows the reports to be created in advance

The HOLAP Prototype

• Client-server approach• Each PC has to have

– SAS 8.1– The application’s config file– The application’s profile file– The central repository

The HOLAP Prototype’s Data

• Reside on Unix servers • Accessed using “Remote library services”

via SAS/CONNECT®

• Central repository holds the metabase and application information

The HOLAP Prototype’s Files

• For each state and subject area there is – One detail data set– One MDDB

The HOLAP Prototype

• Data are displayed via a gif file • Hot spots link the states to an icon that lists

the reports• Override method allows multiple states to

be chosen• A list box holds the selected states

The HOLAP Prototype - Problem

• “Show detail data” was slow– 40 minutes to display 269,104 obs from a data

set of 18,366,027 obs– Caused by the view that is used to point to all

of the detail data sets – SAS does not use the indexes on the datasets

when accessing the data via a view

The HOLAP Prototype - Solution

• An override method– Captures the cell the cursor is sitting on– Determines what the cell represents– Bypasses the HOLAP cube and view– Goes directly to the dataset that contains the

detail data– Subsets the data based on the cell’s contents– Five seconds to display 269,104 detailed obs

from a data set of 18,366,027 obs

The HOLAP Prototype

• Incorporated the use of display formats– Data set values did not have to be re-coded– More testing is needed to verify that the

problem is completely fixed.• Decreased disk space storage requirements

with savings ranging from 32% to 63%

The HOLAP Prototype

– Twelve seconds to display a summary report– Five seconds to display 269,104 detailed obs

from a data set of 18,366,027 obs – Customized objects– Override methods are used– More in-depth knowledge of SCL needed

The Web-based Prototype

• Developed by SAS and HHES• Web-based approach • Uses SAS/IntrNet® features• Retains all the functionality of the SAS

v6.12 prototype• Uses HOLAP technology

The Web-based Prototype

• Each PC has to have a browser • Data reside on Unix servers • Central repository, application’s config file

and profile file reside on Unix server• Central repository holds the metabase and

application information

The Web-based Prototype’s Files

• For each state and subject area there is – One detail data set– One MDDB per subject area

The Web-based Prototype

• Data displayed via a HTML file• JavaScript captures the mouse clicks and

places the selections in the list box• Macro program code accesses the HOLAP

cube to retrieve the selected data• MDDB report viewer (MRV) is used to

display the data

Web-based Prototype

• Setting up the SAS /IntrNet® server is difficult

• Seven seconds to display a summary report • 80 seconds to display the first set of 50

detailed observations• Extensive knowledge of JavaScript will be

required to maintain this prototype

Analysis

• HHES compared the prototypes with each other, reviewing the benefits and disadvantages of each

• HHES also evaluated the prototypes against the selection criteria to decide which technology would be suited for the Census 2000 Long Form Data Review System

Findings

• No clear-cut winner• No prototype met all of HHES’s criteria• All three prototypes could handle large files• All three prototypes can incorporate new

data “flow” basis easily• Only the prototypes using HOLAP allow

users to view multiple states

Findings Continued …

• The SAS v6.12 prototype– Least flexible– Easiest for HHES to maintain

• The HOLAP prototype– More flexible– Harder to maintain due to the override methods

Findings Continued …

• The Web-based prototype– More flexible– Easiest to deploy– Harder to maintain due to the JavaScript

HHES Will Probably GoClient-server HOLAP

– Fewer reports to create

– SAS v8.1 is closer to 8.2 then 6.12

– HOLAP technology will allow the users to view multiple states

– Potential problems with the HOLAP prototype should be the easiest to handle

– The benefits of using the HOLAP prototype technology outweigh the other prototype’s advantages

Conclusion

• More research is needed– Tuning– Load testing – Expand HHES’s knowledge of the technology

used in the prototypes

Next Steps

• Currently HHES is creating a SAS v8.1 HOLAP system using Census 1990 data to fully load test the technology– Try different tuning strategies– Research and test the other outstanding

prototype questions– Gain more override and SCL experience.

CONTACT INFORMATION

Richard A. Denby U.S. Census Bureau 4700 Silver Hill Road, 8500-3 Washington, DC 20233-1912 phone 301-457-6810 fax 301-457-3248 email [email protected]

REFERENCES

SAS® is a registered trademark of SAS in the United States of America and in other countries

Lori A. Guido U.S. Census Bureau 4700 Silver Hill Road, 8500-3 Washington, DC 20233-1912 phone 301-457-3204 fax 301-457-3499 email [email protected]

Questions?

Advantages and disadvantages of using MDDBs, HOLAP and SAS/IntrNet in thedevelopment of an interactive system

Lori A. Guido, U.S. Census BureauRichard A. Denby, U.S. Census Bureau

1. Introduction

The U.S. Census Bureau is best known for conducting a decennial census, because it is a nationalevent that involves everyone. A decennial census is conducted in years ending in a zero. ByDecember 31 of a census year, the Census Bureau must provide to the President state populationtotals and the number of seats to which each state is entitled in the House of Representatives. This is called Aapportionment@ data.

One of the most important uses of decennial census data is to delineate congressional and otherelection districts. This is called Aredistricting@ data. Most states have tight deadlines forcompleting their redistricting work so that it will be finished in time for the 2002 elections. TheCensus Bureau is required to provide redistricting data to the states within one year of thedecennial census and these data are the first set of detailed data from the census. Producing thesedata is a massive undertaking that involves tabulating the characteristics of more than 280million people in 120 million housing units assigned to 39,000 governmental entities in 7.5million census blocks.

The apportionment and redistricting data are derived from the total universe of all persons. Additionally, approximately one in six households received a Along form@ questionnairecontaining 53 questions covering 34 subjects. Every question in Census 2000 was required bylaw to manage or evaluate federal programs or was needed to meet legal requirements stemmingfrom U.S. court decisions such as the Voting Rights Act. Federal dollars supporting schools,employment services, housing assistance, highway construction, hospital services, programs forthe elderly, and more are distributed based on census data.

The Housing and Household Economic Statistics Division (HHES) of the U.S. Census Bureauresearched different SAS technologies to find ones best suited for the development of aninteractive system to review the Census 2000 long form data. Three different prototype systemswere developed. All three systems use a client-server approach with data stored on centralized,large-scale Unix servers and the data accessed through PCs running Windows 95. While all thesystems use multidimensional databases (MDDBs), and all the systems surface the same data,each system uses distinct SAS technologies. The first system uses SAS v6.12 and the MDDBreport object in SAS/EIS7 to create 3400+ commonly used reports. The second system usesSAS v8.1's Hybrid On-Line Analytical Processing (HOLAP) techniques to build a "proxy"HOLAP cube in addition to the MDDB report object in SAS/EIS7, but only 74 reports had to becreated. The third system uses SAS v8.1's HOLAP and SAS/IntrNet7 and other web products tosurface the data. This paper describes the advantages and disadvantages of each technology used,

including development, deployment, and performance issues, in addition to the space savingsobtained via the use of formats.

2. Problem

The technology had to meet several criteria to be suited for the Census 2000 Long Form DataReview system. The technology had to handle data files containing more than five millionobservations. It had to allow for programming flexibility. The specifications for the systemmight change even once the system was in production. The resulting system had to be easy tomaintain and enhance. HHES will have to develop a review system in a very short amount oftime. Also, the analysts will have very little time to review the data, so the programmers willhave to be able to fix any problem with the application in a short amount of time. New datawould have to be incorporated into it on a Aflow@ basis easily, without impacting the data thatwas already there. The technology had to allow users to view several groupings of data at once. The systems developed with this new technology had to be easy to deploy to several differentuser communities, each of whom had different PC environments. Of course, the data had to besurfaced to the user as quickly as possible.

3. The SAS v6.12 Prototype

The first prototype HHES developed uses SAS 6.12 under Windows95 and resided on user PCdesktops. Each PC has to have SAS 6.12, the application=s config file and the application=sprofile file installed on it. The data reside on Unix servers running the Solaris operating system. ARemote library services@ via SAS/Connect are used to access the MDDBs and the underlyingdetail data sets on the Unix servers. There is one detail data set, one MDDB, one set of reportsand one set of application files for each state. One metabase file holds all of the metadata for allof the states. The application files and the metabase file reside on a Novell server. For thissystem, more than 3400 reports would have to be created, with one set of 74 reports having to becreated for each state. In other words, there would have to be 50 copies of each report, sinceeach report could only access one MDDB, or one state=s data. If a problem with a label or titlewas found in one of the reports, all the copies of that report would have to be updated, there is noway to mass-correct the reports. Also, the risk of making mistakes increases when developingsuch a large number of reports.

HHES identified a substantial limitation in the use of SAS formats with version 6.12. Analystswant to see display formats that are more meaningful than the underlying coded values. Forexample, the SAS/EIS report displays the words Amale@ and Afemale@ in place of the codedvalues of A1" and A2". The problem with this is that when analysts use the Ashow detail@ optionfrom the EIS multi-dimensional report object, no detail records making up the cell total arefound. The reason was that during the Areach-through@ process, internally SAS was searchingfor Amale,@ when in reality all the data values for males were stored as A1".

So, for reports where data ranges or labels have to be displayed, a separate field was created andthe appropriate range was stored in it. Using another example, the ADrilldown 3 HighestDegree@ report, shown in Figure xx, displays the number of people who have completed certainlevels of schooling (EHIGH) by the edit parameters that were used to allocate the data if the fieldwas blank on the response for (FLHIGH). For each observation, a field called QHIGH containeda character code that stood for a certain level of education obtained. The pre tabbing processingcreated another field called EHIGH, which contained the character string corresponding to thecharacter code. Figure x shows a proc print of a few observations of the QHIGH and EHIGHfields from the detail data set that goes into this report.

The data is surfaced via a gif file that contains a map that links each state=s hot spot to an iconthat lists the reports for each state. This configuration only allows access to one state=s data at atime. It takes x seconds to surface a summary report. It takes x seconds to surface x detailedobs. Most of the objects used for this system were Aoff the shelf.@ No over-ride methods areused. Extensive SCL knowledge was not needed for the creation of this system.

4. The SAS v8.1 HOLAP Prototype

HHES worked with the SAS to develop the next two prototypes. This prototype took advantageof new features found in SAS v8.1, but also retained all of the functionality contained the in theSAS v6.12 prototype. ARemote library services@ are also used in this system to access to theMDDBs and the underlying detail data sets on the Unix servers. There is still one detail data setand one MDDB for each state. Now, a central repository contains all the reports and themetabase registrations. However, the repository has to reside on each PC. Each PC also has tohave SAS 8.1, the application=s config file and the application=s profile file installed on it. Theprototype's main feature is a new SAS technology called hybrid online analytical processing(HOLAP). This technology allows users to access data from multiple local and remote MDDBsand data sets as if the data were coming from a single data source. A SAS view allowingconcurrent access all of the detail data sets is created. The view must be rerun each time a newstate=s data is added. This view is the input to a Atemplate@ MDDB. The template MDDB willstore the hierarchies, formats, and base table attributes to pass to the resulting HOLAP cube. Itdoes not have to be rerun every time a new state is added. It will just need to be rerun when thehierarchies, formats, or base table changes.

The HOLAP cube (Debbie, Hung what the blank does the HOLAP cube do?) Through the use ofthe HOLAP technology, the number of reports that will have to be created has been decreasedfrom the more than 3400 required in the SAS v6.12 prototype, to a single set of 74 reports thatcan be used for any state. The reports can be created in advance, decreasing the amount of timebetween file processing and file availability to the users. Users of the system will also be able toview data for multiple states and multiple years in one report.

The data is surfaced via a modified version of gif file used in the SAS v6.12 prototype. The filestill contains the map, but for this prototype, there are hot spots for each state, as well as regionsand the entire US. The desktop frame class, which points to SASHELP.EIS.RUNGRPH1.SCL,

is overridden so new SCL code that allows multiple states to be selected will execute. It takes xseconds to surface a summary report. It takes x seconds to surface x detailed obs.

Another override method is used to correct a potential performance problem that HHESdiscovered. While in the AWage/Salary Income@ report, when only California was chosen, aAshow detail data@ took 40 minutes to subset 269,104 records from the 18,366,027 available andpresent the subset on the screen. SAS believes this performance problem is due to them usingthe HOLAP cube, which uses a view to point to all of the detail data sets.

SAS does not use the indexes on the data sets when accessing the data via a view. Even if onlyone state was selected, SAS is reading all the detail data sets in the data group sequentially to pullout the data that was chosen. The prototype now uses an override method that captures the cellthe cursor is sitting on when a Ashow detail data@ is invoked. The SCL code determines whatcell the cursor is on, bypasses the HOLAP cube, goes directly to the data set that contains thedetail data for the cell and subsets the data based on the cell=s contents. Using this method, thesame 269,104 obs in the Texas detail data set were displayed in less than (lori find out how longthat took). A more in-depth knowledge of SCL is needed to maintain or update this prototypedue to the use of these override methods.

The prototype also incorporated the use of reach-through of display formats to data set values sothey did not have to be re-coded. This decreased disk space storage requirements with savingsranging from a minimum 32% savings to as much as 63% savings. For example, the originaleducation data set for Texas, which was created in SAS version 6.12 and had re-coded valuesstored in the data set, was 9 gigabytes. After the data set was converted to version 8.1 and theformats were applied, the files were 3.4 gigabytes. This produced a 61% savings in disk space.

5. SAS v8.1 IntrNet7 Prototype

The third prototype takes advantage of new features used in the SAS v8.1 HOLAP prototype,retains most of the functionality contained in the SAS v6.12 prototype, but also takes advantageof the features of SAS IntrNet7. This was the first attempt at the US Census Bureau to build aweb-enabled HOLAP system incorporating Areach-through@ to the underlying detail data sets. Sun>s Web server is used with SAS/InterNet7 to access to the MDDBs and the underlying detaildata sets on the Unix servers. There is still one detail data set and one MDDB for each state. The HOLAP technology is still used and all the reports and the metabase registrations are stillcontained in a central repository. However, SAS v8.1, the application=s config file, theapplication=s profile file and the repository reside on the Unix server.

The SAS v8.1 IntrNet prototype differs from the SAS v8.1 HOLAP prototype in how the data issurfaced. The data is surfaced via a HTML file that contains a map that has hot spots for eachstate, as well as regions and the entire US. Javascript captures the mouse clicks and place theselections in the list box. A SAS/Macro program takes the list of selected states and manipulatessome code to access the HOLAP cube that points to the selected data. Included in the prototypeis a JAVA Applet created with AppDev Studio 1.2 on a PC, then packaged and FTP=d to theSolaris Web Server root.

Setting up the SAS /IntrNet7 server is very tedious. Knowledge of Unix system administration,web server software (apache, or Sun web server) and general installation procedures are neededto perform this task. The auto installer did not work for this product. The instructions werelengthy and written in pieces. The installer could not get an overall picture of all the stepsnecessary to complete the task. Extensive knowledge of Javascript will be required to maintainor expand this prototype. Summary reports do not look the same. (see figure xy). Expandingthe reports is very tedious and time consuming. Response time varies with the size of files. Ittakes x seconds to surface a summary report. It takes x seconds to surface x detailed obs.

6. Findings

HHES compared the prototypes to each other, reviewing the benefits and disadvantages of each. HHES also evaluated the prototypes against the criteria to be used to decide which technologywould be suited for the Census 2000 Long Form Data Review system. HHES could find noclear-cut winner based on this analysis. No prototype met all of HHES= criteria. All threeprototypes could handle data files containing more than five million observations in them. Allthree prototypes can incorporate new data Aflow@ basis easily, without impacting the data thatwas already there. Only the prototypes that use the HOLAP technology allow users to viewseveral groupings of data at once. The SAS v6.12 prototype was able to surface the data fasterthen either of the SAS v8.1 prototypes.

The SAS v6.12 prototype was the least flexible because of the number of reports that would haveto be created. The SAS v8.1 HOLAP prototype and the SAS v8.1 IntrNet prototype offer moreflexibility. Changes to the report labels and titles that would effect all the states can be madequickly in the prototypes that use the HOLAP technology, since each only has one set of reportsthat will access any of the states.

The SAS v6.12 prototype was the easiest to maintain, because it uses Aoff the shelf@ SASobjects. The objects are well documented. The HHES programmers have a lot of experienceusing SAS v6.12.

The SAS v8.1 HOLAP prototype is the second hardest application to maintain, due the use of theover-ride methods and SCL code. Except for the over-ride methods, this prototype is verysimilar to the SAS v6.12 prototype. The HOLAP creation process was very easy to understand,maintain and modify. There are objects in SAS v8.1 that would allow a user to choose whichstate they would like to view. The use of this object would eliminate the need for the map andthe need to override the desktop frame class. However, an over-ride method had to be used tomeet HHES= criteria of fast response time to surface detail data. Organizations with more over-ride method and SCL programming experience would be able to maintain the prototype withease.

HHES thinks the SAS v8.1 intranet prototype is the most complicated and the hardest prototypeto enhance or maintain. This assessment is based on HHES=s lack of experience withJavascript, and the problems with the server installation. Organizations with more Javascript andIntraNet server installation experience might be able to maintain the prototype with ease.

The SAS v8.1 IntraNet prototype is the easiest application to deploy. All that is needed in on thePC is a web-browser. Both the SAS v6.12 and the SAS v8.1 prototypes require that theappropriate version of SAS be installed on the user=s PC. With a very slight modification, theprototypes could probably use a network version of SAS, but we have not tested this yet. Theconfig and profile files needed by the SAS v6.12 prototype, can be copied to the user=s PC via anetwork release package. HHES is not sure if central repository in the SAS v8.1 HOLAPprototype can be deployed to the user=s PC via a network release package. Another possibleoption for the SAS v8.1 HOLAP prototype deployment would be to have the central repositoryreside on a network server, but HHES has not tested the ramifications of this configuration yet

7. Conclusion

More research is needed. HHES did not try tuning anything. Nor did HHES fully load test theprototypes. HHES is leaning toward using the technology used in the SAS v8.1 HOLAPprototype for five reasons. One, there are fewer reports to create. Two, SAS v8.1 is closer to themost current SAS version then SAS v6.12. Three, the use of the HOLAP technology will allowthe users to view groups of data instead of individual states. Four, the potential problems wemight encounter with the over-ride methods and SCL code in the SAS v8.1 prototype willprobably be overcome with additional training and mentoring from more experiencedprogrammers. Unfortunately, the HHES programmers have the most experience SAS/IntrNetand Javascript of any programmers at the Census Bureau, so mentoring is not an option if we usethe SAS v8.1 IntrNet prototype technology. Five, the benefits of using the SAS v8.1 prototypetechnology out weigh the SAS v6.12 prototype=s is better performance advantage.

Currently we are creating a SAS v8.1 HOLAP system using Census 1990 data. This will allowus to fully load test the technology before we go into production with Census 2000 data. It willallow us to try different tuning strategies. If these tuning strategies work on Census 1990 data,they are almost guaranteed to work on Census 2000 data. It will also allow us to research andtest the other outstanding questions that we had with the prototype, while allowing theprogrammers to gain more over-ride and SCL experience.