module 3: dli ordering and processing - carleton university · dli/idd 1997 workshop...

DLI/IDD 1997 Workshop

lr:ed.ch:module3 1

DLI Ordering and Processing

3.1 Identifying files for acquisition

The two driving forces behind data acquisition are expressed need on the part of a

user, or collection development policy.

Be aware that it is the explicit policy of the DLI that data files should only be

acquired in response to express user need, and not generally acquired from DLI

just because a file is available.

The presumption here is that a reference interview has taken place. Therefore,

ascertain the following:

either the specific data file the user requires, or

the variables the user needs, including

what coding of these variables is needed

what time period the data should cover

what geographic area the user needs to describe

what the user needs to describe? Individuals, or groups of

individuals, or a geographic area (i.e. what unit of observation is

needed)

what the user intends to do with the numbers

what product the user wants

what software the user will be using

what platform the user will be doing his/her work on.

3.1.1 Selecting files

The characteristics of a data file can be categorized into those which describe the

intellectual content of the data file, and those which describe the physical form

of the data file.

Characteristics of the intellectual content:

Variables

substantive (dependant variables) versus demographic (independent

variables)

level of coding of variables (e.g. age or income as a categorical or

continuous variable)

Time period of data

date of data collection

time period covered by data (which is not necessarily the date of

collection; e.g., the income variables in the Survey of Consumer


lr:ed.ch:module3 2

Finances and the Census normally refer to the previous year.)

Geography

coverage of geographic area

availability of variables to identify required level of geography. For

example, the Survey of Consumer Finances microdata files

contain coding for region, but not province or census metropolitan

area.

Level of observation

microdata aggregate data time series data

can aggregate from microdata to aggregate data, but cannot

dissaggregate already aggregated data to smaller units or to

microdata.

Desired Output

Data Type

Microdata

Aggregate data

Time-series data

Microdata

x

x

Aggregate data

x

x

Time-series data

x

For spatial data, and georeferenced data (standard Statistics Canada geography), it

is possible to merge smaller units into larger ones, but not to disaggregate larger

units into smaller ones. The following table displays, for 1991 census geographic

products, the level of geography that can be generated (output) from spatial or

georeferenced products at the commonly used input levels:

Output

Input

ea

ct

cma/ca

csd

cd

fed

rp

ea

x

x

x

x

x

x

x

ct*

x

x

(x)

(x)

(x)

cma/ca

x

csd

x

x

x

cd

x

x

fed

x

x

*Note: with 1971, 1976, and 1981 census tract level data (which included census


lr:ed.ch:module3 3

tracts in census metropolitan areas/census agglomerates as well as provincial

census tracts) it was possible to also generate census-subdivision, census division

and region/province level data. With 1986 and 1991 census-tract level data, which

includes only census tracts in census metropolitan areas/census agglomerates, this

is no longer possible.

Edition/version of the data

version of data: different software dependant versions

edition (1st, 2nd, 3rd, etc.)

Characteristics of format

Access to the data

direct access by the user or access through an intermediary

ease of use (familiarity of software)

remote delivery versus hardware/software infrastructure to

use/access

Input requirements of task

‘manual’ input

output from another task

Output requirements of task (what output does user need?)

subset (file) for further analysis

generic format

software dependant format

report

table

map

3.2 Ordering/acquisition process

The acquisition process consists of two parts: first establishing that the data are

available from DLI (and if not, where else the data might be available), and

secondly, actually acquiring a copy of the data.

3.2.1 Establishing availability

3.2.1.1 If you know the title of the data file you require, you can determine its

availability through DLI using information from the DLI web site or the

DLI mail lists:

the DLI WWW-site:

lists data files that will become available through DLI


lr:ed.ch:module3 4

lists data files that are currently available for ftp from the DLI ftp site,

including the subdirectory in which each is to be found

http://www.statcan.ca/english/Dli/dli.htm http://www.statcan.ca/francais/Dli/dli_f.htm

The ‘dlilist’ listserv:

New data files are announced via [email protected]. If you have not already done

so, you should be subscribed to this listserv.

To subscribe, send an e-mail message to [email protected],

with message text: subscribe dlilist [your first name] [your last name]

It is a good idea to save the messages from this listserv, perhaps in alphabetical

order by title, for future reference, so that when you receive an inquiry, you will

have the information to hand.

Dlilist runs using the listproc software. It is important to make a distinction

between messages which manage your subscription to the list (such messages

should be sent only to [email protected]) versus those messages which are to

everyone else on the list, which messages should be sent to [email protected]

Table of listproc commands:

N.B. These commands should be sent to: [email protected]

help [topic] get information re listproc commands

set [listname] [option]

[argument]

with [option] [argument] change option to new value

subscribe [listname] [your

name]

subscribe to specified list

unsubscribe [listname] remove yourself from specified list

signoff [listname] same as ‘unsubscribe’

recipients [listname] receive a listing of non-concealed people subscribed to

specified list

review [listname] same as ‘recipients’

information [listname] receive general information file about specified list

index [listname] get a list of files in [list] archive

get [archive] [filename] receive a copy of specified file(s) from specified archive

search [archive] [pattern] receive a list of archive files that contain the character

string [pattern]

which receive a list of the local lists to which you are

subscribed


lr:ed.ch:module3 5

3.2.1.2 If you do not know the title of the data file:

A separate handout will include a list of other reference tools which are available to

determine the exact title and version/edition of the data file you require.

3.2.1.3 If the data are not currently available via DLI, but you think they should or

might be, send an inquiry to [email protected]. Responses are now

received within a day or two.

A WWW-based order status system has been set up on the DLI WWW-page, at:

http://www.statcan.ca/english/Dli/dli.htm http://www.statcan.ca/francais/Dli/dli_f.htm

To access this order status page requires two things:

only the official DLI-contact person at your institution can access the page

access is only possible from a platform that has had its IP-address

registered with Mr. Jackie Godfrey (e-mail to [email protected] to

register your IP-address for this purpose).

3.2.1.4 Knowing what data are not available from DLI, nor will be, is as important

as knowing what is available.

Data files that are not available through DLI:

data that are collected by Statistics Canada but for which no standard

computer-readable product is produced (such as the CANSIM cross-

classified database, etc.)

data collected by other federal government departments (other than

Statistics Canada)

data collected by provincial and municipal government departments

attitudinal or poll data

A separate handout will offer suggestions as to where to search for data files that

are not available via DLI.

3.2.2 Acquiring the data

Data files from DLI are disseminated in two major ways:

those that are cd-rom products are mailed in one copy only, with

documentation, to the official DLI-contact.

all other products are made available via ftp to the official DLI-contact from

the Statistics Canada ftp site at ftp://ftp.statcan.ca


lr:ed.ch:module3 6

The official DLI-contact receive occasional e-mail messages containing the current

password at the ftp site. There are plans to provide a WWW-interface so that files

can also be ftp’d via WWW.

To acquire a copy of the data,

if the data file is a cd-rom product, the official DLI-contact person should

send an e-mail message to [email protected]

if the data file is available via ftp, the official DLI-contact person should:

ftp the data from from ftp://ftp.statcan.ca

and

send and e-mail message to [email protected] requesting a copy of

the documentation.

3.2.3 Ftping files

The File Transfer Protocol allows you to copy files from one computer to another. It

is possible to transfer files to or from a remote computer, regardless of format,

allowing you to retrieve software programs, graphic images, sound files, etc., in

addition to ASCII text files.

To access ftp-able resources you must either use a superclient (such as Mosaic,

Netscape, lynx, etc.), a gopher client, or an ftp client (such as (Unix) ftp, or

(Windows) WinSockFTP or Rapid Filer, or (DOS) kermit).

[Note from Laine: Be aware that Rapid Filer is the only ftp-client I am aware

of that will allow you to ftp the content of an entire subdirectory with one

command; however, it does so by assuming the mode of the ftp of each file on

the basis of the file extension, and thus is only useful when all files in the

subdirectory have standard Mime-compliant extensions. Other clients that

require you to specify the mode of the ftp transfer on a file-by-file basis, such

as WS_ftp, allow you the control over the mode of transfer of each file.]

SYNTAX

(Unix client):ftp [site] [port]

(from within a superclient): ftp://[ site]:[port]

COMMANDS (selected)


lr:ed.ch:module3 7

FTP accepts only a limited set of commands. Not all FTP servers accepts all

commands listed below; which commands are acceptable will depend on both the

FTP server software installed on the remote host and the FTP client software you

are using. Use ‘help’ or ‘help ftp’ to display commands available on the version of

FTP software you are using. Selected FTP commands are:

Command Action

Navigation on the REMOTE host (host you are ftping to):

cd [dirname] change remote working directory

cdup change directory ('upwards') to root directory

cd .. change directory ('upwards') one subdirectory

close close connection to remote host

del [fn] delete a file

dir list files in current directory

get [fn] |more display specified file without copying it to local system - use ‘q’

to exit

help display help information

ls -al list content of current directory

mdel * delete all files in a directory

mkdir[dirname] make a new directory

open[site] connect to new system [site]

pwd print path information to current directory

rmdir[dirname] remove a directory

Navigation on the LOCAL host (host you are ftping from):

!dir display a list of files in current directory

!dir |more display list of files one page at a time

lcd [dirname] change 'local' directory

lcd .. move up/back one directory

lcdup move up/back one directory

!ls -al display a list of files in current directory

!ls -al |more display list of files one page at a time

!mkdir [dirname] create a new directory with name [dirname]

!pwd display path information to current local directory

!rmdir [dirname] delete directory

Copying files between hosts

ascii change transfer mode to transfer ASCII text files

binary change transfer mode to transfer binary files


lr:ed.ch:module3 8

prompt turn prompting for transfer of each individual file off/on

get [fn] copy specified file from remote host to local host

mget * copy all files from remote host to local host

mput * copy all files from local host to remote host

put [fn] copy specified file from local host to remote host

<ctrl><c> cancel file transfer

<ctrl><z> interrupt file transfer

bg %l restart file in background

Notes (for the Unix ftp client):

You must be logged on to both the remote and local computer simultaneously,

and have an account and password on both computers.

Read the Readme.first files. On a Unix server (such as ftp.statcan.ca) it is possible

to read files in remote ftp directories without actually 'getting' them first.

Type get [filename] |more

and use <ctrl><c> or <q> to exit reading mode.

Actually, it is a good idea to both read the Readme.first files, as well as ‘get’

them, for later reference.

Two files that are especially useful on the ftp.statcan.ca site are in the ftp

root directory (the very first directory you see when you login):

Readme.first contains a list of data file titles with corresponding

subdirectory name

Dirlist.txt contains a directory listing of the entire ftp site, and is

very useful for discovering how the site is organized and

where all possible variants of a file are, especially in very

complex subdirectories, such as the ‘geography’

subdirectory.

Unix systems are case-sensitive. ALWAYS give the directory or filename(s)

exactly as shown, including punctuation, and upper and lower case characters.

Distinguish between directories and files. Only files can be transfered.

Enter 'cd [subdirectory name]' to move from one directory to the next lower

subdirectory.

Enter 'cd ..' or 'cdup' to move back up in the subdirectory hierarchy, one directory


lr:ed.ch:module3 9

level at a time. Use cd ../../.. to move up three subdirectories at a time, etc.

To 'get' or 'put' a file, you must know how much disk space you have free, and how

big the file is; use 'ls -al' or 'dir' to display file sizes. On a Unix system, use ‘df’ to

display remaining disk space before you run ftp.

When 'getting' a file, you may supply a filename for the incoming file, if you wish to

change it, e.g. get Readme.first readme.gss10.

Use 'get' to get one file, or 'mget' to get several files with the same filename

characteristics. The 'wild card' is '*', but can be used only with ‘mget’.

E.g. mget *.txt

If you make a mistake in typing a filename, try to backspace using:

the backspace key, or

<ctrl><h>, or

<ctrl><backspace>.

To abort a file transfer, the terminal interrupt key sequence is usually <ctrl><c>.

To merely suspend file transfer use <ctrl><z> followed by ‘bg %l’ to restart the

transfer in the background. Be very scrupulous in checking file sizes after ftping in

the background.

It is considered good netiquette to avoid using ftp sites during their working hours,

and to not linger at a site any longer than is necessary to retrieve the files you need.

Transferring files between different environments

When ftping files, i.e. transferring them from one computing environment to

another, two things are very important:

whether the file contains ‘binary’ codes, especially when being copied

between an ASCII environment and an EBCDIC environment;

end-of-line conventions in the environments between which the file is being

transferred.

When files are uploaded in binary mode, they are copied from one system to

another exactly. This is absolutely essential for files in a software-dependant

format, especially files which contain binary codes (i.e. ASCII upper-128 codes). If

the file being copied contains binary characters and is uploaded in ASCII mode,

some of the binary characters may be interpreted as ftp control characters, and

either terminate the ftp session, or merely result in the corruption of the

transferred file. When transferring a file containing binary codes between an ASCII

environment and an EBCDIC environment, translation problems may also occur

with ASCII upper-128 codes if the file is not transferred in binary mode.


lr:ed.ch:module3 10

When files are uploaded in ASCII mode, however, although the bulk of the file is

uploaded byte by byte as is, some things do change, especially when you are ftp-ing

between different operating systems. Uploading from a DOS/Windows environment,

to a Unix environment (or vice versa), in ASCII mode, the end-of-line (EOL)

character at the end of each physical record is changed from the two characters

DOS uses (CR-LF) to the single character Unix uses (LF). CMS, on the other hand,

normally stores files in a fixed-length format, and the length of each record is

constant.

NEWLINE CHARACTERS IN PLAIN TEXT DATA FILES1

Data files typically have added bytes per line

‘mainframe’ tapes neither CR nor LF Zero

written for IBM

mainframes

neither CR nor LF Zero

written for Unix LF One

written for DOS both CR and LF Two

written for Macintosh CR One

In short, if you need to preserve the existing EOLs, move the file between different

operating systems in ASCII mode; if EOL codes are irrelevant (e.g. in system files)

or the file contains binary fields, move the file between operating systems in binary

mode.

Enter the ‘binary’ command before ftping a binary file. Use the ‘ascii’ command

prior to transferring text files. The following table lists some common file name

extensions, as well as some commonly occurring data-related file types, and

whether the files are should be transferred in binary or ascii mode.

File type/extension: Ftp mode: Operating system:

.arc file binary

.cat file binary DOS/Mac

.com file binary DOS

.doc file binary DOS/Mac

.exe file binary DOS/Unix

.gz file binary

.tar file binary

1 Data services and collections (Part 2)./ Geraci, Diane, Chuck Humphrey, and Jim Jacobs. [Ann

Arbor, Mich.]: Inter-University Consortium for Political and Social Research, August 1996. P.

265.


lr:ed.ch:module3 11

.wp[n] file binary DOS/Mac

.z file binary

.zip file binary

.Z file binary

ArcInfo export files ascii DOS/Mac

ASCII text file ascii

dBase file (.dbf) binary DOS/Mac

DDMS dictionary file ascii DOS

Lotus 1-2-3 file binary DOS/Mac

MapInfo files binary DOS/Mac

PDF file (.pdf) binary

PostScript file (.ps) ascii

raw data ascii

SPSS system file binary

SAS system file binary

SPSS export file ascii

SAS export file binary

SPSS command file ascii

SAS command file ascii

If in doubt, copy the file twice, once in binary mode and once in ascii mode, to

different filenames of course, and see which one gives the most satisfactory

result.

And one last piece of advice: make a print-screen of the directory listing of the

files you are copying from the remote site - this will help later when you are

doing the post-processing of the files.

REFERENCES:

Anonymous FTP : frequently asked questions (FAQ)./Perry Rovers. Nov. 30,

1995.

[*]ftp://rtfm.mit.edu/pub/usenet-by-group/news.answers/ftp-list/

[*]http://www.cis.ohio-state.edu/hypertext/faq/ftp-list/faq/faq.html

File transfer protocol (FTP)./ Postel, J and J. Reynolds. Oct. 1985 (RFC 959)

[*]ftp://nic.merit.edu/documents/rfc/rfc0959.txt

How to use anonymous ftp./ P. Deutsch, A. Emtage, A. Marine. May

1994. (FYI 24; RFC 1635)

[*]ftp://ftp.internic.net/rfc/rfc1635.txt

[*]ftp://a.cni.org/pub/FYI-RFC/fyi24.txt

Compression FAQ./ Lemson, David


lr:ed.ch:module3 12

[*]ftp://rtfm.mit.edu/pub/usenet-by-group/news.answers/compression-faq/

[*]http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-faq/top.html

File compression, archiving, and text[-]binary formats./ Lemson, David

[*]ftp://ftp.cso.uiuc.edu/doc/pcnet/compression

3.3 Post-processing files

This section discusses the processing that you may or may not, depending on your

circumstances, go through to ensure:

that you received the entire data file,

that the files you received are complete and useable, and

to make the task of actually using the data as easy and uncomplicated for

your users as possible.

Different types of files require very different post-processing, partly depending on

the format of the files, and partly depending on how you intend to deliver the data

to your users:

data files that are not accompanied by retrieval software require fairly

extensive post-processing, but there are a number of ways in which to

deliver the data to your users,

data files that are accompanied by retrieval software require relatively

little post-processing, but your options for delivering the data to your users

are more limited,

some fairly extensive post-processing of georeferenced data files will be

needed before they can be used with contemporary GIS software,

documentation files will also need more or less extensive post-processing,

depending on how you intend to deliver the information to your users.

3.3.1 Post-processing: first steps

The first steps in post-processing are common to all file types.

Step 1: Check the number of physical carriers

First check that you have received all the physical carriers, if the data have arrived

on a removable medium such as diskette or cd-rom. The number of diskettes, cd-

roms, etc. should be indicated in any of the following:

a covering letter, or

a separate manifest accompanying the data file, or

somewhere in the codebook/user manual which describes the data file.


lr:ed.ch:module3 13

Step 2: Check physical files, file names, and file sizes

(N.B. THIS IS ESPECIALLY IMPORTANT IF YOU HAVE FTP’D THE FILES,

BUT UNIMPORTANT IF THE DATA ARRIVED ON A CD-ROM.) Next check that

you have received the correct number and size of physical files The number of

physical files which comprise the data file, as well as such information as the

datasetnames and perhaps even the physical file sizes, should be indicated in any

of:

a covering letter, or

the Readme.first file (a manifest) in the file subdirectory on the ftp site, or

somewhere in the codebook/user manual describing the data file.

Determine the number of physical files, their datasetnames, and file sizes of the

files you have actually received, and compare this information with the information

gleaned from the accompanying documentation.

System: Command:

MSDOS dir [drive|directory]

Windows [use File Manager]

Unix ls -al

The rules for datasetnames and/or file names vary from system to system:

in Unix, file names may be up to 255 characters, but some older System V

Unix systems only allow 14 characters. File names may have extensions of

any length, preceded by a period (e.g. ‘.sys’). Filenames can contain any

special character except /.

in VM/CMS, file names consist of three parts: the filename (fn), the file type

(ft) and the file mode (fm). Each of fn and ft can be up to eight characters

(alphabetic, or numeric, and some special characters); the fm is usually given

as one character. The three parts of the name are usually given separated by

a blank.

in DOS/Windows, file names are restricted to a maximum of eight characters,

and may be composed of alphabetic, numeric, or special characters (but only:

!@#$%^&()_-{} or ‘). Normally file names are followed by a period and a

maximum three character extension. DOS/Windows file names must never

include a blank.

Similarly, display of file size varies from environment to environment:

in Unix, disk file size is displayed in terms of bytes, including the end-of-line

character (‘LF’ (linefeed), ASCII octal 012) at the end of each record (line, or

row) in the file,

in DOS/Windows, disk file size is displayed in terms of bytes, including the

two end-of-line characters (‘CR’ and ‘LF’ (carriage return - line feed, ASCII

octal 0A and 0D) at the end of each record in the file,


lr:ed.ch:module3 14

in VM/CMS, and indeed most IBM mainframe operating systems, disk file size

is displayed not in bytes but in terms of maximum number of characters per

record (record length), number of records, and record type (whether fixed

length (‘fb’), or variable length (‘vb’)). To ascertain the file size in terms of

bytes, multiply the record length by the number of records, as displayed by

the command: filel [fn ft fm]. This calculation will be accurate for files with

fixed length records. File size display usually also includes the number and

size of disk blocks needed to store the file. N.B. there are no end-of-line

characters in files on IBM mainframe computers.

in VMS, disk file size is displayed in terms of number of blocks (usually of

either 512 or 1024 bytes per block).

When comparing file names and file sizes, you must take into consideration the

physical environment in which the original information was generated (Unix) and

that in which you have just checked it. When file name or size discrepancies occur,

try to determine whether or not they may be the result of simply having been

moved from one environment to another.

If the information about what you should have received and what you actually

received do not match, try to determine what is wrong, which will usually be one of

the following:

you have not yet uncompressed/unbundled the files (if they were received in a

compressed and/or bundled format). If this is the case, uncompress/unbundle

the files (see below), and check again. Alternatively,

the discrepancy may be a result of having moved from one environment to

another, or

the file is incomplete, or

the documentation is in error or incomplete, or

the documentation does not match the data file (i.e. you have the wrong

documentation or the wrong data file).

Document the discrepancies, and if you cannot satisfactorily account for small

discrepancies by their having been shifted from one environment to another, and

the discrepancies are confirmed by the checks of the internal consistency (see

below), contact the source from which you received the data.

Step 3: Uncompress/unbundle any compressed files

Data arriving via ftp are usually in either one of the Unix compressed and/or

bundled formats, or in an MS-DOS/Windows compatible compressed format

(Statistics Canada typically uses PKSFX to compress/bundle files from a

DOS/Windows environment). Increasingly, software to unbundle/uncompress these

formats is becoming available on several systems.


lr:ed.ch:module3 15

The main compressed/bundled formats are:

[fn].exe - file is compressed and/or bundled as a ‘self extracting file, usually

with PKSFX. Move to the appropriate platform, or uncompress on

Unix using ‘unzip’.

To uncompress in DOS: [fn]

To uncompress in Unix: unzip [fn].exe

[fn].Z - file is compressed using Unix 'compress' software. ‘Uncompress’

software is available for DOS, Unix, Mac, etc., platforms.

To uncompress: uncompress *.Z

[fn].tar - usually several files have been 'bundled' together in one physical file

by the 'tar' program, so that you don't have to ftp over a whole bunch

of individual files. ‘Tar’ is available for DOS, Unix, Mac, etc.,

platforms.

To untar: tar -xfv [filename].tar

N.B. .tar files are often subsequently compressed, so that ftping will be

faster. Such files have both a .tar followed by a .Z extension. First

they must be uncompressed using 'uncompress', then un-tarred

using 'tar'.

[fn].gz - file is compressed using the gzip program. It can be uncompressed 'on

the fly' during ftping, but this means that it is the full file that is

ftp'ed over, not the compressed file, and this will take much more

time. Instead, it is more efficient to ftp the file in binary, and gunzip

it locally. Gunzip is available for DOS, Unix, Mac, etc., platforms.

To un-gzip: gunzip [filename].gz

To un-gzip ‘on the fly’ within ftp: get [filename]

Note: to gunzip on the fly, do NOT use the ‘.gz’ extension on the

filename with the ‘get’ command.

E.g. get [filename]

[fn].zip - file(s) have been compressed [and possibly bundled] using the pkzip

program. The file must be uncompressed/unbundled using ‘pkunzip’,

which is available for DOS, Unix, Mac, etc., platforms.

To uncompress in DOS: pkunzip *.zip

To uncompress in Unix: unzip [fn].zip

[fn].arc - file(s) have been compressed [and possibly bundled] using the pkarc

program. The file must be uncompress/unbundled using ‘pkxarc’,

which is available for DOS, Unix, Mac, etc., platforms.

Extension To uncompress/unbundle

.arc pkxarc [fn].arc


lr:ed.ch:module3 16

.exe [fn]

.Z uncompress [fn].Z

.gz gunzip [fn].gz

.tar tar -xfv [fn].tar

.tar.Z first 'uncompress [fn].Z, then 'tar -xfv [fn].tar'

.zip pkunzip [fn].zip

Uncompress/unbundle the files as appropriate, and generate and print yet another

ls -al |more or dir listing of the uncompressed/unbundled files. Check this listing

against the Readme.first file to ensure that you have all files and that they are the

correct size.

For a fairly complete discussion of compression/bundling and related software, as

well as locations of compression/bundling software, see:

Compression FAQ./ Lemson, David

ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq/*

File compression, archiving, and text[-]binary formats./ Lemson, David

ftp://ftp.cso.uiuc.edu/doc/pcnet/compression

Step 4: Determining what the files are

A data file can consist of a number of component physical files: one or more physical

data files, program files and documentation files. Simply put, a data file contains

data, a program file contains either a program or the instructions to a program as to

how to read the data, and documentation files contain information about the data

(often called ‘metadata’) which is needed by the user to understand what the data

mean.

Physical data files

Data files are those that contain the data themselves, which may be either numeric

(i.e. recorded as numbers) or alphabetic (recorded in the letters of the alphabet) or a

combination of the two. One data file can consist of one or more physical data files.

For example, the Canadian General Social Survey on time use (General social

survey number 2), conducted by Statistics Canada in 1987 consists of three data

files:

Main file - one record for each respondent in the file.

Incident file - one record for each reported time usage incident per

respondent (i.e. many incidents for each respondent)

Summary file - one record for each respondent, consisting of a

summary of the incident records.


lr:ed.ch:module3 17

Each of these data files will have its own program files, and may also have separate

documentation files.

Further, the content of data files can be differentiated by other characteristics:

organization of records: the relationship of physical records to logical records,

level/type of observation: what each logical record describes,

field structure: the way in which variables are coded in each record.

Organization of records

Level/type of observation flat hierarchical unstructured

microdata x x

macrodata (aggregate) x x

time-series x x

vector x

raster x

text x x x

The level/type of observation determines the type of analysis for which the file is

appropriate, i.e. the use researchers are likely to make of the file. The organization

of the records in the file affects how the structure of the data file is defined to a

statistical package.

Field structure

raw data file

fixed field files

card image (80 characters per record)

‘lrecl’ (record length in bytes less than 80 or greater than 80)

delimited field files

comma delimited

blank delimited

other character delimited

tagged fields

software-dependant data file

system file

transport/export file

Field structure, like the organization of the records in the file, affects how the

structure of the data file is defined to a statistical package.

Raw data files, that is physical data files that contain only data (numeric or

alphabetic or both), are not much of a problem. If the documentation is complete,

and adequately and accurately reports what variables are coded in what columns,

they can be read by almost any commercially available statistical package.


lr:ed.ch:module3 18

Raw data files are relatively easily recognized by the following characteristics:

if the data are numbers, the numbers should be displayed in discernible vertical

patterns (columns), or separated by blanks or commas, e.g.

35000200000000000000000000014171963224000000099000000100999

24421200000000000000000000013791902114000000099001001014000

24000200000000000000000000001851896115000000099000000100999

35555200000000000000000000013221059124000000099145010302056

12000200000000000000000000014761904222000000099000000100999

if the data are alphabetic, the text should be legible, and there should be a

discernible vertical pattern to the display,

there should be no characters that do not occur on a standard keyboard.

System or export files, on the other hand, will often have a partially eye-readable

header which will give you a clue as to what software is needed to read the file.

Program files

Raw data files are often accompanied by separate ‘program’ files (or ‘control

command files’) that contain the instructions to a specific program (e.g. SAS or

SPSS) as to how to read the raw data file. For example, the first few lines of the

SPSS control command file which describes how to read the above data might look

like this:

title Census of Canada, 1981 - public use microdata file individuals

data list file=in /

Prov 1-2

Cma 3-5

etc.

This file instructs SPSS to read a numeric variable in columns 1 and 2 (counting

from left to right) and assign the name ‘prov’ to it, and to read a numeric variable to

be named ‘cma’ in columns 3 through 5. Thus, ‘prov’ is a two-digit number (in the

above case, with values respectively ‘35', ‘24', ‘24', ‘35', and ‘12') and ‘cma’ is a three-

digit number (with values ‘000', ‘421', ‘000', ‘555', ‘000' respectively)

Some files actually come with both SPSS and SAS control command files, or even

control command files for other programs, such as TSP, SHAZAM, TPL, etc.

Unfortunately, none of these programs can understand commands formatted for

any of the other packages.

System files


lr:ed.ch:module3 19

Some program files actually contain the program to read the data file (either as a

source file, which must be compiled, or more frequently, as a compiled file, and then

most often actually bundled with the data into one physical file) or accompany a

data file in a proprietary format (usually binary) which can only be read by the

program itself and nothing else, which is usually called a ‘system file’.

When a statistical program such as SPSS reads a file of control commands and the

data that they describe, that information, both the instructions as to how to read

the data, and the data themselves, are stored in temporary workspace until the user

stops running the program.

Storing the instructions and data in one file in workspace makes it faster for the

user, who can refer to variables by name, recode them, perform various

transformations, etc. But getting them into that format from the initial form of raw

data and control commands takes some time. Most programs therefore also allow

you to store this temporary work file as a separate physical file, which is

‘permanent’ in the sense that it does not disappear when you stop running the

program -- much in the same way that you can store a WordPerfect document with

System File

Temporary Work File

Raw Data File Control Command File

Statistical Program

(e.g., SAS/SPSS/Minitab, etc.)

Export File Raw Data File Record Layout

Frequencies Content Information


lr:ed.ch:module3 20

all the formatting, changes, etc., so that it does not disappear when you exit

WordPerfect. This file is called a ‘system file’ and it is intended to make things

faster and easier for the user. Lotus 1-2-3 and dBase also allow you to store the

system file as a permanent file, although the difference between these programs

and SAS or SPSS is that you are required to define the dBase or 1-2-3 file format

within the program, not as a separate physical file which is plain text and therefore

eye-readable and portable. Unfortunately, the system file is in a special format

which is not eye-readable, and may well not even be readable by other versions of

the same program, especially not those running under different operating systems

than the one on which the system file was initially created, and is certainly not

readable by any other program but the one that created it. For example, a system

file created using SPSS(PC) under DOS will not be readable by SPSS for Windows

without a special interface, and is certainly not readable by SAS.

Both SAS and SPSS are major, well-known statistical packages, and include

commands to read and write all formats.

READING AND WRITING SPSS FILE FORMATS

File format Command to read Command to write

raw data data list file=’[path/fn]’ write outfile=’[path/fn]’

SPSS system file get file=’[path/fn]’ save outfile=’[path/fn]’

SPSS export file import file=’[path/fn]’ export outfile=’[path/fn]’

Data Interchange Formats

In addition, both SPSS and SAS (as well as other programs, such as ArcInfo) will

produce a data interchange format, what is called an ‘export file’ in SPSS or a

‘transport file’ in SAS. The object of the interchange file is to be able to move a

system file from one operating system or package to another. For example, in order

to use a system file created with SPSS on a machine running under the Unix

operating system on a machine running SPSS for Windows, you would have to

convert the system file to an export file on the Unix platform (using SPSS) and then

use SPSS for Windows to ‘import’ the file, thus creating a new system file.

Alternatively, you could write out the data stored in the system file as a raw data

file, and move that to the Windows platform. But then you would have to go

through the whole process of redefining to SPSS for Windows where and how to

read every variable and value in the file -- this information about how to read the

data is preserved when you write out an export file, and in turn import it.

While some data interchange formats are not readable by other programs, they are

readable by the programs that produce them. In some situations, a major package

will read the data interchange format of a major competitor. For example, SAS will

read an SPSS export file, although SPSS does not currently read a SAS transport


lr:ed.ch:module3 21

file. Both SAS and SPSS will read the dBase (.DBF) and Lotus (.WKS) data

interchange formats, which covers a common database and spreadsheet format,

respectively.

READING AND WRITING SAS FILE FORMATS

File format Command to read Command to write

raw data data; infile [path/fn]; data; file [path/fn]; put;

SAS system file data; set [path/fn];

[proc] data=[libn].[dsn];

data [path/fn];

SAS export file proc cimport infile=[path/fn] proc cport file=[path/fn];

Looking at the files

How do you tell which files are the data, program, and/or documentation files? And

what formats each of the files you have received are in? You have several sources of

clues:

Step 1: read the Readme.first file accompanying the data files, which should

explain the format (especially the software dependence) of all files. Remember not

to take this information as gospel truth, however -- always check the files

themselves to see what they really contain.

Step 2: look at the extensions of the filenames, or the entire filenames. This

information too can be used as a guideline, but should not be relied on entirely

without checking the files themselves.

Step 3: look at the files themselves:

in Unix, several commands are available (although not all will be available on

all flavours of Unix):

To simply display a file:

browse displays a file one screen at a time

cat displays an entire file (scrolling)

cio ‘Check it out’. This Perl script not only lists the first 10 lines of

a file, but also indicates how many records in the file as well as

the number of records by line length.

head displays the first 10 lines of a file (see also ‘tail +0’)

less displays a file one screen at a time

more displays a file one screen at a time

pg displays a file one screen at a time

tail displays the last 10 lines of a file by default

tail +0 displays the first 10 lines


lr:ed.ch:module3 22

To look more closely at what’s in a file, especially at the non-printable characters

(e.g. those odd spaces):

cat -evt displays an entire file, including most non-printing characters

(including ASCII upper-128), tabs, and end-of-lines.

hexdump displays a file in hexadecimal, octal, decimal, and ascii

od displays a file in ASCII octal, decimal, and hexademimal

vis displays a file including non-printable characters in octal etc.

xd displays a file in octal and hexadecimal

in DOS/Windows, there are a few programs that will allow you to look at the

content of a file, including:

type displays a file in text only (use this with small files

only).

list.com Available at:

ftp://princeton.edu/pub/misg-lib/UTIL/list.com

Displays a file in text or in hexadecimal.

vedit Proprietary software from Greenview Data Inc. Can

display both ascii and ebcdic files, in text and

hexadecimal mode.

Norton Utilities

Disk Editor

Proprietary software. Displays a file in text or in

hexadecimal.

Xtree Pro Proprietary software. Displays a file in text or in

hexadecimal

.

for word processor files, try to import them with either Microsoft Word or

WordPerfect, and let the software detect which format it thinks the file is

(N.B. this is relatively easy, but doesn’t always result in a readable file.)

3.3.2 Post-processing data files (without accompanying retrieval software)

From here on in, the post-processing of files varies according to what the files are,

and whether or not they are software dependant, and/or are accompanied by

retrieval software. First, the fairly extensive post-processing needed by generic data

files without accompanying retrieval software.

Step 1: the next step is to check the ‘internal’ completeness or integrity of the

physical files by checking the number of records in the file, and the record length or

length of the records in the file against the information provided in the

documentation. To do this, you may find it expedient actually to process the


lr:ed.ch:module3 23

documentation files first.

Number of records

First, look at the manifest or codebook carefully, to determine if it indicates the

number of records in each physical data file. In a raw data file with one record per

respondent, this is the same as the number of cases (or respondents), also known as

‘the N’. If the codebook discusses an ‘unweighted N’ and a ‘weighted N’ (or cases, or

respondents), the number you are interested in is the ‘unweighted N’.

Determining number of records:

System Software Command

DOS maxline maxline [filename]

Unix wc wc -l [filename]

Unix cio2 cio2 [filename]2

VM/CMS filelist filel [fn] [ft] [fm]

Software availability:

System Software Available as/at:

DOS maxline ftp://datalib.library.ualberta.ca

Unix wc standard Unix utility

Unix cio2 ftp://gort.ucsd.edu/pub/jj/cio2

VM/CMS filelist standard VM/CMS utility

Although in a raw data file, the number of records corresponds to the number of

cases, there are other instances in which the number of records is of use as a simple

control check only, and bears no relationship to the intellectual content:

if there is more than one record per respondent (often the case in older files,

especially those with a record length of 80 characters per record (see

discussion of ‘record length’ below)), the number of records should equal the

number of respondents multiplied by the number of records per respondent.

I.e. the (#records = #cases * #records_per_respondent), and that number of

records_per_respondent should be documented in the codebook;

the data file is an hierarchical file, in which case the number of cases will vary

with the type of case, and the number of records is usually not given in the

documentation;

the data file consists of time series data, or administrative data, in which case


lr:ed.ch:module3 24

very often the number of records will vary with the length of the time series in

each case, and a number of records will bear little relationship to the number

of cases,

in the case of text data, the number of records is usually not given in the

accompanying documentation, and since there are no ‘cases’ per se, will in any

case have little meaning in terms of intellectual content,

in the case of system files and export files, the data are no longer arranged by

case, and therefore the number of physical records has no relationship to the

number of cases or respondents. SPSS will report the number of cases when

EXPORTing a system file to an export file, or IMPORTing an export file to a

system file. Alternatively, to determine the number of respondents in an

existing system file, generate frequencies on a variable with few values, such

as the ‘sex’ variable which occurs in many data files.

E.g. The following SPSS commands will generate frequencies on the variable

‘sexr’ (sex of respondent) in the SPSS system file ‘ind91.sys’ in the

subdirectory ‘/data/cc9105'

Title Census of Canada, 1991 - microdata file.

get file=’/data/91census/ind91.sys’

frequencies variables=sexr

finish

Record length

Next, determine the ‘record length’ of each physical file. If neither the manifest nor

the codebook gives the record length of each physical file, look at the codebook parts

that describe the layout of each of the physical data files that make up the whole

data file, or at accompanying program (control command) files. Look specifically for

the highest column location number that is mentioned in each record layout. This is

the ‘record length’, and represents the number of characters or columns there

should be in each physical record (or ‘row). This number should be the same as:

Determining record length:

System Command/calculation

DOS (average record length + 2) = (#bytes/#records)

DOS maxline3

Unix cio2 [filename]

Unix (average record length + 1) = (#bytes/#records)

3Maxline is a program written in C by the ESRC Data Archive, University of Essex. It reports the length of the

longest line in a file, as well as the number of control characters and characters with an octal code greater than 177.


lr:ed.ch:module3 25

VM/CMS filel [fn] [ft] [fm]

Instances in which the record length is of little intellectual meaning, but still useful

as a control include:

system files and export files usually have predetermined record length

required by different versions of the program. E.g. SPSS under VM/CMS

requires that system files have a record length of 1024 and that export files

have a record length of 80. Since in both of these cases, the data are no longer

organized on a record by case basis, the record length has no meaning other

than as a control that program requirements are met,

comma-delimited and blank-delimited files do not depend on a fixed-field

structure, and usually the documentation gives no information about record

lengths,

text files also seldom contain fixed field information, and therefore record

length has little intellectual meaning.

And what do you do if these don’t match what the manifest or the documentation

claim? Double check all your calculations, and try to determine, as in step 2 above,

whether it is the documentation that is in error or incomplete (or in the worst case

scenario, the wrong documentation for that file) or the data file that is incomplete.

Document the discrepancies, and if they are confirmed by the checks in Step 2,

contact the source from which you received the data.

3.3.2.2 Data files with accompanying software

Data files that come bundled with accompanying software, usually on cd-rom, but

sometimes via ftp (such as Quikstat or the Health indicators database), should be

installed in a DOS/Windows environment in the usual way, following the

accompanying instructions for installation.

If the files are available via ftp, they will be organized in a sequence of /disk[n]

subdirectories. Download the structure as is and either copy onto actual diskettes

before installation, or otherwise maintain the subdirectory structure in order to

install correctly.

3.3.2.3 Spatial data files

In order to be used with GIS software, the geographic reference files need to be

structured in a special way. A major component of the needed structure is the

importance of leading zeros in the geographic identification fields. Be kind to your

GIS librarian/users! Post-processing geographic reference files as follows:

read the file into your favourite statistical package with:

all geographic identifiers defined as alphabetic variables (to preserve the

leading ‘00's)


lr:ed.ch:module3 26

read longitude as a variable with 6 decimal places, and convert it to a

negative number (multiply it by ‘-1')

read latitude in as a variable with 4 decimal places

writing the file out as a flat ascii text file, preserving the alphabetic variables

intact and writing latitude and longitude as signed, floating decimal numbers

read the flat ascii text file into Geoformat

use Geoformat to write out two sets of files:

one comma-delimited for use in ArcInfo

one as a .dbf file, for use with MapInfo and ArcView

Geoformat (a stand-alone FoxPro application) is available on the ftp site in: ftp://ftp.statcan.ca/geography/geogfiles/refdata/geofmt/*

3.3.3 Post-processing documentation files

There are two major types of documentation files: free-form text files, which

describe the data, and control command files which are input to major statistical

packages and must be structured in a specific way.

3.3.3.1 Codebook files

Codebook files vary widely both in their content and in their formats, as well as in

their very names, for there is no standardization as to content or naming of

codebook files.

Ideal codebook contents

Although there are no standards per se, it is generally accepted that good and

complete (optimum) documentation should include all of the following:

official title of the data file and identification of the principal investigator(s)

and their institutional affiliations

a description of the objectives of the research project that collected the data,

its intellectual goals, and its history, project members, sources of funding and

related acknowledgements, etc.

a description of how, when, where, and by whom the data were collected,

verified, processed and cleaned. If the data were collected via a survey of some

sort, descriptions of the sampling frame and procedures, the universe,

response rates, weighting and weight variables, interviewer training, etc.

if the data were compiled from existing sources (e.g. other data files, print

sources, etc.), identification of the sources.

copies of the questionnaire, instructions to interviewers, etc., or a printout of

the CATI/CAPI program.

description of data verification procedures (internal consistency checking),

data cleaning, and so on.

a statement of units of analysis or observations (i.e. who or what do the data


lr:ed.ch:module3 27

describe)

a description of the organization of the data file, i.e. a listing of the physical

files, their datasetnames, sizes (bytes, number of records, record lengths),

software dependency, and respective content of each physical file

a description of each variable in each physical data file (N.B. it is most useful

if this is given in the order in which the variables are recorded in the data file,

i.e. ‘natural order’), including

sequential order of the variable in the data file,

the question number and full text of the question that generated the

variable, or its exact meaning, source, or other information necessary to

understanding what the variable represents,

column location(s), and record number (if there is more than one record per

case) in which the variable is coded,

type of variable (alphabetic, numeric, number of decimals, whether signed,

etc.)

variable names (if the variables have been assigned names in an SPSS or

SAS file)

universe, e.g. who was asked this question,

all valid codes and the exact meaning of each code, including missing data

codes,

the unweighted frequency of each code (including missing data codes), i.e.

how many times it occurs (actual count and as a percentage of responses),

or, if a code has many possible values (e.g. income), summary statistics

such as the lowest code, the highest code, the range of the code, and the

missing data codes

imputation and editing information, e.g. if the values have been estimated

for any cases

the method by which constructed or derived variables have been created

skip patterns applicable to the variable.

many Statistics Canada files which describe survey microdata files also

contain approximate variance tables. These are needed by researchers who

are doing analyses involving population estimates, and need to compare

their results with the original population estimates from Statistics

Canada.

a bibliography of publications relating to or based on the data file.

Bare minimum codebook content

The above is an idealized list, and not all that many codebooks include all of the

above information. Nor is it obtainable for all data files.

At an absolute bare minimum, however, documentation should include:

official title, and principal investigators

date(s) that the data were collected or dates to which the data refer


lr:ed.ch:module3 28

number of cases and the unit of analysis

a description of the organization of the data file, i.e. a listing of the physical

files, their datasetnames, sizes (bytes, number of records, record lengths),

software dependency, and content of each of the physical files

record layout, showing full question text of each variable, variable name (if

applicable), column location(s) (in ‘natural order’), variable type, etc.

all existing codes for each variable, and their meanings (value labels)

frequencies of each variable, or summary statistics

a copy of the data collection instrument (questionnaire)

references to publications about or based on the data file.

Common names of documentation files and software dependant formats are:

data map or record layout

a (usually) computer-generated listing, which may contain any of variable

names, variable types, and column locations, number of decimals, and

missing data codes. Usually these are generic flat text files, with no software

dependency.

DDMS dictionary (files have extensions *2.DBF, *3.DBF, *4.DBF etc.)

Data Dictionary Management Software was developed in Clipper by Health

and Welfare Canada, in the DOS environment. Information is stored in 4

physical files: a directory file which stored title, and principal investigator

information, as well as 3 additional files (*2.dbf, *3.dbf, and *4.dbf) which

store respectively variable level, value level, and comment level information.

The source files (*2.dbf, *3.dbf and *4.dbf) are text, but formatted specifically

for DDMS. Files with extensions *5.dbf and *6.dbf are DDMS output

codebook and index files respectively and are flat text files intended to be

printed, but which can be re-routed to a file using a standard utility such as

prn2file.

word processor files (WordPerfect, MicroSoft Word, Microsoft Windows, WordStar,

MultiMate, Xywrite,etc.) (extensions .DOC, .WP, .CAT etc.)

System files containing text formatting commands (the preamble) as well as

the documentation text, usually in the same physical file. Most PC-based

word processors can both import texts formatted for most other current word

processor packages (usually with some corruption of formatting), as well as

write out a flat, text file without the formatting instructions or preamble

(often denoted as ASCII (DOS) text).

Portable Document Format files (extension .PDF)

System files which contain graphic control as well as the text, in a format

which can be read by the Adobe Acrobat viewer (available for Windows,

Macintosh, and Unix from http://www.adobe.com).


lr:ed.ch:module3 29

PostScript files (extension .PS)

Either text files, containing PostScript programming language (from Adobe)

commands, which control the printer, as well as the documentation text, or

system files (actually, a graphic bit map) generated from the PostScript

program commands and the text by a PostScript interpreter. Device

independent: PostScript code generated by an application can be printed in

any printer with a PostScript interpreter, or displayed using PostScript

interpreter software such as Ghostscript.

HTML (hypertext markup language) files (extension ‘.htm’ or ‘.html’(

Text files containing HTML commands, which control display as well as some

content. Device independent. Can be displayed by any HTML viewer, such as

lynx, Netscape or Mosaic, etc.

SGML (standard general markup language) files

Text files containing SGML commands, which control form of content, rather

than display. Device independent. Can be displayed by any SGML viewer.

ASCII text files (or ‘flat character file’)

a common name for a file which contains only the characters found on a

standard keyboard (the ASCII lower-128 codes) and which is not dependant

on any software, but can be printed or listed with any file listing or printing

software. It is irrelevant whether such a file is stored in ASCII or in

EBCDIC, and it can be moved from one environment to the other with

impunity.

Procedure for printing machine-readable codebooks

1. Copy the file(s)

The first step in codebook processing should always be to copy the codebook file(s) to

your working subdirectory. Never under any circumstances make changes

directly to the archival copy of the codebook files!

Either ftp the files to your account or workstation, or, set up a separate working

subdirectory under the same account, and copy the file(s) to that subdirectory.

Whether you copy/ftp the files to a Unix or DOS/Windows platform depends on the

original format of the files to a certain extent.

Software dependent codebook formats


lr:ed.ch:module3 30

Extension Operating

System

Binary/

Ascii

Probable format

.cat

.wpd

DOS B WordPerfect (used by some Stats Can

departments) -- use MS Word or WordPerfect

.dbf DOS A|B dBase/DDMS -- may need DDMS

.doc DOS B MS Word (version unspecified) -- use MS Word or

WordPerfect

module 3: dli ordering and processing - carleton university · dli/idd 1997 workshop...

Documents