module 3: dli ordering and processing - carleton university · dli/idd 1997 workshop...
TRANSCRIPT
DLI/IDD 1997 Workshop
lr:ed.ch:module3 1
DLI Ordering and Processing
3.1 Identifying files for acquisition
The two driving forces behind data acquisition are expressed need on the part of a
user, or collection development policy.
Be aware that it is the explicit policy of the DLI that data files should only be
acquired in response to express user need, and not generally acquired from DLI
just because a file is available.
The presumption here is that a reference interview has taken place. Therefore,
ascertain the following:
either the specific data file the user requires, or
the variables the user needs, including
what coding of these variables is needed
what time period the data should cover
what geographic area the user needs to describe
what the user needs to describe? Individuals, or groups of
individuals, or a geographic area (i.e. what unit of observation is
needed)
what the user intends to do with the numbers
what product the user wants
what software the user will be using
what platform the user will be doing his/her work on.
3.1.1 Selecting files
The characteristics of a data file can be categorized into those which describe the
intellectual content of the data file, and those which describe the physical form
of the data file.
Characteristics of the intellectual content:
Variables
substantive (dependant variables) versus demographic (independent
variables)
level of coding of variables (e.g. age or income as a categorical or
continuous variable)
Time period of data
date of data collection
time period covered by data (which is not necessarily the date of
collection; e.g., the income variables in the Survey of Consumer
DLI/IDD 1997 Workshop
lr:ed.ch:module3 2
Finances and the Census normally refer to the previous year.)
Geography
coverage of geographic area
availability of variables to identify required level of geography. For
example, the Survey of Consumer Finances microdata files
contain coding for region, but not province or census metropolitan
area.
Level of observation
microdata aggregate data time series data
can aggregate from microdata to aggregate data, but cannot
dissaggregate already aggregated data to smaller units or to
microdata.
Desired Output
Data Type
Microdata
Aggregate data
Time-series data
Microdata
x
x
Aggregate data
x
x
Time-series data
x
For spatial data, and georeferenced data (standard Statistics Canada geography), it
is possible to merge smaller units into larger ones, but not to disaggregate larger
units into smaller ones. The following table displays, for 1991 census geographic
products, the level of geography that can be generated (output) from spatial or
georeferenced products at the commonly used input levels:
Output
Input
ea
ct
cma/ca
csd
cd
fed
rp
ea
x
x
x
x
x
x
x
ct*
x
x
(x)
(x)
(x)
cma/ca
x
csd
x
x
x
cd
x
x
fed
x
x
*Note: with 1971, 1976, and 1981 census tract level data (which included census
DLI/IDD 1997 Workshop
lr:ed.ch:module3 3
tracts in census metropolitan areas/census agglomerates as well as provincial
census tracts) it was possible to also generate census-subdivision, census division
and region/province level data. With 1986 and 1991 census-tract level data, which
includes only census tracts in census metropolitan areas/census agglomerates, this
is no longer possible.
Edition/version of the data
version of data: different software dependant versions
edition (1st, 2nd, 3rd, etc.)
Characteristics of format
Access to the data
direct access by the user or access through an intermediary
ease of use (familiarity of software)
remote delivery versus hardware/software infrastructure to
use/access
Input requirements of task
‘manual’ input
output from another task
Output requirements of task (what output does user need?)
subset (file) for further analysis
generic format
software dependant format
report
table
map
3.2 Ordering/acquisition process
The acquisition process consists of two parts: first establishing that the data are
available from DLI (and if not, where else the data might be available), and
secondly, actually acquiring a copy of the data.
3.2.1 Establishing availability
3.2.1.1 If you know the title of the data file you require, you can determine its
availability through DLI using information from the DLI web site or the
DLI mail lists:
the DLI WWW-site:
lists data files that will become available through DLI
DLI/IDD 1997 Workshop
lr:ed.ch:module3 4
lists data files that are currently available for ftp from the DLI ftp site,
including the subdirectory in which each is to be found
http://www.statcan.ca/english/Dli/dli.htm http://www.statcan.ca/francais/Dli/dli_f.htm
The ‘dlilist’ listserv:
New data files are announced via [email protected]. If you have not already done
so, you should be subscribed to this listserv.
To subscribe, send an e-mail message to [email protected],
with message text: subscribe dlilist [your first name] [your last name]
It is a good idea to save the messages from this listserv, perhaps in alphabetical
order by title, for future reference, so that when you receive an inquiry, you will
have the information to hand.
Dlilist runs using the listproc software. It is important to make a distinction
between messages which manage your subscription to the list (such messages
should be sent only to [email protected]) versus those messages which are to
everyone else on the list, which messages should be sent to [email protected]
Table of listproc commands:
N.B. These commands should be sent to: [email protected]
help [topic] get information re listproc commands
set [listname] [option]
[argument]
with [option] [argument] change option to new value
subscribe [listname] [your
name]
subscribe to specified list
unsubscribe [listname] remove yourself from specified list
signoff [listname] same as ‘unsubscribe’
recipients [listname] receive a listing of non-concealed people subscribed to
specified list
review [listname] same as ‘recipients’
information [listname] receive general information file about specified list
index [listname] get a list of files in [list] archive
get [archive] [filename] receive a copy of specified file(s) from specified archive
search [archive] [pattern] receive a list of archive files that contain the character
string [pattern]
which receive a list of the local lists to which you are
subscribed
DLI/IDD 1997 Workshop
lr:ed.ch:module3 5
3.2.1.2 If you do not know the title of the data file:
A separate handout will include a list of other reference tools which are available to
determine the exact title and version/edition of the data file you require.
3.2.1.3 If the data are not currently available via DLI, but you think they should or
might be, send an inquiry to [email protected]. Responses are now
received within a day or two.
A WWW-based order status system has been set up on the DLI WWW-page, at:
http://www.statcan.ca/english/Dli/dli.htm http://www.statcan.ca/francais/Dli/dli_f.htm
To access this order status page requires two things:
only the official DLI-contact person at your institution can access the page
access is only possible from a platform that has had its IP-address
registered with Mr. Jackie Godfrey (e-mail to [email protected] to
register your IP-address for this purpose).
3.2.1.4 Knowing what data are not available from DLI, nor will be, is as important
as knowing what is available.
Data files that are not available through DLI:
data that are collected by Statistics Canada but for which no standard
computer-readable product is produced (such as the CANSIM cross-
classified database, etc.)
data collected by other federal government departments (other than
Statistics Canada)
data collected by provincial and municipal government departments
attitudinal or poll data
A separate handout will offer suggestions as to where to search for data files that
are not available via DLI.
3.2.2 Acquiring the data
Data files from DLI are disseminated in two major ways:
those that are cd-rom products are mailed in one copy only, with
documentation, to the official DLI-contact.
all other products are made available via ftp to the official DLI-contact from
the Statistics Canada ftp site at ftp://ftp.statcan.ca
DLI/IDD 1997 Workshop
lr:ed.ch:module3 6
The official DLI-contact receive occasional e-mail messages containing the current
password at the ftp site. There are plans to provide a WWW-interface so that files
can also be ftp’d via WWW.
To acquire a copy of the data,
if the data file is a cd-rom product, the official DLI-contact person should
send an e-mail message to [email protected]
if the data file is available via ftp, the official DLI-contact person should:
ftp the data from from ftp://ftp.statcan.ca
and
send and e-mail message to [email protected] requesting a copy of
the documentation.
3.2.3 Ftping files
The File Transfer Protocol allows you to copy files from one computer to another. It
is possible to transfer files to or from a remote computer, regardless of format,
allowing you to retrieve software programs, graphic images, sound files, etc., in
addition to ASCII text files.
To access ftp-able resources you must either use a superclient (such as Mosaic,
Netscape, lynx, etc.), a gopher client, or an ftp client (such as (Unix) ftp, or
(Windows) WinSockFTP or Rapid Filer, or (DOS) kermit).
[Note from Laine: Be aware that Rapid Filer is the only ftp-client I am aware
of that will allow you to ftp the content of an entire subdirectory with one
command; however, it does so by assuming the mode of the ftp of each file on
the basis of the file extension, and thus is only useful when all files in the
subdirectory have standard Mime-compliant extensions. Other clients that
require you to specify the mode of the ftp transfer on a file-by-file basis, such
as WS_ftp, allow you the control over the mode of transfer of each file.]
SYNTAX
(Unix client):ftp [site] [port]
(from within a superclient): ftp://[ site]:[port]
COMMANDS (selected)
DLI/IDD 1997 Workshop
lr:ed.ch:module3 7
FTP accepts only a limited set of commands. Not all FTP servers accepts all
commands listed below; which commands are acceptable will depend on both the
FTP server software installed on the remote host and the FTP client software you
are using. Use ‘help’ or ‘help ftp’ to display commands available on the version of
FTP software you are using. Selected FTP commands are:
Command Action
Navigation on the REMOTE host (host you are ftping to):
cd [dirname] change remote working directory
cdup change directory ('upwards') to root directory
cd .. change directory ('upwards') one subdirectory
close close connection to remote host
del [fn] delete a file
dir list files in current directory
get [fn] |more display specified file without copying it to local system - use ‘q’
to exit
help display help information
ls -al list content of current directory
mdel * delete all files in a directory
mkdir[dirname] make a new directory
open[site] connect to new system [site]
pwd print path information to current directory
rmdir[dirname] remove a directory
Navigation on the LOCAL host (host you are ftping from):
!dir display a list of files in current directory
!dir |more display list of files one page at a time
lcd [dirname] change 'local' directory
lcd .. move up/back one directory
lcdup move up/back one directory
!ls -al display a list of files in current directory
!ls -al |more display list of files one page at a time
!mkdir [dirname] create a new directory with name [dirname]
!pwd display path information to current local directory
!rmdir [dirname] delete directory
Copying files between hosts
ascii change transfer mode to transfer ASCII text files
binary change transfer mode to transfer binary files
DLI/IDD 1997 Workshop
lr:ed.ch:module3 8
prompt turn prompting for transfer of each individual file off/on
get [fn] copy specified file from remote host to local host
mget * copy all files from remote host to local host
mput * copy all files from local host to remote host
put [fn] copy specified file from local host to remote host
<ctrl><c> cancel file transfer
<ctrl><z> interrupt file transfer
bg %l restart file in background
Notes (for the Unix ftp client):
You must be logged on to both the remote and local computer simultaneously,
and have an account and password on both computers.
Read the Readme.first files. On a Unix server (such as ftp.statcan.ca) it is possible
to read files in remote ftp directories without actually 'getting' them first.
Type get [filename] |more
and use <ctrl><c> or <q> to exit reading mode.
Actually, it is a good idea to both read the Readme.first files, as well as ‘get’
them, for later reference.
Two files that are especially useful on the ftp.statcan.ca site are in the ftp
root directory (the very first directory you see when you login):
Readme.first contains a list of data file titles with corresponding
subdirectory name
Dirlist.txt contains a directory listing of the entire ftp site, and is
very useful for discovering how the site is organized and
where all possible variants of a file are, especially in very
complex subdirectories, such as the ‘geography’
subdirectory.
Unix systems are case-sensitive. ALWAYS give the directory or filename(s)
exactly as shown, including punctuation, and upper and lower case characters.
Distinguish between directories and files. Only files can be transfered.
Enter 'cd [subdirectory name]' to move from one directory to the next lower
subdirectory.
Enter 'cd ..' or 'cdup' to move back up in the subdirectory hierarchy, one directory
DLI/IDD 1997 Workshop
lr:ed.ch:module3 9
level at a time. Use cd ../../.. to move up three subdirectories at a time, etc.
To 'get' or 'put' a file, you must know how much disk space you have free, and how
big the file is; use 'ls -al' or 'dir' to display file sizes. On a Unix system, use ‘df’ to
display remaining disk space before you run ftp.
When 'getting' a file, you may supply a filename for the incoming file, if you wish to
change it, e.g. get Readme.first readme.gss10.
Use 'get' to get one file, or 'mget' to get several files with the same filename
characteristics. The 'wild card' is '*', but can be used only with ‘mget’.
E.g. mget *.txt
If you make a mistake in typing a filename, try to backspace using:
the backspace key, or
<ctrl><h>, or
<ctrl><backspace>.
To abort a file transfer, the terminal interrupt key sequence is usually <ctrl><c>.
To merely suspend file transfer use <ctrl><z> followed by ‘bg %l’ to restart the
transfer in the background. Be very scrupulous in checking file sizes after ftping in
the background.
It is considered good netiquette to avoid using ftp sites during their working hours,
and to not linger at a site any longer than is necessary to retrieve the files you need.
Transferring files between different environments
When ftping files, i.e. transferring them from one computing environment to
another, two things are very important:
whether the file contains ‘binary’ codes, especially when being copied
between an ASCII environment and an EBCDIC environment;
end-of-line conventions in the environments between which the file is being
transferred.
When files are uploaded in binary mode, they are copied from one system to
another exactly. This is absolutely essential for files in a software-dependant
format, especially files which contain binary codes (i.e. ASCII upper-128 codes). If
the file being copied contains binary characters and is uploaded in ASCII mode,
some of the binary characters may be interpreted as ftp control characters, and
either terminate the ftp session, or merely result in the corruption of the
transferred file. When transferring a file containing binary codes between an ASCII
environment and an EBCDIC environment, translation problems may also occur
with ASCII upper-128 codes if the file is not transferred in binary mode.
DLI/IDD 1997 Workshop
lr:ed.ch:module3 10
When files are uploaded in ASCII mode, however, although the bulk of the file is
uploaded byte by byte as is, some things do change, especially when you are ftp-ing
between different operating systems. Uploading from a DOS/Windows environment,
to a Unix environment (or vice versa), in ASCII mode, the end-of-line (EOL)
character at the end of each physical record is changed from the two characters
DOS uses (CR-LF) to the single character Unix uses (LF). CMS, on the other hand,
normally stores files in a fixed-length format, and the length of each record is
constant.
NEWLINE CHARACTERS IN PLAIN TEXT DATA FILES1
Data files typically have added bytes per line
‘mainframe’ tapes neither CR nor LF Zero
written for IBM
mainframes
neither CR nor LF Zero
written for Unix LF One
written for DOS both CR and LF Two
written for Macintosh CR One
In short, if you need to preserve the existing EOLs, move the file between different
operating systems in ASCII mode; if EOL codes are irrelevant (e.g. in system files)
or the file contains binary fields, move the file between operating systems in binary
mode.
Enter the ‘binary’ command before ftping a binary file. Use the ‘ascii’ command
prior to transferring text files. The following table lists some common file name
extensions, as well as some commonly occurring data-related file types, and
whether the files are should be transferred in binary or ascii mode.
File type/extension: Ftp mode: Operating system:
.arc file binary
.cat file binary DOS/Mac
.com file binary DOS
.doc file binary DOS/Mac
.exe file binary DOS/Unix
.gz file binary
.tar file binary
1 Data services and collections (Part 2)./ Geraci, Diane, Chuck Humphrey, and Jim Jacobs. [Ann
Arbor, Mich.]: Inter-University Consortium for Political and Social Research, August 1996. P.
265.
DLI/IDD 1997 Workshop
lr:ed.ch:module3 11
.wp[n] file binary DOS/Mac
.z file binary
.zip file binary
.Z file binary
ArcInfo export files ascii DOS/Mac
ASCII text file ascii
dBase file (.dbf) binary DOS/Mac
DDMS dictionary file ascii DOS
Lotus 1-2-3 file binary DOS/Mac
MapInfo files binary DOS/Mac
PDF file (.pdf) binary
PostScript file (.ps) ascii
raw data ascii
SPSS system file binary
SAS system file binary
SPSS export file ascii
SAS export file binary
SPSS command file ascii
SAS command file ascii
If in doubt, copy the file twice, once in binary mode and once in ascii mode, to
different filenames of course, and see which one gives the most satisfactory
result.
And one last piece of advice: make a print-screen of the directory listing of the
files you are copying from the remote site - this will help later when you are
doing the post-processing of the files.
REFERENCES:
Anonymous FTP : frequently asked questions (FAQ)./Perry Rovers. Nov. 30,
1995.
[*]ftp://rtfm.mit.edu/pub/usenet-by-group/news.answers/ftp-list/
[*]http://www.cis.ohio-state.edu/hypertext/faq/ftp-list/faq/faq.html
File transfer protocol (FTP)./ Postel, J and J. Reynolds. Oct. 1985 (RFC 959)
[*]ftp://nic.merit.edu/documents/rfc/rfc0959.txt
How to use anonymous ftp./ P. Deutsch, A. Emtage, A. Marine. May
1994. (FYI 24; RFC 1635)
[*]ftp://ftp.internic.net/rfc/rfc1635.txt
[*]ftp://a.cni.org/pub/FYI-RFC/fyi24.txt
Compression FAQ./ Lemson, David
DLI/IDD 1997 Workshop
lr:ed.ch:module3 12
[*]ftp://rtfm.mit.edu/pub/usenet-by-group/news.answers/compression-faq/
[*]http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-faq/top.html
File compression, archiving, and text[-]binary formats./ Lemson, David
[*]ftp://ftp.cso.uiuc.edu/doc/pcnet/compression
3.3 Post-processing files
This section discusses the processing that you may or may not, depending on your
circumstances, go through to ensure:
that you received the entire data file,
that the files you received are complete and useable, and
to make the task of actually using the data as easy and uncomplicated for
your users as possible.
Different types of files require very different post-processing, partly depending on
the format of the files, and partly depending on how you intend to deliver the data
to your users:
data files that are not accompanied by retrieval software require fairly
extensive post-processing, but there are a number of ways in which to
deliver the data to your users,
data files that are accompanied by retrieval software require relatively
little post-processing, but your options for delivering the data to your users
are more limited,
some fairly extensive post-processing of georeferenced data files will be
needed before they can be used with contemporary GIS software,
documentation files will also need more or less extensive post-processing,
depending on how you intend to deliver the information to your users.
3.3.1 Post-processing: first steps
The first steps in post-processing are common to all file types.
Step 1: Check the number of physical carriers
First check that you have received all the physical carriers, if the data have arrived
on a removable medium such as diskette or cd-rom. The number of diskettes, cd-
roms, etc. should be indicated in any of the following:
a covering letter, or
a separate manifest accompanying the data file, or
somewhere in the codebook/user manual which describes the data file.
DLI/IDD 1997 Workshop
lr:ed.ch:module3 13
Step 2: Check physical files, file names, and file sizes
(N.B. THIS IS ESPECIALLY IMPORTANT IF YOU HAVE FTP’D THE FILES,
BUT UNIMPORTANT IF THE DATA ARRIVED ON A CD-ROM.) Next check that
you have received the correct number and size of physical files The number of
physical files which comprise the data file, as well as such information as the
datasetnames and perhaps even the physical file sizes, should be indicated in any
of:
a covering letter, or
the Readme.first file (a manifest) in the file subdirectory on the ftp site, or
somewhere in the codebook/user manual describing the data file.
Determine the number of physical files, their datasetnames, and file sizes of the
files you have actually received, and compare this information with the information
gleaned from the accompanying documentation.
System: Command:
MSDOS dir [drive|directory]
Windows [use File Manager]
Unix ls -al
The rules for datasetnames and/or file names vary from system to system:
in Unix, file names may be up to 255 characters, but some older System V
Unix systems only allow 14 characters. File names may have extensions of
any length, preceded by a period (e.g. ‘.sys’). Filenames can contain any
special character except /.
in VM/CMS, file names consist of three parts: the filename (fn), the file type
(ft) and the file mode (fm). Each of fn and ft can be up to eight characters
(alphabetic, or numeric, and some special characters); the fm is usually given
as one character. The three parts of the name are usually given separated by
a blank.
in DOS/Windows, file names are restricted to a maximum of eight characters,
and may be composed of alphabetic, numeric, or special characters (but only:
!@#$%^&()_-{} or ‘). Normally file names are followed by a period and a
maximum three character extension. DOS/Windows file names must never
include a blank.
Similarly, display of file size varies from environment to environment:
in Unix, disk file size is displayed in terms of bytes, including the end-of-line
character (‘LF’ (linefeed), ASCII octal 012) at the end of each record (line, or
row) in the file,
in DOS/Windows, disk file size is displayed in terms of bytes, including the
two end-of-line characters (‘CR’ and ‘LF’ (carriage return - line feed, ASCII
octal 0A and 0D) at the end of each record in the file,
DLI/IDD 1997 Workshop
lr:ed.ch:module3 14
in VM/CMS, and indeed most IBM mainframe operating systems, disk file size
is displayed not in bytes but in terms of maximum number of characters per
record (record length), number of records, and record type (whether fixed
length (‘fb’), or variable length (‘vb’)). To ascertain the file size in terms of
bytes, multiply the record length by the number of records, as displayed by
the command: filel [fn ft fm]. This calculation will be accurate for files with
fixed length records. File size display usually also includes the number and
size of disk blocks needed to store the file. N.B. there are no end-of-line
characters in files on IBM mainframe computers.
in VMS, disk file size is displayed in terms of number of blocks (usually of
either 512 or 1024 bytes per block).
When comparing file names and file sizes, you must take into consideration the
physical environment in which the original information was generated (Unix) and
that in which you have just checked it. When file name or size discrepancies occur,
try to determine whether or not they may be the result of simply having been
moved from one environment to another.
If the information about what you should have received and what you actually
received do not match, try to determine what is wrong, which will usually be one of
the following:
you have not yet uncompressed/unbundled the files (if they were received in a
compressed and/or bundled format). If this is the case, uncompress/unbundle
the files (see below), and check again. Alternatively,
the discrepancy may be a result of having moved from one environment to
another, or
the file is incomplete, or
the documentation is in error or incomplete, or
the documentation does not match the data file (i.e. you have the wrong
documentation or the wrong data file).
Document the discrepancies, and if you cannot satisfactorily account for small
discrepancies by their having been shifted from one environment to another, and
the discrepancies are confirmed by the checks of the internal consistency (see
below), contact the source from which you received the data.
Step 3: Uncompress/unbundle any compressed files
Data arriving via ftp are usually in either one of the Unix compressed and/or
bundled formats, or in an MS-DOS/Windows compatible compressed format
(Statistics Canada typically uses PKSFX to compress/bundle files from a
DOS/Windows environment). Increasingly, software to unbundle/uncompress these
formats is becoming available on several systems.
DLI/IDD 1997 Workshop
lr:ed.ch:module3 15
The main compressed/bundled formats are:
[fn].exe - file is compressed and/or bundled as a ‘self extracting file, usually
with PKSFX. Move to the appropriate platform, or uncompress on
Unix using ‘unzip’.
To uncompress in DOS: [fn]
To uncompress in Unix: unzip [fn].exe
[fn].Z - file is compressed using Unix 'compress' software. ‘Uncompress’
software is available for DOS, Unix, Mac, etc., platforms.
To uncompress: uncompress *.Z
[fn].tar - usually several files have been 'bundled' together in one physical file
by the 'tar' program, so that you don't have to ftp over a whole bunch
of individual files. ‘Tar’ is available for DOS, Unix, Mac, etc.,
platforms.
To untar: tar -xfv [filename].tar
N.B. .tar files are often subsequently compressed, so that ftping will be
faster. Such files have both a .tar followed by a .Z extension. First
they must be uncompressed using 'uncompress', then un-tarred
using 'tar'.
[fn].gz - file is compressed using the gzip program. It can be uncompressed 'on
the fly' during ftping, but this means that it is the full file that is
ftp'ed over, not the compressed file, and this will take much more
time. Instead, it is more efficient to ftp the file in binary, and gunzip
it locally. Gunzip is available for DOS, Unix, Mac, etc., platforms.
To un-gzip: gunzip [filename].gz
To un-gzip ‘on the fly’ within ftp: get [filename]
Note: to gunzip on the fly, do NOT use the ‘.gz’ extension on the
filename with the ‘get’ command.
E.g. get [filename]
[fn].zip - file(s) have been compressed [and possibly bundled] using the pkzip
program. The file must be uncompressed/unbundled using ‘pkunzip’,
which is available for DOS, Unix, Mac, etc., platforms.
To uncompress in DOS: pkunzip *.zip
To uncompress in Unix: unzip [fn].zip
[fn].arc - file(s) have been compressed [and possibly bundled] using the pkarc
program. The file must be uncompress/unbundled using ‘pkxarc’,
which is available for DOS, Unix, Mac, etc., platforms.
Extension To uncompress/unbundle
.arc pkxarc [fn].arc
DLI/IDD 1997 Workshop
lr:ed.ch:module3 16
.exe [fn]
.Z uncompress [fn].Z
.gz gunzip [fn].gz
.tar tar -xfv [fn].tar
.tar.Z first 'uncompress [fn].Z, then 'tar -xfv [fn].tar'
.zip pkunzip [fn].zip
Uncompress/unbundle the files as appropriate, and generate and print yet another
ls -al |more or dir listing of the uncompressed/unbundled files. Check this listing
against the Readme.first file to ensure that you have all files and that they are the
correct size.
For a fairly complete discussion of compression/bundling and related software, as
well as locations of compression/bundling software, see:
Compression FAQ./ Lemson, David
ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq/*
File compression, archiving, and text[-]binary formats./ Lemson, David
ftp://ftp.cso.uiuc.edu/doc/pcnet/compression
Step 4: Determining what the files are
A data file can consist of a number of component physical files: one or more physical
data files, program files and documentation files. Simply put, a data file contains
data, a program file contains either a program or the instructions to a program as to
how to read the data, and documentation files contain information about the data
(often called ‘metadata’) which is needed by the user to understand what the data
mean.
Physical data files
Data files are those that contain the data themselves, which may be either numeric
(i.e. recorded as numbers) or alphabetic (recorded in the letters of the alphabet) or a
combination of the two. One data file can consist of one or more physical data files.
For example, the Canadian General Social Survey on time use (General social
survey number 2), conducted by Statistics Canada in 1987 consists of three data
files:
Main file - one record for each respondent in the file.
Incident file - one record for each reported time usage incident per
respondent (i.e. many incidents for each respondent)
Summary file - one record for each respondent, consisting of a
summary of the incident records.
DLI/IDD 1997 Workshop
lr:ed.ch:module3 17
Each of these data files will have its own program files, and may also have separate
documentation files.
Further, the content of data files can be differentiated by other characteristics:
organization of records: the relationship of physical records to logical records,
level/type of observation: what each logical record describes,
field structure: the way in which variables are coded in each record.
Organization of records
Level/type of observation flat hierarchical unstructured
microdata x x
macrodata (aggregate) x x
time-series x x
vector x
raster x
text x x x
The level/type of observation determines the type of analysis for which the file is
appropriate, i.e. the use researchers are likely to make of the file. The organization
of the records in the file affects how the structure of the data file is defined to a
statistical package.
Field structure
raw data file
fixed field files
card image (80 characters per record)
‘lrecl’ (record length in bytes less than 80 or greater than 80)
delimited field files
comma delimited
blank delimited
other character delimited
tagged fields
software-dependant data file
system file
transport/export file
Field structure, like the organization of the records in the file, affects how the
structure of the data file is defined to a statistical package.
Raw data files, that is physical data files that contain only data (numeric or
alphabetic or both), are not much of a problem. If the documentation is complete,
and adequately and accurately reports what variables are coded in what columns,
they can be read by almost any commercially available statistical package.
DLI/IDD 1997 Workshop
lr:ed.ch:module3 18
Raw data files are relatively easily recognized by the following characteristics:
if the data are numbers, the numbers should be displayed in discernible vertical
patterns (columns), or separated by blanks or commas, e.g.
35000200000000000000000000014171963224000000099000000100999
24421200000000000000000000013791902114000000099001001014000
24000200000000000000000000001851896115000000099000000100999
35555200000000000000000000013221059124000000099145010302056
12000200000000000000000000014761904222000000099000000100999
if the data are alphabetic, the text should be legible, and there should be a
discernible vertical pattern to the display,
there should be no characters that do not occur on a standard keyboard.
System or export files, on the other hand, will often have a partially eye-readable
header which will give you a clue as to what software is needed to read the file.
Program files
Raw data files are often accompanied by separate ‘program’ files (or ‘control
command files’) that contain the instructions to a specific program (e.g. SAS or
SPSS) as to how to read the raw data file. For example, the first few lines of the
SPSS control command file which describes how to read the above data might look
like this:
title Census of Canada, 1981 - public use microdata file individuals
data list file=in /
Prov 1-2
Cma 3-5
etc.
This file instructs SPSS to read a numeric variable in columns 1 and 2 (counting
from left to right) and assign the name ‘prov’ to it, and to read a numeric variable to
be named ‘cma’ in columns 3 through 5. Thus, ‘prov’ is a two-digit number (in the
above case, with values respectively ‘35', ‘24', ‘24', ‘35', and ‘12') and ‘cma’ is a three-
digit number (with values ‘000', ‘421', ‘000', ‘555', ‘000' respectively)
Some files actually come with both SPSS and SAS control command files, or even
control command files for other programs, such as TSP, SHAZAM, TPL, etc.
Unfortunately, none of these programs can understand commands formatted for
any of the other packages.
System files
DLI/IDD 1997 Workshop
lr:ed.ch:module3 19
Some program files actually contain the program to read the data file (either as a
source file, which must be compiled, or more frequently, as a compiled file, and then
most often actually bundled with the data into one physical file) or accompany a
data file in a proprietary format (usually binary) which can only be read by the
program itself and nothing else, which is usually called a ‘system file’.
When a statistical program such as SPSS reads a file of control commands and the
data that they describe, that information, both the instructions as to how to read
the data, and the data themselves, are stored in temporary workspace until the user
stops running the program.
Storing the instructions and data in one file in workspace makes it faster for the
user, who can refer to variables by name, recode them, perform various
transformations, etc. But getting them into that format from the initial form of raw
data and control commands takes some time. Most programs therefore also allow
you to store this temporary work file as a separate physical file, which is
‘permanent’ in the sense that it does not disappear when you stop running the
program -- much in the same way that you can store a WordPerfect document with
System File
Temporary Work File
Raw Data File Control Command File
Statistical Program
(e.g., SAS/SPSS/Minitab, etc.)
Export File Raw Data File Record Layout
Frequencies Content Information
DLI/IDD 1997 Workshop
lr:ed.ch:module3 20
all the formatting, changes, etc., so that it does not disappear when you exit
WordPerfect. This file is called a ‘system file’ and it is intended to make things
faster and easier for the user. Lotus 1-2-3 and dBase also allow you to store the
system file as a permanent file, although the difference between these programs
and SAS or SPSS is that you are required to define the dBase or 1-2-3 file format
within the program, not as a separate physical file which is plain text and therefore
eye-readable and portable. Unfortunately, the system file is in a special format
which is not eye-readable, and may well not even be readable by other versions of
the same program, especially not those running under different operating systems
than the one on which the system file was initially created, and is certainly not
readable by any other program but the one that created it. For example, a system
file created using SPSS(PC) under DOS will not be readable by SPSS for Windows
without a special interface, and is certainly not readable by SAS.
Both SAS and SPSS are major, well-known statistical packages, and include
commands to read and write all formats.
READING AND WRITING SPSS FILE FORMATS
File format Command to read Command to write
raw data data list file=’[path/fn]’ write outfile=’[path/fn]’
SPSS system file get file=’[path/fn]’ save outfile=’[path/fn]’
SPSS export file import file=’[path/fn]’ export outfile=’[path/fn]’
Data Interchange Formats
In addition, both SPSS and SAS (as well as other programs, such as ArcInfo) will
produce a data interchange format, what is called an ‘export file’ in SPSS or a
‘transport file’ in SAS. The object of the interchange file is to be able to move a
system file from one operating system or package to another. For example, in order
to use a system file created with SPSS on a machine running under the Unix
operating system on a machine running SPSS for Windows, you would have to
convert the system file to an export file on the Unix platform (using SPSS) and then
use SPSS for Windows to ‘import’ the file, thus creating a new system file.
Alternatively, you could write out the data stored in the system file as a raw data
file, and move that to the Windows platform. But then you would have to go
through the whole process of redefining to SPSS for Windows where and how to
read every variable and value in the file -- this information about how to read the
data is preserved when you write out an export file, and in turn import it.
While some data interchange formats are not readable by other programs, they are
readable by the programs that produce them. In some situations, a major package
will read the data interchange format of a major competitor. For example, SAS will
read an SPSS export file, although SPSS does not currently read a SAS transport
DLI/IDD 1997 Workshop
lr:ed.ch:module3 21
file. Both SAS and SPSS will read the dBase (.DBF) and Lotus (.WKS) data
interchange formats, which covers a common database and spreadsheet format,
respectively.
READING AND WRITING SAS FILE FORMATS
File format Command to read Command to write
raw data data; infile [path/fn]; data; file [path/fn]; put;
SAS system file data; set [path/fn];
[proc] data=[libn].[dsn];
data [path/fn];
SAS export file proc cimport infile=[path/fn] proc cport file=[path/fn];
Looking at the files
How do you tell which files are the data, program, and/or documentation files? And
what formats each of the files you have received are in? You have several sources of
clues:
Step 1: read the Readme.first file accompanying the data files, which should
explain the format (especially the software dependence) of all files. Remember not
to take this information as gospel truth, however -- always check the files
themselves to see what they really contain.
Step 2: look at the extensions of the filenames, or the entire filenames. This
information too can be used as a guideline, but should not be relied on entirely
without checking the files themselves.
Step 3: look at the files themselves:
in Unix, several commands are available (although not all will be available on
all flavours of Unix):
To simply display a file:
browse displays a file one screen at a time
cat displays an entire file (scrolling)
cio ‘Check it out’. This Perl script not only lists the first 10 lines of
a file, but also indicates how many records in the file as well as
the number of records by line length.
head displays the first 10 lines of a file (see also ‘tail +0’)
less displays a file one screen at a time
more displays a file one screen at a time
pg displays a file one screen at a time
tail displays the last 10 lines of a file by default
tail +0 displays the first 10 lines
DLI/IDD 1997 Workshop
lr:ed.ch:module3 22
To look more closely at what’s in a file, especially at the non-printable characters
(e.g. those odd spaces):
cat -evt displays an entire file, including most non-printing characters
(including ASCII upper-128), tabs, and end-of-lines.
hexdump displays a file in hexadecimal, octal, decimal, and ascii
od displays a file in ASCII octal, decimal, and hexademimal
vis displays a file including non-printable characters in octal etc.
xd displays a file in octal and hexadecimal
in DOS/Windows, there are a few programs that will allow you to look at the
content of a file, including:
type displays a file in text only (use this with small files
only).
list.com Available at:
ftp://princeton.edu/pub/misg-lib/UTIL/list.com
Displays a file in text or in hexadecimal.
vedit Proprietary software from Greenview Data Inc. Can
display both ascii and ebcdic files, in text and
hexadecimal mode.
Norton Utilities
Disk Editor
Proprietary software. Displays a file in text or in
hexadecimal.
Xtree Pro Proprietary software. Displays a file in text or in
hexadecimal
.
for word processor files, try to import them with either Microsoft Word or
WordPerfect, and let the software detect which format it thinks the file is
(N.B. this is relatively easy, but doesn’t always result in a readable file.)
3.3.2 Post-processing data files (without accompanying retrieval software)
From here on in, the post-processing of files varies according to what the files are,
and whether or not they are software dependant, and/or are accompanied by
retrieval software. First, the fairly extensive post-processing needed by generic data
files without accompanying retrieval software.
Step 1: the next step is to check the ‘internal’ completeness or integrity of the
physical files by checking the number of records in the file, and the record length or
length of the records in the file against the information provided in the
documentation. To do this, you may find it expedient actually to process the
DLI/IDD 1997 Workshop
lr:ed.ch:module3 23
documentation files first.
Number of records
First, look at the manifest or codebook carefully, to determine if it indicates the
number of records in each physical data file. In a raw data file with one record per
respondent, this is the same as the number of cases (or respondents), also known as
‘the N’. If the codebook discusses an ‘unweighted N’ and a ‘weighted N’ (or cases, or
respondents), the number you are interested in is the ‘unweighted N’.
Determining number of records:
System Software Command
DOS maxline maxline [filename]
Unix wc wc -l [filename]
Unix cio2 cio2 [filename]2
VM/CMS filelist filel [fn] [ft] [fm]
Software availability:
System Software Available as/at:
DOS maxline ftp://datalib.library.ualberta.ca
Unix wc standard Unix utility
Unix cio2 ftp://gort.ucsd.edu/pub/jj/cio2
VM/CMS filelist standard VM/CMS utility
Although in a raw data file, the number of records corresponds to the number of
cases, there are other instances in which the number of records is of use as a simple
control check only, and bears no relationship to the intellectual content:
if there is more than one record per respondent (often the case in older files,
especially those with a record length of 80 characters per record (see
discussion of ‘record length’ below)), the number of records should equal the
number of respondents multiplied by the number of records per respondent.
I.e. the (#records = #cases * #records_per_respondent), and that number of
records_per_respondent should be documented in the codebook;
the data file is an hierarchical file, in which case the number of cases will vary
with the type of case, and the number of records is usually not given in the
documentation;
the data file consists of time series data, or administrative data, in which case
DLI/IDD 1997 Workshop
lr:ed.ch:module3 24
very often the number of records will vary with the length of the time series in
each case, and a number of records will bear little relationship to the number
of cases,
in the case of text data, the number of records is usually not given in the
accompanying documentation, and since there are no ‘cases’ per se, will in any
case have little meaning in terms of intellectual content,
in the case of system files and export files, the data are no longer arranged by
case, and therefore the number of physical records has no relationship to the
number of cases or respondents. SPSS will report the number of cases when
EXPORTing a system file to an export file, or IMPORTing an export file to a
system file. Alternatively, to determine the number of respondents in an
existing system file, generate frequencies on a variable with few values, such
as the ‘sex’ variable which occurs in many data files.
E.g. The following SPSS commands will generate frequencies on the variable
‘sexr’ (sex of respondent) in the SPSS system file ‘ind91.sys’ in the
subdirectory ‘/data/cc9105'
Title Census of Canada, 1991 - microdata file.
get file=’/data/91census/ind91.sys’
frequencies variables=sexr
finish
Record length
Next, determine the ‘record length’ of each physical file. If neither the manifest nor
the codebook gives the record length of each physical file, look at the codebook parts
that describe the layout of each of the physical data files that make up the whole
data file, or at accompanying program (control command) files. Look specifically for
the highest column location number that is mentioned in each record layout. This is
the ‘record length’, and represents the number of characters or columns there
should be in each physical record (or ‘row). This number should be the same as:
Determining record length:
System Command/calculation
DOS (average record length + 2) = (#bytes/#records)
DOS maxline3
Unix cio2 [filename]
Unix (average record length + 1) = (#bytes/#records)
3Maxline is a program written in C by the ESRC Data Archive, University of Essex. It reports the length of the
longest line in a file, as well as the number of control characters and characters with an octal code greater than 177.
DLI/IDD 1997 Workshop
lr:ed.ch:module3 25
VM/CMS filel [fn] [ft] [fm]
Instances in which the record length is of little intellectual meaning, but still useful
as a control include:
system files and export files usually have predetermined record length
required by different versions of the program. E.g. SPSS under VM/CMS
requires that system files have a record length of 1024 and that export files
have a record length of 80. Since in both of these cases, the data are no longer
organized on a record by case basis, the record length has no meaning other
than as a control that program requirements are met,
comma-delimited and blank-delimited files do not depend on a fixed-field
structure, and usually the documentation gives no information about record
lengths,
text files also seldom contain fixed field information, and therefore record
length has little intellectual meaning.
And what do you do if these don’t match what the manifest or the documentation
claim? Double check all your calculations, and try to determine, as in step 2 above,
whether it is the documentation that is in error or incomplete (or in the worst case
scenario, the wrong documentation for that file) or the data file that is incomplete.
Document the discrepancies, and if they are confirmed by the checks in Step 2,
contact the source from which you received the data.
3.3.2.2 Data files with accompanying software
Data files that come bundled with accompanying software, usually on cd-rom, but
sometimes via ftp (such as Quikstat or the Health indicators database), should be
installed in a DOS/Windows environment in the usual way, following the
accompanying instructions for installation.
If the files are available via ftp, they will be organized in a sequence of /disk[n]
subdirectories. Download the structure as is and either copy onto actual diskettes
before installation, or otherwise maintain the subdirectory structure in order to
install correctly.
3.3.2.3 Spatial data files
In order to be used with GIS software, the geographic reference files need to be
structured in a special way. A major component of the needed structure is the
importance of leading zeros in the geographic identification fields. Be kind to your
GIS librarian/users! Post-processing geographic reference files as follows:
read the file into your favourite statistical package with:
all geographic identifiers defined as alphabetic variables (to preserve the
leading ‘00's)
DLI/IDD 1997 Workshop
lr:ed.ch:module3 26
read longitude as a variable with 6 decimal places, and convert it to a
negative number (multiply it by ‘-1')
read latitude in as a variable with 4 decimal places
writing the file out as a flat ascii text file, preserving the alphabetic variables
intact and writing latitude and longitude as signed, floating decimal numbers
read the flat ascii text file into Geoformat
use Geoformat to write out two sets of files:
one comma-delimited for use in ArcInfo
one as a .dbf file, for use with MapInfo and ArcView
Geoformat (a stand-alone FoxPro application) is available on the ftp site in: ftp://ftp.statcan.ca/geography/geogfiles/refdata/geofmt/*
3.3.3 Post-processing documentation files
There are two major types of documentation files: free-form text files, which
describe the data, and control command files which are input to major statistical
packages and must be structured in a specific way.
3.3.3.1 Codebook files
Codebook files vary widely both in their content and in their formats, as well as in
their very names, for there is no standardization as to content or naming of
codebook files.
Ideal codebook contents
Although there are no standards per se, it is generally accepted that good and
complete (optimum) documentation should include all of the following:
official title of the data file and identification of the principal investigator(s)
and their institutional affiliations
a description of the objectives of the research project that collected the data,
its intellectual goals, and its history, project members, sources of funding and
related acknowledgements, etc.
a description of how, when, where, and by whom the data were collected,
verified, processed and cleaned. If the data were collected via a survey of some
sort, descriptions of the sampling frame and procedures, the universe,
response rates, weighting and weight variables, interviewer training, etc.
if the data were compiled from existing sources (e.g. other data files, print
sources, etc.), identification of the sources.
copies of the questionnaire, instructions to interviewers, etc., or a printout of
the CATI/CAPI program.
description of data verification procedures (internal consistency checking),
data cleaning, and so on.
a statement of units of analysis or observations (i.e. who or what do the data
DLI/IDD 1997 Workshop
lr:ed.ch:module3 27
describe)
a description of the organization of the data file, i.e. a listing of the physical
files, their datasetnames, sizes (bytes, number of records, record lengths),
software dependency, and respective content of each physical file
a description of each variable in each physical data file (N.B. it is most useful
if this is given in the order in which the variables are recorded in the data file,
i.e. ‘natural order’), including
sequential order of the variable in the data file,
the question number and full text of the question that generated the
variable, or its exact meaning, source, or other information necessary to
understanding what the variable represents,
column location(s), and record number (if there is more than one record per
case) in which the variable is coded,
type of variable (alphabetic, numeric, number of decimals, whether signed,
etc.)
variable names (if the variables have been assigned names in an SPSS or
SAS file)
universe, e.g. who was asked this question,
all valid codes and the exact meaning of each code, including missing data
codes,
the unweighted frequency of each code (including missing data codes), i.e.
how many times it occurs (actual count and as a percentage of responses),
or, if a code has many possible values (e.g. income), summary statistics
such as the lowest code, the highest code, the range of the code, and the
missing data codes
imputation and editing information, e.g. if the values have been estimated
for any cases
the method by which constructed or derived variables have been created
skip patterns applicable to the variable.
many Statistics Canada files which describe survey microdata files also
contain approximate variance tables. These are needed by researchers who
are doing analyses involving population estimates, and need to compare
their results with the original population estimates from Statistics
Canada.
a bibliography of publications relating to or based on the data file.
Bare minimum codebook content
The above is an idealized list, and not all that many codebooks include all of the
above information. Nor is it obtainable for all data files.
At an absolute bare minimum, however, documentation should include:
official title, and principal investigators
date(s) that the data were collected or dates to which the data refer
DLI/IDD 1997 Workshop
lr:ed.ch:module3 28
number of cases and the unit of analysis
a description of the organization of the data file, i.e. a listing of the physical
files, their datasetnames, sizes (bytes, number of records, record lengths),
software dependency, and content of each of the physical files
record layout, showing full question text of each variable, variable name (if
applicable), column location(s) (in ‘natural order’), variable type, etc.
all existing codes for each variable, and their meanings (value labels)
frequencies of each variable, or summary statistics
a copy of the data collection instrument (questionnaire)
references to publications about or based on the data file.
Common names of documentation files and software dependant formats are:
data map or record layout
a (usually) computer-generated listing, which may contain any of variable
names, variable types, and column locations, number of decimals, and
missing data codes. Usually these are generic flat text files, with no software
dependency.
DDMS dictionary (files have extensions *2.DBF, *3.DBF, *4.DBF etc.)
Data Dictionary Management Software was developed in Clipper by Health
and Welfare Canada, in the DOS environment. Information is stored in 4
physical files: a directory file which stored title, and principal investigator
information, as well as 3 additional files (*2.dbf, *3.dbf, and *4.dbf) which
store respectively variable level, value level, and comment level information.
The source files (*2.dbf, *3.dbf and *4.dbf) are text, but formatted specifically
for DDMS. Files with extensions *5.dbf and *6.dbf are DDMS output
codebook and index files respectively and are flat text files intended to be
printed, but which can be re-routed to a file using a standard utility such as
prn2file.
word processor files (WordPerfect, MicroSoft Word, Microsoft Windows, WordStar,
MultiMate, Xywrite,etc.) (extensions .DOC, .WP, .CAT etc.)
System files containing text formatting commands (the preamble) as well as
the documentation text, usually in the same physical file. Most PC-based
word processors can both import texts formatted for most other current word
processor packages (usually with some corruption of formatting), as well as
write out a flat, text file without the formatting instructions or preamble
(often denoted as ASCII (DOS) text).
Portable Document Format files (extension .PDF)
System files which contain graphic control as well as the text, in a format
which can be read by the Adobe Acrobat viewer (available for Windows,
Macintosh, and Unix from http://www.adobe.com).
DLI/IDD 1997 Workshop
lr:ed.ch:module3 29
PostScript files (extension .PS)
Either text files, containing PostScript programming language (from Adobe)
commands, which control the printer, as well as the documentation text, or
system files (actually, a graphic bit map) generated from the PostScript
program commands and the text by a PostScript interpreter. Device
independent: PostScript code generated by an application can be printed in
any printer with a PostScript interpreter, or displayed using PostScript
interpreter software such as Ghostscript.
HTML (hypertext markup language) files (extension ‘.htm’ or ‘.html’(
Text files containing HTML commands, which control display as well as some
content. Device independent. Can be displayed by any HTML viewer, such as
lynx, Netscape or Mosaic, etc.
SGML (standard general markup language) files
Text files containing SGML commands, which control form of content, rather
than display. Device independent. Can be displayed by any SGML viewer.
ASCII text files (or ‘flat character file’)
a common name for a file which contains only the characters found on a
standard keyboard (the ASCII lower-128 codes) and which is not dependant
on any software, but can be printed or listed with any file listing or printing
software. It is irrelevant whether such a file is stored in ASCII or in
EBCDIC, and it can be moved from one environment to the other with
impunity.
Procedure for printing machine-readable codebooks
1. Copy the file(s)
The first step in codebook processing should always be to copy the codebook file(s) to
your working subdirectory. Never under any circumstances make changes
directly to the archival copy of the codebook files!
Either ftp the files to your account or workstation, or, set up a separate working
subdirectory under the same account, and copy the file(s) to that subdirectory.
Whether you copy/ftp the files to a Unix or DOS/Windows platform depends on the
original format of the files to a certain extent.
Software dependent codebook formats
DLI/IDD 1997 Workshop
lr:ed.ch:module3 30
Extension Operating
System
Binary/
Ascii
Probable format
.cat
.wpd
DOS B WordPerfect (used by some Stats Can
departments) -- use MS Word or WordPerfect
.dbf DOS A|B dBase/DDMS -- may need DDMS
.doc DOS B MS Word (version unspecified) -- use MS Word or
WordPerfect