attribute grammars for validating chunk-based binary file...

50
i Attribute Grammars for Validating Chunk-based Binary File Formats William Underwood Sandra Laib ICL/ITDSD Working Paper 11-03 July 2011 Information Technology and Decision Support Division Information and Communications Laboratory Georgia Tech Research Institute Atlanta, Georgia Research was sponsored by The Army Research Laboratory (ARL) and the Applied Research Division of the National Archives and Records Administration (NARA) and was accomplished under Cooperative Agreement Number W911NF-10-2-0030. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory, NARA, or the US Government. The U.S. government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation heron.

Upload: truongnguyet

Post on 25-Jun-2018

235 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

i

Attribute Grammars for Validating

Chunk-based Binary File Formats

William Underwood

Sandra Laib

ICLITDSD Working Paper 11-03

July 2011

Information Technology and Decision Support Division

Information and Communications Laboratory

Georgia Tech Research Institute

Atlanta Georgia

Research was sponsored by The Army Research Laboratory (ARL) and the Applied Research

Division of the National Archives and Records Administration (NARA) and was accomplished

under Cooperative Agreement Number W911NF-10-2-0030 The views and conclusions

contained in this document are those of the authors and should not be interpreted as representing

the official policies either expressed or implied of the Army Research Laboratory NARA or

the US Government The US government is authorized to reproduce and distribute reprints for

Government purposes notwithstanding any copyright notation heron

ii

ABSTRACT

The capability to validate and view or play binary file formats as well as to convert binary file

formats to standard or current file formats is critically important to the preservation of digital

data and records

Context-free grammars are used for defining programming languages There are established

methods for using these grammars to check the syntax of programs and to compile programs into

assembly or machine language programs Grammars have also been used to define page

description languages and vector graphics languages However current technologies for context-

free grammars syntax checkers and compilers are limited to string languages or string data

types

This report describes the extension of context-free grammars from strings to binary files Binary

files are arrays of data types such as long and short integers floating-point numbers and pointers

as well as characters The concept of an attribute grammar is extended to these context-free array

grammars Attribute grammars have been used to define a number of chunk-based file formats

A parser generator has been used with some of these grammars to generate syntax checkers

(recognizers) for validating binary file formats The significance of these results is that with

these extensions to core computer science concepts traditional parsercompiler technologies can

be used as a part of a cost effective preservation strategy for binary file formats

iii

TABLE OF CONTENTS

1 INTRODUCTION 1

11 BACKGROUND 1

12 PURPOSE 1

13 SCOPE 1

2 BINARY FILE FORMATS 1

3 CONTEXT-FREE GRAMMARS AND ATTRIBUTE GRAMMARS 3

31 CONTEXT-FREE GRAMMARS FOR ARRAYS OF DATA TYPES 3

32 ATTRIBUTE GRAMMARS 4

33 SEMANTIC GRAMMAR 5

4 GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS 6

41 INTERLEAVED BITMAP (ILBM) 6

42 RIFF WAVEFORM PCM AUDIO FILE FORMAT (WAVE) 9

5 RECURSIVE DESCENT PARSERS FOR BINARY FILE GRAMMARS 10

6 A PARSER GENERATOR FOR RECOGNIZING BINARY FILE FORMATS 11

61 ANTLR 11

62 USING ANTLR TO GENERATE A PARSER FOR A BINARY FILE FORMAT 11

621 An ANTLR Grammar for an ILBM Chunk-based File Format 11

622 The Lexer and Parser 16

63 AN ANTLR GRAMMAR FOR THE RIFF WAVEFORM PCM FILE FORMAT 18

7 RELATED RESEARCH 20

8 CONCLUSION 21

REFERENCES 24

APPENDIX A CHUNK-BASED BINARY FILE FORMATS 33

A1 ELECTRONIC ARTS INTERCHANGE FILE FORMAT (IFF) 33

A11 Interleaved Bitmap (ILBM) 33

A12 8-bit Sampled Voice (8SVX) 33

A13 Cel Animation (ANIM) 34

A14 IFF Formatted Text (FTXT) 35

A15 IFF Simple Musical Score (SMUS) 35

A16 Planar Bitmap (PBM) 35

A2 STANDARD MIDI FILE FORMAT 35

A3 AUDIO INTERCHANGE FILE FORMAT 35

A4 RESOURCE INTERCHANGE FILE FORMATS 37

A41 RIFF-Waveform (WAVE) 37

A42 WAVEFORMATEX 37

A43 WAVEFORMATEXTENSIBLE 37

A44 Broadcast Wave Format 37

iv

A45 Windows Audio-Video Interleave (AVI) 37

A46 Animated Windows Cursor (ANI) 37

A47 RIFF MIDIfile (RMID) 37

A48 Device-Independent Bitmap (RIFF RDIB) 37

A49 WebP 37

A5 JPEG 38

A51 JPEGJFIF 38

A52 JPEGExif 38

A53 JPEG 2000 39

A54 JPEG 2000 Codestream 39

A6 ADVANCED SYSTEMS FORMAT 39

A61 Windows Media Audio 7-9 39

A62 Windows Media Video 7-91 39

A7 PORTABLE NETWORK GRAPHICS 39

A71 Multiple-image Network Graphics 40

A72 JPEG Network Graphics 40

A8 BINARY INTERCHANGE FILE FORMAT 40

A9 3D STUDIO 40

A91 3D Studio File Format ndash 3ds 40

A92 3D Studio Material Library Files 40

A10 ANIM8OR 40

A11 ANIMATOR 40

A111 Animator Cel 40

A112 Animator Pro Fli 40

A113 Animator Pro fLC 40

A111 Animator Cel 40

A114 Animator Pro PIC 40

A12 AUDIO MANAGER MODULE 41

A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41

A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41

A15 CORELDRAW VECTOR GRAPHICS FILE 41

A16 COREL METAFILE EXCHANGE IMAGE 41

A17 MUSIC MODULES 41

A171 DigiTrekker Module Format 41

A172 Graoumf Tracker modules 41

A173 Oktalyzer 41

A18 EA MULTIMEDIA GAME FILE FORMATS 41

A181 Electronic Arts Music (asf) 41

A182 Electronic Arts CMV 41

A183 Electronic Arts DCT 41

A184 Electronic Arts MAD 42

A185 Electronic Arts MPC 42

A186 Electronic Arts VP7 42

A187 Electronic Arts TGV 42

A188 Electronic Arts TGQ 42

v

A189 Electronic Arts TQI 42

A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42

A20 LIGHTWAVE 3D OBJECTS 42

A201 LWOB 42

A202 LWO2 42

A203 LWLO 42

A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42

A22 PAINT SHOP PRO (PSP) 42

A23 PROVECTOR 2D DRAWING 43

A24 PROWRITE DOCUMENT 43

A25 INTERACTIVEFICTION 43

A251 Blorb Z-machine Resource File 43

A252 Quetzal 43

A26 APPLE QUICKTIME 43

A27 MAYA IFF BITMAP 43

A28 APPLE CORE AUDIO FORMAT 43

A29 AMERICAN LASER GAMES (MM) 43

A30 SMUSH ANIMATION FORMAT 43

A301 SAN 43

A302 SNM 43

A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43

A32 OGG VORBIS 44

A33 SIMPLE VECTOR ARRAY FORMAT 44

A34 NESTED LIST FILE 44

A35 CINEMA 4D 44

A36 THE SIMS INTERCHANGE FILE FORMAT 44

vi

List of Figures

Figure 1 File format (Layout) of a RIFF container for a Webp Image 2

Figure 2 ILBM File hampic Displayed 7

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7

Figure 4 A Parse Tree for the ILBM File hampic 8

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10

Figure 6 An ANTLR Grammar for the ILBM File Format 16

Figure 7 A Java Application for Parsing an ILBM File 17

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20

1

1 Introduction

11 Background

Automated tools are required for identifying and validating the formats of the huge number of

files ingested into digital data and record archives Automated tools are also needed for

viewingplaying text and binary file formats and for converting legacy and obsolete file formats

to standard or current formats Technologies such as data description languages have emerged to

address these preservation challenges [Dunckley et al 2007]

Context-free grammars have been used to specify the syntax of programming languages

Attribute grammars provide a framework for formally specifying the semantics of a language

based on its context-free grammar and for addressing the mildly context-sensitive features of

programming languages such as agreement of the data types of variables in expressions with data

type declarations Such grammars have parsingtranslation algorithms that can be used for

syntax-checking and interpretation or translation of the languages the grammars define

12 Purpose

The research question addressed by this research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files

13 Scope

The next section discusses the traditional approach to specifying file formats and two of the

major families of binary file formats In section 3 extensions to the concepts of context-free

grammars and attribute grammars are described that enable the specification of binary file

formats In section 4 examples of attribute array grammars for a chunk-based binary file formats

are presented In section 5 recursive descent parsers for these classes of grammars are described

In section 6 experience in using ANTLR a parser generator for LL() string grammars in

generating parsers for recognizing the formats of binary files is discussed In section 7 related

research is described Section 8 summarizes the results

2 Binary File Formats

Traditionally a file (or record) layout was a form showing how fields were positioned in a file

(or record) The fields are named and have as attributes a data type length and sometimes a

constant value Fields are offset at addresses relative to the beginning of a file File formats are

still specified in this manner For instance Figure 1 shows the format layout of a RIFF container

for the VP8 codec of a Webp image [Google 2010]

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 2: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

ii

ABSTRACT

The capability to validate and view or play binary file formats as well as to convert binary file

formats to standard or current file formats is critically important to the preservation of digital

data and records

Context-free grammars are used for defining programming languages There are established

methods for using these grammars to check the syntax of programs and to compile programs into

assembly or machine language programs Grammars have also been used to define page

description languages and vector graphics languages However current technologies for context-

free grammars syntax checkers and compilers are limited to string languages or string data

types

This report describes the extension of context-free grammars from strings to binary files Binary

files are arrays of data types such as long and short integers floating-point numbers and pointers

as well as characters The concept of an attribute grammar is extended to these context-free array

grammars Attribute grammars have been used to define a number of chunk-based file formats

A parser generator has been used with some of these grammars to generate syntax checkers

(recognizers) for validating binary file formats The significance of these results is that with

these extensions to core computer science concepts traditional parsercompiler technologies can

be used as a part of a cost effective preservation strategy for binary file formats

iii

TABLE OF CONTENTS

1 INTRODUCTION 1

11 BACKGROUND 1

12 PURPOSE 1

13 SCOPE 1

2 BINARY FILE FORMATS 1

3 CONTEXT-FREE GRAMMARS AND ATTRIBUTE GRAMMARS 3

31 CONTEXT-FREE GRAMMARS FOR ARRAYS OF DATA TYPES 3

32 ATTRIBUTE GRAMMARS 4

33 SEMANTIC GRAMMAR 5

4 GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS 6

41 INTERLEAVED BITMAP (ILBM) 6

42 RIFF WAVEFORM PCM AUDIO FILE FORMAT (WAVE) 9

5 RECURSIVE DESCENT PARSERS FOR BINARY FILE GRAMMARS 10

6 A PARSER GENERATOR FOR RECOGNIZING BINARY FILE FORMATS 11

61 ANTLR 11

62 USING ANTLR TO GENERATE A PARSER FOR A BINARY FILE FORMAT 11

621 An ANTLR Grammar for an ILBM Chunk-based File Format 11

622 The Lexer and Parser 16

63 AN ANTLR GRAMMAR FOR THE RIFF WAVEFORM PCM FILE FORMAT 18

7 RELATED RESEARCH 20

8 CONCLUSION 21

REFERENCES 24

APPENDIX A CHUNK-BASED BINARY FILE FORMATS 33

A1 ELECTRONIC ARTS INTERCHANGE FILE FORMAT (IFF) 33

A11 Interleaved Bitmap (ILBM) 33

A12 8-bit Sampled Voice (8SVX) 33

A13 Cel Animation (ANIM) 34

A14 IFF Formatted Text (FTXT) 35

A15 IFF Simple Musical Score (SMUS) 35

A16 Planar Bitmap (PBM) 35

A2 STANDARD MIDI FILE FORMAT 35

A3 AUDIO INTERCHANGE FILE FORMAT 35

A4 RESOURCE INTERCHANGE FILE FORMATS 37

A41 RIFF-Waveform (WAVE) 37

A42 WAVEFORMATEX 37

A43 WAVEFORMATEXTENSIBLE 37

A44 Broadcast Wave Format 37

iv

A45 Windows Audio-Video Interleave (AVI) 37

A46 Animated Windows Cursor (ANI) 37

A47 RIFF MIDIfile (RMID) 37

A48 Device-Independent Bitmap (RIFF RDIB) 37

A49 WebP 37

A5 JPEG 38

A51 JPEGJFIF 38

A52 JPEGExif 38

A53 JPEG 2000 39

A54 JPEG 2000 Codestream 39

A6 ADVANCED SYSTEMS FORMAT 39

A61 Windows Media Audio 7-9 39

A62 Windows Media Video 7-91 39

A7 PORTABLE NETWORK GRAPHICS 39

A71 Multiple-image Network Graphics 40

A72 JPEG Network Graphics 40

A8 BINARY INTERCHANGE FILE FORMAT 40

A9 3D STUDIO 40

A91 3D Studio File Format ndash 3ds 40

A92 3D Studio Material Library Files 40

A10 ANIM8OR 40

A11 ANIMATOR 40

A111 Animator Cel 40

A112 Animator Pro Fli 40

A113 Animator Pro fLC 40

A111 Animator Cel 40

A114 Animator Pro PIC 40

A12 AUDIO MANAGER MODULE 41

A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41

A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41

A15 CORELDRAW VECTOR GRAPHICS FILE 41

A16 COREL METAFILE EXCHANGE IMAGE 41

A17 MUSIC MODULES 41

A171 DigiTrekker Module Format 41

A172 Graoumf Tracker modules 41

A173 Oktalyzer 41

A18 EA MULTIMEDIA GAME FILE FORMATS 41

A181 Electronic Arts Music (asf) 41

A182 Electronic Arts CMV 41

A183 Electronic Arts DCT 41

A184 Electronic Arts MAD 42

A185 Electronic Arts MPC 42

A186 Electronic Arts VP7 42

A187 Electronic Arts TGV 42

A188 Electronic Arts TGQ 42

v

A189 Electronic Arts TQI 42

A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42

A20 LIGHTWAVE 3D OBJECTS 42

A201 LWOB 42

A202 LWO2 42

A203 LWLO 42

A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42

A22 PAINT SHOP PRO (PSP) 42

A23 PROVECTOR 2D DRAWING 43

A24 PROWRITE DOCUMENT 43

A25 INTERACTIVEFICTION 43

A251 Blorb Z-machine Resource File 43

A252 Quetzal 43

A26 APPLE QUICKTIME 43

A27 MAYA IFF BITMAP 43

A28 APPLE CORE AUDIO FORMAT 43

A29 AMERICAN LASER GAMES (MM) 43

A30 SMUSH ANIMATION FORMAT 43

A301 SAN 43

A302 SNM 43

A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43

A32 OGG VORBIS 44

A33 SIMPLE VECTOR ARRAY FORMAT 44

A34 NESTED LIST FILE 44

A35 CINEMA 4D 44

A36 THE SIMS INTERCHANGE FILE FORMAT 44

vi

List of Figures

Figure 1 File format (Layout) of a RIFF container for a Webp Image 2

Figure 2 ILBM File hampic Displayed 7

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7

Figure 4 A Parse Tree for the ILBM File hampic 8

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10

Figure 6 An ANTLR Grammar for the ILBM File Format 16

Figure 7 A Java Application for Parsing an ILBM File 17

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20

1

1 Introduction

11 Background

Automated tools are required for identifying and validating the formats of the huge number of

files ingested into digital data and record archives Automated tools are also needed for

viewingplaying text and binary file formats and for converting legacy and obsolete file formats

to standard or current formats Technologies such as data description languages have emerged to

address these preservation challenges [Dunckley et al 2007]

Context-free grammars have been used to specify the syntax of programming languages

Attribute grammars provide a framework for formally specifying the semantics of a language

based on its context-free grammar and for addressing the mildly context-sensitive features of

programming languages such as agreement of the data types of variables in expressions with data

type declarations Such grammars have parsingtranslation algorithms that can be used for

syntax-checking and interpretation or translation of the languages the grammars define

12 Purpose

The research question addressed by this research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files

13 Scope

The next section discusses the traditional approach to specifying file formats and two of the

major families of binary file formats In section 3 extensions to the concepts of context-free

grammars and attribute grammars are described that enable the specification of binary file

formats In section 4 examples of attribute array grammars for a chunk-based binary file formats

are presented In section 5 recursive descent parsers for these classes of grammars are described

In section 6 experience in using ANTLR a parser generator for LL() string grammars in

generating parsers for recognizing the formats of binary files is discussed In section 7 related

research is described Section 8 summarizes the results

2 Binary File Formats

Traditionally a file (or record) layout was a form showing how fields were positioned in a file

(or record) The fields are named and have as attributes a data type length and sometimes a

constant value Fields are offset at addresses relative to the beginning of a file File formats are

still specified in this manner For instance Figure 1 shows the format layout of a RIFF container

for the VP8 codec of a Webp image [Google 2010]

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 3: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

iii

TABLE OF CONTENTS

1 INTRODUCTION 1

11 BACKGROUND 1

12 PURPOSE 1

13 SCOPE 1

2 BINARY FILE FORMATS 1

3 CONTEXT-FREE GRAMMARS AND ATTRIBUTE GRAMMARS 3

31 CONTEXT-FREE GRAMMARS FOR ARRAYS OF DATA TYPES 3

32 ATTRIBUTE GRAMMARS 4

33 SEMANTIC GRAMMAR 5

4 GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS 6

41 INTERLEAVED BITMAP (ILBM) 6

42 RIFF WAVEFORM PCM AUDIO FILE FORMAT (WAVE) 9

5 RECURSIVE DESCENT PARSERS FOR BINARY FILE GRAMMARS 10

6 A PARSER GENERATOR FOR RECOGNIZING BINARY FILE FORMATS 11

61 ANTLR 11

62 USING ANTLR TO GENERATE A PARSER FOR A BINARY FILE FORMAT 11

621 An ANTLR Grammar for an ILBM Chunk-based File Format 11

622 The Lexer and Parser 16

63 AN ANTLR GRAMMAR FOR THE RIFF WAVEFORM PCM FILE FORMAT 18

7 RELATED RESEARCH 20

8 CONCLUSION 21

REFERENCES 24

APPENDIX A CHUNK-BASED BINARY FILE FORMATS 33

A1 ELECTRONIC ARTS INTERCHANGE FILE FORMAT (IFF) 33

A11 Interleaved Bitmap (ILBM) 33

A12 8-bit Sampled Voice (8SVX) 33

A13 Cel Animation (ANIM) 34

A14 IFF Formatted Text (FTXT) 35

A15 IFF Simple Musical Score (SMUS) 35

A16 Planar Bitmap (PBM) 35

A2 STANDARD MIDI FILE FORMAT 35

A3 AUDIO INTERCHANGE FILE FORMAT 35

A4 RESOURCE INTERCHANGE FILE FORMATS 37

A41 RIFF-Waveform (WAVE) 37

A42 WAVEFORMATEX 37

A43 WAVEFORMATEXTENSIBLE 37

A44 Broadcast Wave Format 37

iv

A45 Windows Audio-Video Interleave (AVI) 37

A46 Animated Windows Cursor (ANI) 37

A47 RIFF MIDIfile (RMID) 37

A48 Device-Independent Bitmap (RIFF RDIB) 37

A49 WebP 37

A5 JPEG 38

A51 JPEGJFIF 38

A52 JPEGExif 38

A53 JPEG 2000 39

A54 JPEG 2000 Codestream 39

A6 ADVANCED SYSTEMS FORMAT 39

A61 Windows Media Audio 7-9 39

A62 Windows Media Video 7-91 39

A7 PORTABLE NETWORK GRAPHICS 39

A71 Multiple-image Network Graphics 40

A72 JPEG Network Graphics 40

A8 BINARY INTERCHANGE FILE FORMAT 40

A9 3D STUDIO 40

A91 3D Studio File Format ndash 3ds 40

A92 3D Studio Material Library Files 40

A10 ANIM8OR 40

A11 ANIMATOR 40

A111 Animator Cel 40

A112 Animator Pro Fli 40

A113 Animator Pro fLC 40

A111 Animator Cel 40

A114 Animator Pro PIC 40

A12 AUDIO MANAGER MODULE 41

A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41

A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41

A15 CORELDRAW VECTOR GRAPHICS FILE 41

A16 COREL METAFILE EXCHANGE IMAGE 41

A17 MUSIC MODULES 41

A171 DigiTrekker Module Format 41

A172 Graoumf Tracker modules 41

A173 Oktalyzer 41

A18 EA MULTIMEDIA GAME FILE FORMATS 41

A181 Electronic Arts Music (asf) 41

A182 Electronic Arts CMV 41

A183 Electronic Arts DCT 41

A184 Electronic Arts MAD 42

A185 Electronic Arts MPC 42

A186 Electronic Arts VP7 42

A187 Electronic Arts TGV 42

A188 Electronic Arts TGQ 42

v

A189 Electronic Arts TQI 42

A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42

A20 LIGHTWAVE 3D OBJECTS 42

A201 LWOB 42

A202 LWO2 42

A203 LWLO 42

A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42

A22 PAINT SHOP PRO (PSP) 42

A23 PROVECTOR 2D DRAWING 43

A24 PROWRITE DOCUMENT 43

A25 INTERACTIVEFICTION 43

A251 Blorb Z-machine Resource File 43

A252 Quetzal 43

A26 APPLE QUICKTIME 43

A27 MAYA IFF BITMAP 43

A28 APPLE CORE AUDIO FORMAT 43

A29 AMERICAN LASER GAMES (MM) 43

A30 SMUSH ANIMATION FORMAT 43

A301 SAN 43

A302 SNM 43

A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43

A32 OGG VORBIS 44

A33 SIMPLE VECTOR ARRAY FORMAT 44

A34 NESTED LIST FILE 44

A35 CINEMA 4D 44

A36 THE SIMS INTERCHANGE FILE FORMAT 44

vi

List of Figures

Figure 1 File format (Layout) of a RIFF container for a Webp Image 2

Figure 2 ILBM File hampic Displayed 7

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7

Figure 4 A Parse Tree for the ILBM File hampic 8

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10

Figure 6 An ANTLR Grammar for the ILBM File Format 16

Figure 7 A Java Application for Parsing an ILBM File 17

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20

1

1 Introduction

11 Background

Automated tools are required for identifying and validating the formats of the huge number of

files ingested into digital data and record archives Automated tools are also needed for

viewingplaying text and binary file formats and for converting legacy and obsolete file formats

to standard or current formats Technologies such as data description languages have emerged to

address these preservation challenges [Dunckley et al 2007]

Context-free grammars have been used to specify the syntax of programming languages

Attribute grammars provide a framework for formally specifying the semantics of a language

based on its context-free grammar and for addressing the mildly context-sensitive features of

programming languages such as agreement of the data types of variables in expressions with data

type declarations Such grammars have parsingtranslation algorithms that can be used for

syntax-checking and interpretation or translation of the languages the grammars define

12 Purpose

The research question addressed by this research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files

13 Scope

The next section discusses the traditional approach to specifying file formats and two of the

major families of binary file formats In section 3 extensions to the concepts of context-free

grammars and attribute grammars are described that enable the specification of binary file

formats In section 4 examples of attribute array grammars for a chunk-based binary file formats

are presented In section 5 recursive descent parsers for these classes of grammars are described

In section 6 experience in using ANTLR a parser generator for LL() string grammars in

generating parsers for recognizing the formats of binary files is discussed In section 7 related

research is described Section 8 summarizes the results

2 Binary File Formats

Traditionally a file (or record) layout was a form showing how fields were positioned in a file

(or record) The fields are named and have as attributes a data type length and sometimes a

constant value Fields are offset at addresses relative to the beginning of a file File formats are

still specified in this manner For instance Figure 1 shows the format layout of a RIFF container

for the VP8 codec of a Webp image [Google 2010]

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 4: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

iv

A45 Windows Audio-Video Interleave (AVI) 37

A46 Animated Windows Cursor (ANI) 37

A47 RIFF MIDIfile (RMID) 37

A48 Device-Independent Bitmap (RIFF RDIB) 37

A49 WebP 37

A5 JPEG 38

A51 JPEGJFIF 38

A52 JPEGExif 38

A53 JPEG 2000 39

A54 JPEG 2000 Codestream 39

A6 ADVANCED SYSTEMS FORMAT 39

A61 Windows Media Audio 7-9 39

A62 Windows Media Video 7-91 39

A7 PORTABLE NETWORK GRAPHICS 39

A71 Multiple-image Network Graphics 40

A72 JPEG Network Graphics 40

A8 BINARY INTERCHANGE FILE FORMAT 40

A9 3D STUDIO 40

A91 3D Studio File Format ndash 3ds 40

A92 3D Studio Material Library Files 40

A10 ANIM8OR 40

A11 ANIMATOR 40

A111 Animator Cel 40

A112 Animator Pro Fli 40

A113 Animator Pro fLC 40

A111 Animator Cel 40

A114 Animator Pro PIC 40

A12 AUDIO MANAGER MODULE 41

A13 AUTODESK DRAWING INTERCHANGE FILES (DXF) (BINARY) 41

A14 CALIGARI TRUESPACE SCENEOBJECT (BINARY) 41

A15 CORELDRAW VECTOR GRAPHICS FILE 41

A16 COREL METAFILE EXCHANGE IMAGE 41

A17 MUSIC MODULES 41

A171 DigiTrekker Module Format 41

A172 Graoumf Tracker modules 41

A173 Oktalyzer 41

A18 EA MULTIMEDIA GAME FILE FORMATS 41

A181 Electronic Arts Music (asf) 41

A182 Electronic Arts CMV 41

A183 Electronic Arts DCT 41

A184 Electronic Arts MAD 42

A185 Electronic Arts MPC 42

A186 Electronic Arts VP7 42

A187 Electronic Arts TGV 42

A188 Electronic Arts TGQ 42

v

A189 Electronic Arts TQI 42

A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42

A20 LIGHTWAVE 3D OBJECTS 42

A201 LWOB 42

A202 LWO2 42

A203 LWLO 42

A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42

A22 PAINT SHOP PRO (PSP) 42

A23 PROVECTOR 2D DRAWING 43

A24 PROWRITE DOCUMENT 43

A25 INTERACTIVEFICTION 43

A251 Blorb Z-machine Resource File 43

A252 Quetzal 43

A26 APPLE QUICKTIME 43

A27 MAYA IFF BITMAP 43

A28 APPLE CORE AUDIO FORMAT 43

A29 AMERICAN LASER GAMES (MM) 43

A30 SMUSH ANIMATION FORMAT 43

A301 SAN 43

A302 SNM 43

A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43

A32 OGG VORBIS 44

A33 SIMPLE VECTOR ARRAY FORMAT 44

A34 NESTED LIST FILE 44

A35 CINEMA 4D 44

A36 THE SIMS INTERCHANGE FILE FORMAT 44

vi

List of Figures

Figure 1 File format (Layout) of a RIFF container for a Webp Image 2

Figure 2 ILBM File hampic Displayed 7

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7

Figure 4 A Parse Tree for the ILBM File hampic 8

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10

Figure 6 An ANTLR Grammar for the ILBM File Format 16

Figure 7 A Java Application for Parsing an ILBM File 17

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20

1

1 Introduction

11 Background

Automated tools are required for identifying and validating the formats of the huge number of

files ingested into digital data and record archives Automated tools are also needed for

viewingplaying text and binary file formats and for converting legacy and obsolete file formats

to standard or current formats Technologies such as data description languages have emerged to

address these preservation challenges [Dunckley et al 2007]

Context-free grammars have been used to specify the syntax of programming languages

Attribute grammars provide a framework for formally specifying the semantics of a language

based on its context-free grammar and for addressing the mildly context-sensitive features of

programming languages such as agreement of the data types of variables in expressions with data

type declarations Such grammars have parsingtranslation algorithms that can be used for

syntax-checking and interpretation or translation of the languages the grammars define

12 Purpose

The research question addressed by this research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files

13 Scope

The next section discusses the traditional approach to specifying file formats and two of the

major families of binary file formats In section 3 extensions to the concepts of context-free

grammars and attribute grammars are described that enable the specification of binary file

formats In section 4 examples of attribute array grammars for a chunk-based binary file formats

are presented In section 5 recursive descent parsers for these classes of grammars are described

In section 6 experience in using ANTLR a parser generator for LL() string grammars in

generating parsers for recognizing the formats of binary files is discussed In section 7 related

research is described Section 8 summarizes the results

2 Binary File Formats

Traditionally a file (or record) layout was a form showing how fields were positioned in a file

(or record) The fields are named and have as attributes a data type length and sometimes a

constant value Fields are offset at addresses relative to the beginning of a file File formats are

still specified in this manner For instance Figure 1 shows the format layout of a RIFF container

for the VP8 codec of a Webp image [Google 2010]

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 5: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

v

A189 Electronic Arts TQI 42

A19 IMAGINE 3D DATA DESCRIPTION (TDDD) 42

A20 LIGHTWAVE 3D OBJECTS 42

A201 LWOB 42

A202 LWO2 42

A203 LWLO 42

A21 LUXOLOGY MODO 3D OBJECT (LXOB) 42

A22 PAINT SHOP PRO (PSP) 42

A23 PROVECTOR 2D DRAWING 43

A24 PROWRITE DOCUMENT 43

A25 INTERACTIVEFICTION 43

A251 Blorb Z-machine Resource File 43

A252 Quetzal 43

A26 APPLE QUICKTIME 43

A27 MAYA IFF BITMAP 43

A28 APPLE CORE AUDIO FORMAT 43

A29 AMERICAN LASER GAMES (MM) 43

A30 SMUSH ANIMATION FORMAT 43

A301 SAN 43

A302 SNM 43

A31 STRUCTURED DATA EXCHANGE FORMAT (SDXF) 43

A32 OGG VORBIS 44

A33 SIMPLE VECTOR ARRAY FORMAT 44

A34 NESTED LIST FILE 44

A35 CINEMA 4D 44

A36 THE SIMS INTERCHANGE FILE FORMAT 44

vi

List of Figures

Figure 1 File format (Layout) of a RIFF container for a Webp Image 2

Figure 2 ILBM File hampic Displayed 7

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7

Figure 4 A Parse Tree for the ILBM File hampic 8

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10

Figure 6 An ANTLR Grammar for the ILBM File Format 16

Figure 7 A Java Application for Parsing an ILBM File 17

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20

1

1 Introduction

11 Background

Automated tools are required for identifying and validating the formats of the huge number of

files ingested into digital data and record archives Automated tools are also needed for

viewingplaying text and binary file formats and for converting legacy and obsolete file formats

to standard or current formats Technologies such as data description languages have emerged to

address these preservation challenges [Dunckley et al 2007]

Context-free grammars have been used to specify the syntax of programming languages

Attribute grammars provide a framework for formally specifying the semantics of a language

based on its context-free grammar and for addressing the mildly context-sensitive features of

programming languages such as agreement of the data types of variables in expressions with data

type declarations Such grammars have parsingtranslation algorithms that can be used for

syntax-checking and interpretation or translation of the languages the grammars define

12 Purpose

The research question addressed by this research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files

13 Scope

The next section discusses the traditional approach to specifying file formats and two of the

major families of binary file formats In section 3 extensions to the concepts of context-free

grammars and attribute grammars are described that enable the specification of binary file

formats In section 4 examples of attribute array grammars for a chunk-based binary file formats

are presented In section 5 recursive descent parsers for these classes of grammars are described

In section 6 experience in using ANTLR a parser generator for LL() string grammars in

generating parsers for recognizing the formats of binary files is discussed In section 7 related

research is described Section 8 summarizes the results

2 Binary File Formats

Traditionally a file (or record) layout was a form showing how fields were positioned in a file

(or record) The fields are named and have as attributes a data type length and sometimes a

constant value Fields are offset at addresses relative to the beginning of a file File formats are

still specified in this manner For instance Figure 1 shows the format layout of a RIFF container

for the VP8 codec of a Webp image [Google 2010]

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 6: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

vi

List of Figures

Figure 1 File format (Layout) of a RIFF container for a Webp Image 2

Figure 2 ILBM File hampic Displayed 7

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format 7

Figure 4 A Parse Tree for the ILBM File hampic 8

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format 10

Figure 6 An ANTLR Grammar for the ILBM File Format 16

Figure 7 A Java Application for Parsing an ILBM File 17

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF 18

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format 20

1

1 Introduction

11 Background

Automated tools are required for identifying and validating the formats of the huge number of

files ingested into digital data and record archives Automated tools are also needed for

viewingplaying text and binary file formats and for converting legacy and obsolete file formats

to standard or current formats Technologies such as data description languages have emerged to

address these preservation challenges [Dunckley et al 2007]

Context-free grammars have been used to specify the syntax of programming languages

Attribute grammars provide a framework for formally specifying the semantics of a language

based on its context-free grammar and for addressing the mildly context-sensitive features of

programming languages such as agreement of the data types of variables in expressions with data

type declarations Such grammars have parsingtranslation algorithms that can be used for

syntax-checking and interpretation or translation of the languages the grammars define

12 Purpose

The research question addressed by this research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files

13 Scope

The next section discusses the traditional approach to specifying file formats and two of the

major families of binary file formats In section 3 extensions to the concepts of context-free

grammars and attribute grammars are described that enable the specification of binary file

formats In section 4 examples of attribute array grammars for a chunk-based binary file formats

are presented In section 5 recursive descent parsers for these classes of grammars are described

In section 6 experience in using ANTLR a parser generator for LL() string grammars in

generating parsers for recognizing the formats of binary files is discussed In section 7 related

research is described Section 8 summarizes the results

2 Binary File Formats

Traditionally a file (or record) layout was a form showing how fields were positioned in a file

(or record) The fields are named and have as attributes a data type length and sometimes a

constant value Fields are offset at addresses relative to the beginning of a file File formats are

still specified in this manner For instance Figure 1 shows the format layout of a RIFF container

for the VP8 codec of a Webp image [Google 2010]

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 7: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

1

1 Introduction

11 Background

Automated tools are required for identifying and validating the formats of the huge number of

files ingested into digital data and record archives Automated tools are also needed for

viewingplaying text and binary file formats and for converting legacy and obsolete file formats

to standard or current formats Technologies such as data description languages have emerged to

address these preservation challenges [Dunckley et al 2007]

Context-free grammars have been used to specify the syntax of programming languages

Attribute grammars provide a framework for formally specifying the semantics of a language

based on its context-free grammar and for addressing the mildly context-sensitive features of

programming languages such as agreement of the data types of variables in expressions with data

type declarations Such grammars have parsingtranslation algorithms that can be used for

syntax-checking and interpretation or translation of the languages the grammars define

12 Purpose

The research question addressed by this research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files

13 Scope

The next section discusses the traditional approach to specifying file formats and two of the

major families of binary file formats In section 3 extensions to the concepts of context-free

grammars and attribute grammars are described that enable the specification of binary file

formats In section 4 examples of attribute array grammars for a chunk-based binary file formats

are presented In section 5 recursive descent parsers for these classes of grammars are described

In section 6 experience in using ANTLR a parser generator for LL() string grammars in

generating parsers for recognizing the formats of binary files is discussed In section 7 related

research is described Section 8 summarizes the results

2 Binary File Formats

Traditionally a file (or record) layout was a form showing how fields were positioned in a file

(or record) The fields are named and have as attributes a data type length and sometimes a

constant value Fields are offset at addresses relative to the beginning of a file File formats are

still specified in this manner For instance Figure 1 shows the format layout of a RIFF container

for the VP8 codec of a Webp image [Google 2010]

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 8: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

2

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk starting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

Figure 1 File format (Layout) of a RIFF container for a Webp Image

Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF

notation EBNF (Extended Backus-Naur Form) is a notation for expressing context-free

grammars in a compact human readable way Pseudo-EBNF is similar in concept to pseudocode

which is a high-level description of a computer algorithm that is intended for human

understanding but that omits details that would be necessary for computer execution The

pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural

language description of the details that are not actually expressible in the EBNF (or regular

expression) notation These details are often the context-sensitive relationships of the size of an

array to its actual length or the relationship of an address pointer to the actual location of the

data pointed to in the file These context-sensitive relationships are not expressible in a context-

free grammar (or EBNF notation) It is a goal of this research to extend the EBNF notation and

concept of a context-free grammar to include the details that are necessary to precisely specify

the file formats of binary file formats that include these context-sensitive features

Binary file formats with a similar file structure are referred to as a family of file formats There

are two families of binary file formats that are readily distinguished the chunk-based and the

directory-based file formats

Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the

Interchange File Format (IFF) [Morrison 1985] An IFF file itself is one entire IFF chunk A

chunk consists of an ID tag a size and size bytes of data The data may include chunks called

subchunks Subchunks have the same structure as chunks Subchunks can have subchunks

Chunk-based file formats are primarily used as containers for multimedia but have also been

used for word processing files including pictures and other figures File format specifications for

chunk-based formats may use other terms than chunks in describing the format for instance

atoms in QuickTimeMP4 segments in JPEG and ―tagged data representations as in

AutoCAD DXF

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 9: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

3

The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on

Electronic Arts Interchange File Format Apple Core Audio Format (CAF) Applelsquos replacement

format for AIFF remains a chunk-based format [Apple Computer 2005] The Resource

Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on

Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-Video

Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis WebP a picture format

recently introduced by Google[2010] also uses RIFF as a container

Microsoftlsquos Advanced Systems Format (ASF) [2004] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV) Microsoftlsquos Binary Interchange File Format [Rentz 2008]

(Microsoft Excellsquos File Format) is chunk-based

Another family of binary file formats is directory-based A directory-based file format consists of

one or more directory tables which contain one or more directory entries A directory entry

specifies where the actual data for a type of information is located This is the scheme used in

TIFF files OLE (Microsoft Object Linking and Embedding) files OASIS OpenDocument and

Microsoft Open Office files

3 Context-Free Grammars and Attribute Grammars

Context-free grammars have been widely used to define programming languages A context-free

grammar G is a quadruple ltN Σ S Pgt where

N is a finite set of non-terminal symbols

Σ is a finite set of terminal symbols

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

The set of all strings over a vocabulary Σ is denoted by Σ The language generated by a context-

free grammar G is the set L(G) = w w Σ and S w

31 Context-Free Grammars for Arrays of Data Types

An array is a data structure consisting of a collection of elements (values or variables) each

identified by an index An array is stored so that the position of each element can be computed

from its index For example an array of 10 integer variables with indices 0 through 9 may be

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 10: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

4

stored as 10 (4 byte) words at file addresses 0 4 8 hellip36 so that the element with index i has the

address 4 times i

Arrays are used to implement many other data structures such as lists and strings In most

modern computers the internal memory and external memory of storage devices (and files on

those devices) is a one-dimensional array of data types whose indices are their addresses

A data type is a classification of one of various types of data such as floating-point integer

character pointer or Boolean that determines the possible values for that type the operations

that can be performed on values of that type and the way values of that type can be stored Data

types have names such as int16 int32 float char bool ptr We define a binary file to be an array

of values of data of various types

We define a context-free binary file grammar BG as a quintuple ltN D Σ S Pgt where

N is a finite set of non-terminal symbols

D is a set of data types

Σ is a finite set of binary values of data types D called terminals

S ε N is the start symbol

P is a set of production rules of the form N rarr N Σ

Let DataTypes indicate the union of all the values of all datatypes D The set of all binary files is

BinaryFiles = [IndicesrarrDataTypes] where Indices = 1 The language generated by a

context-free array (binary file) grammar AG is the set L(AG) = w w BinaryFiles and S w

By primitive nonterminal is meant a symbol which would be replaced by one or more terminal

symbols were an additional production rule applied In other words a primitive nonterminal is a

nonterminal that appears on the left-hand side of productions which only have one or more

terminal symbols on the right We adopt the notation shown below for primitive nonterminals

ltprimitive_nonterminal_symbol type=datatype_name constant=valuegt

32 Attribute Grammars

Context-free grammars cannot represent context-sensitive aspects of programming languages

such as (1) enforcing the constraint that all variables are declared before they are used or (2)

checking the number of parameters in a function call against the number in the functionlsquos

declaration Context-free grammars have also been used to define the syntax of English and other

natural languages but they cannot define such context-sensitivity as subject verb agreement in

English sentences A number of extensions of context-free grammars have been proposed to

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 11: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

5

address the context-sensitivity of programming languages and natural languages These include

indexed [Aho 1968] recording [Barth 1979] affix [Koster 1971] VW [van Wijngaarden 1974]

and attribute grammars [Knuth 1969 1971]

Context-free grammars also cannot represent the semantics of programming languages Knuth

[1968 1971] proposed an extension of context-free grammars termed attribute grammars that

addresses the semantics as well as the context-sensitivity of programming languages

An attribute grammar GA is a triple ltG A ARgt where G is a context-free grammar for the

language A associates each grammar symbol X isin (N Σ) with a set of attributes and AR

associates each production R isin P with a set of attribute computation rules A(X) where X isin (N

Σ) can be further partitioned into two sets synthesized attributes S(X) and inherited attributes

I(X) AR(R) where R isin P contains rules for computing inherited and synthesized attributes

associated with the symbols in the production R Synthesized attributes are evaluated and passed

up from deeper terminals and non-terminals while inherited attributes are passed down from

upper non-terminals

An L-attributed grammar is an attribute grammar that uses

i) synthesized attributes and

ii) inherited attributes where the inherited attribute depends only on

a) attributes of the other child nodes to its left

b) inherited attributes of P itself

L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the

abstract syntax tree As a result attribute evaluation in L-attributed grammars can be

incorporated conveniently in top-down parsing

33 Semantic Grammar

An interpreter for a binary file format that is using a grammar to display or play the contents of a

file or to convert it to another file format needs to produce a semantic representation that is

appropriate to the task Natural language understanding components of dialog systems face a

similar need to produce a semantic representation that is appropriate to the dialogue task To

address this need some developers of dialogue systems make use of an approach called semantic

grammar [Issar and Ward 1993 Ward and Issar 1994] A semantic grammar is a CFG in which

the non-terminals on the left-hand side of rules correspond to semantic entities A parser for a

semantic grammar produces a labeling of the input string with semantic node labels for example

SHOW FLIGHTS ORIGIN DESTINATION DEPART_DATE DEPART_TIME Show me flights from Boston to San Francisco on Tuesday morning

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 12: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

6

A semantic grammar is just what we need to file in the values of a programming language data

structure Semantic grammars are not applicable to generic systems but are limited to specific

domains This is not a problem in specifying or parsing a binary file format since a grammar for

a binary file format is not meant to apply to all binary file formats but just to a single format or

to a family of file formats

4 Grammars for Chunk-based Binary File Formats

The attribute grammar that we present will be a mixture of synthesized semantic rules (attribute

values of the left side of a production are calculated from attribute values of symbols on the right

side) and inherited rules (attribute values of symbols on the right side of a production are

calculated from attribute values of the symbol on the left side)

The field names of the file format spec are represented as nonterminals in the grammar Constant

values in fields are terminals that appear on the right-hand side of productions Nonterminal

fieldnames have data types associated with them

41 InterLeaved BitMap (ILBM)

The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file

format could be specified using pseudo EBNF rules and C data structure declarations [Morrison

1986] In this section the ―grammar of the ILBM file format specification is converted to a

context-free binary grammar with synthesized and inherited attributes

Figure 2 shows a display of an ILBM file hampic Figure 3 shows an attribute grammar for its

chunk-based binary file format The attribute grammar is a mixture of synthesized semantic rules

(attribute values of the left side of a production are calculated from attribute values of symbols

on the right side) and inherited rules (attribute values of symbols on the right side of a

production are calculated from attribute values of the symbol on the left side) The attribute

grammar is also an L-attributed grammar Figure 4 shows a parse tree for the file hampic

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 13: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

7

Figure 2 ILBM File hampic Displayed

ltILBMgt rarr ―FORM ltcksize TYPE=UINT32gt ―ILBM ltBMHDgt ltCMAPgt ltCAMGgt ltBODYgt

ILBMcksize = cksizevalue

ltBMHDgt rarr ―BMHD ltcksize TYPE=uint32gt ltBitmapHeadergt

ltBitMapheadergt rarr ltwidth type=UINT32gt

lt height type=UINT16gt

lt xposition type=INT16gt

ltyposition type=INT16gt

ltnplanes type=BYTEgt

ltmasking type=BYTEgt

ltcompression type=BYTEgt

ltreserved type=BYTEgt

lttransparentcolor type=UINT16gt

ltxaspect type=BYTEgt

ltyaspect type=BYTEgt

ltpagewidth type=INT16gt

ltpageheight type=INT16gt

ltCMAPgt -rarr―CMAP ltcksize type=UINT32gt ltcolorgt[n] n=cksizeval3

ltCAMGgt rarr ―CAMG ltcksizetype=UINT32gt ltviewmode type=INT32gt

ltBODYgt rarr ―BODY ltcksize type=UINT32gt ltdata type=BYTEgt[cksize]

ltcolorgt rarr ltred type=BYTEgt ltgreen type= BYTEgt ltblue type=BYTEgt

Figure 3 An Attribute Grammar for an ILBM Chunk-based File Format

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 14: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

8

Figure 4 A Parse Tree for the ILBM File hampic

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 15: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

9

42 RIFF Waveform PCM Audio File Format (WAVE)

The waveform PCM File format is was originally specified in an extended BNF notation [IBM amp

Microsoft 1991] The chunk size was omitted from the EBNF rules Figure 5 shows the EBNF

rules with the cksize inserted To convert this grammar into a binary file format grammar data

types are added for primitive non-terminals

ltWAVE-formgt -gt bdquoRIFF‟ cksize WAVE

ltfmt-ckgt

[ltfact-ckgt]

[ltcue-ckgt]

[ltplaylist-ckgt]

[ltassoc-data-listgt]

ltwave-datagt

ltfmt-ckgt -gt bdquofmt ‟ cksize ltcommon-fieldsgt ltwBitsPerSampleWORDgt

ltcommon-fieldsgt -gt ltwFormatTagWORD value=0x0001gt

ltwChannelsWORDgt

ltdwSamplesPerSecDWORDgt

ltdwAvgBytesPerSecDWORDgt

ltwBlockAlignWORDgt

ltwave-datagt -gt ltdata-ckgt | ltwave-listgt

ltdata-ckgt -gt data cksize ltwave-datagt

ltwave-listgt -gt LIST cksize wavl ltdata-ckgt | ltsilence-ckgt

ltsilence-ckgt -gt slnt cksize ltdwSamplesDWORDgt

ltfact-ckgt -gt factxksize ltdwFileSizeDWORDgt

ltcue-ckgt -gt cue cksize ltdwCuePointsDWORDgt ltcue-pointgt+

ltcue-pointgt -gt ltdwNameDWORDgt

ltdwPositionDWORDgt

ltfccChunkFOURCCgt

ltdwChunkStartDWORDgt

ltdwBlockStartDWORDgt

ltdwSampleOffsetDWORDgt

ltplaylist-ckgt -gt bdquoplst‟ cksize ltdwSegmentsDWORDgt ltplay-segmentgt+

ltplay-segmentgt -gt ltdwNameDWORDgt

ltdwLengthDWORDgt

ltdwLoopsDWORD

ltassoc-data-listgt-gt bdquoLIST‟ cksize adtl ltlabl-ckgtltnote-ckgtltltxt-ckgtltfile-ckgt

ltlabl-ckgt -gt bdquolabl‟ cksize ltdwNameDWORDgt ltdataZSTRgt

ltnote-ckgt -gt bdquonote‟ cksize ltdwNameDWORDgt ltdataZSTRgt

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 16: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

10

ltltxt-ckgt -gt bdquoltxt‟ cksize ltdwNameDWORDgt

ltdwSampleLengthDWORDgt

ltdwPurposeDWORDgt

ltwCountryWORDgt

ltwLanguageWORDgt

ltwDialectWORDgt

ltwCodePageWORDgt

ltdataBYTEgt+

ltfile-ckgt -gt bdquofile‟ cksize ltdwNameDWORDgt ltdwMedTypeDWORDgt

ltfileDataBYTEgt+

Figure 5 Binary File Grammar for RIFF WAVEFORM PCM File Format

5 Recursive Descent Parsers for Binary File Grammars

A recursive descent parser is a top-down parser built from a set of recursive (or non recursive)

procedures where each procedure implements one of the production rules of the grammar A

predictive parser is a recursive descent parser that does not require backtracking Predictive

parsing is possible only for the class of LL(k) grammars which are the context-free grammars

for which there exists some positive integer k that allows a recursive descent parser to decide

which production to use by examining only the next k tokens of input A recursive descent parser

for the binary file format grammars defined in this paper does not require a separate lexer to

identify the data of a file The data types are predicted by the grammar rules

The semantic rules of an L-attributed grammar can be included in the recursive descent parser to

check for context-sensitive aspects of the grammar such as chunk size or image dimensions and

use these values in parsing the chunk data or images The pointers to file addresses that occur in

the file formats are handled by pushing the current address in the file onto a pushdown stack

When the parser exhausts the procedures implementing production rules corresponding to the

pointed to location the topmost address on the pushdown stack is popped and the procedure

corresponding to the nonterminal at that position is executed

Recursive descent parsers with a pushdown stack can be used with binary file format grammars

to create file format recognizers A recognizer (syntax checker or validator) is a parser that reads

a file and generates an error if the file does not conform to the syntax specified by the grammar

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 17: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

11

6 A Parser Generator for Recognizing Binary File Formats

What we want is a parser generator whose input is a binary file attribute grammar for a particular

binary file format and whose generated output is the Java source code of a recursive descent

parser (with a pushdown stack if needed) for the class of binary file formats specified by the

grammar Such a parser generator does not yet exist but there is a widely used parser generator

for attribute grammars for string-based languages

61 ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL() parsing

[Parr amp Quong 1995 Parr 2007] ANTLR takes as input an attribute grammar that specifies a

language and generates as output source code for a recognizer for that language ANTLR

supports generating code in a number of the programming languages including C Java

JavaScript and Python

ANTLR also generates a lexer from lexical rules in the grammar A lexer is a program that

converts a sequence of characters into tokens A lexer is not needed to parse binary file formats

because the data types predicted by the parser are the tokens of the grammar Furthermore the

capability to recognize LL() grammars is not needed because the binary file format grammars

specified so far do not require lookahead of k symbols

62 Using ANTLR to Generate a Parser for a Binary File Format

It is possible to use ANTLR to test our binary file format grammars in recognizing the chunk-

based and directory-based binary file formats This is accomplished by writing a lexer that treats

each input byte in a file as a character token Then functions are created for each data type in the

binary file grammar that convert the appropriate number of character tokens to binary data types

eg int 16 int32 etc This is effective if someone clumsy and has allowed us to test our

attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a

parser generator for binary file grammars

621 An ANTLR Grammar for an ILBM Chunk-based File Format

Figure 6 shows a specification for the ILBM file format that is input to ANTLR The output of

ANTLR is a lexer and parser for ILBM files

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 18: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

12

grammar iblm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in lexerheader is placed at the top of the lexer file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Grammar Rules used to create the Parser

Start symbol ilbm

iblm returns [boolean result]

FORM ckSize ILBM

Print out ILBM chunksize

Systemoutprintln(ID ILBM Size + $ckSizevalue)

Semantic predicate that ensures that at least on BMHD is found

propertyChunks $propertyChunksfoundBMHD == true

dataChunks

body

$result = true

The bmhd cmap and camg property chunks can occur in any order

At least one BMHD data chunk must be present

propertyChunks returns [boolean foundBMHD]

(bmhd $foundBMHD = true

| cmap

| camg)+

bmhd

BMHD ckSize

Systemoutprintln(ID BMHD Size + $ckSizevalue)

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 19: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

13

bitMapHeader

bitMapHeader

width = uWordValue

height = uWordValue

xPosition = wordValue

yPosition = wordValue

nPanes = uByteValue

masking = uByteValue

compression = uByteValue

reserved1 = uByteValue

transparentColor = uWordValue

xAspect = uByteValue

yAspect = uByteValue

pageWidth = wordValue

pageHeight = wordValue

Print out values in the bitmap structure

Systemoutprintln( Width + $widthvalue)

Systemoutprintln( Height + $heightvalue)

Systemoutprintln( X Position + $xPositionvalue)

Systemoutprintln( Y Position + $yPositionvalue)

Systemoutprintln( Number of Panes + $nPanesvalue)

Systemoutprintln( Masking + $maskingvalue)

Systemoutprintln( Compression + $compressionvalue)

Systemoutprintln( Reserved1 + $reserved1value)

Systemoutprintln( TransParent Color +

$transparentColorvalue)

Systemoutprintln( X Aspest + $xAspectvalue)

Systemoutprintln( Y Aspect + $yAspectvalue)

Systemoutprintln( Page Width + $pageWidthvalue)

Systemoutprintln( Page Height + $pageHeightvalue)

cmap returns [long value]

Initialize the index value i in the cmap function

init int i = 0

CMAP ckSize

Print out chunk size

$value = $ckSizevalue

Systemoutprintln(ID CMAP Size + $ckSizevalue)

(colorRegister

Print out color values

Systemoutprintln( Color[ + i++ + ] +

$colorRegisterredValue +

+ $colorRegistergreenValue +

+ $colorRegisterblueValue +

)

)+

colorRegister returns [int redValue int greenValue int blueValue]

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 20: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

14

red = uByteValue green = uByteValue blue = uByteValue

$redValue = $redvalue

$greenValue = $greenvalue

$blueValue = $bluevalue

camg

CAMG ckSize longValue

Systemoutprintln(ID CAMG Size + $ckSizevalue)

Systemoutprintln( View Modes 0x +

LongtoHexString($longValuevalue))

The crng and ccrt data chunks

dataChunks

crng

| ccrt

crng

CRNG ckSize

Systemoutprintln(ID CRNG Size + $ckSizevalue)

cRange

cRange

pad1 = wordValue $pad1value == 0

rate = wordValue

active = wordValue

low = uByteValue

high = uByteValue

ccrt

CCRT ckSize

Systemoutprintln(ID CCRT Size + $ckSizevalue)

cycleInfo

cycleInfo

direction = wordValue

start = uByteValue

end = uByteValue

seconds = longValue

microseconds = longValue

pad = wordValue $padvalue == 0

body

BODY ckSize

Systemoutprintln(ID BODY Size + $ckSizevalue)

$ckSizevalue gt 0 data

Systemoutprintln( Number of bytes + $datavalue)

$datavalue == $ckSizevalue

defines the data chunk of the body chunk

data returns [long value]

(b+=BYTE)+ $bsize() gt 0

$value = $bsize()

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 21: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

15

The cksize is the value of a long integer

ckSize returns [long value]

longValue

$value = $longValuevalue

defines the longValue data typein terms of BYTE terminals

longValue returns [long value] throws UnsupportedEncodingException

h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE

$value = (long)

( getUByteValue($h1) ltlt 24 |

getUByteValue($h2) ltlt 16 |

getUByteValue($l1) ltlt 8 |

getUByteValue($l2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uWordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

$value = (short)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2))

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the unsigned word data type

uByteValue returns [int value]

BYTE

calls the getUByteValue from a BYTE token

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 22: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

16

$value = (int) getUByteValue($BYTE)

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rules used to create the Lexer

FORM FORM

ILBM ILBM

BMHD BMHD

CMAP CMAP

CAMG CAMG

CRNG CRNG

CCRT CCRT

BODY BODY

BYTE

Figure 6 An ANTLR Grammar for the ILBM File Format

622 The Lexer and Parser

When provided the file described in the previous section ANTLR generates the ilbmlexer and

ilbmparser Java programs The Java application shown in Figure 6 uses these programs to

validate the ILBM file BLKIFF It produces the output shown in Figure 7

package orggtrigrammarschunk

import javaioIOException

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRFileStream

import organtlrruntimeANTLRStringStream

import organtlrruntimeCharStream

import organtlrruntimeCommonTokenStream

import organtlrruntimeRecognitionException

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 23: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

17

import organtlrruntimeTokenStream

import gtriorggrammarschunkiblmLexer

import gtriorggrammarschunkiblmParser

public class Test

static String filePath = CUserssl170PicturesIFF-ilbmBLKIFF

public static void main(String[] args) throws RecognitionException IOException

ANTLRFileStream stream = new ANTLRFileStream(filePath ISO-8859-1)

iblmLexer lexer = new iblmLexer(stream)

TokenStream tokenStream = new CommonTokenStream(lexer)

iblmParser parser = new iblmParser(tokenStream)

Systemoutprintln(FILE + filePath)

Systemoutprintln(Successfully parsed = + parseriblm())

Figure 7 A Java Application for Parsing an ILBM File

FILE CUserssl170PicturesIFF-ilbmBLKIFF

ID ILBM Size 2556

ID BMHD Size 20

Width 200

Height 144

X Position 0

Y Position 0

Number of Panes 6

Masking0

Compression1

Reserved10

TransParent Color0

X Aspest3

Y Aspect3

Page Width200

Page Height144

ID CMAP Size 48

Color[0] 0 0 0

Color[1] 0 128 128

Color[2] 0 255 255

Color[3] 255 0 255

Color[4] 0 255 0

Color[5] 255 0 0

Color[6] 0 0 255

Color[7] 255 255 0

Color[8] 128 128 128

Color[9] 192 192 192

Color[10] 128 0 0

Color[11] 0 128 0

Color[12] 128 128 0

Color[13] 0 0 128

Color[14] 128 0 128

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 24: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

18

Color[15] 255 255 255

ID CAMG Size 4

View Modes 0x800

ID BODY Size 2448

Number of bytes 2448

Successfully parsed = true

Figure 8 Output of the ILBM Parser Applied to the file BLKIFF

63 An ANTLR Grammar for the RIFF Waveform PCM File Format

Figure 9 shows a specification for the WAVEFORM PCM file format that is input to ANTLR

The output of ANTLR is a lexer and parser for WAVEFOR PCM files

grammar waveform_pcm

options

language = Java The programming language of the output parser

TokenLabelType = CommonToken

The text contained in header is placed at the top of the parser file

header

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text contained in lexerheader is placed at the top of the lexer

file

lexerheader

package gtriorggrammarschunk

import javaioUnsupportedEncodingException

The text in the members section is placed inside the parser class

members

getUByteValue inputs the token of a Byte terminal and returns an

integer value

public int getUByteValue(Token tok) throws UnsupportedEncodingException

int intValue = 0

char charValue = tokgetText()charAt(0)

intValue = (int) charValue

intValue = intValue amp 0x00ff

return intValue

Default ByteOrder to Little Endian

String byteOrder = II

Rules used to create the Parser

start symbol waveform_pcm

waveform_pcm returns [boolean result]

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 25: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

19

ckId = fourCC $ckIdvalueequals(RIFF)

ckSize = dWordValue

formType = fourCC $formTypevalueequals(WAVE)

formatSubChunk

dataSubChunk

$result = true

fourCC returns [String value]

BYTE BYTE BYTE BYTE $value = $text

formatSubChunk

subChunkId = fourCC $subChunkIdvalueequals(fmt )

ckSize = dWordValue

commonFields

pcmSpecificField

commonFields

formatTag = wordValue Format category

channels = wordValue Number of channels

samplesPerSec = dWordValue Sampling rate

avgBytesPerSec = dWordValue For buffer estimation

blockAlign = wordValue Data block size

pcmSpecificField

bitsPerSample = wordValue Sample size

dataSubChunk

subChunkId = fourCC $subChunkIdvalueequals(data)

ckSize = dWordValue

waveData[$ckSizevalue]

waveData [long size]

(size gt= 0=gt BYTE size--)+

defines the long data type in terms of BYTE terminals

dWordValue returns [long value] throws UnsupportedEncodingException

b1 = BYTE b2 = BYTE b3 = BYTE b4 = BYTE

if (byteOrderequals(II))

$value = (long)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 |

getUByteValue($b3) ltlt 16 |

getUByteValue($b4) ltlt 24 )

else

$value = (long)

( getUByteValue($b1) ltlt 24 |

getUByteValue($b2) ltlt 16 |

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 26: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

20

getUByteValue($b3) ltlt 8 |

getUByteValue($b4) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

defines the signed word data type

wordValue returns [int value]

b1 = BYTE b2 = BYTE

if (byteOrderequals(II))

$value = (int)

( getUByteValue($b1) |

getUByteValue($b2) ltlt 8 )

else

$value = (int)

( getUByteValue($b1) ltlt 8 |

getUByteValue($b2) )

catch[RecognitionException re]

reportError(re)

recover(inputre)

catch[UnsupportedEncodingException uee]

Systemoutprintln(Unsupported Encoding Exception)

Terminal rule used to create the Lexer

BYTE

Figure 9 ANTLR Grammar for the RIFF WAVEFORM PCM File Format

7 Related Research

Researchers have developed a number of data description languages to support validation of file

formats EAST (Enhanced ADA Subset) is a data description language developed by the

Consultative Committee for Space Data Systems [2010] The Data Entity Dictionary

Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic

information The EAST description is used to interpret and provide access to information in

binary and text files

DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to

manipulate Java jar files and ELF object files The PADSC data description language can be

used to define the file formats of text and binary files [Fisher amp Gruber 2005] The PADSC

compiler compiles the description into tools that can be used to recognize manipulate and

transform the data into other formats

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 27: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

21

The Data Format Description Language (DFDL) is being developed by a Working Group of the

Open Grid Forum [OGF 2011] The data description language is based on a subset of W3C XML

Schema using ltxsappinfogt annotations to carry the extra information necessary to describe non-

XML physical representations Version 1 of the language specification was published in 2011

and a parser is being implemented

Each of the data description languages described above can be used to define data types and file

structures of binary files However they are based on file structure declarations such as those of

C or Java The binary file format grammar described in this paper most closely resembles DFDL

However the binary file grammar described in this paper is the only data description language

based on formal grammars that is used for creating recognizers for file formats

JHOVE (JSTORHarvard Object Validation Environment) is an extensible system designed to

provide automated and efficient identification and validation of the formats of digital files

JHOVE is a format-specific digital object validation API written in Java JHOVE supports

validation of the following formats AIFF ASCII GIF HTML JPEG JPEG 2000 PDF TIFF

UTF-8 WAVE and XML JHOVE2 is second generation validation environment [California

digital Library 2011]

The research reported in this paper is similar in intent to that of the JHOVE projectsmdashvalidation

of binary file formats As a matter of fact binary file grammars and parsers have been

constructed for the chunk-based file formats AIFF JPEG and WAVE as well as the directory-

based format TIFF However the research reported herein differs from that of the JHOVE

project in that what is sought is a technology for generating validators for binary file formats

from grammars specifying the binary file format

8 Conclusion

The research question addressed by his research is whether it is possible to extend the context-

free grammars used to specify the syntax of programming languages to the specification of

binary file formats and to use these grammars with parsers for validating the file formats of

binary files Amajor family of binary file formats chunk-based was described Then extensions

to the concepts of context-free grammars and attribute grammars were described that enable the

specification of binary file formats Examples of attribute array grammars for two chunk-based

binary file format were then presented Recursive descent parsers for this classes of grammars

was then described Experience in using ANTLR a parser generator for LL() string grammars

in generating parsers for recognizing the formats of binary files was discussed Finally related

research in data description languages for binary file formats was described and the JHOVE

project in creating Java-based tools for validating file formats was described

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 28: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

22

It is concluded that it is possible to extend context-free grammars to the specification of chunk-

based binary file formats Furthermore these grammars can be used with recursive descent

parsers for validating the file formats of chunk-based binary files It remains to be determined

whether these attribute grammars based on binary file (array) grammars are adequate to define

other families of binary file formats

To be a practical technology for generating recognizers (validators) for binary file formats a

parser generator is needed that creates from a binary file grammar a recursive descent parser

(with pushdown stack if needed) ANTLR which accepts LL() grammars as input and

generates a lexer as well as a parser is too heavyweight a parser generator for this task and does

not have the requisite data types built in

A capability to validate binary file formats assumes that the file format of a file is already

known In this research we use a file type identifier based on the UNIX file command and a

magic file that we have created [Underwood 2009] We have samples of the chunk-based file

formats discussed in this report including those referenced in Appendix a We also have file

signature tests (magic tests) for all these file types

We would like to have a parser generator (also called a compilerndashcompiler) that would take as

input an attribute grammar for a binary file format and generate a parser or interpreter that

recognizes the file format or that interprets the file format and displays or plays the contents The

closest parser generator that we have found to having these capabilities is ANTLR

(wwwantlrorg ) ANTLR will generate a top-down recursive descent parser from a context-free

attribute grammar However it works with strings and generates a separate scanner

Appendix A lists the names of about 75 file formats that belong to the family of ―chunk-based

file formats We have specifications for each of file formats and samples of files with these

formats

These file formats have a similar structure We have developed binary file grammars for two of

these binary file formats We should be able to do so for all of them We used these grammars

with the extensions to ANTLR to generate top-down recursive descent parsers for the file

formats defined with these grammars

The significance of success in this research task is that if binary file formats can be specified

with binary file grammars then only one parsing generator is needed to generate the parsers for

verifying that a file conforms to a particular format Similarly the same parsing generator can be

used for conversion of legacy file formats to current or standard formats Finally the same parser

generator can be used for generating viewerplayers for most file formats This would increase

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 29: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

23

the likelihood of preserving and making available into the indefinite future hose digital records

encoded in binary file formats

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 30: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

24

References

[Aho 1968] A V Aho Indexed Grammars ndash An Extension of Context-free Grammars Journal

of the ACM Vol 15 Issue 4 Oct 1968

[Alias-wavefront 1999] Maya File Formats Maya 20 1999

httpcaadarchethzchinfomayamanual

[Amigan 2011] Amigan Software IFF FORM and Chunk Registry

httpamigan1emunetregiffhtml

[Anisimovsky 2002a] Valery V Anisimovsky Electronic Arts Audio File Formats Description

May 1 2002 wwwextractorruarticleselectronic_arts_audio_file_formats_description

[Anisimovsky 2002b] Valery V Anisimovsky Electronic Arts New Audio File Formats

Description May 1 2002 wwwwotsitorglistaspsearch=electronic+artsampbutton=GO21

[Apple 1989] Audio Interchange File Format AIFF A Standard for Sampled Sound Files

Version 13 Apple Computer Inc January 1989

wwwaesnashvilleorgPDFsAIFF20Specspdf

[Apple 1991] AIFF-C Apple Computer Inc Draft 082691 httpclassic-

webarchiveorgweb20071219035740httpwwwcnpbagwellcomaiff-ctxt

[Apple 2005] Apple Core Audio Format Specification 2005b

httpdeveloperapplecomlibrarymacdocumentationMusicAudioReferenceCAFSpecCAF_

specCAF_spechtmlapple_refdocuidTP40001862-CH210-TPXREF101

[Apple 2007] QuickTime File Format Specification

httpdeveloperapplecomlibrarymacdocumentationQuickTimeQTFFQTFFPrefaceqtffPref

acehtml

[ASF 2001] ASF format version 10 reconstruction January 2001

httpavifilesourceforgenetasf-10htm

[Autodesk 1997] Autodesk Ltd 3D Studio File Format (3ds) Document Revision 093 - January

1997 wwwdcsedacukhomemxrgfx3d3DSspec

[Autodesk 2009] Autodesk DXF Reference January 2009

httpusaautodeskcomadskservletitemlinkID=10809853ampid=12272454ampsiteID=123112

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 31: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

25

[Back 2002] G Back (2002) A specification and scripting language for binary data Generative

Programming and Component Engineering Vol 2487 Lecture Notes in Computer Science 66-

77

[Barth 1979] G Barth Fast recognition of context-sensitive structures Journal of Computing V

22 1979 pp 243-256

[Baum amp Noel 2002]D Baum amp G Noel The Simms Interchange File Format 2002

httpsimtechsourceforgenettechiffhtml

[Bayless 1987] James Bayless WORD (word processing FORM used by ProWrite) New

Horizons Software Inc 1987 httpamigan1emunetregWORDtxt

[Beham 1995] H Beham DigiTrekker Module Format (DTM) rev10 1995

ftpftpmodlandcompubdocumentsformat_documentationDigiTrekker20(dtm)txt

[Blank Aspect 2010a] blankaspectorg NLF (Nested List File) format December 2010

httpondasourceforgenetnlfnlfFormathtml

[Blank Aspect 2010b] blankaspectorg Onda lossless audio compression algorithm amp Onda file

formats March 2010 httpondasourceforgenetondaAlgorithmAndFileFormatshtml

[California Digital Library 2011] California Digital Library JHOVE2 Userlsquos Guide 2011

httpsbytebucketorgjhove2mainwikidocumentsJHOVE2-Users-Guide_20110222pdf

[Caligari 1998] Caligri Caligari trueSpace file format Document Version 6 July

2002 httpwwwcaligaricomdownloadzipscncob65zip

[CCITT 1992] CCITT Terminal Equipment And Protocols for Telematic Services Information

Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements

and Guidelines Recommendation T81 September 1992 httpwwww3orgGraphicsJPEGitu-

t81pdf

[CCSDS 2010] Consultative Committee for Space Data Systems (2010) The Data Description

Language EAST Specification httppublicccsdsorgpublicationsarchive644x0b3pdf

[Corel 1998] Corel Corporation CMX File Format Ver 80 31 March 1998

[Crazy Talk] Crazy Talk Animator Pro file Formats wwwmartinreddynetgfxanimFLCtxt

[Cunniff and Orr] Ross Cunniff and John Orr 2D Standard Object Format FORM DR2D

Taliesin Inc httpamigan1emunetregDR2Dtxt

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 32: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

26

[Cuturicu and Fromme 1999] Cuturicu Cristi and Oliver Fromme CRYXs note about the JPEG

decoding algorithm 1999 wwwcoco3comtextdoc_JPEGtxt

[Dunckley et al 2007] Dunckley M Rankin S Conway E amp Giaretta D (2007) The use of

file description languages for file format identification and validation Proc PV 2007

Conference Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical

Data DLR OberpfaffenhofenMunich Germany httpepubscclrcacukwork-

detailsw=50089

[EBU 2001] European Broadcasting Union Specification of the EBU Broadcast Wave Format

Tech 3285 July 2001 httptechebuchdocstechtech3285pdf

Frans Faase BFF A Grammar for Binary File Formats wwwiwriteiamnlHa_BFFhtml

[Fercoq 1996] Robin Fercoq 3D-Studio Material-Library File Format (mli) Autodesk Ltd

May 1996 wwwmediatelluworkshopgraphic3D_fileformath_3ds2html

[Ferguson 1995] Stuart Ferguson Lightwave 3d Layered Object File Format Lightwave 10495

wwwmediatelluworkshopgraphic3D_fileformatlightwaveh_lwlohtml

[Fisher amp Gruber 2005] Fisher K amp Gruber R (2005) PADS A domain specific language for

processing ad hoc data ACM Conference on Programming language Design and

Implementation ACM Press httpwwwpadsprojorgpaperspldipdf

[Foo 1995] Kenneth Foo AUDIO MANAGER 4 MODULE FORMAT TechnoMaestroRDG

1995

[Frost 1997a] Martin Frost Z-machine Common Save-File Format Standard also called Quetzal

Quetzal Unifies Efficiently The Z-Machine Archive Language version 13b 29th September

1997 wwwgnelsondemoncoukzspecquetzalhtml

[Frost 1997b] Martin Frost Z-machine Common Save-File Format Standard also called

Quetzal Quetzal Unifies Efficiently The Z-Machine Archive Language version 14 (03-Nov-97)

httpifarchiveheanetieif-archiveinfocominterpretersspecificationsavefile_14txt

[Glanville 2003] R S Glanville Format of Anim8or an8 Files - V085 21-Sep-03

httpwwwanim8orcomresourcesan8_formattxt

[Google 2010] WebP RIFF Container 2010

httpcodegooglecomspeedwebpdocsriff_containerhtml

[Haaland] Mike Haaland Flic Files (FLI) Format description

httpwwwmartinreddynetgfxanimFLItxt

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 33: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

27

[Hamilton 1992] Hamilton Eric JPEG File Interchange Format Version 102 C-Cube

Microsystems Milpitas CA September 1 1992 wwwjpegorgpublicjfifpdf

[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison 8SVX IFF 8-Bit Sampled Voice

Electronic Arts February 7 1985 httpamigan1emunetreg8SVXtxt

[Hopcroft et al 1995] JE Hopcroft R Motwani and JD Ullman Introduction to Automata

Theory Languages and Computation Second Edition Addison Wesley ISBN 0-201-44124-1

1995

[Houghtaling 1996] R James Houghtaling ANI (Windows95 Animated Cursor File Format)

August 1996 wwwwotsitorglistaspsearch=aniampbutton=GO21

[IBM amp Microsoft 1991] IBM and Microsoft Multimedia Programming Interface and Data

Specifications 10 August 1991 wwwkkiij4uorjp~kondowavempidatatxt

wwwtactilemediacominfoMCI_Control_Infohtml

httpwww-mmspecemcgillcadocumentsaudioformatswaveDocsriffmcipdf

[Int MIDI Assoc 2008] The International MIDI Association Standard MIDI-File Format Spec

11 2008 httpduskblueorgprojtoymidimidiformatpdf

[ISO 2000] JPEG 2000 Part I Core Coding System Final Committee Draft Version 10

ISOIEC JTC 1SC 29WG 1 March 2000 httpwwwjpegorgpublicfcd15444-1pdf

[ISO 2003] Information technology mdash Computer graphics and image processing mdash Portable

Network Graphics (PNG) Functional specification ISOIEC 159482003 (E) W3C

Recommendation 10 November 2003 wwww3orgTR2003REC-PNG-20031110

[Issar amp Ward 1993] S Issar and W Ward CMUlsquos robust spoken language understanding

system EUOSPEECH-93 p2147-2150 wwwcscmuedu~airpapersissar_wardpdf

[Jasc 1998] Jasc Software Paint Shop Pro (PSP) File Format Specification Jasc Software

November 1998 (Paint Shop Pro (PSP) File Format Version 30 Paint Shop Pro Application

Version 50) wwwjasccomsupportkbarticlespspspecasp

[JEIDA 2002] Japan Electronic Industry Development Association Digital Still Camera Image

File Format Standard (Exchangeable image file format for Digital Still CameraExif) Version

22 April 2002

httpwwwkodakcomglobalpluginsacrobatenservicedigCamexifStandard2pdf

[Kirvan 1994] S Kirvan Imagine 30 Object Format Impulse Inc May 1994

httpwwwtextfilescomprogrammingFORMATSt3d_doc1txt

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 34: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

28

[Knuth 1969] D E Knuth Semantics of Context-Free Grammars Mathematical Systems Theory

vol 2 pp 127_145 1968

[Knuth 1971] D E Knuth Semantics of Context-Free Grammars (corrections) Mathematical

Systems Theory vol 5 pp 95_96 1971

[Koster 1971] CHA Koster Affix grammars In ALGOL 68 Implementation JEL Peck (ed)

North-Holland Publ Co Amsterdam 1971 pp 95-109

[Lemone 1993] Karen Lemone Design of Compilers techniques of Programming Language

Translation CRC Press 1993 httpwebcswpiedu~kalcoursescompilersprojectindexhtml

httpwebcswpiedu~kalcoursescompilersmodule4mysahtml

[LightWave 1996] LightWave 3D Object File Format Oct 16 1996

httpsandboxdeosglightwavehtm

[Lightwave 2001a] Lightwave Lightwave 3D Object Files (LWo2] November 9 2001

httpmuskratmiddleburyedultcrongoingscivizLightWaveSDKdocsfilefmtslwschtml

[Losch 1997] Philip Losch CINEMA 4Dreg mdash File Format Description MAXON Computer

1997 httppaulbourkenetdataformatscinema4dcinema4dpdf

[Luxology 2006] Luxology ―Appendix H LXO file format Extensions modo 202 Scripting and

Commands 2006 wwwkxcadnetmodoScripting_and_Commandspdf

[Maishein] Max Maischein PIC(PRO) Format

http14791177212extrafileformatgraphicspicpicprotxt

[Martin 2007] Michael Martin SVAF format description Version 02 January 2007

httpsvafsourceforgenetsvafhtml

[Microsoft 1992] Microsoft Multimedia Standards Update New Multimedia Data Types and

Data Techniques July 29 1992 Revision 1097

httpnetghostnarodrugffvendspecmicriffms_rifftxt

[Microsoft 1994] Microsoft Corporation Microsoft Multimedia Standards Update New

Multimedia Data Types and Data Techniques Revision 30 April 15 1994

[Microsoft 2003b] Microsoft Corporation WAVEFORMATEXTENSIBLE

httpmsdnmicrosoftcomlibrarydefaultaspurl=libraryen-

usstreamhhstreamaudprop_7wz7asp

[Microsoft 2004] Microsoft Advanced Systems Format (ASF) Specification Revision 012003

December 2004 wwwmicrosoftcomwindowswindowsmediaforprosformatasfspecaspx

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 35: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

29

[Microsoft 2005] Microsoft AVI RIFF File Reference 2005 httpmsdnmicrosoftcomen-

uslibraryms899421aspx

[Morrison 1985] Jerry Morrison ―EA IFF 85rdquo Standard for Interchange Format Files

Electronic Arts January 14 1985 wwwmartinreddynetgfx2dIFFtxt

[Morrison 1986a] Jerry Morrison ILBM IFF Interleaved Bitmap Electronic Arts January 17

1986 wwwfine-viewcomjplabsdocilbmtxt

[Morrison 1986b] J Morrison ―SMUS IFF Simple Musical Score Electronic Arts Inc

February 5 1986 httpgopherproxyorggophermeulienet00textfilescomputerssmus

[MultimediaWiki 2006] Smush Animation Format August 2006

httpwikimultimediacxindexphptitle=SANChunk_Forma

[MultimeduiaWiki 2008a] Electronic Arts CMV April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_CMV

[MultimediaWiki 2008b] Electronic Arts DCT April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_DCT

[MultimediaWiki 2008c] Electronic Arts MAD April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MAD

[MultimediaWiki 2008d] Electronic Arts MPC April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_MPC

[MultimediaWiki 2008e] Electronic Arts VP6 April 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_VP6

[MultimediaWiki 2008f] Electronic Arts TGV August 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGV

[MultimediaWiki 2008g] Electronic Arts TGQ November 2008

httpwikimultimediacxindexphptitle=Electronic_Arts_TGQ

[MultimediaWiki 2008h] American Laser Games MM January 2008

httpwikimultimediacxindexphptitle=American_Laser_Games_MM

[MultimediaWiki 2009a] Electronic Arts TQI February 2009

httpwikimultimediacxindexphptitle=Electronic_Arts_TQI

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 36: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

30

[MultimediaWiki 2009b] SANM October 2009

httpwikimultimediacxindexphptitle=SANM

[Nolan] M P Nolan An LL(1) parser generator implemented for GaTech CS 3240

httpcodegooglecompll-parser-

generatorsourcebrowsetrunkGeneratorparser3240Mainjavar=42

[Novikov and Fillipov] Igor E Novikov and Valek Fillipov Cdrloaderpy A cdr import filter

written in Python in the open source graphics file convertor UniConvertor

httpsk1projectorgmodulesphpname=Productsampproduct=cdrexplorer

[OGF 2011] OGF DFDL WG Data Format Description Language (DFDL) v10 Specification

January 31 2011 wwwogforgdocumentsGFD174pdf

[Oppenheim 1988] Dave Oppenheim Standard MIDI Files 006 March 1 1988

wwwibiblioorgemusic-linfo-docs-FAQsMIDI-docMIDI-SMFtxt

[Parr amp Quong 1995] Parr T J amp Quong R W (1995) ANTLR A predicated- ll(k) parser

generator Software Practice and Experience 25 7 (July) 789-810

httpwwwantlrorgarticle1055550346383antlrpdf

[Parr 2007] Parr T (2007) The Definitive ANTLR Reference Building Domain-Specific

Languages Pragmatic

[Peters] Christoph Peters The Ultimate 3D model file format (v 20)

httpultimate3dorgU3DModelFileFormatV2htm

[Pfeiffer 2003] S Pfeiffer The Ogg Encapsulation Format Version 0 RFC 3533 May 2003

httpxiphorgoggdocrfc3533txt

[Plotkin] Andrew Plotkin Blorb An IF Resource Collection Format Standard Version 20

wwweblongcomzarfblorb

httpwwwifarchiveorgif-archiveinfocominterpretersspecificationblorb_formattxt

[ Randelshofer 2011] Werner Randelshofer Class RIFFParser January 6 2011

wwwrandelshoferchmultishowjavadocchrandelshofermediariffRIFFParserhtml

[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed) MNG (Multiple-image Network Graphics)

Format Version 10 2001 wwwlibpngorgpubmngspecCredits

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 37: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

31

[Rentz 2008] Daniel Rentz OpenOfficeorgs Documentation of the Microsoft Excel File

Format Excel Versions 2 3 4 5 95 97 2000 XP 2003 OpenOfficeorg April 2008

httpscopenofficeorgexcelfileformatpdf

[REWiki 2009]reWiki IFF httprewikiregengedankendewikiIFF

[RFC 3072] M Wildgrube Structured Data Exchange Format (SDXF) Request for Comments

3072 March 2001 wwwietforgrfcrfc3072txt

[Shaw et al 1985] Steve Shaw Jerry Morrison and Bob Burns FTXT IFF Formatted Text Draft

26 Electronic Arts November 15 1985 httpfortunatynetcomtextfilescomputersftxt

[Soras 1995] L de Soras Official formats of the Graoumf Tracker modules GT2 v1 - v3 and

old GT modules GTK 310895 httpwwwwotsitorglistaspsearch=GTKampbutton=GO21

[Sparta 1988] Sparta Inc ANIM An IFF Format For CEL Animations 4 May 1988

httpamigan1emunetregANIMtxt

[Tamir 2005] Giora Tamir C RIFF Parser June 2005

wwwcodeprojectcomKBfilesriffparseraspx

[Thirunarayan 2008] Krishnaprasad Thirunarayan Attribute Grammars and their Applications

In Encyclopedia of Information Science and Technology Second Edition Editor Mehdi

Khosrow-Pour Information Science Publishing pp 268-273 2008

wwwcswrightedu~tkprasadpapersAttribute-Grammarspdf

[Underwood 2009] W Underwood Extensions of the UNIX file Command and Magic File for

File Type Identification PERPOS TR ITTLCSITD 09-02 Georgia Tech Research Institute

September 2009

[Ung 1996] David Ung Retargetable loader Honours thesis Computer Science and Electrical

Engineering The University of Queensland 1996

httpiteeuqeduau~cristinastudentsdaviddavidhtml

[Ward amp Issar 1994] W Ward ampn S Issar Recent improvements in the CMU spoken language

understanding system ARPA Human Language Technologies Workshop Plainsboro New

Jersey httpaclldcupenneduHH94H94-1039pdf

[van Wijngaarden 1974] A van Wijngaarden The generative power of two-level grammars

Automata Languages and Programming Lecture Notes in Computer Science 14 J Loeckx (e)

Springer-Verlag Berlin 1974 pp 9-16

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 38: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

32

[Wilhelm and Mauer 1995] R Wilhelm and D Maurer Compiler Design Addison-Wesley

1995 (chapter 10) (Attribute Grammars)

[Xiph 2010] XiphOrg Foundation Vorbis I specification February 3 2010

httpxiphorgvorbisdocVorbis_I_spechtml

[Zappe 1994] H Zappe Oktalyzer 1994

httpwwwwotsitorglistaspsearch=oktampbutton=GO21

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 39: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

33

Appendix A Chunk-based Binary File Formats

This appendix references specifications for about 75 chunk-based binary file formats and

discusses some of those specifications

A1 Electronic Arts Interchange File Format (IFF)

[Amigan 2011]

Electronic Arts and Commodore-Amiga introduced the chunk-based file format in the

specification for the Interchange File Format [Morrison 1985]

A11 Interleaved Bitmap (ILBM)

[Morrison 1986]

This file format was discussed in section 41 of this report

A12 8-bit Sampled Voice (8SVX)

The file format specification for the 9-bit sampled voice file format uses C programming

language data structures to define the format [Hayes and Morrison 1985] The following is a re-

expression of that specification as a context-free binary file grammar

Form_8SVX_chunk rarr ―FORM cksize ―8SVX ckdata

cksize rarr ULONG

ckdata rarr voice8header bodyck

ckdata rarr voice8header nameck copyrightck bodyck

ckdata rarr voice8header copyrightck nameck bodyck

ckdata rarr voice8header nameck bodyck

ckdata rarr voice8header copyrightck bodyck

ckdata rarr voice8header authorck bodyck

ckdata rarr voice8header annotation bodyck

ckdata rarr voice8header atakck releaseck bodyck

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 40: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

34

voice8header rarr ―VHDR cksize oneShotHiSamples repeatHiSamples

samplesPerHiCycle ctOctave sCompression Fixedevolume

oneShotHiSamples rarr ULONG

repeatHiSamples rarr ULONG

samplesPerHiCycle rarr ULONG

samplesPerSec rarr UWORD

ctOctave rarr UBYTE

sCompression rarr UBYTE

Fixed volume rarr UBYTE

nameprop rarr ―NAME cksize CHAR[size]

copyrightck rarr (c) cksize CHAR[size]

authorck rarr ―AUTH cksize CHAR[size]

annotation rarr annotationck annotation

annotation rarr annotationck

annotationck rarr ―ANNO cksize CHAR[size]

atakckrarr ―ATAK cksize EGpoint[size]

releaseck rarr ―RLSE cksize EGpoint[size]

EGpoint rarr duration dest

duration rarr UWORD

dest rarr Fixed

bodyck rarr ―BODY cksize samples[size]

In general the BODY has ctOctave octaves of data The highest frequency octave comes first

comprising the fewest samples oneShotHiSamples + repeatHiSamples Each successive octave

contains twice as many samples as the next higher octave but the same number of cycles The

lowest frequency octave comes last with the most samples 2ctOctave-1 (oneShotHiSamp les +

repeatHiSamples) The number of samples in the BODY chunk is

(20 + + 2

(ctOctave-1) (oneShotHiSamples + repeatHiSamples)

A13 Cel Animation (ANIM)

[Sparta 1988]

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 41: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

35

A14 IFF Formatted Text (FTXT)

[Shaw et al 1985]

A15 IFF Simple Musical Score (SMUS)

[Morrison 1986b]

A16 Planar Bitmap (PBM)

[REWiki 2009]

A2 Standard Midi File Format

Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below

[Int MIDI Assoc 2008] The ―length of header data is 6 Header data consist of three 16-bit

integers

Midi rarr ltheaderchunkgt lttrackchunksgt

ltheaderchunkgt rarr ldquoMThdrdquo ltlength of header datagt ltheader datagt

ltheader datagt rarr ltformatgt ltntrksgt ltdivisiongt

lttrackchunksgt rarr lttrackchunkgt lttrackchunksgt

lttrackchunkgt rarr ldquoMTrkrdquo ltlength of track datagt lttrack datagt

lttrack datagt rarr ltMTrk eventgt+

ltMTrk eventgt rarr ltdelta-timegt lteventgt

lteventgt rarr ltMIDI eventgt | ltsysex eventgt | ltmeta-eventgt

ltMIDI eventgt rarr ltMIDI channel messagegt

ltsysex eventgt rarr 0xF0 ltlengthgt ltbytes to be transmitted after F0gt

ltsysex eventgt rarr 0xF7 ltlengthgt ltall bytes to be transmittedgt

ltmeta-eventgt rarr 0xFF lttypegt ltlengthgt ltbytesgt

A3 Audio Interchange File Format

The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on

Electronic Artslsquo Interchange File Format The file format is specified using C data structure

definitions [Apple 1989] The following recasts the specification as a context-free pseudo EBNF

grammar Apple [1991] also specified a compressed version AIFC of the Audio Interchange file

AudioAIFFfile -gt ldquoFORMrdquo cksize ldquoAIFFrdquo localchunks

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 42: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

36

localhunks -gt CommonChunk SoundDataChunk

CommonChunk -gt ldquoCOMMrdquo cksize numChannels numSampleFrames sampleSize

sampleRate

SoundDataChunk -gt ldquoSSNDrdquo cksize offset blockSize soundData[]

MarkerChunk -gt ldquoMARKrdquo cksize Markers[ ]

Markers -gt marker markers

Marker -gt markerid position markerName

InstrumentChunk -gt ldquoINSTrdquo cksize baseNote

detune

lowNote

highNote

` lowVelocity

` ` highVelocity

gain

sustainLoop

releaseLoop

MIDIDataChunk -gt ldquoMIDIrdquo cksize MIDIData[ ]

AudioRecordingChunk -gt ldquoAESDrdquo cksize AESChannelStatusData[24]

ApplicationSpecificChunk -gt ldquoAPPLrdquo cksize applicationSignature data[ ]

CommentsChunk -gt ldquoCOMTrdquo cksize numcomments comments[ ]

Comments -gt Comment Comments

Comment -gt timeStamp

marker

count

text

The name author copyright and annotation chunks are included in the definition of every EA

IFF 85 file All are text chunks their data portion consists solely of text Each of these chunks is

optional

nameprop -gt ldquoNAMErdquo size CHAR[size]

copyrightck -gt ldquo(c) ldquo size CHAR[size]

authorck -gt ldquoAUTHrdquo size CHAR[size]

annotation -gt annotationck annotation

annotation -gt annotationck

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 43: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

37

annotationck -gt ldquoANNOrdquo size CHAR[size]

A4 Resource Interchange File Formats

The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also

based on Electronic Artslsquo Interchange File Format The Microsoft container formats like Audio-

Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]

[Randelshofer 2011] WebP picture format introduced in (2010) by Google also uses RIFF as a

container

A41 RIFF-Waveform (WAVE)

[IBM amp Microsoft 1991]

This file format was discussed in section 42 of this report

A42 WAVEFORMATEX

[Microsoft 1994]

A43 WAVEFORMATEXTENSIBLE

[Microsoft 2003b]

A44 Broadcast Wave Format

[EBU 2001]

A45 Windows Audio-Video Interleave (AVI)

[Microsoft 2005]

A46 Animated Windows Cursor (ANI)

[Houghtaling 1996]

A47 RIFF MIDIfile (RMID)

[IBM amp Microsoft 1991]

A48 Device-Independent Bitmap (RIFF RDIB)

The RDIB format consists of a Windows 30 DIB enclosed in a RIFF chunk [IBM amp Microsoft

1991]

A49 WebP

WebP is a method of lossy compression for digital images The degree of compression is

adjustable so a user can choose the trade-off between file size and image quality WebP typically

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 44: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

38

achieves an average of 39 more compression than JPEG and JPEG 2000 without loss of image

quality A WebP file consists of VP8 image data and a container based on RIFF The format of a

WebP file is shown below [Google 2010]

0ffset Description

0 RIFF 4-byte tag

4 size of data chunk sarting at offset 8

8 WEBP the form-type signature

12 VP8 4-bytes tag describing the raw video format used

16 size of the raw VP8 image data chunk starting at offset 20

20 the VP8 image data

A5 JPEG

A51 JPEGJFIF

[CCITT 1992] [Hamilton 1992] [Cuturicu and Fromme 1999]

JPEGJFIF file format

- header (2 bytes) $ff $d8 (SOI) (these two identify a JPEGJFIF file)

- for JFIF files an APP0 segment is immediately following the SOI marker see below

- any number of segments (similar to IFF chunks) see below

- trailer (2 bytes) $ff $d9 (EOI)

Segment format

- header (4 bytes)

$ff identifies segment

n type of segment (one byte)

sh sl size of the segment including these two bytes but not

including the $ff and the type byte Note not Intel order

high byte first low byte last

A52 JPEGExif

[JEIDA 2002]

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 45: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

39

A53 JPEG 2000

―Every marker is two bytes long The first byte consists of a single 0xFF byte The second byte

denotes the specific marker and can have any value in the range 0x01 to 0xFE Many of these

markers are already used in ITU-T Rec T81 | ISOIEC 10918-1 and ITU-T Rec T84 | ISOIEC

10918-3 and shall be regarded as reserve unless specifically used A marker segment includes a

marker and associated parameters called marker parameters In every marker segment the first

two bytes after the marker shall be an unsigned big endian integer value that denotes the length

in bytes of the marker parameters (including two bytes of this length parameter but not the two

bytes of the marker itself) [ISO 2000 page 13]

A54 JPEG 2000 Codestream

[ISO 2000] Annex A Codestream syntax

A6 Advanced Systems Format

Microsoftlsquos Advanced Systems Format (ASF) [2001] is also a chunk-based container format

The most common file formats contained within an ASF file are Windows Media Audio (WMA)

and Windows Media Video (WMV)

A61 Windows Media Audio 7-9

Microsoft 2004]

A62 Windows Media Video 7-91

[Microsoft 2004]

A7 Portable Network Graphics

[ISO 159482003]

PNG file -gt PNGSignature headerck paletteck

headerck -gt IHDR cksize Width Height BitDepth ColourType

Compressionmethod Filtermethod Interlacemethod

paletteck -gt PLTE cksize entries

entries -gt entry entries

entry -gt red green blue

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 46: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

40

A71 Multiple-image Network Graphics

[Randers-Pehrson 2001]

A72 JPEG Network Graphics

[Randers-Pehrson 2001]

A8 Binary Interchange File Format

[Rentz 2008]

A9 3D Studio

A91 3D Studio File Format ndash 3ds

[Autodesk 1997]

A92 3D Studio Material Library Files

[Fercoq 1996]

A10 Anim8or

[Glanville 2003]

A11 Animator

[Crazy Talk]

A111 Animator Cel

[Crazy Talk]

A112 Animator Pro Fli

[Haaland]

A113 Animator Pro fLC

A111 Animator Cel

[Crazy Talk]

A114 Animator Pro PIC

[Maischein]

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 47: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

41

A12 Audio Manager Module

[Foo 1995]

A13 Autodesk Drawing Interchange Files (DXF) (Binary)

[Autodesk 2009]

A14 Caligari TrueSpace SceneObject (Binary)

[Caligari 1998]

A15 CorelDRAW Vector Graphics File

(Versions 30 ndash X3)

[Novikov and Fillipov]

A16 Corel Metafile Exchange Image

[Corel 1998]

A17 Music Modules

A171 DigiTrekker Module Format

[Beham 1995]

A172 Graoumf Tracker modules

[Soras 1995]

A173 Oktalyzer

[Zappa 1994]

A18 EA Multimedia Game File Formats

[Anisimovsky 2002a]

A181 Electronic Arts Music (asf)

[Anisimovsky 2002b]

A182 Electronic Arts CMV

[MultimediaWiki 2008a]

A183 Electronic Arts DCT

[MultimediaWiki 2008b]

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 48: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

42

A184 Electronic Arts MAD

[MultimediaWiki 2008c]

A185 Electronic Arts MPC

[MultimediaWiki 2008d]

A186 Electronic Arts VP7

[MultimediaWiki 2008e]

A187 Electronic Arts TGV

[MultimediaWiki 2008f]

A188 Electronic Arts TGQ

[MultimediaWiki 2008g]

A189 Electronic Arts TQI

[MultimediaWiki 2009]

A19 Imagine 3D Data Description (TDDD)

[Kirvan 1994]

A20 Lightwave 3D Objects

A201 LWOB

[LightWave 1996]

A202 LWO2

[Lightwave 2001a]

A203 LWLO

[Ferguson 1995]

A21 Luxology Modo 3D Object (LXOB)

[Luxology 2006]

A22 Paint Shop Pro (PSP)

[Jasc 1998]

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 49: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

43

A23 ProVector 2D Drawing

[Cunniff and Orr]

A24 ProWrite Document

[Bayless 1987]

A25 InteractiveFiction

A251 Blorb Z-machine Resource File

[Plotkin]

A252 Quetzal

[Frost 1997a 1997b]

A26 Apple QuickTime

[Apple 2007]

A27 Maya IFF Bitmap

[Alias-wavefront 1999]

A28 Apple Core Audio Format

[Apple 2005]

A29 American Laser Games (MM)

[MultimediaWiki 2008h]

A30 Smush Animation Format

A301 SAN

[MultimediaWiki 2006]

A302 SNM

[MultimediaWiki 2009b]

A31 Structured Data eXchange Format (SDXF)

[RFC 3072]

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]

Page 50: Attribute Grammars for Validating Chunk-based Binary File ...perpos.gtri.gatech.edu/publications/WP-11-03-Grammars_for_Binary... · The capability to validate and view or play binary

44

A32 Ogg Vorbis

[Pfeiffer 2003] [Xiph 2010]

A33 Simple Vector Array Format

[Martin 2007]

A34 Nested List File

Nested List File (NLF) is a general chunk-based file metaformat for binary data that is used by

Onda as the metaformat for its compressed files [Blank Aspect 2010a 2010b]

A35 Cinema 4D

[Losch 1997]

A36 The Sims Interchange File Format

Jamie Doornbos is the lead programmer of Maxis The Sims [Baum amp Noel 2002]