1name · 2020-04-27 · txr(1) txrprogramming language txr(1) messages, and anyoutput generated by...

TXR(1) TXR Programming Language TXR(1)

1 NAME

TXR − Programming Language (Version 237)

2 SYNOPSIS

txr [ options ] [ script-file [ data-files ... ]]

3 DESCRIPTION

TXR is a general-purpose, multi-paradigm programming language. It comprises two languages integrated

into a single tool: a text scanning and extraction language referred to as the TXR Pattern Language (some-

times just "TXR"), and a general-purpose dialect of Lisp called TXR Lisp.

TXR can be used for everything from "one liner" data transformation tasks at the command line, to data

scanning and extracting scripts, to full application development in a wide-range of areas.

A script written in the TXR Pattern Language, also referred to in this document as a query, specifies a pat-

tern which matches one or more sources of inputs, such as text files. Patterns can consist of large chunks of

multi-line free-form text, which is matched literally against material in the input sources. Free variables

occurring in the pattern (denoted by the @ symbol) are bound to the pieces of text occurring in the corre-sponding positions. Patterns can be arbitrarily complex, and can be broken down into named pattern func-

tions, which may be mutually recursive.

In addition to embedded variables which implicitly match text, the TXR pattern language supports a num-

ber of directives, for matching text using regular expressions, for continuing a match in another file, for

searching through a file for the place where an entire sub-query matches, for collecting lists, and for com-

bining sub-queries using logical conjunction, disjunction and negation, and numerous others.

Patterns can contain actions which transform data and generate output. These actions can be embedded

anywhere within the pattern matching logic. A common structure for small TXR scripts is to perform a

complete matching session in the at the top of the script, and then deal with processing and reporting at the

bottom.

The TXR Lisp language can be used from within TXR scripts as an embedded language, or completely

stand-alone. It supports functional, imperative and object-oriented programming, and provides numerous

data types such as symbols, strings, vectors, hash tables with weak reference support, lazy lists, and arbi-

trary-precision ("bignum") integers. It has expressive foreign function interface (FFI) for calling into

libraries and other software components that support C-language-style calls.

TXR Lisp source files as well as individual functions can be optionally compiled for execution on a virtual

machine that is built into TXR. Compiled files execute and load faster, and resist reverse-engineering.

Stand-alone application delivery is possible.

TXR is free software offered under the two-clause BSD license which places almost no restrictions on

redistribution, and allows every conceivable use, of the whole software or any constituent part, royalty-free,

free of charge, and free of any restrictions.

4 ARGUMENTS AND OPTIONS

If TXR is given no arguments, it will enter into an interactive mode. See the INTERACTIVE LISTENER

section for a description of this mode. When TXR enters interactive mode this way, it prints a one-line

banner is printed announcing the program name and version, and one line of help text instructing the user

how to exit.

Utility Commands 2020-04-25 1


Options which don’t take an argument may be combined together. The -v and -q options are mutuallyexclusive. Of these two, the one which occurs in the rightmost position in the argument list dominates. The

-c and -f options are also mutually exclusive; if both are specified, it is a fatal error.

-Dvar=valueBind the variable var to the value value prior to processing the query. The name is in scope

over the entire query, so that all occurrence of the variable are substituted and match the equivalent

text. If the value contains commas, these are interpreted as separators, which give rise to a list

value. For instance -Da,b,c creates a list of the strings "a", "b" and "c". (See Collect Direc-tive bellow). List variables provide a multiple match. That is to say, if a list variable occurs in a

query, a successful match occurs if any of its values matches the text. If more than one value

matches the text, the first one is taken.

-Dvar Binds the variable var to an empty string value prior to processing the query.

-q Quiet operation during matching. Certain error messages are not reported on the standard errordevice (but the if the situations occur, they still fail the query). This option does not suppress error

generation during the parsing of the query, only during its execution.

-i If this option is present, then TXR will enter into an interactive interpretation mode after process-ing all options, and the input query if one is present. See the INTERACTIVE LISTENER section

for a description of this mode.

-d

--debuggerInvoke the interactive TXR debugger. See the DEBUGGER section. Implies --backtrace.

--backtraceTurns on the establishment of backtrace frames for function calls so that a backtrace can be pro-

duced when an unhandled exception occurs, and in other situations. Backtraces are helpful in iden-

tifying the causes of errors, but require extra stack space and slow down execution.

-n

--noninteractiveThis option affects behavior related to TXR’s *stdin* stream. It also has a another, unrelatedeffect, on the behavior of the interactive listener; see below.

Normally, if this stream is connected to a terminal device, it is automatically marked as having the

real-time property when TXR starts up (see the functions stream-set-prop and real-time-stream-p). The -n option suppresses this behavior; the *stdin* stream remains ordi-nary.

The TXR pattern language reads standard input via a lazy list, created by applying the lazy-stream-cons function to the *stdin* stream. If that stream is marked real-time, then the lazylist which is returned by that function has behaviors that are better suited for scanning interactive

input. A more detailed explanation is given under the description of this function.

If the -n option is effect and TXR enters into the interactive listener, the listener operates in plainmode. The listener reads buffered lines from the operating system without any character-based

editing features or history navigation. In plain mode, no prompts appear and no terminal control

escape sequences are generated. The only output is the results of evaluation, related diagnostic



messages, and any output generated by the evaluated expressions themselves.

-v Verbose operation. Detailed logging is enabled.

-b This option binds a Lisp global lexical variable (as if by the defparml function) to an objectdescribed by Lisp syntax. It requires an argument of the form sym=value where sym must be,syntactically, a token denoting a bindable symbol, and value is arbitrary TXR Lisp syntax. The

sym syntax is converted to the symbol it denotes, which is bound as a global lexical variable, if it

is not already a variable. The value syntax is parsed to the Lisp object it denotes. This object is

not subject to evaluation; the object itself is stored into the variable binding denoted by sym. Note

that if sym already exists as a global variable, then it is simply overwritten. If sym is marked spe-

cial, then it stays special.

-B If the query is successful, print the variable bindings as a sequence of assignments in shell syntaxthat can be eval-ed by a POSIX shell. II the query fails, print the word "false". Evaluation of this

word by the shell has the effect of producing an unsuccessful termination status from the shell’s

eval command.

-l or --lisp-bindingsThis option implies -B. Print the variable bindings in Lisp syntax instead of shell syntax.

-a numThis option implies -B. The decimal integer argument num specifies the maximum number ofarray dimensions to use for list-valued variable bindings. The default is 1. Additional dimensions

are expressed using numeric suffixes in the generated variable names. For instance, consider the

three-dimensional list arising out of a triply nested collect: ((("a" "b") ("c" "d"))(("e" "f") ("g" "h"))). Suppose this is bound to a variable V. With -a 1, this willbe reported as:

V_0_0[0]="a"V_0_1[0]="b"V_1_0[0]="c"V_1_1[0]="d"V_0_0[1]="e"V_0_1[1]="f"V_1_0[1]="g"V_1_1[1]="h"

With -a 2, it comes out as:

V_0[0][0]="a"V_1[0][0]="b"V_0[0][1]="c"V_1[0][1]="d"V_0[1][0]="e"V_1[1][0]="f"V_0[1][1]="g"V_1[1][1]="h"

The leftmost bracketed index is the most major index. That is to say, the dimension order is:

NAME_m_m+1_..._n[1][2]...[m-1].



-c querySpecifies the query in the form of a command line argument. If this option is used, the script-

file argument is omitted. The first non-option argument, if there is one, now specifies the first

input source rather than a query. Unlike queries read from a file, (non-empty) queries specified as

arguments using -c do not have to properly end in a newline. Internally, TXR adds the missing

newline before parsing the query. Thus -c "@a" is a valid query which matches a line.

Example:

Shell script which uses TXR to read two lines "1" and "2" from standard input, binding them tovariables a and b. Standard input is specified as - and the data comes from shell "here document"redirection:

code: #!/bin/sh

txr -B -c "@a@b" - output: a=1b=2

The @; comment syntax can be used for better formatting:

txr -B -c "@;@a@b"

-f script-fileSpecifies the file from which the query is to be read, instead of the script-file argument.

This is useful in #! ("hash bang") scripts. (See Hash Bang Support below).

-e expressionEvaluates a TXR Lisp expression for its side effects, without printing its value. Can be specified

more than once. The script-file argument becomes optional if -e is used at least once. If theevaluation of every expression evaluated this way terminates normally, and there is no

script-file argument, then TXR terminates with a successful status.

-p expressionJust like -e but prints the value of expression using the prinl function.

-P expressionLike -p but prints using the pprinl function.

-t expressionLike -p but prints using the tprint function.

-C number



--compat=number

Requests TXR to behave in a manner that is compatible with the specified version of TXR. This

makes a difference in situations when a release of TXR breaks backward compatibility. If some

version N+1 deliberately introduces a change which is backward incompatible, then -C N can beused to request the old behavior.

The requested value of N can be too low, in which case TXR will complain and exit with an

unsuccessful termination status. This indicates that TXR refuses to be compatible with such an old

version. Users requiring the behavior of that version will have to install an older version of TXR

which supports that behavior, or even that exact version.

If the option is specified more than once, the behavior is not specified.

Compatibility can also be requested via the TXR_COMPAT environment variable instead of the -Coption.

For more information, see the COMPATIBILITY section.

--gc-delta=number

The number argument to this option must be a decimal integer. It represents a megabyte value,

the "GC delta": one megabyte is 1048576 bytes. The "GC delta" controls an aspect of the garbage

collector behavior. See the gc-set-delta function for a description.

--debug-autoloadThis option turns on debugging, like --debugger but also requests stepping into the auto-loadprocessing of TXR Lisp library code. Normally, debugging through the evaluations triggered by

auto-loading is suppressed. Implies --backtrace.

--debug-expansionThis option turns on debugging, like --debugger but also requests stepping into the parse-timemacro-expansion of TXR Lisp code embedded in TXR queries. Normally, this is suppressed.

Implies --backtrace.

--helpPrints usage summary on standard output, and terminates successfully.

--licensePrints the software license. This depends on the software being installed such that the LICENSE

file is in the data directory. Use of TXR implies agreement with the liability disclaimer in the

license.

--versionPrints program version standard output, and terminates successfully.

--argsThe --args option provides a way to encode multiple arguments as a single argument, which isuseful on some systems which have limitations in their implementation of the "hash bang" mecha-

nism. For details about its special syntax, See Hash Bang Support below. It is also useful in stand-

alone application deployment. See the section STAND-ALONE APPLICATION SUPPORT, in

which example uses of --args are shown.



--eargsThe --eargs option (extended --args) is like --args but must be followed by an argument.The argument is removed from the argument list and substituted in place of occurrences of {}among the arguments expanded from the --eargs syntax.

--lisp

--compiledThese options influences the treatment of query files which do not have a suffix indicating their

type. The --lisp option causes an unsuffixed file to be treated as Lisp source; and --com-piled causes it to be treated as a compile file.

Moreover, if --lisp is specified, and an unsuffixed file does not exist, then TXR will add the".tl" suffix and try the file again; and --compiled will similarly add the ".tlo" suffix andtry opening the file again. In the same situation, if neither --lisp nor --compiled has beenspecified, TXR will first try adding the ".txr" suffix. If that fails, then the ".tlo" suffix willbe tried and finally ".tl". Note that --lisp and --compiled influence how the argument ofthe -f option is treated, but only they precedes that option.

--reexecOn platforms which support the POSIX exec family of functions, this option causes TXR to re-execute itself. The re-executed image receives the remaining arguments which follow the

--reexec argument. Note: this option is useful for supporting setuid operation in "hash hang"scripts. On some platforms, the interpreter designated by a "hash bang" script runs without altered

privilege, even if that interpreter is installed setuid. If the interpreter is executed directly, then

setuid applies to it, but not if it is executed via "hash bang". If the --reexec option is used inthe interpreter command line of such a script, the interpreter will re-execute itself, thereby gaining

the setuid privilege. The re-executed image will then obtain the script name from the arguments

which are passed to it and determine whether that script will run setuid. See the section

SETUID/SETGID OPERATION.

--gc-debugThis option enables a behavior which stresses the garbage collector with frequent garbage collec-

tion requests. The purpose is to make it more likely to reproduce certain kinds of bugs. Use of this

option severely degrades the performance of TXR.

--vg-debugIf TXR is enabled with Valgrind support, then this option is available. It enables code which uses

the Valgrind API to integrate with the Valgrind debugger, for more accurate tracking of garbage

collected objects. For example, objects which have been reclaimed by the garbage collector are

marked as inaccessible, and marked as uninitialized when they are allocated again.

--dv-regexIf this option is used, then regular expressions are all treated using the derivative-based back-end.

The NFA-based regex implementation is disabled. Normally, only regular expressions which

require the intersection and complement operators are handled using the derivative back-end. This

option makes it possible to test that back-end on test cases that it wouldn’t normally receive.

-- Signifies the end of the option list.

- This argument is not interpreted as an option, but treated as a filename argument. After the firstsuch argument, no more options are recognized. Even if another argument looks like an option, it



is treated as a name. This special argument - means "read from standard input" instead of a file.The script-file, or any of the data files, may be specified using this option. If two or more

files are specified as -, the behavior is system-dependent. It may be possible to indicate EOF fromthe interactive terminal, and then specify more input which is interpreted as the second file, and so

forth.

After the options, the remaining arguments are files. The first file argument specifies the script file, and is

mandatory if the -f option has not been specified, and TXR isn’t operating in interactive mode or evaluat-ing expressions from the command line via -e or one of the related options. A file argument consisting ofa single - means to read the standard input instead of opening a file.

Specifying standard input as a source with an explicit - argument is unnecessary. If no data source argu-ments are present, then TXR scans standard input by default. This was not true in versions of TXR prior to

171; see the COMPATIBILITY section.

TXR begins by reading the script. In the case of the TXR pattern language, the entire query is scanned,

internalized and then begins executing, if it is free of syntax errors. (TXR Lisp is processed differently,

form by form). On the other hand, the pattern language reads data files in a lazy manner. A file isn’t opened

until the query demands material from that file, and then the contents are read on demand, not all at once.

The suffix of the script-file is significant. If the name has no suffix, or if it has a ".txr" suffix, thenit is assumed to be in the TXR pattern language. If it has the ".tl" suffix, then it is assumed to be TXRLisp. The --lisp option changes the treatment of unsuffixed script file names, causing them to be inter-preted as TXR Lisp .

If an unsuffixed script file name is specified, and cannot be opened, then TXR will add the ".txr" suffixand try again. If that fails, it will be tried with the ".tl" suffix, and treated as TXR Lisp . If the --lispoption has been specified, then TXR tries only the ".tl" suffix.

A TXR Lisp file is processed as if by the load macro: forms from the file are read and evaluated. If theforms do not terminate the TXR process or throw an exception, and there are no syntax errors, then TXR

terminates successfully after evaluating the last form. If syntax errors are encountered in a form, then TXR

terminates unsuccessfully. TXR Lisp is documented in the section TXR LISP.

If a query file is specified, but no file arguments, it is up to the query to open a file, pipe or standard input

via the @(next) directive prior to attempting to make a match. If a query attempts to match text, but hasrun out of files to process, the match fails.

5 STATUS AND ERROR REPORTING

TXR sends errors and verbose logs to the standard error device. The following paragraphs apply when

TXR is run without enabling verbose mode with -v, or the printing of variable bindings with -B or -a.

If the command line arguments are incorrect, TXR issues an error diagnostic and terminates with a failed

status.

If the script-file specifies a query, and the query has a malformed syntax, TXR likewise issues error

diagnostics and terminates with a failed status.

If the query fails due to a mismatch, TXR terminates with a failed status. No diagnostics are issued.

If the query is well-formed, and matches, then TXR issues no diagnostics, and terminates with a successful

status.



In verbose mode (option -v), TXR issues diagnostics on the standard error device even in situations whichare not erroneous.

In bindings-printing mode (options -B or -a), TXR prints the word false if the query fails, and exitswith a failed termination status. If the query succeeds, the variable bindings, if any, are output on standard

output.

If the script-file is TXR Lisp, then it is processed form by form. Each top-level Lisp form is evalu-

ated after it is read. If any form is syntactically malformed, TXR issues diagnostics and terminates unsuc-

cessfully. This is somewhat different from how the pattern language is treated: a script in the pattern lan-

guage is parsed in its entirety before being executed.

6 BASIC TXR SYNTAX

6.1 Comments

A query may contain comments which are delimited by the sequence @; and extend to the end of the line.Whitespace can occur between the @ and ;. A comment which begins on a line swallows that entire line, aswell as the newline which terminates it. In essence, the entire comment line disappears. If the comment

follows some material in a line, then it does not consume the newline. Thus, the following two queries are

equivalent:

1. @a@; comment: match whole line against variable @a@; this comment disappears entirely@b

2. @a@b

The comment after the @a does not consume the newline, but the comment which follows does. Withoutthis intuitive behavior, line comment would give rise to empty lines that must match empty lines in the data,

leading to spurious mismatches.

Instead of the ; character, the # character can be used. This is an obsolescent feature.

6.2 Hash Bang Support

TXR has several features which support use of the "hash bang" convention for creating apparently stand-

alone executable programs.

6.2.1 Basic Hash Bang

Special processing is applied to TXR query or TXR Lisp script files that are specified on the command line

via the -f option or as the first non-option argument. If the first line of such a file begins with the charac-ters #!, that entire line is consumed and processed specially.

This removal for TXR queries to be turned into standalone executable programs in the POSIX environment

using the "hash bang" mechanism. Unlike most interpreters, TXR applies special processing to the #!line, which is described below, in the section Argument Generation with the Null Hack.

Shell session example: create a simple executable program called "twoline.txr" and run it. Thisassumes TXR is installed in /usr/bin.

$ cat > hello.txr#!/usr/bin/txr



@(bind a "Hey")@(output)Hello, world!@(end)$ chmod a+x hello.txr$ ./hello.txrHello, world!

When this plain hash bang line is used, TXR receives the name of the script as an argument. Therefore, it

is not possible to pass additional options to TXR. For instance, if the above script is invoked like this

$ ./hello.txr -B

the -B option isn’t processed by TXR, but treated as an additional argument, just as if txr scriptname-B had been executed directly.

This behavior is useful if the script author wants not to expose the TXR options to the user of the script.

However, the hash bang line can use the -f option:

#!/usr/bin/txr -f

Now, the name of the script is passed as an argument to the -f option, and TXR will look for more optionsafter that, so that the resulting program appears to accept TXR options. Now we can run

$ ./hello.txr -BHello, world!a="Hey"

The -B option is honored.

6.2.2 Argument Generation with --args and --eargs

On some operating systems, it is not possible to pass more than one argument through the hash bang mech-

anism. That is to say, this will not work.

#!/usr/bin/txr -B -f

To support systems like this, TXR supports the special argument --args, as well as as an extended ver-sion, --eargs. With --args, it is possible to encode multiple arguments into one argument. The--args option must be followed by a separator character, chosen by the programmer. The characters afterthat are split into multiple arguments on the separator character. The --args option is then removed fromthe argument list and replaced with these arguments, which are processed in its place.

Example:

#!/usr/bin/txr --args:-B:-f

The above has the same behavior as

#!/usr/bin/txr -B -f

on a system which supports multiple arguments in hash bang. The separator character is the colon, and so

the remainder of that argument, -B:-f, is split into the two arguments -B -f.



The --eargs mechanism allows an additional flexibility. An --eargs argument must be followed byone more argument.

After --eargs performs the argument splitting in the same manner as --args, any of the argumentswhich it produces which are the two-character sequence {} are replaced with that following argument.Whether or not the replacement occurs, that following argument is then removed.

Example:

#!/usr/bin/txr --eargs:-B:{}:--foo:42

This has an effect which cannot be replicated in any known implementation of the hash bang mechanism.

Suppose that this hash bang line is placed in a script called script.txr. When this script is invokedwith arguments, as in:

script.txr a b c

then TXR is invoked similarly to:

/usr/bin/txr --eargs:-B:{}:--foo:42 script.txr a b c

Then, when --eargs processing takes place, firstly the argument sequence

-B {} --foo 42

is produced by splitting into four fields using the : character as the separator. Then, within these fourfields, all occurrences of {} are replaced with the following argument script.txr, resulting in:

-B script.txr --foo 42

Furthermore, that script.txr argument is removed from the remaining argument list.

The four arguments are then substituted in place of the original --eargs:-B:{}:--foo:42 syntax.

The resulting TXR invocation is, therefore:

/usr/bin/txr -B script.txr --foo 42 a b c

Thus, --eargs allows some arguments to be encoded into the interpreter script, such that script name isinserted anywhere among them, possibly multiple times. Arguments for the interpreter can be encoded, as

well as arguments to be processed by the script.

6.2.3 Argument Generation with the Null Hack

The --args and --eargs mechanisms do not solve the following problem: the POSIX env utility isoften exploited for its PATH searching capability, and used to express hash bang scripts in the followingway:

#!/usr/bin/env txr

Here, the env utility searches for the txr program in the directories indicated by the PATH variable,which liberates the script from having encode the exact location where the program is installed. However,

if the operating system allows only one argument in the hash bang mechanism, then no arguments can be

passed to the program.



To mitigate this problem, TXR supports a special feature in its hash bang support. If the hash bang #! linecontains a null byte, then text after the null byte, to the end of the line, is split into fields using the space

character as a separator, and these fields are inserted into the command line. This manipulation happens

during command line processing, prior to the execution of the file, which happens after command-line pro-

cessing. If this processing is applied to a file that is specified using the -f option, then the arguments whicharise from the special processing are inserted after that option and its argument. If this processing is

applied to the file which is the first non-option argument, then the options are inserted before that argument.

However, care is taken not to process that argument a second time. In either situation, processing of the

command line options continues, and the arguments which are processed next are the ones which were just

inserted. This is true even if the options had been inserted as a result of processing the first non-option

argument, which would ordinarily signal the termination of option processing.

In the following examples, it is assumed that the script is named, and invoked, as

/home/jenny/foo.txr, and is given arguments --bar abc, and that txr resolves to/usr/bin/txr. The code indicates a literal ASCII NUL character, or zero bytes.

Basic example:

#!/usr/bin/env txr-a 3

Here, env searches for txr, finding it in /usr/bin. Thus, including the executable name, TXR receivesthis full argument list:

/usr/bin/txr /home/jenny/foo.txr --bar abc

The first non-option argument is the name of the script. TXR opens the script, and notices that it begins

with a hash bang line. It consumes the hash bang line and finds the null byte inside it, retrieving the charac-

ter string after it, which is "-a 3". This is split into the two arguments -a and 3, which are then insertedinto the command line ahead of the the script name. The effective command line then becomes:

/usr/bin/txr -a 3 /home/jenny/foo.txr --bar abc

Command line option processing continues, beginning with the -a option. After the option is processed,/home/jenny/foo.txr is encountered again. This time it is not opened a second time; it signals theend of option processing, exactly as it would immediately do if it hadn’t triggered the insertion of any argu-

ments.

Advanced example: use env to invoke txr passing options to interpreter and to the script:

#!/usr/bin/env txr--eargs:-C:175:{}:--debug

This example shows how --eargs can be used in conjunction with the null hack. When txr begins exe-cuting, it receives the arguments

/usr/bin/txr /home/jenny/foo.txr

The script file is opened, and the arguments delimited by the null character in the hash bang line are

inserted, resulting in the effective command line:

/usr/bin/txr --eargs:-C:175:{}:--debug /home/jenny/foo.txr

Next, --eargs is processed in the ordinary way, transforming the command line into:

/usr/bin/txr -C 175 /home/jenny/foo.txr --debug



The name of the script file is encountered, and signals the end of option processing. Thus txr receives the-C option, instructing it to emulate some behaviors from version 175, and the /home/jenny/foo.txrscript receives --debug as its argument: it executes with the *args* list containing one element, thecharacter string "--debug".

The hash bang null hack feature was introduced in TXR 177. Previous versions ignore the hash bang line,

performing no special processing. Where a risk exists that programs which depend on the feature might be

executed by an older version of TXR, care must be taken to detect and handle that situation, either by

means of the txr-version variable, or else by some logic which infers that the processing of the hashbang line hadn’t been performed.

6.2.4 Passing Options to TXR via Hash Bang Null Hack

It is possible to use the Hash Bang Null Hack, such that the resulting executable program recognizes TXR

options. This is made possible by a special behavior in the processing of the -f option.

For instance, suppose that the effect of the following familiar hash bang line is required:

#!/path/to/txr -f

However, suppose there is also a requirement to use the env utility to find TXR. Furthermore, the opera-tion system allows only one hash bang argument. Using the Null Hack, this is rewritten as:

#!/usr/bin/env txr-f

then if the script is invoked with arguments -a b c, the command line will ultimately be transformedinto:

/path/to/txr -f /path/to/scriptfile -i a b c

which allows TXR to process the -i option, leaving a, b and c as arguments for the script.

However, note that there is a subtle issue with the -f option that has been inserted via the Null Hack:namely, this insertion happens after TXR has opened the script file and read the hash bang line from it.

This means that when the inserted -f option is being processed, the script file is already open. A specialbehavior occurs. The -f option processing notices that the argument to -f is identical to the path name ofname of the script file that TXR has already opened for processing. The -f option and its argument arethen skipped.

6.2.5 Hash Bang and Setuid

TXR supports setuid hash bang scripting, even on platforms that do not support setuid and setgid attributes

on hash bang scripts. On such platforms, TXR has to be installed setuid/setgid. See the section

SETUID/SETGID OPERATION. On some platforms, it may also be necessary to to use the --reexecoption.

6.3 Whitespace

Outside of directives, whitespace is significant in TXR queries, and represents a pattern match for white-

space in the input. An extent of text consisting of an undivided mixture of tabs and spaces is a whitespace

token.

Whitespace tokens match a precisely identical piece of whitespace in the input, with one exception: a

whitespace token consisting of precisely one space has a special meaning. It is equivalent to the regular

expression @/[ ]+/: match an extent of one or more spaces (but not tabs!). Multiple consecutive spaces



do not have this meaning.

Thus, the query line "a b" (one space between a and b) matches "a b" with any number of spacesbetween the two letters.

For matching a single space, the syntax @\ can be used (backslash-escaped space).

It is more often necessary to match multiple spaces than to exactly match one space, so this rule simplifies

many queries and adds inconvenience to only few.

In output clauses, string and character literals and quasiliterals, a space token denotes a space.

6.4 Text

Query material which is not escaped by the special character @ is literal text, which matches input characterfor character. Text which occurs at the beginning of a line matches the beginning of a line. Te xt which

starts in the middle of a line, other than following a variable, must match exactly at the current position,

where the previous match left off. Moreover, if the text is the last element in the line, its match is anchored

to the end of the line.

An empty query line matches an empty line in the input. Note that an empty input stream does not contain

any lines, and therefore is not matched by an empty line. An empty line in the input is represented by a

newline character which is either the first character of the file, or follows a previous newline-terminated

line.

Input streams which end without terminating their last line with a newline are tolerated, and are treated as if

they had the terminator.

Te xt which follows a variable has special semantics, described in the section Variables below.

A query may not leave a line of input partially matched. If any portion of a line of input is matched, it must

be entirely matched, otherwise a matching failure results. However, a query may leave unmatched lines.

Matching only four lines of a ten line file is not a matching failure. The eof directive can be used to explic-itly match the end of a file.

In the following example, the query matches the text, even though the text has an extra line.

code: Four score and sevenyears ago our

data: Four score and sevenyears ago ourforefathers

In the following example, the query fails to match the text, because the text has extra material on one line

that is not matched:

code: I can carry nearly eighty gigsin my head

data: I can carry nearly eighty gigs of datain my head

Needless to say, if the text has insufficient material relative to the query, that is a failure also.

To match arbitrary material from the current position to the end of a line, the "match any sequence of



characters, including empty" regular expression @/.*/ can be used. Example:

code: I can carry nearly eighty gigs@/.*/

data: I can carry nearly eighty gigs of data

In this example, the query matches, since the regular expression matches the string "of data". (See Regular

Expressions section below).

Another way to do this is:

code: I can carry nearly eighty gigs@(skip)

6.5 Special Characters in Text

Control characters may be embedded directly in a query (with the exception of newline characters). An

alternative to embedding is to use escape syntax. The following escapes are supported:

@\newlineA backslash immediately followed by a newline introduces a physical line break without breaking

up the logical line. Material following this sequence continues to be interpreted as a continuation

of the previous line, so that indentation can be introduced to show the continuation without appear-

ing in the data.

@\spaceA backslash followed by a space encodes a space. This is useful in line continuations when it is

necessary for some or all of the leading spaces to be preserved. For instance the two line sequence

abcd@\@\ efg

is equivalent to the line

abcd efg

The two spaces before the @\ in the second line are consumed. The spaces after are preserved.

@\a Alert character (ASCII 7, BEL).

@\b Backspace (ASCII 8, BS).

@\t Horizontal tab (ASCII 9, HT).

@\n Line feed (ASCII 10, LF). Serves as abstract newline on POSIX systems.

@\v Vertical tab (ASCII 11, VT).

@\f Form feed (ASCII 12, FF). This character clears the screen on many kinds of terminals, or ejects apage of text from a line printer.

@\r Carriage return (ASCII 13, CR).

@\e Escape (ASCII 27, ESC)

@\x hex-digitsA @\x immediately followed by a sequence of hex digits is interpreted as a hexadecimal numericcharacter code. For instance @\x41 is the ASCII character A. If a semicolon character immedi-ately follows the hex digits, it is consumed, and characters which follow are not considered part of

the hex escape even if they are hex digits.



@\ octal-digits

A @\ immediately followed by a sequence of octal digits (0 through 7) is interpreted as an octalcharacter code. For instance @\010 is character 8, same as @\b. If a semicolon character imme-diately follows the octal digits, it is consumed, and subsequent characters are not treated as part of

the octal escape, even if they are octal digits.

Note that if a newline is embedded into a query line with @\n, this does not split the line into two; it’sembedded into the line and thus cannot match anything. However, @\n may be useful in the @(cat)directive and in @(output).

6.6 Character Handling and International Characters

TXR represents text internally using wide characters, which are used to represent Unicode code points.

Script source code, as well as all data sources, are assumed to be in the UTF-8 encoding. In TXR and

TXR Lisp source, extended characters can be used directly in comments, literal text, string literals,

quasiliterals and regular expressions. Extended characters can also be expressed indirectly using hexadeci-

mal or octal escapes. On some platforms, wide characters may be restricted to 16 bits, so that TXR can

only work with characters in the BMP (Basic Multilingual Plane) subset of Unicode.

TXR does not use the localization features of the system library; its handling of extended characters is not

affected by environment variables like LANG and L_CTYPE. The program reads and writes only theUTF-8 encoding.

If TXR encounters an invalid bytes in the UTF-8 input, what happens depends on the context in which this

occurs. In a query, comments are read without regard for encoding, so invalid encoding bytes in comments

are not detected. A comment is simply a sequence of bytes terminated by a newline. In lexical elements

which represent text, such as string literals, invalid or unexpected encoding bytes are treated as syntax

errors. The scanner issues an error message, then discards a byte and resumes scanning. Certain sequences

pass through the scanner without triggering an error, namely some UTF-8 overlong sequences. These are

caught when when the lexeme is subject to UTF-8 decoding, and treated in the same manner as other

UTF-8 data, described in the following paragraph.

Invalid bytes in data are treated as follows. When an invalid byte is encountered in the middle of a multi-

byte character, or if the input ends in the middle of a multibyte character, or if a character is extracted

which is encoded as an overlong form, the UTF-8 decoder returns to the starting byte of the ill-formed

multibyte character, and extracts just that byte, mapping it to the Unicode character range U+DC00 through

U+DCFF. The decoding resumes afresh at the following byte, expecting that byte to be the start of a UTF-8

code.

Furthermore, because TXR internally uses a null-terminated character representation of strings which eas-

ily interoperates with C language interfaces, when a null character is read from a stream, TXR converts it

to the code U+DC00. On output, this code converts back to a null byte, as explained in the previous para-

graph. By means of this representational trick, TXR can handle textual data containing null bytes.

6.7 Regular Expression Directives

In place of a piece of text (see section Text above), a regular expression directive may be used, which has

the following syntax:

@/RE/

where the RE part enclosed in slashes represents regular expression syntax (described in the section Regu-

lar Expressions below).



Long regular expressions can be broken into multiple lines using a backslash-newline sequence. White-

space before the sequence or after the sequence is not significant, so the following two are equivalent:

@/reg \ular/

@/regular/

There may not be whitespace between the backslash and newline.

Whereas literal text simply represents itself, regular expression denotes a (potentially infinite) set of texts.

The regular expression directive matches the longest piece of text (possibly empty) which belongs to the set

denoted by the regular expression. The match is anchored to the current position; thus if the directive is the

first element of a line, the match is anchored to the start of a line. If the regular expression directive is the

last element of a line, it is anchored to the end of the line also: the regular expression must match the text

from the current position to the end of the line.

Even if the regular expression matches the empty string, the match will fail if the input is empty, or has run

out of data. For instance suppose the third line of the query is the regular expression @/.*/, but the inputis a file which has only two lines. This will fail: the data has no line for the regular expression to match. A

line containing no characters is not the same thing as the absence of a line, even though both abstractions

imply an absence of characters.

Like text which follows a variable, a regular expression directive which follows a variable has special

semantics, described in the section Variables below.

6.8 Variables

Much of the query syntax consists of arbitrary text, which matches file data character for character. Embed-

ded within the query may be variables and directives which are introduced by a @ character. Two consecu-tive @@ characters encode a literal @.

A variable matching or substitution directive is written in one of several ways:

@sident@{bident}@*sident@*{bident}@{bident /regex/}@{bident (fun [arg ... ])}@{bident number}

The forms with an * indicate a long match, see Longest Match below. The last two three forms with theembedded regexp /regex/ or number or function have special semantics; see Positive Match below.

The identifier t cannot be used as a name; it is a reserved symbol which denotes the value true. An attemptto use the variable @t will result in an exception. The symbol nil can be used where a variable name isrequired syntactically, but it has special semantics, described in a section below.

A sident is a "simple identifier" form which is not delimited by braces.

A sident consists of any combination of one or more letters, numbers, and underscores. It may not look

like a number, so that for instance 123 is not a valid sident, but 12A is valid. Case is sensitive, so thatFOO is different from foo, which is different from Foo.



The braces around an identifier can be used when material which follows would otherwise be interpreted as

being part of the identifier. When a name is enclosed in braces it is a bident.

The following additional characters may be used as part of bident which are not allowed in a sident:

! $ % & * + - < = > ? \ ˜

Moreover, most Unicode characters beyond U+007F may appear in a bident, with certain exceptions. A

character may not be used if it is any of the Unicode space characters, a member of the high or low surro-

gate region, a member of any Unicode private use area, or is one of the two characters U+FFFE or U+FFFF.

The rule still holds that a name cannot look like a number so +123 is not a valid bident but these arevalid: a->b, *xyz*, foo-bar.

The syntax @FOO_bar introduces the name FOO_bar, whereas @{FOO}_bar means the variable named"FOO" followed by the text "_bar". There may be whitespace between the @ and the name, or openingbrace. Whitespace is also allowed in the interior of the braces. It is not significant.

If a variable has no prior binding, then it specifies a match. The match is determined from some current

position in the data: the character which immediately follows all that has been matched previously. If a

variable occurs at the start of a line, it matches some text at the start of the line. If it occurs at the end of a

line, it matches everything from the current position to the end of the line.

6.9 Negative Match

If a variable is one of the plain forms

@sident@{bident}@*sident@*{bident}

then this is a "negative match". The extent of the matched text (the text bound to the variable) is deter-

mined by looking at what follows the variable, and ranges from the current position to some position where

the following material finds a match. This is why this is called a "negative match": the spanned text which

ends up bound to the variable is that in which the match for the trailing material did not occur.

A variable may be followed by a piece of text, a regular expression directive, a function call, a directive,

another variable, or nothing (i.e. occurs at the end of a line). These cases are described in detail below.

6.9.1 Variable Followed by Nothing

If the variable is followed by nothing, the negative match extends from the current position in the data, to

the end of the line. Example:

code: a b c @FOO

data: a b c defghijk

result: FOO="defghijk"

6.9.2 Variable Followed by Text

For the purposes of determining the negative match, text is defined as a sequence of literal text and regular

expressions, not divided by a directive. So for instance in this example:

@a:@/foo/bcd e@(maybe)f@(end)



the variable @a is considered to be followed by ":@/foo/bcd e".

If a variable is followed by text, then the extent of the negative match is determined by searching for the

first occurrence of that text within the line, starting at the current position.

The variable matches everything between the current position and the matching position (not including the

matching position). Any whitespace which follows the variable (and is not enclosed inside braces that sur-

round the variable name) is part of the text. For example:

code: a b @FOO e f

data: a b c d e f

result: FOO="c d"

In the above example, the pattern text "a b " matches the data "a b ". So when the @FOO variable isprocessed, the data being matched is the remaining "c d e f". The text which follows @FOO is " ef". This is found within the data "c d e f" at position 3 (counting from 0). So positions 0-2 ("cd") constitute the matching text which is bound to FOO.

6.9.3 Variable Followed by a Function Call or Directive

If the variable is followed by a function call, or a directive, the extent is determined by scanning the text for

the first position where a match occurs for the entire remainder of the line. (For a description of functions,

see Functions.)

For example:

@foo@(bind a "abc")xyz

Here, foo will match the text from the current position to where "xyz" occurs, even though there is a@(bind) directive. Furthermore, if more material is added after the xyz, it is part of the search. Note thedifference between the following two:

@foo@/abc/@(func)@foo@(func)@/abc/

In the first example, the variable foo matches the text from the current position until the match for the regu-

lar expression abc. @(func) is not considered when processing @foo. In the second example, the vari-able foo matches the text from the current position until the position which matches the function call, fol-

lowed by a match for the regular expression. The entire sequence @(func)@/abc/ is considered.

6.9.4 Consecutive Variables

If an unbound variable specifies a fixed-width match or a regular expression, then the issue of consecutive

variables does not arise. Such a variable consumes text regardless of any context which follows it.

However, what if an unbound variable with no modifier is followed by another variable? The behavior

depends on the nature of the other variable.

If the other variable is also unbound, and also has no modifier, this is a semantic error which will cause the

query to fail. A diagnostic message will be issued, unless operating in quiet mode via -q. The reason isthat there is no way to bind two consecutive variables to an extent of text; this is an ambiguous situation,

since there is no matching criterion for dividing the text between two variables. (In theory, a repetition of

the same variable, like @FOO@FOO, could find a solution by dividing the match extent in half, which wouldwork only in the case when it contains an even number of characters. This behavior seems to have dubious

value).



An unbound variable may be followed by one which is bound. The bound variable is effectively replaced by

the text which it denotes, and the logic proceeds accordingly.

It is possible for a variable to be bound to a regular expression. If x is an unbound variable and y is boundto a regular expression RE, then @x@y means @x@/RE/. A variable v can be bound to a regular expres-sion using, for example, @(bind v #/RE/).

The @* syntax for longest match is available. Example:

code: @FOO:@BAR@FOO

data: xyz:defxyz

result: FOO=xyz, BAR=def

Here, FOO is matched with "xyz", based on the delimiting around the colon. The colon in the pattern thenmatches the colon in the data, so that BAR is considered for matching against "defxyz". BAR is followedby FOO, which is already bound to "xyz". Thus "xyz" is located in the "defxyz" data following"def", and so BAR is bound to "def".

If an unbound variable is followed by a variable which is bound to a list, or nested list, then each character

string in the list is tried in turn to produce a match. The first match is taken.

An unbound variable may be followed by another unbound variable which specifies a regular expression or

function call match. This is a special case called a "double variable match". What happens is that the text is

searched using the regular expression or function. If the search fails, than neither variable is bound: it is a

matching failure. If the search succeeds, than the first variable is bound to the text which is skipped by the

search. The second variable is bound to the text matched by the regular expression or function. Examples:

code: @foo@{bar /abc/}

data: xyz@#abc

result: foo="xyz@#", BAR="abc"

6.9.5 Consecutive Variables Via Directive

Tw o variables can be de facto consecutive in a manner shown in the following example:

@var1@(all)@var2@(end)

This is treated just like the variable followed by directive. No semantic error is identified, even if both vari-

ables are unbound. Here, @var2 matches everything at the current position, and so @var1 ends up boundto the empty string.

Example 1: b matches at position 0 and a binds the empty string:

code: @a@(all)@b@(end)

data: abc

result: a=""b="abc"

Example 2: *a specifies longest match (see Longest Match below), and so it takes everything:

code: @*a@(all)@b@(end)

data: abc

result: a="abc"b=""



6.9.6 Longest Match

The closest-match behavior for the negative match can be overridden to longest match behavior. A special

syntax is provided for this: an asterisk between the @ and the variable, e.g:

code: a @*{FOO}cd

data: a b cdcdcdcd

result: FOO="b cdcdcd"

code: a @{FOO}cd

data: a b cdcdcd

result: FOO="b "b=""

In the former example, the match extends to the rightmost occurrence of "cd", and so FOO receives "bcdcdcd". In the latter example, the * syntax isn’t used, and so a leftmost match takes place. The extentcovers only the "b ", stopping at the first "cd" occurrence.

6.10 Positive Match

There are syntactic variants of variable syntax which have an embedded expression enclosed with the vari-

able in braces:

@{bident /regex/}@{bident (fun [args...])}@{bident number}@{bident bident}

These specify a variable binding that is driven by a positive match derived from a regular expression, func-

tion or character count, rather than from trailing material (which is regarded as a "negative" match, since

the variable is bound to material which is skipped in order to match the trailing material). In the /regex/form, the match extends over all characters from the current position which match the regular expression

regex. (see Regular Expressions section below). In the (fun [args ...]) form, the matchextends over characters which are matched by the call to the function, if the call succeeds. Thus @{x (yz w)} is just like @(y z w), except that the region of text skipped over by @(y z w) is also bound tothe variable x. See Functions below.

In the number form, the match processes a field of text which consists of the specified number of charac-

ters, which must be non-negative number. If the data line doesn’t hav e that many characters starting at the

current position, the match fails. A match for zero characters produces an empty string. The text which is

actually bound to the variable is all text within the specified field, but excluding leading and trailing white-

space. If the field contains only spaces, then an empty string is extracted.

This syntax is processed without consideration of what other syntax follows. A positive match may be

directly followed by an unbound variable.

The

@{bident bident} syntax allows the number or regex modifier to come from a variable. Thevariable must be bound and contain a non-negative integer or regular expression. For example, @{x y}behaves like @{x 3} if y is bound to the integer 3. It is an error if y is unbound.

6.11 Special Symbols nil and t

Just like in the Common Lisp language, the names nil and t are special.



nil symbol stands for the empty list object, an object which marks the end of a list, and Boolean false. It issynonymous with the syntax () which may be used interchangeably with nil in most constructs.

In TXR Lisp, nil and t cannot be used as variables. When evaluated, they evaluate to themselves.

In the TXR pattern language, nil can be used in the variable binding syntax, but does not create a binding;it has a special meaning. It allows the variable matching syntax to be used to skip material, in ways similar

to the skip directive.

The nil symbol is also used as a block name, both in the TXR pattern language and in TXR Lisp. Ablock named nil is considered to be anonymous.

6.12 Keyword Symbols

Names whose names begin with the : character are keyword symbols. These also may not be used as vari-ables either and stand for themselves. Keywords are useful for labeling information and situations.

6.13 Regular Expressions

Regular expressions are a language for specifying sets of character strings. Through the use of pattern

matching elements, regular expression is able to denote an infinite set of texts. TXR contains an original

implementation of regular expressions, which supports the following syntax:

. The period is a "wildcard" that matches any character.

[] Character class: matches a single character, from the set specified by special syntax writtenbetween the square brackets. This supports basic regexp character class syntax. POSIX notation

like [:digit:] is not supported. The regex tokens \s, \d and \w are permitted in characterclasses, but not their complementing counterparts. These tokens simply contribute their characters

to the class. The class [a-zA-Z] means match an uppercase or lowercase letter; the class[0-9a-f] means match a digit or a lowercase letter; the class [ˆ0-9] means match a non-digit,and so forth. There are no locale-specific behaviors in TXR regular expressions; [A-Z] denotesan ASCII/Unicode range of characters. The class [\d.] means match a digit or the period char-acter. A ] or - can be used within a character class, but must be escaped with a backslash. A ˆ inthe first position denotes a complemented class, unless it is escaped by backslash. In any other

position, it denotes itself. Tw o backslashes code for one backslash. So for instance [\[\-]means match a [ or - character, [ˆˆ] means match any character other than ˆ, and [\ˆ\\]means match either a ˆ or a backslash. Regex operators such as *, + and & appearing in a charac-ter class represent ordinary characters. The characters -, ] and ˆ occurring outside of a characterclass are ordinary. Unescaped / characters can appear within a character class. The empty charac-ter class [] matches no character at all, and its complement [ˆ] matches any character, and istreated as a synonym for the . (period) wildcard operator.

\s, \w and \dThese regex tokens each match a single character. The \s regex token matches a wide variety ofASCII whitespace characters and Unicode spaces. The \w token matches alphabetic word charac-ters; it is equivalent to the character class [A-Za-z_]. The \d token matches a digit, and isequivalent to [0-9].

\S, \W and \DThese regex tokens are the complemented counterparts of \s, \w and \d. The \S token matchesall those characters which \s does not match, \W matches all characters that \w does not matchand \D matches nondigits.

empty An empty expression is a regular expression. It represents the set of strings consisting of the emptystring; i.e. it matches just the empty string. The empty regex can appear alone as a full regular

expression (for instance the TXR syntax @// with nothing between the slashes) and can also bepassed as a subexpression to operators, though this may require the use of parentheses to make the



empty regex explicit. For example, the expression a| means: match either a, or nothing. Theforms * and (*) are syntax errors; though not useful, the correct way to match the empty expres-sion zero or more times is the syntax ()*.

nomatchThe nomatch regular expression represents the empty set: it matches no strings at all, not even the

empty string. There is no dedicated syntax to directly express nomatch in the regex language.

However, the empty character class [] is equivalent to nomatch, and may be considered to be anotation for it. Other representations of nomatch are possible: for instance, the regex ˜.* whichis the complement of the regex that denotes the set of all possible strings, and thus denotes the

empty set. A nomatch has uses; for instance, it can be used to temporarily "comment out" regular

expressions. The regex ([]abc|xyz) is equivalent to (xyz), since the []abc branch cannotmatch anything. Using [] to "block" a subexpression allows you to leave it in place, then enable itlater by removing the "block".

(R) If R is a regular expression, then so is (R). The contents of parentheses denote one regularexpression unit, so that for instance in (RE)*, the * operator applies to the entire parenthesizedgroup. The syntax () is valid and equivalent to the empty regular expression.

R? Optionally match the preceding regular expression R.

R* Match the expression R zero or more times. This operator is sometimes called the "Kleene star", or"Kleene closure". The Kleene closure favors the longest match. Roughly speaking, if there are

two or more ways in which R1*R2 can match, than that match occurs in which R1* matches thelongest possible text.

R+ Match the preceding expression R one or more times. Like R*, this favors the longest possiblematch: R+ is equivalent to RR*.

R1%R2 Match R1 zero or more times, then match R2. If this match can occur in more than one way, thenit occurs such that R1 is matched the fewest number of times, which is opposite from the behaviorof R1*R2. Repetitions of R1 terminate at the earliest point in the text where a non-empty matchfor R2 occurs. Because it favors shorter matches, % is termed a non-greedy operator. If R2 is theempty expression, or equivalent to it, then R1%R2 reduces to R1*. So for instance (R%) is equiv-alent to (R*), since the missing right operand is interpreted as the empty regex. Note that whereasthe expression (R1*R2) is equivalent to (R1*)R2, the expression (R1%R2) is not equivalent to(R1%)R2. Also note that A(XY%Z)B is equivalent to AX(Y%Z)B. This is because the prece-dence of % is higher than that of catenation on its left side; this rule prevents the given syntax fromexpressing the XY catenation. The expression may be understood as: A(X(Y%Z))B where theinner parentheses clarify how the syntax surrounding the % operator is being parsed, and the outerparentheses are superfluous. The correct way to assert catenation of XY as the left operand of % isA(XY)%ZB. To specify XY as the left operand, and limit the right operand to just Z, the correctsyntax is A((XY)%Z)B. By contrast, the expression A(X%YZ)B is not equivalent to A(X%Y)ZBbecause the precedence of % is lower than that of catenation on its right side. The operator is effec-tively "bi-precedential".

˜R Match the opposite of the following expression R; that is, match exactly those texts that R does notmatch. This operator is called complement, or logical not.

R1R2 Tw o consecutive regular expressions denote catenation: the left expression must match, and thenthe right.

R1|R2 match either the expression R1 or R2. This operator is known by a number of names: union, logi-cal or, disjunction, branch, or alternative.

R1&R2 Match both the expression R1 and R2 simultaneously; i.e. the matching text must be one of thetexts which are in the intersection of the set of texts matched by R1 and the set matched by R2.This operator is called intersection, logical and, or conjunction.

Any character which is not a regular expression operator, a backslash escape, or the slash delimiter, denotes



one-position match of that character itself.

Any of the special characters, including the delimiting /, and the backslash, can be escaped with a back-slash to suppress its meaning and denote the character itself.

Furthermore, all of the same escapes as are described in the section Special Characters in Text above are

supported - the difference is that in regular expressions, the @ character is not required, so for example a tabis coded as \t rather than @\t. Octal and hex character escapes can be optionally terminated by a semi-colon, which is useful if the following characters are octal or hex digits not intended to be part of the

escape.

Only the above escapes are supported. Unlike in some other regular expression implementations, if a back-

lash appears before a character which isn’t a regex special character or one of the supported escape

sequences, it is an error. This wasn’t true of historic versions of TXR. See the COMPATIBILITY section.

Precedence table, highest to lowest:

Operators Class Associativity

(R) [] primaryR? R+ R* R%... postfix left-to-rightR1R2 catenation left-to-right˜R ...%R unary right-to-leftR1&R2 intersection left-to-rightR1|R2 union left-to-right

The % operator is like a postfix operator with respect to its left operand, but like a unary operator withrespect to its right operand. Thus a˜b%c˜d is a(˜(b%(c(˜d)))) , demonstrating right-to-left asso-ciativity, where all of b% may be regarded as a unary operator being applied to c˜d. Similarly, a?*+%bmeans (((a?)*)+)%b, where the trailing %b behaves like a postfix operator.

In TXR, regular expression matches do not span multiple lines. The regex language has no feature for

multi-line matching. However, the @(freeform) directive allows the remaining portion of the input to betreated as one string in which line terminators appear as explicit characters. Regular expressions may freely

match through this sequence.

It’s possible for a regular expression to match an empty string. For instance, if the next input character is z,facing a the regular expression /a?/, there is a zero-character match: the regular expression’s statemachine can reach an acceptance state without consuming any characters. Examples:

code: @A@/a?/@/.*/

data: zzzzz

result: A=""

code: @{A /a?/}@B

data: zzzzz

result: A="", B="zzzz"

code: @*A@/a?/

data: zzzzz

result: A="zzzzz"

In the first example, variable @A is followed by a regular expression which can match an empty string. Theexpression faces the letter z at position 0 in the data line. A zero-character match occurs there, therefore thevariable A takes on the empty string. The @/.*/ regular expression then consumes the line.



Similarly, in the second example, the /a?/ regular expression faces a z, and thus yields an empty stringwhich is bound to A. Variable @B consumes the entire line.

The third example requests the longest match for the variable binding. Thus, a search takes place for the

rightmost position where the regular expression matches. The regular expression matches anywhere, includ-

ing the empty string after the last character, which is the rightmost place. Thus variable A fetches the entireline.

For additional information about the advanced regular expression operators, NOTES ON EXOTIC REGU-

LAR EXPRESSIONS below.

6.14 Compound Expressions

If the @ escape character is followed by an open parenthesis or square bracket, this is taken to be the start ofa TXR Lisp compound expression.

The TXR language has the unusual property that its syntactic elements, so-called directives, are Lisp com-

pound expressions. These expressions not only enclose syntax, but expressions which begin with certain

symbols de facto behave as tokens in a phrase structure grammar. For instance, the expression @(col-lect) begins a block which must be terminated by the expression @(end), otherwise there is a syntaxerror. The collect expression can contain arguments which modify the behavior of the construct, forinstance @(collect :gap 0 :vars (a b)). In some ways, this situation might be compared tothe HTML language, in which an element such as must be terminated by and can have attributessuch as .

Compound contain subexpressions: other compound expressions, or literal objects of various kinds. Among

these are: symbols, numbers, string literals, character literals, quasiliterals and regular expressions. These

are described in the following sections. Additional kinds of literal objects exist, which are discussed in the

TXR LISP section of the manual.

Some examples of compound expressions are:

(banana)

(a b c (d e f))

( a (b (c d) (e ) ))

("apple" #\b #\space 3)

(a #/[a-z]*/ b)

(_ ‘@file.txt‘)

Symbols occurring in a compound expression follow a slight more permissive lexical syntax than the

bident in the syntax @{bident} introduced earlier. The / (slash) character may be part of an identi-fier, or even constitute an entire identifier. In fact a symbol inside a directive is a lident. This is

described in the Symbol Tokens section under TXR LISP. A symbol must not be a number; tokens that

look like numbers are treated as numbers and not symbols.

6.15 Character Literals

Character literals are introduced by the #\ syntax, which is either followed by a character name, the letterx followed by hex digits, the letter o followed by octal digits, or a single character. Valid character namesare:



nul linefeed returnalarm newline escbackspace vtab spacetab page pnul

For instance #\esc denotes the escape character.

This convention for character literals is similar to that of the Scheme language. Note that #\linefeedand #\newline are the same character. The #\pnul character is specific to TXR and denotes theU+DC00 code in Unicode; the name stands for "pseudo-null", which is related to its special function. Formore information about this, see the section "Character Handling and International Characters".

6.16 String Literals

String literals are delimited by double quotes. A double quote within a string literal is encoded using \"and a backslash is encoded as \\. Backslash escapes like \n and \t are recognized, as are hexadecimalescapes like \xFF or \xxabc and octal escapes like \123. Ambiguity between an escape and subsequenttext can be resolved by using trailing semicolon delimiter: "\xabc;d" is a string consisting of the charac-ter U+0ABC followed by "d". The semicolon delimiter disappears. To write a literal semicolon immedi-ately after a hex or octal escape, write two semicolons, the first of which will be interpreted as a delimiter.

Thus, "\x21;;" represents "!;".

If the line ends in the middle of a literal, it is an error, unless the last character is a backslash. This back-

slash is a special escape which does not denote a character; rather, it indicates that the string literal contin-

ues on the next line. The backslash is deleted, along with whitespace which immediately precedes it, as

well as leading whitespace in the following line. The escape sequence "\ " (backslash space) can be usedto encode a significant space.

Example:

"foo \bar"

"foo \\ bar"

"foo\ \bar"

The first string literal is the string "foobar". The second two are "foo bar".

6.17 Word List Literals

A word list literal (WLL) provides a convenient way to write a list of strings when such a list can be given

as whitespace-delimited words.

There are two flavors of the WLL: the regular WLL which begins with #" (hash, double-quote) and thesplicing list literal which begins with #*" (hash, star, double-quote).

Both types are terminated by a double quote, which may be escaped as \" in order to include it as a char-acter. All the escaping conventions used in string literals can be used in word literals.

Unlike in string literals, whitespace (tabs and spaces) is not significant in word literals: it separates words.

Whitespace may be escaped with a backslash in order to include it as a literal character.



Just like in string literals, an unescaped newline character is not allowed. A newline preceded by a back-

slash is permitted. Such an escaped backslash, together with any leading and trailing unescaped whitespace,

is removed and replaced with a single space.

Example:

#"abc def ghi" --> notates ("abc" "def" "ghi")

#"abc def \ghi" --> notates ("abc" "def" "ghi")

#"abc\ def ghi" --> notates ("abc def" "ghi")

#"abc\ def\ \\ ghi" --> notates ("abc def " " ghi")

A splicing word literal differs from a word literal in that it does not produce a list of string literals, but

rather it produces a sequence of string literals that is merged into the surrounding syntax. Thus, the follow-

ing two notations are equivalent:

(1 2 3 #*"abc def" 4 5 #"abc def")

(1 2 3 "abc" "def" 4 5 ("abc" "def"))

The regular WLL produced a single list object, but the splicing WLL expanded into multiple string literal

objects.

6.18 String Quasiliterals

Quasiliterals are similar to string literals, except that they may contain variable references denoted by the

usual @ syntax. The quasiliteral represents a string formed by substituting the values of those variables intothe literal template. If a is bound to "apple" and b to "banana", the quasiliteral ‘one @a and two@{b}s‘ represents the string "one apple and two bananas". A backquote escaped by a back-slash represents itself. Unlike in directive syntax, two consecutive @ characters do not code for a literal @,but cause a syntax error. The reason for this is that compounding of the @ syntax is meaningful. Instead,there is a \@ escape for encoding a literal @ character. Quasiliterals support the full output variable syntax.Expressions within variable substitutions follow the evaluation rules of TXR Lisp. This hasn’t always been

the case: see the COMPATIBILITY section.

Quasiliterals can be split into multiple lines in the same way as ordinary string literals.

6.19 Quasiword List Literals

The quasiword list literals (QLL-s) are to quasiliterals what WLL-s are to ordinary literals. (See the above

section Word List Literals.)

A QLL combines the convenience of the WLL with the power of quasistrings.

Just as in the case of WLL-s, there are two flavors of the QLL: the regular QLL which begins with #‘(hash, backquote) and the splicing QLL which begins with #*‘ (hash, star, backquote).

Both types are terminated by a backquote, which may be escaped as \‘ in order to include it as a charac-ter. All the escaping conventions used in quasiliterals can be used in QLL.

Unlike in quasiliterals, whitespace (tabs and spaces) is not significant in QLL: it separates words.



Whitespace may be escaped with a backslash in order to include it as a literal character.

A newline is not permitted unless escaped. An escaped newline works exactly the same way as it does in

word list literals (WLL-s).

Note that the delimiting into words is done before the variable substitution. If the variable a contains spa-

ces, then #‘@a‘ nevertheless expands into a list of one item: the string derived from a.

Examples:

#‘abc @a ghi‘ --> notates (‘abc‘ ‘@a‘ ‘ghi‘)

#‘abc @d@e@f \ghi‘ --> notates (‘abc‘ ‘@d@e@f‘ ‘ghi‘)

#‘@a\ @b @c‘ --> notates (‘@a @b‘ ‘@c‘)

A splicing QLL differs from an ordinary QLL in that it does not produce a list of quasiliterals, but rather it

produces a sequence of quasiliterals that is merged into the surrounding syntax.

6.20 Numbers

TXR supports integers and floating-point numbers.

An integer constant is made up of digits 0 through 9, optionally preceded by a + or - sign.

Examples:

123-34+0-0+234483527304983792384729384723234

An integer constant can also be specified in hexadecimal using the prefix #x followed by an optional sign,followed by hexadecimal digits: 0 through 9 and the upper or lower case letters A through F:

#xFF ;; 255#x-ABC ;; -2748

Similarly, octal numbers are supported with the prefix #o followed by octal digits:

#o777 ;; 511

and binary numbers can be written with a #b prefix:

#b1110 ;; 14

Note that the #b prefix is also used for buffer literals.

A floating-point constant is marked by the inclusion of a decimal point, the exponential "e notation", or

both. It is an optional sign, followed by a mantissa consisting of digits, a decimal point, more digits, and

then an optional exponential notation consisting of the letter e or E, an optional + or - sign, and then digitsindicating the exponent value. In the mantissa, the digits are not optional. At least one digit must either

precede the decimal point or follow. That is to say, a decimal point by itself is not a floating-point constant.



Examples:

.123123.1E-320E40.9E19.E19-.5+3E+31.E5

Examples which are not floating-point constant tokens:

. ;; dot token, not a number123E ;; the symbol 123E1.0E- ;; syntax error: invalid floating point constant1.0E ;; syntax error: invalid floating point constant1.E ;; syntax error: invalid floating point literal.e ;; syntax error: dot token followed by symbol

In TXR there is a special "dotdot" token consisting of two consecutive periods. An integer constant fol-

lowed immediately by dotdot is recognized as such; it is not treated as a floating constant followed by a dot.

That is to say, 123.. does not mean 123. . (floating point 123.0 value followed by dot token). Itmeans 123 .. (integer 123 followed by .. token).

Dialect note: unlike in Common Lisp, 123. is not an integer, but the floating-point number 123.0.

6.21 Comments

Comments of the form @; were introduced earlier. Inside compound expressions, another convention forcomments exists: Lisp comments, which are introduced by the ; (semicolon) character and span to the endof the line.

Example:

@(foo ; this is a commentbar ; this is another comment)

This is equivalent to @(foo bar).

7 DIRECTIVES

7.1 Overview

When a TXR Lisp compound expressions occurs in TXR preceded by a @, it is a directive.

Directives which are based on certain symbols are, additionally, inv olved in a phrase-structure syntax which

uses Lisp expressions as if they were tokens.

For instance, the directive

@(collect)



not only denotes a compound expression with the collect symbol in its head position, but it also intro-duces a syntactic phrase which requires a matching @(end) directive. In other words, @(collect) isnot only an expression, but serves as a kind of token in a higher level phrase structure grammar.

Effectively, collect is a reserved symbol in the TXR language. A TXR program cannot use this symbolas the name of a pattern function, due to its role in the syntax. The symbol has no reserved role in TXR

Lisp.

Usually if this type of directive occurs alone in a line, not preceded or followed by other material, it is

involved in a "vertical" (or line oriented) syntax.

If such a directive is embedded in a line (has preceding or trailing material) then it is in a horizontal syntac-

tic and semantic context (character-oriented).

There is an exception: the definition of a horizontal function looks like this:

@(define name (arg))body material@(end)

Yet, this is considered one vertical item, which means that it does not match a line of data. (This is neces-

sary because all horizontal syntax matches something within a line of data, which is undesirable for defini-

tions.)

Many directives exhibit both horizontal and vertical syntax, with different but closely related semantics. A

few are vertical only, and some are horizontal only.

A summary of the available directives follows:

@(eof)Explicitly match the end of file. Fails if unmatched data remains in the input stream.

@(eol)Explicitly match the end of line. Fails if the current position is not the end of a line. Also fails if no

data remains (there is no current line).

@(next)Continue matching in another file or other data source.

@(block)Groups together a sequence of directives into a logical name block, which can be explicitly termi-

nated from within using the @(accept) and @(fail) directives. Blocks are described in thesection Blocks below.

@(skip)Treat the remaining query as a subquery unit, and search the lines (or characters) of the input file

until that subquery matches somewhere. A skip is also an anonymous block.

@(trailer)Treat the remaining query or subquery as a match for a trailing context. That is to say, if the

remainder matches, the data position is not advanced.



@(freeform)Treat the remainder of the input as one big string, and apply the following query line to that string.

The newline characters (or custom separators) appear explicitly in that string.

@(fuzz)The fuzz directive, inspired by the patch utility, specifies a partial match for some lines.

@(line) and @(chr)These directives match a variable or expression against the current line number or character posi-

tion.

@(name)Match a variable against the name of the current data source.

@(data)Match a variable against the remaining data (lazy list of strings).

@(some)Multiple clauses are each applied to the same input. Succeeds if at least one of the clauses matches

the input. The bindings established by earlier successful clauses are visible to the later clauses.

@(all)Multiple clauses are applied to the same input. Succeeds if and only if each one of the clauses

matches. The clauses are applied in sequence, and evaluation stops on the first failure. The bind-

ings established by earlier successful clauses are visible to the later clauses.

@(none)Multiple clauses are applied to the same input. Succeeds if and only if none of them match. The

clauses are applied in sequence, and evaluation stops on the first success. No bindings are ever

produced by this construct.

@(maybe)Multiple clauses are applied to the same input. No failure occurs if none of them match. The bind-

ings established by earlier successful clauses are visible to the later clauses.

@(cases)Multiple clauses are applied to the same input. Evaluation stops on the first successful clause.

@(require)The require directive is similar to the do directive in that it evaluates one or more TXR Lispexpressions. If the result of the rightmost expression is nil, then require triggers a match failure.

See the TXR LISP section far below.

@(if), @(elif), and @(else)The if directive with optional elif and else clauses allows one of multiple bodies of patternmatching directives to be conditionally selected by testing the values of Lisp expressions.

@(choose)Multiple clauses are applied to the same input. The one whose effect persists is the one which

maximizes or minimizes the length of a particular variable.



@(empty)The @(empty) directive matches the empty string. It is useful in certain situations, such asexpressing an empty match in a directive that doesn’t accept an empty clause. The @(empty)syntax has another meaning in @(output) clauses, in conjunction with @(repeat).

@(define name (args ...))Introduces a function. Functions are described in the Functions section below.

@(call expr args*)Performs function indirection. Evaluates expr, which must produce a symbol that names a pat-

tern function. Then that pattern function is invoked.

@(gather)Searches text for matches for multiple clauses which may occur in arbitrary order. For conve-

nience, lines of the first clause are treated as separate clauses.

@(collect)Search the data for multiple matches of a clause. Collect the bindings in the clause into lists, which

are output as array variables. The @(collect) directive is line oriented. It works with a multi-line pattern and scans line by line. A similar directive called @(coll) works within one line.

A collect is an anonymous block.

@(and)Separator of clauses for @(some), @(all), @(none), @(maybe) and @(cases). Equiva-lent to @(or). The choice is stylistic.

@(or) Separator of clauses for @(some), @(all), @(none), @(maybe) and @(cases). Equiva-lent to @(and). The choice is stylistic.

@(end)Required terminator for @(some), @(all), @(none), @(maybe), @(cases), @(if),@(collect), @(coll), @(output), @(repeat), @(rep), @(try), @(block) and@(define).

@(fail)Terminate the processing of a block, as if it were a failed match. Blocks are described in the sec-

tion Blocks below.

@(accept)Terminate the processing of a block, as if it were a successful match. What bindings emerge may

depend on the kind of block: collect has special semantics. Blocks are described in the section

Blocks below.

@(try)Indicates the start of a try block, which is related to exception handling, described in the Excep-

tions section below.



@(catch) and @(finally)Special clauses within @(try). See Exceptions below.

@(defex) and @(throw)Define custom exception types; throw an exception. See Exceptions below.

@(assert)The assert directive requires the following material to match, otherwise it throws an exception.It is useful for catching mistakes or omissions in parts of a query that are sure-fire matches.

@(flatten)Normalizes a set of specified variables to one-dimensional lists. Those variables which have scalar

value are reduced to lists of that value. Those which are lists of lists (to an

1name · 2020-04-27 · txr(1) txrprogramming language txr(1) messages, and anyoutput generated by...

Documents