.TH WLEX 1

.SH NAME
wlex \- OCaml lexer generator for large encodings

.SH SYNOPSIS
.B wlex
.I filename.mll
[
.BI \-cf \ class-file
]

.SH DESCRIPTION

.BR wlex (1)
is a lexer generator for OCaml derived from ocamllex.
The lexer architecture of wlex adds an extra layer (classification)
between the lexbuf and the lexer. This layer extracts "character
classes" from the lexbuf and the lexer itself works with this classes,
not directly on characters.

Usually, the number of classes is small (<< 256) and the
classification may consume more than one byte to produce the next
class. This allow to parse efficiently wide characters encodings
such as UTF-8 (the main motivation for wlex).

Classes form a partition of accepted characters.

Running
.BR wlex (1)
on the input file
.IR lexer \&.mll
produces Caml code for a lexical analyzer in file 
.IR lexer \&.ml.
By default, class definition are also written to the
.IR lexer \&.ml
file. It is possible to redirect this output to
another file, with the
.B -cf
option.

A typical example is to group all the letters to a class "letter",
all the digits to a class "digit", ...  The regexps in the
lexer specification (the .mll file) are built on classes. For
instance:

let ident = letter (letter | digit)*

In some cases, it is necessary to change the design of the lexer.
For instance, it is a good idea to have a single for identifiers
and keywords, the distinction between them being done in the semantic
action of the rule. Another possibility is to declare all the
characters from the keywords as single classes. This is a very bad idea.


During lexing, the classification is handled by an "engine".
Some generic engine are provided (a C implementation for speed;
an ML implementation if you want pure bytecode), especially to
support UTF-8. Working with such an encoding with ocamllex
would introduce a *lot* of "waiting" state and a *lot* of
duplicated transitions in the automaton (it is the motivation
for wlex to avoid these).


Another small modification from ocamllex is the possibility to give parameters 
to lexer entry points. The mandatory parameters are the lexbuf and the 
engine.

.SH LEXER SPECIFICATION

The syntax of
.BR ocamllex (1) 
files is modified as follows:

- before the header, there is a new section which declare classes.
  It starts with the keyword *classes*, followed by classes
  declaration. An ident declares a class with this name.
  A literal character 'x' declares a class with name char_ff
  where ff is the hexa code of the x.
  A literal string "xyz" is equivalent to 'x' 'y' 'z'.
 
  The class are assigned sequential number, starting with 1.
  The class 0 is predefined to eof.

- the entry point accept extra argument. Ex:
  rule token arg1 arg2 = ....

- in a regexp, "_" means any class;
  an ident is interpreted as a regexp or as a class name
 
- in a "[ ... ]" regexp, the dash is forbidden; a literal char 'x' or
  an indent references the corresponding class which must be declared;
  a string "xyz" is equivalent to 'x' 'y' 'z'

.SH SEE ALSO
.BR ocamllex (1),
.BR ocamlyacc (1).
.br
.I The Objective Caml user's manual,
chapter "Lexer and parser generators".
