 |
www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com |

|
|
|
More lexical analysis
The comment-stripping example that we did in the last page is a special
case of lexical analysis. In general, a text file consists of a sequence
of characters. Thus, the statement
if(flag == 0)
b = 67.5;
consists of the following sequence of characters:
i, f, (, f, l, a, g, =,
=, <blank>, 0, ), <newline>,
<tab>, b, =, 6, 7, ., 5, ;
But surely we do not want to think of the code snippet as merely a list of
21 characters. We want to think of the code as made up of the following
10 "high-level" building blocks:
if ( flag
== 0 )
b = 67.5 ;
In particular we want to
ignore white-space characters like <blank> etc. However, we do
need the white-space characters to delimit the other entities, since
if (a == 0) then b = 9
is not the same as
if (a == 0) thenb = 9
collect the different tokens and know which token is
which. For instance in ForTran we have a token then,
which has a special meaning in the language. But in C it is not a
reserved word. So in C the token then will be treated
a variable name (or identifier).
The analysis of a text-file to achieve these two ends is called Lexical
Analysis. We shall consider one simple example of how to recognize
identifiers in a program file.
What is an identifier? It is the name of any variable or function or
any other object defined by the user. Thus in the following code snippet
void updateFile(str,FILE *f)
{
if(!f)
{
fprintf(stderr,"Invalid file\n");
exit(1);
}
else
{
fprintf(f,"%s\n",str);
}
}
we have 6 identifiers :
updateFile
str
f
fprintf
stderr
exit
Most programming language textbooks specify the permissible identifiers
as follows. They first list all the reserved words in the language. Then
they say that any string other than these, starting with an alphbetic
character (a to z or A to Z or _) and followed by any number of alphanumeric
characters is an identifier. Some languages also put a bound on the length
of the identifiers.
We shall see how to extract identifier names from a preogram files using a
finite automaton. Our description will be deliberately informal, so that
the reader is led to draw the transition diagram himself or herself.
Finite automaton to extract identifiers
The input set is of size three. The first input occurs when we read any
alphabetic character (one may use the isalpha() function in
C). The second input corresponds to reading any digit character, the third
input means that the character is neither alphabetic nor numeric.
We shall need two states:
Outside an identifier
After reading the first character of an identifier
We shall not specify the output set. The reader is to fill in that blank with
his or her programming common sense. When the automaton is in state 1, it
continues to remain there until it meets an input of type 1. Then it moves
into state 2. It remains in that state until it gets an input of type 3,
which switches the automaton back to state 1.
Finite automata as Recognizers
The lexical analysis example leads to one concept of great practical and
theoretical importance, viz., language recognition. In finite automata
nomenclature the word language is defined as follows.
Suppose we have a finite (nonempty) set of symbols. This set is called our
alphabet, D. Each symbol is called a letter. Consider the
set D* consisting of all (finite) strings of letters (of whatever length,
including 0 length). The zero-length string is just 'nothing', and is
included in the set for some technical convenience. Each member of D* is
called a word. Here is an example.
If D is the English alphabet (lower case, upper case and numerals) then
here are some members of D*
a b who 456 robo89 45typ MyGOODNESS
By a language, L, we mean a subset of D*. Thus, in the English example,
the set of all grammatically correct words form a language (which is our
familiar English language.)
By language recognition we mean contructing some gadget which will take
an element of D* and tell us whether it belongs to L or not. Finite
automata (and variants thereof) are highly suited to this purpose. We
shall discuss below finite automata for recognizing some languages.
In the comment-stripping example from the last page our alphabet consisted
of all the allowable characters in a C program. The words were code
snippets. Our language consisted of comments ie strings starting with
/*, ending with */, and having in the middle any
nonnegative number letters such that there is no */ in it.
In the identifier extracting example we had again the same alphabet. But
this time our language consisted of words that started with an alphabetic
character, and then had any nonnegative number of alphanumeric
characters.
Consider the similarity between the two examples. In both cases, the
language specification consists of specifying part of the words concretely
(eg, must start with /* or alphabetic letter) and the rest of
it is described as any nonnegative number of some partical type of
letters. Any language that is specified by these two methods can be easily
recognized by a finite automaton. If the reader has followed the last two
examples (comment-stripping and identifier-extracting) well then
constructing a recognizer for any such language should be easy. We take
up this issue below with some rigor.
Details
Let D be our alphabet. The language to be recognized is L, a subset of
D*.
An automaton M for recognizing this language has input set consisting of all
the letters in D alongwith one more special symbol (call it EOF).
The output set is of size three. The outputs are : "yes" , "no" and
"nothing". The last output is just to allow the automaton to refrain from
showing any visible output. We shall say that M recognizes L if for any
input string of the form d EOF, where d is a word in D*, the
automaton produces the output "yes" finally if and only if d is a word
in L. We do not care about the
outputs it produces when it is somewhere halfway through the input string.
It is not true that any language L can be recognized by a finite
automaton. The languages that can be recognized by a finite automaton are
called regular expressions.
May be you already have an idea about what regular
expressions are. They are used widely in editors and search
engines. Here are a few examples:
grep/egrep: In Unix envioronments, one uses the grep or
egrep utility to search inside files. Thus, for instance, to see
if a file named foo.txt has any of the words "romeo" and "Romeo",
we use the command
$ grep [rR]omeo foo.txt
Here the [rR] mean "either r or R". In general, square brackets in a
regular expression denote one of the letters enclosed.
To see if foo.txt contains
any word made up of only the letters "a", "g", "t" and "c"
the command is
$ grep ([agtc])+ foo.txt
The symbol (...)+ in a regular expression means any positive number of
repetition of the thing inside the parantheses, in this case [agtc].
The C identifiers considered above are of the form
[a-z]([0-9a-z])*
Here [a-z] is an abbreviation for [abc...wxyz], and similarly for
[0-9a-z]. The (...)* means any nonnegative number of
repetitions of the thing inside the parantheses.
It should not be too hard for the reader to adapt the argument given for
recognizing C identifiers to recognize any regular expression. We
shall not go into the details here, since more interesting things are
awaiting us in the next page!
|