Κύρια Formal Languages and Compilation (3rd Edition)

Formal Languages and Compilation (3rd Edition)

, ,
This classroom-tested and clearly-written textbook presents a focused guide to the conceptual foundations of compilation, explaining the fundamental principles and algorithms used for defining the syntax of languages, and for implementing simple translators.

This significantly updated and expanded third edition has been enhanced with additional coverage of regular expressions, visibly pushdown languages, bottom-up and top-down deterministic parsing algorithms, and new grammar models.

Topics and features: describes the principles and methods used in designing syntax-directed applications such as parsing and regular expression matching; covers translations, semantic functions (attribute grammars), and static program analysis by data flow equations; introduces an efficient method for string matching and parsing suitable for ambiguous regular expressions (NEW); presents a focus on extended BNF grammars with their general parser and with LR(1) and LL(1) parsers (NEW); introduces a parallel parsing algorithm that exploits multiple processing threads to speed up syntax analysis of large files; discusses recent formal models of input-driven automata and languages (NEW); includes extensive use of theoretical models of automata, transducers and formal grammars, and describes all algorithms in pseudocode; contains numerous illustrative examples, and supplies a large set of exercises with solutions at an associated website.

Advanced undergraduate and graduate students of computer science will find this reader-friendly textbook to be an invaluable guide to the essential concepts of syntax-directed compilation. The fundamental paradigms of language structures are elegantly explained in terms of the underlying theory, without requiring the use of software tools or knowledge of implementation, and through algorithms simple enough to be practiced by paper and pencil.
Χρόνος:
2019
Έκδοση:
3rd
Εκδότης:
Springer
Γλώσσα:
english
Σελίδες:
499 / 509
ISBN 13:
978-3-030-04879-2
Series:
Texts in Computer Science
File:
PDF, 9.35 MB
Κατεβάστε (pdf, 9.35 MB)

You may be interested in Powered by Rec2Me

 

Most frequently terms

 
 
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

Foolish Bride

Χρόνος:
2017
Γλώσσα:
english
File:
EPUB, 241 KB
2

Tainted Bride

Γλώσσα:
english
File:
EPUB, 244 KB
Texts in Computer Science

Stefano Crespi Reghizzi
Luca Breveglieri
Angelo Morzenti

Formal
Languages
and Compilation
Third Edition

Texts in Computer Science
Series Editors
David Gries, Department of Computer Science, Cornell University, Ithaca, NY,
USA
Orit Hazzan, Faculty of Education in Technology and Science, Technion—Israel
Institute of Technology, Haifa, Israel

More information about this series at http://www.springer.com/series/3191

Stefano Crespi Reghizzi •
Luca Breveglieri • Angelo Morzenti

Formal Languages
and Compilation
Third Edition

123

Stefano Crespi Reghizzi
Dipartimento di Elettronica, Informazione e
Bioingegneria
Politecnico di Milano
Milan, Italy

Luca Breveglieri
Dipartimento di Elettronica, Informazione e
Bioingegneria
Politecnico di Milano
Milan, Italy

Angelo Morzenti
Dipartimento di Elettronica, Informazione e
Bioingegneria
Politecnico di Milano
Milan, Italy

ISSN 1868-0941
ISSN 1868-095X (electronic)
Texts in Computer Science
ISBN 978-3-030-04878-5
ISBN 978-3-030-04879-2 (eBook)
https://doi.org/10.1007/978-3-030-04879-2
Library of Congress Control Number: 2018965427
1st & 2nd editions: © Springer-Verlag London 2009, 2013
3rd edition: © Springer Nature Switzerland AG 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the a; dvice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This third edition updates, enriches, and revises the 2013 printing [1], without
dulling the original structure of the book. The selection of materials and the presentation style reflect many years of teaching compiler courses and of doing
research on formal language theory and formal methods, on compiler and language
design, and to a lesser extent on natural language processing. In the turmoil of
information technology developments, the subject of the book has kept the same
fundamental principles since half a century, yet preserving its conceptual importance and practical relevance.
This state of affairs of a topic, which is central to computer science and information processing and is based on established principles, might lead some people to
believe that the corresponding textbooks are by now consolidated, much as the
classical books on calculus or physics. In reality, this is not the case: there exist fine
classical books on the mathematical aspects of language and automata theory, but,
for what concerns application to compiling and data representation, the best books
are sort of encyclopedias of algorithms, design methods, and practical tricks used in
compiler design. Indeed, a compiler is a microcosm, and it features many different
aspects ranging from algorithmic wisdom to computer hardware, system software,
and network interfaces. As a consequence, the textbooks have grown in size and
compete with respect to their coverage of the latest developments in programming
languages, processor and network architectures and clever mappings from the
former to the latter.
To put things into order, it is better to separate such complex topics into two
parts, that we may call basic or fundamental, and advanced or technological, which
to some extent correspond to the two subsystems that make a compiler: the
user-language level specific front-end and the machine-level specific back-end. The
basic part is the subject of this book; it covers the principles and algorithms to be
used for defining the syntax of languages and for implementing simple translators.
It does not include: the specialized know-how needed for various classes of programming languages (procedural, functional, object oriented, etc.), the computer
architecture related aspects, and the optimization methods used to improve the
performance of both the compilation process and the executable code produced.
In other textbooks on compilers, the bias toward practical aspects has reduced
the attention to fundamental concepts. This has prevented their authors from taking

v

vi

Preface

advantage of the improvements and simplifications made possible by decades of
both extensive use and theoretical polishing, to avoid the irritating variants and
repetitions that are found in the historical fundamental papers. Moving from these
premises, we decided to present, in a simple minimalist way, the principles and
methods currently used in designing syntax-directed applications such as parsing
and regular expression matching. Thus, we have substantially improved the standard presentation of parsing algorithms for both regular expressions and grammars,
by unifying the concepts and notations used in various approaches, and by
extending method coverage with a reduced definitional apparatus. An example that
expert readers should appreciate is the unification of top-down and bottom-up
parsing algorithms within a new rigorous yet practical framework.
In this way, the reduced effort and space needed to cover classical concepts has
made room for new advanced methods typically not present in similar textbooks.
First, we have introduced in this edition a new efficient algorithm for string
matching using regular expressions, which are increasingly important in such
applications as Web browsing and data filtering. Second, our presentation of
parsing algorithms focuses on the more expressive extended BNF grammars, which
are the de facto standard in language reference manuals. Third, we present a parallel
parsing algorithm that takes advantage of the many processing units of modern
multi-core devices to speed up the syntax analysis of large files. At last, we mention
the addition in this edition of the recent formal models of input-driven automata and
languages (also known as visibly pushdown model), which suits the mark-up
languages (such as XML and JSON) and has been developed also in other domains
to formally verify critical systems.
The presentation is illustrated by many small yet realistic examples, to ease the
understanding of the theory and the transfer to application. Theoretical models of
automata, transducers, and formal grammars are extensively used, whenever
practical motivations exist. Algorithms are described in a pseudo-code to avoid the
disturbing details of a programming language, yet they are straightforward to
convert into executable procedures, and for some, we have made available their
code on the Web.
The book is not restricted to syntax. The study of translations, semantic functions (attribute grammars), and static program analysis by data flow equations
enlarges the understanding of the compilation process and covers the essentials of a
syntax-directed translator.
A large collection of problems and solutions is available at the authors’ course
site.
This book should be welcome by those willing to teach or learn the essential
concepts of syntax-directed compilation, without needing to rely on software tools
and implementations. We believe that learning by doing is not always the best
approach and that an overcommitment to practical work may sometimes obscure
the conceptual foundations. In the case of formal languages, the elegance and
essentiality of the underlying theory allows students to acquire the fundamental
paradigms of language structures, to avoid pitfalls such as ambiguity, and to adequately map structure to meaning. In this field, many if not all relevant algorithms

Preface

vii

are simple enough to be practiced by paper and pencil. Of course, students should
be encouraged to enroll in a hands-on laboratory and to experiment syntax-directed
tools (like flex and bison) on realistic cases.
The authors thank the colleagues Alessandro Barenghi and Marcello Bersani,
who have taught compilers to computer engineering students by using this book.
The first author remembers the late Antonio Grasselli, the computer science
pioneer who first fascinated him with the subject of this book, combining linguistic,
mathematical, and technological aspects.
Milan, Italy
April 2019

Stefano Crespi Reghizzi
Luca Breveglieri
Angelo Morzenti

Reference
1. Crespi Reghizzi S, Breveglieri L, Morzenti A (2013) Formal languages and compilation, 2nd
edn. Springer

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Intended Scope and Readership . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Compiler Parts and Corresponding Concepts . . . . . . . . . . . . . . .
2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Artificial and Formal Languages . . . . . . . . . . . . .
2.1.2 Language Types . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Formal Language Theory . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Alphabet and Language . . . . . . . . . . . . . . . . . . .
2.2.2 Language Operations . . . . . . . . . . . . . . . . . . . . .
2.2.3 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Star and Cross . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Language Quotient . . . . . . . . . . . . . . . . . . . . . . .
2.3 Regular Expressions and Languages . . . . . . . . . . . . . . . .
2.3.1 Definition of Regular Expression . . . . . . . . . . . . .
2.3.2 Derivation and Language . . . . . . . . . . . . . . . . . .
2.3.3 Other Operators . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Closure Properties of Family REG . . . . . . . . . . .
2.4 Linguistic Abstraction: Abstract and Concrete Lists . . . . .
2.4.1 Lists with Separators and Opening/Closing Marks
2.4.2 Language Substitution . . . . . . . . . . . . . . . . . . . .
2.4.3 Hierarchical or Precedence Lists . . . . . . . . . . . . .
2.5 Context-Free Generative Grammars . . . . . . . . . . . . . . . . .
2.5.1 Limits of Regular Languages . . . . . . . . . . . . . . .
2.5.2 Introduction to Context-Free Grammars . . . . . . . .
2.5.3 Conventional Grammar Representations . . . . . . . .
2.5.4 Derivation and Language Generation . . . . . . . . . .
2.5.5 Erroneous Grammars and Useless Rules . . . . . . .
2.5.6 Recursion and Language Infinity . . . . . . . . . . . . .
2.5.7 Syntax Tree and Canonical Derivation . . . . . . . . .
2.5.8 Parenthesis Languages . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
1
2
5
5
5
6
7
8
8
11
13
14
17
17
18
20
24
25
26
27
28
29
31
31
32
34
37
39
41
43
47

ix

x

Contents

2.5.9 Regular Composition of Context-Free Languages .
2.5.10 Ambiguity in Grammars . . . . . . . . . . . . . . . . . . .
2.5.11 Catalogue of Ambiguous Forms and Remedies . .
2.5.12 Weak and Structural Equivalence . . . . . . . . . . . .
2.5.13 Grammar Transformations and Normal Forms . . .
2.6 Grammars of Regular Languages . . . . . . . . . . . . . . . . . . .
2.6.1 From Regular Expression to Grammar . . . . . . . . .
2.6.2 Linear and Unilinear Grammars . . . . . . . . . . . . .
2.6.3 Linear Language Equations . . . . . . . . . . . . . . . . .
2.7 Comparison of Regular and Context-Free Languages . . . .
2.7.1 Limits of Context-Free Languages . . . . . . . . . . . .
2.7.2 Closure Properties of Families REG and CF . . . .
2.7.3 Alphabetic Transformations . . . . . . . . . . . . . . . .
2.7.4 Grammars with Regular Expressions . . . . . . . . . .
2.8 More General Grammars and Language Families . . . . . . .
2.8.1 Chomsky Classification of Grammars . . . . . . . . .
2.8.2 Further Developments and Final Remarks . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

51
52
54
62
65
82
82
84
87
89
93
95
97
100
104
105
109
112

3 Finite Automata as Regular Language Recognizers . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Recognition Algorithms and Automata . . . . . . . . . . . . . . . .
3.2.1 A General Automaton . . . . . . . . . . . . . . . . . . . . .
3.3 Introduction to Finite Automata . . . . . . . . . . . . . . . . . . . . .
3.4 Deterministic Finite Automata . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Error State and Total Automaton . . . . . . . . . . . . . .
3.4.2 Clean Automaton . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Minimal Automaton . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 From Automaton to Grammar . . . . . . . . . . . . . . . .
3.5 Nondeterministic Automata . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Motivation of Nondeterminism . . . . . . . . . . . . . . .
3.5.2 Nondeterministic Recognizers . . . . . . . . . . . . . . . .
3.5.3 Automata with Spontaneous Moves . . . . . . . . . . . .
3.5.4 Correspondence Between Automata and Grammars
3.5.5 Ambiguity of Automata . . . . . . . . . . . . . . . . . . . .
3.5.6 Left-Linear Grammars and Automata . . . . . . . . . . .
3.6 From Automaton to Regular Expression: BMC Method . . . .
3.7 Elimination of Nondeterminism . . . . . . . . . . . . . . . . . . . . .
3.7.1 Elimination of Spontaneous Moves . . . . . . . . . . . .
3.7.2 Construction of Accessible Subsets . . . . . . . . . . . .
3.8 From Regular Expression to Recognizer . . . . . . . . . . . . . . .
3.8.1 Thompson Structural Method . . . . . . . . . . . . . . . .
3.8.2 Berry–Sethi Method . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

115
115
116
117
120
122
123
124
125
129
130
131
133
135
136
137
138
139
142
142
144
147
148
150

Contents

3.9

String Matching of Ambiguous Regular Expressions . . . .
3.9.1 Trees for Ambiguous r.e. and Their Phrases . . .
3.9.2 Local Recognizer of Linearized Syntax Trees . .
3.9.3 The Berry–Sethi Parser . . . . . . . . . . . . . . . . . . .
3.10 Expressions with Complement and Intersection . . . . . . .
3.10.1 Product of Automata . . . . . . . . . . . . . . . . . . . .
3.11 Summary of the Relations Between Regular Expressions,
Grammars and Automata . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

165
165
172
172
188
190

. . . . . . 195
. . . . . . 196

4 Pushdown Automata and Parsing . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Pushdown Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 From Grammar to Pushdown Automaton . . . . . . . . .
4.2.2 Definition of Pushdown Automaton . . . . . . . . . . . . .
4.2.3 One Family for Context-Free Languages and
Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Intersection of Regular and Context-Free Languages
4.3 Deterministic Pushdown Automata and Languages . . . . . . . .
4.3.1 Closure Properties of Deterministic Languages . . . . .
4.3.2 Nondeterministic Languages . . . . . . . . . . . . . . . . . .
4.3.3 Determinism and Language Unambiguity . . . . . . . .
4.3.4 Subclasses of Deterministic Pushdown Automata
and Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Syntax Analysis: Top-Down and Bottom-Up Constructions . .
4.5 Grammar as Network of Finite Automata . . . . . . . . . . . . . . .
4.5.1 Syntax Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Derivation for Machine Nets . . . . . . . . . . . . . . . . . .
4.5.3 Initial and Look-Ahead Characters . . . . . . . . . . . . .
4.6 Bottom-Up Deterministic Analysis . . . . . . . . . . . . . . . . . . . .
4.6.1 From Finite Recognizers to Bottom-Up Parser . . . . .
4.6.2 Construction of ELR ð1Þ Parsers . . . . . . . . . . . . . . .
4.6.3 ELR ð1Þ Condition . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.4 Simplifications for BNF Grammars . . . . . . . . . . . . .
4.6.5 Parser Implementation Using a Vector Stack . . . . . .
4.6.6 Lengthening the Look-Ahead . . . . . . . . . . . . . . . . .
4.7 Deterministic Top-Down Parsing . . . . . . . . . . . . . . . . . . . . .
4.7.1 ELL ð1Þ Condition . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.2 Step-by-Step Derivation of ELL ð1Þ Parsers . . . . . . .
4.7.3 Direct Construction of Top-Down Predictive Parsers
4.7.4 A Graphical Method for Computing Guide Sets . . . .
4.7.5 Increasing Look-Ahead in Top-Down Parsers . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

199
199
200
201
205

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

209
213
215
216
218
220

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

221
230
233
237
239
241
244
245
250
253
267
271
275
282
286
288
305
314
322

xii

Contents

4.8

Operator-Precedence Grammars and Parallel Parsing . . . . . .
4.8.1 Floyd Operator-Precedence Grammars and Parsers .
4.8.2 Sequential Operator-Precedence Parser . . . . . . . . .
4.8.3 Comparisons and Closure Properties . . . . . . . . . . .
4.8.4 Parallel Parsing Algorithm . . . . . . . . . . . . . . . . . .
4.9 Deterministic Language Families: A Comparison . . . . . . . .
4.10 Discussion of Parsing Methods . . . . . . . . . . . . . . . . . . . . .
4.11 A General Parsing Algorithm . . . . . . . . . . . . . . . . . . . . . .
4.11.1 Introductory Presentation . . . . . . . . . . . . . . . . . . .
4.11.2 Earley Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
4.11.3 Syntax Tree Construction . . . . . . . . . . . . . . . . . . .
4.12 Managing Syntactic Errors and Changes . . . . . . . . . . . . . .
4.12.1 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.2 Incremental Parsing . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

325
326
330
333
339
346
350
352
353
357
366
373
373
379
379

5 Translation Semantics and Static Analysis . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Translation Relation and Function . . . . . . . . . . . . . . . . . . .
5.3 Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Purely Syntactic Translation . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Infix and Polish Notation . . . . . . . . . . . . . . . . . . .
5.4.2 Ambiguity of Source Grammar and Translation . . .
5.4.3 Translation Grammar and Pushdown Transducer . .
5.4.4 Syntax Analysis with Online Translation . . . . . . . .
5.4.5 Top-Down Deterministic Translation by Recursive
Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.6 Bottom-Up Deterministic Translation . . . . . . . . . . .
5.5 Regular Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Two-Input Automaton . . . . . . . . . . . . . . . . . . . . .
5.5.2 Translation Function and Finite Transducer . . . . . .
5.5.3 Closure Properties of Translation . . . . . . . . . . . . .
5.6 Semantic Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Attribute Grammars . . . . . . . . . . . . . . . . . . . . . . .
5.6.2 Left and Right Attributes . . . . . . . . . . . . . . . . . . .
5.6.3 Definition of Attribute Grammar . . . . . . . . . . . . . .
5.6.4 Dependence Graph and Attribute Evaluation . . . . .
5.6.5 One-Sweep Semantic Evaluation . . . . . . . . . . . . . .
5.6.6 Other Evaluation Methods . . . . . . . . . . . . . . . . . .
5.6.7 Combined Syntax and Semantic Analysis . . . . . . .
5.6.8 Typical Applications of Attribute Grammar . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

383
383
386
388
389
391
395
397
402

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

403
406
414
416
421
426
427
429
431
435
437
442
446
447
457

Contents

5.7

Static Program Analysis . . . . . . . . . . . .
5.7.1 A Program as an Automaton . . .
5.7.2 Liveness Intervals of a Variable
5.7.3 Reaching Definition . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

467
468
471
477
486

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487

1

Introduction

1.1 Intended Scope and Readership
The information technology revolution was made possible by the invention of electronic digital machines, yet without programming languages their use would have
been restricted to the few people able to write binary machine code. A programming
language as a text contains features coming both from human languages and from
mathematical logic. The translation from a programming language to machine code
is known as compilation.1 Language compilation is a very complex process, which
would be impossible to master without systematic design methods. Such methods
and their theoretical foundations are the argument of this book. They make up a consistent and largely consolidated body of concepts and algorithms, which are applied
not just in compilers, but also in other fields. Automata theory is pervasive in all
the branches of informatics, to model situations or phenomena classifiable as time
and space discrete systems. Formal grammars, on the other hand, are originated in
linguistic research, and together with automata, they are widely applied in document
processing and in data representation and searching, in particular for the Web.
Coming to the prerequisites, the reader should have a good background in programming, though a detailed knowledge of a specific programming language is not
required, because our presentation of algorithms relies on self-explanatory pseudocode. The reader is expected to be familiar with basic mathematical theories and
notations, namely from set theory, algebra and logic. The above prerequisites are
typically met by computer science/engineering or mathematics students with a university education of two or more years.

1 This

term may sound strange; it originates in the early approach to the compilation of correspondence tables between a command in the language and a series of machine operations.
© Springer Nature Switzerland AG 2019
S. Crespi Reghizzi et al., Formal Languages and Compilation,
Texts in Computer Science, https://doi.org/10.1007/978-3-030-04879-2_1

1

2

1 Introduction

The selection of topics and the presentation based on rigorous definitions and
algorithms illustrated by many motivating examples qualifies the book for a university
course, aiming to expose students to the importance of good theories and of efficient
algorithms for designing effective systems. In our experience some fifty hours of
lecture suffice to cover the entire book.
The authors’ long experience in teaching the subject to different audiences brings
out the importance of combining theoretical concepts and examples. Moreover, it
is advisable that the students take advantage of well-known and documented software tools (such as the classical Flex and Bison generators of lexical and syntactic
analyzers), to implement and experiment the main algorithm on realistic case studies.
With regard to the reach and limits, the book covers the essential concepts and
methods needed to design simple translators based on the very common syntaxdirected paradigm. Such systems analyze into syntactical units a source text, written
in a formally specified language, and translate it into a different target representation.
The translation function is nicely formulated as the composition of simpler functions
keyed to the syntactical units of the source text.
It goes without saying that a real compiler for a programming language includes
other technological aspects and know-how, in particular related to processor and
computer architecture, which are not covered. Such know-how is essential for automatically translating a program into machine instructions and for transforming a
program in order to make the best use of the computational resources of a computer. The study of methods for program transformation and optimization is a more
advanced topic, which follows the present introduction to compiler methods. The
next section outlines the contents of the book.

1.2 Compiler Parts and Corresponding Concepts
There are two external interfaces to a compiler: the source language to be analyzed
and translated and the target language produced by the translator.
Chapter 2 describes the so-called syntactic methods that are universally adopted in
order to provide a rigorous definition of the texts (or character strings) written in the
source language. The methods to be presented are regular expressions and contextfree (BNF) grammars. To avoid the pitfall of ambiguous definitions, we examine the
relevant causes of ambiguity in the regular expressions and in the grammars. Useful
transformations from regular expressions to grammars and between grammars of
different forms are covered, which are needed in the following chapters. In particular
we introduce the combination of regular expressions and grammars, which is the
central model for the parsing algorithms. Chapter 2 ends with a critical hint to more
powerful, but rarely applied models: the Chomsky context-sensitive type grammars
and the recently proposed conjunctive grammars.
The first task of a compiler is to check the correctness of the source text, that is,
whether the text agrees with the syntactic definition of the source language. In order
to perform such a check, the algorithm scans the source text character by character,

1.2 Compiler Parts and Corresponding Concepts

3

and at the end it rejects or accepts the input depending on the result of the analysis.
By a minimalist abstract approach, such recognition algorithms are conveniently
described as mathematical machines or automata, in the tradition of the well-known
Turing machine. Different automaton types correspond to the various families of
formal languages, and the types practically relevant for compilation are considered
in the following two chapters.
Chapter 3 covers finite automata, which are machines with a finite random access
memory. They are the recognizers of the languages defined by regular expressions.
Within compilation, they are used for lexical analysis or scanning, to extract from
the source text the keywords, numbers, identifiers and in general the pieces of text
that encode the lexical units of the language, called lexemes or tokens.
The basic models of finite automata, deterministic and nondeterministic, real-time
and with spontaneous moves are presented along with the transformations between
different models. The algorithms for converting a finite automaton into a regular
expression and conversely are well covered. The other main topics of Chap. 3 are: the
string matching problem for ambiguous regular expressions, and the use of Boolean
operations, intersection and complement, in the regular expressions.
Chapter 4 is devoted to the recognition problem for the languages defined by
context-free grammars, which is the backbone of syntax-directed translators and
justifies the salient length of the chapter.
The chapter starts with the model of nondeterministic pushdown automata and
its direct correspondence to context-free grammars. The deterministic version of the
model follows, which defines the leading family of deterministic languages. As a
special case, the input-driven or visibly pushdown automata have been included,
because of their relevance for markup languages such as XML and JSON. A rich
choice of parsing algorithms is presented. First, there is a unified modern presentation
of the deterministic parsers, both bottom-up (or LR (1)) and top-down (or LL (1)),
for context-free grammars extended with regular languages. The third deterministic
parser is the operator-precedence method, which is here presented in the parallel
version suitable for multi-core computers. The last parsing method by Earley is the
most general and accepts any context-free language. Chapter 4 includes comparisons
of the different language families suitable for each parsing method, a discussion of
grammar transformations, and the basic concepts on error handling by parsers.
The ultimate job of a compiler is to translate a source text into another language.
The compiler module responsible for finalizing the verification of the source language
rules and for producing the translation is called the semantic analyzer. It operates on
the structural representation, the syntax tree, produced by the parser.
The formal translation models and the methods used to implement semantic analyzers are illustrated in Chap. 5. Actually, we describe two kinds of transformations.
First, the pure syntactic translations are modeled by transducers that are either finite
automata or pushdown automata extended with an output, and that compute a string.
Second, the semantic translations are performed by functions or methods that operate on the syntax tree of the source text and may produce as output not just a string,
but any sort of data values. Semantic translations are specified by a practical semiformal extension to context-free grammars, which is called attribute grammar. This

4

1 Introduction

approach, by combining the accuracy of formal syntax and the flexibility of programming, conveniently expresses the typical analysis and transformation of syntax
trees that is required by compilers.
To give a concrete idea of compilation, typical simple examples are included:
the type consistency check between the variables declared and used in a programming language, the translation of high-level statements into machine instructions and
semantics-directed parsing.
For sure, compilers do much more than syntax-directed translation. Static program
analysis is an important example: it consists in examining a program to determine,
ahead of execution, some properties, or to detect errors not detected by syntactic and
semantic analysis. The purpose is to improve the robustness, reliability and efficiency
of the program. An example of error detection is the identification of uninitialized
variables used in the arithmetic expressions. For code improvement, an example is
the elimination of useless assignment statements.
Chapter 5 terminates with an introduction to the static analysis of programs modeled by their control-flow graph, viewed as a finite automaton. Several interesting
problems can be formalized and statically analyzed by a common approach based
on flow equations and their solution by iterative approximations converging to the
least fixed point.
Last, the book comes with an up-to-date essential bibliography at the end of each
chapter and is accompanied by a detailed index.

2

Syntax

2.1 Introduction
2.1.1 Artificial and Formal Languages
Many centuries after the spontaneous emergence of natural language for human communication, mankind has purposively constructed other communication systems and
languages, to be called artificial, intended for very specific tasks. A few artificial languages, like the logical propositions of Aristotle or the music sheet notation of Guittone d’Arezzo, are very ancient, but their number has exploded with the invention
of computers. Many artificial languages are intended for man–machine communication, to instruct a programmable machine to do some task: to perform a computation,
to prepare a document, to search a database or the Web, to control a robot, and so on.
Other languages serve as interfaces between devices or between software applications, e.g., PDF is a language that many document and picture processors read and
write.
Any designed language is artificial by definition, but not all artificial languages
are formalized: thus, a programming language like Java is formalized, but Esperanto,
although designed by man, is not so.
But how should we classify the language of genes or DNA, which deals with
natural biological systems though it has been invented by scientists? The structures
of molecular biology are three-dimensional, yet viewing them as RNA sequences has
opened the way to the methods of formal languages.1
For a language to be formalized (or formal), the form of its sentences (or syntax)
and their meaning (or semantics) must be precisely and algorithmically defined. In

1 For

a discussion of the role of linguistic methods in molecular biology, see Searls [1].

© Springer Nature Switzerland AG 2019
S. Crespi Reghizzi et al., Formal Languages and Compilation,
Texts in Computer Science, https://doi.org/10.1007/978-3-030-04879-2_2

5

6

2 Syntax

other words, it should be possible for a computer to check that the sentences are
grammatically correct, and to determine their meaning.
Meaning, however, is a difficult and controversial notion. For our purposes, the
meaning of a sentence can be taken to be the translation into another language, which
is known to the computer or to a well-established software system. For instance,
the meaning of a Java program is its translation into the machine language of the
computer executing the program.
In this book, the term formal language is used in a narrower sense, which excludes
semantics. In the field of syntax, a formal language is a mathematical structure,
defined on top of an alphabet, by means of certain axiomatic rules (formal grammar)
or by using abstract machines such as the famous one due to A. Turing. The notions
and methods of formal language are analogous to those used in number theory and
in logic.
Thus, formal language theory is primarily concerned with the form or syntax of
sentences, not with meaning, at least not directly. A string (or text) is either valid
or illegal, that is, it either belongs to the formal language or it does not. Such a
theory makes a first important step toward the ultimate goal: the study of language
translation and meaning, which will however require additional methods.

2.1.2 Language Types
In this book, a language is a one-dimensional communication medium, made by
sequences of symbolic elements of an alphabet, called terminal characters. Actually,
people often refer to language as other non-textual communication media, which are
more or less formalized by means of rules. Thus, iconic languages focus on road traffic signs or video display icons. Musical language is concerned with sounds, rhythm,
and harmony. Architects and industrial designers of buildings and things are interested in their spatial relations, which they describe as the language of design. Early
child drawings are often considered as sentences of a pictorial language, which can
be partially formalized in accordance with developmental psychology theories. Certainly, the formal approach to syntax to be developed has some interest for nontextual
languages too, which however are out of scope for this book.
Within computer science, the term language applies, as said, to a text made by
a set of characters orderly written from, say, left to right. In addition, the term is
used to refer to other discrete structures, such as graphs, trees, or arrays of pixels
describing a digital picture. Formal language theories have been proposed and used
more or less successfully also for such nontextual languages.2
Reverting to the main stream of textual languages, a frequent request directed to
the specialist is to define and specify an artificial language. The specification may
have several uses: as a language reference manual for future users, as an official

2 Just three examples and their references: graph grammars and languages [2], tree languages [3,4],

and picture (or two-dimensional) languages [5,6].

2.1 Introduction

7

standard definition, or as a contractual document for compiler designers to ensure
consistency of specification and implementation.
It is not an easy task to write a complete and rigorous definition of a language. Of
course, the exhaustive approach, to list all the possible sentences or phrases, is unfeasible because the possibilities are infinitely many, since the length of the sentences is
usually unbounded. As a native language speaker, a programmer is not constrained
by any strict limit on the length of the phrases he writes. The ensuing problem to
represent an infinite number of cases by a finite description can be addressed by an
enumeration procedure, as in logic. When executed, the procedure generates longer
and longer sentences, in an unending process if the language to be modeled is not
finite.
This chapter presents a simple and established manner to express the rules of the
enumeration procedure in the form of rules of a generative grammar (or syntax).

2.1.3 Chapter Outline
The chapter starts with the basic components of language theory: alphabet, string,
and operations, such as concatenation and repetition, on strings and sets of strings.
The first model presented are the regular expressions, which define the family
of regular languages. Then the lists are introduced as a fundamental and pervasive
syntax structure in all the kinds of languages. From the exemplification of list variants,
the idea of linguistic abstraction grows out naturally. This is a powerful reasoning
tool to reduce the varieties of existing languages to a few paradigms.
After discussing the limits of regular languages, the chapter moves to context-free
grammars. Following the basic definitions, the presentation focuses on
structural properties, namely equivalence, ambiguity, and recursion. Exemplification continues with important linguistic paradigms such as hierarchical lists, parenthesized structures, polish notations, and operator precedence expressions. Their
combination produces the variety of forms to be found in artificial languages.
Then the classification of some common forms of ambiguity and corresponding
remedies is offered as a practical guide for grammar designers.
Various rule transformations (normal forms) are introduced, which should familiarize the reader with the modifications, to adjust a grammar without affecting the
language, often needed for the analysis algorithms studied in Chap. 4.
Returning to regular languages from the grammar perspective, the chapter evidences the greater descriptive capacity of context-free grammars. The comparison
of regular and context-free languages continues by considering the operations that
may cause a language to exit or remain in one or the other family. Alphabetical
transformations anticipate the operations studied in Chap. 5 as translations. A discussion about the unavoidable regularities found in very long strings completes the
theoretical picture.
The last section mentions the Chomsky classification of grammar types and exemplifies context-sensitive grammars, stressing the difficulty of such rarely used model
and discussing some recent attempts at more expressive grammar models.

8

2 Syntax

2.2 Formal Language Theory
Formal language theory starts from the elementary notions of alphabet, string operations, and aggregate operations on sets of strings. By such operations, complex
languages can be obtained starting from simpler ones.

2.2.1 Alphabet and Language
An alphabet is a finite set of elements called terminal symbols or characters. Let Σ =
{ a1 , a2 , . . . , ak } be an alphabet with k elements, i.e., its cardinality is | Σ | = k. A
string, also called a word, is a sequence, i.e., an ordered set possibly with repetitions,
of characters.
Example 2.1 (binary strings) Let Σ = { a, b } be an alphabet. Some strings are:
a a b a, a a a, a b a a, and b.

A language is a set of strings over a specified alphabet.
Example 2.2 Three sample languages over the same alphabet Σ = { a, b }:
L1 = { a a, a a a }
L2 = { a b a, a a b }
L3 = { a b, b a, a a b b, a b a b, . . . , a a a b b b, . . . }
= set of the strings having as many letters a as letters b



Notice that a formal language viewed as a set has two layers. At the first level, there
is an unordered set of nonelementary elements, the strings. At the second level, each
string is an ordered set of atomic elements, the terminal characters.
Given a language, a string belonging to it is called a sentence or phrase. Thus,
/ L3 is an incorrect string.
b b a a ∈ L3 is a sentence of L3 , whereas a b b ∈
The cardinality or size of a language is the number of sentences it contains. For
instance, | L2 | = | { a b a, a a b } | = 2. If the cardinality is finite, the language is
called finite, too. Otherwise, there is no finite bound on the number of sentences
and the language is termed infinite. To illustrate, languages L1 and L2 are finite, but
language L3 is infinite.
One can observe that a finite language is essentially a collection of words3 sometimes called a vocabulary. A special finite language is the empty set or language ∅,
which contains no sentence, i.e., | ∅ | = 0. Usually, when a language contains just
one element, the set braces are omitted by writing, e.g., a b b instead of { a b b }.

3 In

mathematical writings, the terms word and string are synonymous, while in linguistics a word
is a string that has a meaning.

2.2 Formal Language Theory

9

It is convenient to introduce the notation | x |b for the number of characters b
present in a string x. For instance,
| a a b |a = 2

| a b a |a = 2

| b a a |c = 0

The length | x | of a string x is the number of characters it contains, e.g., | a b | = 2
and | a b a a | = 4. Two strings of length h and k, respectively:
x = a1 a2 . . . ah

y = b1 b2 . . . bk

are equal if h = k and ai = bi (1 ≤ i ≤ h). In words, by examining the strings from
left to right, their respective characters coincide. Thus, we obtain:
a b a = b a a

b a a = b a

2.2.1.1 String Operations
In order to manipulate strings, it is convenient to introduce several operations. For
strings:
x = a1 a2 . . . ah
concatenation4 is defined as:

y = b1 b2 . . . bk

x · y = a1 a2 . . . ah b1 b2 . . . bk
The dot ‘·’ may be dropped, writing x y in place of x · y. This operation, essential
for formal languages, plays the role addition has in number theory.
Example 2.3 (string concatenation) For strings:
x = well

y = in

z = formed

we obtain:
x y = wellin

y x = inwell = x y

(x y) z = wellin · formed = x (y z) = well · informed
= wellinformed



Concatenation is clearly noncommutative, that is, the identity x y = y x does not hold
in general. The associative property holds:
(xy)z = x(yz)
Associativity permits to write without parentheses the concatenation of three or more
strings. The length of the result is the sum of the lengths of the concatenated strings:
|xy| = |x| + |y|
4 Also

termed product in mathematical works.

(2.1)

10

2 Syntax

2.2.1.2 Empty String
It is useful to introduce the concept of empty (or null) string, denoted by the Greek
epsilon ε, as the only string satisfying, for every string x, the identity:
xε = εx = x
From (2.1), it follows that the empty string has length zero: | ε | = 0.
From an algebraic perspective, the empty string ε is the neutral element with
respect to concatenation, because any string is unaffected by concatenating ε to the
left or right.
The empty string should not be confused with the empty set ∅. In fact, the empty
set as a language contains no string, whereas the set { ε } contains one, the empty
string; equivalently, the cardinalities of sets ∅ and { ε } are 0 and 1, respectively. A
language L is said to be nullable if and only if it includes the empty string, i.e., ε ∈ L.

2.2.1.3 Substring
Let string x = u y v be written as the concatenation of three, possibly empty, strings
u, y, and v. Then strings u, y, v are substrings of x. Moreover, string u is a prefix
of x and string v is a suffix of x. A nonempty substring (or prefix or suffix) is called
proper if it does not coincide with string x.
Let x be a nonempty string of length at least k, i.e., | x | ≥ k ≥ 1. The notation
Ini k (x) denotes the prefix u of x having length k, to be termed the initial (of x) of
length k.
Example 2.4 The string x = a a b a c b a contains the following components:
prefixes
suffixes
substrings

a, a a, a a b, a a b a, a a b a c, a a b a c b, and a a b a c b a
a, b a, c b a, a c b a, b a c b a, a b a c b a, and a a b a c b a
all the prefixes and suffixes listed above, and the internal strings such
as a, a b, b a, b a c b, …

Notice that the pair b c is not a substring of string x although both letters b and c

occur in x. The initial of length two is Ini 2 ( a a b a c b a ) = a a.

2.2.1.4 Reversal or Mirror Reflection
The characters of a string are usually read from left to right, but it is sometimes
requested to reverse the order. The reversal or reflection of a string x = a1 a2 . . . ah
is the string xR = ah ah−1 . . . a1 . For instance, it is:
x = roma

xR = amor

2.2 Formal Language Theory

11

The following string identities are immediate:
(xR )R = x

( x y )R = yR xR

εR = ε

2.2.1.5 Repetition
When a string contains repetitions, it is handy to have an operator expressing them.
The m-th power xm (for an integer m ≥ 1) of a string x is the concatenation of x with
itself for m − 1 times:
xm = x · x · · · · · x
m times

By stipulation, the zero power of any string is defined to be the empty string. Thus,
the complete definition is:
xm = xm−1 · x for m ≥ 1

x0 = ε

Here are a few examples:
x = ab
y = a2 = a a

x0 = ε
y3 = a2 a2 a2 = a6

x1 = x = a b
ε0 = ε

x2 = ( a b )2 = a b a b
ε2 = ε

When writing formulas, the string to be repeated must be parenthesized, if it is longer
than one. Thus, to express the 2-nd power of string a b, i.e., a b a b, one should write
( a b )2 , not a b2 , which is the string a b b.
Expressed differently, we assume that the power operation takes precedence over
concatenation. Similarly, reversal takes precedence over concatenation, e.g., a bR
returns a b since bR = b, while ( a b )R = b a.

2.2.2 Language Operations
It is straightforward to extend an operation, originally defined on a string, to an entire
language: just apply the operation to each one of the language sentences. By means
of this general principle, the string operations previously defined can be revisited,
starting from those that have one argument.
The reversal LR of a language L is the set of the strings that are the reversal of a
sentence of L:


LR = x | x = yR ∧ y ∈ L
characteristic predicate

Here the strings x are specified by the property expressed in the so-called characteristic predicate.
Similarly, the set of the proper prefixes of a language L is:
prefixes (L) = { y | x = y z ∧ x ∈ L ∧ y = ε ∧ z = ε }

12

2 Syntax

Example 2.5 (prefix-free language) In some applications, the loss of one or more
final characters of a language sentence is required to produce an incorrect string.
The motivation is that the compiler is then able to detect the inadvertent truncation
of a sentence.
A language is prefix-free if none of the proper prefixes of its sentences is in
the language, i.e., the set prefixes (L) is disjoint from L. Thus, language L1 =
{ x | x = an bn ∧ n ≥ 1 } is prefix-free since every prefix takes the form an bm ,
with n > m ≥ 0, and does not satisfy the characteristic predicate.
On the other hand, the language L2 = { am bn | m > n ≥ 1 } contains a3 b2 as

well as the proper prefix a3 b.
Similarly, the operations on two strings can be extended to two languages, by letting the former and latter argument span the respective language. For instance, the
concatenation of languages L and L is defined as:


L · L = x · y | x ∈ L ∧ y ∈ L
From this, extending the m-th power operation to a language is straightforward
(m ≥ 0):
L0 = { ε }

Lm = Lm−1 · L for m ≥ 1
Some special cases follow from the previous definitions:
∅0 = { ε }

L · {ε} = {ε} · L = L

L·∅=∅·L=∅

Example 2.6 (language concatenation) Consider languages L1 and L2 :


L1 =  ai | i ≥ 0 ∧ i is even  = { ε, a a, a a a a, . . . }
L2 = bj a | j ≥ 1 ∧ j is odd = { b a, b b b a, . . . }
We obtain this concatenation L1 · L2 :


L1 · L2 =  ai · bj a | ( i ≥ 0 ∧ i is even ) ∧ ( j ≥ 1 ∧ j is odd )
= ε b a, a2 b a, a4 b a, . . . , ε b3 a, a2 b3 a, a4 b3 a, . . .



A common error when computing a power of a base language is to take the same
string for m times. The result is a different set, included in the power of the base:


x | x = ym ∧ y ∈ L



⊆ Lm

m≥2

(2.2)

Thus, in (2.2) for L = { a, b } and m = 2, the set to the left of ‘⊆’ is { a a, b b }, while
the value of Lm is { a a, a b, b a, b b }.

2.2 Formal Language Theory

13

Example 2.7 (strings of finite length) The power operation allows a concise definition
of the strings of length not exceeding some integer k ≥ 0. Consider the alphabet
Σ = { a, b }. For k = 3, the language L:
L = { ε, a, b, a a, a b, b a, b b, a a a, a a b, a b a, a b b, b a a, b a b, b b a, b b b }
= Σ0 ∪ Σ1 ∪ Σ2 ∪ Σ3

can also be defined as:
L = { ε, a, b }3
Notice that the sentences shorter than 3 are obtained by using the empty string ε of
the base language.
By slightly changing the example, the language L = { x | 1 ≤ | x | ≤ 3 } is
defined, with concatenation and power, by the formula:
L = { a, b } · { ε, a, b }2



2.2.3 Set Operations
Since a language is a set, the classical set operations of union ‘∪’, intersection ‘∩’, and
difference ‘\’ apply to languages. The set relations of inclusion ‘⊆’, strict inclusion
‘⊂’, and equality ‘=’ apply as well.
Before introducing the complement of a language, the notion of universal language is needed: it is defined as the set of all the strings, over an alphabet Σ, of any
length including zero. Clearly, the universal language is infinite and can be viewed
as the union of all the powers of the alphabet:
Luniversal = Σ 0 ∪ Σ ∪ Σ 2 ∪ . . .
The complement of a language L over an alphabet Σ, denoted by ¬ L, is the set
difference:
¬ L = Luniversal \ L
that is, the set of the strings over alphabet Σ that are not in L. When the alphabet is
understood, the universal language can be expressed as the complement of the empty
language:
Luniversal = ¬ ∅
Example 2.8 (language complement) The complement of a finite language is always
infinite; for instance, the set of strings of any length except two is:


¬ { a, b }2 = { ε } ∪ { a, b } ∪ { a, b }3 ∪ . . .

14

2 Syntax

On the other hand, the complement of an infinite language may or may not be finite,
as shown on one side by the complement of the universal language, and on the other
side by the complement of the set of the strings of even length over alphabet { a }:


L = a2n | n ≥ 0



¬ L = a2n+1 | n ≥ 0

Moving to set difference, consider an alphabet Σ = { a, b, c } and languages:
L1 = { x | | x |a = | x |b = | x |c ≥ 0 }
L2 = { x | | x |a = | x |b ∧ | x |c = 1 }
Then the set differences are:
L1 \ L2 = { ε } ∪ { x | | x |a = | x |b = | x |c ≥ 2 }
which represents the set of the strings that have the same number (except one) of
occurrences of letters a, b, and c, and:
L2 \ L1 = { x | | x |a = | x |b = | x |c ∧ | x |c = 1 }
which is the set of the strings that have one c and the same number of occurrences
(except one) of a and b.


2.2.4 Star and Cross
Most artificial and natural languages include sentences that can be lengthened at will,
so causing the number of sentences in the language to be unbounded. On the other
hand, all the operations so far defined, with the exception of complement, do not
allow to write a finite formula denoting an infinite language. In order to enable the
definition of an infinite language, the next essential development extends the power
operation to the limit.

2.2.4.1 Star
The star 5 operation ‘∗’ is defined as the union of all the powers of the base language:
L∗ =

∞


Lh = L0 ∪ L1 ∪ L2 ∪ . . . = { ε } ∪ L ∪ L2 ∪ . . .

h=0

5 Also

known as Kleene star and as reflexive transitive closure by concatenation.

2.2 Formal Language Theory

15

Example 2.9 (language star) For the language L = { a b, b a }, the star is:
L∗ = { ε, a b, b a, a b a b, a b b a, b a a b, b a b a, . . . }
Every nonempty string of the ‘starred’ language L∗ can be segmented into substrings,
which are the sentences of the base language L.
Notice that, if the base language contains at least one nonempty string, the starred
language is infinite.
It may happen that the starred and base languages are identical, as in:


L = a2n | n ≥ 0



L∗ = a2n | n ≥ 0 = L



An interesting special case occurs when the base is an alphabet Σ; then the star Σ ∗
contains all the strings6 obtained by concatenating terminal characters. This language
is the same as the previous universal language7 of alphabet Σ.
It is obvious that any formal language is a subset of the universal language over the
same alphabet, and the relation L ⊆ Σ ∗ is often written to say that L is a language
over an alphabet Σ. Some useful properties of star follow:
properties of star
L ⊆ L∗
if x ∈ L∗ ∧ y ∈ L∗ then x y ∈ L∗
( L∗ )∗ = L∗
 ∗
( L∗ )R = LR

meaning
monotonicity
closure by concatenation
idempotence
commutativity with reversal

Example 2.10 (idempotence) The monotonicity
 that any language
 property states
is included in its star. But for language L = a2n | n ≥ 0 , the equality L∗ = L
follows from the idempotence property and from the fact that L can be equivalently

defined by the starred formula { a a }∗ .
For the empty language ∅ and empty string ε, we have the set identities:
∅∗ = { ε }

{ ε }∗ = { ε }

length of a sentence in Σ ∗ is unbounded, yet it may not be considered infinite. A specialized
branch of formal language theory (see Perrin and Pin [7]) is devoted to the so-called infinitary
or omega-languages, which include also sentences of infinite length. They effectively model the
situations when a perpetual system can receive or produce messages of infinite length.
7 Another name for it is free monoid. In algebra, a monoid is a structure provided with an associative
composition law (concatenation) and a neutral element (empty string).
6 The

16

2 Syntax

Example 2.11 (identifiers) Many artificial languages assign a name or identifier to
each entity, i.e., variable, file, document, subprogram, and object. A common naming
rule prescribes that an identifier is a string that has the initial character in the set
{ A, B, . . . , Z } and contains any number of letters or digits { 0, 1, . . . , 9 }, e.g.,
LOOP3A2 .
By using the alphabets ΣA and ΣN :
ΣA = { A, B, . . . , Z }

ΣN = { 0, 1, . . . , 9 }

the language of identifiers I ⊆ ( ΣA ∪ ΣN )∗ is:
I = ΣA ( ΣA ∪ ΣN )∗
To introduce a variant, prescribe that the length of the identifiers should not exceed
five. By defining an alphabet Σ = ΣA ∪ ΣN , the language is:


I5 = ΣA Σ 0 ∪ Σ 1 ∪ Σ 2 ∪ Σ 3 ∪ Σ 4


= ΣA ε ∪ Σ ∪ Σ 2 ∪ Σ 3 ∪ Σ 4
The formula expresses the concatenation of language ΣA , the sentences of which
are single characters, with the language constructed as the union of powers. A more
elegant writing is:
I5 = ΣA ( ε ∪ Σ )4



2.2.4.2 Cross
A useful though dispensable operator, derived from star, is the cross8 ‘+’:
L+ =

∞


Lh = L ∪ L2 ∪ . . .

h=1

Cross differs from star because the union excludes the power zero. The following
relations hold:
L+ ⊆ L∗

ε ∈ L+ if and only if ε ∈ L

L+ = L L∗ = L∗ L

Star and cross are called iteration operators.
Example 2.12 (language cross) Two applications of cross to finite sets:


{ a b, b b }+ = a b, b2 , a b3 , b2 a b, a b a b, b4 , . . .

 

{ ε, a a }+ = ε, a2 , a4 , . . . = a2n | n ≥ 0

8 Or

irreflexive closure by concatenation.



2.2 Formal Language Theory

17

Not surprisingly, a language can usually be defined by various formulas, which differ
by their use of operators.
Example 2.13 (controlling string length) Two ways of defining the strings of four
or more characters:
• concatenating the strings of length 4 with arbitrary strings: Σ 4 · Σ ∗
4

• constructing the 4-th power of the set of nonempty strings: Σ +



2.2.5 Language Quotient
Operations like concatenation, star, or union lengthen the strings or increase the
cardinality of the set of strings they operate upon. On the other hand, given two
languages, the right quotient operation L /R L shortens the sentences of the first
language L by cutting a suffix, which is a sentence of the second language L . The
right quotient of L with respect to L is defined as:


L = L /R L = y | ∃ z such that y z ∈ L ∧ z ∈ L
Example 2.14 (quotient language) Consider two languages L and L :


L = a2n b2n | n > 0



L = b2n+1 | n ≥ 0

Their right quotients are:
L /R L = { ar bs | ( r ≥ 2 is even ) ∧ ( 1 ≤ s < r is odd ) }
= a2 b, a4 b, a4 b3 , . . .


L /R L = ∅



A dual operation is the left quotient L /L L that shortens the sentences of language
L by cutting a prefix, which is a sentence of language L .
More operations will be introduced later, in order to transform or translate a formal
language by replacing the terminal characters with other characters or strings.

2.3 Regular Expressions and Languages
Theoretical investigation on formal languages has invented various categories of
languages, in a way reminiscent of the classification of numerical domains introduced
much earlier by number theory. Such categories are characterized by mathematical
and algorithmic properties.

18

2 Syntax

The first family of formal languages to be considered is called regular (or rational)
and can be defined by an astonishing number of different approaches. Regular languages have been independently discovered in disparate scientific fields: the study
of input signals driving a sequential circuit9 to a certain state, the lexicon of programming languages modeled by simple grammar rules, and the simplified analysis
of neural behavior. Later such approaches have been complemented by a logical
definition based on a restricted form of predicates.
To introduce the family, the first definition will be algebraic, by using the operations of union, concatenation, and star; then the family will be defined again by means
of certain simple grammar rules; last, Chap. 3 describes the algorithm for recognizing
regular languages in the form of an abstract machine, the finite automaton.10

2.3.1 Definition of Regular Expression
A language over alphabet Σ = { a1 , a2 , . . . , an } is regular if it can be expressed
by applying for a finite number of times the operations of concatenation, union, and
star, starting with the unitary languages11 { a1 }, { a2 }, …, { an } or the empty string ε.
More precisely, a regular expression (r.e.) is a string r containing the terminal
characters of the alphabet Σ and the following metasymbols12 :
“ ∪ ” union
“ · ” concatenation
“ ∗ ” star
“ ε ” empty (or null) string
“ ( ) ” parentheses
in accordance with the following rules:
#
1
2
3
4
5

rule
r=ε
r=a
r = (s ∪ t)
r = ( s · t ) or r = ( s t )
r = ( s )∗

meaning
empty (or null) string
unitary language
union of expressions
concatenation of expressions
iteration (star) of an expression

where the symbols s and t are regular (sub)expressions.
For expressivity, the metasymbol cross is allowed in an r.e., since it can be
expressed using rules 4 and 5 as ( s )+ = (s ( s∗ )). For economy, some parentheses

9A

digital component incorporating a memory.
language family can also be defined by the form of the logical predicates characterizing
language sentences, e.g., as in [8].
11 A unitary language contains one sentence.
12 In order to prevent confusion between terminals and metasymbols, the latter should not be in the
alphabet. If not so, the metasymbols must be suitably recoded to make them distinguishable.
10 The

2.3 Regular Expressions and Languages

19

can be dropped by stipulating the following precedence between operators: first apply
star, then concatenation, and last union. Furthermore, the union of three (or more)
terms, e.g., ((r ∪ s) ∪ t) and (r ∪ (s ∪ t)), can be simplified as (r ∪ s ∪ t) because
union is an associative operation. The same remark applies to concatenation.
It is customary to write the cup ‘∪’ symbol as a vertical bar ‘|’, called an alternative.
In more theoretical books on formal languages, the metasymbol ‘∅’ (empty set)
is allowed in an r.e., which permits to obtain the empty string ‘ε’ from the basic
operations through the identity { ε } = ∅∗ . We prefer the more direct and practical
presentation shown above.
The rules from 1 to 5 compose the syntax of the r.e., to be formalized later by
means of a grammar (Example 2.31 on p. 33). The meaning or denotation of an r.e.
r is a language Lr over the alphabet Σ, defined by the correspondence shown in
Table 2.1.
Example 2.15 (language of the multiples of 3) Let the alphabet be Σ = { 1 }, where
character ‘1’ may be viewed as a pulse or signal. An r.e. e and the language Le it
denotes are:


e = ( 1 1 1 )∗
Le = 1n | n mod 3 = 0
Language Le contains the character sequences multiple of three. Notice that when
dropping parentheses the language changes, due to the precedence of star over concatenation:
e1 = 1 1 1∗ = 1 1 ( 1 )∗



Le1 = 1n | n ≥ 2



Example 2.16 (language of integers) Let the alphabet be Σ = { +, −, d }, where
character d denotes any decimal digit 0, 1, …, 9. The expression e:
e = ( + ∪ − ∪ ε ) d d∗ ≡ ( + | − | ε ) d d∗
produces the language Le = { +, −, ε } { d } { d }∗ of the integers with or without
sign, such as +353, −5, 969, and +001.


Table 2.1 Language Lr denoted by a regular expression r

#

Regular expression r

Language Lr denoted by r

1

ε

{ε}

2

a

{a}

3

s∪t

4

s·t

5

∗

s ,s

or more often s | t
or more often s t

+

where a ∈ Σ

Ls ∪ Lt
Ls · Lt
L∗s ,

L+
s

or more often Ls Lt

20

2 Syntax

Actually, the correspondence between an r.e. and its denoted language is so direct
that it is customary to refer to the language Le by the r.e. e itself.
We introduce the established terminology for two language families. A language
is regular if it is denoted by a regular expression. The empty language ∅ is considered
a regular language as well, despite there is not an r.e. that defines it. The collection
of all regular languages is called the family REG of regular languages.
Another simple family of languages is the collection of all the finite languages,
and it is called FIN. Thus, a language is in the family FIN if its cardinality is finite,
as for instance the language of 32-bit binary numbers.
By comparing the families REG and FIN, it is easy to see that every finite language
is regular, i.e., FIN ⊆ REG. In fact, a finite language is the union of finitely many
strings x1 , x2 , …, xk , each one being the concatenation of finitely many characters, i.e.,
xi = a1 a2 . . . ani . Then an r.e. producing such finite language is simply the union of
k terms xi , each one concatenating ni characters. Since family REG includes nonfinite
languages, too, the inclusion between the two families is proper, i.e., FIN ⊂ REG.
More language families will be later introduced and compared with REG.

2.3.2 Derivation and Language
We formalize the mechanism through which a given r.e. e produces the denoted
language. For the moment, we suppose that r.e. e is fully parenthesized (except for
the atomic terms), and we introduce the notion of subexpression (s.e.) in the next
example:
subexpression e2 of e0

subexpression e1 of e0

e0 =



∗

(a ∪ (bb))



c

+



∪ (a ∪ (bb))



subexpr. s of e2

This r.e. e0 is structured as the concatenation of two parts e1 and e2 , to be called
subexpressions (of e0 ). In general, an s.e., f , of an r.e. e is a well-parenthesized
substring immediately occurring inside the outermost parentheses. This means that
there is not another well-parenthesized substring of e that contains f . In the example,
the substring labelled s is not an s.e. of e0 , but it is an s.e. of e2 .
When an r.e. is not fully parenthesized, in order to identify its subexpressions one
has to insert (or to imagine) the missing parentheses, in agreement with the operator
precedence. We recall that three or more terms combined by union do not need to be
pairwise parenthesized, because union is an associative operation. The same remark
applies to three or more concatenated terms.
A union or iteration (star and cross) operator offers different choices for producing
strings. By making a choice, one obtains an r.e. that defines a less general language,
which is included in the original one. We say that an r.e. is a choice of another one
in the following three cases:

2.3 Regular Expressions and Languages

21

1. r.e. ek (with 1 ≤ k ≤ m and m ≥ 2) is a choice of the union:
( e1 ∪ · · · ∪ ek ∪ · · · ∪ em )
2. r.e. em = e · · · · · e (with m ≥ 1) is a choice of the star e∗ or cross e+
m times

3. the empty string ε is a choice of the star e∗
Let e be an r.e. By substituting some choice for e , a new r.e. e can be derived from
e . The corresponding relation, called derivation, between two regular expressions
e and e is defined next.
Definition 2.17 (derivation13 ) We say that an r.e. e derives an r.e. e , written
e ⇒ e , if one of the two conditions below holds true:
1. r.e. e is a choice of e
2. r.e. e is the concatenation of m ≥ 2 s.e., as follows:

e = e1 . . . ek . . . em

and r.e. e is obtained from e by substituting an s.e., say ek , with a choice of ek ,
say ek , as shown below:

∃ k, 1 ≤ k ≤ m, such that ek is a choice of ek ∧ e = e1 . . . ek . . . em

Such a derivation ⇒ is called immediate as it makes exactly one choice. If two
or more immediate derivations are applied in series, thus making as many choices
altogether, we have a multi-step derivation. We say that an r.e. e0 derives an r.e. en
n
in n ≥ 1 steps, written e0 =
⇒ en , if the following immediate derivations apply:
e0 ⇒ e1

e1 ⇒ e 2

...

en−1 ⇒ en

+

The notation e =
⇒ e says that an r.e. e derives an r.e. e in n ≥ 1 steps, without
specifying n. If the number of steps is n = 0, then the identity e0 = en holds and says
∗
+
⇒ e if it holds e =
⇒ e or
that the derivation relation is reflexive. We also write e =

e = e .
Example 2.18 (immediate and multi-step derivation) Immediate derivations:
a∗ ∪ b+ ⇒ b+
a∗ ∪ b+ ⇒ a∗
 ∗
∗



a ∪ b b ⇒ a∗ ∪ b b a∗ ∪ b b

13 Also

called implication.

22

2 Syntax

Notice that the substrings of the r.e. considered must be chosen in order from external
to internal, if one wants to produce all possible derivations.
For instance, it would be

∗
unwise, starting from e = ( a∗ ∪ b b )∗ , to choose a2 ∪ b b , because a∗ is not
an s.e. of e . Although the value 2 is a correct choice for the star, such a premature
choice would rule out the derivation of a valid sentence such as a2 b b a3 .
Multi-step derivations:
⇒ε
a∗ ∪ b+ ⇒ a∗ ⇒ ε that is a∗ ∪ b+ =
2

a∗ ∪ b+ ⇒ b+ ⇒ b b b

+

or also a∗ ∪ b+ =
⇒ε


+

or also a∗ ∪ b+ =
⇒ bbb

Some expressions produced by derivation from an expression r contain the metasymbols of union, star, and cross. Others just contain terminal characters or the
empty string, and possibly redundant parentheses, which can be canceled. The latter
expressions compose the language denoted by the r.e.
The language defined by a regular expression r is:


∗
⇒x
Lr = x ∈ Σ ∗ | r =
Two r.e. are equivalent if they define the same language.
The coming example shows that different derivation orders may produce the same
sentence.
Example 2.19 (derivation orders) Compare the derivations 1 and 2 below:
steps

1-st

2-nd

3-rd

1 : a∗ ( b ∪ c ∪ d ) f + ⇒ a a a ( b ∪ c ∪ d ) f + ⇒ a a a c f + ⇒ a a a c f
2 : a∗ ( b ∪ c ∪ d ) f + ⇒ a∗ c f +
⇒ aaacf+ ⇒ aaacf
At line 1, the first step takes the leftmost s.e. ( a∗ ) and chooses a3 = a a a, whereas
at line 2 it takes another s.e. ( b ∪ c ∪ d ) and chooses c. Since such choices are
mutually independent, they can be applied in any order. Thus, by the second step, we
obtain the same r.e. a a a c f + . The third step takes s.e. f + , chooses f , and produces
the sentence a a a c f . Since the last step is independent of the other two, it could be
performed before, after, or between them.


2.3.2.1 Ambiguity of Regular Expressions
The next example conceptually differs from the preceding one with respect to how
different derivations produce the same sentence.
Example 2.20 (ambiguous regular expression) The language over alphabet { a, b }
such that every sentence contains one or more letters a is defined by:
( a ∪ b )∗ a ( a ∪ b )∗

2.3 Regular Expressions and Languages

23

where the compulsory presence of an a is evident. Clearly, every sentence containing
two or more occurrences of a can be obtained by multiple derivations, which differ
with respect to the character identified as the compulsory one. For instance, sentence
a a offers two possibilities:
(a ∪ b)∗ a (a ∪ b )∗ ⇒ (a ∪ b) a (a ∪ b)∗ ⇒ a a (a ∪ b)∗ ⇒ a a ε = a a
(a ∪ b)∗ a (a ∪ b)∗ ⇒ ε a (a ∪ b)∗ ⇒ ε a (a ∪ b) ⇒ ε a a = a a



A sentence, and the r.e. that derives it, is said to be ambiguous if, and only if, it can be
obtained via two structurally different derivations. Thus, sentence a a is ambiguous,
while sentence b a is not ambiguous, because only one set of choices is possible,
corresponding to the derivation:
( a ∪ b )∗ a ( a ∪ b )∗ ⇒ ( a ∪ b ) a ( a ∪ b )∗ ⇒ b a ( a ∪ b )∗ ⇒ b a ε = b a
Ambiguous definitions are a source of trouble in many settings, although they may be
useful in certain applications. In general, such definitions should be avoided although
they may have the advantage of concision over unambiguous definitions. We would
like to have an algorithm for checking if an r.e. is ambiguous, but for that we have
to wait till Chap. 3.
For the time being, we just state a simple sufficient condition for a sentence and
its r.e. to be ambiguous. To this end, given an r.e. f , we number its letters and we
obtain a numbered r.e., which we denote by f  :
f  = ( a1 ∪ b2 )∗ a3 ( a4 ∪ b5 )∗
Notice that the numbered r.e. f  defines a language over the numbered alphabet
{ a1 , b2 , a3 , a4 , b5 }, in general larger than the original one.
An r.e. f is ambiguous if the language defined by the numbered r.e. f  associated
to f contains two distinct strings x and y that become identical when the numbers are
erased. For instance, the strings a1 a3 and a3 a4 of language f  prove the ambiguity
of the string a a of language f .
The same criterion can be applied to an r.e. e that includes the metasymbol ε. For
instance, consider the r.e. e = ( a | ε )+ . Then the numbered r.e. is e = ( a1 | ε2 )+ ,
from which the two strings a1 ε2 and ε2 a1 can be derived that are mapped to the
same ambiguous string a.
Notice that the sufficient condition does not cover cases like the following. The
+
+


b, numbered a1+ | b2
b3 , ambiguously derives the string a a b
r.e. a+ | b
with the unique numbering a1 a1 b3 , yet with two distinct derivations:

and

a+ | b


+



 + + + +
b ⇒ a+ | b a+ | b b =
⇒a a b=
⇒ aab

a+ | b

+



b ⇒ a+ | b b ⇒ a+ b ⇒ a a b

24

2 Syntax

Similarly, the r.e ( a∗ | b∗ ) c has the ambiguous derivations:




 ∗
a | b∗ c ⇒ b∗ c ⇒ ε c = c
a∗ | b∗ c ⇒ a∗ c ⇒ ε c = c and

which are not detected by our condition.
The concept of ambiguity will be thoroughly studied for grammars.

2.3.3 Other Operators
When regular expressions are used in practice, it is convenient to add to the basic
operators of union, concatenation, and star, the derived operators of power and
cross. Moreover, for better expressivity other derived operators may be practical:
repetition from k ≥ 0 to n > k times: [ a ]nk = ak ∪ ak+1 ∪ · · · ∪ an
option: [ a ] = [ a ]10 = a0 ∪ a1 = ε ∪ a
interval of an ordered set: the short notation ( 0 . . . 9 ) represents any digit in
the ordered set { 0, 1, . . . , 9 }; similarly, the notations ( a . . . z ) and ( A . . . Z )
respectively represent any lowercase/uppercase letter
Sometimes other set operations are also used: intersection, set difference, and complement; the regular expressions using such operators are called extended, although
the name is not standard and one has to specify the operators allowed case by case.
Example 2.21 (extended r.e. with intersection) This operator provides a straightforward formulation when a string, to be valid, must obey two or more conditions. To
illustrate, let the alphabet be { a, b } and assume a valid string must (i) contain the
substring b b and (ii) have an even length. Condition (i) is imposed by r.e.:
( a | b )∗ b b ( a | b )∗
condition (ii) by r.e.:



( a | b )2

∗

and the language is defined by the r.e. extended with intersection:


( a | b )∗ b b ( a | b )∗




∗
∩ ( a | b )2

The same language can be defined by a basic r.e. without intersection, but the formula
is more complicated. It says that the substring b b can be surrounded by two strings,
both of even length or both of odd length:




∗
∗
∗
∗
( a | b )2 b b ( a | b )2 | ( a | b ) ( a | b )2 b b ( a | b ) ( a | b )2



2.3 Regular Expressions and Languages

25

Furthermore, it is sometimes simpler to define the sentences of a language ex negativo,
by stating a property they should not have.
Example 2.22 (extended r.e. with complement) Let L be the set of strings over alphabet { a, b } not containing a a as substring. Its complement is:


¬ L = x ∈ ( a | b )∗ | x contains substring a a
easily defined by r.e. ( a | b )∗ a a ( a | b )∗ , whence an extended r.e. for L is:


L = ¬ ( a | b )∗ a a ( a | b )∗
We also show a definition of language L by a basic r.e.:
L = ( a b | b )∗ ( a | ε )
which most readers are likely to find less readable.



Actually, it is no coincidence that the languages in Examples 2.21 and 2.22 admit
also an r.e. without intersection or complement. A theoretical result, to be presented
in Chap. 3, states that the language produced by any r.e. extended with complement
and intersection is regular, therefore by definition can be produced by a nonextended
r.e. as well.

2.3.4 Closure Properties of Family REG
Let op be an operator to be applied to one or two languages, to produce another
language. A language family is closed under operator op if the language, obtained
by applying op to any languages of the family, belongs to the same family.
Property 2.23 (closure of REG) The family REG of regular languages is closed
under the operators of concatenation, union, and star; therefore, it is closed also
under any derived operator, such as cross.

This closure property descends from the very definitions of r.e. and of family REG (p.
20). In spite of its theoretical connotation, Property 2.23 is very relevant in practice:
two regular languages can be combined by using the above operations, at no risk
of losing the nice features of family REG. This will have an important practical
consequence, to permit the compositional design of the algorithms used to check
if an input string is valid for a language. Furthermore, we anticipate that the REG
family is closed under intersection, complement, and reversal, too, which will be
proved later.
From the above property, we derive an alternative definition of family REG.

26

2 Syntax

Property 2.24 (characterization of REG) The family of regular languages, REG, is
the smallest language family such that: (i) it contains all finite languages and (ii) it
is closed by concatenation, union, and star.

Proof By contradiction, assume there is a smaller family F, i.e., F ⊂ REG, that meets
conditions (i) and (ii). Take any language Le ∈ REG defined by an r.e. e. Language
Le is obtained by starting from unitary languages, and orderly applying the operators
present in e. From the assumption, it follows that Le belongs also to family F.
Then F contains any regular language and thus contradicts the strict inclusion F ⊂
REG.

We anticipate that other language families exist which are closed under the same
operators of Property 2.23. Chief among them is the family, CF, of context-free languages, to be introduced soon. From Property 2.24, it follows a (strict) containment
relation between these two families, i.e., REG ⊂ CF.

2.4 Linguistic Abstraction: Abstract and Concrete Lists
If one recalls the programming languages he is familiar with, he may observe that,
although superficially different in their use of keywords and separators, they are often
quite similar at a deeper level. By shifting focus from concrete to abstract syntax,
we can reduce the bewildering variety of language constructs to a few essential
structures. The verb ‘to abstract’ means:
to consider a concept without thinking of a specific example, WordNet 2.1
By abstracting away from the actual characters that represent a language construct,
we perform a linguistic abstraction. This is a language transformation that replaces
the terminal characters of the concrete language with others taken from an abstract
alphabet. Abstract characters should be simpler and suitable to represent similar
constructs from different artificial languages.14
By this approach, the abstract syntax structures of existing artificial languages
are easily described as composition of few elementary paradigms, by means of standard language operations: union, iteration, and substitution (later defined). Starting
from the abstract language, a concrete or real language is obtained by the reverse
transformation, metaphorically also called ‘coating with syntax sugar.’
Factoring a language into its abstract and concrete syntax pays off in several ways.
When studying different languages, it affords much conceptual economy. When
designing compilers, abstraction helps for portability across different languages, if
14 The idea of language abstraction is inspired by the research line in linguistics that aims at discovering the underlying similarities between human languages, disregarding morphological and
syntactic differences.

2.4 Linguistic Abstraction: Abstract and Concrete Lists

27

the compiler functions are designed to process abstract, instead of concrete, language
constructs. Thus, parts of, say, a C compiler can be reused for similar languages, like
Java.
Since the number of abstract paradigms in use is surprisingly small, we will be
able to present most of them in this chapter, starting from the ones conveniently
specified by regular expressions, i.e., the lists.
An abstract list contains an unbounded number of elements e of the same type.
It is defined by r.e. e+ , or by r.e. e∗ if an empty list is permitted. For the time being,
a list element is viewed as a terminal character. In later refinements, however, an
element may be a string from another formal language; for instance, think of a list
of numbers.

2.4.1 Lists with Separators and Opening/Closing Marks
In many real cases, adjacent elements must be divided by a string called separator,
denoted s in abstract syntax. Thus, in a list of numbers, a separator is needed to
delimit the end of a number and the beginning of the next one.
A list with separators is defined by r.e. e ( s e )∗ , saying that the first element e
can be followed by zero or more pairs s e. The equivalent definition ( e s )∗ e differs
by giving evidence to the last element.
In many concrete cases, there is another requirement, intended for legibility or
computer processing: to make the start and end of the list easily recognizable by
respectively prefixing and suffixing some special signs: in the abstract, the initial
character or opening mark i, and the final character or closing mark f .
Lists with separators and opening/closing marks are defined as:
i e ( s e )∗ f
Example 2.25 (some concrete lists) Lists are everywhere in languages, as shown by
typical examples.
instruction block

as in
begin instr1 ; instr2 ; . . . ; instrn end

where instr may stand for assignment, goto, if-statement, write-statement, etc.;
corresponding abstract and concrete terms are:
abstract alphabet
i
e
s
f

concrete alphabet
begin
instr
;
end

28

2 Syntax

procedure parameters

as in

procedure WRITE (
i

par1
e

,
s

par2
e

,
...

…
...

parn
e

)
f

Furthermore, should an empty parameter list be legal, as in a declaration like
procedure WRITE ( ), the r.e. becomes i e ( s e )∗ f
array definition

as in

array MATRIX [
i

int1
e

,
s

int2
e

,
...

…
...

intn
e

]
f

where each int is an interval such as 10 … 50



2.4.2 Language Substitution
The above examples have illustrated the mapping from concrete to abstract symbols.
Moreover, language designers find it useful to work by stepwise refinement, as done
in any branch of engineering when a complex system is divided into its components,
atomic or otherwise. To this end, we introduce the new language operation of substitution, which replaces a terminal character of a language called the source, with
a sentence of another language called the target. As always, the source alphabet is
Σ and the source language is L ⊆ Σ ∗ . Consider a sentence of L containing one or
more occurrences of a source character b:
x = a1 a2 . . . an ∈ L

where ai = b for some 1 ≤ i ≤ n

Let Δ be another alphabet, called target, and Lb ⊆ Δ∗ be the image language of b.
The substitution of language Lb for b in string x produces a set of strings:
{ y | y = y1 y2 . . . yn ∧ if ai = b then yi = ai else yi ∈ Lb }
which are a language over the alphabet ( Σ \ { b } ) ∪ Δ. Notice that all the characters
other than b are left unchanged.
By the usual approach, we can extend the substitution operation from one string
to the whole source language, by applying it to every source sentence.
Example 2.26 (Example 2.25 continued) Resuming the case of a parameter list, the
abstract syntax is:
i e ( s e )∗ f
and the substitutions to be applied are tabulated below:

2.4 Linguistic Abstraction: Abstract and Concrete Lists

abstract char.

29

image language

i

Li = procedure  procedure identifier  ‘ ( ’

e

Le =  parameter identifier 

s

Ls = ‘ ,’

f

Lf = ‘ ) ’

Thus, the opening mark i is replaced with a string of language Li , where the
procedure identifier has to agree with the reference manual of the programming
language. Clearly, the target languages of such substitutions depend on the ‘syntax
sugar’ of the concrete language intended for. Notice that the four above substitutions
are independent of one another and can be applied in any order.

Example 2.27 (identifiers with dash) In certain programming languages, long
mnemonic identifiers can be constructed by appending alphanumeric strings separated by a dash: thus, string LOOP3-OF-35 is a legal identifier. More precisely, the
leftmost word, LOOP3, must initiate with a letter, the others may also initiate with a
digit; and adjacent dashes, as well as a trailing dash, are forbidden.
As a first approximation, each language sentence is a nonempty list of elements
e, which are alphanumeric words, separated by a dash:
e ( ‘ _ ’ e )∗
However, the first word must be different from the others and may be taken to be the
opening mark i of a possibly empty list:
i ( ‘ - ’ e )∗
By substituting to i the language ( A . . . Z ) ( A . . . Z | 0 . . . 9 )∗ , and to e the lan
guage ( A . . . Z | 0 . . . 9 )+ , the final r.e. is obtained.
This is an overly simple instance of syntax design by abstraction and stepwise refinement, a method to be further developed now and after the introduction of grammars.
Other language transformations are studied in Chap. 5.

2.4.3 Hierarchical or Precedence Lists
A frequently recurrent construct is a list such that each element is in turn a list of a
different type. The first list is attached to level 1, the second to level 2, and so on.
However, the present abstract paradigm, called hierarchical list, is restricted to lists
having a bounded number of levels. The case of unbounded levels is studied later by
using grammars, under the name of nested structures.
A hierarchical list is also called a list with precedence, because the list at level
l ≥ 2 bounds its elements more strongly than the list at level l − 1. In other words,

30

2 Syntax

the elements at higher level, i.e., with higher precedence (or more priority), must be
assembled into a list, and each such list becomes an element at the next lower level,
i.e., with lower priority.
Of course, each level may have its own opening/closing marks and separator. Such
delimiters are usually distinct level by level, in order to avoid confusion.
The structure of a hierarchical list with k ≥ 2 levels is:
hierarchy of lists

level

precedence/priority

∗

1

lowest/min

∗

2

list1 = i1 list2 ( s1 list2 ) f1
list2 = i2 list3 ( s2 list3 ) f2
…=…
listk = ik ek ( sk ek )∗ fk

k

highest/max

In this schema, only the highest level may contain atomic elements, ek . But a common
variant permits, at any level l (1 ≤ l < k), atomic elements el to occur side by side
with lists of level l + 1. A few concrete examples follow.
Example 2.28 (two hierarchical lists) Structures in programming languages:
block of print instructions

as in

begin instr1 ; instr2 ; … instrn end
where instr is a print instruction, e.g., WRITE (var1 , var2, …, varn) , i.e., a list (from
Example 2.25); there are two levels (k = 2):
level 1 (low)
level 2 (high)

list of instructions instr separated by semicolon ‘;’, opened by
begin and closed by end
list of variables var separated by comma ‘,’, with i2 = WRITE‘ ( ’
and f2 = ‘ ) ’

arithmetic expression without parentheses the precedences of arithmetic operators determine how many levels the hierarchy has; for instance, the operators
‘×’, ‘÷’ and ‘+’, ‘−’ are layered on two levels (k = 2) and the expression:
term

term

term

3 + 5 × 7 × 4
n

n

n

...

−

term

term

8×2÷5 + 8 + 3
.....................

is a two-level list, with neither opening nor closing marks; at level 1, we find a
list of terms term = list2 , separated by signs ‘+’ and ‘−’, i.e., by low precedence
operators; at level 2 we see a list of numbers n = e2 , separated by signs ‘×’ and
‘÷’, i.e., by high precedence operators; one may go further and introduce a level
3 with an exponentiation sign ‘∗∗’ as separator


2.4 Linguistic Abstraction: Abstract and Concrete Lists

31

Of course, hierarchical structures are omnipresent in natural languages as well. Think
of a list of nouns:
father, mother, son and daughter
Here we may observe a difference with respect to the abstract paradigm: the penultimate element uses a distinct separator and, possibly in order to warn the listener
of an utterance that the list is approaching the end. Furthermore, the items in the list
may be enriched by second-level qualifiers, such as a list of adjectives.
In all sorts of documents and written media, hierarchical lists are extremely common. For instance, a book is a list of chapters, separated by white pages, between a
front and back cover. A chapter is a list of sections, a section is a list of paragraphs,
and so on.

2.5 Context-Free Generative Grammars
We start the study of the family of context-free languages, which plays the central role in compilation. Initially invented by linguists for natural languages in the
1950s, context-free grammars have proved themselves extremely useful in computer
science applications: all existing technical languages have been defined using such
grammars. Moreover, since the early 1960s, efficient algorithms have been found to
analyze, recognize, and translate sentences of context-free languages. This chapter
presents the relevant properties of context-free languages, illustrates their application
by many typical cases, and compares and combines them with regular languages.
At last, we position the context-free model in the classical hierarchy of grammars
and computational models due to Chomsky, and we briefly examine some contextsensitive models.

2.5.1 Limits of Regular Languages
Regular expressions are very practical for describing lists and related paradigms
but fall short of the capacity needed to define other frequently occurring constructs.
A relevant case are the block structures (or nested parentheses) present in many
technical languages, schematized by:
begin begin begin … end begin … end … end end
1-st inner block

2-nd inner block

outer block

Example 2.29 (simple block structure) Let the alphabet be shortened to { b, e } and
consider a quite limited case of nested structure, such that all opening marks precede
all closing marks. Clearly, opening/closing marks must have identical count:


L1 = bn en | n ≥ 1

32

2 Syntax

We argue that language L1 cannot be defined by a regular expression, but we have
to defer the formal proof to a later section. In fact, since each string must have all
letters b left of any letter e, either we write an overly general r.e. such as b+ e+ ,
which unfortunately defines illegal strings like b3 e5 , or we write a too restricted r.e.
that exhaustively lists a finite sample of strings up to a bounded length. On the other
hand, if we comply with the condition that the count of the two letters is the same

by writing ( b e )+ , illegal strings like b e b e creep in.
For defining this and other useful languages, regular or not, we move to the formal
model of generative grammars.

2.5.2 Introduction to Context-Free Grammars
A generative grammar or syntax 15 is a set of simple rules that can be repeatedly
applied in order to generate all and only the valid strings.
Example 2.30 (palindromes) The language L, over the alphabet Σ = { a, b }, is first
defined, using the reversal operation, as:


L = u uR | u ∈ Σ ∗ = { ε, a a, b b, a b b a, b a a b, . . . , a b b b b a, . . . }
It contains strings having specular symmetry, called palindromes, of even-length.
The following grammar G contains three rules:
pal → ε

pal → a pal a

pal → b pal b

The arrow ‘→’ is a metasymbol, exclusively used to separate the left part of a rule
from the right part.
To derive the strings, just replace the symbol pal, called nonterminal, with the
right part of a rule, for instance
pal ⇒ a pal a ⇒ a b pal b a ⇒ a b b pal b b a ⇒ . . .
The derivation process can be chained and terminates when the last string obtained
no longer contains a nonterminal symbol. At that moment, the generation of the
sentence is concluded. We complete the derivation:
a b b pal b b a ⇒ a b b ε b b a = a b b b b a
Incidentally, the language of palindromes is not regular.

15 Sometimes

the term grammar has a broader connotation than syntax, as when some rules for
computing the meaning of sentences are added to the rules for enumerating them. When necessary,
the intended meaning of the term will be made clear.

2.5 Context-Free Generative Grammars

33

Next, we enrich the example into a list of palindromes separated by commas, exemplified by sentence ‘a b b a, b b a a b b, a a’. The grammar adds two list-generating
rules to the previous ones:
list → pal ‘ , ’ list
list → pal

pal → ε
pal → a pal a
pal → b pal b

The top left rule says: the concatenation of palindrome, comma, and list produces a
(longer) list; notice that the comma is between quotes, to stress that it is a terminal
symbol like a and b. The other rule says that a list can be made of one palindrome.
Now there are two nonterminal symbols: list and pal. The former is termed axiom
because it defines the intended language. The latter defines certain component substrings, also called constituents of the language, the palindromes.

Example 2.31 (metalanguage of regular expressions) A regular expression that
defines a language over a fixed terminal alphabet, say Σ = { a, b }, is a formula,
i.e., a string over the alphabet Σr.e. = { a, b, ‘ ∪ ’, ‘ ∗ ’, ‘ ε ’, ‘ ( ’, ‘)’ }, where the
quotes around the r.e. metasymbols indicate their use as grammar terminals (concatenation ‘·’ could be added as well). Such formulas in turn are the sentences of the
language Lr.e. defined by the next grammar. Looking back to the definition of r.e. on
p. 18, we write the rules of grammar G r.e. :
1 : expr → ‘ ε ’
2 : expr → a
3 : expr → b

4 : expr → ‘ ( ’ expr ‘ ∪ ’ expr ‘ ) ’
5 : expr → ‘ ( ’ expr expr ‘ ) ’
6 : expr → ‘ ( ’ expr ‘ ) ’‘ ∗ ’

where numbering is only for reference. A derivation is (quotes are omitted):
expr ⇒ ( expr ∪ expr ) ⇒ ( ( expr expr ) ∪ expr ) ⇒ ( ( a expr ) ∪ expr )
4 
2
 5




⇒ a ( expr )∗ ∪ expr ⇒ a ( ( expr ∪ expr ) )∗ ∪ expr
6 
 4




⇒ a ( ( a ∪ expr ) )∗ ∪ expr ⇒ a ( ( a ∪ b ) )∗ ∪ expr
2 
3


⇒ a ( ( a ∪ b ) )∗ ∪ b = e
3

Since the last string can be interpreted as an r.e. e, it denotes a second language Le
over alphabet Σ, which is the set of the strings starting with letter a, plus string b:
Le = { a, b, a a, a b, a a a, a b a, . . . }



A word of caution: this example displays two language levels, since the syntax
defines certain strings to be understood as definitions of other languages. To avoid

34

2 Syntax

terminological confusion, we say that the syntax G r.e. stays at the metalinguistic level,
that is, over the linguistic level, or, equivalently, that the syntax is a metagrammar.
To set the two levels apart, it helps to consider the alphabets: at meta-level the
alphabet is Σr.e. = { a, b, ‘ ∪ ’, ‘ ∗ ’, ‘ ε ’, ‘ ( ’, ‘)’ }, whereas the final language
has alphabet Σ = { a, b }, devoid of metasymbols.
An analogy with human language may also clarify the issue. A grammar of Russian
can be written, say, in English. Then it contains both Cyrillic and Latin characters.
Here English is the metalanguage and Russian the final language, which only contains
Cyrillic characters. Another example of language versus metalanguage is provided
by XML, the metanotation used to define a variety of Web document types such as
HTML.
Definition 2.32 (context-free grammar) A context-free (CF) (or type 2 or BNF 16 )
grammar G is defined by four entities:
V
Σ
P
S∈V

nonterminal alphabet, a set of symbols termed nonterminals or metasymbols
terminal alphabet, a set of symbols termed terminals, disjoint from the
preceding set
a set of syntactic rules (or productions)
a particular nonterminal termed axiom

A rule of set P is an ordered pair X → α, with X ∈ V and α ∈ ( V ∪ Σ )∗ . Two or
more rules (n ≥ 2):
X → α1

X → α2

...

X → αn

with the same left part X can be concisely grouped in:
X → α1 | α2 | · · · | αn

or

X → α1 ∪ α2 ∪ · · · ∪ αn

We say that the strings α1 , α2 , …, αn are the alternatives of X .



2.5.3 Conventional Grammar Representations
To prevent confusion, the metasymbols ‘→’, ‘|’, ‘∪’, and ε should not be used for
terminal or nonterminal symbols (Example 2.31 is a motivated exception). Moreover, the terminal and nonterminal alphabets must be disjoint, i.e., Σ ∩ V = ∅.

16 Type 2 comes from Chomsky classification (p. 105). Backus Normal Form, or also Backus–
Naur Form, comes from the names of John Backus and Peter Naur, who pioneered the use of such
grammars for defining programming languages.

2.5 Context-Free Generative Grammars

35

Table 2.2 Different typographic styles for writing grammar rules
Nonterminal

Terminal

A compound word between
angle brackets, e.g.,
 if sentence 

A word written as it is,
without any special marks

Example
 if sentence  → if  cond 
then
 sentence 
else
 sentence 

A word written as it is,
A word written in bold, in
without any special marks; it italic or quoted, e.g., then,
may not contain blank spaces; then or ‘then’
e.g., sentence or sentence_list

if_sentence → if cond then
sentence else sentence or
if_sentence → if cond then
sentence else sentence or
if_sentence → ‘if’ cond ‘then’
sentence ‘else’ sentence

An uppercase Latin letter

F → if C then D else D

A word written as it is,
without any special marks

In professional and scientific practice, different styles are used to represent terminals and nonterminals, as specified in Table 2.2. In the first style, the grammar of
Example 2.30 becomes:
⎧
⎪
⎨  sentence  → ε
 sentence  → a  sentence  a
⎪
⎩
 sentence  → b  sentence  b
Alternative rules may be grouped together:
 sentence  → ε | a  sentence  a | b  sentence  b
If a technical grammar is large, of the order of a few hundred rules, it should be
written with care, to facilitate searching for specific definitions, making changes and
cross referencing. The nonterminals should be identified by self-explanatory names,
and the rules should be divided into sections and be numbered for reference. On the
other hand, in very simple examples the third style of Table 2.2 is more suitable, that
is, to have disjoint short symbols for terminals and nonterminals.
In this book, for simplicity we often adopt the following style:
• lowercase Latin letters near the beginning of the alphabet { a, b, . . . } for terminal
characters
• uppercase Latin letters { A, B, . . . , Z } for nonterminal symbols
• lowercase Latin letters near the end of the alphabet { . . . , r, s, . . . , z } for strings
over the alphabet Σ, i.e., including only terminals
• lowercase Greek letters { α, β, . . . , ω } for the strings over the combined alphabet
V ∪Σ

36

2 Syntax

Table 2.3 Classification of grammar rule forms
Class

Description

Model

Terminal

Either RP contains only terminals
or is the empty string

→u | ε

Empty or null

RP is the empty string

→ε

Initial or axiomatic

LP is the grammar axiom

S→

Recursive

LP occurs in RP

A → αAβ

Self-nesting or self-embedding

LP occurs in RP between nonempty A → α A β
strings

Left-recursive

LP is the prefix of RP

A → Aβ

Right-recursive

LP is the suffix of RP

A→βA

Left-right-recursive or two-side
recur.

The two cases above together

A → Aβ A

Copy, renaming, or categorization

RP consists of one nonterminal

A→B

Identity

LP and RP are identical

A→A

Linear

RP contains at most one
nonterminal (the rest is terminal)

→ uBv | w

Left-linear or type 3

As the linear case above, but the
nonterminal is prefix

→ Bv | w

Right-linear or type 3

As the linear case above, but the
nonterminal is suffix

→ uB | w

Homogeneous normal

RP consists of either n ≥ 2
nonterminals or one terminal

→ A 1 . . . An | a

Chomsky normal or homogeneous
of degree 2

RP consists of either two
nonterminals or one terminal

→ BC | a

Greibach normal or real-time

RP is one terminal possibly
followed by nonterminals

→ aσ | b

Operator form

RP consists of two nonterminals
→ AaB
separated by one terminal, which is
called operator (more generally, RP
is devoid of adjacent nonterminals)

Rule Types
In grammar studies, the rules may be classified depending on their form, with the aim
of making the study of language properties more immediate. For future reference, in
Table 2.3 we list some common rule forms along with their technical names. Each
rule form is next schematized, with symbols adhering to the following stipulations:
a and b are terminals; u, v, and w denote strings of terminals and may be empty; A,
B, and C are nonterminals; α and β denote strings containing a mix of terminals and
nonterminals, and may be empty; lastly, σ denotes a string of nonterminals.
The classification of Table 2.3 is based on the form of the right part RP of a rule,
except the recursive rules, which may also consider the left part LP. We omit all the
rule parts that are irrelevant for the classification. By the Chomsky classification,

2.5 Context-Free Generative Grammars

37

the left-linear and right-linear forms are also known as type 3 grammars. Most rule
forms listed in Table 2.3 will occur in the book, and the remaining ones are listed for
general reference.
We will see that some of the grammar forms can be forced on any given grammar
and leave the language unchanged. Such forms are called normal.

2.5.4 Derivation and Language Generation
We reconsider and formalize the notion of string derivation. Let β = δ A η be a string
containing a nonterminal A, where δ and η are any strings, possibly empty. Let A → α
be a rule of grammar G and let γ = δ α η be the string obtained by replacing the
nonterminal A in β with the rule right part α.
The relation between two such strings is called derivation. We say that string β
derives string γ for grammar G, and we write
β=
⇒γ
G

or simply β ⇒ γ when the grammar name is understood. Rule A → α is applied in
such a derivation and string α reduces to nonterminal A.
Now consider a chain of derivations of length n ≥ 0:
β0 ⇒ β1 ⇒ · · · ⇒ βn
which can be shortened to:
n

β0 =
⇒ βn
0

If n = 0, for every string β we posit β =
⇒ β, that is, the derivation relation is reflexive.
To express derivations of any length, we write:
∗

β0 =
⇒ βn

or

+

β0 =
⇒ βn

if the length of the chain is n ≥ 0 or n ≥ 1, respectively.
The language generated or defined by a grammar G, starting from nonterminal
A, is the set of the terminal str