
These days I am writing a book on ‘domain-specific language engineering’ for use
in a master’s course at Delft University. The book is about the design and implementation
of domain-specific languages, i.e. the definition of their syntax, static
semantics, and code generators. But it also contains a dose of linguistic reflection,
by studying the phenomenon of defining languages, and of defining languages for
defining languages (which is called meta-modelling these days).
Writing chapters on syntax definition and modeling of languages, takes me
back to my days as a PhD student at the University of Amsterdam. Our quarters
were in the university building at theWatergraafsmeer, which was connected to the
CWI building via a bridge. Since the ASF+SDF group of Paul Klint was divided
over the two locations, meetings required a walk to the other end of the building.
So, I would regularly wander to the CWI part of the building to chat.
[While the third application we learned to use in our Unix course in 1989 was talk, with which
one could synchronously talk with someone else on the internet (the first application was probably
csh and the second email), face-to-face meetings were still the primary mode of communication;
as opposed to the use of IRC to talk to one’s officemate.]
Often I
would look into Jan Heering’s office to say hi, and more often than not would end
up spending the rest of the afternoon discussing research and meta-research.
One of the recurring topics in these conversations was the importance of examples.
Jan was fascinated by the notion of ‘programming by example’, i.e. deriving
a program from a bunch of examples of its expected behaviour, instead of a rigorous
and complete definition for all cases. But the other use of examples was for
validation, a word I didn’t learn until long after writing my thesis.
The culture of the day (and probably location?) was heavily influenced by
mathematics and theoretical computer science. The game was the definition, preferably
algebraically, of the artifacts of interest, and then, possibly proving interesting
properties. The application minded would actually implement stuff. As a language
engineer I was mostly interested in making languages with cool features. The motivation
for these features was often highly abstract. The main test example driving
much of the work on the ASF+SDF MetaEnvironment was creating an interactive
environment for the Pico language (While with variable declarations). The idea
being that once an environment for Pico was realized, creating one for a more realistic
language would be a matter scaling up the Pico definition (mere engineering).
Actually making a language (implementation) and using that to write programs
would be a real test. To be fair, there were specifications of larger languages undertaken,
such as ones of (mini-) ML [8] and Pascal [6]. As a student I had developed a
specification of the syntax and static semantics of the object-oriented programming
language Eiffel [16], but that was so big it was not usable at the Sun workstations
we had at that time.
Time and again, Jan Heering would stress the importance of real examples to
show the relevance of a technique and/or to discover the requirements for a design.
While I thought it was a cool idea, I didn’t have examples. At least not to sell the
design and implementation of SDF2, the syntax definition formalism that turned
out to be the main contribution of my PhD thesis [20].
Design of SDF2
SDF2 is the successor to the syntax definition formalism SDF, originally designed
by Heering, Hendriks, Klint, and Rekers [7], which was the basis for syntax definition
in the MetaEnvironment [10]. SDF was originally designed to be declarative,
high-level, and able to support the definition of real languages. To that end, its
design was actually motivated by a careful investigation of the requirements of the
syntax of programming languages.
The redesign and reimplementation of SDF fitted in the larger scheme of bootstrapping
ASF+SDF, i.e. the definition and implementation of ASF+SDF in itself.
The motivation for that project was in the first place driven by need; the LeLisp language
and Centaur platform in which the first version of the ASF+SDF MetaEnvironment
was implemented was on the verge of becoming extinct. But as a second
system effort, it was also a chance to reconsider the design of the language and
the implementation of ASF+SDF, taking the experience of the first version into account.
Furthermore, bootstrapping is of course the ultimate test for a language for
defining languages.
As a result the redesign of SDF that I undertook was driven mainly by internal
motivations, trying to achieve elegance, efficiency, orthogonality, and simplicity,
while trying to be largely backwards compatible with the existing SDF language.
Thus the syntax of SDF2 was mostly the same as that of SDF, but the underlying
abstract syntax was different; with three main differences.
First, the syntax of the language was organized into a kernel language that provided
the essential expressive power of the language to simplify implementation
and language extensions providing syntactic abstractions (sugar) to make reading
and writing syntax definitions tractable. The normalization (desugaring) of full
SDF2 to Kernel-SDF [17] was using rewrite rules in ASF+SDF. (This design pattern
was inspired by the design of Haskell [12, 9] as a core and extend language,
and since applied in Stratego [3] and WebDSL [22].)
Second, heuristics for disambiguation were replaced with a systematic approach
to disambiguation using disambiguation filters [11]. While this part was
supposed to be the core of my research, it didn’t get very far. However, I developed
a scheme for applying disambiguation filters for associativity and priority
during parser generation [18].
Third, triggered by the work of Salomon and Cormack [14], I redesigned SDF
to fully integrate the definition of lexical syntax and concrete syntax. Instead of
a separate lexical analysis stage and corresponding regular grammar specification
of tokens, context-free grammar productions are used for the definition of lexical
syntax (tokens) as well as the context-free syntax (sentence structure).
What motivated this last change? Wisdom in parsing has it that a separate lexical
analysis phase is indispensable for performance. However, the implementation
of the first version of SDF used GLR parsing to support arbitrary context-free
grammars. To make this useful, (local) ambiguity at the lexical level should be
supported as well. As a result the architecture was quite complex, with a scanner
producing multiple tokens, with potentially unsynchronized token boundaries.
In this context, scannerless parsing produced a considerable simplification of the
implementation. There was no need for a separate language of regular grammars,
no generation of finite automata for lexical analysis, and no difficult interface between
scanner and parser to support lexical ambiguities. Furthermore, lexical disambiguation
heuristics such as longest match and prefer keywords were mostly
right, but could also produce the wrong token list. Scannerless parsing forced a
declarative definition of lexical disambiguation.
SGLR
Thus, scannerless parsing was motivated mostly by arguments
pertaining to the implementation rather than the application of SDF. It turned out
that the use of the Generalized-LR algorithm [15, 13] solves many of the issues
of lexical disambiguation. Since the parsing context dictates the possible ‘tokens’,
many ambiguities at the lexical level are automatically solved.
However, the approach does not solve all lexical ambiguities. Solving these
ambiguities with post-parse filters is not feasible, since the parse forest explodes
with too many ambiguities. Basic lexical ambiguities need to be solved during
parsing. Salomon and Cormack [14] had introduced the notions of follow restrictions
and reject rules to formulate longest match and prefer keywords disambiguation
rules. Their approach to implementing these was to adapt the parser generator
to include these concepts into the parse table. However, the use of arbitrary grammar
symbols for follow restrictions and reject rules was not succesful.
In the SGLR algorithm follow restrictions do work out through a simplification
of the grammar formalism. Instead of considering arbitrary non-terminals to exclude
from following a (lexical) non-terminal, only a character class (or sequence
of character classes) is used as follow restriction. Thus, follow restrictions can be
implemented simply by restricting the follow set of a reduction, avoiding ambiguities
in the parse table. Reject productions are more problematic. By carefully
ordering the order in which reductions are considered, SGLR avoids applying reductions
from states that could end up being rejected. The approach works in
practice, but expression in the parse table would be more satisfactory.
The paper I wrote about SGLR [19] was not accepted for publication. This
was partly due to the prototypical nature of the implementation. Reviewers were
skeptical of a parsing solution without separate scanner. However, the more problematic
issue with the paper was that it did not provide any compelling motivation
for the introduction of scannerless parsing. The paper provides technical considerations
such as integrated uniform grammar formalism, disambiguation by context,
and conservation of lexical structure. The examples used to illustrate these features
are rather lame, e.g. subrange types vs floating point numbers in Pascal. Why introduce
a new technique with worse performance, if it cannot be used for anything
that could not be done with the old techniques?
Language Embedding
It turned out that there was a killer application for scannerless parsing, but I realized
this only much later. In the paper meta-programming with concrete object
syntax [21], a general architecture for embedding an object language into a metalanguage
is described in response to a challenge by Tim Sheard. Of course, the
technique was used in ASF+SDF, but somehow it did not seem like an example
that provided compelling motivation for scannerless parsing. Application to other
combinations of languages made it interesting for a wider audience. The approach
relies crucially on the modularity of SDF and SGLR. Since meta-language and
object-language typically have a different syntax, in particular a different lexical
syntax, language composition is not an option with traditional parsing technogies
(LL,LR) since these are not closed under composition. Due to disambiguation by
context different lexical syntaxes can co-exist without problems in SDF.
Meta-programming with concrete object syntax brought on a host of applications
of language embedding, which I explored with Martin Bravenboer [1]. In
MetaBorg [5] domain-specific languages abstracting over APIs are embedded in a
general-purpose language. In StringBorg [2] so called ‘string languages’ are embedded
to ensure that user input does not lead to injection attacks. AspectJ, an
extension of Java, is a composition of several sub-languages with different lexical
syntax, which we described with a declarative syntax definition in SDF [4] (where
existing implementations had to rely on hacks).
Example-Driven Research
The lesson that I learned from the SGLR experience, was to take examples serious.
Examples are crucial as witnesses to the anomalies of existing approaches and
techniques. Addressing such anomalies provides compelling motivation for a new
approach or technique; this part may be more important than a rigorous description
of the solution. For example, the design of WebDSL, a domain-specific language
for web application, was based on an inductive domain analysis process [22], i.e.
starting with a study of existing best practices in web application development.
The design of WebDSL solves a number of weaknesses in the state-of-the-art. In
advising my PhD students I use the lesson I learned from Jan Heering: provide
good examples.
There is another lesson in the SGLR experience, however. Based on the applications
we developed at the time there was no clear motivation for scannerless
parsing. From that perspective we should have never gone in that direction. SGLR
was developed based on internal motivations such as elegance and simplicity. This
turned to enable a host of applications unforeseen at the time of development. Thus,
there should always be room for research that explores new techniques just because
it seems cool, not necessarily useful.
References
[1] M. Bravenboer. Exercises in Free Syntax. Syntax Definition, Parsing, and
Assimilation of Language Conglomerates. PhD thesis, Utrecht University,
Utrecht, The Netherlands, January 2008.
[2] M. Bravenboer, E. Dolstra, and E. Visser. Preventing injection attacks with
syntax embeddings. A host and guest language independent approach. In
J. Lawall, editor, Generative Programming and Component Engineering
(GPCE 2007), pages 3–12, New York, NY, USA, October 2007. ACM.
[3] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser. Stratego/XT 0.17.
A language and toolset for program transformation. Science of Computer Programming,
72(1-2):52–70, June 2008. Special issue on experimental software
and toolkits.
[4] M. Bravenboer, E. Tanter, and E. Visser. Declarative, formal, and extensible
syntax definition for AspectJ. A case for scannerless generalized-lr parsing.
InW. R. Cook, editor, Proceedings of the 21th ACM SIGPLAN Conference on
Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA
2006), pages 209–228, Portland, Oregon, USA, October 2006. ACM
Press.
[5] M. Bravenboer and E. Visser. Concrete syntax for objects. Domain-specific
language embedding and assimilation without restrictions. In D. C. Schmidt,
editor, Proceedings of the 19th ACM SIGPLAN Conference on Object-
Oriented Programing, Systems, Languages, and Applications (OOPSLA
2004), pages 365–383, Vancouver, Canada, October 2004. ACM Press.
[6] A. Van Deursen. The static semantics of pascal. In A. van Deursen, J. Heering,
and P. Klint, editors, Language Prototyping. An Algebraic Specification
Approach, volume 5 of AMAST Series in Computing, chapter 2, pages 31–52.
World Scientific, Singapore, September 1996.
[7] J. Heering, P. R. H. Hendriks, P. Klint, and J. Rekers. The syntax definition
formalism SDF – reference manual. SIGPLAN Notices, 24(11):43–75, 1989.
[8] P. R. H. Hendriks. Typechecking Mini-ML. In J. Bergstra, J. Heering, and
P. Klint, editors, Algebraic Specification, ACM Press Frontier Series, pages
299–337. The ACM Press in co-operation with Addison-Wesley, 1989. Chapter
7.
[9] P. Hudak, S. Peyton Jones, and P.Wadler, editors. Report on the Programming
Language Haskell. A Non-strict, Purely Functional Language. (Version 1.2),
volume 27. May 1992. ACM SIGPLAN Notices.
[10] P. Klint. A meta-environment for generating programming environments.
ACM Transactions on Software Engineering and Methodology, 2(2):176–
201, April 1993.
[11] P. Klint and E. Visser. Using filters for the disambiguation of context-free
grammars. In G. Pighizzini and P. San Pietro, editors, Proc. ASMICS Workshop
on Parsing Theory, pages 1–20, Milano, Italy, October 1994. Tech. Rep.
126–1994, Dipartimento di Scienze dell’Informazione, Universit`a di Milano.
[12] S. L. Peyton Jones. The Implementation of Functional Programming Languages.
Prentice-Hall International Series in Computer Science. Prentice-
Hall, Englewood Cliffs, New Jersey, 1987.
[13] J. Rekers. Parser Generation for Interactive Environments. PhD thesis, University
of Amsterdam, 1992.
[14] D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) parsing of programming
languages. SIGPLAN Notices, 24(7):170–178, 1989.
[15] M. Tomita. Efficient Parsing for Natural Languages. A Fast Algorithm for
Practical Systems. Kluwer Academic Publishers, 1985.
[16] E. Visser. Syntax and static semantics of Eiffel. A case study in algebraic
specification techniques. Unpublished technical report, December 1992.
[17] E. Visser. A family of syntax definition formalisms. In M. G. J. van den Brand
et al., editors, ASF+SDF 1995. A Workshop on Generating Tools from Algebraic
Specifications, pages 89–126. Technical Report P9504, Programming
Research Group, University of Amsterdam, May 1995.
[18] E. Visser. A case study in optimizing parsing schemata by disambiguation filters.
In International Workshop on Parsing Technology (IWPT 1997), pages
210–224, Boston, USA, September 1997. Massachusetts Institute of Technology.
[19] E. Visser. Scannerless generalized-LR parsing. Technical Report P9707,
Programming Research Group, University of Amsterdam, July 1997.
[20] E. Visser. Syntax Definition for Language Prototyping. PhD thesis, University
of Amsterdam, September 1997.
[21] E. Visser. Meta-programming with concrete object syntax. In D. Batory,
C. Consel, and W. Taha, editors, Generative Programming and Component
Engineering (GPCE 2002), volume 2487 of Lecture Notes in Computer Science,
pages 299–315, Pittsburgh, PA, USA, October 2002. Springer-Verlag.
[22] E. Visser. WebDSL: A case study in domain-specific language engineering.
In R. L¨ammel, J. Visser, and J. Saraiva, editors, International Summer
School on Generative and Transformational Techniques in Software Engineering
(GTTSE 2007), volume 5235 of Lecture Notes in Computer Science,
pages 291–373, Heidelberg, October 2008. Springer.