CsParser Repository

Rule-based parsing for C++, C#, Java, ...

Quick Status

Code functions correctly

no known defects

Demonstration code

yes

Documentation

yes - needs more

Test cases

none, but planned

Static library

no, but planned

Build requires

C# 7.0

Planned design changes

Refactor Rules and Actions
into language specific files

Fig 1. Parser Static Structure

1.0 Concept

Parsing is the process of discovering and classifying parts of some complex thing. Our interests are in parsing computer languages and particularly C, C++, Java, and C#. In this context parsing is the process of some form of syntactic analysis, which may be based on a formal reduction using some representation like BNF, or using an ad-hoc process.

There are a lot of reasons you may wish to parse source code beyond compiling it's text. For example:

Building code analysis tools
Searching for content in, or ownership of, code files
Evaluating code metrics
Compiling "little embedded languages"

Many code parsers have been written including: ANTLR, bison, Lex, and Spirit. There is a long history of successful use of some of these, so why would we consider writing yet another parser?

Using existing parsers for the fairly small tasks in which we are interested seems like killing flies with a sledge hammer - too much work and not enough reward. Our goals are to build a facility that is quick to deploy, can be easily ported to different platforms, and for which the parsing model can be built incrementally as we learn more about the work we are trying to accomplish.

2.0 Design

In Fig 1. we show the logical design we've used for this parser. At the bottom you see a lexical scanner composed of a tokenizer and semi-expression processor. The tokenizer collects words from a stream and SemiExp builds a sequence of tokens that have just the right amount of information to analyze a grammatical construct, e.g., function or class definition.

In the middle you find the parsing machinery. Parser simply routes incoming semiExps to its contained rules. Each rule derives from IRule and detects one particular type of grammatical construct, e.g., a function definition. Since all rules derive from IRule, you can add new derived rules to the parser's rule collection, via Liskov Substitution.

Each rule has one or more actions, derived from IAction that determine what happens when a rule matches an incoming semiExp. For example, the action may push a node onto the ScopeStack instance in the Repository, indicating the start of a function scope.

Note that the parser doesn't need to know what its rules do, and the rules don't need to know what its actions do. The parser simply collects semiExps from the scanner, and feed them to its rules. When a rule matches it simply feeds the semiExp to each of its actions. The action usually mutates the state of the Repository, very often building an Abstract Syntax Tree (AST).

For code analysis, the AST can be quite simple. Since it isn't responsible for generating code, each AST node can be the same type, just used to build a representation of the source files static structure, or a possibly small part of its structure, needed for the type of analysis we are executing.

3.0 Build

All code was built with Visual Studio, Community Edition - 2019, and tested on Windows 10.

4.0 Status

Code in this repository will be stable for awhile. I intend, eventually, to simplify the demonstration code and build process, and add a regression test harness.

5.0 Resources

BlogParser