CsParser code
CsParser Repository
Rule-based parsing for C++, C#, Java, ...
Quick Status
Code functions correctly
no known defects
Demonstration code
yes
Documentation
yes - needs more
Test cases
none, but planned
Static library
no, but planned
Build requires
C# 7.0
Planned design changes
Refactor Rules and Actions
into language specific files
Fig 1. Parser Static Structure
1.0 Concept
Parsing is the process of discovering and classifying parts of some complex thing. Our interests are in parsing
computer languages and particularly C, C++, Java, and C#. In this context parsing is the process of some form of
syntactic analysis, which may be based on a formal reduction using some representation like BNF, or using an
ad-hoc process.
There are a lot of reasons you may wish to parse source code beyond compiling it's text. For example:
- Building code analysis tools
- Searching for content in, or ownership of, code files
- Evaluating code metrics
- Compiling "little embedded languages"
Many code parsers have been written including: ANTLR, bison, Lex, and Spirit. There is a long history of successful
use of some of these, so why would we consider writing yet another parser?
Using existing parsers for the fairly small tasks in which we are interested seems like killing flies with a
sledge hammer - too much work and not enough reward. Our goals are to build a facility that is quick to deploy,
can be easily ported to different platforms, and for which the parsing model can be built incrementally as we
learn more about the work we are trying to accomplish.
2.0 Design
In Fig 1. we show the logical design we've used for this parser. At the bottom you see a lexical scanner
composed of a tokenizer and semi-expression processor. The tokenizer collects words from a stream and SemiExp
builds a sequence of tokens that have just the right amount of information to analyze a grammatical construct,
e.g., function or class definition.
In the middle you find the parsing machinery. Parser simply routes incoming semiExps to its contained rules.
Each rule derives from IRule and detects one particular type of grammatical construct, e.g., a function definition.
Since all rules derive from IRule, you can add new derived rules to the parser's rule collection, via Liskov
Substitution.
Each rule has one or more actions, derived from IAction that determine what happens when a rule matches an incoming
semiExp. For example, the action may push a node onto the ScopeStack instance in the Repository, indicating the
start of a function scope.
Note that the parser doesn't need to know what its rules do, and the rules don't need to know what its
actions do. The parser simply collects semiExps from the scanner, and feed them to its rules. When a rule matches
it simply feeds the semiExp to each of its actions. The action usually mutates the state of the Repository,
very often building an Abstract Syntax Tree (AST).
For code analysis, the AST can be quite simple. Since it isn't responsible for generating code, each AST node
can be the same type, just used to build a representation of the source files static structure, or a possibly small
part of its structure, needed for the type of analysis we are executing.
3.0 Build
All code was built with Visual Studio, Community Edition - 2019, and tested on Windows 10.
4.0 Status
Code in this repository will be stable for awhile. I intend, eventually, to simplify the demonstration code
and build process, and add a regression test harness.
5.0 Resources
BlogParser