Fig 2. SemiExpression Output
1.0 Contents
CppLexicalScanner has two packages: Tokenizer and SemiExp.
Tokenizer extracts words from a stream (file or string). Whitespace is always a token boundary,
but there are a number of other events that cause ending token character collection:
- Whitespace
- text-punctuator boundary
- comment boundary
- quoted string boundary
- quoted char boundary
- encountering a character that is specified to be a one-character token
- beginning and ending a pair of characters specified to be a two-character token
It optionally removes both C style and C++ style comments. Removal is the default.
Tokenizer is implemented using the "State Pattern".
SemiExp collects token sequences that are useful for detecting a single gramatical construct.
It uses terminator tokens to end its token collection. Those are: "{", "}",
";", and "\n" when its collection starts with "#".
SemiExp token sequences help us detect the beginning and ending of program scopes, class and
function definitions, and the occurrence of compiler directives.
2.0 Operations
You can see from Fig 1. the ouput of the tokenizer. Its purpose is to remove whitespace, and apply some
intricate rules to decide where in punctuation sequences to break tokens. Fortunately these rules are
essentially the same for C++, C#, and Java.
The output of SemiExp groups tokens according to the rules for terminating semi-expressions, cited above.
You can see from the output
that these token collections are just the right size for parsing programming language grammatical
constructs.
3.0 Build
CppLexicalScanner was built with Visual Studio Community Edition - 2019, and tested on Windows 10.
4.0 Status
Tokenizer and SemiExp are both used in CppParser and CppCodeAnalyzer. I've used these facilities
in my own code analysis work, and have, for several years, assigned projects to Computer Engineering classes
that require their use. Also, several of my doctoral advisees used them as part of their research activities.
These facilities have no known defects.