CppLexicalScanner Repository

Contains Tokenizer and SemiExpression classes

Quick Status

Code functions correctly

no known defects

Demonstration code

yes

Documentation

yes

Test cases

no but planned

Static library

no but planned

Build requires

C++17 option

Planned design changes

None

Fig 1. Tokenizer Output

Fig 2. SemiExpression Output

1.0 Contents

CppLexicalScanner has two packages: Tokenizer and SemiExp. Tokenizer extracts words from a stream (file or string). Whitespace is always a token boundary, but there are a number of other events that cause ending token character collection:

Whitespace
text-punctuator boundary
comment boundary
quoted string boundary
quoted char boundary
encountering a character that is specified to be a one-character token
beginning and ending a pair of characters specified to be a two-character token

It optionally removes both C style and C++ style comments. Removal is the default.

Tokenizer is implemented using the "State Pattern".

SemiExp collects token sequences that are useful for detecting a single gramatical construct. It uses terminator tokens to end its token collection. Those are: "{", "}", ";", and "\n" when its collection starts with "#".

SemiExp token sequences help us detect the beginning and ending of program scopes, class and function definitions, and the occurrence of compiler directives.

2.0 Operations

You can see from Fig 1. the ouput of the tokenizer. Its purpose is to remove whitespace, and apply some intricate rules to decide where in punctuation sequences to break tokens. Fortunately these rules are essentially the same for C++, C#, and Java.

The output of SemiExp groups tokens according to the rules for terminating semi-expressions, cited above. You can see from the output that these token collections are just the right size for parsing programming language grammatical constructs.

3.0 Build

CppLexicalScanner was built with Visual Studio Community Edition - 2019, and tested on Windows 10.

4.0 Status

Tokenizer and SemiExp are both used in CppParser and CppCodeAnalyzer. I've used these facilities in my own code analysis work, and have, for several years, assigned projects to Computer Engineering classes that require their use. Also, several of my doctoral advisees used them as part of their research activities. These facilities have no known defects.