about
12/03/2022
CppLexicalScanner Repo
CppLexicalScanner code

CppLexicalScanner  Repository

Contains Tokenizer and SemiExpression classes

Quick Status Code functions correctly no known defects Demonstration code yes Documentation yes Test cases no but planned Static library no but planned Build requires C++17 option Planned design changes None
Fig 1. Tokenizer Output
Fig 2. SemiExpression Output

1.0 Contents

CppLexicalScanner has two packages: Tokenizer and SemiExp. Tokenizer extracts words from a stream (file or string). Whitespace is always a token boundary, but there are a number of other events that cause ending token character collection:
  • Whitespace
  • text-punctuator boundary
  • comment boundary
  • quoted string boundary
  • quoted char boundary
  • encountering a character that is specified to be a one-character token
  • beginning and ending a pair of characters specified to be a two-character token
It optionally removes both C style and C++ style comments. Removal is the default.
Tokenizer is implemented using the "State Pattern".
SemiExp collects token sequences that are useful for detecting a single gramatical construct. It uses terminator tokens to end its token collection. Those are: "{", "}", ";", and "\n" when its collection starts with "#".
SemiExp token sequences help us detect the beginning and ending of program scopes, class and function definitions, and the occurrence of compiler directives.

2.0 Operations

You can see from Fig 1. the ouput of the tokenizer. Its purpose is to remove whitespace, and apply some intricate rules to decide where in punctuation sequences to break tokens. Fortunately these rules are essentially the same for C++, C#, and Java.
The output of SemiExp groups tokens according to the rules for terminating semi-expressions, cited above. You can see from the output that these token collections are just the right size for parsing programming language grammatical constructs.

3.0 Build

CppLexicalScanner was built with Visual Studio Community Edition - 2019, and tested on Windows 10.

4.0 Status

Tokenizer and SemiExp are both used in CppParser and CppCodeAnalyzer. I've used these facilities in my own code analysis work, and have, for several years, assigned projects to Computer Engineering classes that require their use. Also, several of my doctoral advisees used them as part of their research activities. These facilities have no known defects.
  Next Prev Pages Sections About Keys