CppPageValidator

validates HTML files for structural correctness

Concept:

CppPageValidator checks HTML files for structural correctness. Given one or more files or directories, it scans every .html / .htm file and reports violations of eight structural rules with precise line and column locations. It is a C++23 port of RsPageValidator. The tool is designed for batch use: validate a single page during development, or sweep an entire static-site tree before publishing. Exit status 0 means all files pass; exit status 1 means at least one violation was found.

Packages:

The project is a CMake workspace composed of four components in a strict linear dependency chain. Each component is a C++23 named module (.ixx file).
ComponentKindRole
tokenizer static library Raw lexical scanner; produces a Token stream with line/column tracking
lexer static library Groups tokens into structured Lexeme values; normalizes tag names to lowercase
validator static library Applies the eight structural rules; collects all errors before returning
entry_point executable (page_validator) CLI orchestration — parses arguments, walks directories, prints the report
The dependency chain is strictly linear: tokenizerlexervalidatorentry_point. No external libraries are used; only the C++23 standard library (import std;).

Quick Start:

# 1. Configure and build (Release)
cmake -B build -G "Visual Studio 17 2022"
cmake --build build --config Release

# 2. Validate a single file
build\entry_point\Release\page_validator index.html

# 3. Validate a directory tree, quiet mode (errors only)
build\entry_point\Release\page_validator -r -q .\site

# 4. Validate with pass/fail summary
build\entry_point\Release\page_validator -r -s .\site

Command-Line Options:

OptionArgumentDefaultMeaning
<path>... file or directory (required) One or more HTML files or directories to validate
-r, --recursive (flag) off Descend into subdirectories
-q, --quiet (flag) off Print only files with errors; suppress PASS lines
-s, --summary (flag) off Print pass/fail count after all files are processed
-h, --help (flag) off Print help and exit

Output:

Each file is reported as PASS or FAIL. Failing files list every violation with rule name, line, and column:
PASS  site/index.html
FAIL  site/about.html
      [tag-nesting] 14:3 — </div> does not match open <p>
      [duplicate-id] 22:10 — duplicate id 'header'
PASS  site/contact.html

2 passed, 1 failed
With -q (quiet), PASS lines are suppressed. With -s (summary), the count line is appended. Both flags may be combined.

Rules Checked:

RuleWhat is checked
doctype Document begins with <!DOCTYPE html>
root-element Exactly one <html> element at the top level
head-required <head> is present and contains a <title>
body-required <body> element is present
tag-nesting Every open tag has a matching close tag in the correct order
void-elements Void elements (br, img, input, hr, meta, link, etc.) have no close tag
attr-quotes All attribute values are quoted
duplicate-id No two elements share the same id attribute value

Design:

tokenizer - Tokenizer

Scans raw HTML source character by character and emits a flat stream of Token values. Token is a std::variant over:
tok::TagOpen, tok::TagClose, tok::AttrName,
tok::AttrValue, tok::AttrUnquoted, tok::SelfClose,
tok::TagEnd, tok::Text, tok::Comment, tok::Doctype, tok::Eof
Every token carries a (line, col) position for precise error reporting. The tokenizer holds no HTML grammar knowledge — it only recognises <, >, =, ", ', and ! as delimiters. Malformed input is never rejected at this stage.

lexer - Lexer

Consumes the Token stream and groups tokens into structured Lexeme values. Lexeme is a std::variant over:
lex::OpenTag, lex::SelfClosingTag, lex::CloseTag,
lex::TextNode, lex::CommentNode, lex::DoctypeDecl
OpenTag and SelfClosingTag carry a std::vector<Attr> with key, value, and quoting status used later by the attr-quotes rule. All tag names are normalized to lowercase; whitespace-only text nodes are discarded.

validator - Validator

Drives the Lexer and applies all eight rules in a single pass. It maintains an element stack (std::vector<std::pair<std::string, Pos>>) for nesting checks, an std::unordered_set<std::string> for duplicate-ID detection, and boolean flags for the document-level requirements (doctype, html, head, title, body). All violations are collected into a std::vector<ValidationError> before the Report is returned — the validator never aborts early on the first failure. Validator::validate(std::string_view src, const std::filesystem::path& file) is the single public entry point. Report::is_valid() returns true when the error list is empty.

entry_point - CLI orchestration

Parses command-line flags manually (no external dependency), collects .html / .htm files from the supplied paths, and skips build and VCS directories (build, bin, obj, .git, etc.). For each file it calls Validator::validate, prints the per-file report, and tracks the overall pass/fail count. Exits with status 1 if any file fails.

Build:

Requires CMake 3.28+ and MSVC 19.38+ (Visual Studio 2022 17.8) or Clang 18+.
# Configure
cmake -B build -G "Visual Studio 17 2022"

# Build Release
cmake --build build --config Release

# Run all unit tests
ctest --test-dir build --build-config Release --output-on-failure
C++23 named modules (export module; / import std;) require a compiler and CMake version that support module scanning. Visual Studio 2022 17.8 or later is the recommended toolchain on Windows.

Testing:

Each component has a test.cpp registered with CTest:
ComponentTestsCoverage
tokenizer 12 tags, attributes, doctypes, comments, position tracking
lexer 11 tag grouping, attribute collection, case normalization
validator 12 valid documents, missing elements, nesting errors, void elements, duplicate IDs, unquoted attributes
# Run all tests
ctest --test-dir build --build-config Release --output-on-failure

External Dependencies:

None. All four components use only the C++23 standard library (import std;). No third-party packages are required.