CsPageValidator

validates HTML files for structural correctness

Concept:

CsPageValidator checks HTML files for structural correctness. Given one or more files or directories, it scans every .html / .htm file and reports violations of eight structural rules with precise line and column locations. It is a C# (.NET) port of CppPageValidator. The tool is designed for batch use: validate a single page during development, or sweep an entire static-site tree before publishing. Exit status 0 means all files pass; exit status 1 means at least one violation was found.

Packages:

The project is a .NET solution composed of four library packages in a strict linear dependency chain, plus three companion test projects.
PackageKindRole
Tokenizer library Raw lexical scanner; produces a Token stream with line/column tracking
Lexer library Groups tokens into structured Lexeme values; normalizes tag names to lowercase
Validator library Drives the Lexer, applies the eight structural rules, and returns a Report containing all errors
EntryPoint executable (page_validator) CLI orchestration - parses arguments, walks directories, calls Validator, prints the report
The dependency chain is strictly linear: TokenizerLexerValidatorEntryPoint. No NuGet packages are used; only the .NET Base Class Library.

Quick Start:

# 1. Build (from the CsPageValidator/ solution root)
dotnet build

# 2. Validate a single file
dotnet run --project EntryPoint -- index.html

# 3. Validate a directory tree, quiet mode (errors only)
dotnet run --project EntryPoint -- -r -q ./site

# 4. Validate with pass/fail summary
dotnet run --project EntryPoint -- -r -s ./site

# 5. Or run the built executable directly
EntryPoint/bin/Debug/net10.0/page_validator -r -s ./site

Command-Line Options:

OptionArgumentDefaultMeaning
<path>... file or directory (required) One or more HTML files or directories to validate
-r, --recursive (flag) off Descend into subdirectories
-q, --quiet (flag) off Print only files with errors; suppress PASS lines
-s, --summary (flag) off Print pass/fail count after all files are processed
-h, --help (flag) off Print help and exit

Output:

Each file is reported as PASS or FAIL. Failing files list every violation with rule name, line, and column:
PASS  site/index.html
FAIL  site/about.html
      [tag-nesting] 14:3 - </div> does not match open <p>
      [duplicate-id] 22:10 - duplicate id 'header'
PASS  site/contact.html

2 passed, 1 failed
With -q (quiet), PASS lines are suppressed. With -s (summary), the count line is appended. Both flags may be combined.

Rules Checked:

RuleWhat is checked
doctype Document begins with <!DOCTYPE html>
root-element Exactly one <html> element at the top level
head-required <head> is present and contains a <title>
body-required <body> element is present
tag-nesting Every open tag has a matching close tag in the correct order
void-elements Void elements (br, img, input, hr, meta, link, etc.) have no close tag
attr-quotes All attribute values are quoted
duplicate-id No two elements share the same id attribute value

Design:

Tokenizer

Scans raw HTML source character by character and emits a flat stream of Token values. Token is an abstract record with derived types:
TagOpen, TagClose, AttrName, AttrValue, AttrUnquoted,
SelfClose, TagEnd, Text, Comment, Doctype, Eof
Every token carries a (Line, Col) position for precise error reporting. The tokenizer holds no HTML grammar knowledge - it only recognises <, >, =, ", ', and ! as delimiters. Malformed input is never rejected at this stage.

Lexer

Consumes the Token stream and groups tokens into structured Lexeme values. Lexeme is an abstract record with derived types:
OpenTag, SelfClosingTag, CloseTag, TextNode, CommentNode, DoctypeDecl
OpenTag and SelfClosingTag carry an IReadOnlyList<Attr> with key, value, and quoting status used later by the attr-quotes rule. All tag names are normalized to lowercase; whitespace-only text nodes are discarded.

Validator

Drives the Lexer and applies all eight rules in a single pass. It maintains an element stack (List<(string Name, (int Line, int Col) Pos)>) for nesting checks, a HashSet<string> for duplicate-ID detection, and boolean flags for the document-level requirements (doctype, html, head, title, body). All violations are collected into a list of ValidationError records before the Report is returned - the validator never short-circuits on the first failure. Validator.Validate(string src, string file) is the single public entry point. Report.IsValid returns true when the error list is empty.

EntryPoint - CLI orchestration

Parses command-line flags manually (no external dependency), collects .html / .htm files from the supplied paths, and skips build and VCS directories (bin, obj, .git, .vs, etc.). For each file it calls Validator.Validate, prints the per-file report, and tracks the overall pass/fail count. Exits with status 1 if any file fails or cannot be read.

Build:

Requires the .NET 10 SDK (dotnet --version should show 10.x).
# Build all projects (from the CsPageValidator/ solution root)
dotnet build

# Build Release configuration
dotnet build -c Release

# Run all unit tests
dotnet test

# Run with verbose test output
dotnet test --logger "console;verbosity=detailed"

Testing:

Each library has a companion xUnit test project:
Test ProjectCoverage
Tokenizer.Tests tags, attributes, doctypes, comments, position tracking
Lexer.Tests tag grouping, attribute collection, case normalization
Validator.Tests valid documents, missing elements, nesting errors, void elements, duplicate IDs, unquoted attributes
# Run all tests from the solution root
dotnet test

# Run tests for a single project
dotnet test Validator.Tests

External Dependencies:

None. All four components use only the .NET Base Class Library (System.IO, System.Text.RegularExpressions, and core collections). No NuGet packages are required.