CsPageValidator

validates HTML files for structural correctness

Concept:

CsPageValidator checks HTML files for structural correctness. Given one or more files or directories, it scans every .html / .htm file and reports violations of eight structural rules with precise line and column locations. It is a C# (.NET) port of CppPageValidator.

The tool is designed for batch use: validate a single page during development, or sweep an entire static-site tree before publishing. Exit status 0 means all files pass; exit status 1 means at least one violation was found.

Packages:

The project is a .NET solution composed of four library packages in a strict linear dependency chain, plus three companion test projects.

Package	Kind	Role
Tokenizer	library	Raw lexical scanner; produces a Token stream with line/column tracking
Lexer	library	Groups tokens into structured Lexeme values; normalizes tag names to lowercase
Validator	library	Drives the Lexer, applies the eight structural rules, and returns a Report containing all errors
EntryPoint	executable (page_validator)	CLI orchestration - parses arguments, walks directories, calls Validator, prints the report

The dependency chain is strictly linear: Tokenizer ← Lexer ← Validator ← EntryPoint. No NuGet packages are used; only the .NET Base Class Library.

Quick Start:

# 1. Build (from the CsPageValidator/ solution root)
dotnet build

# 2. Validate a single file
dotnet run --project EntryPoint -- index.html

# 3. Validate a directory tree, quiet mode (errors only)
dotnet run --project EntryPoint -- -r -q ./site

# 4. Validate with pass/fail summary
dotnet run --project EntryPoint -- -r -s ./site

# 5. Or run the built executable directly
EntryPoint/bin/Debug/net10.0/page_validator -r -s ./site

Command-Line Options:

Option	Argument	Default	Meaning
<path>...	file or directory	(required)	One or more HTML files or directories to validate
-r, --recursive	(flag)	off	Descend into subdirectories
-q, --quiet	(flag)	off	Print only files with errors; suppress PASS lines
-s, --summary	(flag)	off	Print pass/fail count after all files are processed
-h, --help	(flag)	off	Print help and exit

Output:

Each file is reported as PASS or FAIL. Failing files list every violation with rule name, line, and column:

PASS  site/index.html
FAIL  site/about.html
      [tag-nesting] 14:3 - </div> does not match open <p>
      [duplicate-id] 22:10 - duplicate id 'header'
PASS  site/contact.html

2 passed, 1 failed

With -q (quiet), PASS lines are suppressed. With -s (summary), the count line is appended. Both flags may be combined.

Rules Checked:

Rule	What is checked
doctype	Document begins with <!DOCTYPE html>
root-element	Exactly one <html> element at the top level
head-required	<head> is present and contains a <title>
body-required	<body> element is present
tag-nesting	Every open tag has a matching close tag in the correct order
void-elements	Void elements (br, img, input, hr, meta, link, etc.) have no close tag
attr-quotes	All attribute values are quoted
duplicate-id	No two elements share the same id attribute value

Design:

Tokenizer

Scans raw HTML source character by character and emits a flat stream of Token values. Token is an abstract record with derived types:

TagOpen, TagClose, AttrName, AttrValue, AttrUnquoted,
SelfClose, TagEnd, Text, Comment, Doctype, Eof

Every token carries a (Line, Col) position for precise error reporting. The tokenizer holds no HTML grammar knowledge - it only recognises <, >, =, ", ', and ! as delimiters. Malformed input is never rejected at this stage.

Lexer

Consumes the Token stream and groups tokens into structured Lexeme values. Lexeme is an abstract record with derived types:

OpenTag, SelfClosingTag, CloseTag, TextNode, CommentNode, DoctypeDecl

OpenTag and SelfClosingTag carry an IReadOnlyList<Attr> with key, value, and quoting status used later by the attr-quotes rule. All tag names are normalized to lowercase; whitespace-only text nodes are discarded.

Validator

Drives the Lexer and applies all eight rules in a single pass. It maintains an element stack (List<(string Name, (int Line, int Col) Pos)>) for nesting checks, a HashSet<string> for duplicate-ID detection, and boolean flags for the document-level requirements (doctype, html, head, title, body). All violations are collected into a list of ValidationError records before the Report is returned - the validator never short-circuits on the first failure.

Validator.Validate(string src, string file) is the single public entry point. Report.IsValid returns true when the error list is empty.

EntryPoint - CLI orchestration

Parses command-line flags manually (no external dependency), collects .html / .htm files from the supplied paths, and skips build and VCS directories (bin, obj, .git, .vs, etc.). For each file it calls Validator.Validate, prints the per-file report, and tracks the overall pass/fail count. Exits with status 1 if any file fails or cannot be read.

Build:

Requires the .NET 10 SDK (dotnet --version should show 10.x).

# Build all projects (from the CsPageValidator/ solution root)
dotnet build

# Build Release configuration
dotnet build -c Release

# Run all unit tests
dotnet test

# Run with verbose test output
dotnet test --logger "console;verbosity=detailed"

Testing:

Each library has a companion xUnit test project:

Test Project	Coverage
Tokenizer.Tests	tags, attributes, doctypes, comments, position tracking
Lexer.Tests	tag grouping, attribute collection, case normalization
Validator.Tests	valid documents, missing elements, nesting errors, void elements, duplicate IDs, unquoted attributes

# Run all tests from the solution root
dotnet test

# Run tests for a single project
dotnet test Validator.Tests

External Dependencies:

None. All four components use only the .NET Base Class Library (System.IO, System.Text.RegularExpressions, and core collections). No NuGet packages are required.