RsPageValidator

validates HTML files for structural correctness

Concept:

RsPageValidator checks HTML files for structural correctness. Given one or more files or directories, it scans every .html / .htm file and reports violations of eight structural rules with precise line and column locations.

The tool is designed for batch use: validate a single page during development, or sweep an entire static-site tree before publishing. Exit status 0 means all files pass; exit status 1 means at least one violation was found.

Crates:

The project is a Cargo workspace composed of four crates in a strict linear dependency chain.

Crate	Kind	Role
tokenizer	library	Raw lexical scanner; produces a Token stream with line/column tracking
lexer	library	Groups tokens into structured Lexeme values; normalizes tag names to lowercase
validator	library	Applies the eight structural rules; collects all errors before returning
entry_point	binary (rs_page_validator)	CLI orchestration - parses arguments, walks directories, prints the report

The dependency chain is strictly linear: tokenizer ← lexer ← validator ← entry_point. Only entry_point has an external dependency (clap); the three library crates use the Rust standard library only.

Quick Start:

# Validate a single file
cargo run -- index.html

# Validate a directory tree, quiet mode (errors only)
cargo run -- -r -q ./site

# Validate with pass/fail summary
cargo run -- -r -s ./site

# Run the release build directly
./target/release/rs_page_validator -r ./website

Command-Line Options:

Option	Argument	Default	Meaning
<path>...	file or directory	(required)	One or more HTML files or directories to validate
-r, --recursive	(flag)	off	Descend into subdirectories
-q, --quiet	(flag)	off	Print only files with errors; suppress PASS lines
-s, --summary	(flag)	off	Print pass/fail count after all files are processed
-h, --help	(flag)	off	Print help and exit

Output:

Each file is reported as PASS or FAIL. Failing files list every violation with rule name, line, and column:

PASS  site/index.html
FAIL  site/about.html
      [tag-nesting] 14:3 — </div> does not match open <p>
      [duplicate-id] 22:10 — duplicate id 'header'
PASS  site/contact.html

2 passed, 1 failed

With -q (quiet), PASS lines are suppressed. With -s (summary), the count line is appended. Both flags may be combined.

Rules Checked:

Rule	What is checked
doctype	Document begins with <!DOCTYPE html>
root-element	Exactly one <html> element at the top level
head-required	<head> is present and contains a <title>
body-required	<body> element is present
tag-nesting	Every open tag has a matching close tag in the correct order
void-elements	Void elements (br, img, input, hr, meta, link, etc.) have no close tag
attr-quotes	All attribute values are quoted
duplicate-id	No two elements share the same id attribute value

Design:

tokenizer - Tokenizer

Scans the raw HTML source character by character and emits a flat stream of Token values. Recognized variants include TagOpen, TagClose, AttrName, AttrValue, AttrValueUnquoted, SelfClose, TagEnd, Text, Comment, Doctype, and Eof. Every token carries a (line, col) position for precise error reporting. Malformed input is never rejected at this stage — it becomes an opaque token for later stages to handle.

lexer - Lexer

Wraps Tokenizer and groups the flat token stream into structured Lexeme values: OpenTag, SelfClosingTag, CloseTag, TextNode, CommentNode, and DoctypeDecl. Each tag lexeme carries its collected Vec<Attr>, where Attr records the key, value, and a quoted flag used later by the attr-quotes rule. Tag names are normalized to lowercase, and whitespace-only text nodes are discarded.

validator - Validator

Drives the lexer and applies all eight rules in a single pass. It maintains an element stack for nesting checks, a HashSet for duplicate-ID detection, and boolean flags for the document-level requirements (doctype, html, head, title, body). All violations are collected into a Vec<ValidationError> before the Report is returned — the validator never aborts early.

Validator::validate(src: &str, file: &Path) -> Report is the single public entry point. Report::is_valid() returns true when the error list is empty.

entry_point - CLI orchestration

Parses arguments with clap (derive API), collects .html / .htm files from the supplied paths, and skips build and VCS directories (target, bin, obj, .git, archive, etc.). For each file it calls Validator::validate, prints the per-file report, and tracks the overall pass/fail count. Exits with status 1 if any file fails.

Build:

cargo build
cargo build --release
cargo test

Testing:

Each crate has its own #[cfg(test)] suite:

Crate	Tests	Coverage
tokenizer	11	tags, attributes, doctypes, comments, position tracking
lexer	6	tag grouping, attribute collection, case normalization
validator	9	valid documents, missing elements, nesting errors, void elements, duplicate IDs, unquoted attributes

# Run all tests
cargo test

# Run with output visible
cargo test -- --show-output

External Dependencies:

Package	Version	Used by	Purpose
clap	4.x	entry_point	Derive-based CLI argument parsing

The tokenizer, lexer, and validator crates use the Rust standard library only.