RsPageValidator

validates HTML files for structural correctness

Concept:

RsPageValidator checks HTML files for structural correctness. Given one or more files or directories, it scans every .html / .htm file and reports violations of eight structural rules with precise line and column locations. The tool is designed for batch use: validate a single page during development, or sweep an entire static-site tree before publishing. Exit status 0 means all files pass; exit status 1 means at least one violation was found.

Crates:

The project is a Cargo workspace composed of four crates in a strict linear dependency chain.
CrateKindRole
tokenizer library Raw lexical scanner; produces a Token stream with line/column tracking
lexer library Groups tokens into structured Lexeme values; normalizes tag names to lowercase
validator library Applies the eight structural rules; collects all errors before returning
entry_point binary (rs_page_validator) CLI orchestration - parses arguments, walks directories, prints the report
The dependency chain is strictly linear: tokenizerlexervalidatorentry_point. Only entry_point has an external dependency (clap); the three library crates use the Rust standard library only.

Quick Start:

# Validate a single file
cargo run -- index.html

# Validate a directory tree, quiet mode (errors only)
cargo run -- -r -q ./site

# Validate with pass/fail summary
cargo run -- -r -s ./site

# Run the release build directly
./target/release/rs_page_validator -r ./website

Command-Line Options:

OptionArgumentDefaultMeaning
<path>... file or directory (required) One or more HTML files or directories to validate
-r, --recursive (flag) off Descend into subdirectories
-q, --quiet (flag) off Print only files with errors; suppress PASS lines
-s, --summary (flag) off Print pass/fail count after all files are processed
-h, --help (flag) off Print help and exit

Output:

Each file is reported as PASS or FAIL. Failing files list every violation with rule name, line, and column:
PASS  site/index.html
FAIL  site/about.html
      [tag-nesting] 14:3 — </div> does not match open <p>
      [duplicate-id] 22:10 — duplicate id 'header'
PASS  site/contact.html

2 passed, 1 failed
With -q (quiet), PASS lines are suppressed. With -s (summary), the count line is appended. Both flags may be combined.

Rules Checked:

RuleWhat is checked
doctype Document begins with <!DOCTYPE html>
root-element Exactly one <html> element at the top level
head-required <head> is present and contains a <title>
body-required <body> element is present
tag-nesting Every open tag has a matching close tag in the correct order
void-elements Void elements (br, img, input, hr, meta, link, etc.) have no close tag
attr-quotes All attribute values are quoted
duplicate-id No two elements share the same id attribute value

Design:

tokenizer - Tokenizer

Scans the raw HTML source character by character and emits a flat stream of Token values. Recognized variants include TagOpen, TagClose, AttrName, AttrValue, AttrValueUnquoted, SelfClose, TagEnd, Text, Comment, Doctype, and Eof. Every token carries a (line, col) position for precise error reporting. Malformed input is never rejected at this stage — it becomes an opaque token for later stages to handle.

lexer - Lexer

Wraps Tokenizer and groups the flat token stream into structured Lexeme values: OpenTag, SelfClosingTag, CloseTag, TextNode, CommentNode, and DoctypeDecl. Each tag lexeme carries its collected Vec<Attr>, where Attr records the key, value, and a quoted flag used later by the attr-quotes rule. Tag names are normalized to lowercase, and whitespace-only text nodes are discarded.

validator - Validator

Drives the lexer and applies all eight rules in a single pass. It maintains an element stack for nesting checks, a HashSet for duplicate-ID detection, and boolean flags for the document-level requirements (doctype, html, head, title, body). All violations are collected into a Vec<ValidationError> before the Report is returned — the validator never aborts early. Validator::validate(src: &str, file: &Path) -> Report is the single public entry point. Report::is_valid() returns true when the error list is empty.

entry_point - CLI orchestration

Parses arguments with clap (derive API), collects .html / .htm files from the supplied paths, and skips build and VCS directories (target, bin, obj, .git, archive, etc.). For each file it calls Validator::validate, prints the per-file report, and tracks the overall pass/fail count. Exits with status 1 if any file fails.

Build:

cargo build
cargo build --release
cargo test

Testing:

Each crate has its own #[cfg(test)] suite:
CrateTestsCoverage
tokenizer 11 tags, attributes, doctypes, comments, position tracking
lexer 6 tag grouping, attribute collection, case normalization
validator 9 valid documents, missing elements, nesting errors, void elements, duplicate IDs, unquoted attributes
# Run all tests
cargo test

# Run with output visible
cargo test -- --show-output

External Dependencies:

PackageVersionUsed byPurpose
clap 4.x entry_point Derive-based CLI argument parsing
The tokenizer, lexer, and validator crates use the Rust standard library only.