PyPageValidator

validates HTML files for structural correctness

Concept:

PyPageValidator checks HTML files for structural correctness. Given one or more files or directories, it scans every .html / .htm file and reports violations of eight structural rules with precise line and column locations. Written in Python 3.10+, no build step required. The tool is designed for batch use: validate a single page during development, or sweep an entire static-site tree before publishing. Exit status 0 means all files pass; exit status 1 means at least one violation was found.

Packages:

The project is composed of four packages in a strict linear dependency chain, each with a companion test module.
PackageKindRole
Tokenizer library Raw lexical scanner; produces a Token stream with no grammar knowledge
Lexer library Groups tokens into structured Lexeme values; normalizes tag names to lowercase
Validator library Drives the Lexer, applies the eight structural rules, and returns a Report containing all errors
EntryPoint executable (page_validator.py) CLI orchestration — parses arguments, walks directories, calls Validator, prints the report
The dependency chain is strictly linear: TokenizerLexerValidatorEntryPoint. No package imports another laterally; all coupling flows through EntryPoint.

Quick Start:

# Validate a single file
python EntryPoint/page_validator.py index.html

# Validate a directory tree, quiet mode (errors only)
python EntryPoint/page_validator.py -r -q ./site

# Validate with pass/fail summary
python EntryPoint/page_validator.py -r -s ./site

Command-Line Options:

OptionArgumentDefaultMeaning
<path>... file or directory (required) One or more HTML files or directories to validate
-r, --recursive (flag) off Descend into subdirectories
-q, --quiet (flag) off Print only files with errors; suppress PASS lines
-s, --summary (flag) off Print pass/fail count after all files are processed
-h, --help (flag) off Print help and exit
Running with no arguments displays help and exits cleanly.

Output:

Each file is reported as PASS or FAIL. Failing files list every violation with rule name, line, and column:
PASS  site/index.html
FAIL  site/about.html
      [tag-nesting] 14:3 — </div> does not match open <p>
      [duplicate-id] 22:10 — duplicate id 'header'
PASS  site/contact.html

2 passed, 1 failed
With -q (quiet), PASS lines are suppressed. With -s (summary), the count line is appended. Both flags may be combined.

Rules Checked:

RuleWhat is checked
doctype Document begins with <!DOCTYPE html>
root-element Exactly one <html> element at the top level
head-required <head> is present and contains a <title>
body-required <body> element is present
tag-nesting Every open tag has a matching close tag in the correct order
void-elements Void elements (br, img, input, hr, meta, link, etc.) have no close tag
attr-quotes All attribute values are quoted
duplicate-id No two elements share the same id attribute value

Design:

Tokenizer

Reads raw HTML source and emits a flat stream of Token values. Token is a plain base class with dataclass subtypes:
class Token: pass

@dataclass class TagOpen(Token):      name: str
@dataclass class TagClose(Token):     name: str
@dataclass class AttrName(Token):     name: str
@dataclass class AttrValue(Token):    value: str   # quoted
@dataclass class AttrUnquoted(Token): value: str   # unquoted
class SelfClose(Token): pass
class TagEnd(Token):    pass
@dataclass class Text(Token):    content: str
@dataclass class Comment(Token): content: str
@dataclass class Doctype(Token): content: str
class Eof(Token): pass
The tokenizer holds no HTML grammar knowledge — it only recognises <, >, =, ", ', !, /, and - as structurally significant. Malformed input is never rejected at this stage.

Lexer

Consumes the Token stream and groups tokens into structured Lexeme values. Lexeme is a plain base class with dataclass subtypes:
class Lexeme: pass

@dataclass class OpenTag(Lexeme):        name: str; attrs: list[Attr]; pos: tuple[int,int]
@dataclass class SelfClosingTag(Lexeme): name: str; attrs: list[Attr]; pos: tuple[int,int]
@dataclass class CloseTag(Lexeme):       name: str; pos: tuple[int,int]
@dataclass class TextNode(Lexeme):       content: str
@dataclass class CommentNode(Lexeme):    content: str
@dataclass class DoctypeDecl(Lexeme):    content: str
OpenTag and SelfClosingTag carry a list[Attr] with key, value, and quoting status used later by the attr-quotes rule. All tag names are normalized to lowercase; whitespace-only text nodes are discarded.

Validator

Drives the Lexer and applies all eight rules in a single pass. It maintains an element stack (list[tuple[str, tuple[int, int]]]) for nesting checks, a set for duplicate-ID detection, and boolean flags for the document-level requirements (doctype, html, head, title, body). All violations are collected into a list of ValidationError instances before the Report is returned — the validator never short-circuits on the first failure. Validator.validate(src, file) is the single public entry point. Report.is_valid returns True when the error list is empty.

EntryPoint (page_validator.py)

Uses a standard main(argv) function with an if __name__ == '__main__' guard. Parses command-line flags manually (no external dependency), collects .html / .htm files from the supplied paths, and skips build and VCS directories. For each file it calls Validator.validate, prints the per-file report, and tracks the overall pass/fail count. Exits with status 1 if any file fails or cannot be read.

Running the Tool:

No build step is required. Python 3.10+ must be installed.
# Verify Python version
python --version    # should show 3.10 or later

# From the PyPageValidator/ root:
python EntryPoint/page_validator.py index.html
python EntryPoint/page_validator.py -r -q -s ./site

Testing:

Each package has a test_*.py using the standard unittest framework:
Test moduleCoverage
Tokenizer/test_tokenizer.py tags, attributes, doctypes, comments, self-closing elements
Lexer/test_lexer.py tag grouping, attribute collection, case normalization
Validator/test_validator.py valid documents, missing elements, nesting errors, void elements, duplicate IDs, unquoted attributes
# Run one component's tests
python -m unittest Tokenizer/test_tokenizer.py
python -m unittest Lexer/test_lexer.py
python -m unittest Validator/test_validator.py

# Run all tests via discovery (from PyPageValidator/)
python -m unittest discover -s . -p "test_*.py"

External Dependencies:

None. All components use the Python 3.10+ standard library only. No pip packages are required.