ib RustBite RegEx
about
05/19/2022
RustBites - RegEx
Rust Bites Code

Rust Bite - Regular Expressions

little language for specifying text fragments with metadata

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
- Jamie Zawinski

A regular expression is a String-based pattern that matches other Strings or parts of Strings. Patterns are constructed from a little language, discussed in Section 3. A program in that language is compiled by a regular expression compiler. The result is used to search text for matches. Rust does not have a regular expression library, but the there is a widely used regex library posted on crates.io that we will use in this Bite. The compiler, regex::RegEx, expects syntax similar to that used in the string processing language Perl. RegEx has a number of methods, discussed in Section 2., that support finding matches and retrieving text fragments.

1.0  Introduction

Example fn re_check(re: &str, text: &str) -> bool { /* panics if re is invalid */ let re: Regex = Regex::new(re).unwrap(); re.is_match(text) } fn show_re_test(re: &str, text: &str) { if re_check(re, text) { print!( "\n RegEx: {:?} matches text: {:?}", re, text ) } else { print!( "\n RegEx: {:?} does'nt match text: {:?}", re, text ) } } fn main() { let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); assert!(re.is_match("2014-01-01")); let re = "abc"; let text1 = "123abc987"; show_re_test(re, text1); let text2 = "123000987"; show_re_test(re, text2); let re = r"([a-z]+)|([A-Z][A-Z])"; let text3 = "???abc"; show_re_test(re, text3); let text4 = "???A@@@"; show_re_test(re, text4); let text4 = "???AK@@@"; show_re_test(re, text4); let text5 = r"123"; show_re_test(re, text5); } Result RegEx: "abc" matches text: "123abc987" RegEx: "abc" does not match text: "123000987" RegEx: "([a-z]+)|([A-Z][A-Z])" matches text: "???abc" RegEx: "([a-z]+)|([A-Z][A-Z])" does not match text: "???A@@@" RegEx: "([a-z]+)|([A-Z][A-Z])" matches text: "???AK@@@" RegEx: "([a-z]+)|([A-Z][A-Z])" does not match text: "123"
matching in rust playground The code used in this example will be discussed in the next two sections.

2.0  Regular Expression Syntax Summary

A short list of syntax, extracted from the regex crate, is provided in Table 1. This should be all that is needed for most applications.

Table 1. - Regular Expression Syntax

syntax meaning
. any character except new line
\d digit
\D not digit
\s white space
\S not white space
\w word character
\W not word character
[xyz] character class matching either x, y, or z
[^xyz] character class matching any char except x, y, or z
[b-q] character class matching any char in b-q range inclusive
[0-7&&[^4]] matches any char in range 0-7 except 4
xy concatenation - x followed by y
x|y x or y
x* zero or more of x
x+ one or more of x
x? zero or one of x
x{n} n repetitions of x
x{m,n} at least m but no more than n repetitions of x
^ beginning of text
$ end of text
(...) capture group

Table 2. - Examples

Example Action
[A-Z] matches C in "a Capital letter"
Rust matches any string containing "Rust"
abc{3} matches "this is an example - abccc -"
a(bc)* matches a, abc, abcbc, ...
fn|struct matches "fn funct()", "struct X {}"
RE Tutorial - Factory Mind
has many executable examples.
more complete list of syntax

3.0  Regex

The crate, regex ver 1.4.5, available from crates.io is used exclusively in this Bite. It's primary struct, regex::Regex, is a processor for regular expressions that compiles a regular expression into a state machine. Using that it can:
  • check if a regular expression matches a given text string, as shown above
  • return capture groups that describe possibly multiple matches in a given string
  • split text based on capture groups
  • replace text
Searching text with a Regex instance is guaranteed to be linear in the size of the text. Compiling regular expressions requires a non-trivial amount of time, so repeated invocations should not recompile unless the pattern changes.
Partial Declarations from regex crate #[derive(Copy, Clone, Debug, Eq, PartialEq)] pub struct Match<'t> { text: &'t str, start: usize, end: usize, } pub struct Regex(Exec); /* compiles reg express'n, result can be used repeatedly */ pub fn new(re: &str) -> Result<Regex, Error> /* cheapest way to detect a match */ pub fn is_match(&self, text: &str) -> bool /* returns start and end of first match if it exists */ pub fn find<'t>(&self, text: &'t str) -> Option<Match<'t>> /* returns iterator for successive non-overlapping matches */ pub fn find_iter<'r, 't>(&'r self, text: &'t str) -> Matches<'r, 't> /* returns capture groups for first match in text */ pub fn captures<'t>(&self, text: &'t str) -> Option<Captures<'t>> /* returns iterator over all non-overlapping capture groups */ pub fn captures_iter<'r, 't>( &'r self, text: &'t str, ) -> CaptureMatches<'r, 't> /* returns iterator of substrings of matching text */ pub fn split<'r, 't>(&'r self, text: &'t str) -> Split<'r, 't> /* replaces first match with replacement */ pub fn replace<'t, R: Replacer>( &self, text: &'t str, rep: R, ) -> Cow<'t, str> /* replaces all non-overlapping matches in text with replacement */ pub fn replace_all<'t, R: Replacer>( &self, text: &'t str, rep: R, ) -> Cow<'t, str> Contents of the preceding block show the structure and methods of regex::Regex. The next block gives examples of their use, and Section 3.0 discusses regular expression pattern syntax and semantics. Example Use use regex::{Regex, Match, Captures}; fn check(pattern: &str, text: &str, pred:bool) { if pred { print!( "\n pattern: {:?} matches text: {:?}", pattern, text ); } else { print!( "\n pattern: {:?} !matches text: {:?}", pattern, text ); } } fn range(pattern: &str, text: &str, mat: &Option<Match> ) { if let Some(mt) = mat { print!( "\n find pattern {:?} in text {:?}:", pattern, text ); print!(" match in [{}, {})", mt.start(), mt.end()); } else { print!("\n no match"); } } fn range_iter(pattern: &str, text: &str, mat: Match) { print!( "\n find pattern {:?} in text {:?}:", pattern, text ); print!(" match in [{}, {})", mat.start(), mat.end()); } fn test_match() { print!("\n -- test_match --"); let pattern = r"[a-q]{3,4}$"; let re = Regex::new(pattern).unwrap(); let text = "12cde"; let pred = re.is_match(text); check(pattern, text, pred); let text = "12cdefg"; let pred = re.is_match(text); check(pattern, text, pred); let text = "12cd"; let pred = re.is_match(text); check(pattern, text, pred); let text = "12cd3e"; let pred = re.is_match(text); check(pattern, text, pred); let text = "12cds"; let pred = re.is_match(text); check(pattern, text, pred); } fn test_find() { print!("\n -- test_find --"); let pattern = r"abc"; let re = Regex::new(pattern).unwrap(); let text = "123abc456"; let op: Option<Match> = re.find(text); range(pattern, text, &op); } fn test_find_iter() { print!("\n -- test_find_iter --"); let pattern = r"abc"; let re = Regex::new(pattern).unwrap(); let text = "123abc456abc789"; let matches = re.find_iter(text); for mat in matches { range_iter(pattern, text, mat); } } fn test_captures() { print!("\n -- test_captures --"); let text = "123abc456def789"; let pattern = "\ ([a-z]{3}|[0-9]{3})\ ([a-z]{3}|[0-9]{3})\ ([a-z]{3}|[0-9]{3})\ ([a-z]{3}|[0-9]{3})\ ([a-z]{3}|[0-9]{3})\ "; // These don't work as you might expect. // Capture doesn't work well with repetitions. // let pattern = r"([a-z]{3}|[0-9]{3}){5}"; // let pattern = r"(([a-z]{3})([0-9]{3}))+"; // let pattern = r"((?:\d+)+)+"; let re = Regex::new(pattern).unwrap(); let captures: Option<Captures> = re.captures(text); print!("\n captures: {:?}", captures); let caps = captures.unwrap(); for i in 0..caps.len() { print!("\n captures[{}] = {:?}", i, &caps.get(i)); let cap = &caps.get(i).unwrap(); print!( "\n cap = {:?}, {}, {}", cap.as_str(), cap.start(), cap.end() ); } } fn main() { test_match(); test_find(); test_find_iter(); test_captures(); } Output -- test_match -- pattern: "[a-q]{3,4}$" matches text: "12cde" pattern: "[a-q]{3,4}$" matches text: "12cdefg" pattern: "[a-q]{3,4}$" !matches text: "12cd" pattern: "[a-q]{3,4}$" !matches text: "12cd3e" pattern: "[a-q]{3,4}$" !matches text: "12cds" -- test_find -- find pattern "abc" in text "123abc456": match in [3, 6) -- test_find_iter -- find pattern "abc" in text "123abc456abc789": match in [3, 6) find pattern "abc" in text "123abc456abc789": match in [9, 12) -- test_captures -- captures: Some( Captures({ 0: Some("123abc456def"), 1: Some("123"), 2: Some("abc"), 3: Some("456"), 4: Some("def") }) ) captures[0] = Some( Match { text: "123abc456def789", start: 0, end: 12 } ) cap = "123abc456def", 0, 12 captures[1] = Some( Match { text: "123abc456def789", start: 0, end: 3 } ) cap = "123", 0, 3 captures[2] = Some( Match { text: "123abc456def789", start: 3, end: 6 } ) cap = "abc", 3, 6 captures[3] = Some( Match { text: "123abc456def789", start: 6, end: 9 } ) cap = "456", 6, 9 captures[4] = Some( Match { text: "123abc456def789", start: 9, end: 12 } ) cap = "def", 9, 12 captures[5] = Some( Match { text: "123abc456def789", start: 12, end: 15 } ) cap = "789", 12, 15
code in playground

4.0  Epilogue:

For many applications all you need is RegEx::is_match(&self) -> bool. That simply returns true after finding the first match, or false if there are no matches. That is efficient and flexible. Make sure that you only compile the pattern once, using RegEx::new(pattern). You should recompile only if the pattern changes.

5.0  References:

Link Description
Regular Expression Syntax Nice organization of regular expression pattern language syntax
regex::Regex Documentation for Regex and its methods.
Crate regex Regex crate documentation.
Jeff Atwood's Blog Establishes the arena, provides advice and several very good links.
Regular expression - Wikipedia Quite extensive discussion of theory, syntax, and semantics.
RE Cheat Sheet - Dave Child Nice compact summary.
RE Tutorial - Factory Mind Clear and fairly brief.
  Next Prev Pages Sections About Keys