about
10/31/2024
Help: RegEx

Help: RegEx

finding substrings in files and strings

About
click header to toggle Site Explorer
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
- Jamie Zawinski

Synopsis:

RegEx builds state machines for efficiently finding text specified by a "little language". This Regular Expression language is used to generate expressions each of which specify a search state machine.

1.0 Regular Expression Specifiers

Regular Expressions Specifiers (RegExSpec) are strings built from pattern matching specifiers. They are used for searching, validating, and manipulating text. RegExs are supported by Python, JavaScript, Java, and C#. Rust supports them through the regex crate

1.1 Metacharacters

Metacharacters are characters with special meanings in the context of a regular expression.
   ↓Metacharacter Meaning
. period matches any character except a newline
* star matches zero or more of the preceding character
+ plus matches one or more of the preceding character
? question mark matches zero or one instances of the preceding character
^ carot matches start of the test string, e.g., the preceding character is the first.
$ dollar matches end of the test string, e.g., the preceding character is the last.
\ back-slash escapes a metacharacter so it is treated as an ordinary character.
| alternation character represents an "either or", as in cat|dog matches either cat or dog.
The code below on left contains two functions that use Regular Expression metacharacters. The code on the right shows their output.
fn test_reg_ex(test_str:&str, reg_ex_str:&str) -> bool {

  /* 
    RegEx::new(reg_ex_str) returns  RegEx state machine wrapped in 
    Ok if valid reg_ex_str 
  */
  if let Ok(pattern) = Regex::new(reg_ex_str) {
    pattern.is_match(test_str)
    /* Ok if valid RegExStr, returns true if match else false */  
  }
  else {
    false  /* not Ok, invalid RegExStr */
  }
}
fn show_match_op(test_str:&str, reg_ex_str:&str) {
  let m = test_reg_ex(test_str, reg_ex_str);
  if m {
    println!("{} matches RegEx", test_str);
  }
  else {
    println!("{} did not match RegEx", test_str);
  }
}
Test literal string matching
RegEx string: Rust|rust|Language|language
rust matches RegEx
Language matches RegEx
foo did not match RegEx

Test metacharacters matching
RegEx string: R.s+
Rust matches RegEx
Rbst matches RegEx
Rcsttt matches RegEx
Rctttt did not match RegEx
rctttt did not match RegEx









            

1.2 Metastrings

Metastrings are sequences of characters with special meanings in the context of a regular expression.
   ↓Metastrings Meaning
\d matches any digit
\D matches any non-digit character
\w matches any word character, i.e., [a-zA-Z0-9_]
\W matches any non-word character, i.e., [^a-zA-Z0-9_]
\b matches any word boundary, transition from word character to non-word character or vice versa
\B matches non-word boundary
\s matches any white-space character, i.e., [\t\r\n\f]
\S matches any non-white-space character, i.e., [^\t\r\n\f]
{n} matches exactly n occurences of the preceding character or group, e.g., \d{3} matches three adjacent digits

1.3 Character Classes

Character classes are groups of characters with special matching properties.
      ↓Character Classes Meaning
[rst] matches "r", "s", or "t"
[b-y] matches any lower case letter between "b" and "y", including first and last
[^A-Z] matches any character that is not upper case ASCII
() captures a specific part of test string, e.g., (\d{3}) captures first three digits.
^ carot matches start of the test string, e.g., the preceding character is the first.
$ dollar matches end of the test string, e.g., the preceding character is the last.
\ back-slash escapes a metacharacter so it is treated as an ordinary character.
This code example illustrates the use of capture ( ) and quantification {n} to extract three pieces of information form a string.
fn show_capture(test_str:&str, reg_ex_str:&str) {
  match Regex::new(reg_ex_str) {
    Ok(re) => {
      if let Some(caps) = re.captures(test_str) {
        if let Some(group1) = caps.get(1) {
          println!("Group 1: {}", group1.as_str());
        }
        if let Some(group2) = caps.get(2) {
          println!("Group 2: {}", group2.as_str());   
        }
        if let Some(group3) = caps.get(3) {
          println!("Group 3: {}", group3.as_str());
        }
      }
    }
    Err(_e)  => {
      println!("Invalid RegExStr");
    }
  }
}            
Test capture
RegEx string: (\d{3})-(\d{3})-(\d{4})   
test_str: 012-345-6789
Group 1: 012
Group 2: 345
Group 3: 6789














2.0 Regular Expression State Machines

The function regex::RegEx::new(RegExSpec) compiles the string RegExSpec and, if valid, creates a state machine that efficiently executes search, validation, and parsing of strings. It does not do that until presented with a test string to match in RegEx::is_match(test_string).
RegEx MethodExplaination
pub fn new(regex_spec:&str) -> Result<Regex, Error> Compiles RegExSpec string into state machine, but does not execute. Returns Ok(RegEx) if there are no errors building state machine, otherwise None.
pub fn is_match(&self, test_string&str) -> bool Executes state machine to see if test_string matches regex_spec.
pub fn find(&self, test_string:&str) -> Option<Match> Searches for first match of test_string with regex_str, returning Option where Match contains start and end indices of the match or None if there are no matches.
pub fn find_iter(&self, test_string: &str) -> FindMatches FindMatches is an iterator over all non-overlapping matches in text_str.

References:

Link Comments
Crate regex Crate documentation for regex 1.1
regex tutorial from rust-cookbook Several examples of regex applications.
RegEx Notes - Ray Toal Contains most of the content to be summarized here
  Next Prev Pages Sections About Keys