about
01/25/2024
RustBites - Strings

Rust Strings

literals, Strings, formats, iteration

About
click header to toggle Rust Explorer

Synopsis:

Rust strings are encoded in utf-8 characters. Literal strings of type &str and owed strings of type String are subject to copy and move semantics, respectively. That has some implications addressed in this post.
The two main string types in Rust, String and str are unlike std::string provided by either C++ or C#. Rust strings hold utf-8 characters with sizes from 1 to 4 bytes. That allows instances to hold non-ASCII characters like greek and arabic letters. This is powerful, but comes with costs:
  • Rust strings cannot be indexed as characters
  • Accessing a character at a known location is linear time
  • Converting between Rust strings and those of the platform are more complex than for some other languages.

1. str - literal string

The std::string library provides the copy type str to represent const literal strings stored in static memory.
let s = "a string";
The "a string" is a string slice of type &str. It is a contiguous collection of UTF-8 characters, compiled into some location in static memory with the code that uses it.
 

Table 1. - Selected str member functions:

member function description
const fn as_bytes(&self) -> &[u8] Converts a string slice to a byte slice
pub fn bytes(&self) -> Bytes<'_> An iterator over bytes of a string slice
pub fn chars(&self) -> Chars<'_> Returns iterator over chars of a string slice
pub fn char_indices(&self) -> CharIndices<'_> Returns iterator over chars of a string slice, and their positions
pub fn contains<'a, P>(&'a self, pat: P) -> bool Returns true if pattern P matches sub-slice of this string slice
pub fn find<'a, P>(&'a self, pat: P) -> Option Returns byte index of first character of this string slice that matches pattern P. Returns None if the pattern doesn't match.
pub fn is_ascii(&self) -> bool Checks if all characters in this string are within ASCII range
pub fn is_char_boundary(&self, index: usize) -> bool Checks that index-th byte is first byte in UTF-8 code point sequence or end of the string
pub const fn is_empty(&self) -> bool Returns true if self has length of zero bytes
pub const fn len(&self) -> usize This length is in bytes, not chars or graphemes
pub fn lines(&self) -> Lines<'_> An iterator over the lines of a string, as string slices
pub fn make_ascii_lowercase(&mut self) Converts this string to its ASCII lower case equivalent in-place
pub fn make_ascii_uppercase(&mut self) Converts this string to its ASCII upper case equivalent in-place
pub fn parse<F>(&self) -> Result<F, <F as FromStr>::Err> Parses this string slice into another type
pub fn repeat(&self, n: usize) -> String Creates a new String by repeating a string n times
pub fn replace<'a, P>(&'a self, from: P, to: &str) -> String Replaces all matches of pattern P with another string
pub fn split<'a, P>(&'a self, pat: P) -> Split<'a, P> An iterator over substrings of this string slice, separated by characters matched by a pattern
pub fn trim(&self) -> &str Returns a string slice with leading and trailing whitespace removed
More methods ... std::string::String
Basic str operations str Demonstration // Basic str demo fn main() { let s = "a literal string"; print!("\n -- chars --\n "); for ch in s.chars() { print!("{} ", ch); } print!("\n -- char_indices --"); for item in s.char_indices() { print!("\n {:?} ", item); } print!("\n -- find --"); let ch = 't'; if let Some(indx) = s.find(ch) { print!( "\n found \'{}\' at index {} in {:?}", ch, indx, s ); } else { print!( "\n did not find \'{}\' in {:?}", ch, s ); } print!("\n -- demonstrate copy, t = s --"); let t = s; let addr_t = &t; let addr_s = &s; print!("\n address of s = {:p}", addr_s); print!("\n address of t = {:p}", addr_t); } Output -- chars -- a l i t e r a l s t r i n g -- char_indices -- (0, 'a') (1, ' ') (2, 'l') (3, 'i') (4, 't') (5, 'e') (6, 'r') (7, 'a') (8, 'l') (9, ' ') (10, 's') (11, 't') (12, 'r') (13, 'i') (14, 'n') (15, 'g') -- find -- found 't' at index 4 in "a literal string" -- demonstrate copy, t = s -- address of s = 0x7ffffd0b7180 address of t = 0x7ffffd0b7458
code in playground

2. String

The std::string library provides the main Rust string.
let s = String::from("a string");
The "a string" is a string slice of type &str. It is a contiguous collection of UTF-8 characters, loaded into some location in static memory, as shown in Figure 1, below.
The str type satisfies the Copy trait. The statement let s = "some string contents"; copies a reference to the literal string into s's location. The statement let t = s copies the s reference to t.
Figure 1. Str Copy
Each member of the String class consists of a control block in the stack holding a pointer to its string slice in the heap. See RustBites_Data for details. The String type moves instead of copying. You can, however, explicitly invoke its clone() method. The statement:
let t:String = s;
results in transfer of ownership of s's character resources to t. That invalidates s, as shown in Figure 2.
Figure 2. String Move Figure 3. String Clone
String satisfies the Clone trait. So, you can explicitly invoke its clone() method. The statement:
let t:String = s.clone();
results in copying s's character resources to t. So s remains valid, as shown in Figure 3.

Table 2. - String member functions:

member function description
new() -> String Create new empty String
from(s: &str) -> String Creates string from string slice
as_str(&self) -> &str Returns string slice
push_str(&mut self, s: &str) Appends chars from s
push(&mut self, ch: char) Appends ch
remove(&mut self, n: usize) -> char Removes char at index n
insert(&mut self, n: usize, ch: char) inserts ch at location n
insert_str(&mut self, n: usize, s: &str) Inserts contents of s at location n
len(&self) -> usize Returns length of string in bytes, not chars!
They are the same only for ASCII characters.
is_empty(&self) -> bool Returns true if len() == 0, else false
clear(&mut self) Removes all bytes
from_utf8(vec: Vec<u8> -> REsult<String, FromUtf8Error> Converts vector of bytes to String. Returns error if invalid UTF-8
into_bytes(self) -> Vec<u8> Convert to Vec of bytes
as_bytes(&self) -> &[u8] Returns byte slice
is_char_boundary(&self, n: usize) -> bool Is this byte the start of a new UTF-8 character?
More methods ... std::string::String
String Examples: demo_string use core::fmt::Debug; /*------------------------------------------------- Show slice as stack of rows with span elements in row - nice illustration of Iterator methods */ fn show_fold<T:Debug>(t:&[T], span:usize) { let times = 1 + t.len()/span; let iter = t.iter(); print!("\n "); for _i in 0..times { for bt in iter.clone() .skip(_i * span).take(span) { print!("{:5?} ", bt); } if _i < times - 1 { print!("\n "); } } } fn get_type<T>(_t:&T) -> &str { std::any::type_name::<T>() } fn show_type_value<T: Debug>(msg: &str, t: &T) { print!( "\n {} type is: {}, value: {:?}", msg, get_type::<T>(t), t ); } fn main() { print!("\n -- demo_string --"); let s1 : String = String::from("a test string"); show_type_value("s1 - ", &s1); print!( "\n -- iterating through String characters --" ); let iter = s1.chars(); print!("\n "); for ch in iter { print!("{} ", ch); } print!("\n -- extracting bytes --"); let s1_bytes = s1.as_bytes(); print!("\n bytes are:"); show_fold(&s1_bytes, 5); // This works too, will wrap in [] // print!("\n bytes are: {:?}", b"a test string"); print!("\n -- extracting a slice --"); let slc = &s1[0..6]; show_type_value("&s1[0..6]", &slc); print!("\n -- demonstrate move --"); print!("\n executing statement: let s2 = s1;"); print!("\n address of s1 = {:p}", &s1); print!( "\n address of s1.as_bytes()[0] = {:p}", &s1.as_bytes()[0] ); let s2 = s1; print!("\n address of s2 = {:p}", &s2); print!( "\n address of s2.as_bytes()[0] = {:p}", &s2.as_bytes()[0] ); print!( "\n new control block, orig start of heap alloc" ); } Output: -- demo_string -- s1 - type is: alloc::string::String, value: "a test string" -- iterating through String characters -- a t e s t s t r i n g -- extracting bytes -- bytes are: 97 32 116 101 115 116 32 115 116 114 105 110 103 -- extracting a slice -- &s1[0..6] type is: &str, value: "a test" -- demonstrate move -- executing statement: let s2 = s1; address of s1 = 0x7fff7eadf3e8 address of s1.as_bytes()[0] = 0x55b3a9a96b40 address of s2 = 0x7fff7eadf680 address of s2.as_bytes()[0] = 0x55b3a9a96b40 new control block, original start of heap alloc
code in playground

3. String Formats

Rust provides a useful set of formatting facilities for console display: std::fmt and for building formatted strings, using the format! macro: std::format There is a little language associated with the formatting process that is well described in the std::fmt reference given above. Using that and an extensive set of attributes, also presented in the docs, you can provide very well organized information on the console, instead of a lot of raw data.

4. Iterating over Strings

Since types of String and &str contain utf-8 characters, their items may have sizes that vary from 1 to 4 bytes. So their iterators have to search for character boundaries.

Table 1. utf-8 character boundaries

char size indicator
1 byte, e.g. ASCII byte starts with bit 0
2 bytes First byte starts with bits 110
3 bytes First byte starts with bits 1110
4 bytes First byte starts with bits 11110
not first byte byte starts with bits 10
For that reason, instances of std::String and primitive str provide iterators:
  • chars(&self) -> Chars<'_>
    Chars<'_> implements next(&self) -> Option<char>
  • char_indices(&self) -> CharIndices<'_>
    CharIndices<'_> implements next(&self) -> Option<(usize, char)>
  • bytes(&self) -> Bytes<'_>
    Bytes<'_> implements next(&self) -> Option<u8>
The type char is not what String and str hold. The type char consists of 4 bytes which can hold any of the String and str characters. So, a Vec<char> would be up to four times larger than a std::String with the same logical contents.

5. Other String Types

Rust libraries std::ffi (foreign function interface) and std::path provide four other string types:
String Type Description
std::borrow::Cow Standard smart pointer implementing clone-on-write.
std::ffi::OsString Owned mutable wrapper for platform-native strings, used to make platform API calls and interoperate with "C" code.
std::ffi::OsStr Borrowed reference to OsString
std::path::PathBuf Owned mutable filesystem path, adds methods for interacting with the Rust filesystem
std::path::Path Borrowed reference to PathBuf slice
  Next Prev Pages Sections About Keys