Unlocking the Power of Strings in Rust

In this article, we’ll take a closer look at how Rust handles text, from different ways to store and manipulate it to more advanced features like special formatting and memory efficiency."

Strings in Rust are a bit more sophisticated compared to other programming languages. As mentioned in previous lessons, strings in Rust are typically represented by either the &str type or the String type. However, there are some significant differences between these two types. But before we move on to discuss them, let’s take a moment to understand a bit more about how Rust handles strings.

Strings in Rust are UTF-8 encoded. I have written a Medium article on character encoding where you can learn how UTF-8 encoding works, and you can take a look at it here. In a nutshell, a string is a sequence of characters such as A, 9, -, and even emoji characters like 😁. English alphabets and numbers take only a single byte (8 bits) of space for storage, while other characters can take 2, 3, or even 4 bytes.

  • 1 byte = A, z, 1, .
  • 2 bytes = é, ñ, ø, ß
  • 3 bytes = , , , π
  • 4 bytes = 𝄞, 😀, 💡, 🧑‍💻

&str type

Before we understand what &str is, let’s first take a look at the str type. str is a valid data type in Rust, and it is commonly referred to as a string slice. It represents an immutable, non-growable sequence of UTF-8 bytes. Unlike types like String, str does not own the data it references — it’s simply a borrowed view into a portion of a string, which is why it is called a slice. The data it references can exist in various places, such as in the program’s binary (for string literals) or on the heap (when referenced by types like Box<str>). Once created, the data behind a str cannot be changed, making it immutable by nature.

When we declare a string literal (using double quotes), like "Hello, world!", that text gets stored directly in the compiled binary, and at runtime, it is accessed from the read-only memory section of the binary. If you inspect the compiled binary using a hex viewer, you’ll see the text exactly as it was declared. However, there is a limitation with this type: Rust requires us to work with types whose sizes are known at compile time (sized types). For example, u8 or i8 have a fixed size of 8 bits, and this size doesn’t change. We spoke about this in the the Slices section in the Data Types lesson.

This isn’t the case with str. Although the string data has a known length, str itself is an unsized type because it doesn’t store the length information of the string directly. As a result, you will rarely see str used on its own. Instead, Rust encourages us to use &str, which is a reference to raw string data (which could be stored as either the str or String type), and &str has a fixed size. The &str type consists of two components: a pointer to the first byte of the string (with a size of usize) and a length representing the number of bytes (also usize). Since &str has a fixed size (2 x usize), it is efficient to work with directly. The actual string data referenced by &str may reside in the binary (for string literals) or in heap memory (for slices of String).

If you want to learn more about slices and their characteristics, please refer to the Slices section in the Data Types lesson.

fn main() {
    let s: &str = "Hello, world!";
    println!("s: {}", s);
}

// s: Hello, world!

When we declare a variable and assign it a string literal, the raw string data is stored in the program’s binary, and a reference to it is returned in the form of the &str data type. In the example above, the variable s gets the &str data type.

The actual type of s here is &'static str, where 'static is a lifetime parameter in Rust, which we will cover in an upcoming lesson. When we declare a string using a string literal, it has this type because the string is stored in the program’s binary and has the lifetime of the entire program, which is what 'static means.

In the previous lesson, we learned about the Slice data type. A slice contains a pointer to the first element in a collection and the length of the elements it represents. The same applies to &str, as it also holds a pointer and a length. This is why &str is called a string slice.

fn main() {
    let s: &str = "Hello, world!";

    let ss: &str = &s[0..5];
    println!("ss: {}", ss);
}


// ss: Hello

In the example above, we have a string literal s with the type &str. When we create a slice of it using the familiar &[..] slice expression, we also get the &str type back. So, &str is indeed a slice of a string. The ..5 in the &s[0..5] creates a slice from byte index 0 of the s, until but not including byte 5. Remember, it’s a byte, and not the character.

fn main() {
    let mut s: &str = "Hello, world!";
    println!("[before] s: {}", s);

    s = "Hello new, world!";
    println!("[after] s: {}", s);
}
$ cargo run
[before] s: Hello, world!
[after] s: Hello new, world!

Since &str is an immutable reference to a string, either stored in the program’s binary or on the heap, we cannot mutate the underlying data through it. When it comes to string literals, they are stored in the program’s binary and cannot be mutated at runtime. However, we can update the value of a variable. In the example above, the variable s was initially pointing to the string "Hello, world!" stored in the binary, but later it pointed to "Hello new, world!" also stored in the binary. The underlying raw data did not change.

Let’s explore another limitation with strings in Rust. Unlike other languages, Rust does not allow direct indexing on a string. While we can access an element in an array or a slice using the array[index] expression, this is not possible with strings in Rust—for example, s[0] is not allowed. The reason is that, since string characters are UTF-8 encoded, they may take more than 1 byte to store their values. If we try to access a byte in a string using string[index], the byte at that index might not fully represent a character.

fn main() {
    let s: &str = "Hello";

    for c in s.chars() {
        println!("char: {}", c); // c has `char` type
    }


    let char_at_1: Option<char> = s.chars().nth(1);

    match char_at_1 {
        Some(c) => println!("\nChar at 1: {}", c),
        None => println!("There is not character at 1"),
    }
}
$ cargo run
char: H
char: e
char: l
char: l
char: o

Char at 1: e

This is why we can’t run a standard for loop on strings directly, as it would iterate over every byte. Instead, as shown in the program above, we use the chars method, which returns an iterator that yields fully qualified characters. We can also use the .nth(index) method on the Chars iterator, which returns the character at the specified index using the Option enum.

fn main() {
    let s: &str = "é";

    let fb: &str = &s[0..1];
    let sb: &str = &s[1..2];

    println!("fb: {}", fb);
    println!("sb: {}", sb);
}

In the example above, we have a string literal s that contains a single character, é, which takes two bytes to store. The variable fb is a slice that references the first byte of this string, and sb references the second byte. When we try to print this using println!, we get the following error.

$ cargo run
cargo run
   Compiling hello_world v0.1.0 (/Users/thatisuday/rust/hello_world)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.16s
     Running `target/debug/hello_world`
thread 'main' panicked at src/main.rs:4:22:
byte index 1 is not a char boundary; it is inside 'é' (bytes 0..2) of `é`
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This happens because we are trying to split a slice between a character. This error tells us that we’re trying to slice the string at byte index 1, which is inside a multi-byte character é, which spans from byte 0 to byte 2.

fn main() {
    let s: &str = "é";
    let s_bytes: &[u8] = s.as_bytes();

    let fb: u8 = s_bytes[0];
    let sb: u8 = s_bytes[1];

    println!("fb: {} / {:#X} / {}", fb, fb, fb as char);
    println!("sb: {} / {:#X} / {}", sb, sb, sb as char);
}

If we still want to read bytes from a string, we can use the .as_bytes() method, which returns a slice of u8 values. Since u8 represents 8 bits, which is exactly one byte, this makes perfect sense. If we print each byte of this string independently as a character, we get some interesting results.

$ cargo run
fb: 195 / 0xC3 / Ã
sb: 169 / 0xA9 / ©

The byte sequence for the character é is [0xC3, 0xA9]. However, if we try to interpret them as individual characters using as char, Rust treats 0xC3 and 0xA9 as Unicode code points, and their respective characters from the UTF-8 table are à and ©.

In the UTF-8 table, a single-byte character with a decimal value over 127 or a hex value over 7F does not exist. All characters with Unicode code points above 0x7F take 2 bytes or more.

String type

The String data type, unlike string literals (&str), stores its content on the heap. It is both growable and mutable, meaning you can append or remove parts of the string as needed. A String owns its data, unlike &str, which is a borrowed reference to string data. Behind the scenes, String manages its memory automatically. Like a Vec<T>, Rust allocates more memory on the heap than the string currently requires, allowing it to grow efficiently without frequent reallocations. As you append text, Rust adjusts the memory allocation to accommodate more content ahead of time.

struct String {
    vec: Vec<u8>, // stores the string data as a vector of bytes
}

struct Vec<T> {
    ptr: *mut T,     // pointer to the heap-allocated buffer
    len: usize,      // current length of the vector (number of elements)
    capacity: usize, // total capacity of the vector (allocated space)
}

As you can see above, String is a wrapper around Vec<u8> with additional functionality for handling UTF-8 text data. As a result, the String type contains a pointer to a region of memory on the heap, a length representing the current number of bytes in the string, and a capacity representing the total allocated memory.

fn main() {
    let mut text = String::new();
    println!(
        "[before] text: `{}`, len: {}, cap: {}",
        text,
        text.len(),
        text.capacity()
    );

    text.push_str("Hello, world!");
    println!(
        "[after] text: `{}`, len: {}, cap: {}",
        text,
        text.len(),
        text.capacity()
    );
}
$ cargo run
[before] text: ``, len: 0, cap: 0
[after] text: `Hello, world!`, len: 13, cap: 16

In the example above, we declare a String by calling the String::new() associated function, which initializes an empty string on the heap dynamically. Since this is a growable and mutable type of string, we can append more data to it using the .push_str() method, which accepts a &str to append to the original string. As you can also see in the program, we call the .len() method to get the length of the string, which returns the size of the string in bytes, and the .capacity() method to return the current amount of memory allocated for it.

If you’d like to initialize a string with an initial value, you can use the String::from() associated function, which takes a string slice as an argument.

fn main() {
    let mut text = String::from("Hello");
    println!("[before] text: `{}`", text,);

    text.push_str(", world!");
    println!("[after] text: `{}`", text,);
}
$ cargo run
[before] text: `Hello`
[after] text: `Hello, world!`

In the example above, we provided a string literal, which has the &str type, to the String::from() associated function. It initializes a string on the heap, and we made it mutable by using mut. We can also use a slice of another String type.

fn main() {
    let message = String::from("Good Morning, everyone!");
    let portion: &str = &message[12..];

    let mut text = String::from("Hello");
    text.push_str(portion);
    println!("text: `{}`", text,);
}

// text: `Hello, everyone!`

In the example above, we have a String value stored in the message variable. The portion variable is a string slice (&str) that references part of this String, starting from byte index 12 and continuing until the end of the string. Unlike a string literal, portion is a slice referencing heap-allocated data from the message string. When we call .push_str(portion), the bytes referenced by the portion slice are copied into the text string.

Here are some interesting methods that we can use on &str and String types:

  • len: Returns the number of bytes in the string.
  • is_empty: Returns true if the string has a length of 0.
  • chars: Returns an iterator over the characters of the string.
  • contains: Checks if a string contains a substring.
  • starts_with: Returns true if the string starts with the specified substring.
  • ends_with: Returns true if the string ends with the specified substring.
  • find: Searches for a substring and returns the byte index of the first occurrence.
  • split: Splits the string based on a delimiter and returns an iterator.
  • replace: Replaces all matches of a pattern with a new string.
  • to_lowercase: Converts the string to lowercase.
  • to_uppercase: Converts the string to uppercase.
  • trim: Removes leading and trailing whitespace from the string.
  • lines: Returns an iterator over the lines of the string.
  • as_bytes: Returns a slice of the string as a byte array (&[u8]).
  • repeat: Repeats the string n times and returns the concatenated result.

As we learned about the limitations of str, we can’t really use it directly in our program. However, we can use the Box pointer (we will talk about this in an upcoming lesson) to wrap the str value and store it on the heap. Why would we do that? First of all, this gives us a way to store immutable string data on the heap. Unlike String, which can be mutated if we are not careful, that’s not the case with str. It’s merely a borrowed portion of string data and has no capability to modify it. Additionally, str does not have a capacity attribute like String, which can reserve more memory than what the string requires. This is why str provides a more compact and memory-efficient way to store a string, where the memory allocated is exactly equal to the string’s length.

fn main() {
    let text: String = String::from("Hello, world!");

    let text_str: Box<str> = text.into_boxed_str();

    println!("Heap-allocated str: {}", text_str);
}

// Heap-allocated str: Hello, world!

Rust provides the .into_boxed_str() method on the String type to create a Box<str> directly. Here, text_str is a Box<str>, which owns and stores an immutable view of the string data on the heap. During this conversion, if text has more capacity than its length, the extra unused memory will be freed, and the string will be stored in a more compact form.

Difference between &str and &String

fn main() {
    let text: String = String::from("Hello, world!");
    let text_ref: &String = &text;
}

Let’s say we have a variable text of type String. We can take a reference to it using &text, which has the type &String. Then we might wonder, what’s the difference between &String and &str? Aren’t they the same?

Though they look similar, they are somewhat different. Like &str, &String has a pointer and a length. However, unlike &str, it references the entire string, with the pointer pointing to the first character of the string stored on the heap and the length specifying the entire string length. &str can refer to either a small portion or the entire string, stored either in the binary or on the heap. Therefore, &str is more flexible than &String.

fn print_str(s: &str) {
    println!("Printing &str: {}", s);
}

fn main() {
    let text: String = String::from("Hello, world!");
    let text_ref: &String = &text;

    // print a string literal
    print_str("Hello, literal!");

    // print a `String` slice
    let text_slice: &str = &text[..];
    print_str(text_slice);

    // print `&String` with Deref coercion
    print_str(text_ref);
}
$ cargo run
Printing &str: Hello, literal!
Printing &str: Hello, world!
Printing &str: Hello, world!

However, Rust implements the Deref trait for String, which can automatically convert &String to &str. We will talk about this trait in an upcoming lesson. But for now, you can simply pass &String to a function that takes &str, and Rust will handle converting &String to &str. As a matter of fact, the String methods we saw above are actually implemented on the str type. For example, when we call the .len() method on a String, it is actually called on the &str value derived from it. If this doesn’t make complete sense, that’s fine for now.

The opposite is not possible: a function that strictly takes &String cannot accept a &str.

String concatenation

In many languages, + operator is used to concatenate two strings together to create a new string. In Rust, it works exactly like that but with some constraints.

fn main() {
    let firstname: String = "John".to_string();
    let lastname: String = "Doe";

    let fullname: String = firstname + " " + lastname;
    println!("fullname: {}", fullname);

    // Error: borrow of moved value: `firstname`
    // println!("firstname: {}", firstname); // <-- uncomment

    // Error: cannot add `&str` to `&str`; string concatenation requires an owned `String` on the left
    // let shortname = "Mr. " + lastname; // <-- uncomment
}

// fullname: John Doe

In the example above, the firstname variable contains a string of type String. We used the .to_string() method on a string literal (which is available on &str) to convert from &str to a String value stored on the heap. This is a common method to create a String easily. The lastname variable represents a &str value.

The rule for the + operator (string concatenation) requires that the first operand must be a String and the subsequent operands must be &str values. Rust enforces this rule for performance reasons and efficient memory management. Since String is an owned type, its value is moved and concatenated with the string slice’s data (such as literals or other &str values). What move means will be explained in more detail in a later lesson, but here, move means that the original String value is no longer accessible after the operation.

In the program above, the firstname value is moved into fullname due to the + operation. This is why we would get an error if we tried to access firstname after the move. However, this is not the case with &str values. Since &str represents references (and not owned values), they do not move. In the context of string concatenation, their underlying data is copied and concatenated with the firstname’s data, forming the fullname.

fn main() {
    let firstname: String = "John".to_string();
    let lastname: &str = "Doe";

    let fullname: String = format!("{} {}", firstname, lastname);
    println!("fullname: {}", fullname);

    println!("firstname: {}", firstname);
}
$ cargo run
fullname: John Doe
firstname: John

If you do not want your strings to be moved due to the concatenation operation, you can use the format! macro, which comes from Rust’s standard library and is a part of the prelude. It’s almost exactly like the println! macro, but instead of printing text to STDOUT, it returns a String value.

Multiline and Raw strings

fn main() {
    let text = "Hello,
    world!";

    println!("text: {}", text);
}
$ cargo run
text: Hello,
    world!

Rust by default supports spanning strings over multiple lines without introducing anything extra. Therefore, a string declared in Rust is multiline by default. In the above example, we have declared a string literal text, and some part of this string is written on the next line. However, if you notice the output, Rust includes all spaces, newline characters, and tab characters between where the line break occurred and where the next character began.

fn main() {
    let text = "Hello, \
    world!";

    println!("text: {}", text);
}
$ cargo run
text: Hello, world!

If you would like to break a string onto the next line without including characters introduced after the line break, you can use the line continuation character \. All whitespace, including the newline after this character, is ignored as shown above.

fn main() {
    let text = "Hello, \nworld!";

    println!("text: {}", text);
}
$ cargo run
text: Hello,
world!

If you would like to introduce a newline in a string without splitting the string in the code over multiple lines, you can use the \n (newline or line break) character, as shown above. However, what if you actually wanted your output to display \n instead of a new line? This is where raw string literals come in.

fn main() {
    let text = r"Hello, \nworld!";

    println!("text: {}", text);
}
$ cargo run
text: Hello, \nworld!

The prefix r is called raw string literal. By placing r before a string literal returns a string exactly hiw it is, raw, in its true form.

fn main() {
    // Syntax Error: expected SEMICOLON
    let text = r"Hello, "world!"";

    println!("text: {}", text);
}

But what if a string contains " (double quote) characters? Even with a raw string literal, as shown above, it’s not valid syntax because the Rust compiler assumes the string has ended after it sees the second double quote. To fix this, we use the # characters before and after the string quotes. Here, the # is used as a delimiter to mark the beginning and the end of the string data.

fn main() {
    let text = r#"Hello, "world!""#;

    println!("text: {}", text);
}
$ cargo run
text: Hello, "world!"

In the example above, we used a single # character before and after the double quotes in which we have written a string value. With this, Rust knows where the string starts and where it ends, allowing us to write as many double quote characters as we want. If your raw string contains # characters, then you can use multiple # characters (1 more than the number of # characters in the string) as the delimiter, as shown below.

fn main() {
    let text = r##"Hello, "#world!""##;

    println!("text: {}", text);
}
$ cargo run
text: Hello, "#world!"

Byte literals

So far we saw the r prefix which is when place before a string literal, make the string as a raw string. There are other such prefixes which are also used in the context of strings in Rust and they are interesting.

fn main() {
    let byte_char: u8 = b'A';
    let bytes_str: &[u8; 6] = b"Hello\n";
    let bytes_str_raw: &[u8; 7] = br"Hello\n";

    println!("byte_A: {}", byte_char);
    println!("bytes_str: {:?}", bytes_str);
    println!("bytes_str_raw: {:?}", bytes_str_raw);
}
$ cargo run
byte_A: 65
bytes_str: [72, 101, 108, 108, 111, 10]
bytes_str_raw: [72, 101, 108, 108, 111, 92, 110]

The prefix b, when placed in front of a character literal (''), converts it into a u8 value. Therefore, it is called a byte literal. However, this only works for ASCII characters since characters outside the ASCII table take more than one byte and therefore can’t be used with it. If you place the same prefix b before a string literal, it returns a reference to an array of u8 values, which is a nothing but a reference to where the string data is stored. Therefore, it is called a byte string literal.

If we take a look at the bytes_str data, we see that the last byte is 10, which is the Line Feed character in the ASCII table. This character corresponds to \n in the string literal since it is rendered as a newline. However, if we want the raw bytes of the string, we use br prefix, which stands for a raw byte string literal. In bytes_str_raw, we get 92 and 110 instead, which correspond to the \ and n characters in the ASCII table.

String formatting

I think we are already quite familiar with the println! macro, which we use to print text to STDOUT. We also learned in this lesson that the format! macro works like println!, but instead returns a string, making it very useful for string concatenation and substitution. In this section, let’s uncover more about their capabilities.

fn main() {
    let firstname = "John";
    let lastname = "Doe";

    println!("Fullname: {} {}", firstname, lastname);
    println!("Fullname of {1}: {1} {0}", lastname, firstname);
    println!("Fullname: {fname} {lname}", lname = lastname, fname = firstname);
}
$ cargo run
Fullname: John Doe
Fullname of John: John Doe
Fullname: John Doe

If we take a look at the first println!() statement, it’s a very ordinary one. We have two format specifiers {} (also called arguments) in the format string, and after the string literal, we declared two values to be substituted in order of their appearance. However, if you’d like to specify the order yourself, you can refer to them by their index in the format specifiers, such as {0} or {1}, as we did in the second println!() statement. If you don’t want to worry about their index and would rather refer to them by a custom alias, you can do that using the alias = value expression, like we did in the third println!() statement. In this case, the order doesn’t matter and is not used at all.

fn main() {
    let x: i32 = 8;
    let y: f64 = 3.14159265359;
    let z: i32 = 127;

    println!("x: {x:.2}"); // x variable taken as value
    println!("x: {num:+}", num = x); // num alias taken as value
    println!("x: {0:o}", x); // 0th index taken as value
    println!("x: {:#o}", x); // first value take

    println!("y: {:.2}", y);

    println!("z: {:b}", z);
    println!("z: {:#b}", z);
    println!("z: {:x}", z);
    println!("z: {:X}", z);
    println!("z: {:#x}", z)
}
x: 8
x: +8
x: 10
x: 0o10

y: 3.14

z: 1111111
z: 0b1111111
z: 7f
z: 7F
z: 0x7f

We can also format integers and floating-point numbers using these format specifiers:

  • To format a value with a specific precision (number of digits after the decimal point), we use the :.N notation, where N is the number of digits. If the value is an integer, the precision is ignored.
  • The :+ notation means to always display the sign of the value.
  • The :b, :o, and :x notation is used to represent a value as binary, octal, and hexadecimal, respectively. You can use :X to print hexadecimal letters in uppercase.
  • The :# notation is used for alternative formatting. It usually prints values with a prefix, such as 0b, 0o, or 0x for binary, octal, and hexadecimal formats, respectively.
#[derive(Debug)]
struct Person {
    firstname: String,
    lastname: String,
    age: u8,
}

fn main() {
    let person = Person {
        firstname: "John".to_string(),
        lastname: "Doe".to_string(),
        age: 36,
    };

    // Error: `Person` doesn't implement `std::fmt::Display`
    // println!("person: {}", person); // <-- uncomment

    println!("person: {:?}", person);
    println!("person (pretty): {:#?}", person);
}
$ cargo run
person: Person { firstname: "John", lastname: "Doe", age: 36 }

person (pretty): Person {
    firstname: "John",
    lastname: "Doe",
    age: 36,
}

A custom struct like Person in the above example doesn’t know how to represent itself as a string. This is why, when we try to print it using the {} format specifier in the println! macro, we get an error. All types that can be printed this way must implement the Display trait, which requires those types to implement a method named fmt. This method provides a string value that will be used to substitute the format specifiers in the println!() string literal. So whenever Rust encounters {} in the format string, it calls the .fmt() method on that value.

The {:?} format specifier is called the debug formatter. It is used primarily for debugging purposes, as it prints a value in a string format that a developer can understand rather than a user-friendly format. For a type to display itself using {:?}, it must implement the Debug trait, which also requires the type to implement the fmt method, returning a string representation of the type’s value.

In the above case, we simply derived the Debug trait on the Person struct, which provides a default implementation of the fmt method. Using {:#?} prints the struct’s value in a more “pretty” or readable format due to the #.

Don’t worry about what a struct, a trait, or what the Display and Debug traits are. We will cover these topics in the traits lesson.

#rust #strings