Strings are defined between double quotes "..."
and not single quotes, unlike JavaScript. Strings in Go are UTF-8
encoded by default which makes more sense in the 21st century.
As UTF-8 supports ASCII
character set, you don’t need to worry about encoding in most of the cases. But to understand how UTF-8 encoding works, you should definitely visit my article on Character Encoding.
Let’s write a simple program. To define an empty variable of string type, use string
keyword. Check out earlier tutorials on how to declare a variable.
To find the length of a string, you can use len
function. The len
function is available in Go runtime, hence you don’t need to import it from any package.
💡 The
len
is a universal function to find length of any data type, it’s not exclusive for strings. We will learn about more Go’s built-in functions in upcoming tutorials.
In the above program, len(s)
will print 11
to the console as the string s
has 11
characters including a space character.
All characters in string Hello World
are valid ASCII characters, hence we hope to see each character to occupy only a byte in memory (as ASCII characters in UTF-8 occupies 8 bits or 1 byte).
Let’s verify that using a for
loop on the string s
.
Woha! I guess you were expecting s[i]
to be a letter in s
string where i
is index
of the character in the string starting from 0
. Then what is this? Well, these are the decimal value of ASCII/UTF-8 characters in Hello World string (see table http://www.asciichart.com).
In Go, a string is in effect a read-only slice of bytes. For now, imagine slice
is like a simple array.
We will learn about slices in upcoming lessons.
In the above example, we are iterating over a slice of bytes (values of uint8
array). Hence s[i]
prints the decimal value of the byte held by the character. But to see individual characters, you can use %c
format string in Printf
statement. You can also use %v
format string to see the byte value and %T
to see data type of the value.
So you can see each letter shows a decimal number which holds 8 bits
or 1 byte
of memory in type uint8
.
As we know (read wikipedia page), UTF-8 character can be defined in memory size from 1 byte (ASCII compatible) to 4 bytes. Hence in Go, all characters are represented in int32
(size of 4 bytes) data type. A code unit
is the number of bits an encoding uses for one single unit cell. So UTF-8 uses 8 bits and UTF-16 uses 16 bits for a code unit
, that means UTF-8 needs minimum 8 bits or 1 byte to represent a character.
A code point
is any numerical value that defines the character and this is represented by one or more code units depending on the encoding. As UTF-8 is compatible with ASCII, all ASCII characters are represented in a single byte (8 bits), hence UTF-8 needs only 1 code unit to represent them.
But the biggest question is, if all characters in UTF-8 are represented in int32
, then why we are getting uint8
type in the above example. As said earlier, in Go, a string is a read-only slice of bytes. When we use len
function on a string, it calculates the length of that slice
.
When we use for
loop, it loops around the slice returning one byte at a time or one code unit
at a time. As so far, all our characters were in the ASCII character set, the byte provided by for loop was a valid character or a code unit was, in fact, a code point.
Hence %c
in Printf
statement could print valid a character from that byte value. But as we know, UTF-8 code point
or character value can be represented by series of one or more bytes (max 4 bytes), what will happen in for loop we saw earlier if we introduce non-ASCII characters?
Let’s replace o
in Hello
to õ
(LATIN SMALL LETTER O WITH TILDE, http://www.utf8-chartable.de) which has Unicode code point representation U+00F5
and it is represented by 2 code units (2 bytes) c3 b5
(hexadecimal representation). So instead of 6f
for character o
, we should expect c3 b5
for character õ
.
From the above result, we got c3 b5
instead of 6f
but characters of Hellõ World
did not get printed very well. We also see that len(s)
returns 12
because len
counts the number of bytes in a string and that caused this problem.
As indexing a string (using for loop on it) accesses individual bytes, not characters. Hence c3
(decimal 195
) in UTF-8 represents Ã
and b5
(decimal 181
) represents µ
(check here).
To avoid the above the chaos, Go introduces data type rune
(synonym of code point
) which is an alias of int32
and I told you (but not proved yet) that Go represents a character (code point) in int32
data type.
💡 Interesting answer on why
rune
isint32
and notuint32
(as charactercode point
value can not be negative andint32
data type can hold both negative and positive values) is here.
So, instead of a slice of bytes, we need to convert a string into a slice of runes.
We converted a string into a slice of runes using type conversion. Observe f5
in the above result instead of c3 b5
.
This happened because while converting the string s
to a slice of rune
, c3 b5
got converted to f5
as c3 b5
collectively represents the character õ
and code point
of õ
in UTF table is f5
(hence Unicode code point representation U+00F5
) or decimal 245
(check here).
Also, we got the length 11
of string s
which is correct, because there are 11 runes in the slice (or 11 code points or 11 characters). And we also proved that a code point or a character in Go is represented by int32
data type.
Using a for loop on a string
If you use range
within a for
loop, range
will return rune
and byte index of the character.
In the above program, we lost index 5
because the 5th byte is second code unit
of õ
character. If you don’t need index
value, you can ignore it by using _
(blank identifier) instead.
What is a rune?
A string is a slice of bytes or uint8
integers, simple as that. When we use for
loop with range
, we get rune
because each character in the string is represented by rune
data type.
In Go, a character can be represented between single quotes AKA character literal. Hence, any valid UTF-8 character within a single quote ('
) is a rune
and its type is int32
.
The above program will print f5 245 int32 which is hexadecimal/decimal value and data type of code point value of õ in the UTF table.
Strings are immutable
As seen from the earlier definition of strings, they are a read-only slice of bytes. Hence, if we try to replace any byte in the slice, the compiler will throw an error.
The above program will not compile and the compiler will throw an error, cannot assign to s[0]
as the string s
is a read-only slice of bytes.
However, you can create a string from a slice of bytes and not only from a string literal. But once the conversion from slice to string is done, you can not modify the string as explained in the above example.
var1 := []uint8{72, 101, 108, 108, 111} // [72 101 108 108 111]
var2 := string(var1) // Hello
💡 Remember,
byte
is an alias forunit8
andrune
is an alias forint32
. Hence, you can use them interchangiably
String literals using backtick
Instead of double quotes, we can also use backtick (`) character to represent a string in Go. In quotes (”) you need to escape newlines, tabs and other characters that do not need to be escaped in backticks.
If you put a line break in a backtick string, it is interpreted as a ‘\n’ character, see https://golang.org/ref/spec#String_literals
💡 The value of a raw string literal is the string composed of the uninterpreted (
implicitly UTF-8-encoded
) characters between the backticks; in particular, backslashes have no special meaning and the string may contain newlines. Carriage return characters (\r
) inside raw string literals are discarded from the raw string value. - GoLang documentation
Let’s see a small example
We can see that original formatting of the string with newline, tab and double quotes persisted in the output and newline character \n
did nothing while carriage return \r
was discarded.
Character comparison
As character represented in single quotes in Go is rune
and rune can be compared because they represent Unicode code points (int32
values). Hence if a character has more decimal value, it will be greater than the character which has lower.
Let’s see a very simple example.
Since int32
value of b
is greater than a
, the expression 'b' > 'a'
will be true. Let’s see another example.
Since we know that characters are nothing but int32
internally, we can do all sorts of comparisons with them. For example, a for
loop between two character-value range.
This was a basic introduction to Strings in Go but there are many utility functions provided by strings package that can be used to perform all sorts of operations on string like join, replace, search, etc. The strings package is a part of Go’s standard library.