UTF-8 is an efficient encoding of Unicode
character-strings that recognizes the fact that the majority of
text-based communications are in ASCII, and it therefore optimizes the
encoding of these characters. Most code translates directly to UTF-8
with no changes at all, but because UTF-8 is a variable-length
multi-byte
encoding you cannot calculate the number of characters from the number
of bytes. Also, there is a small performance hit for working in
UTF-8, probably about 5%, but this is more than offset by it's
advantages:
The GTK+ UTF-8 string functions
are declared in <glib/unicode.h>.
If you look through this header file you will soon realize that a lot
of extra work is required when working with UTF-8 strings. By
comparison, the
use of UTF-8 strings in XFC is completely transparent because XFC
provides a
standard string compatible UTF-8 string class,
called String, which does the extra work for you. The only string
functions you will ever need to use in an XFC application are those
defined by the Xfc::String class. It's that easy.
Xfc::String provides a
comprehensive API which is declared in
<xfc/utfstring.hh>. All the familiar member functions defined
std::string are available, as well as convenient
wrappers for all the GLib UTF-8 string functions. You can use an
Xfc::String just as you would use a std::string, however, there are a
few important
differences that you need to be aware of.
String is implemented using an
internal std::string as a byte array. This allows construction from a
std::string and simple conversion to a
std::string with the method:
const
std::string& str();
str() returns a const reference to
the internal std::string, allowing
the user to pass a String to functions that expect a std::string.
String's std::string-like methods use the
corresponding std::string name but the meaning two of the argument
types is different. In a std::string function 'pos' refers to a
byte position within the string and 'n' refers to the number of
bytes. In a Xfc::String method, 'char_pos' refers to a
character
position within the String, 'byte_pos' refers to a byte
position
within the String, 'n_chars' refers to the number of characters
and 'n_bytes' refers to the number of bytes. A special value, npos, can
be passed as the n_bytes or n_chars argument to imply all the remaining
bytes or characters, just as it does in a std::string.
Internally,
methods
that take an n_chars argument have to parse the input string
or
character array for the number of valid UTF-8 characters, and this take
time. Therefore you can improve efficiency by using methods that don't
need to know the number of characters. Another efficiency measure is in
the
implementation of the substring search methods. The find(), rfind(),
find_first_of(), find_last_of(), find_first_not_of() and
find_last_not_of()
methods take the byte position from which to start their search and
return the byte position of the first element found or npos
if
unsuccessful. This is the same as with a std::string.
For example, the
find() methods in Xfc::String are:
size_t find(const char *s, size_t
byte_pos, size_t n_chars) const;
size_t find(const String& str, size_t
byte_pos = 0) const;
size_t find(const char *s, size_t byte_pos
= 0) const;
size_t find(char c, size_t byte_pos = 0) const;
size_t find(gunichar c, size_t byte_pos = 0) const;
A 'byte_pos' of zero implies the
beginning of
the string, which is where you usually start searching from. The return
value is
then passed back to the search method as the byte_pos for the
next search, and so on until you are done.
For example, here is a simple forward search:
#include <iostream>
String s = "This is a string";
size_t i = 0;
while ((i = s.find("is", i+1)) != String::npos)
{
std::cout << i << std::endl;
}
which could also be written like this:
#include <iostream>
String s = "This is a string";
size_t i = s.find("is");
while (i != std::string::npos)
{
std::cout << i << std::endl;
i = s.find("is", ++i);
}
The output is of course 2 and 5. Remember, after one search you have to
increment the byte index 'i' by one before the next search, to move
along the string. If you did not, the output here would be an endless
loop outputting 2.
You can convert from a character offset within a String to an integer
byte index by calling:
size_t index(size_t
char_pos) const;
You can convert from a constant pointer to a position within a String
to an integer character offset by calling:You can convert an integer
character offset within
a String to a constant pointer to a position within the string by
calling:
As with std::string, the size()
method returns the
number of allocated bytes in a String. To get the number of UTF-8
characters
in a String you must instead call:
size_t length() const;
For a std::string, size() and length return the same value.
Unlike std::string, a Xfc::String understands the concept of being
null.
This simplifies passing a String to a function that accepts a
C-string and the assigning of a C-string to a Xfc::String. A null
string
can only be constructed with the following call:
String s(0);
but you would never do this; there is no point. What you would do is
something like this:
String s =
gtk_some_function_that_returns_a_c_string();
If
gtk_some_function_that_returns_a_c_string()
returns a null pointer, the Xfc::String will be null and the null()
method will return true.
When you want to pass a C-string to some function, you call the
following
method:
const
char* c_str() const { return null() ? (char*)0
: string_.c_str(); }
As you can see, c_str() is an
inline function that
returns a null pointer if the string is null, otherwise it
calls the internal std::string's c_str() function.
The index operator can be called
to return the
UTF-8 character at a given position in a String, as a G::Unichar:
G::Unichar operator[](size_t char_pos)
const;
The 'char_pos'
argument is a character position within the String. The G::Unichar
character is returned by value, not by reference. G::Unichar is a
convenient gunichar wrapper class and is declared in
<xfc/glib/unichar.hh>.
Another useful method is format()
which lets you
do inline sprintf-style text formatting:
static String format(const char
*message_format, ...);
Calling
format() is equivalent
to formatting a temporary character array and then calling
String::assign().
You can convert one or more characters in a String from lower case to
upper case, and vice versa, by calling:
String upper();
String upper(size_t char_pos, size_t n_bytes = npos);
String lower();
String lower(size_t char_pos, size_t n_bytes = npos);
The upper() and lower() methods return a new String correctly converted
to UTF-8 upper or lower case.
You can check the validity of the UTF-8 characters in a String by
calling one of the following methods:
bool
validate(size_t& byte_pos) const;
bool validate(const_pointer *end =
0) const;
Both methods return true if the String
is
a valid UTF-8 string. After returning, the 'byte_pos' and 'end'
arguments point to the first invalid byte, or the end of the string.
A word about iterators. String defines its own iterators that know how
to iterate over UTF-8 characters in a forward direction (iterator) or
reverse direction (reverse_iterator). These iterators are used just
like
std::string iterators but note: std::string iterators can't be used on
UTF-8 strings.
String defines it's own standard i/o stream operators so you can pass a
String to any stream using the >> and << operators. There
are also equivalence operators so you compare two strings or a string
and a character array using the equivalence operators == and !=.
The String class is declared in <xfc/utfstring.hh>
and exports many more methods than discussed here. Most XFC class
methods
take a String argument as a reference and return a String by value. For
efficiency when passing string literals, all methods that take a String
argument are overloaded to accept a 'const char *' argument as well.