INTRODUCTION This is a library for converting Unicode strings to numbers. Standard functions like strtoul and strtod do this for numbers written in the usual Western number system using the Indo-Arabic numerals, but they do not handle other number systems. The main functions take as input a UTF-32 Unicode string and compute the corresponding unsigned integer. Internal computation is done using arbitrary precision arithmetic, so there is no limit on the size of the integer that can be converted. INSTALLATION For installation of the C library, see the file INSTALL for details. In general, you should be able to do: ./configure make (su) make install For installation of the Tcl library, see README_TCL. To use the Tcl library, from tclsh or wish execute the command: load libuninum.so If the library is in the right place (a directory in which tcl knows to look for it) this command will succeed silently. You can then try it out by giving the command: source test.tcl The image TclTestOutput.jpg shows what the result should look like. USAGE OF THE C API (A) Converting Unicode strings to numbers Although there are a variety of additional features, the basic use of the library is very simple, perhaps simpler than using strtoul(3). Here is the minimal code needed to convert a UTF-32 string to an unsigned integer. We assume that str is a a wchar_t * containing a null-terminated UTF-32 string. You will also need the appropriate includes, which are discussed in the more elaborate example below. union ns_rval val; unsigned long myint; StringToInt(&val,str,NS_RETURN_ULONG,NS_ANY); if(0 == uninum_err) myint = val.u; This call to StringToInt attempts to convert the string str and if successful places the result in val.u. It sets the flag uninum_err to a non-zero value if an error occurs. The argument NS_ANY tells StringToInt to attempt to determine the number system itself. If it is unable to do so, uninum_err will be set to NS_UNKNOWN_ERR. The value of the string is returned in one of three forms. One option is a string of ASCII characters containing the decimal representation of the integer using the Indo-Arabic digits. This option has the virtue of avoiding any possibility of overflow or truncation. The second is to obtain the value as an unsigned long integer. If you are going to do internal calculations, this is probably the most convenient option, but some numbers (in fact, infinitely many) will not fit into an unsigned long integer. The library guarantees that no overflow or truncation will occur; if the number will not fit, it sets an error flag and returns 0. The third option is to obtain the value as a GNU MP object of type mpz_t. This is useful if you are going to do further arbitrary precision calculation. The library assumes that the input is in UTF-32 Unicode, with two exceptions. The writing systems for Klingon and Tengwar are not formally recognized by the Unicode consortium. We assume the encodings registered with the Conscript Registry. The encodings for Egyptian hieroglyphics and Sinhala are the proposed Unicode encodings, which are not yet (as of version 5.0) official. The basic interface to the library is the function StringToInt. void StringToInt (union ns_rval *ReturnValue, wchar_t *s, short ReturnType, int NumberSystem); The first argument is a pointer to a union of a string and an unsigned long: union ns_rval { char *s; unsigned long u; mpz_t m; }; This is used to store the "return" value. The second argument is the UTF-32 string that you wish to convert. The third argument indicates whether the return value should be a string, an unsigned long integer, or an object of type mpz_t. The fourth argument specifies the number system expected, e.g. NS_CHINESE. The constants specifying number systems are defined in nsdefs.h. Note that the mention of a number system in this file does not guarantee that it is available. A few constants are defined for future use. You can find out which number systems are actually available by using the -l flag to numconv or by inspecting the source code. If a string is returned, it is your job to free it. If an object of type mpz_t is returned, when you are done with it it is your job to remove it by calling mpz_clear (a function provided by GNU MP). StringToInt is an interface to a set of functions that each handle a single writing system, e.g. ArabicToInt, DevanagariToInt, etc. These functions have the same calling conventions except for the fact that they take no number system argument. The function WesternToInt assumes that the base is 10. The function WesternGeneralToInt takes an additional argument specifying a base in the range [2,36]. It expects strings without base specifiers such as "0x" for hex. It overlaps in function with strtoul(3). Its most likely use is in cases in which you want to deal with numbers too large to fit into an unsigned long integer. The auxiliary function int StringToNumberSystem (char *); returns a number specifier corresponding to a name such as "Chinese", or NS_UNKNOWN if it does not recognize the name. The inverse function char *NumberSystemToString (int); is also provided. The function: char *ListNumberSystems(int,int); is a generator that enumerates the number systems known to the library. Each time it is called with a non-zero argument it returns another number system name. Calling it with an argument of zero resets it to the beginning of the list. If the second argument is 0 it generates the specific number system names that can be used in either direction. If the second argument is 1 it lists the cover terms (such as "Chinese") that can only be used when converting strings to numbers. For example, the following line will print the list of supported specific number systems on the standard output: while (ds = ListNumberSystems(1,0)) printf("%s\n",ds); To list both the specific number systems and the cover terms, we would write: while (ds = ListNumberSystems(1,0)) printf("%s\n",ds); ListNumberSystems(0,0); /* Reset */ while (ds = ListNumberSystems(1,1)) printf("%s\n",ds); In almost all cases, it is possible to determine the number system from a single string. The auxiliary function: int GuessNumberSystem(wchar_t *); returns a number system identifier corresponding to the number system of the string it is passed. It returns NS_UNKNOWN if it does not recognize the number system and NS_ALLZERO if the string consists entirely of zeroes. The number system of such a string cannot be determined unambiguously since several number systems previously lacking a zero have added one recently and sometimes use the same glyph and codepoint. However, it is desirable to distinguish this case from NS_UNKNOWN for two reasons. First, the value of such a string is determinate, namely 0. Second, if you know that all of the data you are dealing with is in the same number system, it is sensible to adopt different strategies in dealing with the two cases. If the first item returns NS_UNKNOWN, you had might as well abandon processing as you are not going to be able to deal with it. If the first item returns NS_ALLZERO, you can expect to determine the number system from subsequent items, most of which will most likely not consist entirely of zeroes. (B) Converting numbers to Unicode strings The function that performs this conversion is: wchar_t * IntToString(union ns_rval *n,int NumberSystemNumber, short InputType) { where NumberSystemNumber is a numerical number system specifier defined in nsdefs.h and InputType specifies whether the number passed is an unsigned long, an ASCII decimal string, or an mpz_t object. The number to convert is passed in the union to which n points. For example: char *foureightyfive="485"; union ns_rval number; wchar_t *numstr; number->s = foureightyfive; numstr = IntToString(&number,NS_MONGOLIAN,NS_TYPE_STRING); will result in the Mongolian for 485 (U+1814 U+1818 U+1815) being written into freshly allocated storage pointed at by numstr. It is the caller's responsability to free this storage when the string is no longer of use. Output strings may be divided into groups. String delimitation is controlled by several variables defined in uninum.h. wchar_t Uninum_Output_Group_Separator; int Uninum_Output_General_Group_Size; int Uninum_Output_First_Group_Size; The reason for the distinction between the two group size variables is that in some number systems it is customary to use a group size of two except for the lowest group which is of size three. This is typical in India. For example, in India the number 123,456,789 will traditionally be delimited 12,34,56,789. This pattern will be obtained with the settings: Uninum_Output_General_Group_Size = 2; Uninum_Output_First_Group_Size = 3; The usual Western pattern will be obtained with the settings: Uninum_Output_General_Group_Size = 3; Uninum_Output_First_Group_Size = 3; To prevent delimitation, set Uninum_Output_General_Group_Size = 0; The output base for Western_Lower and Western_Upper is determined by the variable int Uninum_Output_Base; The input base for the same number systems is determined by the variable int Uninum_Input_Base; Some number systems can only represent numbers up to a certain limit. If you try to convert a number beyond the limit of representation, IntToString will set uninun_err to NS_RANGE_ERROR. You can find out what the maximum value representable by a number system is by calling the function MaximumValue with the number system code as argument. It returns a malloc-ed string containing the limit, if any, in decimal, or the word "unlimited" if there is no limit. Roman numerals may be generated either using M for all thousands or using the units with a superscript bar. M is the default. This behaviour is controlled by the variable: int Uninum_Generate_Roman_With_Bar_P In the program numconv, the flag -m triggers use of the superscript bar. In the GUI NumberConverter, set the variable Generate_Roman_With_Bar_P in your initialization file. (C) Errors The variable: int uninum_err; is used to report errors. It is set to zero at the beginning of every call so you need not do it yourself. A non-zero value indicates an error. The errors defined are: NS_ERROR_OKAY No error occurred. NS_ERROR_BADCHARACTER indicates that the string contains a character that it should not. The first character that was not recognized is placed in the variable uninum_badchar. NS_ERROR_DOESNOTFIT indicates that the number represented by the string does not fit into an unsigned long integer. NS_ERROR_NUMBER_SYSTEM_UNKNOWN indicates that the writing system is not recognized. NS_ERROR_BADBASE WesternGeneralToInt has been called with a base outside the valid range of [2,36]. NS_ERROR_NOTCONSISTENTWITHBASE WesternGeneralToInt has been applied to a string that contains a character not possible in the specified base. (For example, if the specified base is 8, neither 8 nor 9 nor any of the letters can validly appear in the string.) NS_ERROR_OUTOFMEMORY Indicates that it was not possible to allocate sufficient memory when converting numbers to strings. NS_ERROR_RANGE Indicates that the number to be converted is too large to be represented in the specified number system. Three other ancillary functions are provided. wchar_t *NormalizeChineseNumbers (wchar_t *s); Replaces simplified and variant Chinese numerals with their standard, traditional counterparts, which are the only ones understood by ChineseToInt. This function may reallocate storage since some such replacements increase the number of characters in the string. It is called automatically when ChineseToInt is called via StringToInt. wchar_t StripSeparators (wchar_t *s, wchar_t separator); Returns a string from which the "thousands" separator specified in its second argument has been stripped. Since most non-Western writing systems rarely or never use such separators, it is not called automatically, but you may find it useful. char *uninum_version(void); Returns the library version. This string should not be freed. The number systems supported are: Indic Balinese Bengali Burmese Devanagari Gujarati Gurmukhi Kannada Kharoshthi Khmer Lao Limbu Malayalam New_Tai_Lue Oriya Sinhala Tamil Tamil_Place Tamil_Traditional Telugu Thai Tibetan Miscellaneous Place Arabic_Western Common_Braille (Arabic, Dutch, English, German, Greek, Hebrew, Italian, Japanese, Korean, languages of India, Quebec, Vietnamese) Ewellic_Decimal Ewellic_Hexadecimal French Braille (French, Czech) Klingon Mongolian Nko Osmanya Perso_Arabic Russian_Braille Tengwar Alphabetic systems Arabic_Alphabetic Armenian_Alphabetic Cyrillic_Alphabetic Glagolitic_Alphabetic Greek_Alphabetic_Upper Greek_Alphabetic_Lower Hebrew Hebrew_Early Hebrew_Late Mxedruli Xucuri_Lower Xucuri_Upper European non-alphabetic Hex Old_Italic Roman Western Additive-Repetitive Egyptian Old_Persian Phoenician Ethiopic Aegean Sinitic Chinese_Regular_Traditional Chinese_Mandarin_Regular_Traditional Chinese_Regular_Place Chinese_Regular_Simplified Chinese_Mandarin_Regular_Simplified Chinese_Legal_Traditional Chinese_Mandarin_Legal_Traditional Chinese_Legal_Simplified Chinese_Mandarin_Legal_Simplified Chinese_Suzhou Chinese_Counting_Rod_Early Chinese_Counting_Rod_Early_No_Zero Chinese_Counting_Rod_Late Chinese_Counting_Rod_Late_No_Zero Japanese_Regular_Simplified Japanese_Regular_Traditional Japanese_Regular_Place Japanese_Legal_Simplified Japanese_Legal_Traditional Japanese_Western_Mix "regular" means that the characters used are those used for writing numbers in most contexts. "legal" means that the characters used are those used on legal documents, banknotes, and so forth. In the case of Chinese proper, "traditional" means that the characters used are those still in use outside of China and Singapore and used in those countries as well prior to the 1960s. "simplified" means that the characters used are those resulting from the reforms carried out in the People's Republic of China and adopted in Singapore. In the case of Japanese, "simplified" means that the characters used are those currently in general use. "traditional" means that the characters used are those used prior to the reforms of the 1950s. The distinction between Chinese_Mandarin and Chinese is that Chinese_Mandarin reflects the use of the morpheme liang "both" in place of yi "two" preceding the powers of ten greater than 1. This usage is also found in the Chaozhou (Teowchow) dialect, where liang is pronounced no. Mandarin_Regular_Simplified and Mandarin_Regular_Traditional also use zeroes in certain positions where other styles do not. The principles that appear to govern the use of zero in current Mandarin numbers, and which are implemented here are: (a) if the coefficient of a power of ten is zero, zero is written; (b) all sequences of two or more zeroes are reduced to a single zero. (c) leading and trailing zeroes are omitted Japanese differs from Chinese and Chinese_Mandarin in two respects. First, the division between traditional and simplified characters is slightly different. Second, Japanese omits leading 1s in contexts in which Chinese does not. In its historical form the Chinese number system is not place-based. The two systems designated "place" are place-based systems isomorphic to the Western system in which the digits are the Chinese digits rather than the Indo-Arabic digits. The Suzhou system, which is used primarily in markets, differs from the usual Chinese number sytems both in that it is place-based and in that it uses a distinct set of numerals. It is incorrectly named "Hangzhou" in Unicode documentation. The distinction between the Early and Late versions of the Chinese Counting Rod system reflects the fact that the orientation of the symbols used for even- and odd-numbered powers of ten was switched during the Han dynasty. The "No_Zero" variants omit zero where it is omissible without ambiguity, namely where both neighbors are non-zero and non-null. (In this case the fact that there is an implicit zero can be discerned from the fact that its neighbors have the same orientation.) Japanese_Western_Mix denotes the system sometimes seen in Japan in which components of numbers less than 10,000 are written the Western way. For example, 25,000,000 may be written "2,500MAN", where MAN designates the character U+4E07 "10,000". Some logically possible combinations of features are not provided. For example, the only place-based systems are use the simplified regular digits. There is no reason in principle that traditional or legal digits could not be used in a place-based manner, but to my knowledge this does not occur. If you are aware of the use of feature combinations that I have not provided, please let me know. ---- The following program illustrates the use of the library. You may also find it useful study the source for numconv.c, which provides a command-line interface to the library. ------------------------------------------------------------------------------------ #include #include #include #include #include #include #include /* Create two UTF-32 strings */ wchar_t *s1=L"\x0A67\x0A69\x0A68"; /* Gurmukhi */ wchar_t *s2=L"\x0ED5\x0ED7\x0ED6"; /* Lao */ int main(int ac, char **av) { int ns; /* This is where the "return" value will be stored */ union ns_rval val; /* So that we can check whether it has changed */ uninum_err = 0; /* We already know what number system this should be */ StringToInt(&val, /* pointer to return receiver */ s1, /* the string to convert */ NS_TYPE_STRING, /* flag requesting result as an ascii string */ NS_GURMUKHI); /* number system */ /* The string is in the s member of union val */ if(!uninum_err) printf("%s\n",val.s); /* Pretend we don't know what number system s2 is in */ ns=GuessNumberSystem(s2); printf("The second number system is: %s\n",NumberSystemToString(ns)); if(ns == NS_UNKNOWN) exit(2); /* So that we can check whether it has changed */ uninum_err = 0; StringToInt(&val, s2, NS_TYPE_ULONG, /* flag requesting result as an unsigned long int */ ns); /* number system value obtained from GuessNumberSystem */ /* Unsigned long is in u member of union val */ if(!uninum_err) printf("%u\n",val.u); exit(0); }