INTRODUCTION

This is a library for converting Unicode strings to numbers. Standard
functions like strtoul and strtod do this for numbers written in the usual
Western number system using the Indo-Arabic numerals, but they do not handle
other number systems. The main functions take as input a UTF-32 Unicode
string and compute the corresponding unsigned integer. Internal computation
is done using arbitrary precision arithmetic, so there is no limit on the size
of the integer that can be converted.

INSTALLATION

For installation of the C library, see the file INSTALL for details.
In general, you should  be able to do:

./configure
make
(su)
make install

For installation of the Tcl library, see README_TCL.

To use the Tcl library, from tclsh or wish
execute the command:

load libuninum.so

If the library is in the right place (a directory in
which tcl knows to look for it) this command will
succeed silently. You can then try it out by
giving the command:

source test.tcl

The image TclTestOutput.jpg shows what the result should
look like. 

USAGE OF THE C API

(A) Converting Unicode strings to numbers

Although there are a variety of additional features, the basic use of the
library is very simple, perhaps simpler than using strtoul(3). Here is
the minimal code needed to convert a UTF-32 string to an unsigned integer.
We assume that str is a a wchar_t * containing a null-terminated UTF-32
string. You will also need the appropriate includes, which are discussed
in the more elaborate example below.

union ns_rval val;
unsigned long myint;

StringToInt(&val,str,NS_RETURN_ULONG,NS_ANY);
if(0 == uninum_err) myint = val.u;

This call to StringToInt attempts to convert the string str
and if successful places the result in val.u. It sets the
flag uninum_err to a non-zero value if an error occurs.
The argument NS_ANY tells StringToInt to attempt to determine
the number system itself. If it is unable to do so, uninum_err
will be set to NS_UNKNOWN_ERR.

The value of the string is returned in one of three forms.
One option is a string of ASCII characters containing the decimal
representation of the integer using the Indo-Arabic digits. This option has
the virtue of avoiding any possibility of overflow or truncation. The second
is to obtain the value as an unsigned long integer. If you are going
to do internal calculations, this is probably the most convenient option,
but some numbers (in fact, infinitely many) will not fit into an unsigned
long integer. The library guarantees that no overflow or truncation will occur;
if the number will not fit, it sets an error flag and returns 0.
The third option is to obtain the value as a GNU MP object of type mpz_t.
This is useful if you are going to do further arbitrary precision calculation.

The library assumes that the input is in UTF-32 Unicode, with two exceptions.
The writing systems for Klingon and Tengwar are not formally recognized by
the Unicode consortium. We assume the encodings registered with the Conscript
Registry. The encodings for Egyptian hieroglyphics and Sinhala are
the proposed Unicode encodings, which are not yet (as of version 5.0) official.

The basic interface to the library is the function StringToInt.

void StringToInt (union ns_rval *ReturnValue, wchar_t *s, short ReturnType, int NumberSystem);

The first argument is a pointer to a union of a string and an unsigned long:

union ns_rval {
  char *s;
  unsigned long u;
  mpz_t m;
};

This is used to store the "return" value.

The second argument is the UTF-32 string that you wish to convert. The third argument
indicates whether the return value should be a string, an unsigned long integer,
or an object of type mpz_t. The fourth argument specifies the number system
expected, e.g. NS_CHINESE. The constants specifying number systems are
defined in nsdefs.h. Note that the mention of a number system in this file
does not guarantee that it is available. A few constants are defined for future use.
You can find out which number systems are actually available by using the
-l flag to numconv or by inspecting the source code.

If a string is returned, it is your job to free it.
If an object of type mpz_t is returned, when you are done with it
it is your job to remove it by calling mpz_clear (a function provided by
GNU MP).

StringToInt is an interface to a set of functions that each handle a single
writing system, e.g. ArabicToInt, DevanagariToInt, etc. These functions have the
same calling conventions except for the fact that they take no number
system argument.

The function WesternToInt assumes that the base is 10. The function
WesternGeneralToInt takes an additional argument specifying a base in the range [2,36].
It expects strings without base specifiers such as "0x" for hex. It overlaps
in function with strtoul(3). Its most likely use is in cases in which you want
to deal with numbers too large to fit into an unsigned long integer.

The auxiliary function

int StringToNumberSystem (char *);

returns a number specifier corresponding to a name such as "Chinese", or
NS_UNKNOWN if it does not recognize the name. The inverse function

char *NumberSystemToString (int);

is also provided.

The function:

char *ListNumberSystems(int,int);

is a generator that enumerates the number systems known to the library. Each time it
is called with a non-zero argument it returns another number system name. Calling it
with an argument of zero resets it to the beginning of the list. If the
second argument is 0 it generates the specific number system names that can be
used in either direction. If the second argument is 1 it lists the cover terms
(such as "Chinese") that can only be used when converting strings to numbers.

For example, the following line will print the list of supported specific number
systems on the standard output:

    while (ds = ListNumberSystems(1,0)) printf("%s\n",ds);


To list both the specific number systems and the cover terms, we would write:

    while (ds = ListNumberSystems(1,0)) printf("%s\n",ds);
    ListNumberSystems(0,0); /* Reset */
    while (ds = ListNumberSystems(1,1)) printf("%s\n",ds);

In almost all cases, it is possible to determine the number system from a single
string. The auxiliary function:
 
int GuessNumberSystem(wchar_t *); 

returns a number system identifier corresponding to the number system of the string
it is passed. It returns NS_UNKNOWN if it does not recognize the number system
and NS_ALLZERO if the string consists entirely of zeroes. The number system of
such a string cannot be determined unambiguously since several number systems
previously lacking a zero have added one recently and sometimes use the same
glyph and codepoint. However, it is desirable to distinguish this case from
NS_UNKNOWN for two reasons. First, the value of such a string is determinate,
namely 0. Second, if you know that all of the data you are dealing with is in the
same number system, it is sensible to adopt different strategies in dealing
with the two cases. If the first item returns NS_UNKNOWN, you had might as well
abandon processing as you are not going to be able to deal with it. If the first
item returns NS_ALLZERO, you can expect to determine the number system
from subsequent items, most of which will most likely not consist entirely
of zeroes.


(B) Converting numbers to Unicode strings

The function that performs this conversion is:

wchar_t * IntToString(union ns_rval *n,int NumberSystemNumber, short InputType) {

where NumberSystemNumber is a numerical number system specifier defined
in nsdefs.h and InputType specifies whether the number passed is an unsigned
long, an ASCII decimal string, or an mpz_t object. The number to convert
is passed in the union to which n points. For example:

char *foureightyfive="485";
union ns_rval number;
wchar_t *numstr;

number->s = foureightyfive;
numstr = IntToString(&number,NS_MONGOLIAN,NS_TYPE_STRING);

will result in the Mongolian for 485 (U+1814 U+1818 U+1815)
being written into freshly allocated storage pointed at by numstr.
It is the caller's responsability to free this storage when
the string is no longer of use.

Output strings may be divided into groups. String delimitation is
controlled by several variables defined in uninum.h.

wchar_t Uninum_Output_Group_Separator;
int Uninum_Output_General_Group_Size;
int Uninum_Output_First_Group_Size;

The reason for the distinction between the two group size variables is
that in some number systems it is customary to use a group size of two
except for the lowest group which is of size three. This is typical in India.
For example, in India the number 123,456,789 will traditionally be delimited
12,34,56,789.

This pattern will be obtained with the settings:
Uninum_Output_General_Group_Size = 2;
Uninum_Output_First_Group_Size   = 3;

The usual Western pattern will be obtained with the settings:
Uninum_Output_General_Group_Size = 3;
Uninum_Output_First_Group_Size   = 3;

To prevent delimitation, set 
Uninum_Output_General_Group_Size = 0;

The output base for Western_Lower and Western_Upper is determined by the variable
int Uninum_Output_Base;

The input base for the same number systems is determined by the variable
int Uninum_Input_Base;

Some number systems can only represent numbers up to a certain limit.
If you try to convert a number beyond the limit of representation,
IntToString will set uninun_err to NS_RANGE_ERROR. You can find out
what the maximum value representable by a number system is by
calling the function MaximumValue with the number system code as
argument. It returns a malloc-ed string containing the limit,
if any, in decimal, or the word "unlimited" if there is no limit.

Roman numerals may be generated either using M for all thousands or using
the units with a superscript bar. M is the default. This behaviour is controlled
by the variable:
int Uninum_Generate_Roman_With_Bar_P

In the program numconv, the flag -m triggers use of the superscript bar.
In the GUI NumberConverter, set the variable Generate_Roman_With_Bar_P
in your initialization file.

(C) Errors

The variable:

int uninum_err;

is used to report errors. It is set to zero at the beginning of every
call so you need not do it yourself. A non-zero value indicates an error.
The errors defined are:

NS_ERROR_OKAY
	No error occurred.

NS_ERROR_BADCHARACTER
	indicates that the string contains a character that it should not.
	The first character that was not recognized is placed in the
	variable uninum_badchar.

NS_ERROR_DOESNOTFIT
	indicates that the number represented by the string does not fit into an
	unsigned long integer.

NS_ERROR_NUMBER_SYSTEM_UNKNOWN
	indicates that the writing system is not recognized.

NS_ERROR_BADBASE
	WesternGeneralToInt has been called with a base outside the
	valid range of [2,36].

NS_ERROR_NOTCONSISTENTWITHBASE
	WesternGeneralToInt has been applied to a string that
	contains a character not possible in the specified base.
	(For example, if the specified base is 8, neither 8 nor 9 nor
	any of the letters can validly appear in the string.)

NS_ERROR_OUTOFMEMORY
	Indicates that it was not possible to allocate sufficient memory
	when converting numbers to strings. 

NS_ERROR_RANGE
	Indicates that the number to be converted is too large to be
	represented in the specified number system.


Three other ancillary functions are provided. 

wchar_t *NormalizeChineseNumbers (wchar_t *s);

Replaces simplified and variant Chinese numerals with their standard, traditional
counterparts, which are the only ones understood by ChineseToInt. This function
may reallocate storage since some such replacements increase the number of
characters in the string. It is called automatically when ChineseToInt is
called via StringToInt.

wchar_t StripSeparators (wchar_t *s, wchar_t separator);

Returns a string from which the "thousands" separator specified in its
second argument has been stripped. Since most non-Western writing systems
rarely or never use such separators, it is not called automatically,
but you may find it useful.

char *uninum_version(void);

Returns the library version. This string should not be freed.

The number systems supported are:


Indic
	Balinese
	Bengali
	Burmese
	Devanagari
	Gujarati
	Gurmukhi
	Kannada
	Kharoshthi
	Khmer
	Lao
	Limbu
	Malayalam
	New_Tai_Lue
	Oriya
	Sinhala
	Tamil
		Tamil_Place
		Tamil_Traditional
	Telugu
	Thai
	Tibetan


Miscellaneous Place
      Arabic_Western
      Common_Braille	(Arabic, Dutch, English, German, Greek, Hebrew, Italian, Japanese, Korean, languages of India, Quebec, Vietnamese)
      Ewellic_Decimal
      Ewellic_Hexadecimal
      French Braille	(French, Czech)
      Klingon
      Mongolian
      Nko
      Osmanya
      Perso_Arabic
      Russian_Braille
      Tengwar

Alphabetic systems
  	 Arabic_Alphabetic
	 Armenian_Alphabetic
	 Cyrillic_Alphabetic
	 Glagolitic_Alphabetic
	 Greek_Alphabetic_Upper
	 Greek_Alphabetic_Lower
	 Hebrew
	 	Hebrew_Early
		Hebrew_Late
	Mxedruli
	Xucuri_Lower
	Xucuri_Upper	

European non-alphabetic
	Hex
	Old_Italic
	Roman
	Western

Additive-Repetitive
	Egyptian
	Old_Persian
	Phoenician

Ethiopic
Aegean

Sinitic
	Chinese_Regular_Traditional
	Chinese_Mandarin_Regular_Traditional
	Chinese_Regular_Place
	Chinese_Regular_Simplified
	Chinese_Mandarin_Regular_Simplified
	Chinese_Legal_Traditional
	Chinese_Mandarin_Legal_Traditional
	Chinese_Legal_Simplified
	Chinese_Mandarin_Legal_Simplified
	Chinese_Suzhou
	Chinese_Counting_Rod_Early
	Chinese_Counting_Rod_Early_No_Zero
	Chinese_Counting_Rod_Late
	Chinese_Counting_Rod_Late_No_Zero
	Japanese_Regular_Simplified
	Japanese_Regular_Traditional
	Japanese_Regular_Place
	Japanese_Legal_Simplified
	Japanese_Legal_Traditional
	Japanese_Western_Mix

"regular" means that the characters used are those used for writing numbers
in most contexts. "legal" means that the characters used are those used
on legal documents, banknotes, and so forth.

In the case of Chinese proper, "traditional" means that the characters used
are those still in use outside of China and Singapore and used in those
countries as well prior to the 1960s. "simplified" means that the characters
used are those resulting from the reforms carried out in the People's Republic
of China and adopted in Singapore.

In the case of Japanese, "simplified" means that the characters used
are those currently in general use. "traditional" means that the characters
used are those used prior to the reforms of the 1950s.

The distinction between Chinese_Mandarin and Chinese is that Chinese_Mandarin
reflects the use of the morpheme liang "both" in place of yi "two"
preceding the powers of ten greater than 1. This usage is also found
in the Chaozhou (Teowchow) dialect, where liang is pronounced no.
Mandarin_Regular_Simplified and Mandarin_Regular_Traditional also
use zeroes in certain positions where other styles do not.

The principles that appear to govern the use of zero in current
Mandarin numbers, and which are implemented here are:
(a) if the coefficient of a power of ten is zero, zero is written;
(b) all sequences of two or more zeroes are reduced to a single zero. 
(c) leading and trailing zeroes are omitted

Japanese differs from Chinese and Chinese_Mandarin in two respects.
First, the division between traditional and simplified characters is slightly
different. Second, Japanese omits leading 1s in contexts in which Chinese
does not.

In its historical form the Chinese number system is not place-based.
The two systems designated "place" are place-based systems isomorphic
to the Western system in which the digits are the Chinese digits
rather than the Indo-Arabic digits.

The Suzhou system, which is used primarily in markets, differs
from the usual Chinese number sytems both in that it is place-based
and in that it uses a distinct set of numerals.
It is incorrectly named "Hangzhou" in Unicode documentation.

The distinction between the Early and Late versions of the Chinese Counting
Rod system reflects the fact that the orientation of the symbols used for even-
and odd-numbered powers of ten was switched during the Han dynasty. The
"No_Zero" variants omit zero where it is omissible without ambiguity,
namely where both neighbors are non-zero and non-null. (In this case
the fact that there is an implicit zero can be discerned from the fact
that its neighbors have the same orientation.)

Japanese_Western_Mix denotes the system sometimes seen in Japan
in which components of numbers less than 10,000 are written the Western
way. For example, 25,000,000 may be written "2,500MAN", where MAN designates
the character U+4E07 "10,000".

Some logically possible combinations of features are not provided. For example,
the only place-based systems are use the simplified regular digits.
There is no reason in principle that traditional or legal digits could
not be used in a place-based manner, but to my knowledge this does not occur.
If you are aware of the use of feature combinations that I have not
provided, please let me know.


----
The following program illustrates the use of the library. You may also
find it useful study the source for numconv.c, which provides a command-line
interface to the library.

------------------------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <gmp.h>

#include <uninum/unicode.h>
#include <uninum/nsdefs.h>
#include <uninum/uninum.h>

/* Create two UTF-32 strings */
wchar_t *s1=L"\x0A67\x0A69\x0A68"; /* Gurmukhi */
wchar_t *s2=L"\x0ED5\x0ED7\x0ED6"; /* Lao */

int
main(int ac, char **av) {
  int ns;

  /* This is where the "return" value will be stored */
  union ns_rval val;

  /* So that we can check whether it has changed */
  uninum_err = 0;

  /* We already know what number system this should be */
  StringToInt(&val,		/* pointer to return receiver */
	     s1,		/* the string to convert */
	     NS_TYPE_STRING,	/* flag requesting result as an ascii string */
	     NS_GURMUKHI);	/* number system */

  /* The string is in the s member of union val */
  if(!uninum_err) printf("%s\n",val.s);

  /* Pretend we don't know what number system s2 is in */
  ns=GuessNumberSystem(s2);
  printf("The second number system is: %s\n",NumberSystemToString(ns));
  if(ns == NS_UNKNOWN) exit(2);

  /* So that we can check whether it has changed */
  uninum_err = 0;
  StringToInt(&val,
	     s2,
	     NS_TYPE_ULONG,	/* flag requesting result as an unsigned long int */
	     ns);		/* number system value obtained from GuessNumberSystem */

  /* Unsigned long is in u member of union val */
  if(!uninum_err) printf("%u\n",val.u);

  exit(0);
}