Topics: Code Pages, TVCodePage Author: Salvador E. Tropea Status: Complete Revision: $Revision: 1.1 $ 1. INTRODUCTION 2. CODE PAGES IN TURBO VISION 3. FROM THE POINT OF VIEW OF THE USER 4. SUPPORTED ENCODINGS 5. FROM THE POINT OF VIEW OF THE PROGRAMMER 5.1 START UP AND CHANGES ON THE FLY 5.2 OTHER USEFUL INFORMATION 1. INTRODUCTION Most applications, Turbo Vision applications included, uses 8 bits characters. It gives 256 possible combinations, but: what exactly means each one? it depends on the system. To avoid problems some standars exists. DOS uses some IBM/ANSI encodings like code page 437, 850, 866, etc., Windows uses another set of code pages created by Microsoft (code pages 1250, 1251, etc.) and most UNIX systems uses enconding described in the ISO 8859 standard. In this way a document is: 1) Compact, each character/letter needs one byte. 2) Exchangeable, you can use documents from other system just knowing which encoding is used. Linux users should also read the Linux driver documentation, it have a lot of information about the problems found on Linux systems. 2. CODE PAGES IN TURBO VISION TV uses code pages for four things. Three of them are really important for users so we will focus only these three. The first is the "Application code page". This code page indicates how the application is encoded. An example of "application" stuff is the text you load in TV editor. The second is the "Screen code page". This is how the screen is encoded. That's usually the same used for the application, but not always. An interesting case is Linux where the screen can have any arbitrary encoding. And the third is the "Input code page". That's how is encoded the text from the keyboard. That's usually the same as the screen. An interesting case is Win32 console API, it supports a different code page for input, but I never saw a system configured like this. Note that the application code page is how things are encoded internally, input code pages is the encoding of the input and screen code page is the encoding of the output. The center is the application. If any of the other two code pages doesn't match a translation table is created. So you could have a system where the accented characters from the keyboard are encoded in code page 850 (DOS), the application in 1252 (Windows) and the output in ISO 8859-1 (POSIX). All will work ok. 3. FROM THE POINT OF VIEW OF THE USER Usually user doesn't have to mess with it. Applications that offer big flexibility, code page translations, etc. should do it using friendly dialogs. An example is SETEdit text editor. But even in this case the user have to know the concepts explained in the first sections of this doc. A special case where users could need to do some configuration is when using Linux. A long explanation about it is found in the Linux driver documentation. In this documentation Linux users will find a list of supported code pages. Please refer to the index. 4. SUPPORTED ENCODINGS That's a list of supported encodings and the number used for each encoding. This number is what you need to use for configuration variables like AppCP. Name ID PC 437 ASCII ext. 437 PC 737 Greek 737 PC 775 DOS Baltic Rim 775 PC 850 Latin 1 850 PC 852 Latin 2 852 PC 855 Russian 2 855 PC 857 Turkish 857 PC 860 Portuguese 860 PC 861 Icelandic 861 PC 863 French 863 PC 865 Nordic 865 PC 866 Russian 866 PC 869 Greek 2 869 CP 1250 Win Latin 2 1250 CP 1251 Win Russian 1251 CP 1252 Win Latin 1 1252 CP 1253 Win Greek 1253 CP 1254 Win Turkish 1254 CP 1257 Win Baltic 1257 Mac Cyr. CP 10007 10007 ISO 8859-1 Latin 1 88791 ISO 8859-2 Latin 2 88792 ISO 8859-3 Latin 3 88593 ISO 8859-4 Latin 4 88594 ISO 8859-5 Russian 88595 ISO 8859-7 Greek 88597 ISO 8859-9 88599 ISO Latin 1 (Linux) 885901 ISO Latin 1u(Linux) 885911 ISO 8859-14 885914 ISO 8859-15 Icelan. 885915 ISO Latin 2 (Linux) 885920 ISO Latin 2u(Linux) 885921 ISO Latin 2 (Sun) 885922 ISO Latin 2+Euro (Linux) 885923 KOI-8r (Russian) 100000 KOI-8 with CRL/NMSU 100001 Mac OS Ukrainian 100072 Osnovnoj Variant Russian 885951 Alternativnyj Variant RU 885952 U-code Russian 885953 Mazovia (polish) 1000000 ISO 5427 ISO-IR-37 KOI-7 3604494 ECMA-Cyr.ISO-IR-111 17891342 JUS_I.B1.003-SERB ISOIR146 21364750 JUS_I.B1.003-MAC ISO-IR-147 21430286 Cyrillic ISO-IR-153 22216718 5. FROM THE POINT OF VIEW OF THE PROGRAMMER The program can force the encodings using the TVMainConfigFile::Add mechanism even before the application object is created. Read the configuration file documentation for more information. The fourth encoding involved in the process is the code page used for the secondary font. But this encoding doesn't really affect TV. You can change the encodings at any time using the TVCodePage::SetCodePage member. To use the TVCodePage class use the Uses_TVCodePage request. The source file for it is codepage.cc. Read the header for more information. This class also provides members to replace some ctype.h functions. They are: toUpper, toLower, toLowerTable, isAlpha, isAlNum, isLower and isUpper. Note these members uses information from the current application code page. If the value is properly set this information is usually much better than the information you can get from C locales system. Also have in mind that a TV application could be using a very different application encoding, if things are properly configured it will be transparent, but in this case the ctype.h functions will fail. 5.1 START UP AND CHANGES ON THE FLY The library itself is encoded in the 437 code page. That's only for historic reasons (Borland code was for DOS and that's the most common case on DOS). The encoding used at run time is selected by the current driver. It can be forced from the configuration file or the application as already explained. If the library determines the needed code page is different than 437 all the internal information is recoded. Applications that uses characters outside the ASCII range should provide a call back to be called when a recode is needed. You can see an example in the demo program. Here is a basic explanation on how it works: 1) You set a callback *before* creating the application object using TVCodePage::SetCallBack(call_back_function) A call back prototype could be: void cpCallBack(ushort *map) The SetCallBack function returns the previous call back you should save it. 2) This call back will be called when an application code page is done. The map argument is needed to be passed to the TVCodePage members that does the remapping job. They are RemapChar (for a single char), RemapString (for an ASCIIZ string) and RemapNString (for a generic buffer). After remapping all the needed stuff you should call the previous call back. Note it can be a NULL pointer. When you have dynamic text that needs to be recoded, that's text in TView objects that are already created and for some reason doesn't use a static buffer to hold the special characters, you have to handle a special broadcast called cmUpdateCodePage in the handleEvent member. In this case the infoPtr event field contains a pointer to the needed map. 5.2 OTHER USEFUL INFORMATION You can also create your own code pages providing an unicode table. The library currently supports encondings that covers languages using the latin, cyrillic and greek alphabets. If you think you can help with other alphabets, enhance the currently supported or add/fix code pages information please contact us. Note the library doesn't support a lot of important details needed for some languages like: bidirectional writing, vertical writings, glyphs composition, variable spacing, encodings that use more than 256 characters, etc. An internal 16 bits encoding is used for the code pages, you don't have to mess with it but routines to convert from and to Unicode are provided. TVCodePage also provides a member to convert from any of the known code pages to another. Is important to understand that the remapping done when input, application and output doesn't match takes some CPU, but it is currently implemented in a very fast way so the overhead shouldn't be even meassurable. What we currently do is to create one to one tables when the code pages changes. That's all, no searches are done during the drawing. That's quite different to what Linux kernel does, in this case a table converts input values to Unicode and then they are searched in some sort of hash to be converted into the screen encoding. We still using a code page internally, not Unicode nor our internal code, so we don't have to do such a slow thing. The tables uses a simple algorithm to find similar symbols when a value can't be represented in the target code page, this value is put in the table, so even in this case we don't need to search anything. You can know if the remapping is enabled and do it using OnTheFlyRemapNeeded(), OnTheFlyRemap(uchar val), OnTheFlyRemapInpNeeded and OnTheFlyInpRemap. Another important detail is how the keyboard works, note that sometimes the keyboard is just a QWERTY keyboard that somehow generates the other alphabet. In this case a key usually means more than one symbol and users spect that Alt+Key is mapped to *both* things for objects like the menu. This is supported, by I need help and testing. Currently the support is only for DOS code page 737 and Linux KOI8-R. This functionality was introduced in version 2.0.0 of our port. The Borland library didn't have it nor any serious equivalent. It just provided a virtual member to do some stuff at start up.