|< C & Win32 programming 8 | Main | ANSI, Multibyte, Unicode and Localization 2 >| Site Index | Download |


 

 

 

MODULE G

CHARACTER SETS

ANSI, Multibyte, Unicode and Localization 1

 

 

 

What are in this Module?

  1. Programming Abilities:

  2. A Story of Character Sets

  3. Single-byte Character Sets

  4. Double-byte Character Sets

  5. Unicode

  6. Windows Issue

  7. Surrogates

  8. Windows Data Types for Strings

  9. Standard C Functions

  10. wsprintf() Function

  11. Conventions for Function Prototypes – Win32 Programming

  12. Generic Prototype

  13. ANSI Prototype

  14. Unicode Prototype

  15. Character and Unicode strings Conversion

 

 

My Training Period: xx hours. This Module just for informational provided for whom that interested in localization.   Also, some of the information provided here is used to understand some of the convention used in Windows programming.  ANSI C, C++ have been superseded by ISO/IEC standard.  Program examples if any, compiled using Visual C++ .Net 2003.

 

The programming abilities for this session:

 

 

 

 

 

 

 

 

 

 

 

 

 

A Story of Character Sets

Single-byte Character Sets

  • A single-byte character set is a mapping of 256 individual characters to their identifying code values.  The code values 0x20 (decimal 32) through 0x7E (decimal 126) represent standardized displayable characters, but the characters represented by the remaining codes vary among character sets.

  • The ASCII character set covers the range 0x00 through 0x7F.

  • The ANSI character set is used in the window manager (User) and graphics device interface (GDI), but the Microsoft MS-DOS file allocation table (FAT) file system uses the original equipment manufacturer (OEM) character set.

  • Variations on the character sets, called code pages, include different special characters, typically customized for a language or group of languages.  The OEM code page 437 is generally used in the United States.

  • Applications can use Unicode to avoid the inconsistencies of varied code pages and as an aid in developing easily localized applications.

  • For Windows programming, an application can use the GetACP() function to retrieve the ANSI code-page identifier for the system or use the GetOEMCP() function to retrieve the OEM code-page identifier.

  • The OemToChar() and OemToCharBuff() functions allow an application to convert a character or string from the OEM code page to either the ANSI code page or Unicode.

  • To convert in the other direction, you can use either the CharToOem() or CharToOemBuff() function.  In addition, an application can use the MultiByteToWideChar() and WideCharToMultiByte() functions to map single-byte character set (SBCS) strings to Unicode and map Unicode strings to SBCS strings.

  • The GetCPInfo() function fills a CPINFO structure with information that includes the size, in bytes, of the largest character in the code page and the default character used when a character code is entered that has no corresponding entry in the code page.

Double-byte Character Sets

  • The double-byte character set (DBCS) is called an expanded 8-bit character set because its smallest unit is a byte.

  • Some characters in a DBCS have a single byte code value and some have a double byte code value.  A DBCS can be thought of as the ANSI character set for some Asian versions of Microsoft Windows (particularly the Japanese versions).  Functions on the Japanese versions of Windows accept DBCS strings for the ANSI versions of the functions.

  • However, unlike the handling of Unicode, DBCS character handling requires detailed changes in the character-processing algorithms throughout an application's source code.

  • To help identify double-byte character sets, an application can use the IsDBCSLeadByte() function to determine whether a given character is the first byte in a 2-byte character.

  • In addition, an application can use the MultiByteToWideChar() and WideCharToMultiByte() functions to map DBCS strings to Unicode and map Unicode strings to DBCS strings.

 

Unicode

Windows Issue

Surrogates

Windows Data Types for Strings

  1. A generic version that can be compiled for either ANSI or Unicode.

  2. An ANSI version.

  3. A Unicode version.

// generic types

#ifdef UNICODE

    typedef wchar_t TCHAR;

#else

    typedef unsigned char TCHAR;

#endif

typedef TCHAR * LPTSTR, *LPTCH;

 

// 8-bit character specific

typedef unsigned char CHAR;

typedef CHAR *LPSTR, *LPCH;

 

// Unicode specific (wide characters)

typedef unsigned wchar_t WCHAR;

typedef WCHAR *LPWSTR, *LPWCH;

Standard C Functions

#define _UNICODE

 

#include <tchar.h>

#include <wchar.h>

wsprintf() Function

Format specification

ANSI version

Unicode version

c

CHAR

WCHAR

C

WCHAR

CHAR

hc, hC

CHAR

CHAR

hs, hS

LPSTR

LPSTR

lc, lC

WCHAR

WCHAR

ls, lS

LPWSTR

LPWSTR

s

LPSTR

LPWSTR

S

LPWSTR

LPSTR

 

Table 1:  wsprintf() format specification.

Item

Description

Function

wsprintf().

Use

Formats and stores a series of characters and values in a buffer for Unicode.

Prototype

int wsprintf( LPTSTR lpOut, LPCTSTR lpFmt, ... );

Parameters

lpOut - [out] Pointer to a buffer to receive the formatted output.  The maximum size of the buffer is 1024 bytes.

lpFmt - [in] Pointer to a null-terminated string that contains the format-control specifications.  In addition to ordinary ASCII characters, a format specification for each argument appears in this string.

... - [in] Specifies one or more optional arguments.  The number and type of argument parameters depend on the corresponding format-control specifications in the lpFmt parameter.

Return value

If the function succeeds, the return value is the number of characters stored in the output buffer, not counting the terminating null character.

If the function fails, the return value is less than the length of the expected output.  To get extended error information, call GetLastError().

Include file

<windows.h>

 

Table 2:  wsprintf() function information.

 

  • Using this function incorrectly can compromise the security of your application.  The string returned in lpOut is not guaranteed to be NULL-terminated.

  • Also, avoid the %s format.  It can lead to a buffer overrun.  If an access violation occurs it causes a denial of service against your application.  In the worse case, an attacker can inject executable code.

  • Consider using one of the following alternative functions: StringCbPrintf(), StringCbPrintfEx(), StringCbVPrintf(), StringCbVPrintfEx(), StringCchPrintf(), StringCchPrintfEx(), StringCchVPrintf(), or StringCchVPrintfEx().

  • The format-control string contains format specifications that determine the output format for the arguments following the lpFmt parameter.

  • Format specifications, discussed below, always begin with a percent sign (%).  If a percent sign is followed by a character that has no meaning as a format field, the character is not formatted (for example, %% produces a single percent-sign character).

  • The format-control string is read from left to right.  When the first format specification (if any) is encountered, it causes the value of the first argument after the format-control string to be converted and copied to the output buffer according to the format specification.

%[-][#][0][width][.precision]type

Field

Meaning

Pad the output with blanks or zeros to the right to fill the field width, justifying output to the left.  If this field is omitted, the output is padded to the left, justifying it to the right.

#

Prefix hexadecimal values with 0x (lowercase) or 0X (uppercase).

0

Pad the output value with zeros to fill the field width.  If this field is omitted, the output value is padded with blank spaces.

width

Copy the specified minimum number of characters to the output buffer.  The width field is a nonnegative integer.  The width specification never causes a value to be truncated; if the number of characters in the output value is greater than the specified width, or if the width field is not present, all characters of the value are printed, subject to the precision specification.

.precision

For numbers, copy the specified minimum number of digits to the output buffer.  If the number of digits in the argument is less than the specified precision, the output value is padded on the left with zeros.  The value is not truncated when the number of digits exceeds the specified precision.  If the specified precision is 0 or omitted entirely, or if the period (.) appears without a number following it, the precision is set to 1.  For strings, copy the specified maximum number of characters to the output buffer.

type

Output the corresponding argument as a character, a string, or a number.  This field can be any of the following values:

 

  1. c - Single character.  This value is interpreted as type WCHAR if the calling application defines Unicode and as type __wchar_t otherwise.

  2. C - Single character.  This value is interpreted as type __wchar_t if the calling application defines Unicode and as type WCHAR otherwise.

  3. d - Signed decimal integer. This value is equivalent to i.

  4. hc, hC - Single character.  The wsprintf() function ignores character arguments with a numeric value of zero.  This value is always interpreted as type __wchar_t, even when the calling application defines Unicode.

  5. hd - Signed short integer argument.

  6. hs, hS - String.  This value is always interpreted as type LPSTR, even when the calling application defines Unicode.

  7. hu - Unsigned short integer.

  8. i - Signed decimal integer.  This value is equivalent to d.

  9. lc, lC - Single character.  The wsprintf() function ignores character arguments with a numeric value of zero.  This value is always interpreted as type WCHAR, even when the calling application does not define Unicode.

  10. ld - Long signed integer.  This value is equivalent to li.

  11. li - Long signed integer.  This value is equivalent to ld.

  12. ls, lS - String.  This value is always interpreted as type LPWSTR, even when the calling application does not define Unicode.  This value is equivalent to ws.

  13. lu - Long unsigned integer.

  14. lx, lX - Long unsigned hexadecimal integer in lowercase or uppercase.

  15. p - Windows 2000/XP: Pointer.  The address is printed using hexadecimal.

  16. s - String.  This value is interpreted as type LPWSTR when the calling application defines Unicode and as type LPSTR otherwise.

  17. S - String.  This value is interpreted as type LPSTR when the calling application defines Unicode and as type LPWSTR otherwise.

  18. u - Unsigned integer argument.

  19. x, X - Unsigned hexadecimal integer in lowercase or uppercase.

 

Table 3:  Optional and required field for wsprintf().

Conventions for Function Prototypes – Win32 Programming

Generic Prototype

BOOL SetWindowText(HWND hwnd, LPCTSTR lpText);

#ifdef UNICODE

#define SetWindowText SetWindowTextW

#else

#define SetWindowText SetWindowTextA

#endif // !UNICODE

ANSI Prototype

BOOL SetWindowTextA(HWND hwnd, LPCSTR lpText);

Unicode Prototype

BOOL SetWindowTextW(HWND hwnd, LPCWSTR lpText);

Character and Unicode strings Conversion

Item

Description

Function

MultiByteToWideChar().

Use

Maps a character string to a wide-character (Unicode) string.  The character string mapped by this function is not necessarily from a multibyte character set.

Prototype

int MultiByteToWideChar(

  UINT CodePage,                // code page

  DWORD dwFlags,             // character-type options

  LPCSTR lpMultiByteStr,    // string to map

  int cbMultiByte,                   // number of bytes in string

  LPWSTR lpWideCharStr,  // wide-character buffer

  int cchWideChar);              // size of buffer

Parameters

CodePage - [in] Specifies the code page to be used to perform the conversion.  This parameter can be given the value of any code page that is installed or available in the system.  You can also specify one of the values listed below:

  1. CP_ACP  - ANSI code page.

  2. CP_MACCP  - Macintosh code page.

  3. CP_OEMCP  - OEM code page.

  4. CP_SYMBOL  - Windows 2000/XP: Symbol code page (42).

  5. CP_THREAD_ACP  - Windows 2000/XP: The current thread's ANSI code page.

  6. CP_UTF7 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-7.

  7. CP_UTF8 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-8.

 

dwFlags - [in] Indicates whether to translate to pre composed or composite-wide characters (if a composite form exists), whether to use glyph characters in place of control characters, and how to deal with invalid characters.  You can specify a combination of the following flag constants.

  1. MB_PRECOMPOSED  - Always use pre-composed characters that is, characters in which a base character and a non-spacing character have a single character value.  This is the default translation option. Cannot be used with MB_COMPOSITE.

  2. MB_COMPOSITE  - Always use composite characters that is, characters in which a base character and a non-spacing character have different character values.  Cannot be used with MB_PRECOMPOSED.

  3. MB_ERR_INVALID_CHARS  - If the function encounters an invalid input character, it fails and GetLastError() returns ERROR_NO_UNICODE_TRANSLATION.

  4. MB_USEGLYPHCHARS  - Use glyph characters instead of control characters.

 

A composite character consists of a base character and a non-spacing character, each having different character values.  A pre-composed character has a single character value for a base/non-spacing character combination. In the character, the e is the base character and the accent grave mark is the non-spacing character.  The function's default behavior is to translate to the pre-composed form.  If a pre-composed form does not exist, the function attempts to translate to a composite form. The flags MB_PRECOMPOSED and MB_COMPOSITE are mutually exclusive.  The MB_USEGLYPHCHARS flag and the MB_ERR_INVALID_CHARS can be set regardless of the state of the other flags. For the following code pages, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS.

  1. 50220  50227  57002 through 57011

  2. 50221  50229  65000 (UTF7)

  3. 50222  52936  65001 (UTF8)

  4. 50225  54936  42 (Symbol)

 

Windows XP and later: MB_ERR_INVALID_CHARS is the only dwFlags value supported by Code page 65001 (UTF-8).

lpMultiByteStr - [in] Points to the character string to be converted.

cbMultiByte - [in] Specifies the size in bytes of the string pointed to by the lpMultiByteStr parameter, or it can be -1 if the string is null terminated.  Note that if cbMultiByte is 0, the function fails.

If this parameter is -1, the function processes the entire input string including the null terminator.  The resulting wide character string therefore has a null terminator, and the returned length includes the null terminator.

If this parameter is a positive integer, the function processes exactly the specified number of bytes.  If the given length does not include a null terminator then the resulting wide character string will not be null terminated, and the returned length does not include a null terminator.

lpWideCharStr - [out] Points to a buffer that receives the translated string.

cchWideChar - [in] Specifies the size, in wide characters, of the buffer pointed to by the lpWideCharStr parameter.  If this value is zero, the function returns the required buffer size, in wide characters, and makes no use of the lpWideCharStr buffer.

Return value

If the function succeeds, and cchWideChar is nonzero, the return value is the number of wide characters written to the buffer pointed to by lpWideCharStr. If the function succeeds, and cchWideChar is zero, the return value is the required size, in wide characters, for a buffer that can receive the translated string. If dwFlag equals zero, the input string is UTF-8 and contains invalid characters the function returns ERROR_NO_UNICODE_TRANSLATION. If the function fails, the return value is zero.  To get extended error information, call GetLastError(). GetLastError() may return one of the following error codes:

  1. ERROR_INSUFFICIENT_BUFFER

  2. ERROR_INVALID_FLAGS

  3. ERROR_INVALID_PARAMETER

  4. ERROR_NO_UNICODE_TRANSLATION

Include file

<windows.h>

 

Table 4:  MultiByteToWideChar() information.

  1. A character that is not the default character in the source string but translates to the default character when MB_ERR_INVALID_CHARS is not set, or

  2. For DBCS strings, a character which has a lead byte but no valid trailing byte.  When an invalid character is found, and MB_ERR_INVALID_CHARS is set, the function returns 0 and sets GetLastError() with the error ERROR_NO_UNICODE_TRANSLATION.

Item

Description

Function

WideCharToMultiByte().

Use

Maps a wide-character string to a new character string.  The new character string is not necessarily from a multibyte character set.

Prototype

int WideCharToMultiByte(

  UINT CodePage,                      // code page

  DWORD dwFlags,                   // performance and mapping flags

  LPCWSTR lpWideCharStr,    // wide-character string

  int cchWideChar,                     // number of chars in string.

  LPSTR lpMultiByteStr,            // buffer for new string

  int cbMultiByte,                         // size of buffer

  LPCSTR lpDefaultChar,         // default for unmappable chars

  LPBOOL lpUsedDefaultChar);  // set when default char used

Parameters

CodePage - [in] Specifies the code page used to perform the conversion.  This parameter can be given the value of any code page that is installed or available in the system.  For a list of code pages, check MSDN documentation for Code Page Identifiers provided at the end of this Module.  You can also specify one of the following values:

  1. CP_ACP - ANSI code page.

  2. CP_MACCP - Macintosh code page.

  3. CP_OEMCP - OEM code page.

  4. CP_SYMBOL - Windows 2000/XP: Symbol code page (42).

  5. CP_THREAD_ACP - Windows 2000/XP: Current thread's ANSI code page.

  6. CP_UTF7 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-7.  When this is set, lpDefaultChar and lpUsedDefaultChar must be NULL.

  7. CP_UTF8 - Windows 98/Me, Windows NT 4.0 and later: Translate using UTF-8. When this is set, dwFlags must be zero and both lpDefaultChar and lpUsedDefaultChar must be NULL.

 

dwFlags - [in] Specifies the handling of unmapped characters.  The function performs more quickly when none of these flags is set.  The following flag constants are defined.

 

  1. WC_NO_BEST_FIT_CHARS - Windows 98/Me and Windows 2000/XP: Any Unicode characters that do not translate directly to multibyte equivalents are translated to the default character (see lpDefaultChar parameter).  In other words, if translating from Unicode to multibyte and back to Unicode again does not yield the exact same Unicode character, the default character is used. This flag can be used by itself or in combination with the other dwFlag options.

  2. WC_COMPOSITECHECK - Convert composite characters to pre-composed characters.

  3. WC_DISCARDNS - Discard non-spacing characters during conversion.

  4. WC_SEPCHARS - Generate separate characters during conversion.  This is the default conversion behavior.

  5. WC_DEFAULTCHAR - Replace exceptions with the default character during conversion.

 

When WC_COMPOSITECHECK is specified, the function converts composite characters to pre-composed characters.  A composite character consists of a base character and a non-spacing character, each having different character values.  A pre-composed character has a single character value for a base/non-spacing character combination.  In the character, the e is the base character, and the accent grave mark is the non-spacing character. When an application specifies WC_COMPOSITECHECK, it can use the last three flags in this list (WC_DISCARDNS, WC_SEPCHARS, and WC_DEFAULTCHAR) to customize the conversion to pre-composed characters.  These flags determine the function's behavior when there is no pre-composed mapping for a base/non-space character combination in a wide-character string.  These last three flags can only be used if the WC_COMPOSITECHECK flag is set. The function's default behavior is to generate separate characters (WC_SEPCHARS) for unmapped composite characters. For the following code pages, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS.

  1. 50220  50227  57002 through 57011

  2. 50221  50229  65000 (UTF7)

  3. 50222  52936  65001 (UTF8)

  4. 50225  54936  42 (Symbol)

 

lpWideCharStr - [in] Points to the wide-character string to be converted.

cchWideChar - [in] Specifies the number of wide characters in the string pointed to by the lpWideCharStr parameter.  If this value is -1, the string is assumed to be null-terminated and the length is calculated automatically.  The length will include the null-terminator.  Note that if cchWideChar is zero the function fails.

lpMultiByteStr - [out] Points to the buffer to receive the translated string.

cbMultiByte - [in] Specifies the size, in bytes, of the buffer pointed to by the lpMultiByteStr parameter. If this value is zero, the function returns the number of bytes required for the buffer.  In this case, the lpMultiByteStr buffer is not used.

lpDefaultChar - [in] Points to the character used if a wide character cannot be represented in the specified code page.  If this parameter is NULL, a system default value is used. The function is faster when both lpDefaultChar and lpUsedDefaultChar are NULL. For the code pages mentioned in dwFlags, lpDefaultChar must be NULL; otherwise the function fails with ERROR_INVALID_PARAMETER.

lpUsedDefaultChar - [in] Points to a flag that indicates whether a default character was used.  The flag is set to TRUE if one or more wide characters in the source string cannot be represented in the specified code page. Otherwise, the flag is set to FALSE. This parameter may be NULL. The function is faster when both lpDefaultChar and lpUsedDefaultChar are NULL. For the code pages mentioned in dwFlags, lpUsedDefaultChar must be NULL; otherwise the function fails with ERROR_INVALID_PARAMETER.

Return value

If the function succeeds, and cbMultiByte is nonzero, the return value is the number of bytes written to the buffer pointed to by lpMultiByteStr.  The number includes the byte for the null terminator. If the function succeeds, and cbMultiByte is zero, the return value is the required size, in bytes, for a buffer that can receive the translated string. If the function fails, the return value is zero.  To get extended error information, call GetLastError() GetLastError() may return one of the following error codes:

  1. ERROR_INSUFFICIENT_BUFFER

  2. ERROR_INVALID_FLAGS

  3. ERROR_INVALID_PARAMETER

Include file

<windows.h>

 

Table 5:  WideCharToMultiByte() information.

 

lpDefaultChar

lpUsedDefaultChar

Result

NULL

NULL

No default checking.  This is the most efficient way to use this function.

non-NULL

NULL

Uses the specified default character, but does not set lpUsedDefaultChar.

NULL

non-NULL

Uses the system default character and sets lpUsedDefaultChar if necessary.

non-NULL

non-NULL

Uses the specified default character and sets lpUsedDefaultChar if necessary.

 

Table 6:  lpDefaultChar and lpUsedDefaultChar combination behaviors.

 

 

 

 

 

 

 

 

 

 

 

 

 

Further reading and digging:

 

  1. Check the best selling C, C++ and Windows books at Amazon.com.

  2. Microsoft Visual C++, online MSDN.

  3. Further reading and digging: MSDN Library

  4. For more program examples please refer to Windows Users & Groups Win32 programming (Implementation).

  5. Structure, enum, union and typedef story can be found C/C++ struct, enum, union & typedef.

  6. For Unicode and character set reference that contains functions, structures, macros and constants: Unicode and character set reference (MSDN).

  7. Notation used in MSDN is Hungarian Notation instead of CamelCase and is discussed Windows programming notations.

  8. Windows data type information is in Windows data types used in Win32 programming.

 

 

 

 

 

 

 

 

|< C & Win32 programming 8 | Main | ANSI, Multibyte, Unicode and Localization 2 >| Site Index | Download |