|< ANSI, Multibyte, Unicode and Localization 1 | Main | Windows Access Control 1 >| Site Index | Download |


 

 

 

 

MODULE G1

CHARACTER SETS

ANSI, Multibyte, Unicode and Localization 2

 

 

 

What are in this Module?

  1. Locales and Code Pages Stories

  2. Locale Category

  3. Other Locale-Dependent functions

  4. setlocale() function

  5. Standard C++ Library for Locale

  6. <locale>

  7. <locale> Members

  8. Functions

  9. Classes

 

 

My Training Period: xx hours.

 

The programming abilities for this session:

 

 

 

 

 

 

 

 

 

 

 

Locales and Code Pages Stories

  • A locale provides the local conventions and language for a particular geographical region.  A given language may be spoken in more than one Country/Region; for example, Portuguese is spoken in Brazil as well as in Portugal.

  • Conversely, a Country/Region may have more than one official language.  For example, Canada has two: English and French.  Thus, Canada has two distinct locales: Canadian-English and Canadian-French.  Some locale-dependent categories include the formatting of dates and the display format for monetary values.

  • The language determines the text and data formatting conventions, while the Country/Region determines the national conventions.

  • Every language has a unique mapping, represented by code pages, which includes characters other than those in the alphabet (such as punctuation marks and numbers).  A code page is a character set and is related to the language.

  • As such, a locale is a unique combination of language, Country/Region, and code page.  The locale and code page setting can be changed at run time by calling the setlocale() function.

  • Different languages may use different code pages.  For example, the ANSI code page 1252 is used for English and most European languages, and the ANSI code page 932 is used for Japanese Kanji.  Virtually all code pages share the ASCII character set for the lowest 128 characters (0x00 to 0x7F).

  • Any single-byte code page can be represented in a table (with 256 entries) as a mapping of byte values to characters (including numbers and punctuation marks), or glyphs.

  • Any multibyte code page can also be represented as a very large table (with 64K entries) of double-byte values to characters.  In practice, however, it is usually represented as a table for the first 256 (single-byte) characters and as ranges for the double-byte values.

  • The C run-time library has two types of internal code pages: locale (setlocale() function) and multibyte (_setmbcp() function).

  • You can change the current code page during program execution.  Also, the run-time library may obtain and use the value of the operating system code page.  In Windows 2000, the operating system code page is the "system default ANSI" code page.  This code page is constant for the duration of the program's execution.

  • When the locale code page changes, the behavior of the locale-dependent set of functions changes to that dictated by the chosen code page.

  • By default, all locale-dependent functions begin execution with a locale code page unique to the "C" locale.  You can change the internal locale code page (as well as other locale-specific properties) by calling the setlocale() function.

  • For example, a call to setlocale(LC_ALL, "") will set the locale to that indicated by the operating system user locale.

  • Similarly, when the multibyte code page changes, the behavior of the multibyte functions changes to that dictated by the chosen code page.

  • By default, all multibyte functions begin execution with a multibyte code page corresponding to the operating system's default code page.  You can change the internal multibyte code page by calling the _setmbcp() function.

  • The C run-time function setlocale() sets, changes, or queries some or all of the current program's locale information.  The _wsetlocale() routine is a wide-character version of setlocale(); the arguments and return values of _wsetlocale() are wide-character strings.

  • To use setlocale(), you must include the locale.h header file as shown below.  This locale functionality also available in Standard C (ISO/IEC).  And for _wsetlocale() you must include the mbctype.h header file.

#include <locale.h>

#include <mbctype.h>

Locale Category

Locale category

Parts of program affected

LC_ALL

All locale-specific behavior (all categories).

LC_COLLATE

Behavior of strcoll() and strxfrm() functions.

LC_CTYPE

Behavior of character-handling functions (except isdigit(), isxdigit(), mbstowcs(), and mbtowc(), which are unaffected).

LC_MAX

Same as LC_TIME.

LC_MIN

Same as LC_ALL.

LC_MONETARY

Monetary formatting information returned by the localeconv() function

LC_NUMERIC

Decimal-point character for formatted output routines (for example, printf()), data conversion routines, and non-monetary formatting information returned by localeconv() function.

LC_TIME

Behavior of strftime() function.

 

Table 7:  Locale category.

 

Other Locale-Dependent functions

function/constant

Use

setlocale category setting dependence

atof(), atoi(), atol()

Convert character to floating-point, integer, or long integer value, respectively.

LC_NUMERIC

is, isw Routines

Test given integer for particular condition.  The functions include:

isalnum(), iswalnum(), isalpha(), iswalpha(), __isascii(), iswascii(), _isatty(), iscntrl(), iswcntrl(), __iscsym(), __iscsymf(), isdigit(), iswdigit(), isgraph(), iswgraph(), isleadbyte(), islower(), iswlower(), isprint(), iswprint(), ispunct(), iswpunct(), isspace(), iswspace(), isupper(), iswupper(), iswctype(), isxdigit(), iswxdigit().

LC_CTYPE

isleadbyte()

Test for lead byte.

LC_CTYPE

localeconv()

Read appropriate values for formatting numeric quantities.

LC_MONETARY, LC_NUMERIC

MB_CUR_MAX

Maximum length in bytes of any multibyte character in current locale (macro defined in stdlib.h)

LC_CTYPE

_mbccpy()

Copy one multibyte character.

LC_CTYPE1

_mbclen()

Return length, in bytes, of given multibyte character.

LC_CTYPE1

mblen()

Validate and return number of bytes in multibyte character.

LC_CTYPE1

_mbstrlen()

For multibyte-character strings: validate each character in string; return string length.

LC_CTYPE1

mbstowcs()

Convert sequence of multibyte characters to corresponding sequence of wide characters.

LC_CTYPE1

mbtowc()

Convert multibyte character to corresponding wide character.

LC_CTYPE1

printf()

Write formatted output.

LC_NUMERIC (determines radix character output)

scanf()

Read formatted input.

LC_NUMERIC (determines radix character recognition)

setlocale(), _wsetlocale()

Select locale for program.

Not applicable

strcoll(), wcscoll()

Compare characters of two strings.

LC_COLLATE

_stricmp(), _wcsicmp(), _mbsicmp()

Compare two strings without regard to case.

LC_CTYPE1

_stricoll(),

_wcsicoll()

Compare characters of two strings (case insensitive).

LC_COLLATE

_strncoll(),

_wcsncoll()

Compare first n characters of two strings.

LC_COLLATE

_strnicmp(),

_wcsnicmp(), _mbsnicmp()

Compare characters of two strings without regard to case.

LC_CTYPE1

_strnicoll(),

_wcsnicoll()

Compare first n characters of two strings (case insensitive).

LC_COLLATE

strftime(), wcsftime()

Format date and time value according to supplied format argument.

LC_TIME

_strlwr()

Convert, in place, each uppercase letter in given string to lowercase.

LC_CTYPE

strtod(), wcstod(), strtol(), wcstol(), strtoul(),

wcstoul()

Convert character string to double, long, or unsigned long value.

LC_NUMERIC (determines radix character recognition)

_strupr()

Convert, in place, each lowercase letter in string to uppercase.

LC_CTYPE

strxfrm(), wcsxfrm()

Transform string into collated form according to locale.

LC_COLLATE

tolower(), towlower()

Convert given character to corresponding lowercase character.

LC_CTYPE

toupper(), towupper()

Convert given character to corresponding uppercase letter.

LC_CTYPE

wcstombs()

Convert sequence of wide characters to corresponding sequence of multibyte characters.

LC_CTYPE

wctomb()

Convert wide character to corresponding multibyte character.

LC_CTYPE

_wtoi(), _wtol()

Convert wide-character string to int or long.

LC_NUMERIC

 

Table 8:  Locale dependant functions.

1 For multibyte routines, the multibyte code page must be equivalent to the locale set with setlocale()_setmbcp(), with an argument of _MB_CP_LOCALE makes the multibyte code page the same as the setlocale code page.

setlocale() function

Item

Description

Function

setlocale(), _wsetlocale().

Use

To change or query some or all of the current program locale information.

Prototype

char *setlocale(int category, const char *locale);

wchar_t *_wsetlocale(int category, const wchar_t *locale);

Parameters

category - Category affected by locale.

locale - Locale name.

Return value

If a valid locale and category are given, returns a pointer to the string associated with the specified locale and category.  If the locale or category is invalid, returns a null pointer and the current locale settings of the program are not changed.  For example, the call:

setlocale(LC_ALL, "English");

Sets all categories, returning only the string English_USA.1252.  If all categories are not explicitly set by a call to setlocale(), the function returns a string indicating the current setting of each of the categories, separated by semicolons.  If the locale argument is a null pointer, setlocale() returns a pointer to the string associated with the category of the program's locale; the program's current locale setting is not changed.

The null pointer is a special directive that tells setlocale() to query rather than set the international environment.  For example, the sequence of calls:

// Set all categories and return "English_USA.1252"

setlocale( LC_ALL, "English" );

// Set only the LC_MONETARY category and return "French_France.1252"

setlocale( LC_MONETARY, "French" );

setlocale( LC_ALL, NULL );

Will return:

  1. LC_COLLATE=English_USA.1252;

  2. LC_CTYPE=English_USA.1252;

  3. LC_MONETARY=French_France.1252;

  4. LC_NUMERIC=English_USA.1252;

  5. LC_TIME=English_USA.1252

 

Which is the string associated with the LC_ALL category. You can use the string pointer returned by setlocale() in subsequent calls to restore that part of the program's locale information, assuming that your program does not alter the pointer or the string.  Later calls to setlocale() overwrite the string; you can use _strdup() to save a specific locale string.

Include file

<locale.h> or <wchar.h> for _wsetlocale().

 

Table 9:  setlocale(), _wsetlocale() information.

 

  • locale refers to the locality (country/region and language) for which you can customize certain aspects of your program.

  • Some locale-dependent categories include the formatting of dates and the display format for monetary values.

  • If you set locale to the default string for a language with multiple forms supported on your computer, you should check the setlocale() return code to see which language is in effect.

  • For example, using "chinese" could result in a return value of chinese-simplified or chinese-traditional.

  • _wsetlocale() is a wide-character version of setlocale().  The locale argument and return value of _wsetlocale() are wide-character strings.  _wsetlocale() and setlocale() behave identically otherwise.

  • The category argument specifies the parts of a program's locale information that are affected.  The macros used for category and the parts of the program they affect are as follows:

LC_ALL

All categories, as listed below.

LC_COLLATE

The strcoll(), _stricoll(), wcscoll(), _wcsicoll(), strxfrm(), _strncoll(), _strnicoll(), _wcsncoll(), _wcsnicoll(), and wcsxfrm() functions.

LC_CTYPE

The character-handling functions except isdigit(), isxdigit(), mbstowcs() and mbtowc(), which are unaffected.

LC_MONETARY

Monetary-formatting information returned by the localeconv() function.

LC_NUMERIC

Decimal-point character for the formatted output routines (such as printf()), for the data-conversion routines, and for the non-monetary-formatting information returned by localeconv().  In addition to the decimal-point character, LC_NUMERIC also sets the thousands separator and the grouping control string returned by localeconv().

LC_TIME

The strftime() and wcsftime() functions.

 

Table 10:  category macros.

setlocale(LC_ALL, "C");

locale::"lang[_country_region[.code_page]]" | ".code_page" | "" | NULL

setlocale(LC_ALL, NULL);

Examples

setlocale( LC_ALL, "" );

Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.

setlocale( LC_ALL, ".OCP" );

Explicitly sets the locale to the current OEM code page obtained from the operating system.

setlocale( LC_ALL, ".ACP" );

Sets the locale to the ANSI code page obtained from the operating system.

setlocale( LC_ALL, "[lang_ctry]" );

Sets the locale to the language and country/region indicated, using the default code page obtained from the host operating system.

setlocale( LC_ALL, "[lang_ctry.cp]" );

Sets the locale to the language, country/region, and code page indicated in the [lang_ctry.cp] string. You can use various combinations of language, country/region, and code page.  For example:

 

setlocale( LC_ALL, "French_Canada.1252" );

// Set code page to French Canada ANSI default

setlocale( LC_ALL, "French_Canada.ACP" );

// Set code page to French Canada OEM default

setlocale( LC_ALL, "French_Canada.OCP" );

setlocale( LC_ALL, "[lang]" );

Sets the locale to the country/region indicated, using the default country/region for the language specified, and the user-default ANSI code page for that country/region as obtained from the host operating system.  For example, the following two calls to setlocale are functionally equivalent:

 

setlocale( LC_ALL, "English" );

setlocale( LC_ALL, "English_United States.1252" );

setlocale( LC_ALL, "[.code_page]" );

Sets the code page to the value indicated, using the default country/region and language (as defined by the host operating system) for the specified code page. The category must be either LC_ALL or LC_CTYPE to effect a change of code page.  For example, if the default country/region and language of the host operating system are "United States" and "English," the following two calls to setlocale() are functionally equivalent:

 

setlocale( LC_ALL, ".1252" );

setlocale( LC_ALL, "English_United States.1252");

 

Table 11:  LC_ALL category examples.

/* sets the current locale to "Italian" and "French" using the setlocale() function. */

#include <stdio.h>

#include <locale.h>

#include <time.h>

 

int main()

{

    time_t ltime;

    struct tm *testime;

    unsigned char locstr[100];

   

    /* set the locale to Italian */

    setlocale(LC_ALL, "italian");

    time(&ltime);

    testime = gmtime(&ltime);

   

    /* %#x is the long date representation, appropriate to the current locale */

    if(!strftime((char *)locstr, 100, "%#x", (const struct tm *)testime))

           printf("strftime failed!\n");

    else

           printf("In Italian locale, strftime returns \"%s\"\n", locstr);

   

    /* Set the locale to French */

    setlocale(LC_ALL, "french");

    time(&ltime);

    testime = gmtime(&ltime);

   

    /* %#x is the long date representation, appropriate to the current locale */

    if(!strftime((char *)locstr, 100, "%#x", (const struct tm *)testime))

        printf("strftime failed!\n");

    else

        printf("In French locale, strftime returns \"%s\"\n", locstr);

   

    /* set the locale back to the default environment */

    setlocale(LC_ALL, "C");

    time(&ltime);

    testime = gmtime(&ltime);

   

    printf("Back to default...\n");

    if(!strftime((char *)locstr, 100, "%#x", (const struct tm *)testime))

       printf("strftime failed!\n");

    else

       printf("In 'C' locale, strftime returns \"%s\"\n", locstr);

   

    return 0;

}

A sample output:

 

In Italian locale, strftime returns "sabato 18 giugno 2005"

In French locale, strftime returns "samedi 18 juin 2005"

Back to default...

In 'C' locale, strftime returns "Saturday, June 18, 2005"

Press any key to continue

 

 

Standard C++ Library for Locale

<locale>

#include <locale>

<locale> Members

 

Functions

Function

Description

has_facet()

Tests if a particular facet is stored in a specified locale.

isalnum()

Tests whether an element in a locale is an alphabetic or a numeric character.

isalpha()

Tests whether an element in a locale is alphabetic character.

iscntrl()

Tests whether an element in a locale is a control character.

isdigit()

Tests whether an element in a locale is a numeric character.

isgraph()

Tests whether an element in a locale is an alphanumeric or punctuation character.

islower()

Tests whether an element in a locale is lower case.

isprint()

Tests whether an element in a locale is a printable character.

ispunct()

Tests whether an element in a locale is a punctuation character.

isspace()

Tests whether an element in a locale is a whitespace character.

isupper()

Tests whether an element in a locale is upper case.

isxdigit()

Tests whether an element in a locale is a character used to represent a hexadecimal number.

tolower()

Converts a character to lower case.

toupper()

Converts a character to upper case.

use_facet()

Returns a reference to a facet of a specified type stored in a locale.

 

Table 12:  Standard C++ <locale> member functions.

 

Classes

codecvt()

A template class that provides a facet used to convert between internal and external character encodings.

codecvt_base()

A base class for the codecvt class that is used to define an enumeration type referred to as result, used as the return type for the facet member functions to indicate the result of a conversion.

codecvt_byname()

A derived template class that describes an object that can serve as a collate facet of a given locale, enabling the retrieval of information specific to a cultural area concerning conversions.

collate()

A collate template class that provides a facet that handles string sorting conventions.

collate_byname()

A derived template class that describes an object that can serve as a collate facet of a given locale, enabling the retrieval of information specific to a cultural area concerning string sorting conventions.

ctype()

A template class that provides a facet that is used to classify characters, convert from upper- and lowercase and between the native character set and that set used by the locale.

ctype<char>

A class that is an explicit specialization of template class ctype<CharType> to type char, describing an object that can serve as a locale facet to characterize various properties of a character of type char.

ctype_base()

A base class for the ctype class that is used to define enumeration types used to classify or test characters either individually or within entire ranges.

ctype_byname()

A derived template class that describes an object that can serve as a ctype facet of a given locale, enabling the classification of characters and conversion of characters between case and native and locale specified character sets.

locale()

A class that describes a locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment.

messages()

A template class that describes an object that can serve as a locale facet to retrieve localized messages from a catalog of internationalized messages for a given locale.

messages_base()

A base class that describes an int type for the catalog of messages.

messages_byname()

A derived template class that describes an object that can serve as a message facet of a given locale, enabling the retrieval of localized messages.

money_base()

A base class for the ctype class that is used to define enumeration types used to classify or test characters either individually or within entire ranges.

money_get()

A template class that describes an object that can serve as a locale facet to control conversions of sequences of type CharType to monetary values.

money_put()

A template class that describes an object that can serve as a locale facet to control conversions of monetary values to sequences of type CharType.

moneypunct()

A template class that describes an object that can serve as a locale facet to describe the sequences of type CharType used to represent a monetary input field or a monetary output field.

moneypunct_byname()

A derived template class that describes an object that can serve as a moneypunct facet of a given locale enabling the formatting monetary input or output fields.

num_get()

A template class that describes an object that can serve as a locale facet to control conversions of sequences of type CharType to numeric values.

num_put()

A template class that describes an object that can serve as a locale facet to control conversions of numeric values to sequences of type CharType.

numpunct()

A template class that describes an object that can serve as a local facet to describe the sequences of type CharType used to represent information about the formatting and punctuation of numeric and Boolean expressions.

numpunct_byname()

A derived template class that describes an object that can serve as a moneypunct facet of a given locale enabling the formatting and punctuation of numeric and Boolean expressions.

time_base()

A class that serves as a base class for facets of template class time_get, defining just the enumerated type dateorder and several constants of this type.

time_get()

A template class that describes an object that can serve as a locale facet to control conversions of sequences of type CharType to time values.

time_get_byname()

A derived template class that describes an object that can serve as a locale facet of type time_get<CharType, InputIterator>.

time_put()

A template class that describes an object that can serve as a locale facet to control conversions of time values to sequences of type CharType.

time_put_byname()

A derived template class that describes an object that can serve as a locale facet of type time_put<CharType, OutputIterator>.

 

Table 13:  Standard C++ <locale> classes.

 

 

 

 

 

 

 

 

 

 

 

 

Further reading and digging:

 

  1. Check the best selling C, C++ and Windows books at Amazon.com.

  2. Microsoft Visual C++, online MSDN.

  3. Further reading and digging: MSDN Library

  4. For more program examples please refer to Windows Users & Groups Win32 programming (Implementation).

  5. Structure, enum, union and typedef story can be found C/C++ struct, enum, union & typedef.

  6. For Unicode and character set reference that contains functions, structures, macros and constants: Unicode and character set reference (MSDN).

  7. Notation used in MSDN is Hungarian Notation instead of CamelCase and is discussed Windows programming notations.

  8. Windows data type information is in Windows data types used in Win32 programming.

 

 

 

 

 

 

|< ANSI, Multibyte, Unicode and Localization 1 | Main | Windows Access Control 1 >| Site Index | Download |


C & C++ Programming Tutorial | C Programming Practice