Locale Support

A locale is a named group of properties that defines culture-specific conventions. Each locale is made up of one or more categories. Each category controls the behavior of a set of features. For example, the LC_MONETARY category contains information about how monetary values are formatted in some specific territory. The ISO and IEEE (POSIX) standards bodies have stated that a locale should include information such as the ordering of date components, the formatting of numbers, and the language preferred for message text.

PostgreSQL makes use of the locale-processing facilities provided by the host operating system. When you log in to your operating system, you are automatically assigned a locale. On a Linux host (and most Unix hosts), you can find your current locale using the locale command:

$ locale LANG=en_US LC_CTYPE="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_COLLATE="en_US" LC_MONETARY="en_US" LC_MESSAGES="en_US" LC_PAPER="en_US" LC_NAME="en_US" LC_ADDRESS="en_US" LC_TELEPHONE="en_US" LC_MEASUREMENT="en_US" LC_IDENTIFICATION="en_US" LC_ALL=

You can see that I am using a locale named en_US. Locale names are composed of multiple parts. The first component identifies a language. In my case, the language is en, meaning English. The second (optional) component identifies a country, region, or territory where the language is used. I am in the U.S., so my country code is set to US. You can think of en_US as meaning "English as spoken in the U.S.", as opposed to en_AU, which means "English as spoken in Australia." The third component of a locale name is an optional codeset. I'll talk more about codesets later in this chapter. Finally, a locale name may include modifiers, such as "@euro" to indicate that the locale uses the Euro for currency values.

Language IDs are usually two characters long, written in lowercase, and chosen from the ISO 639 list of country codes. Territories are usually two characters long, written in uppercase, and chosen from the ISO 3166 standard.

The POSIX (and ISO) standards define two special locales named C and POSIX. The C and POSIX locales are defined so that they can be used in many different locations.

Table 22.1 shows a few locale names taken from my Linux host.

Table 22.1. Sample Locale Names

Locale Name

Language

Region

Codeset

Modifier

sv_FI

Swedish

Finland

sv_FI@euro

Swedish

Finland

Euro is used in this locale

sv_FI.utf8

Swedish

Finland

UTF-8

sv_FI.utf8@euro

Swedish

Finland

UTF-8

Euro is used in this locale

sv_SE

Swedish

Sweden

v_SE.utf8

Swedish

Sweden

UTF-8

en_AU

English

Australia

en_AU.utf8

English

Australia

UTF-8

en_IE

English

Ireland

en_IE@euro

English

Ireland

Euro is used in this locale

en_IE.utf8

English

Ireland

UTF-8

en_IE.utf8@euro

English

Ireland

UTF-8

Euro is used in this locale

My Fedora Linux system defines 560 locales. Each locale is broken down into a set of categories. Most locale implementations define (at least) the categories shown in Table 22.2. Some environments define additional categories.

Table 22.2. Locale Information Categories

Category

Influences

Used By

LC_MESSAGES

Message formatting and message language

Client/Server

LC_MONETARY

Monetary value formatting

Server

LC_NUMERIC

Numeric value formatting

Server

LC_TIME

Date and time formatting

Not used

LC_CTYPE

Character classifications (uppercase, punctuation, and so on)

Server_

LC_COLLATE

Collating order for string values

Cluster

LC_ALL

All of the above

See all of the above

 

Enabling Locale Support

Starting with PostgreSQL version 7.3, locale support is automatically included in every build. If you're building a version of PostgreSQL older than 7.3, you must specify the --enable-locale option when you configure the source code.

If you want to see messages in a language other than English, you should enable NLS (National Language Support). To enable NLS support, specify --enable-nls when you configure the PostgreSQL source code:

$ ./configure --enable-nls

You choose a locale by setting one or more runtime configuration parameters. Like most configuration parameters, you can define the locale-related parameters in the postgresql.conf configuration file and you can modify them at runtime with the SET command. The locale-related configuration parameters are named LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME. If you don't specify a value for a locale parameter, PostgreSQL looks for the corresponding environment variable.

If you specify a locale using environment variables, there are three levels of environment variables that you can define. At the bottom level, you can set the LANG environment variable to the locale that you want to use. For example, if you want all features to run in a French context unless overridden, set LANG=fr_FR. You can created a mixed by defining LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_CTYPE, and/or LC_COLLATE. The LC_xxx environment variables override LANG. If you are working with a database that stores French values (monetary units perhaps), for example, you may still want to see PostgreSQL messages in English. In this case, you would set LANG=fr_FR and LC_MESSAGES=en_US. At the top level, LC_ALL overrides any other locale-related environment variables: If you want everything to run in German (as spoken in Germany), set LC_ALL=de_DE. Remember that PostgreSQL configuration parameters (specified in postgresql.conf or modified with a SET command) take precedence over environment variables.

Effects of Locale Support

Let's see what happens when you change locales.

The first category in Table 22.2, LC_MESSAGES, determines which language PostgreSQL uses when it displays a message. I've been running with LC_MESSAGES set to en_US when I run psql, so I see messages displayed in English:

$ psql -d movies Welcome to psql, the PostgreSQL interactive terminal. Type: copyright for distribution terms h for help with SQL commands ? for help on internal slash commands g or terminate with semicolon to execute query q to quit movies=#

Let's try setting LC_MESSAGES to fr_CA (French as spoken in Canada):

$ LC_MESSAGES=fr_CA psql -d movies Bienvenu à psql, l'interface interactif de PostgreSQL. Tapez: copyright pour l'information de copyright h pour l'aide-mémoire sur les commandes SQL ? pour l'aide-mémoire sur les commandes internes g ou point-virgule pour exécuter une requête q pour quitter movies=#

Voilà! The client messages are now in French.

It's important to remember that the client (psql) and server do not share an environment. In other words, if you set LC_MESSAGES=fr_CA before running psql, you're only affecting the psql clientmessages produced by the client are displayed in French, but messages produced by the server are produced in the server's locale. To see this behavior, try the following:

$ LC_MESSAGES=fr_CA psql -d movies Bienvenu à psql, l'interface interactif de PostgreSQL. Tapez: copyright pour l'information de copyright h pour l'aide-mémoire sur les commandes SQL ? pour l'aide-mémoire sur les commandes internes g ou point-virgule pour exécuter une requête q pour quitter movies=# SELECT junk ; ERROR: column "junk" does not exist

Notice that the greeting printed by the psql client appears in French, but the error message (which is printed by the server) appears in English. To change the server's LC_MESSAGES locale, define the LC_MESSAGES (or more typically, lc_messages) parameter in your server's postgresql.conf file. You can also change the server's LC_MESSAGES locale in the middle of a session using the SET command:

movies=# SELECT junk ; ERROR: column "junk" does not exist movies=# SET lc_messages = 'fr_CA'; SET movies=# SELECT junk ; ERREUR: la colonne ?junk? n'existe pas

When you execute a SET command, you are changing a parameter in your sessionany new sessions will inherit the parameter from the postgresql.conf file. You can only SET LC_MESSAGES if you are a superuser in the cluster that you're connected to.

Some locale properties affect the server, some affect the client, and a few are stored with the database cluster itself (refer to Table 22.2). The LC_MESSAGES category affects both the client and server because each can produce message text.

Now, let's try a few of the other categories.

The server uses the LC_MONETARY category to control the way in which monetary values are formatted. I've modified the customers table in my database to include a balance column (using the MONEY data type). Here is the new column, shown in the en_US locale:

movies=# SELECT * FROM customers; id | customer_name | phone | birth_date | balance ----+----------------------+----------+------------+------------ 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 1 | Jones, Henry | 555-1212 | 1970-10-10 | $10.00 2 | Rubin, William | 555-2211 | 1972-07-10 | $1,000.00 3 | Panky, Henry | 555-1221 | 1968-01-21 | $10,000.00 (4 rows)

Next, I'll use the SET command to change LC_MONETARY to fr_FR (French as spoken in France).

movies=# SET LC_MONETARY = 'fr_FR'; SET

Now, when I query the customers table, the monetary values are formatted using the fr_FR locale:

movies=# SELECT * FROM customers; id | customer_name | phone | birth_date | balance ----+----------------------+----------+------------+------------ 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 1 | Jones, Henry | 555-1212 | 1970-10-10 | EUR10,00 2 | Rubin, William | 555-2211 | 1972-07-10 | EUR1 000,00 3 | Panky, Henry | 555-1221 | 1968-01-21 | EUR10 000,00 (4 rows)

Notice that MONEY values are now formatted using French preferences.

The LC_NUMERIC category determines which characters will be used for grouping, the currency symbol, positive and negative signs, and the decimal point. Currently, LC_NUMERIC is used only by the TO_CHAR() function. The LC_NUMERIC category affects the only server.

PostgreSQL currently does not use the LC_TIME category (each date/time value can include an explicit time zone).

LC_CTYPE is consulted whenever PostgreSQL needs to categorize a character. The server locale determines which characters are considered uppercase, lowercase, numeric, punctuation, and so on. The most obvious uses of LC_CTYPE are the LOWER(), UPPER(), and INITCAP() string functions. LC_CTYPE is also consulted when the server evaluates regular expressions and the LIKE operator.

LC_COLLATE affects the result of an ORDER BY clause that sorts by a string value. LC_COLLATE also affects how an index that covers a string value is built. Setting LC_COLLATE ensures that strings are ordered properly for your locale.

Let's look at an example. Create two new database clusters and insert the same values into each one. The first database uses the French locale for collating:

$ PGDATA=/usr/local/locale_FR LC_COLLATE=fr_FR initdb ... Success. You can now start the database server using: postmaster -D /usr/local/locale_FR or pg_ctl -D /usr/local/locale_FR -l logfile start $ PGDATA=/usr/local/locale_FR pg_ctl start postmaster successfully started $ PGDATA=/usr/local/locale_FR createdb french_locale CREATE DATABASE $ PGDATA=/usr/local/locale_FR psql -q -d french_locale french_locale=# CREATE TABLE sort_test ( pkey char ); CREATE TABLE french_locale=# INSERT INTO sort_test VALUES ('a'); INSERT french_locale=# INSERT INTO sort_test VALUES ('ä'); INSERT french_locale=# INSERT INTO sort_test VALUES ('b'); INSERT french_locale=# SELECT * FROM sort_test; pkey ------ a ä b (3 rows) french_locale=# q

Now, repeat this procedure but set LC_COLLATE=en_US before creating the database cluster:

$ PGDATA=/usr/local/locale_EN LC_COLLATE=en_US initdb ... Success. You can now start the database server using: postmaster -D /usr/local/locale_EN or pg_ctl -D /usr/local/locale_EN -l logfile start $ PGDATA=/usr/local/locale_EN pg_ctl start postmaster successfully started $ PGDATA=/usr/local/locale_EN createdb english_locale CREATE DATABASE $ PGDATA=/usr/local/locale_EN psql -q -d locale_test english_locale=# CREATE TABLE sort_test ( pkey char ); CREATE TABLE english_locale=# INSERT INTO sort_test VALUES ('a'); INSERT english_locale=# INSERT INTO sort_test VALUES ('ä'); INSERT english_locale=# INSERT INTO sort_test VALUES ('b'); INSERT english_locale=# SELECT * FROM sort_test; pkey ------ a b ä (3 rows) locale_test=# q

Notice that the collation sequence has, in fact, changed. With LC_COLLATE set to fr_FR, you see a,ä,b. With LC_COLLATE set to en_US, the ORDER BY clause returns a,b,ä.

The LC_COLLATE and LC_CTYPE categories are only honored when you run the initdb command. Imagine what would happen if you were trying to alphabetize a long list of customer names, but the collation rules changed every few minutes. You'd end up with quite a messeach portion of the final list would be built with a different ordering. If you could change the collating sequence each time you started a client application, indexes would not be built reliably.

PostgreSQL Locale Summary

To summarize:

Категории