Locale Support
A locale is a named group of properties that defines culture-specific conventions. Each locale is made up of one or more categories. Each category controls the behavior of a set of features. For example, the LC_MONETARY category contains information about how monetary values are formatted in some specific territory. The ISO and IEEE (POSIX) standards bodies have stated that a locale should include information such as the ordering of date components, the formatting of numbers, and the language preferred for message text.
PostgreSQL makes use of the locale-processing facilities provided by the host operating system. When you log in to your operating system, you are automatically assigned a locale. On a Linux host (and most Unix hosts), you can find your current locale using the locale command:
$ locale LANG=en_US LC_CTYPE="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_COLLATE="en_US" LC_MONETARY="en_US" LC_MESSAGES="en_US" LC_PAPER="en_US" LC_NAME="en_US" LC_ADDRESS="en_US" LC_TELEPHONE="en_US" LC_MEASUREMENT="en_US" LC_IDENTIFICATION="en_US" LC_ALL=
You can see that I am using a locale named en_US. Locale names are composed of multiple parts. The first component identifies a language. In my case, the language is en, meaning English. The second (optional) component identifies a country, region, or territory where the language is used. I am in the U.S., so my country code is set to US. You can think of en_US as meaning "English as spoken in the U.S.", as opposed to en_AU, which means "English as spoken in Australia." The third component of a locale name is an optional codeset. I'll talk more about codesets later in this chapter. Finally, a locale name may include modifiers, such as "@euro" to indicate that the locale uses the Euro for currency values.
Language IDs are usually two characters long, written in lowercase, and chosen from the ISO 639 list of country codes. Territories are usually two characters long, written in uppercase, and chosen from the ISO 3166 standard.
The POSIX (and ISO) standards define two special locales named C and POSIX. The C and POSIX locales are defined so that they can be used in many different locations.
Table 22.1 shows a few locale names taken from my Linux host.
Locale Name |
Language |
Region |
Codeset |
Modifier |
---|---|---|---|---|
sv_FI |
Swedish |
Finland |
||
sv_FI@euro |
Swedish |
Finland |
Euro is used in this locale |
|
sv_FI.utf8 |
Swedish |
Finland |
UTF-8 |
|
sv_FI.utf8@euro |
Swedish |
Finland |
UTF-8 |
Euro is used in this locale |
sv_SE |
Swedish |
Sweden |
||
v_SE.utf8 |
Swedish |
Sweden |
UTF-8 |
|
en_AU |
English |
Australia |
||
en_AU.utf8 |
English |
Australia |
UTF-8 |
|
en_IE |
English |
Ireland |
||
en_IE@euro |
English |
Ireland |
Euro is used in this locale |
|
en_IE.utf8 |
English |
Ireland |
UTF-8 |
|
en_IE.utf8@euro |
English |
Ireland |
UTF-8 |
Euro is used in this locale |
My Fedora Linux system defines 560 locales. Each locale is broken down into a set of categories. Most locale implementations define (at least) the categories shown in Table 22.2. Some environments define additional categories.
Category |
Influences |
Used By |
---|---|---|
LC_MESSAGES |
Message formatting and message language |
Client/Server |
LC_MONETARY |
Monetary value formatting |
Server |
LC_NUMERIC |
Numeric value formatting |
Server |
LC_TIME |
Date and time formatting |
Not used |
LC_CTYPE |
Character classifications (uppercase, punctuation, and so on) |
Server_ |
LC_COLLATE |
Collating order for string values |
Cluster |
LC_ALL |
All of the above |
See all of the above |
Enabling Locale Support
Starting with PostgreSQL version 7.3, locale support is automatically included in every build. If you're building a version of PostgreSQL older than 7.3, you must specify the --enable-locale option when you configure the source code.
If you want to see messages in a language other than English, you should enable NLS (National Language Support). To enable NLS support, specify --enable-nls when you configure the PostgreSQL source code:
$ ./configure --enable-nls
You choose a locale by setting one or more runtime configuration parameters. Like most configuration parameters, you can define the locale-related parameters in the postgresql.conf configuration file and you can modify them at runtime with the SET command. The locale-related configuration parameters are named LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME. If you don't specify a value for a locale parameter, PostgreSQL looks for the corresponding environment variable.
If you specify a locale using environment variables, there are three levels of environment variables that you can define. At the bottom level, you can set the LANG environment variable to the locale that you want to use. For example, if you want all features to run in a French context unless overridden, set LANG=fr_FR. You can created a mixed by defining LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_CTYPE, and/or LC_COLLATE. The LC_xxx environment variables override LANG. If you are working with a database that stores French values (monetary units perhaps), for example, you may still want to see PostgreSQL messages in English. In this case, you would set LANG=fr_FR and LC_MESSAGES=en_US. At the top level, LC_ALL overrides any other locale-related environment variables: If you want everything to run in German (as spoken in Germany), set LC_ALL=de_DE. Remember that PostgreSQL configuration parameters (specified in postgresql.conf or modified with a SET command) take precedence over environment variables.
Effects of Locale Support
Let's see what happens when you change locales.
The first category in Table 22.2, LC_MESSAGES, determines which language PostgreSQL uses when it displays a message. I've been running with LC_MESSAGES set to en_US when I run psql, so I see messages displayed in English:
$ psql -d movies Welcome to psql, the PostgreSQL interactive terminal. Type: copyright for distribution terms h for help with SQL commands ? for help on internal slash commands g or terminate with semicolon to execute query q to quit movies=#
Let's try setting LC_MESSAGES to fr_CA (French as spoken in Canada):
$ LC_MESSAGES=fr_CA psql -d movies Bienvenu à psql, l'interface interactif de PostgreSQL. Tapez: copyright pour l'information de copyright h pour l'aide-mémoire sur les commandes SQL ? pour l'aide-mémoire sur les commandes internes g ou point-virgule pour exécuter une requête q pour quitter movies=#
Voilà! The client messages are now in French.
It's important to remember that the client (psql) and server do not share an environment. In other words, if you set LC_MESSAGES=fr_CA before running psql, you're only affecting the psql clientmessages produced by the client are displayed in French, but messages produced by the server are produced in the server's locale. To see this behavior, try the following:
$ LC_MESSAGES=fr_CA psql -d movies Bienvenu à psql, l'interface interactif de PostgreSQL. Tapez: copyright pour l'information de copyright h pour l'aide-mémoire sur les commandes SQL ? pour l'aide-mémoire sur les commandes internes g ou point-virgule pour exécuter une requête q pour quitter movies=# SELECT junk ; ERROR: column "junk" does not exist
Notice that the greeting printed by the psql client appears in French, but the error message (which is printed by the server) appears in English. To change the server's LC_MESSAGES locale, define the LC_MESSAGES (or more typically, lc_messages) parameter in your server's postgresql.conf file. You can also change the server's LC_MESSAGES locale in the middle of a session using the SET command:
movies=# SELECT junk ; ERROR: column "junk" does not exist movies=# SET lc_messages = 'fr_CA'; SET movies=# SELECT junk ; ERREUR: la colonne ?junk? n'existe pas
When you execute a SET command, you are changing a parameter in your sessionany new sessions will inherit the parameter from the postgresql.conf file. You can only SET LC_MESSAGES if you are a superuser in the cluster that you're connected to.
Some locale properties affect the server, some affect the client, and a few are stored with the database cluster itself (refer to Table 22.2). The LC_MESSAGES category affects both the client and server because each can produce message text.
Now, let's try a few of the other categories.
The server uses the LC_MONETARY category to control the way in which monetary values are formatted. I've modified the customers table in my database to include a balance column (using the MONEY data type). Here is the new column, shown in the en_US locale:
movies=# SELECT * FROM customers; id | customer_name | phone | birth_date | balance ----+----------------------+----------+------------+------------ 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 1 | Jones, Henry | 555-1212 | 1970-10-10 | $10.00 2 | Rubin, William | 555-2211 | 1972-07-10 | $1,000.00 3 | Panky, Henry | 555-1221 | 1968-01-21 | $10,000.00 (4 rows)
Next, I'll use the SET command to change LC_MONETARY to fr_FR (French as spoken in France).
movies=# SET LC_MONETARY = 'fr_FR'; SET
Now, when I query the customers table, the monetary values are formatted using the fr_FR locale:
movies=# SELECT * FROM customers; id | customer_name | phone | birth_date | balance ----+----------------------+----------+------------+------------ 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 1 | Jones, Henry | 555-1212 | 1970-10-10 | EUR10,00 2 | Rubin, William | 555-2211 | 1972-07-10 | EUR1 000,00 3 | Panky, Henry | 555-1221 | 1968-01-21 | EUR10 000,00 (4 rows)
Notice that MONEY values are now formatted using French preferences.
The LC_NUMERIC category determines which characters will be used for grouping, the currency symbol, positive and negative signs, and the decimal point. Currently, LC_NUMERIC is used only by the TO_CHAR() function. The LC_NUMERIC category affects the only server.
PostgreSQL currently does not use the LC_TIME category (each date/time value can include an explicit time zone).
LC_CTYPE is consulted whenever PostgreSQL needs to categorize a character. The server locale determines which characters are considered uppercase, lowercase, numeric, punctuation, and so on. The most obvious uses of LC_CTYPE are the LOWER(), UPPER(), and INITCAP() string functions. LC_CTYPE is also consulted when the server evaluates regular expressions and the LIKE operator.
LC_COLLATE affects the result of an ORDER BY clause that sorts by a string value. LC_COLLATE also affects how an index that covers a string value is built. Setting LC_COLLATE ensures that strings are ordered properly for your locale.
Let's look at an example. Create two new database clusters and insert the same values into each one. The first database uses the French locale for collating:
$ PGDATA=/usr/local/locale_FR LC_COLLATE=fr_FR initdb ... Success. You can now start the database server using: postmaster -D /usr/local/locale_FR or pg_ctl -D /usr/local/locale_FR -l logfile start $ PGDATA=/usr/local/locale_FR pg_ctl start postmaster successfully started $ PGDATA=/usr/local/locale_FR createdb french_locale CREATE DATABASE $ PGDATA=/usr/local/locale_FR psql -q -d french_locale french_locale=# CREATE TABLE sort_test ( pkey char ); CREATE TABLE french_locale=# INSERT INTO sort_test VALUES ('a'); INSERT french_locale=# INSERT INTO sort_test VALUES ('ä'); INSERT french_locale=# INSERT INTO sort_test VALUES ('b'); INSERT french_locale=# SELECT * FROM sort_test; pkey ------ a ä b (3 rows) french_locale=# q
Now, repeat this procedure but set LC_COLLATE=en_US before creating the database cluster:
$ PGDATA=/usr/local/locale_EN LC_COLLATE=en_US initdb ... Success. You can now start the database server using: postmaster -D /usr/local/locale_EN or pg_ctl -D /usr/local/locale_EN -l logfile start $ PGDATA=/usr/local/locale_EN pg_ctl start postmaster successfully started $ PGDATA=/usr/local/locale_EN createdb english_locale CREATE DATABASE $ PGDATA=/usr/local/locale_EN psql -q -d locale_test english_locale=# CREATE TABLE sort_test ( pkey char ); CREATE TABLE english_locale=# INSERT INTO sort_test VALUES ('a'); INSERT english_locale=# INSERT INTO sort_test VALUES ('ä'); INSERT english_locale=# INSERT INTO sort_test VALUES ('b'); INSERT english_locale=# SELECT * FROM sort_test; pkey ------ a b ä (3 rows) locale_test=# q
Notice that the collation sequence has, in fact, changed. With LC_COLLATE set to fr_FR, you see a,ä,b. With LC_COLLATE set to en_US, the ORDER BY clause returns a,b,ä.
The LC_COLLATE and LC_CTYPE categories are only honored when you run the initdb command. Imagine what would happen if you were trying to alphabetize a long list of customer names, but the collation rules changed every few minutes. You'd end up with quite a messeach portion of the final list would be built with a different ordering. If you could change the collating sequence each time you started a client application, indexes would not be built reliably.
PostgreSQL Locale Summary
To summarize:
- LC_COLLATE determines the order in which string values are sorted (to satisfy an ORDER BY clause or to create an index). LC_COLLATE is consulted when you create a database cluster using the initdb command. initdb records the value of LC_COLLATE in the $PGDATA/global/pg_control file and you can't change it later (run the pg_controldata program to see which locale was in effect when you created a cluster). You can't specify a different collating sequence for each database in a cluster. You can't specify a per-table or per-column collating sequence. If you need a per-column collating sequence, you can use a function (such as the one you can find at http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string) to convert the column values to a directly sortable form.
- LC_CTYPE is consulted when the server needs to classify a character as uppercase, lowercase, printable, whitespace, and so on. Like LC_COLLATE, LC_CTYPE is recorded in $PGDATA/global/pg_control when you create a database cluster. You can't change a cluster's LC_CTYPE once you've created the cluster. The value of LC_CTYPE is inherited by every database within the cluster.
- LC_MESSAGES determines which language is used when a message is produced by the server or by the client. Each client (such as psql) has it's own copy of LC_MESSAGESmessages generated by the client appear in the language determined by that client's copy of LC_MESSAGES. The client's copy of LC_MESSAGES is typically defined by an environment variable. The client's copy of LC_MESSAGES is not propagated to the server (at least not by the psql clientit certainly could be sent to the server by other client applications). The server's copy of LC_MESSAGES is inherited from the postmaster. The postmaster typically finds LC_MESSAGES by looking in the postgresql.conf configuration file, but if it's not defined there, the postmaster will look for LC_MESSAGES, LC_ALL, or LANG environment variables as described earlier. If you're a cluster superuser, you can change a server's copy of LC_MESSAGES with the SET command. LC_MESSAGES is a per-session settingeach time you start a new server, LC_MESSAGES is inherited from the postmaster.
- LC_NUMERIC determines which characters the server uses when it converts numbers to (and from) character form. In particular, LC_NUMERIC defines the decimal point character, the grouping character, and the rules for using the grouping character. LC_NUMERIC is not (typically) used by PostgreSQL clients because they don't need to convert numeric values to/from character form. Each server session maintains its own copy of LC_NUMERIC. The initial value of LC_NUMERIC is inherited from the postmaster. The postmaster typically finds LC_NUMERIC by looking in the postgresql.conf configuration file, but if it's not defined there, the postmaster will look for LC_NUMERIC, LC_ALL, or LANG environment variables as described earlier. You can change the LC_NUMERIC value for your session with the SET command.
- LC_MONETARY is consulted when the server converts monetary values to and from character form. LC_MONETARY defines the local currency symbol ($), the international currency symbol (USD), and positive/negative sign conventions. LC_MONETARY also defines the decimal point character, grouping character, and grouping rules used to convert monetary values (LC_NUMERIC defines the decimal point character, grouping character, and grouping rules used to convert numeric values that are not considered monetary values).
- LC_TIME is not (currently) used by PostgreSQL.