SAS 9.1.3 Language Reference: Concepts, Third Edition, Volumes 1 and 2

Formats

Definition of a Format

A format is an instruction that SAS uses to write data values. You use formats to control the written appearance of data values, or, in some cases, to group data values together for analysis. For example, the WORDS22. format, which converts numeric values to their equivalent in words, writes the numeric value 692 as six hundred ninety-two .

Syntax of a Format

SAS formats have the following form:

<$> format < w >.< d >

Here is an explanation of the syntax:

$

indicates a character format; its absence indicates a numeric format.

format

names the format. The format is a SAS format or a user-defined format that was previously defined with the VALUE statement in PROC FORMAT. For more information on user -defined formats, see "The FORMAT Procedure" in Base SAS Procedures Guide .

w

specifies the format width, which for most formats is the number of columns in the output data.

d

specifies an optional decimal scaling factor in the numeric formats.

Formats always contain a period (.) as a part of the name . If you omit the w and the d values from the format, SAS uses default values. The d value that you specify with a format tells SAS to display that many decimal places, regardless of how many decimal places are in the data. Formats never change or truncate the internally stored data values.

For example, in DOLLAR10.2, the w value of 10 specifies a maximum of 10 columns for the value. The d value of 2 specifies that two of these columns are for the decimal part of the value, which leaves eight columns for all the remaining characters in the value. This includes the decimal point, the remaining numeric value, a minus sign if the value is negative, the dollar sign, and commas, if any.

If the format width is too narrow to represent a value, then SAS tries to squeeze the value into the space available. Character formats truncate values on the right. Numeric formats sometimes revert to the BEST w.d format. SAS prints asterisks if you do not specify an adequate width. In the following example, the result is x=**.

x=123; put x=2.;

If you use an incompatible format, such as using a numeric format to write character values, first SAS attempts to use an analogous format of the other type. If this is not feasible , then an error message that describes the problem appears in the SAS log.

Ways to Specify Formats

You can use formats in the following ways:

Permanent versus Temporary Association

When you specify a format in a PUT statement, SAS uses the format to write data values during the DATA step but does not permanently associate the format with a variable. To permanently associate a format with a variable, use a FORMAT statement or an ATTRIB statement in a DATA step. SAS permanently associates a format with the variable by modifying the descriptor information in the SAS data set.

Using a FORMAT statement or an ATTRIB statement in a PROC step associates a format with a variable for that PROC step, as well as for any output data sets that the procedure creates that contain formatted variables. For more information on using formats in SAS procedures, see Base SAS Procedures Guide .

Informats

Definition of an Informat

An informat is an instruction that SAS uses to read data values into a variable. For example, the following value contains a dollar sign and commas:

,000,000

To remove the dollar sign ($) and commas (,) before storing the numeric value 1000000 in a variable, read this value with the COMMA11. informat.

Unless you explicitly define a variable first, SAS uses the informat to determine whether the variable is numeric or character. SAS also uses the informat to determine the length of character variables.

Syntax of an Informat

SAS informats have the following form:

<$> informat < w >.< d >

Here is an explanation of the syntax:

$

indicates a character informat; its absence indicates a numeric informat.

informat

names the informat. The informat is a SAS informat or a user-defined informat that was previously defined with the INVALUE statement in PROC FORMAT. For more information on user-defined informats, see "The FORMAT Procedure" in Base SAS Procedures Guide .

w

specifies the informat width, which for most informats is the number of columns in the input data.

d

specifies an optional decimal scaling factor in the numeric informats. SAS divides the input data by 10 to the power of d .

Note  

Even though SAS can read up to 31 decimal places when you specify some numeric informats, floating-point numbers with more than 12 decimal places might lose precision due to the limitations of the eight-byte floating point representation used by most computers.

Informats always contain a period (.) as a part of the name. If you omit the w and the d values from the informat, SAS uses default values. If the data contains decimal points, SAS ignores the d value and reads the number of decimal places that are actually in the input data.

If the informat width is too narrow to read all the columns in the input data, you may get unexpected results. The problem frequently occurs with the date and time informats. You must adjust the width of the informat to include blanks or special characters between the day, month, year, or time. For more information about date and time values, see Chapter 8, "Dates, Times, and Intervals," on page 127.

When a problem occurs with an informat, SAS writes a note to the SAS log and assigns a missing value to the variable. Problems occur if you use an incompatible informat, such as a numeric informat to read character data, or if you specify the width of a date and time informat that causes SAS to read a special character in the last column.

Ways to Specify Informats

You can specify informats in the following ways:

Permanent versus Temporary Association

When you specify an informat in an INPUT statement, SAS uses the informat to read input data values during that DATA step. SAS, however, does not permanently associate the informat with the variable. To permanently associate a format with a variable, use an INFORMAT statement or an ATTRIB statement. SAS permanently associates an informat with the variable by modifying the descriptor information in the SAS data set.

User-Defined Formats and Informats

In addition to the formats and informats that are supplied with Base SAS software, you can create your own formats and informats. In Base SAS software, PROC FORMAT allows you to create your own formats and informats for both character and numeric variables.

When you execute a SAS program that uses user-defined formats or informats, these formats and informats should be available. The two ways to make these formats and informats available are

To create permanent SAS formats and informats, see "The FORMAT Procedure" in Base SAS Procedures Guide .

If you execute a program that cannot locate a user-defined format or informat, the result depends on the setting of the FMTERR system option. If the user-defined format or informat is not found, then these system options produce these results:

System option

Results

FMTERR

SAS produces an error that causes the current DATA or PROC step to stop.

NOFMTERR

SAS continues processing and substitutes a default format, usually the BEST w . or $ w. format.

Although using NOFMTERR enables SAS to process a variable, you lose the information that the user-defined format or informat supplies .

To avoid problems, make sure that your program has access to all of the user-defined formats and informats that are used in the program.

Byte Ordering for Integer Binary Data on Big Endian and Little Endian Platforms

Definitions

Integer values for binary integer data are typically stored in one of three sizes: one-byte, two-byte, or four-byte. The ordering of the bytes for the integer varies depending on the platform (operating environment) on which the integers were produced.

The ordering of bytes differs between the "big endian" and "little endian" platforms. These colloquial terms are used to describe byte ordering for IBM mainframes (big endian) and for Intel-based platforms (little endian). In SAS, the following platforms are considered big endian: AIX, HP-UX, IBM mainframe, Macintosh, and Solaris. The following platforms are considered little endian: OpenVMS Alpha, Digital UNIX, Intel ABI, and Windows.

How Bytes Are Ordered

On big endian platforms, the value 1 is stored in binary and is represented here in hexadecimal notation. One byte is stored as 01, two bytes as 00 01, and four bytes as 00 00 00 01. On little endian platforms, the value 1 is stored in one byte as 01 (the same as big endian), in two bytes as 01 00, and in four bytes as 01 00 00 00.

If an integer is negative, the "two's complement" representation is used. The high-order bit of the most significant byte of the integer will be set on. For example, ˆ’ 2 would be represented in one, two, and four bytes on big endian platforms as FE, FF FE, and FF FF FF FE respectively. On little endian platforms, the representation would be FE, FE FF, and FE FF FF FF. These representations result from the output of the integer binary value ˆ’ 2 expressed in hexadecimal representation.

Writing Data Generated on Big Endian or Little Endian Platforms

SAS can read signed and unsigned integers regardless of whether they were generated on a big endian or a little endian system. Likewise, SAS can write signed and unsigned integers in both big endian and little endian format. The length of these integers can be up to eight bytes.

The following table shows which format to use for various combinations of platforms. In the Sign? column, "no" indicates that the number is unsigned and cannot be negative. "Yes" indicates that the number can be either negative or positive.

Table 4.1: SAS Formats and Byte Ordering

Data created for

Data written by

Sign?

Format/Informat

big endian

big endian

yes

IB or S370FIB

big endian

big endian

no

PIB, S370FPIB, S370FIBU

big endian

little endian

yes

S370FIB

big endian

little endian

no

S370FPIB

little endian

big endian

yes

IBR

little endian

big endian

no

PIBR

little endian

little endian

yes

IB or IBR

little endian

little endian

no

PIB or PIBR

big endian

either

yes

S370FIB

big endian

either

no

S370FPIB

little endian

either

yes

IBR

little endian

either

no

PIBR

Integer Binary Notation and Different Programming Languages

The following table compares integer binary notation according to programming language.

Table 4.2: Integer Binary Notation and Programming Languages

Language

2 Bytes

4 Bytes

SAS

IB2., IBR2., PIB2., PIBR2., S370FIB2., S370FIBU2., S370FPIB2.

IB4., IBR4., PIB4., PIBR4., S370FIB4., S370FIBU4., S370FPIB4.

PL/I

FIXED BIN(15)

FIXED BIN(31)

FORTRAN

INTEGER*2

INTEGER*4

COBOL

COMP PIC 9(4)

COMP PIC 9(8)

IBM assembler

H

F

C

short

long

Working with Packed Decimal and Zoned Decimal Data

Definitions

Packed decimal

specifies a method of encoding decimal numbers by using each byte to represent two decimal digits. Packed decimal representation stores decimal data with exact precision. The fractional part of the number is determined by the informat or format because there is no separate mantissa and exponent.

 

An advantage of using packed decimal data is that exact precision can be maintained . However, computations involving decimal data might become inexact due to the lack of native instructions.

Zoned decimal

specifies a method of encoding decimal numbers in which each digit requires one byte of storage. The last byte contains the number's sign as well as the last digit. Zoned decimal data produces a printable representation.

Nibble

specifies 1/2 of a byte.

Packed Decimal Data

A packed decimal representation stores decimal digits in each "nibble" of a byte. Each byte has two nibbles, and each nibble is indicated by a hexadecimal digit. For example, the value 15 is stored in two nibbles , using the hexadecimal digits 1 and 5.

The sign indication is dependent on your operating environment. On IBM mainframes, the sign is indicated by the last nibble. With formats, C indicates a positive value, and D indicates a negative value. With informats, A, C, E, and F indicate positive values, and B and D indicate negative values. Any other nibble is invalid for signed packed decimal data. In all other operating environments, the sign is indicated in its own byte. If the high-order bit is 1, then the number is negative. Otherwise, it is positive.

The following applies to packed decimal data representation:

Zoned Decimal Data

The following applies to zoned decimal data representation:

Packed Julian Dates

The following applies to packed Julian dates:

Platforms Supporting Packed Decimal and Zoned Decimal Data

Some platforms have native instructions to support packed and zoned decimal data, while others must use software to emulate the computations. For example, the IBM mainframe has an Add Pack instruction to add packed decimal data, but the Intel-based platforms have no such instruction and must convert the decimal data into some other format.

Languages Supporting Packed Decimal and Zoned Decimal Data

Several different languages support packed decimal and zoned decimal data. The following table shows how COBOL picture clauses correspond to SAS formats and informats.

IBM VS COBOL II clauses

Corresponding S370Fxxx formats/informats

PIC S9(X) PACKED-DECIMAL

S370FPDw.

PIC 9(X) PACKED-DECIMAL

S370FPDUw.

PIC S9(W) DISPLAY

S370FZDw.

PIC 9(W) DISPLAY

S370FZDUw.

PIC S9(W) DISPLAY SIGN LEADING

S370FZDLw.

PIC S9(W) DISPLAY SIGN LEADING SEPARATE

S370FZDSw.

PIC S9(W) DISPLAY SIGN TRAILING SEPARATE

S370FZDTw.

For the packed decimal representation listed above, X indicates the number of digits represented, and W is the number of bytes. For PIC S9(X) PACKED-DECIMAL, W is ceil((x+1)/2) . For PIC 9(X) PACKED-DECIMAL, W is ceil (x/2) . For example, PIC S9(5) PACKED-DECIMAL represents five digits. If a sign is included, six nibbles are needed. ceil((5+1)/2) has a length of three bytes, and the value of W is 3.

Note that you can substitute COMP-3 for PACKED-DECIMAL.

In IBM assembly language, the P directive indicates packed decimal, and the Z directive indicates zoned decimal. The following shows an excerpt from an assembly language listing, showing the offset, the value, and the DC statement:

offset value (in hex) inst label directive +000000 00001C 2 PEX1 DC PL3'1' +000003 00001D 3 PEX2 DC PL3'-1' +000006 F0F0C1 4 ZEX1 DC ZL3'1' +000009 F0F0D1 5 ZEX2 DC ZL3'1'

In PL/I, the FIXED DECIMAL attribute is used in conjunction with packed decimal data. You must use the PICTURE specification to represent zoned decimal data. There is no standardized representation of decimal data for the FORTRAN or the C languages.

Summary of Packed Decimal and Zoned Decimal Formats and Informats

SAS uses a group of formats and informats to handle packed and zoned decimal data. The following table lists the type of data representation for these formats and informats. Note that the formats and informats that begin with S370 refer to IBM mainframe representation.

Format

Type of data representation

Corresponding informat

Comments

PD

Packed decimal

PD

Local signed packed decimal

PK

Packed decimal

PK

Unsigned packed decimal; not specific to your operating environment

ZD

Zoned decimal

ZD

Local zoned decimal

none

Zoned decimal

ZDB

Translates EBCDIC blank (hex 40) to EBCDIC zero (hex F0), then corresponds to the informat as zoned decimal

none

Zoned decimal

ZDV

Non-IBM zoned decimal representation

S370FPD

Packed decimal

S370FPD

Last nibble C (positive) or D (negative)

S370FPDU

Packed decimal

S370FPDU

Last nibble always F (positive)

S370FZD

Zoned decimal

S370FZD

Last byte contains sign in upper nibble: C (positive) or D (negative)

S370FZDU

Zoned decimal

S370FZDU

Unsigned; sign nibble always F

S370FZDL

Zoned decimal

S370FZDL

Sign nibble in first byte in informat; separate leading sign byte of hex C0 (positive) or D0 (negative) in format

S370FZDS

Zoned decimal

S370FZDS

Leading sign of - (hex 60) or + (hex 4E)

S370FZDT

Zoned decimal

S370FZDT

Trailing sign of - (hex 60) or + (hex 4E)

PDJULI

Packed decimal

PDJULI

Julian date in packed representation - IBM computation

PDJULG

Packed decimal

PDJULG

Julian date in packed representation - Gregorian computation

none

Packed decimal

RMFDUR

Input layout is: mmssttt F

none

Packed decimal

SHRSTAMP

Input layout is: yyyyddd F hhmmssth , where yyyyddd F is the packed Julian date; yyyy is a 0-based year from 1900

none

Packed decimal

SMFSTAMP

Input layout is: xxxxxxxxyyyyddd F, where yyyyddd F is the packed Julian date; yyyy is a 0-based year from 1900

none

Packed decimal

PDTIME

Input layout is: 0 hhmmss F

none

Packed decimal

RMFSTAMP

Input layout is: 0 hhmmss F yyyyddd F, where yyyyddd F is the packed Julian date; yyyy is a 0-based year from 1900

Data Conversions and Encodings

An encoding maps each character in a character set to a unique numeric representation, resulting in a table of all code points. A single character can have different numeric representations in different encodings. For example, the ASCII encoding for the dollar symbol $ is 24hex. The Danish EBCDIC encoding for the dollar symbol $ is 67hex. In order for a version of SAS that normally uses ASCII to properly interpret a data set that is encoded in Danish EBCDIC, the data must be transcoded.

Transcoding is the process of moving data from one encoding to another. When transcoding the ASCII dollar sign to the Danish EBCDIC dollar sign, the hex representation for the character is converted from the value 24 to a 67.

If you want to know what the encoding of a particular SAS data set is, for SAS 9 and above follow these steps:

  1. Locate the data set with SAS Explorer.

  2. Right-click the data set.

  3. Select Properties from the menu.

  4. Click the Details tab.

  5. The encoding of the data set is listed, along with other information.

Some situations where data might commonly be transcoded are:

For more information on SAS features designed to handle data conversions from different encodings or operating environments, see SAS National Language Support (NLS): User's Guide .

Категории