Working with Data BLOBs and Text

Understanding the Challenges Presented by Binary and Character Large Object (BLOB) Processing

Because the majority of SQL servers are installed to support business data processing, columns in SQL tables are most often used to store such things as names; addresses; units of sales or inventory; money amounts due, sold, or paid; dates and times; and so on. Columns that hold data of these data types require a small amount of storage space-a few bytes for INTEGER, MONEY, and DATETIME values and perhaps 50-100 bytes each for (CHAR or VARCHAR) names, addresses, and product descriptions.

Due to the (typically) minimal data storage requirements of individual columns, DBMS products have been optimized to work with short rows (in terms of byte length), even though each table's rows may have dozens of columns. The SQL server assumes that rows will occupy at most a few thousand bytes when managing disk storage space, and indexes table data for speedy retrieval. Similarly, applications written to work with SQL database data assume that they will be able to retrieve a large number of rows of data into available (free) memory at the workstation or into temporary storage on a local hard drive through the workstations memory buffer. As you saw when you worked with Visual Basic applications and the DBLIB in Tips 391-425, row-at-a-time processing techniques for handling query results are relatively easy to understand and use.

Over the past few years, character-oriented business applications have been replaced by programs that take advantage of the Windows graphical user interface. The requirement on applications not only to display high-resolution graphics and animations but also to play sound files and video clips has caused the average size of "business" data items to grow exponentially. Consequently, with multimedia applications now an integral part of employee training programs, sales presentations, statistical analysis at company board meetings, and customer support services, business users need the DBMS to manage these new, large-size data type items along with other (more traditional and smaller-size) character and numeric data.

The challenge for the DBMS is that a single high-resolution graphic image may require hundreds of thousands of bytes of storage, a short audio file will occupy millions of bytes disk space, and video clips typically take up several megabytes of storage. Therefore, placing just one of these "new" (multimedia) data type items in a single column of one row in a table may require more storage space than an entire table that contains thousands of rows with dozens of columns of numeric and character data per row. While the disk space required to store a video clip, for example, is the same whether the clip is stored in a separate file on disk or in a column in a table row, using normal DBMS methods to work with exceptionally large rows may cause a few problems.

For example, when a user creates and populates a static cursor with rows that have a video clip column, 10 rows of data could easily take up 100MB of storage space (or server memory). Although 100MB per user is no problem if you have only a few users, in a more typical environment with 100 or more active connections, user queries might easily consume hundreds of gigabytes of storage on the server. Moreover, after waiting for the server to copy (perhaps) hundreds of megabytes of data from one disk location to another, the user will be able to work with the data on a workstation with only an inordinately large amount of memory available to buffer the data retrieved by a FETCH statement on a row-at-a-time basis.

Initial DBMS attempts to support binary large objects, or BLOBs (as items of data type TEXT, NTEXT, and IMAGE are often called), involved storing BLOB data outside the database in individual external files. Instead of the BLOB data itself, the DBMS stored the name of the BLOB file in a CHARACTER (or VARCHAR) data type column. When an application needed BLOB data, it would retrieve the name of the BLOB file from the DBMS and then communicate with the operating system to open and work with the file's contents.

Unfortunately, this approach required that the programmer understand both the DBMS and the file system interfaces. Furthermore, because application programs were responsible for managing the contents of the external BLOB files, users could not, for example, ask the DBMS to compare two BLOB items (in a search condition). Moreover, the DBMS could not provide even a simple BLOB text search capability.

Today, most commercial DBMS products store and manage at least two types of BLOB data within the database (vs. in external files). MS-SQL Server, for example, supports TEXT, NTEXT, and IMAGE data type columns. Oracle, meanwhile, supports these same BLOB data items but calls its data types for these CLOB, NCLOB, and BLOB.

Understanding MS SQL Server BLOB(TEXT, NTEXT, and IMAGE) Processing

MS-SQL Server lets you store up to 2GB of data as a single item in a column. A TEXT column, as its name implies, is used to store character string data such as that stored in columns of data type CHAR and VARCHAR. Therefore, use a column of data type TEXT when you have a character string longer than 8,000 characters—which is the maximum number of characters a column of CHAR or VARCHAR can hold. Similarly, an IMAGE column lets you store binary data that exceeds the 8,000-byte maximum length imposed on columns of data type BINARY and VARBINARY. Typical items stored in columns of data type IMAGE are MS-Word documents, MS-Excel spreadsheets, bitmaps (BMP files), Graphics Interchange Format (GIF) pictures, and Joint Photographic Experts Group (JPEG or JPG) images.

If the data in a TEXT or IMAGE column is less than or equal to 8,000 bytes in length, you can use standard INSERT, SELECT, and UPDATE statements to work with the column. For example, to display the contents of a NOTES column of data type TEXT in the EMPLOYEES table, execute the SELECT statement

SELECT notes FROM EMPLOYEES

if the NOTES column has up to 8,000 characters. Bear in mind that although you can use a SELECT statement to show the contents of an IMAGE column, online query tools (such as MS-Query Analyzer) will display column's binary code as 1s and 0s (and not as a graphics image, sound file, or video clip). To interpret and convert the binary 1s and 0s into a document or image, you will need to send the column's contents to an application such as MS-Word, MS-Excel, or WinJPEG.

To optimize storage space usage and data retrieval performance, the MS-SQL Server, like other DBMS products, organizes table data in 8K pages. If the DBMS were to store BLOB items (TEXT, NTEXT, and IMAGE data) intermixed with data from the row's other columns, the BLOB data would make it impossible for the DBMS to fit rows of data into 8K pages. After all, a single BLOB data item may require hundreds of pages of storage. To keep tables optimized, the MS-SQL Server stores any BLOB data that is larger than 256 bytes in a separate area on disk. The server then places a pointer to the first page of BLOB data into the TEXT, NTEXT, or IMAGE column in the table.

As mentioned in Tip 462, "Understanding the Challenges Presented by Binary and Character Large Object (BLOB) Processing," most application programs that work with database data cannot hold an entire BLOB in a memory buffer at once. Instead, the programs must process the BLOB in sections. However, the APIs that the programs use to retrieve and store table data work on a row-at-a-time basis. To get around this problem, you use the Active Data Object (ADO) APPENDCHUNK and GETCHUNK methods, which let you store and retrieve data in BLOB columns a piece-at-a-time. (Tip 601, "Displaying Image Data Stored within an SQL Table," discusses how to retrieve image data from a table and write it to a disk file using the GETCHUNK method.)

One final issue to address in handling BLOB data is what to do about transaction logging. When logging work performed in a transaction, the DBMS normally maintains a "before" and "after" image of all modified data. Since each BLOB data item can be very large, making multiple "before" and "after" copies of it as it is modified can cause unacceptable delays in waiting for the DBMS to copy the data, and may fill up the transaction log while doing so. As such, DBMS products either do not support the logging of BLOB data or let you control the logging by providing a stored procedure (or other utility) you can use to turn it on and off. MS-SQL Server, for example, lets you use the WRITETEXT and UPDATETEXT statements to modify the contents of a BLOB without logging "before" and "after" images of the BLOB in the transaction log.

Using an INSERT or UPDATE Statement to Place Data into a BLOB Data Type Column

As you learned in Tip 462, "Understanding the Challenges Presented by Binary and Character Large Object (BLOB) Processing," and Tip 463, "Understanding MS-SQL Server BLOB (TEXT, NTEXT, and IMAGE) Processing," the MS-SQL Server lets you define table columns that can hold up to 2GB of binary or character string data. These columns of data type TEXT, NTEXT, and IMAGE are often called BLOBs (or bulk data, on MS-SQL Server). You can use an INSERT statement to place data into a BLOB.

For example, if you have a CUSTOMERS table created by

CREATE TABLE customers (cust_ID INTEGER, f_name VARCHAR(15), l_name VARCHAR(20), notes TEXT, remarks TEXT, photo IMAGE)

you can use an INSERT statement such as

INSERT INTO customers VALUES (1, 'Konrad', 'King',' 'Notes on the customer', 'No Photo as yet', '010')

to place data into the CUSTOMER table's BLOB columns (NOTES, REMARKS, PHOTO).

Similarly, you can use an UPDATE statement such as

UPDATE customers SET remarks = 'The PHOTO column has no photo. It has only example data used to show the binary conversion of character string data placed into an IMAGE column. WHERE cust_ID = 1

to change the contents of a BLOB column, just as you would change the contents of a column of any other data type.

Bear in mind that the DBMS logs all work performed by INSERT and UPDATE statements in the database transaction log. Since the amount of data in a BLOB column can be very large (up to 2GB in length), you may want to use a WRITETEXT statement (see Tip 465, "Using the Transact-SQL WRITETEXT Statement to Place Data into a TEXT, NTEXT, or IMAGE Column") or UPDATETEXT statement (see Tip 466, "Using the Transact-SQL UPDATE-TEXT Statement to Change the Contents of a TEXT, NTEXT, or IMAGE Column") in place of INSERT and UPDATE statements when working with BLOB data. Work performed by WRITETEXT and UPDATEXT statements is (by default) not written to the transaction log.

  Note 

As mentioned by the text of the "remarks" placed into the REMARKS column of the current example, if you store a character string in a column of data type IMAGE, the server will treat the character string as a binary value. For example, if you execute the SELECT statement

SELECT photo FROM customers

to display the result of placing the character string 010 into the IMAGE data type column PHOTO in the current example, the MS-SQL Server will display the column's contents as

photo ----------------------------- 0x303130

which is the hexadecimal representation of (0 [ASCII value 48] 1 [ASCII value 49], and 0 [ASCII value 48]). Obviously, you want to store character strings in TEXT columns and reserve binary columns for binary data such as MS-Word documents, MS-Excel Spreadsheets, graphics images, sound files, and video clips.

Using the Transact SQL WRITETEXT Statement to Place Data into a TEXT, NTEXT, or IMAGE Column

When you use an UPDATE statement to change the contents of a BLOB (TEXT, NTEXT, or IMAGE data type) column, the DBMS must write "before" and "after" copies of the column's contents into the transaction log. To avoid the overhead of logging changes to BLOB data, execute a Transact-SQL WRITETEXT instead of an UPDATE statement.

The WRITETEXT statement uses the syntax

WRITETEXT

where:

Use WRITETEXT to replace the current contents of a BLOB (TEXT, NTEXT, or IMAGE data type) column. If you want to modify (vs. replace) the contents of a BLOB, execute a Transact-SQL UPDATETEXT statement (which you will learn about in Tip 466, "Using the Transact-SQL UPDATETEXT Statement to Change the Contents of a TEXT, NTEXT, or IMAGE Column") instead.

In order for the DBMS to execute a WRITETEXT statement successfully, the system needs a valid pointer to the first 8K page that the DBMS has allocated to hold its BLOB data. If the target column contains a non-NULL BLOB, then the text pointer value is already valid. Otherwise, set the value of the BLOB's text pointer by executing an INSERT statement or an UPDATE statement to write either a zero-length character string or a portion of the BLOB data to the target column.

For example, to place a character string into the REMARKS column of the row for the customer with CUST_ID 1 in the CUSTOMERS table created in Tip 464, "Using an INSERT or UPDATE Statement to Place Data into a BLOB Data Type Column," use a statement batch similar to the following:

UPDATE customers SET remarks = '' WHERE cust_ID = 1 DECLARE @text_pointer BINARY(16) SELECT @text_pointer = TEXTPTR(remarks) FROM customers WHERE cust_ID = 1 WRITETEXT customers.remarks @text_pointer 'These are the remarks to write into the REMARKS column'

Using the Transact SQL UPDATETEXT Statement to Change the Contents of a TEXT, NTEXT, or IMAGE Column

As mentioned in Tip 465 "Using the Transact-SQL WRITETEXT Statement to Place Data into a TEXT, NTEXT, or IMAGE Column," both the Transact-SQL WRITETEXT and UPDATETEXT statements tell the DBMS to modify the contents of a TEXT, NTEXT, or IMAGE column (a BLOB) without writing "before" and "after" copies of the column's contents into the transaction log. The WRITETEXT statement can be used only to replace one BLOB with another. An UPDATETEXT statement, meanwhile, is more flexible in that it not only lets you replace one BLOB with another, but it also lets you modify (vs. overwrite) a BLOB, delete some or all of the BLOB's data, and add additional data to a BLOB.

The syntax of the UPDATETEXT statement is

UPDATETEXT {NULL|} {NULL | } { | { }}

where:

In order for the DBMS to execute an UPDATETEXT statement successfully, the system needs a valid pointer to the first page of the BLOB data to be updated. If the target column is not NULL, the text pointer to its BLOB data is already valid. Conversely, if the BLOB data column is NULL, you must set the value of the BLOB's text pointer by executing an INSERT statement or an UPDATE statement to write either a zero-length character string or a portion of the BLOB data to the target column.

For example, to insert a character string at the beginning of the current BLOB in the REMARKS column for customer 1, use a statement batch similar to:

DECLARE @dest_text_pointer BINARY(16) SELECT @dest_text_pointer = TEXTPTR(remarks) FROM customers WHERE cust_ID = 1 IF @dest_text_pointer IS NULL BEGIN UPDATE customers SET remarks = '' WHERE cust_ID = 1 SELECT @dest_text_pointer = TEXTPTR(remarks) FROM customers WHERE cust_ID = 1 END IF @dest_text_pointer IS NOT NULL UPDATETEXT customers.remarks @dest_text_pointer 0 0 ' This is an additional remark.'

Using the READTEXT() Function to Read a Portion (or ALL) of the Data in a TEXT, NTEXT, or IMAGE Column

The Transact-SQL READTEXT() function is to a BLOB (TEXT, NTEXT, or IMAGE data) as the SUBSTRING() function is to a character string. While the SUBSTRING() function returns a portion of the characters in a character string, the READTEXT() function returns a specified number of bytes from a BLOB.

If you want to display the first @@TEXTSIZE bytes in a BLOB, you can use a SELECT statement that lists the BLOB column in its SELECT clause. (You will learn how to set the number of bytes in the TEXTSIZE option in Tip 471, "Understanding the TEXTSIZE Option and the @@TEXTSIZE() Function.") However, if you want to read some number of bytes other than the @@TEXTSIZE or starting with other than the first byte in the BLOB, use a READTEXT instead of a SELECT statement.

The syntax of the READTEXT statement is

READTEXT [HOLDLOCK]

where:

Therefore, to display 21 bytes of data starting with the 15th byte in the REMARKS column of the CUSTOMERS table row for CUST_ID 1, execute the statement batch:

DECLARE @text_pointer BINARY(16) SELECT @text_pointer = TEXTPTR(remarks) FROM customers WHERE cust_ID = 1 IF @text_pointer IS NOT NULL READTEXT customers.remarks @text_pointer 14 21

Using the MS SQL Server TEXTVALID() Function Determine If a Text Pointer Is Valid

As you saw in Tip 464, "Using an INSERT or UPDATE Statement to Place Data into a BLOB Data Type Column," and Tip 467, "Using the READTEXT() Function to Read a Portion (or ALL) of the Data in a TEXT, NTEXT, or IMAGE Column," the WRITETEXT, UPDATE-TEXT, and READTEXT statements require that you provide the address of the first 8K page of the target TEXT, NTEXT, or IMAGE data. The TEXTVALID() function lets you check to see if the TEXTPTR() function will return a valid text pointer to the BLOB data for a particular column in a row of the target table.

The TEXTVALID() function uses the syntax

EXTVALID('',) where:

TEXVALID() will return an INTEGER value of 1 if the for is valid, or an INTEGER value of 0 if the is invalid. As such, the UPDATE statement

UPDATE customers SET remarks = '' WHERE TEXTVALID('customers.remarks', TEXTPTR(remarks)} <> 1

uses the TEXTVALID() function to check the validity of the text pointer to the REMARKS BLOB for each row in the CUSTOMERS table and initializes the text pointer for those rows with a NULL value in the REMARKS column.

  Note 

When you INSERT a row with a NULL value in a TEXT, NTEXT, or IMAGE (BLOB) data type column, the DBMS does not preallocate an 8K page to hold the first 8K of BLOB data. As a result, if the BLOB column value is NULL, the text pointer to its first 8K storage page is invalid (or NULL). As soon as you place a value (even a zero-length string) into the BLOB column, the DBMS allocates an 8K page and initializes the value of the text pointer that points to it. The text pointer for a BLOB data item remains valid until a user or application deletes the row in which the BLOB is stored.

Using the PATINDEX() Function to Return the Location of the First Occurrence of a Character String in a BLOB

The PATINDEX() function lets you search the contents of a TEXT or NTEXT column for a character string. If PATINDEX() finds the character string in the column's data, it returns the byte offset (that is, the location) of the first occurrence of the first letter in the character string. Conversely, if the function does not find the pattern of characters in the column, it returns 0. PATINDEX()returns a NULL value only if the either the TEXT or the NTEXT column is NULL, or if you tell the function to search for a NULL value.

The syntax of a PATINDEX() function call is that shown here:

PATINDEX('',)

where:

For example to get a list of all customers who complained about the "chicken" products, use the PATINDEX() function in the WHERE clause of a query, such as:

SELECT cust_ID, f_name + ' ' + l_name 'Customer Name', complaints FROM customers WHERE PATINDEX('%chicken%', complaints) >= 1

The percent (%) characters that surround the (the character string to find in the BLOB) tell the DBMS to treat as "matching" any number of characters that precede or follow the . If you only want to display those rows that start with the , omit the leading percent (%) character, as in:

SELECT cust_ID, f_name + ' ' + l_name 'Customer Name', complaints FROM customers WHERE PATINDEX('chicken%', complaints) >= 1

To display only those rows with a COMPLAINTS column that ends in the , omit the trailing percent (%) character, as in:

SELECT cust_ID, f_name + ' ' + l_name 'Customer Name', complaints FROM customers WHERE PATINDEX('%chicken', complaints) >= 1

Using the DATALENGTH() Function to Return the Number of Bytes in a BLOB

The DATALENGTH() function, syntax

DATALENGTH()

will return the number of bytes in an of any data type as an INTEGER value. If the has a NULL value, the DATALENGTH() function will return NULL.

DATALENGTH() lets you determine the size of the BLOB in a TEXT, NTEXT, or IMAGE column. For example, to display the number of bytes in the LOGO and PR_INFO columns (data type IMAGE and TEXT, respectively) in the PUB_INFO table of the PUBS database, execute a query such as

USE pubs SET TEXTSIZE 45 SELECT DATALENGTH(logo) 'Logo Size', DATALENGTH(pr_info) 'PR Info Size', pr_info 'Public Relations Info Text' FROM pub_info

and MS-SQL Server will display a results similar to that shown in Figure 470.1.

Figure 470.1: The MS-SQL Server Query Analyzer window displaying query results that use the DATALENGTH() function to show the byte length of an IMAGE and a TEXT column

  Note 

The TEXTSIZE option (which you will learn about in Tip 471, "Understanding the TEXTSIZE Option and the @@TEXTSIZE() Function") sets the maximum number of characters the SELECT statement will return when it is told to display the contents of a TEXT, NTEXT, or IMAGE column.

Understanding the TEXTSIZE Option and the @@TEXTSIZE() Function

When using an interactive program such as MS-SQL Server's Query Analyzer, you may want to prevent a TEXT column's data from "wrapping" around from one line to the next so that each row in the query's results table fits on a single line on the screen. Or, suppose that you write an application that works with data in a table that has a column of data type TEXT, and you are interested in only the first 250 characters in the column. In both cases, you can execute a SET statement to specify the number of bytes that a SELECT statement (executed either interactively or by an application program) will return from a column of data type TEXT, NTEXT, or IMAGE (for example, a column that contains a BLOB).

The SET statement's TEXTSIZE option lets you tell the MS-SQL Server the number of bytes of data that it is to return when told to display a BLOB. Each time a user logs in, the server sets the connection's initial TEXTSIZE to 64,512-meaning that a SELECT statement on a BLOB will return, at most, 64,512 bytes of data. To tell the server to return only the first 20 bytes of data instead, execute the SET statement:

SET TEXTSIZE 20

You can use either a PRINT or a SELECT statement to display the value of the TEXTSIZE option for a session. For example, the PRINT statement

PRINT @@TEXTSIZE

will display the number 64512-if you have not yet use the SET statement to change the initial value of the connection's TEXTSIZE option.

  Note 

When setting the value of TEXTSIZE to specify the number of characters you want a SELECT statement to display from a column of data type TEXT or NTEXT, bear in mind that each NTEXT (or UNICODE) character occupies 2 bytes of storage. As such, if you set the TEXTSIZE to 20, for example, a SELECT statement will display the first 20 characters in a column of data type TEXT and only the first 10 characters of in a column of data type NTEXT.

Категории