Hack 76. Compress to Avoid LOBs

Most databases have a limit as to the size of a text field. You can still put a few more bytes in once you've reached this limit.

Large Object Blocks (LOBs) are considered the standard way to deal with large text fields, binary data, pictures, and so on. However, a LOB is difficult to work with, and can be implemented differently on each database system you are using. You can instead stick with your varchar types, and use compression techniques. This approach is also useful when you don't want to increase the length of an attribute, yet you have something bigger to store in that field than its capacity allows.

A LOB can allow either large binary or text data to be stored in a database. Often you can query character LOBs using SQL text functions (e.g., LIKE). However, there are a number of differences between engines when it comes to LOBs. For instance:

MySQL has binary LOBs of varying capacities, with names such as BLOB, LONGBLOB, MEDIUMBLOB, and TINYBLOB. It also has text LOBs; for instance, TEXT, LONGTEXT, MEDIUMTEXT, and TINYTEXT.

Oracle has binary LOBs with varying capacities, including BLOB and LONG RAW. Character LOBs include CLOB and LONG.

PostgreSQL needs to have binary LOBs generated using CREATE TYPE. Character LOBs are called TEXT.

SQL Server has the IMAGE data type, or VARBINARY(MAX) for binary and a VARCHAR(MAX) for characters.

If you intend to use LOBs in your applications, you must read both the database system documentation and the database interface routines carefully, as in many cases you must treat LOBs differently from normal columns.

Suppose you currently have a table structure as follows:

CREATE TABLE bigtable ( Author varchar(20), Title varchar(200), Abstract varchar(200), primary key (author,title) ); INSERT INTO bigtable VALUES ('Gordon Russell','SQL Hacks','A really nice book');

It worked well for a while, but now you want to store a title that is slightly more than 300 characters long. The title is:

The amazing tale of the long winded story of the hard to describe tribulations of the followers of the following of the pursuers of the people who came before and the people who came after who followed and followed until the following skills were exhausted and all the following people said to the followers the end.

You could extend the title entry for the table by doing this:

alter table bigtable modify title varchar(400);

This will change abstract to allow up to 400 characters. During the ALTER, all the existing abstracts will be converted to the new maximum size without loss of information. This can take quite a long time.

In some database systems, you might find that the maximum varchar size is smaller than the data you want to store in the field. For instance, in MySQL 3 varchar was a maximum of 255 characters (although in version 4 it is 65,535 characters). You could change the field to a different type that allows more storage space, such as TEXT or CLOB.

Or you could use compression.

If you were to gzip the title shown earlier, it would reduce from 316 to 172 characters. You could store the compressed version of the title in the original varchar(200). This approach is not without its problems, as compressed data is binary data and will appear in the database as nonsense. Also, you cannot query it directly (for instance, with LIKE) and it will become meaningful only after it is queried out of the database and uncompressed. However, it will get you out of a tight spot where you cannot change the schema sufficiently to support the required size.

The following Perl code adds the long title to bigtable using compression techniques. It then queries the data back out just to demonstrate that the process does not lose data. varchar automatically strips off trailing spaces. A trailing space may be essential to allow the decompression to work, so the program adds a ; character to the compressed data. This also forms a useful marker so that a title without a ; at the end is printed directly, and one with a ; is decompressed before printing:

#!/usr/bin/perl use strict; use DBI; use Compress::Zlib; # sub dbOpen { my ($user,$pass) = @_; my $dbh=DBI->connect( 'dbi:mysql:dbname', $user, $pass ); return $dbh; } my $dbh = dbOpen('username','password'); sub getval{ my ($dbh) = @_; my $sql = "SELECT author,title,abstract from bigtable"; my $sth = $dbh->prepare($sql); $sth->execute( ); while ( my $row = $sth->fetchrow_arrayref( )) { if ($row->[1] =~ m/;$/) { chop($row->[1]); $row->[1] = uncompress($row->[1]); } print "$row->[0], $row->[1], $row->[2] "; } } my $str = <do("INSERT INTO bigtable values ('Jim',?,'A Big story')",{},$cstr); getval($dbh);

You can see that the titles which are not compressed are quite readable, but the one which is compressed is full of unprintable characters, ending with a ;:

mysql> select * from bigtable; +----------------+-------------------+--------------------+ | Author | Title | abstract | +----------------+-------------------+--------------------+ | Gordon Russell | SQL Hacks | A really nice book | | Jim | xmXj98*-11fha...; | Big story | +----------------+-------------------+--------------------+ 2 rows in set (0.00 sec)

Категории