Maximizing Performance and Scalability with IBM WebSphere

The IBM Power4 processor family consists of some of the industry's most powerful single-die CPUs. In the following sections, you'll take a closer look at the fundamentals of the Power4 processor.

Platform Overview

IBM isn't one to go without a formidable RISC-based 64-bit processor platform. The Power4 platform is IBM's competitor to the likes of the UltraSPARC and the Alpha processors from HP/Compaq.

Essentially , the latest model of the Power4 processor is 64-bit, namely the PowerPC 970. This is IBM's mainstream PowerPC-based 64-bit processor and one that's now found in many of the more recent server offerings from IBM.

I should note that IBM also produces x86-based servers that compete with the likes of the HP/Compaq and Dell server markets. These systems are aimed at the Windows and Linux markets. However, I'll focus on the PowerPC systems for this chapter.

Note  

Linux will also run on the PowerPC processors, and IBM has vocally backed Linux for the past year and a half.

The birth of the PowerPC platform is one of those great stories of the Information Technology (IT) industry. In the early 1990s, IBM attempted to build a PowerPC-based processor architecture for less cost because of the Power platform found in the popular RS/6000 servers.

To make a long story short, IBM teamed up with Apple, which was looking for RISC-based processors to incorporate into Macintosh systems, and Motorola, which was a company that had a long- term affiliation with Apple and a reputation for building high-quality processors.

By the mid-to-late 1990s, the consortium, known as AIM (for Apple, IBM, and Motorola) had produced quality 32-bit and 64-bit processor designs, including the 603 and 604 models and the 620 that was a 64-bit processor.

The processors that IBM supplies nowadays for its Unix servers extend from the well-proven 604e 32-bit processor up to the flagship 64-bit Power4+ processor. In-between, there are a number of processors including the 64-bit Power3-II, Power4, RS64-II, and RS64-IV.

As you'll see shortly, the IBM and Sun SPARC processor families closely resemble one another in terms of their associated clock rates and features. Although Sun and IBM defiantly have a different view on the world according to the processors that they make and what constitutes good performance, there's an interesting line-up of features in the common vertically tiered processor market.

Platform Architecture

The processors I'll focus on include the processors available in today's IBM Unix server market. Some of these processors are a number of years old but still are well performing.

The processors I'll focus on from IBM are as follows :

Before looking at each processor in more detail, you'll see an overview of the processors at a high level and the systems in which they come.

Note  

IBM builds WebSphere. As you can imagine, there are some benefits to using IBM hardware when it comes to using WebSphere-based applications. This extends to several features typically found in the operating system level (in other words, a tighter integration of the JVM and AIX operating system) and in the IBM JVM itself.

IBM PowerPC 604e

The IBM 604e processor is one of IBM's most successful processors. It's a 32-bit CPU for the lower-end server market and for high-end workstations. The 604e is similar in target market to that of the lower-end UltraSPARC II processors, with the main difference being its 32-bit implementation. The 604e supports 4GB of memory per CPU and has 32-byte level 1 data and instruction caches.

The processor operates at 250, 300, 333, 350, and 375MHz. These days, you can only find the 604e on IBM systems such as the RS/6000 B50 (a rack-mounted Telco/utility server) and the RS/6000 150, which is a small business server.

Although the 604e is a quality processor, for a production WebSphere-based environment with the atypical needs of an online application, the 604e isn't the processor of choice.

As you'll see, the 64-bit Power4 processors are a big step ahead in terms of performance and scalability.

IBM PowerPC 750

The PowerPC 750 is where IBM's Power architecture began to be seen as a formidable player in the high-end RISC market. There were several models of the 750, most notably, and the ones I'll discuss are the PowerPC 750, the 750Cxe, and the 750FX. You'll look at each of these in further detail.

PowerPC 750

The PowerPC 750 was a 32-bit RISC processor that was released in the late 1990s by IBM. Based on the success of the Power-based CPU architecture, the 750 provided high performance. Even though the processor was effectively a 32-bit processor, it performed well alongside 64-bit CPUs as the time.

The PowerPC 750 comes with a 32-bit address bus (used for memory addressing and pointing ”32-bit data types) and a 64-bit data bus for moving data. Also, it comes with an independent on-CPU 32KB instruction and 32KB data level 1 cache as well as an expandable level 2 cache supporting Static Random Access Memory (SRAM) expansion in 1MB increments .

The PowerPC 750 provides the ability to execute up to six instructions per clock cycle for some data types when most other instructions only take single clock cycles to execute.

Like most of the PowerPC/Power-based processors, the PowerPC 750 can support both big and little endian modes, allowing multiple manufactures such as Apple, Motorola, and IBM to use the same common processor platform.

Other than the 32-bit factor, the only other major limitation of the PowerPC 750 is that, like UltraSPARC II processors, it only supports 4GB of memory per CPU. Overall, the mainstream PowerPC 750 supports speeds of 233MHz “500MHz.

The system bus clock speed starts from 66MHz for the lower-end 233MHz versions of the PowerPC 750 and continues up to 100MHz in most versions. There's a landing point for some of the PowerPC 750 models with system bus speeds of 83MHz.

PowerPC 750Cxe

The PowerPC 750Cxe at a high level is similar in architecture to the "stock" PowerPC 750 model. With the same amount of level 1 cache, the PowerPC 750Cxe changes from the PowerPC 750's design of having a large, off-processor level 2 cache. The PowerPC 750Cxe's on-processor design supports up to 256KB of level 2 cache.

Processor or clock speeds are nearly double that of the "stock" PowerPC 750 processor; the PowerPC 750Cxe supports up to 700MHz with the entry-level models coming in at 400MHz. System bus speeds for the 750Cxe start at a higher rate than that of the "stock" PowerPC 750, commencing at 100MHz with the highest being 133MHz.

Power 750FX

The PowerPC 750FX is, like the PowerPC 750Cxe, similar to the "stock" PowerPC 750 processor. The key differences for this processor are the additional level 2 cache of 256KB, taking level 2 up to 512KB, and the ability for the process to scale up to and faster than 1GHz.

IBM made other internal upgrades to this processor such as various internal bus expansions, including the level 1 data cache. The stock PowerPC 750 processor operated a 64-bit bus to this particular cache; however, the 750FX has been upgraded to support 256 bits.

IBM RS64-III and RS64-IV

The RS64 processor family is a 64-bit RISC platform that was originally designed for the AS/400 and RS/6000 systems from IBM. These processors are both high-performing processors that, although originally produced a few years ago, continue to provide high-performing options for various models of IBM systems such as the p660 and p680 systems.

I won't present the RS64 range of processors in great detail because of its limited use in the WebSphere application server market. However, as a high-level overview, the processors range from speeds of 262MHz (for the RS64-III) to more than 750MHz (for the RS64-IV).

Both processors support fairly large independent level 1 data, instruction caches of 128KB, and level 2 off-CPU caches up of to 8MB in size for the RS64-III and 16MB for the RS64-IV.

IBM Power4

Industry analysts once considered the IBM Power4 processor architecture an ambitious project, but IBM came through and released the Power4 core architecture. This was because the Power4 processor boasts 170 million transistors as compared to the Pentium 4 processor having 52 million transistors .

Essentially, the Power4 core is a dual-core CPU with three levels of cache (a level 1, a level 2, and a level 3 cache). Because of all the added complexity within the Power4 core, IBM added a number of other components to the processor architecture to facilitate the additional data flow.

The twin processor cores (which hold the standard processor-type components such as issue and execution units and the instruction and data level 1 caches) operate together yet share a level 2 cache and a component known as the fabric controller (discussed shortly).

A key point is that the Power4 level 2 cache memory operates at the same speed as the clock rate. Therefore, with a standard processor clock rate of 1.3GHz, the level 2 cache is being accessed at the same cycle rate as the core. The two cores in the Power4 can communicate with the level 2 shared cache memory at more than 100GB per second.

The fabric controller is essentially a switch that's responsible for switching and managing the data flow around the processor. It also supports the processor-to-processor communications for SMP-based multiprocessor systems.

The fabric controller also interconnects the processing engines to the rest of the system via what's known as a GX controller . Although the fabric controller switches the traffic and data around the processor itself, the GX controller is responsible for the messaging and controlling of data in and out of the processor.

Another component worth mentioning is the level 3 cache component. The level 3 cache is a whopping 32MB in size per Power4 processor. As you add additional processor modules, you effectively increase the amount of level 3 memory available to the processor modules, and you also increase the available bandwidth. As level 3 cache memory is shared across process modules and managed by the level 3 memory controller on each processor module, you gain physical size and bandwidth to level 3 shared cache as you add more processors.

When adding in additional CPU modules to make use of larger multiprocessor systems, the CPU modules will connect via a ring topology and be controlled by the intermodule bus controller. This design allows for the sharing of level 3 cache memory.

The level 1 and level 2 cache memory in the Power4 supports 64KB of level 1 instruction cache and 32KB of level 1data cache. The level 2 cache is supported with 1.5MB of cache memory and is shared between both cores on a dual-core based Power4 chip.

Multiple CPU modules can be interconnected to form 8-way, 16-way, 24-way, and 32-way configurations.

Each processor core can support up to 16GB of system memory, with the memory-to-processor data transfer rate supported at a peak rate 205GB/s with the 1.3GHz cores. Compared with the high-speed FSBs of the high-end Intel Pentium 4 processors of 800MHz, the Power4 currently operates at a slightly lower 433MHz between system memory and processor cores. Although the bus clock rate is lower in the Power4 than it is with the high-end Intel processors, the transfer rate is still comparable.

The Power4 processor comes in two primary versions (as opposed to the PowerPC/Power4 derivatives discussed next ). The first Power4 version is touted by IBM as being optimized for data- intensive applications (for example, scientific, financial modeling, and so on); it's the Power4 HPC processor.

This model of Power4 processor operates at 1.3GHz and operates with only one core instead of the much-touted dual-core model. The level 2 cache, therefore, is shared by a single core only, providing essentially double the cache memory for level 2 cache.

The second version of the Power4 is the Power4 standard. This model of Power4 is primarily what I covered previously. It comes with the standard dual-core design and shared level 2 cache and is the more general-purpose processor (if such a processor can be called that!).

The high-end IBM pSeries 690 server can operate both these Power4 processor models.

From a core-processing perspective, each core (in the Power4) or single core (in the Power4 HPC) can execute a multiply/add instruction each cycle or four floating-point instructions per clock cycle per core. In summary, each core is capable of executing up to eight instructions per clock cycle. The Power4 has a large 12-stage pipeline, one that definitely helps it compete alongside that of the deep Intel and AMD pipelines (up to 20 stages for the Pentium 4).

The Power4 is considered to be the most powerful single processor in the IT industry today.

IBM PowerPC 970

The PowerPC 970 is essentially a trimmed -down version of the Power4 processor architecture.

The PowerPC 970 is a 64-bit 1.4GHz “1.8GHz superscalar processor that offers most of the features of the Power4 processor architecture, with a few modifications.

The main difference is the change from a dual-core design from the Power4 back to a single-core design. The resulting clock speed of the single core is higher; however, with the implementation of the PowerPC 970 on a smaller die, the processor has done away also with the level 3 cache. Also gone is the complex fabric controller found in the Power4 flagship processor.

The PowerPC 970 is essentially IBM's much-anticipated mainstream version of the Power4 processor. The target markets for the PowerPC 970 are entry-level servers and high-end workstations. Because of the addition of many new Single Instruction/Multiple Data (SIMD) instructions, it makes the PowerPC 970 popular for workstation-based computing as well as server-based computing.

The make-up of the PowerPC 970 includes a 64KB level 1 instruction cache, a 32KB level 1 data cache, and a 512KB level 2 cache.

From an internal perspective, the PowerPC 970 boasts similar features to that of the UltraSPARC III. The impressive feat is that the PowerPC 970 incorporates a 900MHz processor bus (somewhat equivalent to the Intel/AMD FSB). This delivers peak transfer rates of more than 6.5GB per second, making the FSB speeds of the PowerPC 970 very impressive.

The PowerPC 970 also consists of a number of graphical instruction sets, all of which make no difference to the operational performance of a WebSphere-based environment.

The PowerPC 970 does, however, come close to that of the Pentium 4 super pipelining capabilities, boasting a 16-stage pipeline for integer instructions and slowly scaling up to a 25-stage pipeline for SIMD instructions.

Comparison Chart: IBM PowerPC Processors

From a WebSphere application server perspective, the PowerPC processors offer good performance. And, like the other three vendors you've looked at (AMD, Sun, and Intel), the differences in performance from the older model processors to that of the newer model processors truly shows how Moore's law is working.

Note  

Moore's law states that processor performance will double every 18 months.

The following summarizes the IBM processors:

Selecting Your PowerPC Platform

Now that you've seen the more common IBM-based processors at a high level, you'll get an overview of where each of them would fit in different- sized environments.

Again, like the previous SPARC and Intel/AMD sections, these recommendations are based on approximate sizes. You should ensure proper capacity planning so that the processor performance and characteristics match those you require for your WebSphere environment.

So, Which CPU?

Based on at the CPUs you've seen so far, the following sections offer some guidelines and recommendations on where you could use the IBM-based processors and servers within a WebSphere environment.

Small Production Environments

For smaller WebSphere production environments where either a single WebSphere application server or dual or more small servers are required, there are several choices.

First, I recommend the Power4 processor for all WebSphere-based needs unless there's a compelling reason to go with an older processor such as a PowerPC 604e or PowerPC 750, or even an RS64.

Based on this recommendation, consider the IBM p630-based servers. These servers support up to four-way configuration with up to 32GB of system memory. These systems are capable of Dynamic Logical Partitioning (DLPAR)/Logical Partitioning (LPAR) partitioning with up to four partitions, thus allowing you to build a WebSphere cluster using internally partitioned components within a single frame. The p630 also supports hot-swappable internal drives and up to six Peripheral Component Interface (PCI) interface slots.

One processor I didn't speak about in detail in the previous sections is the Power3 processor. If budgets are constrained and IBM is your vendor of choice, the RS/6000 server costs less with a Power3-II processor.

Check with your local IBM vendor representative or reseller about the support and service availability of these model servers and CPUs.

Medium Production Environments

A medium-sized WebSphere environment will typically host between 100 and 500 concurrent users and require a high degree of availability and redundancy.

As you saw with the Sun SPARC platforms, you can take two paths here. First, it's possible to take the servers listed in the "Small Production Environment" section and horizontally scale them (in other words, use more smaller servers) to a point where, instead of having two p630 servers, you have four p630 servers. This would provide a high degree of processing power (up to 16 Power4 CPUs) and up to 128GB of memory between all the systems.

However, as I touched on previously, you must be confident that your WebSphere application requirements from a processing and memory point of view don't exceed your lower-end servers.

Remember the earlier discussion on the ratio of operating system threads to JVM threads? As a guide, one CPU per running Java application JVM is a safe starting point. I've been involved in environments where the ratio of JVMs to CPUs is higher ”in the vicinity of one JVM per three CPUs. Typically this is caused by applications that are memory hungry, where there's a constant need for garbage collection. Remember, like most of these guidelines, your mileage is going to vary depending on the structure and characteristics of your application. It'll also vary depending on your platform of choice ”Power4, Intel, SPARC, and so on. Based on a rule of one JVM per CPU, a fully featured IBM p630 server should operate, at maximum, with four WebSphere-based applications.

Note  

This ratio of JVMs to CPUs isn't a WebSphere-specific requirement guide. It's more driven by how the Java JVM operates. And, depending on your Java JVM vendor, this will also vary!

If your pure processor power, number of active JVMs, or amount of memory required exceeds something like a p630 or another smaller, lower-end server, the next models up from IBM will support up to 8-way and 16-way processing and up to 256GB of memory.

The 16-way processor systems are leaning toward the larger WebSphere environments, especially if you're operating multiple nodes.

Two key systems that IBM produces that are well suited to medium-sized WebSphere environments are the p650 and p655. The p650 is a rack-mounted system that supports up to eight-way Power4 processor capability and 64GB of memory. Like other IBM systems operating the Power4 range, the p650 and p655 support DLPAR/LPAR partitioning, so, again, you could deploy one or two of these and partition the servers to meet your requirements. The p655 is a chassis-based server that's eight-way and supports 32GB of memory. This server also supports both Power4 HPC and Power4 standard processor modules.

Large Production Environments

For large Power4-based WebSphere environments operating anywhere from 500 to 10,000 concurrent sessions, there are two primary choices: the p670 and the p690. The p670 has up to 16-way processor support and 256GB of memory, and the p960 has up to 32-way processor support and 512GB of memory.

Again, like other IBM Power4-based systems, these two servers come with all the partitioning capabilities to be able to split them down to multiple nodes per chassis.

Where Are the Large IBM Servers?

A question I regularly get asked is, "Why does IBM only support up to 32-way in a single chassis?" IBM has a slightly different architecture plan than that of Sun: In order to obtain more processing power per physical server (in other words, more than 32-way), IBM has a number of advanced clustering solutions that allow you to cluster your servers.

However, at the end of the day, considering WebSphere as your platform engine, I don't recommend architecting your WebSphere or J2EE applications to require more than 32 JVMs in a single server.

In future chapters, I'll cover running multiple groups of servers or partitions to keep your maximum CPU count per server (not per chassis), domain, or partition down to 24 “28 CPUs.

For WebSphere to be able to support this many active JVMs is an immense task. The better option is to split your servers into common tiers such as a utility tier, a batch tier, and an Enterprise Integration Layer (EIL) tier .

Using this model, you could purchase a pair of IBM p690s and partition them like so:

Partition A may be the utility tier, Partition B the batch tier, and the Partition C the EIL tier. Then you'd mirror this configuration to the second p690 for frame redundancy.

Should you find that the 24 CPUs of processing power for partitions A and B isn't enough, purchase an additional p690 or lower-end p670 and add them to the WebSphere cluster associated with the first two partitions (in other words, the utility and batch tiers).

I'll cover this method of WebSphere platform architecting in greater detail in Chapter 5.

Platform Summary

As you've seen, the IBM PowerPC and Power4 range of processors are high quality and high performing. Without a doubt, the Power4 processor is one of the market leaders in terms of its performance and scalability capabilities. Remember, when considering Power4 processors, each Power4 chip effectively has a dual core.

Although it's not a 100-percent performance improvement over having a single Power4 processor, it does provide a large advantage in terms of performance given the dual cores are cross-connected via high-speed buses, greatly reducing the latency between processor cores. This is unlike that of other SMP-based processor architectures where latency can sometimes, but not always, be an issue for large-scale SMP performance.

Категории