Networking Concepts and Technology: A Designers Resource

 < Day Day Up > 

In 1994, Netscape Communications proposed SSL V.1 and shipped the first products with SSL V.2. SSL V3 was introduced to address some of the limitations of SSL V2 in the area of cryptographic security limitations and functionality. Transport Layer Security (TLS) was created to allow an open standard to prevent any one company from controlling this important technology. However, it turns out that even though Netscape was granted a patent for SSL, SSL is now the defacto standard for secured Web transactions.

This section provides a brief overview of the SSL protocol, and it then describes strategies for deploying SSL processing in the design of data center network architectures.

SSL Protocol Overview

The basic operation of SSL includes the following phases:

  1. Initial Full Handshake The client and server authenticate each other, exchange keys, negotiate preferred cryptographic algorithms (such as RSA or 3DES) and perform a CPU-intensive public key cryptographic mathematical computation. This full handshake can occur again during the life of a client server communication if the session information is not cached or reused and needs to be regenerated. More details are described below in the Handshake Layer.

  2. Data Transfer Phase Bulk Encryption Once the session is established, data is authenticated and encrypted using the master secret.

A typical Web request can span many HTTP requests, requiring that each HTTP session establish an individual SSL session. The resulting higher performance impact might not outweigh the marginal incremental security benefit. Hence, a technique called SSL resumption can be exploited to save the session information for a particular client connection that has already been authenticated at least once.

SSL is composed of two sublayers:

  • Record Layer This layer operates in two directions:

    • Downstream The record layer receives clear messages from the handshake layer. The record layer encapsulates, encrypts, fragments, and compresses the messages using the Message Authentication Code (MAC) operations before sending the messages downstream to the TCP Layer.

    • Upstream The record layer receives TCP packets from the TCP layer and uncompresses, reassembles, decrypts, runs a MAC verification, and decapsulates the packets before sending them to higher layers.

  • Handshake Layer This layer exchanges messages between client and server in order to exchange public keys, negotiate and advertise capabilities, and agree on:

    • SSL version

    • Cryptographic algorithm

    • Cipher suite

    The cipher suite contains key exchange method, data transfer cipher, and Message Digest for Message Authentication Code (MAC). SSL 3.0 supports a variety of key exchange algorithms.

FIGURE 4-15 illustrates an overview of the SSL-condensed protocol exchanges.

Figure 4-15. High-Level Condensed Protocol Overview

Once the first set of messages is successfully completed, an encrypted communication channel is established.

The following sections describe the differences between using a pure software solution and an SSL accelerator appliance in terms of packet processing and throughput.

We will not be discussing SSL in depth. The purpose of this section is to describe the different network architectural deployment scenarios you can apply to SSL processing. The following sections describe various approaches to scaling SSL processing capabilities from a network architecture perspective.

SSL Acceleration Deployment Considerations

One of the fundamental limitations of SSL is performance. When SSL is added to a Web server, performance drops dramatically because of the strain on the CPU caused by the mathematical computations and the number of sessions that constantly need to be set up. There are three common SSL approaches:

  • Software-SSL libraries This approach uses the bundled SSL libraries and offers the most cost-effective option for processing SSL transactions.

  • Crypto Accelerator Board This approach can offer a massive improvement in performance for SSL processing for certain types of SSL traffic. "Conclusions Drawn from the Tests" on page 120 suggests when best to use the Sun™ Crypto Accelerator 1000 board, for example.

  • SSL Accelerator Appliance This solution might have a high initial cost, but it proves to be very effective and manageable for large-scale SSL Web server farms. "Conclusions Drawn from the Tests" on page 120 suggests when best to deploy an appliance such as Netscaler or ArrayNetworks.

There are several deployment options for SSL acceleration. This section describes where it makes sense to deploy different SSL acceleration options. It is important to consider certain characteristics, including:

  • The level or degree of security

  • The number of client SSL transactions

  • The volume of bulk encrypted data to be transferred in the secure channel

  • Cost

  • The number of horizontally scaled SSL Web servers.

Software-SSL Libraries Packet Flow

FIGURE 4-16 shows the packet flow for a software-based approach to SSL processing. Although the path seems direct, the SSL processing is bottlenecked by millions of CPU cycles consumed in the processing of cryptographic algorithms such RSA and 3DES.

Figure 4-16. Packet Flow for Software-based Approach to SSL Processing

The Crypto Accelerator Board Packet Flow

FIGURE 4-17 shows the packet flow using a PCI accelerator card for SSL acceleration. In this case, the incoming encrypted packet reaches the SSL libraries. The SSL libraries maintain various session information and security associations, but the mathematical computations are offloaded to the PCI accelerator card, which contains an ASIC that can compute the cryptographic algorithms in very few clock cycles. However, there is a an overhead of transferring data to the card, as the PCI bus must first be arbitrated and traversed. Note that in the case of small data transfers, the overhead of PCI transfers might not outweigh the benefit of the cryptographic computation acceleration offered by the card. Further, it is important to make sure the PCI slot used is 64 bit and 66 MHz. Using a 32-bit slot could have a performance impact.

Figure 4-17. PCI Accelerator Card Approach to SSL Processing Partial Offload

SSL Accelerator Appliance Packet Flow

FIGURE 4-18 illustrates how a typical SSL accelerator appliance can be exploited to reduce load on servers by offloading front-end client SSL processing. Commercial SSL accelerators at the time of this writing are all PC-based boxes with PCI accelerator cards. The operating systems and network protocol stack are optimized for SSL processing. The major benefit to the backend servers is that CPU cycles are freed up by not having to process thousands of client SSL transactions. The accelerator can either offload all SSL processing and forward cleartext to the server or terminate all client SSL connections and maintain only one SSL connection to the target server, depending on the customer's requirements.

Figure 4-18. SSL Appliance Offloads Frontend Client SSL Processing

SSL Performance Tests

To gain a better understanding of the trade-offs of the three approaches to SSL acceleration, we ran various tests using the Sun Crypto Accelerator 1000 board, Netscaler 9000 SSL Accelerator appliance, and ArrayNetworks SSL accelerator appliance.

Due to limited time and resources, tests were selected that enabled us to compare key attributes among approaches. In the first test, we compare raw SSL processing differences between SSL libraries and an appliance.

Test 1: SSL Software Libraries versus SSL Accelerator Appliance Netscaler 9000

In this test, we looked closely at CPU utilization and network traffic with a software solution. We found a tremendous load on the CPU, which completely pinned the CPU. It took over two minutes to complete 100 SSL transactions.

We then looked at CPU utilization and network traffic using an SSL appliance with the exact same server and client used in the first example. With this setup, it took under one second to complete 100 SSL transactions.

The main reason the SSL appliance is so much faster is that the appliance maintains few long-lived SSL connections on the server. Hence the server is less burdened with recalculating cryptographic computations, which are CPU intensive as is setting up and tearing down SSL sessions. The appliance terminates the SSL session between the client and the appliance and then reuses the SSL connection at the backend with the servers.

Figure 4-19. SSL Test Setup with No Offload

Test 1 (A) Software-SSL Libraries

We used an industry standard benchmark load generator on the client to generate SSL traffic. Both tests ran the same tests on the same 100-megabyte server file.

100 requests were injected into the SSL Web server in concurrency of 10 requests.

Test 1 (B) SSL Accelerator Appliance

We used the Netscaler 9000 SSL Accelerator device on the client to generate SSL traffic. Both tests ran on the same 100-megabyte server file.

The performance gains using the SSL offload device were significant. Some of the key reasons include:

  • Hardware SSL implementation, including hardware coprocessor for mathematically intensive computations of cryptographic algorithms.

  • Reuse of backend SSL tunnel. By keeping one SSL tunnel alive and reusing it, the result is massive server SSL offload.

We ran the benchmark load generator on client (deepak2). The client points to the VIP on the Netscaler, which terminates one side of the SSL connection. The Netscaler then reuses the backend SSL connection. This is also more secure because the client is unaware of the backend servers and hence can do less damage:

#abc -n 100 -c 10 -v 4 http://129.146.138.52:443/100m.file1 >./netscaler100mfel1n100c10.softwareonly

612 packets were transferred to complete 100 SSL handshakes in less that one second!

Test 2: Sun Crypto Accelerator 1000 Board

In this test set, we leveraged the work done by the Performance Availability and Engineering group regarding performance tests of the Sun Crypto Accelerator 1000 board. The test setup consisted of a Sun Fire™ 6800, using eight 900-MHz UltraSPARC™ III processors, and a single Sun Crypto Accelerator 1000 board. FIGURE 4-20 shows that the throughput increases linearly as the number of processors increases on the software approach versus the near-constant performance at 500 Mbit/sec using the Sun Crypto Accelerator 1000 board. Tests show that the ideal benefit of the accelerator board results when the minimum message size exceeds 1000 bytes. If the messages are too small, the benefit of the card acceleration does not outweigh the overhead of diverting SSL computations from the CPU to the board and back.

Figure 4-20. Throughput Increases Linearly with More Processors

Test 3: SSL Software Libraries versus SSL Accelerator Appliance Array Networks

In this set of tests, we performed more detailed tests to better understand not only the value of the SSL appliance, but the impact of threads and file size. FIGURE 4-21 shows the basic test setup for the SSL software test, where an Sun Enterprise™ 450 server used as the client was sufficient to saturate the Sun Blade™ server.

Figure 4-21. SSL Test Setup for SSL Software Libraries

FIGURE 4-22 shows the SSL appliance tests. Larger clients were required to saturate the servers. We used two additional Sun Fire 3800 servers in addition to the Enterprise 450 server. The reason for this was that the SSL appliance terminated the SSL connection, performed all SSL processing, and maintained very few socket connections to the backend servers, thereby reducing the load on the servers.

Figure 4-22. SSL Test Setup for an SSL Accelerator Appliance

FIGURE 4-23 suggests that there is a sweet spot for the number of threads to be used for the client load generator. After a certain point, performance drops. This suggests that the SSL processing of software only approaches benefits from increased threads up to a certain maximum point. These are initial tests and not comprehensive by any means. Our intent is to show that this is one potentially important configuration consideration, which might be beyond the scope of pure design.

Figure 4-23. Effect of Number of Threads on SSL Performance

FIGURE 4-24 shows the impact of file size on SSL performance. Note that these are SSL encrypted bulk files. The SSL appliance has a dramatic impact on increasing performance of SSL throughput for large files. However, the number of transactions decreases in direct proportion to the file size. The link was a 1-gigabit pipe, which can support 125 MByte/sec throughput. The results show that the limiting factor actually is not the network pipe.

Figure 4-24. Effect of File Size on SSL Performance

Conclusions Drawn from the Tests

The software solution is best used in situations that require relatively few SSL transactions throughput, which is typical for sign-on and credit card Web-based transactions, where only certain aspects require SSL encryption.

The PCI accelerator card dramatically increases performance at relatively low cost. The PCI card also offers true end-to-end security and is often desirable for highly secure environments.

The accelerator device can be installed in an existing infrastructure and can offer very good performance. The servers do not need to be modified. Hence, only one device must be managed for SSL acceleration. Another benefit is that the appliance exploits the fact that not every server will be loaded with SSL at the same time. Hence, from a utilization standpoint, an appliance is more economically feasible.

     < Day Day Up > 

    Категории