ActionScripting in Flash MX

7.3 Load-splitting and load-sharing techniques

One of the ways you can scale up device performance is by spreading the traffic load between multiple devices. There are several ways to achieve this; the following list is not meant to be exhaustive:

The key to this problem is to split traffic and allow each device to deal with a fair proportion of the overall traffic. The first issue to address is whether to do this on a per-packet basis or on a flow basis. The per-packet mechanism (e.g., round robin) is likely to result in fairer resource allocation but is also more likely to result in problems with retransmissions (if end systems cannot resequence packets); it may also be harder to debug and generate accounting data. If the load is to be spread among firewalls, state synchronization may rely on flows being handled deterministically (the flow, or session, goes in via one interface and returns back through that interface rather than sessions being asymmetrical).

7.3.1 An introduction to server load-sharing algorithms

Load-sharing server clusters, fronted by a proxy server, are a common technique used on internetworks, since they offer both resilience and scalable performance. Server clusters can be application specific (a DNS load sharer) or generic IP application servers. There are a number of algorithms available to iterate new sessions fairly between servers in a group. Load can be distributed nonintelligently or intelligently, based on cost, delay, server status, and other factors. It is useful to consider two environments for server load sharing: the case where the proxy server is local to the servers and the case where the proxy server is remotely located, as follows:

The two topologies are illustrated in Figure 7.8. There are basically two approaches we can use to monitor session load on server pool members: nonintrusive and intrusive. Nonintrusive algorithms use simple heuristic techniques to distribute requests. Intrusive algorithms require protocol interaction between each member of the server group and the proxy server, so that real-time status information can be passed between them.

click to expand Figure 7.8: (a) Local proxy configuration. (b) Remote proxy configuration.

Local load-sharing algorithms

With local proxy servers the selected algorithm should ideally use precise knowledge of real-time resource availability and system load for each host in the server pool, so that selection of the host with maximum unutilized capacity would be the obvious choice. However, this is not so easy to achieve. Some common nonintrusive algorithms are as follows:

Distributed load-sharing algorithms

When the proxy and server cluster, are geographically separated over large distances, the cost to reach them could vary markedly. In this case the load algorithm could use cost information to select servers based on proximity to the client. The algorithm would need to access routing tables for metric and reachability data from protocols such as OSPF to build a true cost of accessing a server in the group. The following algorithms are nonintrusive:

Intrusive load-sharing algorithms

All of the previous algorithms determine the capacity of servers using heuristic techniques. In practice it is necessary to use closely coupled techniques for accurate remote prediction and status monitoring of system capacity. This class of algorithm is called intrusive, since work is required on the part of the host system. Intrusive algorithms fall into the following broad categories:

Although active algorithms are a little more complex to implement and deploy, they offer a more optimal solution for accurate load distribution, at the price of sacrificing additional resource on the server pool.

Dead host detection

One of the problems with server clusters is detecting host availability and dynamically adjusting for this as part of the normal load-sharing operations. In a naive implementation, where this functionality is not included, client sessions could be directed to unreachable host addresses (e.g., where the server is down). Many heuristic approaches are available, including the following:

Note that although ping is commonly used, it is potentially misleading if used to assess response times, since a host may turn around pings very quickly even if other resources on the system (such as disk IO) are overloaded.

TCP and UDP session use

By convention TCP/IP servers use TCP and UDP ports 0 to 1,023 to listen for incoming connections. Clients typically select source ports in the range of 1,024 to 65,535. However, this convention is not always adhered to; some clients use source port numbers in the 0 to 1,023 range, and some applications listen on ports in the 1,024 to 65,535 range.

7.3.2 DNS load sharing

A DNS load balancer can be used to improve both performance and network reliability by offering a cluster of DNS servers to the network under a single host name. This system acts as a proxy server by intercepting name resolution requests for a particular service and then cycling through a table of alternate IP addresses distributing DNS requests to servers, based on their availability. This functionality only benefits users of DNS; it is not a general-purpose load-sharing system. DNS load balancers can be broadly divided into two groups, as follows:

The advantages of DNS proxies are that they are protocol transparent, simple to deploy, and offer improved performance and reliability. However, these advantages may be negated where name servers and Web browsers cache the IP address returned by the DNS server (commonly done for local performance advantages). These systems may also ignore the DNS Time-To-Live (TTL) value, defeating the possibility of dynamically adjusting load per session. When deploying DNS load balancers, careful attention should be paid to the design. A DNS load balancer may become a single point of failure. A load balancer advertises its name and address to other DNS servers on the Internet, which cache the information. If the load balancer dies, remote clients may not be able to reach any of the DNS server clusters behind it until the cache in their local DNS server expires. The default timeout for DNS entries in many UNIX implementations is five minutes (some servers may be configured to retain DNS information for more than 40 days). Since this functionality is increasingly being integrated into general-purpose load-sharing systems, switches, and firewalls, these devices can often be clustered to provide appropriate levels of fault tolerance. The interested reader is directed to [24–26] for further details about the operation of DNS. Reference [27] documents DNS support issues for load balancing.

7.3.3 HTTP load sharing

HTTP load sharing proxies (sometimes referred to as logical HTTP servers or HTTP redirectors) can be used as a method for distributing requests for Web services. This functionality is limited to Web traffic only. As with DNS, load balancing is implemented as a proxy service, and this has the advantage of being simple to implement and transparent. From the user's perspective there is only one Web server, whereas in reality there may be several servers, often located physically at different sites for additional resilience. Note that Web servers do not necessarily have to be on a particular interface on the proxy (e.g., if this functionality is implemented in a firewall, some Web servers may be behind the firewall, others may be accessible via the Internet).

In effect redirection means that the client browser effectively has two HTTP sessions (although the client is generally unaware of this). The client initially makes an HTTP request to the proxy. The proxy recognizes the destination URL as a logical server and selects the most appropriate server (based on its load-balancing algorithm). The proxy then notifies the client that the URL is being directed via the URL redirection feature of HTTP, redirecting the client to a specific IP address. In practice, this functionality can be negated if the user caches URLs locally as bookmarks, since the HTTP proxy will be bypassed. HTTP load sharing may be implemented in standalone devices or integrated into general-purpose load balancers, such as ArrowPoint's content-aware switches, Cisco's LocalDirector, and Checkpoint's Firewall-1.

7.3.4 FTP load sharing

FTP load-sharing proxies (sometimes referred to as logical FTP servers) can be used as a method for distributing requests for FTP services. This functionality is limited to FTP traffic only. As with HTTP, load sharing is implemented as a proxy service, and this has the advantage of being simple to implement and transparent. From the user's perspective there is only one FTP server, whereas in reality there may be several servers, often located physically at different sites for additional resilience. As with HTTP load sharing, FTP servers do not necessarily have to be on a particular interface on the proxy. However, in this case, unlike HTTP load sharing, there is no explicit redirection; all server sessions are handled through the proxy.

The client initially makes an FTP request to the proxy. The proxy recognizes the destination IP address or name as a logical server and selects the most appropriate server (based on its load-balancing algorithm). The proxy then performs network address translation on the session so that the client is unaware of the true destination server. Address translation is performed in both directions (so that even FTP back connections are handled properly). FTP load sharing may be implemented in standalone devices or integrated into general-purpose load balancers, such as ArrowPoint's content-aware switches, Cisco's LocalDirector, and Checkpoint's Firewall-1.

7.3.5 ARP load sharing

Proxy ARP can be used to provide resilient paths to services and spread load. For example, assume we have a client connected to a LAN-A. The client has an IP address 140.4.3.1 and a mask 255.255.0.0. There are two routers connected to LAN-A, and both remotely connect to LAN-B, which attaches to a server. The server uses IP address 140.4.2.1 and a mask 255.255.255.0. Both routers run proxy ARP. When the client wishes to connect to the server, it believes that the server is locally attached (since it uses a natural class B mask). It, therefore, ARPs locally. Both routers receive the ARP and will respond with their own MAC addresses. The client will cache the MAC address from the most recent ARP response and send the packet addressed to the server's IP address and one of the router's MAC addresses. The relevant router will forward the packet onto LAN-B. This approach is simple and transparent but does have some drawbacks. If one of the routers dies, then all client sessions using that router will eventually time out. Clients will then have to generate a new ARP request, which should be responded to by the working router. All further sessions will be directed to this single router. If the failed router recovers, it will not be used until sessions are broken or new clients come online. ARP timers are quite long, and it may take considerable time for the load to be evenly distributed between routers. For further information on basic ARP and Proxy ARP operations, the interested reader is referred to [28].

7.3.6 NAT load sharing

Network Address Translation (NAT) dynamically changes source and destination IP addresses (and possibly ports) as IP packets traverse the NAT system. NAT is typically employed for security applications or to convert private IP addresses to registered IP addresses (NAT operations are described in [29]). Since this is a critical function, NAT systems are often incorporated in firewalls or perimeter routers placed at strategic locations on the network. In a client/server environment, this gives a NAT system the opportunity to distribute sessions among different servers by dynamically changing the destination IP address to match one of several servers in a hunt group. For complete transparency, packets returned from the servers must traverse the NAT device so that addresses are flipped back to their original forms and forwarded to the client. Since NAT operates at the IP layer, it is not tied to a specific service.

An enhanced version of NAT called Load Share Network Address Translator (LSNAT) is described in [RFC2391]. LSNAT extends NAT to offer load sharing across a pool of servers, instead of directing requests to a single server. LSNAT uses real-time load-sharing algorithms to transparently distribute the load across a server group. This enables servers to scale in large internetwork environments. Unlike traditional NAT, LSNAT devices initiate translations on inbound sessions by binding each session represented by a tuple (client address, client TCP/UDP port, virtual server address, server TCP/UDP port) to one of the servers in the cluster, selection based on a real-time load-sharing algorithm. A virtual server address is a globally unique IP address that identifies a physical server or a group of servers that provide similar functionality. In essence, when a client sends a request to a server (using virtual address), the LSNAT system transparently redirects the request to one of the hosts in the server pool, chosen using a real-time load-sharing algorithm. Multiple sessions may be initiated from the same client, and each session could be directed to a different host based on load balancing across server pool hosts at the time. If load sharing is desired for just a few specific services, the configuration on LSNAT could be defined to restrict load sharing for just the services desired.

To check if a server is up, NAT-based load-balancing solutions need to sacrifice an actual client request, and so a server outage is typically perceived only as a result of a timeout of one of these real client requests. NAT devices often only map affinity or stickiness based on the client's IP address, not at the port level. This means that once a client has contacted a server, then all traffic from that client that is intended for other applications is forwarded to the same server. This drastically restricts configuration flexibility, in many cases rendering the sticky capability unusable in the real world. There are some other limitations worth noting, as follows:

NAT operations introduce additional overhead and latency, as well as queuing delay in the load balancer. Traffic is also typically asymmetrical, with much higher volumes passing downstream to the client. These factors limit the potential scalability of NAT solutions, unless the NAT systems can be clustered (e.g., by grouping firewalls or routers running NAT using protocols such as VRRP or HSRP). Note that some NAT systems mandate that the NAT device be installed as a kind of bridge and will not permit bridges of any other kind to be installed in parallel. All traffic must pass through the NAT devices, whether it is required to be NATed or not. This forces the server group to be located on a private segment behind the NAT system and restricts the flexibility of NAT to include remote WAN servers in the hunt group. It may compromise resilience unless the NAT systems can be clustered. For further information, the interested reader is referred to [30].

7.3.7 IP load balancers

IP load balancers are generic devices that sit in front of a group of IP servers (e.g., a Web farm), sharing the load between these servers. Broadly speaking, load balancers can be divided into two groups, as follows:

Assessing server status is actually quite hard to achieve with any degree of accuracy. For example, CPU utilization statistics on UNIX servers can be misleading; under some conditions these values can be artificially high. As we saw earlier, measuring response time by simply pinging servers may produce inaccurate results, since a server's CPU and network adapter may return fast responses even though the disk subsystem is overloaded. Some of the more sophisticated products in this class create a load index to analyze several key criteria (response time, throughput, number of concurrent connections, etc.). Instead of simply using ping, some of these products may actually request content to assess real-life application performance and may include the ability to run customized scripts as part of the status-testing algorithm. On high-end load balancers, servers can be brought in and out of service dynamically, and content can be moved around (using FTP) as demand increases. Because of their strong fault tolerance, IP or DNS load balancers complement proxy caches. Among the vendors with products are Alteon Networks., ArrowPoint, Foundry, Checkpoint, Cisco, Bright Tiger Technologies, Coyote Point System, F5 Labs, Hydraweb Technologies, and Resonate.

7.3.8 VRRP and HSRP

Virtual Redundant Routing Protocol (VRRP) and Hot Standby Routing Protocol (HSRP) are both relatively dumb protocols that rely on the default gateway feature configured on end systems. The basic idea is that devices (typically routers) within a cluster offer themselves as a virtual IP address. While these protocols are essentially designed to provide a hot standby device, they allow multiple virtual IP groups to be configured, and so traffic can be shared dynamically among several devices, and resilience is offered concurrently. Figure 7.9 illustrates a topology where VRRP is used between two routers to provide both load sharing and resilience for client/server access.

Figure 7.9: VRRP configuration with load sharing and resilience for a server farm.

In this configuration, both routers run VRRP on all interfaces, and on both LAN interfaces both routers will simultaneously participate in two VRRP groups. Half of the end systems install a default route to the IP address of Router-1 (194.34.4.1); the remaining end systems install a default route to the IP address of Router-2 (194.34.4.2). Router-1 is elected as the master for VRID 1, with a priority of 254, with Router-2 providing backup. In the same way Router-2 is elected master for VRID 2, with Router-1 providing backup. If either master should fail, the backup will automatically kick in. For example, if Router-1 dies, then Router-2 will operate as master for virtual IP addresses 194.34.4.2 and 194.34.4.1 until Router-1 resumes activity. On the server-side LAN we are doing effectively the same thing. Note that VRID-1 and VRID-2 are used on both LANs but are effectively operating as different groups.

This configuration enables load sharing through the routers, while also providing full redundancy and transparency. If more horsepower is required, then more than two routers could be configured in parallel.

7.3.9 Port hashing techniques

This technique may be implemented in switches or routers as a hash function to control flow destination based on flow criteria. Normally this is generally a proprietary feature, and the output of the function may be deterministic or non-deterministic (i.e., the designer can predict which physical interface will be used to output a flow based on input, or load balancing is entirely dynamic). Assume, for the sake of simplicity, that our hash function simply takes two addresses and outputs an interface ID using an XOR function, using bit positions to associate an interface. In this case we assume the switch has 16 interfaces, so we use a 4-bit mask to determine which port to use. Let us assume that an IP packet arrives with source address 195.66.3.6 and destination address 140.10.128.3:

11000011 01000010 00000011 00000110 = 195.66.3.6 // Class C address 10001100 00001010 10000000 00000011 = 140.10.128.3 // Class B address XXXXXXXX XXXXXXXX XXXXXXXX XXXX1111 0.0.0.15 // Interface Mask ----------------------------------- logical_XOR XXXXXXXX XXXXXXXX XXXXXXXX XXXX0101 = 5 // The Interface!

Note that I am not recommending this particular algorithm, just illustrating the concept. For a switch with only four active interfaces (as used in our example) the function could be configured (or could reconfigure dynamically) to adjust the length of the bit mask and, if necessary, wrap around if the number of active ports is not a power of 2. For more granular flow classification, the algorithm could hash the IP source and destination address and destination TCP or UDP port to generate an output interface. This would enable differentiation of applications such as FTP, SMTP, and Telnet.

If this algorithm is deterministic (as in our example), we would simply need to cable up the network topology consistently to be sure that flows are sent and received down the same path (this may be important for applications such as stateful firewalls); in effect we are demultiplexing flows through a physical infrastructure and then aggregating them on the other side. In Figure 7.10, we use this concept to scale firewall performance using high-speed switches (capable of operating at up to wire speed). The physical wiring in this case would typically be 10/100 Ethernet. The basic idea is that the bottleneck device (typically a device doing a lot of work, in this case a firewall) should be encapsulated (in the topological sense) inside a number of higher-speed devices. The encapsulated device sees only a proportion of the overall traffic, since the switch presorts flows and allocates them to specific physical interfaces. This enables devices such as firewalls to be scaled as many times as there are available interfaces at the switch. In Figure 7.10, a session is identified (fl) on Switch-1 and hashed to a specific interface. Physical cabling ensures that this interface is fed into Firewall-2, which simply outputs into Switch-2. The flow return is directed down the same physical path (this means that the hash function used on Switch-2 must be consistent, and cabling must reflect the output of the function to ensure that the return path is consistent.

Figure 7.10: Deterministic flow hashing using two monster switches.

The main design restrictions are associated with the hash function and the physical topology. The hash function must have sufficient granularity to pick out application flows to enable a reasonable load distribution. On the other hand, the function needs to be fairly simple to avoid too many processing overheads per packet. There is therefore a trade-off between making this function fast and sufficiently fair. The other main issue is that the physical topology may need to be symmetrical and should exhibit consistently throughout. If a physical interface or cable breaks on either side of the encapsulated device, the flow send and return paths could be inconsistent. This may not be a problem if the encapsulated device does not need to maintain flow state information (a router), but this may be unacceptable to a more sophisticated device such as a firewall (e.g., if stateful inspection is being used). In this case the switch could implement a simple polling feature to test available paths and force interfaces to forward consistently on both sides, or some form of state synchronization could be used between switches to ensure that the hash function output is consistent for all flows.

Since this is a flow-based function, we are not talking about true load balancing. The actual load imposed on each encapsulated device will depend on the applications used and the number of flows. In a large network with reasonably diverse flows the loading should be evenly distributed. If one dominant source is present, loading may be less evenly spread. Note also that some commercial solutions use modulo-based hash functions; this will lead to uneven distribution depending upon the number of switches. If the switch runs into performance issues (assume the external paths in the topology are Gigabit Ethernet) it is possible to scale the switch design, as illustrated in Figure 7.11.

Figure 7.11: Deterministic flow hashing.

Distributed hashing techniques

Some implementations take a more intelligent approach to traffic flow distribution by clustering devices and using hashing techniques to allocate flows among group members. In general there are two methods employed, as follows:

Figure 7.12: Distributing flows using hashing techniques. Here, four flows (A, B, C, and D) are fed into the cluster. (a) An autocratic model, where a master (MI) is elected and performs flow distribution at the ingress port. (b) A democratic model, where each node is aware of the flow states being handled by other members of the cluster.

At first it may seem that the autocratic method adds little value, since we need to process and classify all incoming traffic at the master in order to decide how to distribute it. However, consider a device such as a firewall or VPN termination node. These devices perform significant packet processing functions (particularly if high-speed encryption is required for thousands of concurrent sessions). For this class of device, simply classifying and bouncing incoming traffic between cluster nodes is a negligible overhead in terms of this overall processing requirement.

In either model it is important that the algorithm handle cluster node introduction and removal dynamically. This is a fundamental requirement, since we need to deal gracefully with a cluster member malfunction (including failure of a master), so as not to cause major disruption to the whole group. For applications such as VPN, stateful firewalls, or transaction processing, it is also highly desirable that flows be switched transparently in the event of a failure, so that sessions remain alive regardless. For these applications it may also be desirable that flows be handled symmetrically, so that send and return flows for each individual application session are routed through the same cluster node. We have also assumed so far that nodes in the cluster have equivalent resources (processor speed, memory size, etc.). With either model, more sophisticated algorithms may allow nodes to be mixed, with the flow allocation reflecting any differences in processing capability. Note that in both designs a single switch is used on either side of the cluster; in practice, this could be deployed as multiple switches for scalability and high availability.

Категории