Data Protection and Information Lifecycle Management
< Day Day Up > |
All kinds of networks can support remote copy, including Fibre Channel, Gigabit Ethernet, and a variety of wide-area networks. The choice of network depends on the topology, the distance that needs to be covered, the amount of data to be moved, and the amount of money that can be spent on the system. Block-level storage applications are especially sensitive to latency and throughput. Distance drives network latency, and the amount of data to be moved defines the throughput requirement. Remote copy can easily fail if the round-trip delay is too great or if there is not enough bandwidth to move all the data. It is safe to say that more bandwidth and lower network latency are always better. Tip Network latency and storage latency are similar. In both cases, latency refers to the time it takes to send and receive data. What is different is the cause of the latency. For storage devices, the root cause of latency is the mechanical properties of the device and media. For networks, latency is a function of the electrical and software properties of the network connection. Both must be taken into account when designing remote copy systems.
Bandwidth
Bandwidth needs are dependent on the remote copy application, which is different for each vendor. A good rule of thumb is to look at the underlying storage architecture and see what its bandwidth requirements are. If the storage infrastructure consists of 2-gigabit Fibre Channel, and the data transfer is operating at full rate, remote copy may need 2 gigabits of bandwidth. The cost of network connections increases as more bandwidth is needed and distances increase. It costs much less for a 1-gigabit connection within a local-area network with Fibre Channel or Gigabit Ethernet than for a 1-gigabit connection from a long-distance carrier. In practice, most storage applications do not run at full bandwidth. Most individual data transfers are less than the maximum bandwidth. Because more bandwidth costs more money, it is important to measure the actual bandwidth the application is using before buying expensive network services or components. Matching storage bandwidth needs with network connections is a key part of remote copy design. Table 4-1 gives a list of some storage connections and corresponding network connections.
The connection matching assumes that the applications using the storage need the full bandwidth of the connection. This would be the case in a SAN architecture or with an external disk array requiring high bandwidth. In most instances, full-speed network connections will not be necessary, because the applications using the storage will not use all the available link speed. There are two methods of getting the bandwidth needed from a network connection. The simplest method is to obtain a high-enough bandwidth connection to handle the application's throughput. This can be costly, and in many areas, high bandwidth connections are not available. Another method is to aggregate several connections at the switch level to provide the required bandwidth. DWDM optical switching products do this well, combining several high-speed network connections into one fiber optic link. This method can also be used with long-distance WAN connections such as T-1/E-1 and T-3/E-3 leased lines. Though bandwidth is small by storage standards (1.54 megabits per second and 44.76 megabits per second, respectively), combining several connections can provide sufficient bandwidth for many remote copy applications. The Causes of Network Latency
Network latency the time it takes for a packet or frame to get from the source to the destination depends on many factors. These include
It is important to take these factors into account. Keep in mind the following:
Because all storage was originally local, it works under the assumption that a long delay equals unavailable resources. This is in sharp contrast to network applications, which assume that network latency will happen and are willing to wait longer periods of time. With remote copy, the problems occur when the application must wait for acknowledgment of a frame. To move on to the next I/O, the application needs to know that the last one was successful. When the application has to wait a long time for the last I/O to respond, it will assume that the storage is no longer available and will fail in some fashion. Long delays caused by the network latency will result in slow acknowledgment of the whole transaction and possible failure of the application. Vendors of remote copy applications employ a variety of tricks to overcome network latency problems. All these techniques are designed to make it appear that the I/O was completed. Storage applications have also adapted to environments in which network latency is more of a problem, such as SANs. By queuing I/Os, retrying before abandoning, and through caching, applications have become more tolerant of delays in completing storage transactions. Distances
Distances for remote copy are thought of in the traditional networking manner. Network connections come in local, metropolitan, and wide area or long distance. Local remote copy is performed in the confines of the LAN or SAN within the same building or campus. It can assume very fast connections, usually Fibre Channel and Gigabit Ethernet, and few switches and routers to pass through. The options are greater for metropolitan areas. The MAN distance is defined as being within the local area of a city or region, usually less than 100 kilometers. Intercity connections that are close together are also considered to be metropolitan. Direct fiber optic links can be leased or built. These can then be used directly by Fibre Channel and Gigabit Ethernet. Native Fibre Channel can be used when the distance is less than 10 kilometers and Gigabit Ethernet when the distance is less than 40 kilometers. High-capacity fiber optic connections such as SONET, OC-48, and OC-96 are also available within large metropolitan areas. Long-distance remote copy can use any of the data communications connections available for long-haul transmissions, including T-1/E-1, T-3/E-3 circuits, and leased fiber optic lines (dark fiber). Remote copy at these distances, however, poses some difficulties for the system architect. The network latency caused by distance alone is significant and can create problems for remote copy applications. Another difficulty is the type of available network connections. The distances are too long for direct Gigabit Ethernet, Fibre Channel, and SONET protocols. A single fiber optic link is usually not available, so the packets have to be routed through the networks of a telecommunications carrier, adding more delay. To operate over long distance, a remote copy application needs to be fine-tuned very carefully. Synchronous and Asynchronous Remote Copy
Problems with timely acknowledgment of remote copy I/O have led to two different ways of implementing remote copy. They are synchronous remote copy and asynchronous remote copy. The synchronous form of remote copy has the host wait for acknowledgment of both the primary disk array and remote array. This is the more secure method of remote copy. The application is assured that all I/Os have been completed on both arrays and that an exact copy of the data and logs exists. If the I/O to the primary array fails, the remote can be rolled back to the same state as the primary and an error produced for the application. If the remote copy fails, the remote copy software can resend the I/O while the application waits for a response. If the response does not come in a reasonable amount of time, the I/O can be rolled back on the primary and an error code created. Synchronous remote copy assumes that sufficient bandwidth exists on the network link to the remote array to perform I/Os normally. When the primary data path is using 1 gigabit per second of bandwidth, and the remote array is serviced by an OC-24 network link, there will be sufficient bandwidth for synchronous remote copy. In the case in which that link is shared by several applications, there may be times when the I/Os to the remote array cannot be completed in the allotted time, and the connection times out. Depending on the applications, the host may be able to wait for the packet to be resent, or an error may be generated. With asynchronous remote copy, the remote copy software whether it is housed in an appliance or within a storage device, or is host-based acknowledges the I/O as soon as the primary storage completes it. The I/O is then sent to the remote array and is managed independently of the primary I/O. The host does not have to wait for acknowledgment from the remote array to continue (Figure 4-5). Figure 4-5. Synchronous and asynchronous remote copy
Even when the I/O to the remote array does not fail, waiting for the acknowledgment can drag down the host's performance. Network latency, retries, and other delays can cause the host to spend time waiting instead of processing data. With asynchronous remote copy, there is no waiting. This is a vitally important characteristic when the network link to the remote array is slow or very long distance. Asynchronous remote copy has allowed for less costly implementations. Slower connections mean more network latency and retries. With asynchronous remote copy, these have less effect on the overall performance of the host. Lower-bandwidth connections can be used, which cost much less on a recurring basis. There is a downside to this approach. The host has no way of knowing whether the remote copy actually occurred correctly or at all. In the event of problems on the remote network link, the remote array could become out of sync with the primary array. To mitigate this occurrence, remote copy applications often have a facility for resyncing the data. That is a time-consuming process that has to happen offline, causing downtime in the overall system. With this form of remote copy, the state of the remote array cannot be verified at all times by the host. It should be noted that the steps involved in remote copy are not always sequential. Some implementations of remote copy will write the I/O to the primary and remote array at the same time. What is important to note is that the host has to wait for both acknowledgments before continuing with the next I/O, whichever arrives first. With asynchronous remote copy, only the acknowledgment from the primary disk array is necessary before the next I/O can begin. Bunkering
For some organizations, asynchronous remote copy does not afford the level of protection that is needed for critical applications. This is true in the financial-services industry, for example. Synchronous remote copy is used over a short distance, allowing for the use of a high-bandwidth connection such as direct Fibre Channel or Gigabit Ethernet. Metropolitan Area Network connections such as SONET are also used to get high bandwidth over short distances. When the need exists for long-distance but high-performance remote copy, different architectures are needed. Otherwise, costs will be high and system performance less than what is desired. One such architecture is called bunkering. With bunkering, a hardened facility (the bunker) is available at a short distance, housing only storage and networking equipment. A separate facility that contains not only storage but also application servers is kept at a far distance. Data is copied, using synchronous remote copy, to the arrays in the bunker, where it is available for use over a high-bandwidth local or MAN connection. The bunker storage acts as a staging ground for asynchronous remote copy over a longer distance but a slower link. From here, data is copied over long distance, using standard data communications links (Figure 4-6). Copies of the data are kept on the primary array, bunkered array, and remote array. Figure 4-6. Bunkering
Bunkering solves several problems with long-haul remote copy: cost, performance, and link failure. Because the primary storage has already copied its I/O over to the bunker storage, the application is not affected by the slower, less costly connection. If the long-distance connection should fail, there is still the copy of the data in the bunker protecting the data. When the connection is brought back online, the bunker storage and long-distance storage can synchronize without disturbing the primary storage. Bunkering provides other advantages over direct remote copy. By making three copies of the data instead of only two, the data is safer. In the event of a regional disaster that destroys both the primary data center and the second one, the third version of the data is far enough away to remain unharmed. It can also be an operating data center where employees travel to, allowing the company to return to normal operations sooner. By staging the data, it is also possible to run backups at one or more of the remote facilities. Backups can be performed at almost any time without disrupting applications. Also, because bunkering requires three facilities, a network connection could be established between the primary and long-distance facilities, allowing remote copy to continue in the event of a major disruption at the bunker. With traditional remote copy, disruption in the network link or remote facility leaves no options for continuing remote copy. Bunkering provides an alternative for businesses that have very high availability and data protection requirements.
Cost Considerations
Remote copy can be a very expensive method of data protection, especially for long distances. When designing remote copy systems, it is important to keep these cost factors in mind:
|
< Day Day Up > |