Network guru Mel Beckman explains the new world of converged network architectures, and how IBM's Flex System Fabric gives you extreme flexibility in deploying virtual machines throughout your enterprises.
In the traditional virtualized data center, physical host servers interconnect with each other and with storage devices over separate networks: Ethernet for application data communications, Fibre Channel (FC) for legacy storage area networks, and iSCSI for next-generation storage. This worked well in an environment where every change was managed by a skilled network engineer, coordinated to ensure that modifications to the network don’t disrupt other applications. But in the modern private cloud world, traditional networking doesn’t fly. Multiple network technologies don’t scale, and the added complexity of several networks makes administration increasingly complex.
Perhaps the biggest problem with the traditional model is that it compartmentalizes resources, creating silos that are difficult to bridge. If you have an application running in a rack on one side of the DC, and want to move a single VM to another rack he other side, you have to create new physical interconnections to permit VMs on both sides to access the same data VLANs and storage SANs. That’s a major, and often error-prone, operation. As your data center grows – possibly into multiple data centers in separate buildings – maintaining VM mobility becomes increasingly complex, requiring multiple switching tiers and Layer 3 IP routing.
The solution to this dilemma is the converged network, which uses a single networking technology for all communications, flattening communications to eliminate multiple switching tiers and IP Layer 3 routing. Today’s most popular converged network rides on a version of Ethernet, letting you employ a common set of vendor-neutral devices across your data center, interconnecting them however you need, at speeds of up to 100Gbps.
The Ethernet-converged network transports everything over Ethernet frames. Application data and iSCSI already ride on Ethernet; FibreChannel gets the FC-over-Ethernet (FCoE) protocol to carry its traffic. Thus you can connect a VM to any VLAN and SAN, no matter where it resides in your enterprise, as long as you have adequate bandwidth between locations.
With all three kinds of data mixed together across a common switched infrastructure, you can imagine that there might be issues with prioritization and packet loss. Traditional Ethernet is known as a “best effort” transport layer, which means that when an Ethernet link is congested, packets can be dropped. That’s not a problem for most application communications, which depend on higher-layer protocols, such as TCP, for error detection and recovery. But for iSCSI and FCoE storage interconnects, dropped packets are tantamount to a drive failure. They simply can’t be allowed. What is needed is a new kind of Ethernet, called lossless Ethernet, that can’t lose packets.
Understanding how lossless Ethernet works is the key to building and deploying a robust converged network. IBM’s Flex System Fabric is IBM’s solution for deploying lossless Ethernet with wide-area interconnectivity. Once you know the terminology and internals of lossless Ethernet, you’ll be well positioned to select the right Flex System Fabrics for your private cloud. First, though, let’s look at how Ethernet got to be “lossy” in the first place.
Loss in Space
Ethernet in its native form has serious flaws that only become apparent at the worst possible time: when an Ethernet link approaches saturation. A typical data center Ethernet SAN infra- structure consists of multiple switches interconnected via redundant 1- or 10-gigabit links, so that the failure of any one link doesn’t partition the network. But redundant links, left on their own, create loops that amplify broadcast traffic to pathological levels, which would render the network ususable if not checked. The solution is to have the Ethernet switching devices, which are in reality just Ethernet bridges, cooperate to map out a loop-free topology. The protocol for doing this, Spanning Tree Protocol (STP), is standard in every enterprise network. It determines a single, non-looping path between any two devices, blocking any paths that would form a loop. One downside of this is that STP doesn’t exploit link extra bandwidth available through redundant paths. But it has worse problems.
STP’s single path can’t be optimal for all devices in the network. Some devices are forced to take “the long way around” to communicate, which can saturate parts of the network even though plenty of network capacity still exists. When saturation occurs, Ethernet’s broadcast-based layer-2 transport drops packets. As noted earlier, this is no big deal for data sessions: higher-level protocols such as TCP will retransmit the lost packets, ensuring guaranteed delivery. But iSCSI and FCoE can’t tolerate even a single packet loss, because they have no retransmission mechanism. Applications and operating systems have grown accustomed to disk writes working perfectly. Any failure is considered a catastrophic event that crashes the application, and in VMs that reside on a SAN, the virtualized operating system itself.
Packet loss due to congestion is bad, but theoretically you can solve that by just pouring on more capacity, in the form of higher-bandwidth links. Unfortunately, STP has more problems. Switches running STP communicate with each other using Bridge Protocol Data Units (BPDU) broadcasts. When an STP-based Ethernet topology change occurs, switches broadcast BPDU packets that trigger a “recalculation” of the loop-free topology. This recalculation is intensive, and can take several seconds to converge on a new solution, since all the switches in the network must cooperate to derive a new loop-free configuration. This adds significant traffic to the network, so that any network links already carrying moderate to high traffic rates are at risk of saturation, which can result in the loss of BPDU packets essential to STP re-converging on a stable network topology. The result is often a constantly shifting topology, creating radical performance changes in the network. And dropped packets. Many, many dropped packets.
STP failure was a major factor in one of the worst health-care IT disasters in history. In 2002, the Beth Israel Deaconess Medical Center in Boston, MA experienced an STP loop that intermittently saturated the multi-building campus network for hours at a time. In their attempts to eliminate that problem, Beth Israel’s IT staff inadvertently exceeded STP’s seven-bridge hop limit, creating a network that would never re-converge. Ultimately, Beth Israel was forced to close its Emergency Room due to the loss of critical IT database and application access. The incident was only resolved after several days of intense analysis.
STP reconvergence problems are dangerous for all types of traffic. You can only guard against them by careful, conservative network design and exhaustive analysis of every proposed change in the light of STP’s limitations. Ethernet as a SAN fabric has other problems beyond packet loss and STP-induced outages.
One of those problems is traffic prioritization. In a data center network, you typically would give SAN traffic priority over application traffic. While modern data center switches have the ability to prioritize traffic, the flow control mechanisms across multiple switches—needed when a host is moving data faster than a network switch or an end device can process it given its current workload—don’t permit traffic prioritization. This can create problems, such as when a less-important (but higher priority on the network) SAN process like disk-to-disk backup degrades mission-critical application performance.
Coping with Loss
Virtualization increased SAN utilization, bringing the lossy disadvantage of ordinary Ethernet to light several years ago. To combat lossyness, the IEEE standardized several specific mechanisms—collectively called Data Center Bridging (DCB) technology—convert Ethernet into a loss-free transport enabling a reliable, robust Ethernet SAN fabric. As with many things in life, IBM has its own trademarked term for this technology: Converged Enhanced Ethernet (CEE). For consistency with IBM’s documentation, we’ll use that acronym in the remainder of this document. Below are the five key elements of CEE:
Priority-based Flow Control (PFC).PFC, defined by the IEEE 802.1Qbb standard, provides a link-level flow throttling to protect against packet loss when a link becomes congested. When a device sends packets faster than the receiving device on an Ethernet link can accept them, the intervening switch’s default behavior is to buffer the packets. When a switch runs out of available buffer space to hold incoming packets, it drops additional incoming packets without notifying the sender.
Traditional Ethernet has a link-level flow control mechanism, called the PAUSE control frame, defined by IEEE standard 802.3x. A congested receiver can send a PAUSE request to a sender when its buffer is close to full, triggering the sender to stop sending on the link until the receiver has enough buffer space to accommodate them. The disadvantage of using Ethernet PAUSE is that it operates on the entire link, which likely is carrying multiple traffic flows in the form of separate VLANs.
Some low-priority flows, such as a TCP file transfer, can handle dropped packets in the TCP protocol, but others, such as iSCSI and FCoE, will crash ungracefully. PFC lets you establish multiple class of service (CoS) levels, with a new flow control command, PFC PAUSE, letting you pause individual flows without stopping all traffic on a link. This gives you fine-grained control over traffic loads across a link, preventing congestion before it causes packet loss due to buffer exhaustion.
Enhanced Transmission Selection (ETS).Where PFC aims to prevent packet-dropping congestion, ETS is a traffic engineering tool that lets you allocate bandwidth slices to pre-assigned traffic classes. ETS is defined in the IEEE 802.1Qaz standard. It adds a Priority Group ID (PGID) field to each Ethernet frame. One PGID value, PGID 15, is specified as the high-priority group, which always receives a pre-determined bandwidth allocation. Other groups receive a specified percentage of the remaining band- width on the link. Once allocated, a PGID can only use bandwidth up to its percentage ceiling.
Data Center Bridging Exchange (DCBX).The DCBX protocol, defined in the IEEE 802.1Qaz standard, lets two DCB peers exchange configuration information. DCBX packages parameters into vendor-specific Organizationally Specific Type-Length-Value (TLV) groups, exchanged via the Link Level Discovery Protocol (LLDP). DCB supports two types of parameters, Administered and Operational. Administered parameters are configuration settings, while Operational parameters represent the state of the administered parameters. They can change due to exchanges with the peer and are only present for administered parameters that can be changed by the peer. DCBX lets vendors add proprietary features to a DCB infrastructure without causing compatibil- ity problems with devices from other vendors.
Congestion Notification. IEEE 802.1Qau provides the Congestion Notification method for end-to-end flow control. Congestion Notification lets a congested endpoint signal an ingress port to slow down its delivery rate. When the congestion ends, the endpoint tells the ingress port that it can increase their transmission again. This heads off dropped packets in iSCSI and FCoE by temporarily reducing speed before congestion can result in switch buffer overflows and consequent dropped packets.
Shortest Path Bridging. Normally, STP only finds a loop-free path. The IEEE 802.1AQ enhancement to STP (specifically, the Multiple Spanning Tree version known as MSTP) uses a link state protocol, IS-IS, rather than broadcast BPDUs, to determine the shortest paths between end points across the Ethernet network. Because link-state changes are only communicated between neighboring devices, rather than being broadcast to all switches as BPDUs are, the traffic impact of state changes is much less. This has the combined advantage of greatly reducing packet loss during tree recalculations and more quickly converging on a shortest-path solution.
CEE is a great step forward in network convergence, but not the ultimate solution (if there can even be one). It’s worth noting, though, that IBM’s Flex System Fabric, the set of products that provide PureFlex networking, sports only a single device with CEE support: the CN4093 10Gb Converged Scalable Switch (which can aggregate up to four 10Gb ports to create 40Gb links). If you’re building a converged network using all-IBM components, this device is your primary building block. Note also that CEE is disabled by default on the CN4093; be sure to turn it on!
But nothing says you have to use IBM switching infrastructure with PureFlex. Products from vendors such as Brocade, Juniper, and NEC also support DCB, the equivalent convergence suite to CEE. And these also support a valuable feature that IBM’s products don’t yet support: multiple paths.
Exploiting Multiple Paths
STP creates a single loop-free path in a network of switches having multiple redundant links. This wastes network capacity by preventing redundant links from being used. You can skate around this limitation by using the STP variant MSTP to assign VLANs to specific links, which lets you load-balance, after a fashion, VLANs to exploit redundant paths. MSTP gets messy, though, when the network grows, and requires diligent traffic engineering talent to avoid link saturation.
A key DCB a replacement technology for STP and MSTP is called TRILL (Transparent Interconnection of Lots of Links). TRILL, developed by STP’s original inventor Radia Perlman, arose directly as a result of the aforementioned Beth Israel disaster. The new protocol, although not fully through the Internet Engineering Task Force (IETF) standards process, is solid enough to be deployed in Ethernet SAN environments, and several vendors are exploiting its capabilities in current converged network switches.
TRILL matches two of CEE’s benefits. First, it routes traffic along the shortest Layer 2 path between two nodes, a vast improvement over STP’s non-optimal path calculations (TRILL uses the same IS-IS protocol as CEE). Second, TRILL’s link-state topology convergence algorithm operates as well as CEE’s link-state algorithm. Third, TRILL’s maximum network diameter—the bridge “hop count”—is much higher than STP’s low number (seven), enabling larger networks without risk of inadvertently exceeding the protocol’s intrinsic hop limit.
But TRILL has one huge advantage over CEE: it exploits all available paths, including redundant ones, to spread traffic across all available backbone capacity, reducing link congestion. When combined with other DCB enhancements, such as PBFC and ETS, TRILL creates a true lossless Ethernet fabric that the entire data center can exploit.
To appreciate the advantages of a DCB-based Ethernet SAN fabric, it helps to examine Ethernet’s STP shortcomings more closely.
Figure 1 shows a traditional Ethernet network containing redundant paths needed for resilience and protection from single-point of failures. STP has disabled all redundant interfaces on each switch, leaving but one loop-free path. Packets can’t travers the disabled redundant switch interfaces, forcing them to travel through more devices than necessary. This inevitably slows performance and leads to congestion as traffic rates increase.
Figure 1. Traditional Ethernet with redundant links disabled, resulting in congestion
Consider now a TRILL-based Ethernet (Figure 2). TRILL keeps redundant links active, and routes traffic using the shortest path for each VM. Trill supports multiple equal-cost paths between endpoints, so simultaneous user demands on the network are spread across the entire network infrastructure.
Figure 2. TRILL-enabled Ethernet spreading bandwith across multiple redundant links
Plan Your Strategy
Lossless Ethernet is a boon for application, iSCSI and FCoE transport, enabling convergence to a single transport media supporting what were previously independent networks. For legacy FC, a new class of network interfaces replaced the traditional FC Host Bus Adapter. Called the Converged Network Adapter (CNA), these interfaces integrate FC protocol with 10 Gbps Ethernet technology, letting you migrate legacy FC to a new DCB-based FCoE fabric. IBM’s PureFlex compute nodes support CNAs, letting you mix an max traffic to available bandwidth throughout your private cloud.
You should aim to move from lossy Ethernet to a lossless form as soon as possible. VM workloads are increasing the stress on legacy networks, and in any event you won’t be able to easily scale your legacy, lossy Ethernet infrastructure. You can go lossless today in PowerFlex with IBM’s CEE-enabled CN4093 Converged 10Gb Ethernet switch. Or you can go third-party to achieve a TRILL-enabled infrastructure as soon as possible. It’s only a matter of time before inevitable traffic growth leads to standard Ethernet’s tipping point, at which time it will be too late to fix the problem inexpensively.