Mel Beckman discusses workload management and explains how IBM’s Power Systems were engineered to provide high-quality workload management tools.
If memory, CPU, network, and storage capacity were infinite, and users were infinitely patient, IT technologists would never have to worry about managing workloads to fit available capacity and performance needs. Alas, there never seems to be quite enough physical compute resources or user patience. Left to their own devices, the applications we manage would gradually slow down to a stodgy pace, at which point user patience would approach zero.
The solution to the problem of finite resources and seemingly unlimited user demand is workload management: the explicit allocation of resources, on a priority basis, to applications in order to meet specific performance and reliability objectives. IBM’s Power Systems were engineered from the ground up to provide high-quality workload management tools. Some of these tools are specific to each OS platform (i, AIX, and Linux); others are integrated into PowerVM, the hardware-based hypervisor that now underpins all Power-based machines. Depending on your environment, you might be able to exploit several—or all—of the available workload management tools. To use these capabilities, however, you must know what tools exist, and the pros and cons of each.
Objectives, Planning, and Instrumentation
Any workload management process must start with objectives. There’s no single “optimum” workload management methodology, and thus no single set of objectives for every environment. You might have specific performance objectives for high-priority applications, or a cost objective for energy consumption, or an uptime objective of 99.999 percent. Objectives are usually related to the importance of each application you administer. For example, email, Customer Relationship Management (CRM), and materials requirements planning (MRP) are typically mission-critical applications, whereas business intelligence and accounting are less performance-sensitive. To begin workload management planning, identify all the objectives you want to meet and prioritize applications in the order you want them to perform.
Don’t be concerned if the objectives you want aren’t realized in your current system. One benefit of workload management is that it improves the performance of existing runtime environments while reducing costs. It’s quite possible that unbalanced resource utilization, unused islands of capacity, unnecessary resource contention, or a combination of all three pathologies hampers your current system performance. However, you want to be able to measure the results of any workload management process so that you can demonstrate improvements. To do that, you must have good instrumentation in place.
To track gross performance improvements across a system, you need to monitor only a few key metrics: CPU, memory, and network utilization; I/O operations per second (IOPS); and application response times. Setting up instrumentation for these isn’t complex, but it’s outside the scope of this article. You can set up basic ad-hoc monitoring for each workload using the tools enumerated in Table 1.
Table 1: Tools for Monitoring Resources in PowerVM
Or you can monitor them more conveniently in IBM’s Systems Director (see “Meet Your Data Center Sidekick: IBM Systems Director 6.3”), which provides historical graphing (Figure 1) and alerting. You can learn about these instrumentation techniques in the IBM Redbook “IBM PowerVM Virtualization Managing and Monitoring.”
Figure 1: Example Systems Director Performance Measurement Graph
Once you have a monitoring scheme up and running, you should run it across all major production periods (e.g., a day, a week, a month, month-end) to establish performance baselines against which you can measure your workload management changes.
The essence of workload management is moving workloads and resources around to marry up needy (and entitled) applications with the resources necessary to meet your objectives. For example, if you have an objective of sub-second response time on an eCommerce application, and that objective isn’t being met, you can add the appropriate resource—CPU, memory, or network capacity—to open whatever bottleneck is constraining performance. The instrumentation described in the previous section will let you identify the resource that requires relief.
Ideally, you could use one workload management toolset to accomplish all your goals, but the state of the art hasn’t quite arrived there. For now, you’ll use a mix of tools, some specific to the platform you’re running—AIX, i, or Linux—and some cross-platform tools. Let’s look at workload management at the platform level first. It’s easiest to understand the big picture in light of the history of each platform converging on the Power Systems architecture, so we’ll look at them in that order.
The earliest workload management tools originated with the i platform, in the form of logical partitions. An LPAR is a virtual machine (VM) dedicated to a single OS, which in the beginning was just IBM i. Originally, the LPAR hypervisor ran in a dedicated “primary” OS/400 partition, which managed subsidiary LPARs using IBM i commands. Beginning with Power5 and i5/OS, the hypervisor (today called PowerVM) became independent from all OSs, running in its own dedicated, protected memory. Over time, IBM added support for Linux LPARs, and ultimately extended the concept to AIX. Today, IBM eschews the term LPAR, preferring the more generic virtual server, but for our discussion I’ll stick to the LPAR nomenclature. (An LPAR is a VM. Why create yet another term?)
The word “partition” means “part of a whole,” but that’s not quite consistent with the way LPARs actually work in the Power world. Originally, you managed LPARs from within IBM i, but in modern Power systems (Power6 and Power7) you manage them from the Hardware Management Console (HMC), the Integrated Virtualization Manager (IVM), or the Systems Director Management Console (SDMC).
All of PowerVM’s management interfaces let you slice out a dedicated portion of CPU cycles for each LPAR. Two or more LPARs can also share CPU, which is useful for related application components such as a web and database server. The former situation guarantees a specific performance level, whereas the latter optimizes utilization. You choose the approach that works best for your objectives.
But wait, there’s more! Instead of specifically configuring LPARs to share CPUs, you can assign LPARs to a CPU pool. In this scenario, you allocate a CPU “entitlement” to each LPAR, which it’s guaranteed to have available when needed. If an LPAR goes over its entitlement, it can “borrow” unused CPU cycles from the pool. Any CPU cycles an LPAR doesn’t need are “donated” to the pool for others to use.
PowerVM doles out CPU capacity in virtual processors, which can be allocated in tenths of a physical processor—a capability called micropartitioning. You allocate a number of virtual processors to each LPAR, which dictates the number of symmetrical multiprocessing tasks in hardware, and set the micropartition size to the number of tenths across all the virtual processors. For a single virtual processor, the micropartition capacity can be from 0.1 to 1.0. For three virtual processors, the capacity can be from 0.3 to 3.0. The number of virtual processors in a micropartition isn’t necessarily related to the number of physical processors in the machine. A single physical processor can support up to 10 virtual processors. PowerVM manages this “overbooking” of physical processors by using wait time (e.g., when a processor is blocked for I/O completion) to spread unused CPU cycles among VPs.
The shared processor pool consists of all physical processors not already assigned to a CPU-dedicated LPAR. This means that if a dedicated LPAR shuts down, its CPU capacity is automatically donated to the pool. PowerVM automatically manages sharing, so until CPU capacity is saturated, each workload gets dedicated processing power. This means you don’t pay a penalty for putting workloads in a pool until the pool is full, at which point PowerVM, using the previously described borrow/donate procedure, carefully manages the penalty. In any event, every LPAR can always use its minimum entitled CPU capacity.
In addition to setting a minimum entitlement for each LPAR, you can also set an upper limit, called a cap, on CPU usage. For example, you could allocate a 1.5-CPU entitlement to an LPAR with a 3.5-CPU cap. Capping prevents a runaway LPAR from consuming all the available CPU resources. Even if you don’t cap an LPAR, you can give it a “share weight,” which represents the workload’s pecking order in the sharing process. The higher the weight, the higher priority the LPAR has when competing for spare CPU cycles.
How do you use LPAR CPU controls in the real world? Suppose you have a machine with four LPARs monitored by IBM Systems Director (Figure 2).
Figure 2: Systems Direcotr LPAR overview showing CPU utilization spike
Suddenly, Systems Director issues an alert that the average CPU for one of the partitions (the LPAR running IBM i, the blue line in Figure 2) has spiked. If that LPAR is running from a shared pool, it will borrow unused CPU cycles from other LPARs (up to any CPU cap specified) to satisfy the demand—you don’t need to do anything. The borrowed cycles let the IBM i LPAR retain its performance level. However, if the IBM i LPAR isn’t in a shared pool, its CPU usage will be capped at its original fixed allocation and performance will suffer. This is where manual workload management comes into play. You can dynamically increase the CPU allocation to the IBM i LPAR to reduce contention by using unallocated physical partitions or by removing physical partition allocations (in tenths of a physical partition) from one or more shared pools. By moving CPU resources this way, you can give priority applications the compute power they need to meet your objectives.
CPU allocations are the primary, but not the only, means for work management in LPARs. There are several other knobs and dials, such as simultaneous multi-threading, automatic sparing, and on-demand capacity, that add even more flexibility to LPAR management.
If you run AIX workloads, you have two other workload management tools at your disposal: Workload Partitions (WPARs) and Automatic Memory Expansion (AME). I won’t go into great detail about these mechanisms here, but you can read about them in two IBM resources: the IBM publication “Exploiting IBM AIX Workload Partitions” and the IBM DeveloperWorks Wiki “IBM Active Memory Expansion.”
AIX WPARs are a software-based form of virtualization commonly called “container” virtualization. Rather than virtualizing the physical processor, containerization shares a single copy of an OS—in this case AIX—among several workloads, with each OS instance being a completely self-contained environment similar to running on a dedicated CPU. Thus, WPARs let you subdivide an existing AIX LPAR into smaller units for more precise AIX workload management. An AIX OS component called the Workload Manager (WLM) distributes CPU and memory resources according to policies you configure. The WLM runs in a primary WPAR called the System WPAR, whereas individual application workloads run in Application WPARs. All WPARs share the core system services, such as inetd and cron, and a single file system, so WPAR isn’t a fully transparent virtualization model.
WPARs are a lightweight way to gain AIX-specific workload management capabilities when you must support several AIX environments, such as production and test systems. One slick feature of WPARs is WPAR Partition Mobility, which lets you move an Application WPAR from one AIX LPAR to another. The destination LPAR can even be on a different physical Power Systems host. Because WPAR isolation isn’t perfect, however, it’s not well suited to multi-tenancy environments where workloads must be totally insulated from each other.
AME provides a means for enlarging the apparent physical memory allocated to an AIX LPAR by internally compressing unused memory regions. With AME, a 16GB memory partition could be made to look and operate like a 24GB partition. You might be asking, “Hey, doesn’t PowerVM memory virtualization already do this?” The answer is yes, it does, but AME takes advantage of seldom-used, highly compressible memory areas by exploiting internal knowledge the OS has about memory usage and contents, whereas PowerVM memory virtualization considers only usage, swapping least recently used memory out to disk. For example, AME can identify large areas of memory containing only zeros and “compress” that down to practically nothing. AME compression is completely transparent and requires no management beyond initial configuration. Theoretically, it can improve storage utilization by up to 100 percent, although 50 percent is a more realistic expectation.
Active Memory Sharing
PowerVM has its own way to optimize memory utilization among LPARs: Active Memory Sharing (AMS), a feature available on Power6 and Power7 platforms. Normally, each LPAR has its own dedicated slice of physical memory, and PowerVM manages that memory using memory virtualization within each LPAR, paging out less-used memory to disk when an LPAR’s memory allocation is overcommitted. This paging hurts performance, but it’s better than letting applications crash due to memory exhaustion. If paging within an LPAR becomes too much of a burden, you can manually add more physical memory using PowerVM’s dynamic memory reconfiguration feature, which lets you take memory from one LPAR and allocate it to another. But that’s a manual process, and as such, is one systems admins like to avoid.
To work around this problem, it’s common to over-allocate physical memory to an LPAR, delaying the onset of severe paging as much as possible. But this practice can lock a considerable amount of unused physical memory into one LPAR, making it unavailable to other LPARs. AMS lets LPARs share a common pool of physical memory, giving you the best of both worlds: physical memory that’s available for future needs but not locked in to a single LPAR.
AMS not only lets LPARs share physical memory, it increases the number of LPARs a given Power System can support. For example, a machine with 48GB of available physical memory could support only six 8GB dedicated LPARs but could easily support 16 8GB AMS LPARs, assuming that not all the LPARs require all their memory simultaneously. You can read all the details of AME in the IBM Redbook “IBM PowerVM Virtualization Active Memory Sharing.”
A recent enhancement to AME is Active Memory Deduplication. AMD works at the physical memory page level to detect and remove duplicate memory pages in AMS configurations. LPARs sharing a deduped page access the same physical page in memory, eliminating the redundant memory usage and freeing physical memory for reuse.
Duplicate pages are more common than you might think. For example, read-only machine code on LPARs running the same OS will often have many duplicates, and their contents will never change, making them excellent candidates for memory deduplication. Dedup isn’t limited to read-only pages. Large areas of zeroed memory are common in most OSs—especially those that take care to clear memory for security purposes. All those zeroed pages can be reduced to a single page using AMD. When one LPAR writes to that page, AMD invokes the PowerVM copy-on-write (CoW) process to duplicate that copy of the deduped physical page. The copy becomes a regular page only accessible to the LPAR that modified it.
AMS with AMD is a great way to dramatically improve memory utilization in a Power system, but there’s another benefit: AMD can stave off virtual memory paging in several LPARs at once. This is a huge win for the entire machine because it also reduces disk IOPS.
Although AMS runs on both Power6 and Power7 hardware, AMD works only with Power7. You can read more about AMD in the IBM Redpaper “Power Systems Memory Deduplication.”
Suspend/Resume and Partition Mobility
Another powerful PowerVM workload management tool is LPAR Suspend/Resume, which lets you stop execution of a lower-priority LPAR to return CPU and memory resources to higher-priority partitions (Figure 3).
Figure 3: Performing an LPAR Suspend operation
The suspended LPAR’s active state and memory are written to disk. When you later resume the LPAR, the CPU and memory it requires must be available in either unallocated resources or, if the LPAR is part of a CPU or AMS pool, in the pool inventory. PowerVM restores memory contents and the state of the LPAR and then picks up operation from the point where the LPAR was suspended. Chapter 2 in IBM’s publication “IBM PowerVM Virtualization Introduction and Configuration” describes Suspend/Resume in more detail.
Suspend/Resume comes with a couple of caveats. First, any external systems communicating with applications running in the LPAR must be able to tolerate the LPAR’s failure to respond. This communication could be through shared disk storage or across network connections. Any communicating partner should be configured to either wait for the LPAR to return to service or be prepared to continue operating without it. Second, TCP/IP sessions in progress within the LPAR’s applications are likely to end abnormally, so the applications should be configured to automatically restore any connections they need. This includes 5250 emulation sessions (although for the most part the default behavior for 5250 emulation is to retry connections and restore them when possible).
You don’t have to resume a suspended LPAR on the same physical host it was suspended from. A PowerVM capability called Partition Mobility lets you migrate the suspended LPAR to any Power system sharing a SAN with the original one, provided the destination machine has adequate CPU, memory, and network resources. You can also choose to shut down the suspended LPAR, which deletes the stored state and sets the LPAR to a powered-off condition. Note, however, that this is a rather drastic action equivalent to an unexpected power loss with no graceful shutdown of the OS.
Live Partition Mobility
PowerVM has one final workload management wonder in its kit: Live Partition Mobility (LPM). Similar to Partition Mobility, LPM lets you suspend an LPAR and restore it on a different physical machine. However, it does so transparently, with no downtime and no loss of connectivity. Interactive users might experience a few seconds’ delay during the live migration, but otherwise performance is unaffected. LPM is useful for re-balancing workloads between two or more machines, for moving workloads off a machine to accommodate maintenance downtime, or for rearranging workloads to accommodate policy changes such as security isolation for compliance purposes.
The terminology taxonomy gets more complex with LPM. First, IBM renamed Partition Mobility to Suspended Partition Mobility (SPM). IBM calls the actual process of moving a live partition Active Partition Mobility (APM), although the PowerVM feature is still (somewhat confusingly) called LPM. That change was made to accommodate an LPM capability that IBM calls Inactive Partition Mobility, which transfers an LPAR that’s powered down from one system to another.
LPM was originally released with only support for AIX and Linux LPARs. But in early 2012, IBM shipped IBM i 7.4 TRF PTF group SF99707, which adds LPM support for IBM i LPARs. You can read the details of LPM in the IBM Redbook “IBM PowerVM Live Partition Mobility.”
The Massive Workload Management Toolbox
PowerVM’s entire suite of workload management features—LPARs, WPARs, AME, AMS, AMD, Suspend/Resume, Partition Mobility, and LPM—constituted a truly comprehensive toolbox for keeping your Power Systems in line with your performance objectives. Not all these features are available on all Power hardware platforms, or with all versions of PowerVM. For the latest information on workload management feature system requirements, refer to Chapter 1 in IBM’s Redbook “IBM PowerVM Virtualization Managing and Monitoring.” This tome also discusses such workload management considerations as network sharing, storage management, and automation of workload management processes. Now that you know about the PowerVM workload management tools at your disposal, and how they can help you shuffle resources to meet almost any performance objective, you’re well equipped to start putting these tools to work in your shop.
Editor's Note: This article originally appeared in the October 2012 digital issue of POWER IT Pro.