Q. When it comes to disaster recovery (DR) planning and testing, is there a one-size-fits-all approach? Or do approaches vary greatly depending on the size of the organization, the business objectives, and the operating systems in place? Do DR approaches vary for each of the major IBM POWER operating systems—IBM i, AIX, and Linux?
A. Every company should have a disaster recovery plan and be able to answer the following questions: Is our plan current, complete, and comprehensive? Has the plan been tested within the past 12 months? The DR plan itself will vary by company, as each organization has its own set of recovery objectives: How quickly do we need to be able to recover? How much data can we afford to lose? Which of our systems are most critical for recovery?
Larger companies tend to have more complex DR plans because there is more to coordinate, including systems and staff across multiple locations. Smaller companies may have to work within a smaller DR budget or need a higher degree of outside assistance with DR planning because they don’t have in-house expertise. All organizations want to ensure that company executives support the DR plan and understand what it can deliver.
Good DR planning methods do not necessarily vary when working with different operating systems. Good planning is good planning. However, the backup and recovery technologies used in the plan will often vary because some are better suited to support particular operating systems than others. Knowledge of how to recover specific operating systems or applications also matters significantly to the likelihood of a successful recovery. If you’re comparing service providers, be sure to ask if they have experience in your operating system, such as IBM i, AIX, LINUX, or Windows.
Q. When talking about DR, are there any new, emerging technologies—say from IBM i, AIX, or Linux—or trends or resources that are changing today's options or preferred strategies? These days, it seems as if organizations are more at risk to downtime of any kind because their systems simply need to be available for so many parts of their businesses 24x7. How is DR evolving to meet these needs and ensure companies are up and running even in the midst of major disasters?
A. Because technology supports nearly every aspect of company operations today, organizations need to be able to recover much more quickly than ever before—or risk significant damage to the business. We’re seeing many more companies with recovery time objectives of four to six hours versus days. DR must continuously evolve to meet the growing demands of organizations, and there are several new emerging technologies that we’re excited to explore in the near future. One of these technologies is IBM Power Systems Live Partition Mobility (LPM), which allows you to immediately bring another system up in a different location. An advanced form of clustering, LPM is available on Power Systems for IBM i, AIX, and Linux.
Companies are also beginning to realize the importance of being able to recover the systems they are using to access the data, not just the system that holds the data. To address this, companies are looking for ways to replicate all of their supporting systems—from file servers, to business intelligence applications, to web servers, everything.
We also see many companies replicating their email in real time (or taking it the cloud) in case disaster strikes because email has become such an important tool for communicating with customers, internally, and with key partners.
Q. What about compliance and security issues when it comes to meeting backup and recovery objectives? In other words, what are some of the best practices for conducting audits of major operating systems including IBM i, AIX, and Linux?
A. The most crucial piece of advice I would give in terms of conducting and maintaining a good audit is to test your DR plan regularly. This contributes to audit requirements in a few ways. First, testing includes making sure all of your critical systems are covered by your DR plan. Companies add and remove systems throughout the year, so DR testing provides an opportunity to validate the accuracy and completeness of your system inventory. Testing also validates recoverability: You can attest to your ability to recover, not just hope that you can. Finally, compliance means that you can address the company’s stated objectives, whether set internally or influenced by external regulations. Testing confirms not only recoverability, but also your ability to get back online within your stated recovery time and recovery point objectives.
Q. Testing seems to be one of the most talked about elements, and it remains a thorn that just won't go away. How important is it for DR plans to be tested? What are some of the best practices for testing a DR plan? And one last question here: How often should plans be tested?
A. It is absolutely critical for companies to test their DR plans at least once a year—more often if there are significant changes in infrastructure, changes to the business, or if the company has very short recovery objectives. If companies invest in a comprehensive DR plan but it sits on the shelf untested, they are squandering their investment because they have no idea if the plan still works. We’ve heard companies say they tested the plan five years ago, but since then they’ve added 33 servers to their environments—none of which have been tested.
Best practices for testing should include the following: Test your plan regularly. From a security standpoint, establish who has the authority to initiate the recovery, whether for IBM Power Systems, Windows systems, or others. Make sure the recovery team is in the best position to help you get back online in terms of their location relative to the recovery site, their expertise, and their understanding of their role in the recovery. In addition, do both passive and active DR testing. Walk through your plans to make sure they are complete and still make sense, but also perform actual recoveries.
Here is one final “gotcha” we commonly see that trips up many companies: Establish a process for reviewing your tape management/backup software (BRMS, BackupExec, ArcServe, TSM, etc.) backup logs in order to address any errors or missed files quickly so you know your data is being backed up. Then, email the reports to an offsite location. In the event of a real disaster, these tape management reports are not going to be available if they reside on one of the systems that is down. If companies can’t get to the logs, they don’t know what tapes to recall and where the data to recover from is located.