In addition to having servers that are fast and extremely modular, imagine having systems that anticipate and solve pressing problems: safeguarding workloads to prevent small errors from becoming more serious; automating many of the tasks that database administrators historically have had to execute manually; and ensuring enterprise-level system availability.
Superior reliability, availability and serviceability (RAS) features such as these can help you rest easy.X6 servers have three categories of exceptional RAS features: Memory Page Retire, Consumed Error Recovery, and Upward Integration Modules. Read a brief overview of each feature below and watch System x Senior Technical Staff Member Randy Kolvick provide an easy-to-understand description of the functions in the video, “RAS Features: System x X6 Servers.”
MEMORY PAGE RETIRE
What the RAS feature does: Memory Page Retire protects your X6 server against unnecessary repair actions and server restarts.
The way this RAS feature works is comparable to the manner in which disk sectors have functioned for a number of years; errant disks have been marked as “bad” so it wouldn’t be necessary to replace them. Memory Page Retire now handles server memory in much the same way.
Memory Page Retire is included in a patented algorithm that is smart enough to determine the instances in which different hardware and software features should be used to prevent repair actions and reboots. Thanks to the algorithm, the system determines whether reporting an error is the appropriate action or whether the error should be self-corrected and the server should be allowed to continue to run. Let’s say the system hits a correctable memory error threshold. Rather than issuing a predictive failure alert or taking other server actions, Memory Page Retire automatically instructs the operating system or hypervisor to mark the memory page as “bad” and then it moves the data. The system then automatically takes the bad memory page offline. (Another memory page is used in lieu of that unacceptable one.) Because X6 systems support massive amounts of memory – terabytes of it-- losing a several-kilobyte memory page is inconsequential.
Another benefit provided by Memory Page Retire is the fact that it prevents correctable errors from becoming more serious over time, decreasing the likelihood of a system reboot.
CONSUMED ERROR RECOVERY
What the RAS feature does: It keeps the server running in the event of an error and avoids impacting all the other work in process; bad memory can be corrected without a server restart.
Consumed Error Recovery automatically takes charge when an uncorrectable error makes it all the way back to the CPU into a running application. Although servers are running hundreds of processes at the same time, it is crucial to understand that Consumed Error Recovery only stops and corrects a process that is affected by an error. That particular running application is now able to “retry,” or restart the process as needed. Other work-in-process is unaffected so that you can continue at your own pace with the maintenance you’ve scheduled exactly when you’ve scheduled it!
While Consumed Error Recovery is a feature provided by Intel as part of its Machine Check Architecture, it is noteworthy that firmware, the operating systems, and the hypervisors all have to coordinate in their participation in the Machine Check Architecture features in order for them to function properly. The System x team has worked with major vendors in order to successfully provide enhancements and integrate all of the necessary software. In partnering with these vendors, he team added value above and beyond the Intel features and ensured that all of the features function in the manner intended. This additional effort alleviates deployment pain and unnecessary surprises in the data center.
UPWARD INTEGRATION MODULES (UIMs)
What the Upward Integration Modules do: UIMs enable two types of actions to be completed: 1) automating the movement of a workload from a server that issues a predictive failure alert to a trusted server in the cluster and 2) facilitating “Rolling Firmware Updates.”
Upward Integration Modules (UIMs) are plug in-software modules specific to tools such as VMware ® vCenter and others. The first action that UIMs drive is automating the process of moving a workload from a server that issues a predictive failure alert onto a trusted server in the cluster. The server automatically determines that the affected critical workload can be migrated off to the server that hasn’t had any errors. The system from which the workload was migrated can then be put into maintenance mode. Scheduled maintenance can be executed, and then that server can be brought back up-and-running later. Had the workload been left untouched and the error allowed to increase in severity over time, it could have become an uncorrectable error and ultimately resulted in an outage.
The same feature can be used for a second purpose known as “Rolling Firmware Updates” in which a workload is migrated off of the server so that a firmware update can be downloaded. Firmware updates are completed in an update service pack: a group of firmware is tested together to make sure it all works as a unified group. The update is then applied to the server and the server is restarted in order for the update to take effect. Then the workload can be moved back.
Once the first server is done, groups of servers then can be addressed in the same way. The process is automated so an administrator does not have to take each respective action individually as was required in the past.
Special thanks go to Randy Kolvick, senior technical staff member (STSM), Lenovo, for offering expertise and background for this blog post and for providing the opportunity to interview him for the video “RAS Features: System x X6 Servers.
About the Author
As a member of the seasoned product marketing team that launched the System x X6 family of servers, Kathy Holoman has a history in technology marketing that spans three decades. Prior to joining the Lenovo Enterprise Product Group, she held high-impact global technology marketing roles in four Fortune 100 companies -- IBM, HP, EDS, GE -- and one Global 500, Schneider Electric. Kathy’s ability to “translate a technology conversation into tangible business value” across diverse marketing media and several languages has been recognized not only with corporate honors such as the IBM Global Leadership Award, but also with industry awards for advertising, global events marketing, and graphic design. Read more on Linkedin or follow @techiewahoo on Twitter.