New X-Ray Power & Metrics Part 1

With rising energy costs around the globe and an increased focus on sustainability, organizations are looking to understand the power load and energy consumption of their IT infrastructure in increasing detail. Examining or reporting an IT system’s power load or energy consumption, whether for compliance purposes or to measure carbon emissions can be a complex task, requiring IT teams to collect and analyze data from a range of technology platforms and potentially across different locations.

Nutanix’s software-defined architecture facilitates analysis of power load/energy consumption because, as a single platform for storage and compute, the majority of data center footprint is consolidated behind a single interface. Compute and storage are the major consumers of data center power, and by having them under one platform, configurable and measurable through one interface, it’s much easier to run test scenarios and analyze the results.

The Nutanix X-Ray™ benchmarking platform greatly simplifies testing and measuring the IT infrastructure with various preloaded and configurable test scenarios across a wide range of workloads. X-Ray reports a variety of performance metrics for each of these test scenarios to give the user an in-depth view of their IT infrastructure’s capabilities for easy comparative analysis.

The latest release of X-Ray, version 4.4, now includes power and energy measurements along with performance benchmarking so that IT teams can analyze energy usage data for the user-selected workload profile. Data is power (pun intended) – and having accurate data on a system under load has a number of potential benefits. By improving the visibility of power & energy metrics alongside performance data, Nutanix can help IT teams be more mindful of their resource consumption, positioning the IT team to help build a more sustainable approach to their IT infrastructure operations and decisions.

Example of Nutanix X-Ray report “Detailed Results” graphs for Cluster CPU Usage

Example of Nutanix X-Ray report “Detailed Results” graphs for Cluster Power Usage

What? Why?

X-Ray has been updated so that it now provides power & energy consumption metrics to compare alongside other performance metrics. Many customers have often reported dramatic reductions in power draw or energy use following their deployment of the Nutanix Cloud Platform (NCP). But increasingly, power draw is an area of focus ahead of deployment, for reasons of cost budgeting and because there is an increasing focus on sustainability. And whilst we can never be 100% sure of how a configuration will behave when it's deployed in a data center, we can back up the theory and our expectations with real data obtained using the power & energy metrics provided by X-Ray.

When a user runs a test, X-Ray measures the performance of the VMs it creates, and now it also polls the cluster’s nodes for power consumption data. At the end of the test, X-Ray aggregates all this information so the user can see the power and performance data side-by-side.

Example of Nutanix X-Ray report “Results Summary” showing the new energy metric widget (bottom left) alongside other performance metrics

For more information and tips on reducing your organizations consumption of IT resources and energy, check out these blogs on Nutanix.dev.

Education is Key

There are five areas that should be properly understood before attempting to interpret the power and energy metrics in test results:

1. Power vs Energy

Power metrics W / kW and energy metrics Wh / kWh are not as straightforward to interpret as performance metrics, and understanding whether higher or lower results are desirable for different tests is important to consider. Understanding the difference between Power and Energy is the first step.

Simply put, within a data center context:

Energy is the application of power over time.
An IT system converts energy into work that delivers value.
A significant by-product of IT systems’ energy use is heat.

We’re not going to detail these fundamental concepts of physics here when there are many great explanations online. Here are two good examples to get you started if you want to read deeper:

And, for more information on data center efficiency check out this blog: Digging into Data Center efficiency, PUE and the impact of HCI on nutanix.dev.

Why these concepts matter is that almost all IT equipment, be it servers, switches, firewalls, PDUs or anything else, have a sensor to report on its current level of power utilization at a point in time. Some also report on energy usage, but even then, it's usually just calculated from an aggregation of point-in-time power utilization measurements. This can then be used over a period of time to calculate a mean power consumption or the energy used during that period, usually in kWh (kilowatt hours).

X-Ray uses hypervisor-level tools to take a reading from the physical hardware sensors via the BMC and then analyzes those measurements. X-Ray supports AHV and ESXi hypervisors for all the tests and the reported metrics, including the power and energy data. One should note that X-Ray also supports non-Nutanix systems if the users want to run tests on other HCI platforms and 3-tier infrastructure (tests run on 3-tier infrastructure do not include power used by storage arrays or their supporting infrastructure).

Important: X-Ray uses the power metric that the hardware reports via the hypervisor. While we’ve no reason to doubt the validity of these metrics nor do we expect to see significant fluctuation between hardware vendors, we've not tested all possible scenarios with Hardware vendors, platforms and hypervisor versions. If possible, it's always worth comparing PDU (Power Distribution Unit) level metrics as part of overall analysis. It's also worthwhile reviewing this document on the support portal if you need to get into the fine details.

2. Expected Max Load vs Budgeting

Once you understand the difference between power and energy, you can then properly understand the two possible use cases for the new metrics in X-Ray:

Budgetary energy estimations over a system’s lifetime: This is the more typical use case as it allows you to estimate the energy consumption of a platform more realistically over a long period of time under the expected workload. Users should be mindful that at certain times of day the platform may consume more or less energy depending on the workload, so further modeling (including understanding idle times) is required to create an accurate estimation for a long period of time.
Expected maximum power load: Using the results to predict the maximum power draw and then using that to decide how many nodes to put in a rack or plug into a PSU is not a recommended use case. For this use case, hardware documentation and power advisory tools should be consulted by properly experienced team members. Those with the relevant experience may want to use X-Ray results for validation purposes but should use the information cautiously.

Important: Incorrectly designing a rack or power equipment can result in downtime due to the hardware attempting to draw too much power and “tripping” PDUs. If in doubt, engage the experts!

3. Fixed work vs. fixed time test scenarios

It's important to understand what a test is doing in order to correctly interpret power and energy metrics, particularly the difference between these two kinds of test scenarios:

Fixed work: e.g. Complete a set number of transactions or process a set number of documents with no fixed time frame.
Fixed time: e.g. Run a workload simulation for a number of desktop VMs for a set period of time. i.e. do as much work as possible in a set time frame.

This is important because properly interpreting results can be slightly counter intuitive.

For a fixed work scenario, the test that uses the least energy to complete the fixed task is the most energy efficient. Sometimes a test result may show a higher max power draw but because the work is completed more quickly the system is overall more energy efficient (for that workload at least) than a test that took longer with a lower power draw.

The big “gotcha” is where a fixed time test scenario may result in greater energy consumption or a higher max power draw but may also have done more work during the fixed period.

For example, a performance test set to run a workload for 20 minutes (such as a login VSI test) might show a greater maximum power consumption on a newer hypervisor version than an older version. On the face of it, you may conclude that the older version is superior from an efficiency perspective because at peak it draws less power. But that may not be the case as the newer version could be more efficient and actually complete more work for the fixed period that the test was running. It all depends on the type of scenario being run.

It’s only once you also consider the “work done” metric (e.g. VSI login score) then you can properly understand the energy efficiency of one test platform vs another. We’ll look at more test scenarios to help better understand this in the next part of this blog.

4. Result consistency

As with any kind of benchmarking, multiple tests should be run to weed out erroneous results. The same applies to power & energy metrics so it is advisable to run repeat to tests to validate the consistency of the metrics being analyzed.

When comparing any test results, always ensure that the same workloads and tests are being run across multiple runs. This best practice becomes critical when you are comparing the results across different software and hardware configurations.

5. Hardware / Hypervisor settings

Be aware that different hardware vendors may have different BIOS settings and different hypervisors may also influence performance and energy consumption. Not all of these technologies ship with a common set of base settings, so understanding the differences can be important if you are getting unexpected results between platforms / configurations. If possible, make a note of these configurations and retain them to review along with X-Ray results.

Conclusion

Understanding the above points is key to correctly interpreting power and energy use metrics and avoiding erroneous conclusions. Additionally, if you are interested in data center efficiency as a whole, then it's also worth taking a look at this blog post: Digging into data center efficiency & PUE on www.nutanix.dev. Now that the background and theory are covered, it's time to look at some examples in the next part of this blog coming soon!

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.

New X-Ray Power & Metrics Part 1