Importance of AI Data Storage Performance

How MLPerf Storage benchmark helps AI and ML developers compare performance of different data storage technologies.

By Tom Mangan

By Tom Mangan November 22, 2024

Developers of artificial intelligence and machine learning (AI/ML) applications now have a better way to assess the performance of data storage technologies. It’s an important step forward because data-storage bottlenecks often plague AI/ML model training.

AI/ML apps fetch immense datasets from storage devices and feed them into accelerators like graphics processing units (GPUs). Slow or inefficient storage can cause training tie-ups that make AI/ML apps more expensive and less useful. These issues underscore the importance of updated data-storage benchmarks from MLCommons that help application developers circumvent storage bottlenecks and optimize AI/ML training performance.

MLCommons’ MLPerf Storage v1.0 is a suite of benchmarks comparing how well data storage solutions from more than a dozen vendors perform across a broad spectrum of AI/ML training scenarios. 

“Storage is kind of like Baskin-Robbins, except there might be more than 31 flavors,” said David Kanter, co-founder and board member of MLCommons and Head of MLPerf, in an interview with The Forecast.

Object storage, edge storage, direct-attached storage, network-attached storage and many more varieties have distinct performance characteristics. When training models process trillions of data points, subtle differences between storage options can have significant effects on performance.

Nutanix contributed to the MLPerf Storage project and its Nutanix Unified Storage significantly outperformed other participating vendors. 

Kanter explained why these performance results help AI\ML developers and IT teams that manage infrastructure. He likened this new era of AI/ML to the power of supercomputing moving into the mainstream.     

How MLPerf Storage Evolved

MLPerf is an collection of industry-standard benchmarks for measuring ML systems. It was developed by MLCommons, which is a consortium of AI leaders from academia, research labs and industry. MLPerf is widely used by hardware makers, software developers and cloud service providers to help understand performance of ML and AI applications.

“From day one, we always knew that storage matters for machine learning, especially on the training side,” Kanter told The Forecast

RELATED

Building a GenAI App to Improve Customer Support

He noted that companies doing large-scale AI/ML projects were reporting significant storage bottlenecks because they had to ingest massive volumes of training data.

One of Kanter’s collaborators at MLCommons was Debojyoti “Debo” Dutta, vice president of engineering at Nutanix. 

“We started brainstorming and the idea emerged that we should look at the performance you need from a storage system to support AI training,” Kanter said. A year later, MLCommons’ MLPerf Storage was born.

Dutta’s role at Nutanix, whose hybrid multicloud IT infrastructure software helps enterprises modernize and scale their digital environments, placed him at the heart of the kinds of decisions corporate IT teams must make when building and training AI/ML models. Which storage should remain on-premises? How can they control costs when storing data in the public cloud? What kind of software will do the best job of optimizing storage in hybrid environments?

Everything in AI/ML is downstream of data, amassing enough information and compute power to train models that automate complicated human activities like drafting sentences and finding specific objects in photos or video streams. 

“The data, and where it lives, is the necessary ingredient to keep all the compute humming,” Kanter said.

Because storage comes in so many flavors, AI/ML developers need reliable metrics comparing how various storage technologies perform in certain kinds of training tasks. AI/ML training is iterative, analyzing one batch of data after another, identifying errors and using advanced math to correct them in each iteration. 

“You're doing a lot of data loading,” Kanter said. “We wanted to measure how good a given storage system is at loading data out of storage and into compute.”

RELATED

IT Leaders Get AI-Ready and Go

MLPerf Storage also needed reliable insights on the GPU accelerators that process training data. The developers wanted to know, for instance, how storage solutions worked when feeding data to 4,096 GPUs. But they didn’t want to buy that many GPUs, which would’ve cost a small fortune. Their solution: a software tool that simulates the performance of multiple NVIDIA A100 and H100 GPUs. 

“It's a really fantastic tool from an engineering standpoint,” Kanter said.  

MLPerf Storage users can compare a storage system’s performance on three advanced training models:

  • 3D-Unet, for training based on segmenting three-dimensional medical images

  • Resnet50, a neural network that’s 50 layers deep and specializes in image classification

  • Cosmoflow, a deep-learning model based on the mathematics of the universe

It also offers several variables for each vendor. For instance, the benchmark can compare the throughput of Nutanix Unified Storage clusters running A100 vs. H100 GPUs in each of the training models.  The v1.0 storage benchmark was used to measure other leading storage vendors.

MLPerf Storage helps IT professionals and AI developers strike that balance. 

“As with any benchmark, we're trying to produce something that's fair and generalizable,” Kanter added. “It's not going to match your exact, specific use case.For every system, there's a right balance. You need a certain amount of storage to go with a certain amount of compute.”

MLPerf and the Rise of Supercomputing

MLPerf can evaluate AI/ML training on a scale that’s essentially supercomputing. 

“When it comes to even standard, classic commercial AI, like building a recommendation system or a large language model, the computers that we're using are supercomputers by almost any metric,” Kanter said. 

“We've had systems with 10 or 11,000 accelerators — some of the largest supercomputers out there. It's very impressive from a raw computational standpoint. But the thing to keep in mind is if we get all that compute together, we also need to bring along the storage to feed it.”

RELATED

Seeing AI’s Impact on Enterprises

He noted that computing at this level can help scientists understand molecular interactions much faster than was possible with lower-powered computers. For instance, scientists delving into how a single molecule would react to another molecule could do what’s called an exact simulation. 

“That's very computationally expensive,” Kanter said.

With AI/ML, scientists can predict molecular reactions with a simulator that requires far less computing power, he added. The old exact simulations might take days or weeks. 

“An AI model, once it's trained, can make predictions in minutes or seconds or hours,” Kanter said.

These simulations would pay off in the development of pharmaceuticals, for instance, where scientists design medicines at the molecular level. A drugmaker’s scientific team could use AI/ML to predict likely drug interactions of a new medication in a fraction of the time required with old technologies.   

As AI/ML systems get more sophisticated, they’ll be asked to do things like operating autonomous vehicles, which will pull in real-time, 2D and 3D data from cameras, lidar and other sensors while potentially communicating with traffic signals and other vehicles and synthesizing current weather and road maintenance data. 

“You have to combine all of this together and analyze it,” Kanter said, to make sure the car knows what to do when there’s a deer standing next to the road, for instance.  

In years to come, data centers will pull in data from dramatically expanding pools of people and applications. 

“We need the storage that can support that in many more places than we did before, which I think for the storage world is very exciting,” Kanter said.

“The variety of approaches is really astounding,” he added. “And that was all something we had to take into account when we built the benchmark to be as inclusive as possible.”

Editor’s note: Dive deeper with this blog post Nutanix Unified Storage Takes the Lead in MLPerf Storage v1.0 Benchmark. Learn more about the Nutanix Enterprise AI capabilities in this blog post and video coverage of its release in May 2024.

Tom Mangan is a contributing writer. He is a veteran B2B technology writer and editor, specializing in cloud computing and digital transformation. Contact him on his website or LinkedIn.

Jason Lopez and Ken Kaplan contributed to this story.

MLPerf is a registered trademark of MLCommons.

© 2024 Nutanix, Inc. All rights reserved. For additional information and important legal disclaimers, please go here.

Related Articles