Exascale Computing Resilience Discussion at Structure: Data

My Exascale panel discussion at the Structure: Big Data conference has already hit the news. Matthew Ingram from GigaOm has written up a summary of one of the key topics of the discussion: Resiliency in a system with a million nodes, hundreds of millions of cores, and billions of threads.

Speaking at GigaOM’s Structure:Data conference, Los Alamos HPC deputy division leader Gary Grider said that the exascale computer has so many parts, that some element will constantly be failing. “It wouldn’t be worth building if it didn’t stay working for more than a minute,” Grider said. “Resilience is absolutely a must. The way you get answers to science is you run problems on these things for six months or more. If the machine is going to die every few minutes, that’s going to be tough sledding. We’ve got to figure out how to deal with resilience in a pretty fundamental way between now and then.”

It was a fun discussion and I got a lot of good comments from the audience. I’d also like to thank Garth Gibson from Panasas, who’s insightful comments during the talk helped to give Exascale a rare IO perspective.

We’ll be posting the video as soon as its available. Read the Full Story.

Exascale Computing Resilience Discussion at Structure: Data

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Featured RSS Feed

More News from insideHPC

Exascale Computing Resilience Discussion at Structure: Data

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Join Us On Social Media

Speak Your Mind Cancel reply

Related Posts

Featured RSS Feed

More News from insideHPC