Atmospheric showers lead to the failure of supercomputers: what can be done about it


The Cray-1 supercomputer, which was the fastest in the 1970s, does not look like a supercomputer. It looks like a modification of an attraction in which a person rises to the wall, fastens, and then unwinds. He is surrounded by a round bench that hides food, similar to a donut - if only a donut hole could give out valuable ideas related to nuclear weapons.

After Seymour Cray first created this computer, he gave the Los Alamos National Laboratory a six-month free use of it. But during these half a year something interesting happened: 152 unexplained memory errors occurred in the computer. And only later did researchers find out that neutrons from cosmic rays could collide with parts of the processor and violate the data stored in the computer. The higher you are and the bigger your computers are, the more this problem affects you. Los Alamos, located at 2.2 km above sea level, where the most luxurious computers in the world are located, has become the main target.


Seymour Cray, the creator of the supercomputer, next to his brainchild Cray-1

Since then, the world has changed, and computers have changed. And the cosmos remained the same. Therefore, Los Alamos had to adapt - and its engineers began to take space particles into account in the hardware and software. “This is not a problem to be solved,” explained Nathan Debardeleben from the high-performance computer development group. “This is a problem that we are capable of restraining.”

For modern computers, starting with the Q supercomputer , this is quite a serious thing. Q, installed in 2003, was much faster than Cray-1, designed for calculations related to the United States postponed for a black day. But he broke down more often than expected - and these were the first failures that made scientists from Los Alamos seriously concerned about cosmic rays from deep space. They collide with chemical elements in the atmosphere, and it all breaks down into smaller particles . "They literally form a kind of showers falling directly on us," says Sean Blanchard, another member of the group. Some of these “drops” turn out to be neutrons - and this is very bad.

"They can lead to a switch in the computer's memory," says Debardeleben, "from 0 to 1, or from 1 to 0." For a home computer, this is nonsense. But Los Alamos has huge threshers for numbers. The same Q of the beginning of the century reminds the supermarket shelves. And today in the laboratory there are computer rooms the size of a football field, and all the computers in the room can work on the same task. And, just as there is more precipitation on the football field than on the summer cottage, so supercomputers penetrate more cosmic rays than your laptop.


In Los Alamos, neutron detectors are placed throughout the supercomputer center

After Q, the engineers really understood that neutrons are not such neutral particles, so now they are trying to anticipate problems. Before installing the new equipment, engineers conduct something like a space stress test, placing the electronics on a neutron beam - there are much more of them than in atmospheric showers - and watching what happens. “We take parts, make them radioactive, make them work to refuse,” Blanchard explains. Soon they will place neutron detectors inside the supercomputer center to measure the strength of the “storms”. If you know how many neutrons have flown in, and you know how they affect the operation of computer components, “you can predict the lifetime of your electronics,” says Susan Novichki, a physicist from the space and applied sciences laboratory.

Usually, supercomputers are smart enough to understand that something has gone wrong, and they feel the switched bits just as you would feel if you pull your hair out. [The author of the original article is a girl / approx. trans. ] In this case, the system usually simply reports an error and fixes it. But sometimes, says Blanchard, the computer is more pessimistic. “I have a mistake, too many bits have switched,” he describes the computer, “I can’t fix it, but I wanted to let you know.”

When this happens in Los Alamos, people deliberately stop all computers. It is just the same as specifically falling, skiing from the mountain, because it will be less painful than if you try to resist. But in this case, there is no need to go back to the top and start all over again - the engineers arrange “ checkpoints ” on the way of searching for an answer. It’s like saving points in games - if you die, you don’t have to start all over again. Start from the last point that has saved your achievements. Supercomputers also have a similar storage system.

The real problem is " silent data corruption. " This is when the bits are switched, and no one notices. And the answer that you think is right may actually be a dream inspired by neutrons. That is why proactive work is so important: we know what to expect and how often, and keep an eye on it. At the same time, having received this knowledge, the team hopes to turn silent errors into loudly screaming ones. But if something slips through the defense, perhaps a living person will see it. Usually in Los Alamos they do not say “Here is your answer!” Until a person checks the results of their work for meaningfulness.

Personal intervention occurs in particular because Los Alamos is engaged in critical research on topics that affect many other people. “The laboratory - and the energy department as a whole - is studying climate change, new drugs, epidemiology, the spread of disease, fire modeling, materials science and the fragility of metals,” explains Blanchard. And, as he adds after this list, the reason for the existence of Los Alamos is nuclear weapons, created by people (some of them even belong to this very laboratory). "We are a nuclear weapons research laboratory," says Blanchard. - Our job is to manage its stocks. We must ensure that it is safe and works as it should, and does not work when not necessary. ”

Due to the ban on testing nuclear weapons , the only legitimate method to stop worrying and learn how to maintain a supply of bombs will be a simulation of what is happening inside on a supercomputer. This is how a laboratory, concerned about radiation on Earth, should worry about radiation from space. Because, no matter what kind of work supercomputers do in the future, one thing is clear: “Every year they are becoming a bigger target,” Blanchard says.

Source: https://habr.com/ru/post/414835/


All Articles