Surely you have already hit yourself because your computer is not working properly, something is wrong. On a laptop or desktop, it’s easy to find the problem and fix it. Now the the most powerful computer in the worldwhich bears the name of Frontier, has performance issuesand although they know what the problem is, they haven’t been able to fix it yet.
Frontier is a supercomputer developed to perform advanced work that requires high computing power. One of its main characteristics is the possibility of offering power greater than 1 ExaFLOP. We are talking about a power of the order of thousands of times greater than that of a home computer.
The largest and most expensive computer in the world crashes all the time
Currently this supervisor it works, although due to its computing capacity, it does not work well. Make such a beast system working properly is really complicated. Keep in mind that it has thousands of components and a very complex interconnection system. It’s not like a personal computer, which is easy to assemble and repair.
To give us an idea, Frontier has 9472 AMD EPYC 7A53s processors. Each of these processors has a total of 64 cores and operates at a frequency of 2.0 GHz. It is completed by a total of 37,888 Radeon Instinct MI250X Accelerator cards.
Each of the nodes is composed of an AMD CPU, four AMD graphics cards each with 128 GiB of HBN2e memory there 512GB DDR4 RAM. In addition, each of the systems has a capacity of 4TB NVMe storage.
Apparently this system has a strong problem operation that has to do with the Instinct MI250X. the system sling interconnect used for this system would be gcausing problems operating with high loads.
Justin Whitt, program director of the Oak Ridge Leadership Computing Facility explained:
These are mostly scaling issues as well as application scope, so the issues we’ve encountered are mostly related to running very, very large jobs using the whole system. .. and run all the gear in concert to do it
But that wouldn’t be the only issue affecting performance. Indicates that the AMD products would not be the problem, but it would be a “coincidence”. It should also be noted that this type of performance issues are not unusual in these types of systems. When systems of this type are created, until everything works correctly, it usually takes time and requires fixing various problems.
Expensive and difficult to assemble
We must bear in mind that this type of system has thousands of existing connections. Getting the whole system to work properly is not easy, many adjustments have to be made. In addition, it must also be taken into account that many applications are not ready for this type of system.
One would think that since this is not working properly, it would be better to create several systems. The reality is that it is “easier” to tune the software than to create several smaller systems. Often the data obtained is needed for the next stage of processing.
It must be said that this type of equipment is used for complex scientific studies such as astrophysics or biomedicine. They are also used for climate predictions and several types of advanced simulations. Things that a normal computer or a set of them would take years to do, but a supercomputer can do in much less time.