HPE recently completed its acquisition of Cray, after doing the same with SGI just three years ago. Due to consolidation, top merchant supercomputer vendors (as opposed to government entities building their own) now include server OEMs Dell, HPE and IBM based in North America, Fujitsu in Japan and Atos Bull in Europe. Inspur, Lenovo and Sugon lead a growing group of supercomputer-focused OEM server vendors in China.
The challenge for all these supercomputer vendors is that public cloud vendors are also targeting high-performance computing (HPC) and supercomputing markets. Public cloud providers are changing the market’s demand for functionality. This competition will challenge branded server OEMs’ ability to push upmarket toward higher-end customers with traditional on-prem HPC server clusters.
Alibaba Cloud, AWS and Azure are already deploying HPC and supercomputing worthy infrastructure and services.
I expect public cloud giant’s HPC and supercomputing focused deployments will improve continuously over time. As predictions go that’s fairly tame—it’s what they do.
Therefore, I believe it will be increasingly hard for HPC infrastructure vendors to sell clusters directly to end customers as customers opt to simply configure a supercomputer out of available public cloud instance types and networking options.
Public vs. Private Infrastructure
All the same arguments heard for the past decade about private infrastructure vs. public cloud are now surfacing in HPC market positioning.
Set aside all of the tired tropes about security, availability, latency, etc. Public clouds provide as good or better infrastructure service and support than most IT departments can manage on their own, and they have been doing that for years.
For public cloud-based HPC and supercomputing services to be successful, they cannot:
- Run significantly slower than private infrastructure
- Require refactoring and rewriting decades-old applications
Alibaba Cloud, AWS and Azure all have recently deployed new HPC and supercomputing instance types and sizes implementing fast Ethernet networking (Azure also offers high-end InfiniBand networking) and shared-memory clustering capabilities that enable customers to meet both of the above requirements.
- Alibaba Cloud recently deployed two processor-only instance types and one GPU-accelerated instance type.
- AWS deployed specific sizes of six processor-only instance types, plus one GPU-accelerated instance type.
- Azure has deployed three processor-only instance types and two GPU-accelerated instance types.
GCP asks HPC customers to use Preemptible instance types, which requires refactoring of existing HPC applications. In addition, Preemptible instance types have many other restrictions, including lack of any Service Level Agreements (SLAs). GCP’s Cloud TPU Pods cannot be programmed with traditional supercomputing software development tools.
Another important point of tension between using private and public infrastructure is “data gravity”.
Data gravity is somewhat similar to real gravity. The more massive a celestial object is (planet, star, galaxy or whatever), the more it influences objects in its vicinity. From a spaceflight perspective, getting spaceships out of Earth’s “gravity well” is very expensive. It’s another order of magnitude of expensive to send spacecraft out of the Sun’s gravity well; humanity has only done that twice (Voyagers 1 and 2).
Data gravity urges IT customers to include data transfer and storage costs when considering the total costs of migrating applications from on prem infrastructure to public cloud infrastructure. In practical use, the key data gravity consideration for most applications is simple: Does an application send a lot of data back out of the cloud?
Sending a lot of data into a public cloud may take time, but most clouds do not charge data ingress fees. Moving data within a public cloud can run up data transfer charges. Sending a lot of data out of a cloud most certainly will run up data transfer expenses.
A different kind of data gravity is defined by government and military security. While public clouds are more than competitive with private infrastructure for commercial-grade security, there are datasets that must be “air gapped” (not connected to the public internet) and must not leave the facility or organization that created the dataset.
Customers using public cloud HPC resources most likely do not want to:
- Require huge datasets to be transferred out of the cloud
- Violate sovereign state or similar secrets
The combination of shared-memory architecture and data gravity points to a several areas that public cloud-based HPC and supercomputing clusters have distinct advantages in the short term:
- Offload HPC application development to a public cloud, which may increase the productive utilization rate of overburdened private clusters
- Simulate HPC and supercomputer cluster configurations for a given application to scope private infrastructure purchases
- Run simulations dependent on small initial datasets or systems of equations
- Summarize insights gained from analyzing massive datasets copied to or already in the cloud, such as the results of thousands of simulation runs
For the most part, disruptive technological change starts at the low-end of markets (despite a handful of Apple and Tesla counter-examples). Public cloud HPC and supercomputing capabilities will further disrupt on-prem server and data center infrastructure sales. It’s just a matter of how quickly this happens.
This article was written for Forbes, the original can be found here.