[wp_tech_share]

The rise of accelerated computing for applications such as AI and ML over the last several years has led to new innovations in the areas of compute, networking, and rack infrastructure. Accelerated computing generally refers to servers that are equipped with coprocessors such as GPUs and other custom accelerators. These accelerated servers are deployed as a system consisting of low-latency networking fabric, and enhanced thermal management to accommodate the higher power envelope.

Today, data centers account for approximately 2% of the global energy usage. While the latest accelerated server can consume up to 6kW of power each and may seem counterintuitive from a sustainability perspective, accelerated systems are actually more energy efficient compared to general-purpose servers when matched to the right mix of workloads. The advent of generative AI has significantly raised the threshold in compute and network demands, given that these language models consist of billions of parameters. Accelerators can help to train these large language models within a practical timeframe.

Deployment of these AI language models usually consists of two distinct stages: training and inference.

  • In AI training, data is fed into the model, so the model learns about the type of data to be analyzed. AI training is generally more infrastructure intensive, consisting of one to thousands of interconnected servers with multiple accelerators (such as GPUs and custom coprocessors) per server. We classify accelerators for training as “high-end” and examples include NVIDIA H100, Intel Gaudi2, AMD MI250, or custom processors such as the Google TPU.
  • In AI inference, the trained model is used to make predictions based on live data. AI inference servers may be equipped with discrete accelerators (such as GPUs, FPGAs, or custom processors) or embedded accelerators in the CPU. We classify accelerators for inference as “low-end” and examples include NVIDIA T4 or L40S. In some cases, AI inference servers are classified as general-purpose servers because of the lack of discrete accelerators.

 

Server Usage: Training vs. Inference?

A common question that has been asked is how much of the infrastructure, typically measured by the number of servers, is deployed for training as opposed to inference applications, and what is the adoption rate of each type of platform? This is a question that we have been investigating and debating, and the following factors complicate the analysis.

  • NVIDIA’s recent GPU offerings based on the A100 Ampere and H100 Hopper platforms are intended to support both training and inference. These platforms typically consist of a large array of multi-GPU servers that are interconnected and well-suited for training large language models. However, any excess capacity not used for training can be utilized towards inference workloads. While inference workloads typically do not require a large array of servers (although inference applications are increasing in size), inference applications can be deployed for multiple tenants through virtualization.
  • The latest CPUs from Intel and AMD have embedded accelerators on the CPU that are optimized for inference applications. Thus, a monolithic architecture without discrete accelerators is ideal as capacity can be shared by both traditional and inference workloads.
  • The chip vendors also sell GPUs and other accelerators not as systems but as PCI Express add-in cards. One or several of these accelerator add-in cards can be installed by the end-user after the sale of the system.

Given that different workloads (training, inference, and traditional) can be shared on one type of system, and that end-users can reconfigure the systems with discrete accelerators, it becomes less meaningful to delineate the market purely by workload type. Instead, we segment the market by three distinct types of server platform types as defined in Figure 1.

Server Platform Types - DellOro

We expect each of these platform types to have a different growth expectation. Growth of general-purpose servers is slowing, with a 5-year CAGR of less than 5% given increasing CPU core counts and use of virtualization. On the other hand, accelerated systems are forecast for a 5-year CAGR in the range of approximately 40%. By 2027, we project accelerated systems will account for approximately a 16% share of all server shipments, and will have the mix of accelerator types as shown in Figure 2.

Looking ahead we expect continued innovation and new architectures to support the growth of AI. More specialized systems and processors will be developed that will enable more efficient and sustainable computing. We also expect the vendor landscape to be more diversified, with compelling solutions from the vendors and cloud service providers to optimize the performance of each workload.

To get additional insights and outlook for servers and components such as accelerators and CPUs for the data center market, please check out our Data Center Capex report and Data Center IT Semiconductor and Components report.

[wp_tech_share]

Artificial intelligence (AI) is currently having a profound impact on the data center industry. This impact can be attributed to OpenAI’s launch of ChatGPT in late 2022, which rapidly gained popularity for its remarkable ability to provide sophisticated and human-like responses to queries. As a result, generative AI, a subset of AI technology, became the focal point of discussions across industry events, earnings presentations, and vendor ecosystem discussions in the first half of 2023. The excitement is warranted, as generative AI has already caused tens of billions of dollars of investments, and is forecast to continue to lift Data Center Capex to over $500 Billion by 2027. However, due to the significant expansion of computing power required for training and deploying large language models (LLMs) that support generative AI applications, it will require architectural changes for data centers.

While the hardware required to support such AI applications is new to many, there is a segment of the data center industry that has already been deploying such infrastructure for years. This segment is often known as the high-performance computing (HPC) or supercomputing industry. Historically, this market segment has primarily been supported by governments and higher education to deploy some of the world’s most complex and sophisticated computer systems.

What generative AI is doing that is new is proliferating AI applications and the infrastructure to support them, to the much wider enterprise and service provider markets. Learning from the HPC industry gives us an idea of what that infrastructure may start to look like.

Figure 1: AI Hardware Implications

 

AI Infrastructure Needs More Power and Liquid Cooling

To summarize the implications shown in Figure 1, AI workloads will require more computing power and higher networking speeds. This will lead to higher rack power densities, which has significant implications for Data Center Physical Infrastructure (DCPI). For facility power infrastructure, also referred to as grey space, architectural changes are expected to be limited. AI workloads should increase demand for backup power (UPS) and power distribution to the IT rack (Cabinet PDU and Busway), but it won’t mandate any significant technology changes. Where AI Infrastructure will lead to a transformational impact on DCPI is in a data center’s white space.

First, due to the substantial power consumption of AI IT hardware, there is a need for higher power-rated rack PDUs. At these power ratings, the costs associated with potential failures or inefficiencies can be high. This is expected to push end users towards the adoption of intelligent rack PDUs, with the ability to remotely monitor and manage power consumption and environment factors. These rack PDUs cost many magnitudes higher than basic rack PDUs, which don’t give an end user the ability to monitor or manage their rack power distribution.

Even more transformative for data center architectures is the necessity of liquid cooling to manage higher heat loads produced by next-generation CPUs and GPUs to run AI workloads. Liquid cooling, both direct liquid cooling and immersion cooling, has been growing in adoption in the wider data center industry, which is expected to accelerate alongside the deployment of AI infrastructure. However, given the historically long runway associated with adopting liquid cooling, we anticipate that the influence of generative AI on liquid cooling will be limited in the near term. It remains possible to deploy the current generation of IT infrastructure with air-cooling but at the expense of hardware utilization and efficiency.

To address this challenge, some end-users are retrofitting their existing facilities with closed-loop air-assisted liquid cooling systems. Such infrastructure can be a version of a rear door heat exchanger (RDHx) or direct liquid cooling that utilizes a liquid to capture the heat generated within the rack or server, and reject it at the rear of the rack or server, directing it into a hot aisle. This design allows data center operators to leverage some advantages of liquid cooling without significant investments to redesign a facility. However, to achieve the desired efficiency of AI hardware at scale, purpose-built liquid-cooled facilities will be required. We expect the current interest in liquid cooling will start to materialize in deployments by 2025, with liquid cooling revenues forecast to approach $2 Billion by 2027.

Power Availability May Disrupt the AI Hype

Plans to incorporate AI workloads in future data center construction are already materializing. This was the primary reason for our recent upward revision to our Data Center Physical Infrastructure market 5-Year outlook, with revenue growth now forecast to grow at a 10% CAGR to 2027. But, despite all the prospective market growth AI workloads are expected to generate for the data center industry, there are some notable factors that could slow that growth. At the top of that list is power availability. The Covid-19 pandemic accelerated the pace of digitalization, spurring a wave of new data center construction. However, as that demand materialized, supply chains struggled to keep up, resulting in data center physical infrastructure lead times beyond a year at their peak. Now, as supply chain constraints are easing, DCPI vendors are working through elevated backlogs and starting to reduce lead time.

Yet, demand for AI workloads is forming another wave of growth for the data center industry. This double-shot of growth has generated a discrepancy between the growing energy needs of the data center industry and the pace at which utilities can supply power to the desired locations. Consequently, this is leading to data center service providers to explore a “Bring Your Own Power” model as a potential solution. While the feasibility of this model is still being determined, data center providers are thirsty for an innovative approach to support their long-term growth strategies, with the surge in AI Workloads being a central driver.

As the need for more DCPI is balanced against available power, one thing is clear — AI is ushering in a new era for DCPI. In this era, DCPI will not only play a critical role in enabling data center growth but will also define performance, cost and help achieve progress towards sustainability. This is a distinct shift from the historical role DCPI played, particularly compared to the industry nearly a decade ago when DCPI was almost an afterthought.

With this tidal wave of AI growth quickly approaching, it’s critical to address DCPI requirements within your AI strategy. Failing to do so might result in AI IT hardware with nowhere to get plugged in.

[wp_tech_share]

Recently, the U.S. Department of Energy (DOE) announced $40 million in funding to accelerate innovation in data center thermal management technologies intended to help reduce carbon emissions. The funding was awarded to 15 projects, ranging in value from $1.2 to $5 Million, to universities, labs, and enterprises in the data center and aerospace industries. It will likely take 1 to 3 years for these projects to start impacting the data center industry, but following the money can provide an early indication on the potential direction of future data center thermal management technologies.

Before we assess what we can learn from the selected projects, it’s important to understand why the DOE is investing in the development of next-generation data center thermal management. It’s simple –Thermal management can consume up to 40% of a data centers’ overall energy use, second most to compute. This energy is consumed in the process of capturing the heat generated by the IT infrastructure and rejecting it into the atmosphere. The data center industry has been optimizing today’s air-based thermal management infrastructure, but with processor TDPs rising (think CPUs and GPUs generating more heat), liquid is likely required to achieve the performance and efficiency standards desired by data center operators and regulators in the near future. The type of liquid and how it is applied to thermal management has divided the data center industry on the best path forward. The COOLERCHIPS program provides a unique lens into the developments happening behind the scenes that may impact the direction of future liquid cooling technologies.

To assess the impact of the COOLERCHIPS project awards, each of the 15 projects was segmented by the type of thermal management technology and the heat capture medium. The projects were placed into the following categories (Not all projects could be applied to both segments):

  1. Funding by Technology (n=14)
      1. Direct Liquid Cooling: A cold plate attached to a CPU with a CDU managing the secondary fluid loop.
      2. Immersion Cooling: A server immersed in fluid within a tank or chassis.
      3. Software: Software tools used to design, model, and evaluate different data center thermal management technologies.
  2. Funding by Heat Capture Medium (n=10)
      1. Single-phase: A fluid that always remains a liquid in the heat capture and transfer process of data center liquid cooling.
      2. Two-phase: A fluid that boils during the heat capture process to produce a vapor that transfers the heat to a heat exchanger, where it condenses back into a fluid.
      3. Hybrid: Combined use of a single-phase or two-phase fluid and air to capture and transfer heat in the thermal management process.

The results for funding by technology showed that direct liquid cooling accounted for 73% of awarded funds, immersion cooling 14%, and software 13%. This isn’t a particular surprise, given that direct liquid cooling is more mature than Immersion cooling in the data center industry, primarily due to the use of direct liquid cooling in high-performance computing. Additionally, the planning, design and operational changes in implementing immersion cooling have proved to be a bigger hurdle for some end-users than originally anticipated. Despite only receiving 14% of the awarded funds, there is still significant maturation occurring with immersion cooling as the installed base grows among a variety of end users. Lastly, software rounded out the projects with 13% of the awarded funds. This was a welcome addition, as software plays a critical role in thermal management design and evaluation when comparing the use of different thermal management technologies in different scenarios. Environmental factors such as temperature and humidity, computational workloads, and prioritization of sustainability can all influence which technology choice is best. Software must be utilized to align an end users’ priorities with the technology best suited to reach those goals.

Share of Funding ($) By Technology

The results for funding by heat capture medium showed that two-phase solutions accounted for 48% of the awarded funds, hybrid solutions 38%, and single-phase solutions 14%.  It was surprising to see two-phase solutions garner the most funding since the significant majority of data center liquid cooling today is single-phase solutions. Between 3M announcing their exit of PFAS manufacturing by 2025 and evolving European F-Gas regulations, certain manufactured fluids have been under pressure. But that may be the very reason two-phase solutions were awarded such funding – They may a play critical role in thermal management as CPU and GPU roadmaps reach and surpass 500 watt TDPs in the coming years. Inversely, it’s possible that the level of maturity in single-phase solutions is what limited the awards to only 14% of funding. Furthermore, single-phase solutions aren’t always a 100% heat capture solution, so it makes sense that hybrid solutions that bring air and liquid technologies together that do capture 100% of the heat, received more funding. Afterall, end users are increasingly interested in holistic solutions when it comes to the never-ending cycle of deploying more computing power.

Based on these results conclude data center thermal management is headed towards two-phase direct liquid cooling solutions in the medium to long term. However, it’s important to remember the maturity that is already emerging in single-phase liquid cooling solutions. This maturity is what has driven the data center liquid cooling market to account for $329 million in 2022, which is forecast to reach $1.7 billion by 2027. The liquid cooling technology and fluids will most certainly evolve over the coming years, with investments from the DOE, among many others, helping shape that direction. But most importantly, the COOLERCHIPS project awards aren’t about closing doors to solutions that already exist, but opening doors to new technologies to give us more choices for efficient, reliable and sustainable thermal management solutions in the future.

[wp_tech_share]

2023 witnessed a remarkable resurgence of the OFC conference following the pandemic. The event drew a significant turnout, and the atmosphere was buzzing with enthusiasm and energy. The level of excitement was matched by the abundance of groundbreaking announcements and product launches. Given my particular interest in the data center switch market, I will center my observations in this blog on the most pertinent highlights regarding data center networking.

The Bandwidth and Scale of AI Clusters Will Skyrocket Over the Next Few Years

It’s always interesting to hear from different vendors about their expectations for AI networks, but it’s particularly fascinating when Cloud Service Providers (SPs) discuss their plans and predictions regarding the projected growth of their AI workloads. This is because such workloads are expected to exert significant pressure on the bandwidth and scale of Cloud SPs’ networks, making the topic all the more astounding. At OFC this year, Meta portrayed their expectations of how their AI clusters in 2025 and beyond may look like. Two key takeaways from Meta’s predictions:

  • The size and network bandwidth of AI clusters are expected to increase drastically in the future: Meta expects the size of its AI cluster will grow from 256 accelerators today to 4 K accelerators per cluster by 2025. Additionally, the amount of network bandwidth per accelerator is expected to grow from 200 Gbps to more than 1 Tbps, a phenomenal increase in just about three years. In summary, not only the size of the cluster is growing, but also the amount of compute network per accelerator is skyrocketing.
  • The expected growth in the size of AI clusters and compute network capacity will have significant implications on how accelerators are currently connected: Meta showcased the current and potential future state of the cluster fabric. The chart below presented by Meta proposes flattening the network by embedding optics directly in every accelerator in the rack, rather than through a network switch. This tremendous increase in the number of optics, combined with the increase in network speeds is exacerbating the power consumption issues that Cloud SPs have already been battling with. We also believe that AI networks may require a different class of network switches purpose-built and designed for AI workloads.

 

 

Pluggable Optics vs. Co-packaged Optics (CPOs) vs. Linear Drive Pluggable Optics (LPOs)

Pluggable optics will be responsible for an increasing portion of the power consumption at a system level (more than 50% of the switch system power @ 51 .2 Tbps and beyond) and as mentioned above, this issue will only get exacerbated as Cloud SPs build their next-generation AI networks. CPOs emerged as an alternative technology that have the promise to reduce power and cost compared to pluggable optics. Below are some updates about the state of the CPO market:

  • Cloud SPs are still on track to experiment with CPOs: Despite rumors that Cloud SPs are canceling their plans to deploy CPOs due to budget cuts, it appears that they are still on track to experiment with this technology. At OFC 2023, Meta reiterated their plans to consider CPOs in order to reduce power consumption from 20 pJ/bit to less than 5 pJ/bit using Direct Drive CPOs (Direct Drive CPOs eliminate the digital signal processors (DSPs)). It is still unclear, however, where exactly in the network they plan to implement CPOs or if it will be primarily used for compute interconnect.
  • The ecosystem is making progress in developing CPOs but a lot remains to be done: There were several exciting demonstrations and product announcements at OFC 2023. For example, Broadcom showcased a prototype of its Tomahawk 5-based 51.2 Tbps “Bailly” CPO system, along with a fully functional Tomahawk 4-based 25.6 Tbps “Humboldt” CPO system that was announced in September 2022. Additionally, Cisco presented the power savings achieved with its CPO switch populated with CPO silicon photonic-based optical tiles driving 64×400 G FR4, as compared to a conventional 32-port 2x400G 1 RU switch. During our discussions with the OIF, we were provided with an update on the various standardization efforts taking place, including the standardization of the socket that the CPO module will go into. Our conversations with major players and stakeholders made it clear that significant progress has been made in the right direction. However, there is still much work to be done to reach the final destination, particularly in addressing serviceability, manufacturability, and testability issues that remain unsolved. Our CPO forecast published in our 5-year Data Center Forecast report January 2023 edition takes into consideration all of these challenges.
  • LPOs present another alternative to explore: Andy Bechtolsheim of Arista has suggested LPOs as another alternative that may address some of the challenges of CPOs. The idea behind LPOs is to remove the DSP from pluggable optics, as the DSP drives about half of the power consumption and a large portion of the cost of 400 Gbps pluggable optics. By removing the DSP, LPOs would be able to reduce optic power by 50% and system power by up to 25% as Andy portrayed in the chart below.

 

 

Additionally, other materials for Electric Optic Modulation (EOM) are being explored, which may offer even greater savings compared to silicon photonics. Although silicon photonics is a proven high-volume technology, it has high voltage and insertion loss, so exploring new materials such as TFLN may help lower power consumption. However, we would like to note that while LPOs has the potential to achieve power savings similar to CPOs, they put more stress on the electrical part of the switch system and require a high-performance switch SERDES and careful motherboard signal integrity design. We expect 2023 to be busy with measurement and testing activities for LPO products.

 

800 Gbps Pluggable Optics are Ready for Production Volume and 1.6 Tbps Optics are already in the Making

While we are excited about the aforementioned futuristic technologies that may take a few more years to mature, we are equally thrilled about the products on display at the OFC that will contribute to the market growth in the near future, such as the 800 Gbps optical pluggable transceivers, which were widely represented at the event this year. The timing was perfect, as it is aligned with the availability of 51.2 Tbps chips from various vendors, including Broadcom and Marvell. While 800 Gbps optics started shipping in 2022, more suppliers are currently sampling, and the volume production is expected to ramp up by the end of this year, as indicated in the chart below from our 5-year Data Center Forecast report January 2023 edition. In addition, several 1.6 Tbps optical components and transceivers based on 200 G per lambda were also introduced at OFC 2023, but we do not expect to see substantial volumes in the market before 2025/2026.

If you would like to hear more about our findings, please contact us at dgsales@delloro.com

[wp_tech_share]

Before diving into our data center predictions for 2023, I would first like to recap some of the key trends that I highlighted in the 2022 predictions blog.

Growth in the prior year was predicated on supply chain constraints, which had disrupted data center deployment plans for the past two years. As we had anticipated, the supply constraints started to ease in the second half of 2022, as vendors optimized their sourcing strategies, and as global demand for electronic components subsided. We had also predicted that data center capex for the Top 4 US Cloud SPs—Amazon, Google, Meta, and Microsoft—will grow by over 30% in 2022. Indeed, the Top 4 are on track to increase data center spending by 32% in 2022 (according to the 3Q22 data center quarterly report), as they expanded their global footprint, deployed new AI infrastructure, and added compute capacity. On the technology front, I had expected the next-generation servers, high-speed server-to-network connectivity, and new AI deployments to gain traction in 2022. While we saw significant deployments of new AI infrastructure deployments (mostly from the hyperscalers) and 100 and 200 Gbps server ports last year, shipments of next-generation servers based on Intel’s Sapphire Rapids processor have been limited. Despite initial challenges, these upcoming server platforms will be part of the cornerstone for new data center architectures for years to come.

The market conditions will be dramatically different in 2023 compared to the prior year, as supply chains normalize, and demand softens with mounting economic uncertainties. We anticipate the market to maintain near-term growth fueled by backlogged shipments and the current cloud expansion cycle before decelerating through most of 2023. We identify some key trends below that will shape 2023.

Hyperscale Capex Digestion on the Horizon

After data center capex growth exceeding 30 percent in 2022, we anticipate the Top 4 US Cloud SPs to trim data center capex to single-digit growth in 2023 according to our Data Center IT Capex report. Increased demands and supply chain delays have prolonged the current expansion cycle. During the last two years, the Top 4 Cloud SPs have also added more new data centers than in any prior periods as they seek to deliver more services globally to meet performance and regulatory requirements. As the current expansion cycle winds down, some of the major Cloud SPs are likely to enter a period of slower growth this year. However, the slowdown is expected to be brief, as the Cloud SPs will follow their typical cadence by returning to another growth cycle.

Chinese Cloud Market Continues to See Headwinds

Data center spending for the Top 3 China-based Cloud SPs—Alibaba, Baidu, and Tencent—contracted last year. That market was faced with a range of challenges, from heightened government regulations, COVID-related lockdowns, overcapacity, declining demand for cloud services, and slowing economy. Furthermore, Chinese data center equipment vendors need to tackle challenges of sourcing high-end processors with mounting US chip export restrictions. Despite these persistent factors, we do expect a slight rebound in 2023 after a prolonged slowdown in this sector. Furthermore, the cloud market in China is still in its nascent growth stages and there will be long-term growth opportunities on the horizon.

Softening Enterprise Demand

Enterprise IT spending has historically been sensitive to economic uncertainties. Looking ahead to 2023, we project data center capex to grow by single digits, as mounting economic uncertainties and the rising cost of capital could cause enterprises to slow capital purchases, and cause more enterprises to shift to the cloud. Sales cycles in certain verticals are lengthening as firms reevaluate their IT investment strategies in light of recent developments. However, despite the near-term headwinds, enterprises continue to undergo digital transformation initiatives, while building out their hybrid cloud infrastructures.

New Server Platforms Ready for Launch

We anticipate deployments of new server platforms based on Intel’s Sapphire Rapids and AMD’s Genoa to materialize this year after some setbacks encountered in the prior year. These new server platforms will feature the latest in server interconnect technology, such as PCIe 5, DDR5, and more importantly CXL. The CXL standard allows coherent interfaces connecting from server to memory, enabling memory and others within servers in the rack and improving resource utilization. This architecture could further advance the disaggregation of various rack functions, such as accelerated computing and storage. Most of the hyperscalers and server OEM vendors have announced plans to roll out new servers based on Sapphire Rapids and Genoa this year.

Edge Computing Use Cases are Materializing

There are several compelling edge computing use cases on the near horizon. Multi-Access Edge Computing (MEC), is one such compelling opportunity that will enable latency-sensitive applications such as smart factories, augmented/virtual reality, and multi-player cloud gaming. Commercial off-the-shelf (COTS) hardware, such as standard servers, to virtualization of various network functions such as radio access networks and broadband access. In our recently published Telecom Server report, we project revenue for these edge applications will increase by over 60 percent over the next five years.

Let’s Not Forget About Server Connectivity

Server connectivity will also need to evolve continuously and not be the bottleneck between server and the rest of the network. Today, server ports of 100 Gbps and 200 Gbps have reached mainstream for the hyperscale data center, with 25 Gbps serving most general-purpose workloads. Smart NICs, or data processing units (DPUs), are specialized Ethernet adapters which offload various network and storage functions from the host CPU and can process network traffic with minimal packet loss. While devices have mostly been deployed by the hyperscalers, Smart NIC revenue growth in the rest of the market could surpass 50 percent in 2023 according to the recent edition of the Ethernet Adapter and Smart NIC report. We could see more mainstream deployment this year as compelling enterprise solutions based on VMWare’s Project Monterey begins to ship this year, and as the industry comes together to bring more open solutions to end-users.