[wp_tech_share]

 

Data centers are the backbone of our digital lives, enabling the real-time processing of and aggregation of data and transactions, as well as the seamless delivery of applications to both enterprises and their end customers. Data centers have been able to grow to support ever-increasing volumes of data and transaction processing thanks in large part to software-based automation and virtualization, allowing enterprises and hyperscalers alike to adapt quickly to changing workload volumes as well as physical infrastructure limitations.

Despite their phenomenal growth and innovation, the principles of which are being integrated into service provider networks, data centers of all sizes are about to undergo a significant expansion as they are tasked with processing blockchain, bitcoin, IoT, gigabit broadband, and 5G workloads. In our latest forecast, published earlier this month, we expect worldwide data center capex to reach $350 B by 2026, representing a five-year projected growth rate of 10%. We also forecast hyperscale cloud providers to double their data center spending over the next five years.

Additionally, enterprises are all becoming smarter about how to balance and incorporate their private clouds, public clouds, and on-premises clouds for the most optimal and efficient processing of workloads and application requests. Similar to highly-resilient service provider networks, enterprises are realizing that the distribution of workload processing allows them to scale faster with more redundancy. Despite the general trend towards migrating to the cloud, enterprises will continue to invest in on-premises infrastructure to handle workloads that involve sensitive data, as well as those applications that are very latency-sensitive.

As application requests, change orders, equipment configuration changes, and other general troubleshooting and maintenance requests continue to increase, anticipating and managing the necessary changes in multi-cloud environments becomes exceedingly difficult. Throw in the need to quickly identify and troubleshoot network faults at the physical layer and you have a recipe for a maintenance nightmare and, more importantly, substantial revenue loss due to the cascading impact of fragmented networks that are only peripherally integrated.

Although automation and machine learning tools have been available for some time, they are often designed to automate application delivery within one of the multiple cloud environments, not across multiple clouds and multiple network layers. Automating IT processes across both physical and virtual environments and across the underlying network infrastructure, compute and storage resources have been a challenge for some time. Each layer has its own distinct set of issues and requirements.

New network rollouts or service changes resulting in network configuration changes are typically very labor-intensive and frequently yield faults in the early stages of deployment that require significant man-hours of labor.

Similarly, configuration changes sometimes result in redundant or mismatched operations due to the manual entry of these changes. Without a holistic approach to automation, there is no way to verify or prevent the introduction of conflicting network configurations.

Finally—and this is just as true of service provider networks as it is of large enterprises and hyperscale cloud providers—detecting network faults is often a time-consuming process, principally because network faults are often handled passively until they are located and resolved manually. Traditional alarm reporting followed by manual troubleshooting must give way to proactive and automatic network monitoring that quickly detects network faults and uses machine learning to rectify them without any manual intervention whatsoever.

 

Automating a Data Center’s Full Life Cycle

As the size and complexity of data centers continue to increase and as workload and application changes increase, the impact on the underlying network infrastructure can be difficult to predict. Various organizations both within and outside the enterprise have different requirements that all must somehow be funneled into a common platform to prevent conflicting changes to the application delivery layer all the way to the network infrastructure. These organizations can also have drastically different timeframes for the expected completion of changes largely due to siloed management of different portions of the data center, as well as different diagnostic and troubleshooting tools in use by the network operations team and the IT infrastructure teams.

In addition to pushing on their equipment vendor and systems integrator partners to deliver platforms that solve these challenges, large enterprises also want platforms that give them the ability to automate the entire lifecycle of their networks. These platforms use AI and machine learning to build a thorough and evolving view of underlying network infrastructure to allow enterprises to:

    • Support automatic network planning and capacity upgrades by modeling how the addition of workloads will impact current and future server requirements as well as the need to add switching and routing capacity to support application delivery.
    • Implement network changes automatically, reducing the need for manual intervention and thereby reducing the possibility of errors.
    • Constantly provide detailed network monitoring at all layers and provide proactive fault location, detection, and resolution while limiting manual intervention.
    • Simplify the service and application provisioning process by providing a common interface that then translates requests into desired network changes.

Ultimately, one of the key goals of these platforms is to create a closed-loop between network management, control, and analysis capabilities so that changes in the upper-layer services and applications can drive defined changes in the underlying network infrastructure automatically. In order for this to become a reality in increasingly complex data center network environments, these platforms must provide some critical functions, including:

    • Providing a unified data model and data lakes across multiple cloud environments and multi-vendor ecosystems
      • This function has been a long-standing goal of large enterprises and telecommunications service providers for years. Ending the swivel-chair approach to network management and delivering error-free network changes with minimal manual intervention are key functions of any data center automation platform.
    • Service orchestration across multiple, complex service flows
      • This function has also been highly sought-after by large enterprises and service providers alike. For service providers, SDN overlays were intended to add in these functions and capabilities into their networks. Deployments have yielded mixed, but generally favorable results. Nevertheless, the principles of SDN continue to proliferate into other areas of the network, largely due to the desire to streamline and automate the service provisioning process. The same can be said for large enterprises and data center providers.

Although these platforms are intended to serve as a common interface across multiple business units and network layers, their design, and deployment can be modular and gradual. If a large enterprise wants to migrate to a more automated model, it can do so at a pace that is suited to the organization’s needs. The introduction of automation can be done first at the network infrastructure layer and then introduced to the application layer. Over time, with AI and machine learning tools aggregating performance data across both network layers, correlations between application delivery changes and their impact on network infrastructure can be determined more quickly. Ultimately, service and network lifecycle management can be simplified and expanded to cover hybrid cloud or multi-vendor environments.

We believe that these holistic platforms that bridge the worlds of telecommunications service providers and large enterprise data centers will play a key role in helping automate data center application delivery by providing a common window into the application delivery network as well as the underlying network infrastructure. The result will be the more efficient use of network resources, a reduction in the time required to make manual configuration changes to the network, a reduction in the programming load for IT departments, and strict compliance with SLA guarantee to key end customers and application provider partners.

[wp_tech_share]

 

As pandemic-related headwinds started to ease, we were optimistic for a return to higher growth on data center infrastructure spending in 2021. The Cloud was entering an expansion cycle and demand signals in the Enterprise were gaining momentum. While data center capex grew 9% in 2021, which was in line with our prior projections, growth was mainly driven by higher cost of data center equipment, rather than by unit volume. Server unit growth, which was flat for the year, was constrained by component shortages and long lead times. Deliveries for networking and physical infrastructure equipment are also facing a mounting backlog. Furthermore, higher supply chain costs, from increased commodity, expedite, and logistics costs led to higher system prices. Our 2022 outlook is more optimistic, with a data center capex projection of 17%, accompanied by double-digit growth in server unit shipments. We identify the following key trends that could shape the dynamics of data center capex in 2022.

Hyperscale Cloud on Expansion Cycle

The Top 4 Cloud service providers—Amazon, Google, Meta (formerly Facebook), and Microsoft—are expected to increase data center capex by over 30% in 2022. Investments will go towards the replacement of aged servers, increased deployment of accelerated computing, as well as servers for new data centers in more than 30 regions that are scheduled to launch in 2022. Furthermore, infrastructure planned last year that was not deployed due to extended equipment lead-times have resulted in additional tailwind growth as deliveries are fulfilled in 2022.

Supply Chain Stabilizing

Generally, the major Cloud service providers have weathered through this tough supply chain climate better than the rest of the market given their strong visibility in their demand and can proactively increase inventory levels of crucial components and build redundancies in their supply chains. On the other hand, data center capex growth in Tier 2 and 3 Cloud service providers and Enterprise have been supply-constrained. There is some consensus that the level of supply chain disruptions is starting to stabilize and possibly ease by the second half of 2022. Lead-time for servers could improve sooner than other data center equipment such as networking, given their relatively larger scale and lower product mix.

Metaverse Could Drive Opportunities In AI Infrastructure

Some of the major Cloud service providers, such as Apple, Meta, Microsoft, and Tencent, have announced plans to enrich their metaverse offerings for both enterprise and consumer applications. This would require increased investments in new infrastructures, such as servers with accelerated co-processors, low-latency networking, and enhanced thermal management solutions. Chip manufacturers and major Cloud service providers will be developing specialized processors for AI applications. The ecosystem would need to evolve to enable the community of AI application developers to broaden the reach of AI into enterprises. AI infrastructure is costly and will be a major capex driver. For instance, we estimate that the cost of AI infrastructure is largely responsible for Meta’s plans to increase capex by approximately 60% this year.

New Server Architectures On The Horizon

Intel is releasing a new processor platform, Sapphire Rapids, later this year. Sapphire Rapids will feature the latest in server interconnect technologies, such as PCIe 5, DDR5, and more importantly, CXL. These new high-speed interfaces could alleviate system bandwidth constraints, enabling more processor cores and memory to be packaged into a single server. CXL would enable memory sharing between the CPU and other co-processors within the server and rack, enabling data-intensive applications such as AI to access memory more efficiently and at lower latencies. AMD and ARM will also incorporate these new interfaces within their processor platforms as well. We expect these enhancements could kick off a multi-year journey of new server architecture developments.

Let’s Not Forget About Server Connectivity

Last but not least on this list, server connectivity will also need to evolve continuously and not clog the connection between server and the rest of the network. The hyperscale Cloud service providers have been deploying in production the latest generation network interface cards (NICs) based on 56 Gbps PAM-4 SerDes of up to 100 Gbps for general purpose workloads, and up to 200 Gbps for advanced workloads such as AI. The Enterprise is fully embracing 25 Gbps NICs, and we anticipate the number of 25 Gbps ports to overtake that of 10 Gbps later this year. Smart NICs, or data processing units (DPUs) are being deployed by the major Cloud service providers across their infrastructure to improve server utilization, and to accelerate latency-sensitive applications such as AI. Outside of the hyperscale, Smart NIC adoption is still in its nascent stage. However, given that most of the network adapter vendors have a Smart NIC solution available in the market, enterprises potentially have a wide range of choices to fit their applications and budget.

 

[wp_tech_share]

 

A New Year always marks a great time to look back and reflect on the previous year and predict what it means for the coming year. It’s specifically an exciting time for the Data Center Physical Infrastructure research at Dell’Oro Group, with the program’s first publication of the Q3 2021 Report. While we did not make any predictions for data center physical infrastructure in 2021, we can certainly recap the year before looking at 2022 predictions.

For the data center physical infrastructure market, 2021 can be split into two major themes. During the first half of 2021, the market for data center physical infrastructure rebounded strongly, growing 17.7% to $10 billion after a pandemic-induced market dip in 2020. Year-over-year comparisons were favorable, but cloud service provider investment and rebounding enterprise spending in North America and EMEA drove the market past 2019 levels. However, the story changed in the second half of 2021. New covid-19 variants Delta and Omicron reared their ugly heads, while supply chains began to break down, leading to a lack of availability in components and products, raw material price increases, and labor and logistical issues. We forecast that this slowed data center physical infrastructure growth to 4.7%, with the market reaching $11.4 billion in revenues during the second half of 2021. Data center physical infrastructure vendors entered 2022 with record backlogs, but questions remain on how much supply they will be able to deliver as demand continues to outpace supply. While these supply chain issues will likely persist throughout 2022, what else does the data center physical infrastructure market have in store for us?

1. Plans to Reach Long-Term Data Center Sustainability Goals Begin to Materialize

As the global COVID-19 pandemic accelerated digital adoption and growth throughout 2021, it also cast a large shadow on the growing climate impact of data center growth. It’s no wonder sustainability quickly became one of the most common buzzwords in the industry. The data center industry responded with aggressively expanding sustainability commitments, which were previously largely tied to 100% renewable energy offset credits. Renewable energy goals transitioned from 100% renewable energy offsets to 100% renewable energy consumption. Data center water usage also came under fire, with Microsoft notably pledging to cut water usage 95% by 2024 and become water positive by 2030. But by far the most common goal set by data center owners and operators was to become carbon neutral, or even carbon negative in some cases, by 2030. Critics were quick to point out the difficult path to achieve those goals, with details on how, sparse. 2022 will bring more clarity on some of the technologies that will help enable progress towards those goals. Data Center Physical Infrastructure will specifically play a big role in a number of areas.

    • Backup power connects to the grid – A large portion of data center physical infrastructure is committed to providing clean, uninterruptible power to IT infrastructure even during a utility power outage, through the use of UPS systems, batteries, and generators. Those systems largely sit unutilized when utility power is on. That is beginning to change, spurred by the adoption of lithium-ion batteries, which are creating new energy storage use cases at data center facilities. This technology, commonly referred to as Grid-connected UPS, will enable opportunities for those idle assets to become revenue-generating or cost-saving, through peak shaving, frequency regulation, and other grid participation activities, in addition to supporting better integration of renewable energy. Microsoft and Eaton have publically collaborated on grid-interactive UPS, recently releasing a white paper on the subject. We predict major strides on-grid interactive UPS systems in 2022, with details and an ecosystem forming around early pilots to support execution on larger scale rollouts.
    • Fuel cells replace generators – Okay, this isn’t happening in 2022. But, the recent announcement of Vertiv, Equinix, and other utility, fuel cell, and research partners working on a proof-of-concept (POC) fuel cell use case for data centers funded by the Clean Hydrogen Partnership sure does create some excitement. Vertiv has committed to providing a 100kW fuel cell module with an integrated UPS by 2023. Here’s to hoping we can get updates throughout the year to learn more about how fuel cell technology can be applied to data centers and on what timeline.
    • Data center heat re-use bubbles up to the top of sustainability priorities – Data centers consume a lot of power, and in turn, generate a lot of heat. Today, air-based thermal management systems are in place to capture and reject that heat into the atmosphere. However, that heat has a significant opportunity to be re-used, with district heating and urban farming as commonly cited examples. The difficulty in scaling data center heat re-use is that today’s thermal management designs and infrastructure largely don’t support it. In 2022, we predict that to change, with heat re-use technology being designed into new products and data center architectures. To take full advantage of heat reuse, data centers owners and ecosystem vendors will turn to liquids, which transfer energy up to ten times more efficiently than air, to get the most out of heat-reuse technology.

2. Liquid Cooling Adoption Momentum Continues as POC Deployments Proliferate and Early Adopters Begin Larger Roll Outs

Traditionally, the data center industry has been conservative in adopting new physical infrastructure technologies. Interested in bringing liquids into my IT space, let alone into the IT rack? Absolutely not. However, as Moore’s Law has struggled to keep pace, data center rack densities have started to rise. In the high-performance computing (HPC) space, air-cooling simply wasn’t an option anymore as HPC IT rack densities surpassed 20 kW, 50 kW, and even 100 kW in some cases. This trend formed the foundation of the market that is liquid cooling today. This includes both direct liquid (pumping liquids to cold plates directly attached to CPUs, GPUs, and memory) and immersion (submerging an entire rack of servers in a liquid-filled tank).

Liquid cooling market revenue growth accelerated in 2021, growing an estimated 64.3% from 2020 to $113M. Another 25% growth is forecast for 2022, with the market forecast to reach $141M, despite constrained supply chains. This growth is forecast to be driven by proliferating POCs from cloud, colocation, and telco service providers, in addition to large enterprises dipping their toes in. For early adopters, larger-scale rollouts of liquid cooling technology are forecast to begin, with increased awareness and comfort in operating liquid-cooled data centers. With momentum continuing to build, an inflection point for liquid cooling adoption appears near.

3. Supply Chain Resiliency and Integrated Solutions Drive Mergers, Acquisitions, and Partnerships

Supply chain discussions are creeping into nearly every conversation these days, so we can’t have 2022 predictions without assessing what impact they might have on the year. First, we do believe supply chain issues will persist throughout 2022, and potentially into 2023. However, we predict their lasting impact on the year will be from the mergers, acquisitions, and partnerships they drive.

Supply chain disruptions have become common place over the past three years. From the onset of US and China trade war tensions, data center physical infrastructure vendors have already been localizing supply chains in region, for region. The pandemic has only added more unpredictability to global supply chains, exposing further weaknesses. To address these weaknesses, we predict a flurry of mergers and acquisitions. We believe these acquisitions will be focused on supply chain resiliency, establishing and growing manufacturing footprints in select regions, while also supporting the delivery of holistic data center solutions at the rack, row, pod or building level. Checking off multiple of these boxes makes any potential acquisition quite appetizing in 2022.

At the beginning of next year, we’ll circle back and see how we did on our predictions. In the meantime, stay connect with the data center physical infrastructure program for the latest updates.

[wp_tech_share]

The Nvidia GTC Fall 2021 virtual event I attended last week highlighted some exciting developments in the field of AI and machine learning, most notably, in new applications for the metaverse. A metaverse is a digital universe created by the convergence of the real world and a virtual world abstracted from virtual reality, augmented reality, and other 3D visual projections.

Several leading Cloud service providers recently laid out their visions of the metaverse. Facebook, which changed its name to Meta to align its focus on the metaverse, envisions people working, traveling, and socializing in virtual worlds. Microsoft already offers holograms and mixed-reality on its Microsoft Mesh platform and announced plans to bring holograms and virtual avatars to Microsoft Teams next year. Tencent recently shared its metaverse plan to leverage its strengths in multiplayer gaming on its social media platform.

In order to recreate an accurate virtual representation of the real world, massive amounts of AI training data would need to be acquired, captured, and processed. This would stretch the limits of the compute infrastructure. During GTC, Nvidia highlighted various solutions in three areas that could help pave the way for the proliferation of the metaverse in the near future:

  • Compute Architecture: During the Q&A session, I asked Nvidia CEO Jensen Huang how the data center would need to evolve to meet the needs of the metaverse. Jensen emphasized that computer vision and graphics and physics simulation would need to converge in a coherent architecture and be scaled out to millions of people. In a sense, this would be a new type of computer, a fusion of various disciplines with the data center as the new unit of computing. In my view, such an architecture would be composed of a large cluster of accelerated servers with multiple GPUs within a network of tightly coupled, general-purpose servers. The servers would run applications and store massive amounts of data. Memory coherent interfaces, such as CXL,  NVLink, or their future iterations, offered on x86- and ARM-based platforms, would enable memory sharing across racks and pods. These interfaces would also improve connectivity between CPUs and GPUs, reducing system bottlenecks.
  • Network Architecture: As the unit of computing continues to scale, new network architectures will need to be developed. During GTC, Nvidia introduced Quantum-2, a networking solution composed of a 400 Gbps InfiniBand and a Bluefield-3 DPU (data processing unit) Smart NIC. This combination will enable high-throughput, low-latency networking in a dense and tightly coupled cluster scaling up to one million nodes needed for metaverse applications. 400 Gbps is the fastest server access speed available today. It could double to 800 Gbps in several years. The ARM processor in the Bluefield DPU could directly access the network interface, bypassing the CPU and benefiting time-sensitive AI workloads. Furthermore, we can expect that these scaled-out computing clusters would be shared across multiple users. With a Smart NIC, such as the Bluefield DPU, layer isolation could be provided among users, thereby enhancing security.
  • Omniverse: The compute and network infrastructure could only be effectively utilized with a solid software development platform and ecosystem in place. Nvidia’s Omniverse provides the platform to enable developers and enterprises to create and connect virtual worlds for various use cases. During GTC, Jensen described how the Omniverse could be applied to build a digital twin in an automotive factory with the manufacturing process simulated and optimized by AI. This twin would later serve as the blueprint for the physical construct. The range of potential applications ranged from education to healthcare, retail, and beyond.

We are still in the initial developmental stages of the metaverse; the technology build-blocks and ecosystem are still coming together. Furthermore, as we have seen recently with certain social media platforms and the gaming industry, new regulations could emerge to reset the boundaries between the real and virtual worlds. Nevertheless, I believe that the metaverse has the potential to unlock new use cases for both consumers and enterprises and drive investments in data center infrastructure in the Cloud and Enterprise. To access the full Data Center Capex report, please contact us at dgsales@delloro.com.

[wp_tech_share]

Dell’Oro Group projects that the spend on accelerated compute servers targeted to artificial intelligence (AI) workloads will reach double-digit growth over the next five years, outpacing other data center infrastructure. An accelerated compute server equipped with accelerators such as a GPU, FPGA, or custom ASIC can generally handle AI workloads with much greater efficiency than general purpose (without accelerators) servers. Numerically speaking, deployment of these servers still represents only a fraction of Cloud service providers’ overall server footprint. Yet, at ten or more times the cost of a general-purpose server, accelerated compute servers are becoming a substantial portion of the data center capex.

Tier 1 Cloud service providers are increasing their spending on new infrastructure tailored for AI workloads. In Facebook’s 3Q21 earnings calls, the company announced its plans to increase capex by more than 50% in 2022. Investments will be driven by AI and machine learning to improve ranking and recommendations across Facebook’s platform. In the longer term, as the company shifts its business model to the metaverse, capex investments will be driven by video and compute-intensive applications such as AR and VR. At the same time, Tier 1 Cloud service providers, such as Amazon, Google, and Microsoft, also aim to increase spending on AI-focused infrastructure to enable their enterprise customers to deploy applications with enhanced intelligence and automation.

It has been a year since my last blog on AI data center infrastructure. Since that time, new architectures and solutions have emerged that could pave the way for the further proliferation of AI in the data center. Following are three innovations I’ll be watching closely:

New CPU Architectures

Intel is scheduled to launch its next-generation Sapphire Rapids processor next year. With its AMX (Advanced matrix Extension) instruction set, Sapphire Rapids is optimized for AI and ML workloads. CXL, which will be offered with Sapphire Rapids for the first time, will establish a memory-coherent, high-speed link PCIe Gen 5 interface between the host CPU and accelerators. This, in turn, will reduce system bottlenecks by enabling lower latencies and more efficient sharing of resources across devices. AMD will likely follow on the heels of Intel and offer CXL on EPYC Genoa. For ARM, competing coherent interfaces will also be offered, such as CCIX with Ampere’s Altra processor and NVlink on Nvidia’s upcoming Grace processor.

Faster Networks and Server Connectivity

AI applications are bandwidth hungry. For this reason, the fastest networks available would need to be deployed to connect host servers to accelerated servers to facilitate the movement of large volumes of unstructured data and training models (a) between the host CPU and accelerators, and (b) among accelerators in a high-performance computing cluster. Some Tier 1 Cloud service providers are deploying 400 Gbps Ethernet networks and beyond. The network interface card (NIC) must also evolve to ensure that server connectivity is not inhibited as data sets become larger. 100 Gbps NICs have been the standard server access speed for most accelerated compute servers. Most recently, however, 200 Gbps NICs are increasingly used with these high-end workloads, especially by Tier 1 Cloud service providers. Some vendors have added an additional layer of performance by integrating accelerated compute servers with Smart NICs or Data Processing Units (DPUs). For instance, Nvidia’s DGX system could be configured with two Bluefield-2 DPUs to facilitate packet processing of large datasets and provide multi-tenant isolation.

Rack Infrastructure

Accelerated compute servers, generally equipped with four or more GPUs, tend to be power hungry. For example, an Nvidia DGX system with 8 A100 GPUs has a maximum system power usage rated at 6.5kW. Extra consideration would be needed to ensure efficient thermal management. Today, air-based, thermal management infrastructure is predominantly used. However, as rack power densities are on the rise to support accelerated computing hardware, air-cooling efficiencies and limits are being reached. Novel liquid-based, thermal management solutions, including immersion cooling, are under development to further enhance the thermal efficiencies of accelerated compute servers.

These technology trends will continue to evolve and drive the commercialization of specialized hardware for AI applications. Please stay tuned for more updates from the upcoming Data Center Capex reports.