Home > AI
545 views 26 mins 0 comments

Why Liquid Cooling Is Becoming the Default for AI‑Scale Data Centers

In AI, Technology
August 31, 2025
Why Liquid Cooling Becoming Default for AI

Big compute now runs hot. AI training clusters and high‑performance workloads have pushed servers and racks to power levels that traditional air systems struggle to handle. Fans are louder. Aisle temperatures are higher. Floor tiles and raised floors, once standard, are turning into bottlenecks. That is why liquid cooling has shifted from a niche to a mainstream option, not just for supercomputers but for commercial data centers in cities and industrial parks.

This story explains the physics, the hardware, and the facility changes behind that shift. It shows the trade‑offs between cold plates, rear‑door heat exchangers, and immersion tanks. It digs into coolant chemistry, corrosion, and leak management. It also looks at efficiency, waste heat recovery, and what it takes to retrofit a live room without drama. The goal: give you a practical, clear guide to what liquid cooling involves and why it matters now.

The heat problem no fan can solve

Modern accelerators are compact heaters. A single high‑end GPU routinely draws several hundred watts under load. Newer modules approach 1 kilowatt per device. Put eight to twelve of them on a board and the heat climbs fast. In dense racks, total draw of 80–150 kW is becoming common, with designs pushing beyond 200 kW. Air cooling has limits at these densities because air is a poor heat carrier. It has low thermal capacity and low conductivity compared to liquids.

Three basic physics points explain why air reaches a wall:

  • Heat flux at the chip: Dies and interposers can see 50–100 W/cm² hotspots. Moving that energy through heat spreaders and heatsinks into air requires big temperature differences and aggressive airflow. That adds fan power, noise, and vibration.
  • Airflow path complexity: To cool a rack at 100 kW with air, you must push a huge volume through heatsinks and out the rear. Every bend, grill, cable cutout, and obstruction adds back pressure. Precision matters; any bypass or recirculation destroys performance.
  • Practical limits: Even with hot‑aisle/cold‑aisle containment, in‑row coolers, and optimized plenum design, most air‑only rooms hit a ceiling around 20–40 kW per rack. Rear‑door heat exchangers extend that, but not infinitely. Many facilities still cannot deliver the airflow or chilled water needed for the next tier.

Liquids carry far more heat per unit volume and enable smaller temperature lifts for the same thermal duty. That is why the new generation of server designs embrace cold plates and immersion tanks. The chips stay within safe limits with less fan energy and tighter control.

How liquid cooling works in practice

“Liquid cooling” covers several approaches. Each moves heat from chips to water‑cooled coils or to a secondary fluid that can be pumped to a heat rejection system. The three most common options today are direct‑to‑chip cold plates, rear‑door heat exchangers, and immersion.

Direct‑to‑chip cold plates

Cold plates bolt directly on top of GPUs, CPUs, and sometimes memory modules. Inside each plate, microchannels or pin fins guide coolant across the hottest areas. A pump drives coolant through a manifold that feeds every plate on a board and then returns it to a Coolant Distribution Unit (CDU). The CDU contains heat exchangers, filters, flow meters, and controls. It connects to the facility water loop.

Key design details:

  • Warm‑water operation: Many cold‑plate systems run on 25–45°C (77–113°F) water. This often enables “free cooling” with dry coolers instead of energy‑intensive chillers, depending on climate.
  • Leak‑resistant plumbing: Quick‑disconnects with self‑sealing valves allow service without draining the loop. Hoses route along sidewalls and secure into blind‑mate frames to minimize human error.
  • Bypass and balancing: Flow restrictors or smart valves ensure each plate sees the right flow. Without balance, early branches hog coolant while tail plates starve.
  • Hybrid cooling: Cold plates remove most of the heat at the chip. Small fans still cool VRMs, NICs, and drives. This “air assist” typically handles 10–20% of server heat.

Rear‑door heat exchangers

Rear‑door units place a water‑cooled coil at the back of the rack. Server fans push warm exhaust air through the coil, which captures much of the heat before air returns to the room. This approach is attractive for retrofits because it avoids touching server internals. It can support 40–80 kW per rack under good conditions and works well with a decent chilled water plant.

Limitations include dependency on server fans (still power‑hungry at high loads) and the fact that hot spots inside the server remain air‑cooled. Even so, rear‑door units are a common bridge for facilities that need density now without a full re‑architecture.

Immersion cooling

Immersion places entire servers into a tank of dielectric fluid. Two variants exist:

  • Single‑phase immersion: Servers sit in a bath of non‑conductive oil or synthetic hydrocarbon. Pumps move the fluid through an external heat exchanger. It is mechanically simple, widely supported by server vendors, and has good safety characteristics.
  • Two‑phase immersion: The fluid boils at a low temperature on hot components. Vapor rises and condenses on a coil, releasing heat. Boiling provides very high heat transfer. However, fluid selection is critical, and supply chains and environmental constraints require careful vetting.

Immersion achieves very high density and excellent temperature uniformity. It can reduce or eliminate system fans, which lowers noise and power draw. Trade‑offs include compatibility (some parts or adhesives may swell in certain fluids), service workflows (lifting servers out of tanks), and facility changes (tanks have different floor loads and footprint). Operators must also plan for fluid management end‑to‑end, from storage and filtration to spill response and recycling.

Choosing coolants and materials

Coolant choice affects safety, efficiency, and maintenance. The two big categories are water‑based coolants and dielectric fluids for immersion or leak‑tolerant designs.

Water and water‑glycol blends

For cold plates and rear‑door coils, water is the standard because it is cheap, effective, and well understood. Blends incorporate corrosion inhibitors and biocides, and sometimes glycol for freeze protection. Good practice includes:

  • Material compatibility: Avoid galvanic couples like aluminum and copper in the same loop without careful control. Stainless steel, copper alloys, and suitable plastics reduce risk.
  • Water quality: Filtered, deionized fill is normal. Conductivity must be kept within ranges recommended by equipment vendors. Periodic testing and top‑off procedures matter.
  • Chemical control: Maintain inhibitor concentrations. Replace cartridges or perform bulk treatment on schedule to prevent corrosion and biological growth.

Dielectric fluids

Dielectric fluids allow electronics to operate in direct contact with coolant. For single‑phase systems, fluids are often synthetic hydrocarbons or esters with low viscosity and good oxidation stability. Two‑phase systems use engineered fluids designed to boil at 40–60°C under atmospheric pressure.

Consider these factors:

  • Thermal performance: Thermal conductivity and specific heat determine how well the fluid carries heat. Lower viscosity aids pumping, but extremely low viscosity can increase leak spread.
  • Materials: Test elastomers, plastics, and adhesives used in connectors, sockets, and cable insulation. Swelling or embrittlement causes long‑term failures. Vendors publish compatibility charts, but field pilots are important.
  • Environmental profile: Check global warming potential (GWP), persistence, and toxicity. Some fluorinated fluids face regulatory pressure. Ensure supplier longevity, recycling programs, and clear safety data sheets.
  • Handling and cleanliness: Keep fluids dry and clean. Water contamination reduces dielectric strength; particulate contamination erodes pumps or clogs microchannels. Plan for filtration, degassing, and sampling.

Facility design: from CDU to cooling tower

Moving heat off a chip is only half the job. The other half is rejecting that heat from the building. Most liquid‑cooled facilities use a multi‑loop approach for safety and control.

Three loops, one objective

  • IT loop: The closed loop that touches IT gear. It is clean, filtered, and chemically controlled. Pressures are modest. This loop connects to servers, tanks, or rear‑door units.
  • Facility secondary loop: Transfers heat from the IT loop via a plate heat exchanger, often inside the CDU. This loop may run at higher pressures and larger pipe sizes. It aggregates heat from many racks.
  • Heat rejection loop: Moves heat outdoors. Options include dry coolers, adiabatic coolers, cooling towers, and chillers. Choice depends on climate, water policy, and heat‑reuse plans.

Warm‑water and free cooling

A strength of liquid systems is their ability to run at warmer temperatures. If the IT loop exits at 35–45°C, a dry cooler can reject heat in many climates for a large fraction of the year. That bypasses chillers and raises overall efficiency. Controls modulate pump speed and valve positions to maintain chip inlet setpoints with minimal energy.

Condensation and dew point

Warm pipes in cool rooms are safe. Cool pipes in humid rooms are not. Any cold surface below dew point will condensate, which is unacceptable around electronics. Designs typically keep coolant above room dew point or insulate lines. Sensors and interlocks prevent startup conditions that could create condensation on manifolds or connectors.

Standards and guidance

Industry groups publish best practices for liquid‑cooled data centers. You will find guidance on allowable temperatures, materials, and maintainability in widely used documents. Vendors also supply validated installation guides for specific servers and racks. Following these reduces surprises and supports warranty coverage.

Reliability and safety

Liquid and electronics can co‑exist safely. Several practices make that true in production environments:

  • Leak prevention: Use high‑quality quick disconnects rated for the temperature and pressure. Blind‑mate manifolds reduce handling error. Design with drip trays and isolation valves in every branch.
  • Leak detection: Install sensor cables under manifolds and along hose runs. The CDU can alarm and close valves automatically. For immersion, secondary containment and fluid level sensors are standard.
  • Pressure management: Keep loop pressures low and stable. Include expansion tanks. Avoid water hammer by using soft‑start pumps and controlled valve actuation.
  • Service playbooks: Train staff on connect/disconnect procedures, purge steps, and spill response. Practice on mock‑ups before touching production racks.
  • Materials discipline: Stick to approved hoses and O‑rings. A single incompatible elastomer can be the source of a persistent micro‑leak months later.

Fire safety does not fundamentally change. Water‑based loops do not add fire load. Dielectric immersion fluids are generally low‑volatility and self‑extinguishing or hard to ignite, but always verify fluid flash points and follow local codes. For any design, ensure ventilation accommodates outgassing during maintenance and that lifting operations around tanks are planned with proper rigging and ergonomics in mind.

Efficiency and TCO: the numbers that matter

Liquid cooling earns its keep in three ways: it enables higher density, it reduces energy used to move heat, and it opens heat‑reuse opportunities.

Density and space

Cold plates and immersion allow more compute per rack without driving aisle temperatures out of range. Higher density shortens cables, reduces floorspace, and simplifies power distribution in new builds. In retrofits, it can keep you in the same building longer instead of needing a new hall for every expansion.

Energy use

Fan energy scales fast with airflow and pressure; pumps are usually more efficient movers of heat. Warm‑water loops reduce or eliminate chiller run time. The combined result pushes a facility’s power usage effectiveness (PUE) closer to 1.1 or better in some climates. Even modest savings compound at scale and over time.

Water use

Air‑cooled designs often rely on evaporative cooling. That costs water, especially during hot seasons. Warm‑water liquid systems can shift to dry coolers or hybrid setups that use less water. For regions with tight water policy, this is as important as energy use.

Cost considerations

Up‑front costs include CDUs, manifolds, plates or tanks, and facility piping. However, the avoided cost of oversized air handlers, ducting, and chillers balances the ledger. Over a system’s life, lower energy and the ability to deploy denser racks often outweigh the initial premium. The exact break‑even depends on climate, utility rates, and how aggressively you plan to scale compute.

Heat reuse: turning waste into warmth

Heat leaving a data center at 30–50°C is not waste if you can use it. Several strategies make reuse practical:

  • District heating: Export warm water to nearby homes or buildings via a heat network. A booster heat pump lifts temperatures to 60–80°C when needed. This is most common in cold climates and in campuses with cooperative planning.
  • Industrial or agricultural loads: Greenhouses, aquaculture, and low‑temperature industrial processes can use warm water with minimal extra lift. These applications can be built adjacent to the data center, cutting transmission losses.
  • On‑site space heating: Offices and warehouses tied to the facility can tap into the secondary loop during winter, directly offsetting gas or electric heating.

Liquid cooling makes heat reuse simpler because it concentrates heat at higher temperatures than room air. The piping interface is straightforward, and metering the exported energy is easy. That said, heat reuse is a project in its own right. It needs contracts, thermal storage planning, and seasonal strategies. But when it works, it adds a revenue stream or reduces local energy bills while improving sustainability metrics.

Retrofitting without breaking the room

Many operators need to fit liquid cooling into existing spaces. A staged approach keeps risk low:

  • Start with one row: Select a row with good access to mechanical rooms. Install a CDU and rear‑door heat exchangers or cold‑plate servers in a controlled pilot. Validate procedures, water treatment, and monitoring.
  • Mechanical tie‑in: Route secondary loop piping overhead if underfloor space is crowded. Use isolation valves and quick couplers to segment sections for maintenance. Label everything clearly.
  • Power and floor load check: Verify structural load for tanks or high‑density racks. Assess branch circuit capacity and busbars before moving in heavier, denser gear.
  • Operational updates: Update change control, incident response, and EOPs to include liquid handling. Train on spill kits and lockout/tagout for pumps and valves.

Rear‑door units are a popular first step for brownfield sites. They offer immediate gains with minimal server changes. As comfort grows, cold‑plate gear can be added rack by rack, using the same CDU infrastructure. Immersion tanks usually require more planning but can coexist with air‑cooled rows if the mechanical room can supply separate headers.

Monitoring and control

Liquid systems shine when you can see what is happening inside them. Instrumentation is essential:

  • Sensors on every branch: Flow meters and temperature sensors at rack inlets and outlets verify that each branch gets the right flow. Differential pressure sensors help diagnose fouling or kinks.
  • CDU telemetry: Monitor pump speeds, valve positions, filter differential pressure, and heat exchanger approach temperatures. Alerts should roll up to the data center infrastructure management (DCIM) platform.
  • Leak alarms: Map leak detection to actionable zones. A trip should know which valve to close and who to page. Test alarms regularly.
  • Digital twinning: Simple hydraulic models help plan expansions and set expectations for temperature rise at different loads. They also spot where balancing valves are needed.

Controls aim to minimize energy while maintaining safe inlet temperatures. In practice, that means coordinated pump control, CDU valve modulation, and adaptive setpoints that respond to weather forecasts and grid signals. Systems with smart valves on cold plates can even allocate extra cooling to the noisiest chips during short training bursts, smoothing thermal spikes.

Human factors and serviceability

The success of any cooling strategy depends on people using it. For liquid systems, service‑friendly design features reduce errors and downtime:

  • Clear labeling and color coding: Separate supply and return identifiers. Put flow arrows on manifolds. Use distinct colors per loop.
  • Reachability: Place quick disconnects and filters where a technician can use two hands and proper lighting without stretching over live gear.
  • Dry‑break connectors: Choose fittings with low spill volume. Verify that the mating force suits your service procedures; too stiff is a recipe for misalignment.
  • Modular replacements: Design server trays or GPU boards to be swapped as modules, minimizing time spent connecting and purging lines.

For immersion, plan for handling. That includes cranes or gantries if tanks are deep, drip trays and mats, and storage for wet parts. Provide PPE and hand‑wash stations. Good ergonomics and hygiene prevent minor issues from turning into major incidents.

What comes next

Liquid cooling will keep evolving with compute. Several trends are visible already:

  • Warmer water everywhere: Vendors are raising allowable inlet temperatures to expand chiller‑less operating hours. Expect more designs comfortable with 45–50°C inlets.
  • Microchannel innovation: Additive manufacturing and advanced machining produce cold plates that target individual chiplets and memory stacks precisely, improving performance at lower flow rates.
  • Standardized interfaces: Open hardware groups are defining blind‑mate liquid connectors, manifold geometries, and telemetry schemas. That will make multi‑vendor racks easier to integrate.
  • Liquid to memory and storage: As memory bandwidth and power grow, liquid attention will extend beyond CPUs and GPUs. Expect DIMM and HBM modules with integrated plate options.
  • Cleaner fluid chemistries: Suppliers are developing low‑GWP alternatives and recycling programs for immersion fluids. End‑of‑life plans will be part of standard procurements.
  • Heat as a product: More data centers will sell or donate heat to neighbors. Contracts and metering will mature, making it a repeatable pattern rather than a custom project.

The broader picture is simple: the compute roadmap points to higher power density and tighter performance per rack. Liquid cooling is one of the few tools that scales with that reality without making rooms unlivable or bills unsustainable.

Real‑world checklist for getting started

If you are evaluating liquid cooling for a specific site, this short checklist can speed up your first workshop:

  • Target rack density and chip inlet temperature (for the next three years, not just today)
  • Available mechanical capacity and footprint for CDUs or tanks
  • Water policy, including make‑up water and discharge limits; preferred rejection method (dry, adiabatic, tower)
  • Compatibility with your current server roadmap and vendor certifications
  • Leak detection zones, alarm routing, and isolation valve plan
  • Fluid selection criteria, including environmental and recycling commitments
  • Maintenance windows, spares, and training plans for technicians
  • Heat reuse opportunities within 1 km of the site and stakeholders to contact

Treat the first deployment as a learning platform. A well‑instrumented pilot teaches more than a dozen slide decks. Once the team sees clean installs, predictable temperature control, and lower noise, adoption tends to accelerate on its own.

Summary:

  • Air cooling struggles at AI‑era densities; liquid moves more heat with less energy
  • Direct‑to‑chip cold plates, rear‑door coils, and immersion are the main choices
  • Coolant chemistry and materials compatibility determine long‑term reliability
  • Multi‑loop designs separate clean IT water from facility and heat rejection loops
  • Reliability comes from leak prevention, detection, pressure control, and training
  • Liquid systems improve efficiency, support higher density, and reduce water use
  • Heat reuse turns a waste stream into local heating or industrial value
  • Retrofits can start with one row; rear‑door units often bridge to cold plates
  • Instrumentation and smart controls keep temperatures stable and energy low
  • Next steps include warmer inlets, microchannel plates, standards, and greener fluids

External References: