FPX AI

Delivered Megawatts: The Ultimate Colocation Buying Guide for the AI Era

FPX AI — Tue, 05 May 2026 18:17:43 GMT

Is this market constrained, or just misunderstood? Everyone keeps quoting megawatts, but nobody agrees on what they actually mean. A “20 MW site” can mean land with a utility conversation, power at the property line, a future substation, powered shell, or a commissioned facility that can take AI servers and keep them running. Those are not the same product. The market is not short on announcements. It is short on server-ready power: substations, transformers, switchgear, UPS, generators, cooling, network, controls, permits, commissioning, and a buyer that can actually be underwritten.

That is why oversupply, undersupply, cancellations, and price spikes can all be true at once. They are talking about different kinds of megawatts. The next decade will not be won by the firms with the most land. It will be won by the firms with the highest probability of delivered megawatts. Capacity is no longer just sold. It is allocated.

First, we get back to first principles: what a data center actually is. Then we explain why AI breaks the old colocation model. Then we separate fake capacity from real capacity. Then we look at the bottlenecks that actually matter: electrical architecture, cooling, power procurement, credit, tariffs, local politics, and behind-the-meter generation. Finally, we lay out what operators, buyers, and investors should do now. The old question was: how much space can I lease? The new question is: how many AI megawatts can I actually operate, by what date, under what power architecture, with what cooling design, in what jurisdiction, with what credit support?

Check out the FPX Colocation Marketplace → marketplace.fpx.world/colocation
Where we allocate the only thing that matters: megawatts that can actually run your workload.

1. A data center cannot just be treated like a real estate project

A data center is not real estate. It is a tightly coupled machine that converts electricity into compute and heat, and the only question that matters is whether that machine works under load and under failure. The building is just a shell. The product is the system inside it.

Start with the power path, because if this breaks, nothing else matters. Power enters at the utility interconnection or onsite generation. It flows into the substation (the electrical front door that receives high voltage power and steps it down). Then into transformers (devices that convert voltage levels; if these are not secured, the project is not real). Then into switchgear (the protection and control layer that isolates faults and routes power safely). Then into the UPS (Uninterruptible Power Supply), which stabilizes power and bridges outages. Then into generators or backup systems that carry load during extended failures. From there, power is distributed via PDUs (Power Distribution Units), RPPs (Remote Power Panels), and busway (overhead power rails) into the rack, where rack PDUs and power shelves finally feed GPUs and CPUs. Every step has lead times, failure modes, and constraints. If a seller cannot walk this chain cleanly, they do not have capacity.

Now mirror that with the cooling path, because every watt of compute becomes a watt of heat. Legacy systems used CRACs and CRAHs (air cooling units), which worked at low densities. AI breaks this. At high density, you move to rear door heat exchangers (cooling at the back of the rack) and direct to chip liquid cooling (coolant flowing directly over processors). The core device here is the CDU (Coolant Distribution Unit), which regulates flow, pressure, and temperature between facility water and IT cooling loops. If the site cannot specify temperatures, flow rates, CDU ownership, and failover behavior, it cannot support AI workloads. Cooling is no longer a feature. It is industrial thermal infrastructure.

Then come the other systems most people ignore until they fail. The network layer includes carrier fiber, cloud on ramps, and internal high speed switching fabric that determines latency and throughput. The control layer includes BMS (Building Management System) and EPMS (Electrical Power Monitoring System), which monitor and automate the plant. The life safety layer includes fire suppression, security, and operational procedures that keep the facility insurable and operational. These are not add ons. They are part of the machine.

The key is that all of these systems must work together under stress. Redundancy is often misunderstood. “A plus B” power means nothing if one side cannot actually carry the full load during failure. “Tier III” means nothing if you cannot see the one line diagram and the failover case. Commissioning proof matters more than marketing labels.

This is why AI changes everything. Legacy colocation could survive on vagueness because a 5 to 15 kW rack gave margin for error. AI removes that margin. Today’s racks are already around 100 to 150 kW and moving higher, which turns the data center into an industrial system, not a flexible office environment. At these densities, power quality, thermal stability, protection systems, and failure handling are no longer secondary. They are the entire product.

If you are buying AI capacity, you are not buying racks or space. You are buying a machine that must deliver power cleanly, remove heat precisely, operate continuously, and survive failure conditions. If the operator cannot explain exactly

2. Delivered megawatts are the new currency

The market keeps quoting megawatts like they are interchangeable.

They are not.

Most of what gets called “capacity” today is noise. Press releases, land options, utility conversations, tax incentives, future substations. None of that runs a single GPU. The gap between what is announced and what is actually usable is where deals break, timelines slip, and buyers get burned.

There are only four types of megawatts in this market:

Announced: marketing. Looks big. Means nothing.
Optioned: land and early utility positioning. Still nothing you can deploy on.
Committed: real work has started, studies, permits, some equipment strategy, maybe a tenant. Still not usable.
Delivered: energized, cooled, commissioned, contracted, and able to take servers now.

Only the last one matters.

Everything else is pipeline risk.

AI buyers do not need optionality. They need megawatts that can actually take racks, at density, on time. Investors should not underwrite how many megawatts exist on paper. They should underwrite the probability that those megawatts make it through power, equipment, cooling, permitting, and credit into something real. Operators who blur that line are going to get exposed fast.

This is where the market splits. The top tier controls substations, transformers, switchgear, cooling systems, and procurement pipelines. They know where their equipment is coming from, when it lands, and how it gets commissioned. The bottom tier controls land, decks, and excuses. That bottom tier is about to get wiped out.

The phrase “AI-ready” is already collapsing. It will only mean two things going forward: either the operator can prove the site supports the actual rack density, cooling method, electrical architecture, and failover condition on a real timeline, or they cannot. There is no middle ground.

A rendering is not a megawatt.
A utility conversation is not a megawatt.
A tax incentive is not a megawatt.
A future substation is not a megawatt.

A delivered megawatt is a megawatt.

And here is the part most people are still missing: even “delivered” is not enough if it is designed for the wrong future.

The industry keeps framing AI colocation as a cooling problem. That is outdated. The next bottleneck is electrical architecture. Today’s racks are already pushing past 100 kW and heading toward 300 kW and beyond. At megawatt scale, the legacy 54 VDC architecture breaks. The physics are not negotiable. At 1 MW, you are dealing with roughly 18.5 kA of current at 54 VDC versus about 1.25 kA at 800 VDC. Same rack, completely different system behavior.

That is why the shift to higher voltage distribution is not optional. It is the only scalable path. Without it, you end up stuffing racks with power shelves, copper, and complexity just to keep the system alive. That does not scale economically or physically. The next generation of real capacity will be built around centralized conversion and high voltage distribution, not incremental patches on legacy designs.

Cooling is scaling in parallel. At megawatt levels, heat rejection becomes a fluid dynamics problem, not a facilities checkbox. You are talking about massive flow rates, CDU scaling, and thermal systems that behave more like industrial plants than data halls. At the same time, protection becomes critical. Higher density means higher fault energy, higher arc flash risk, and tighter operating margins. One failure can take out entire rows of equipment.

This is why most “capacity” today is mispriced. It is being valued as if it can support the next generation of hardware when it cannot. A site designed for current density without a path to future electrical architecture is already partially obsolete.

FPX view: the market is not just filtering fake megawatts from real ones. It is about to filter current-ready megawatts from future-proof megawatts. The winners will not be the ones who can announce capacity. They will be the ones who can deliver it at the densities the next hardware cycle demands, safely, efficiently, and on time.

3. The “50% canceled” headline is the wrong fight

The market is now arguing about whether half of U.S. data center capacity is being delayed or canceled.

That is the wrong argument.

The market is not canceling half of AI data center capacity.

It is canceling the illusion that every announced megawatt was real.

Yes, projects are being delayed. Some are being killed. Tax incentives are being pulled back. Local politics are getting harder. Utility tariffs are tightening. Power equipment is still constrained. And in FPX’s Marketplace, U.S. sites have risen roughly 15% on average.

But that is not a demand-collapse story.

It is a sorting cycle.

The weak pipeline is getting exposed: land-banked campuses, speculative press-release megawatts, undercapitalized developers, sites without a real power path, and buyers without the credit to secure capacity.

The real pipeline is not disappearing. It is being repriced, delayed, preleased, or allocated to stronger buyers.

That is the part the headline misses. In this market, capacity does not simply go to whoever wants it. It goes to whoever can make the provider, utility, lender, and operator believe the load is real.

That favors hyperscalers and large enterprises.

It hurts neoclouds, smaller enterprises, and speculative AI infrastructure buyers.

They may have demand. They may even be willing to pay. But if they cannot support the credit package, infrastructure commitment, ramp schedule, and utilization story, they get pushed out.

So yes, more projects will be canceled.

But no, half of real AI capacity is not going away.

The better read is this:

Announced capacity is getting cut. Bankable capacity is getting scarce.

That is why waiting for the “shakeout” is dangerous. More supply should arrive into 2027, but the best of it will not show up as clean, open inventory. It will already be spoken for, repriced, or reserved for buyers with stronger credit and earlier commitments.

FPX view: the market is not oversupplied. It is finally learning the difference between a rendering and a delivered megawatt.

4. The next bottleneck is electrical architecture

The industry keeps saying AI colocation is a cooling problem.

That is outdated.

Cooling matters, but cooling is no longer the limiting factor. The real constraint now is electrical architecture, and most of the market is not prepared for what that actually means.

AI racks are scaling faster than the infrastructure underneath them. A 100 kW rack is already a different system. A 300 kW rack is not an extension, it is a break. A 1 MW rack does not fit inside legacy assumptions at all. At that level, you are not just delivering power, you are managing current at a scale that makes traditional designs inefficient, expensive, and in some cases physically impractical.

This is where physics shows up.

At low voltage, current explodes. A megawatt rack at legacy voltage levels implies massive current, massive copper, massive losses, and massive heat inside the power delivery system itself. That is why the industry has been stuffing racks with power shelves, busbar, and workarounds. But that approach does not scale. It gets heavier, hotter, more complex, and more failure-prone with every step up in density.

The only real solution is to change the architecture.

Higher voltage distribution fixes the problem at the root. Increase voltage, current drops. When current drops, everything else gets easier. Less copper. Lower losses. Less heat. Cleaner routing. More scalable delivery. That is why the shift toward high voltage distribution is inevitable, not optional. This is not a design preference. It is a constraint imposed by physics.

And this is where most “AI-ready” capacity breaks.

A facility designed around today’s rack density without a clear path to higher voltage distribution is already capped. It may run current deployments. It may look fine on a spec sheet. But it will struggle or fail to support the next generation of hardware without major rework. That is not a theoretical risk. That is already in the design cycle.

At the same time, cooling is scaling into an industrial problem. At megawatt density, heat rejection is no longer about airflow optimization. It is about fluid systems operating at high flow rates, high pressure, and tight tolerances. CDU capacity, facility water loops, and heat rejection systems start to look like process engineering, not facility management. And as density rises, protection becomes critical. Higher power means higher fault energy. Arc flash risk increases. A single failure event can destroy entire rows of equipment. This is not just harder engineering. It is a different risk profile.

Now layer in the part the market is still getting wrong.

AI colocation is not one business. It is splitting into two.

Training and inference are diverging, and that divergence directly impacts how electrical and thermal infrastructure should be built.

Training is power first. It wants the largest, densest, most expandable megawatt blocks available. It will move to where power exists, even if that means remote campuses, hybrid grid and onsite generation, and massive internal power and cooling plants. Electrical architecture is the core constraint here. If the site cannot scale power delivery efficiently, it is not a serious training site.

Inference is latency first. It wants proximity to users, dense fiber, cloud adjacency, and predictable network performance. It is distributed, regional, and network-constrained rather than purely power-constrained. But it still requires high density and often liquid cooling. The difference is not air versus liquid. The difference is economics and placement.

Same AI umbrella. Completely different infrastructure problem.

This is where the market is about to make a major mistake.

It is still underwriting sites as if one “AI-ready” design can serve both. It cannot. Training sites need power architecture that scales into megawatt racks, centralized conversion, and massive thermal systems. Inference sites need network-dense metros with enough power density to run efficiently, but optimized for latency and distribution.

If you use the same framework for both, you are not doing analysis. You are doing marketing.

This is the real implication of the electrical bottleneck.

It is not just about voltage levels or copper. It is about whether a site can evolve with the hardware and whether it is even built for the right workload category.

Operators need to design for the next two hardware cycles, not just the current tenant. Buyers need to ask whether the site can migrate into future power architectures without breaking. Investors need to stop underwriting demand in aggregate and start underwriting whether specific assets can stay relevant as density increases.

Most sites cannot.

That is the gap.

And that gap is where the next generation of winners will come from.

6. AI colocation is splitting into two different markets

The phrase “AI data center” is becoming too lazy to be useful.

Training and inference are not the same business. They do not want the same sites, the same power profile, the same cooling design, the same network, or the same underwriting model. Treating them as one category is how buyers pick the wrong site, operators build the wrong product, and investors misprice the asset.

Training is the brutal one. It is power first, density first, liquid cooled, campus scale infrastructure. Training wants huge contiguous megawatt blocks, cheap and expandable energy, deep internal networking, high voltage power delivery, massive thermal plants, and a site that can absorb future rack densities without falling apart. It is less sensitive to being close to the end user. If the power is real, the fiber is good enough, the operator is credible, and the county will allow it, training can move.

Inference is different. Inference is not one thing either. Even inside inference, the workload is splitting between prefill and decode. Prefill is the front end of the model response: processing the prompt, context, documents, tools, and memory. It is compute heavy and can often tolerate more distance. Decode is the token generation loop: the sequential output users actually wait on. It is latency sensitive, concurrency sensitive, and tied much more directly to user experience. That means inference infrastructure will fragment. Some of it will look like smaller training style clusters. Some of it will live in regional metros. Some of it will be optimized around fiber depth, cloud adjacency, jitter, and fast deployment rather than maximum campus scale.

That is the real estate implication. Training sites look like industrial power projects. Inference sites look like distributed digital infrastructure. Training is about delivered megawatts in one place. Inference is about putting enough compute in the right places.

Cooling is not a standalone category anymore. It is a function of workload type. Serious training workloads are moving toward dense liquid cooled environments because the rack densities force it, but “liquid cooled” is already becoming as vague as “AI ready.” It can mean rear door heat exchangers, direct to chip cooling, facility water to the hall but not the rack, customer owned CDUs, operator owned CDUs, or no clear ownership at all. The question is not whether the site is “liquid ready.” The question is whether it can support the exact cooling topology required by the hardware: supply and return temperatures, flow rates, pressure, CDU capacity, coolant quality, maintenance responsibility, leak response, commissioning schedule, and the failover condition. A plus B is not redundancy if either side cannot carry the load when the other side fails.

This is where a lot of AI colo disappointment will show up. Buyers will sign for “liquid cooled capacity” and later discover that the facility water loop, CDU boundary, thermal envelope, or operating liability does not support the deployment they thought they bought. The next generation of disputes will not be about square footage. It will be about cooling scope, rack density, delay liability, and who is responsible when the system fails commissioning. For training, cooling is industrial thermal infrastructure. For inference, the picture is more fragmented: some deployments will use lighter liquid cooling, some will adapt to air cooled or hybrid environments, and some inference specific accelerators will make existing colo more usable. Either way, the cooling design has to match the workload, not the marketing deck.

Inference will be more disruptive because it has to fit into the real world faster. Not every inference deployment can wait for a custom 100 MW campus. A lot of it will need to adapt to available colo, metro locations, air cooled or lighter liquid cooled designs, and inference specific accelerators that are built to improve performance per watt, latency, and deployment flexibility. Groq style architectures, decode optimized chips, and other inference specific systems matter because they change the site selection problem. They make more existing colo usable. They make smaller footprints more valuable. They turn inference into a speed and placement game, not just a raw megawatt game.

That is the market split.

For training, you are often betting on future projects. If you need a 50 MW, 100 MW, or 110 MW plus campus, you are not really buying today’s inventory. You are underwriting development risk. You are betting on the operator, the utility, the equipment queue, the interconnection timeline, the cooling design, the generator strategy, the financing, the county, the tariff, the permits, and the local politics. The lease is only one piece of the risk.

For inference, the opportunity is different. The winners will be the operators who can turn existing or near term capacity into deployable inference capacity quickly. That means enough density, good network, clean operations, credible power, and a design that matches the actual workload instead of pretending every AI deployment needs the same industrial campus.

This is the next major underwriting mistake in the market: using one “AI ready” framework for everything.

Training wants power scale.
Inference wants placement.
Training wants dense liquid cooled campuses.
Inference wants latency, availability, and hardware flexibility.
Training is a future project bet.
Inference is a deployment speed bet.

Both are valuable. They are just not the same asset.

FPX view: the market is going to stop valuing “AI capacity” as one generic category. It will split between training capacity and inference capacity, and the diligence will look completely different. If a seller says a site can do both, the next question is simple: for which hardware, at what density, under what cooling design, with what network profile, by what date?

If they cannot answer that, they are not selling AI capacity.

They are selling a slogan.

7. The county matters as much as the cabinet

This is where the committing to larger sites especially for training workloads gets dangerous.

If you are taking 2 MW of inference capacity in an existing facility, you are diligencing an operating asset. If you are taking 10 MW, 50 MW, or 100 MW plus for training, you are often underwriting a project that does not fully exist yet. That means you are not just betting on power and cooling. You are betting on the operator, the utility, the permitting path, the substation schedule, the equipment queue, the financing, the tariff, the water strategy, the generator plan, and the county.

The county is not background noise.

The county is part of the capacity stack.

For years, data center site selection was mostly a power, fiber, land, tax, and latency exercise. Find cheap power. Find enough land. Find fiber. Get incentives. Build. That world is gone. AI campuses are too large, too visible, too power hungry, and too politically sensitive to slip quietly through local approval.

A 100 MW campus is not a real estate project to the community. It is a power project. It is a water project. It is a noise project. It is a tax project. It is a road project. It is a diesel generator project. It is a transmission project. It is a question every resident understands: what do we give up, what do we get back, and who pays when the grid needs upgrading?

That is why the best site on paper can still die in county politics.

A cheap site in a hostile county is not cheap.

A fast utility path in a county that will fight every permit is not fast.

A tax incentive that becomes a political target is not bankable.

A power allocation that depends on ratepayers absorbing grid costs is not secure.

A project that has not explained noise, water, generators, transmission, jobs, and local benefit is not de-risked. It is exposed.

This matters most for larger training workloads because those buyers are often committing before the site is fully delivered. The bigger the block, the more you are buying future execution rather than present inventory. At 10 MW, you can sometimes still find capacity inside a known operating environment. At 50 MW, you are usually into expansion risk. At 100 MW plus, you are underwriting a development thesis.

That changes the diligence.

You are no longer asking, “Does this provider have a hall?”

You are asking, “Will this entire project come into existence?”

That means the county has to be evaluated like a counterparty. Not formally, but practically. Is the local government supportive? Have similar projects been approved? Are residents already organizing against data centers? Is water politically sensitive? Are diesel generators controversial? Are there noise setbacks? Is the tax abatement vulnerable? Will transmission upgrades trigger opposition? Are utility bills rising? Does the project create enough local benefit to survive scrutiny?

Most buyers do not ask these questions early enough.

They ask after the deal is signed.

That is too late.

For operators, the lesson is simple: community strategy is now development strategy. You cannot treat it as PR. Noise modeling, water sourcing, generator permitting, visual screening, road impact, emergency response, grid cost allocation, tax contribution, and permanent local benefit need to be built into the project before the opposition forms. Once the local narrative becomes “data center extracts power and gives nothing back,” the project is already wounded.

For buyers, the lesson is even sharper: do not confuse a provider’s confidence with project certainty. Every developer is confident before a permit hearing. You need evidence. Show me the zoning path. Show me the county record. Show me the utility position. Show me the generator permitting plan. Show me the water plan. Show me the community engagement. Show me what happens if the incentive changes. Show me the delay remedies if local approval slips.

For investors, this is now core underwriting. A county that welcomes the project deserves a lower risk premium. A county that is politically fragile deserves a higher one. A project with secured local alignment is worth more than a project with better land economics and a weaker approval path.

The market keeps acting like the hardest part is finding the megawatts.

That is only half true.

The harder part is making sure the megawatts survive contact with the real world.

FPX view: for large training deployments, the county matters more than the cabinet because the cabinet is replaceable and the county is not. A rack vendor can slip and recover. A transformer can be resequenced. A cooling design can be modified. But if the county turns against the project, your 100 MW commitment can become a press release, a lawsuit, or a two year delay.

At small scale, you buy capacity.

At large scale, you underwrite permission.

8. Water is both over-politicized and under-analyzed

Water is now one of the easiest ways to kill a data center project politically, and one of the easiest issues to misunderstand. The headlines sound terrifying because raw gallon numbers are large. But large numbers without context do not tell you whether a site is reckless, efficient, reusable, dry cooled, wet cooled, drawing potable water, using reclaimed wastewater, operating in a water rich region, or building in a constrained basin.

That is the point SemiAnalysis made with the xAI Colossus 2 comparison. Its estimate put the campus at roughly 346 million gallons of blue water per year versus roughly 147 million gallons for an average In N Out store. Whether you like the framing or not, the lesson is useful: water numbers are easy to weaponize when nobody explains what they mean.

The real question is not, “does the data center use water?” Of course it does. The real question is: what water, from where, under what cooling design, in what climate, with what reuse strategy, and with what local impact?

A dry cooled site is not the same as an evaporative cooled site.
A site using reclaimed wastewater is not the same as a site pulling potable water.
A site in a water rich region is not the same as a site in a stressed basin.
A low density enterprise facility is not the same as a dense AI training campus.
A cooling design built for inference is not the same as one built for training.

This is where operators lose the room. They either dismiss water concerns as emotional, or they show up with engineering language nobody in the county meeting understands. Both are mistakes. Once “data center water use” becomes the headline, the project is already on defense.

The operators that win will make water boring early. They will explain the source, the expected use, the consumptive portion, the cooling method, the seasonal profile, the reuse plan, and the local tradeoff before opponents define it for them. They will show why the scary headline number is not the full story, but they will not pretend the issue is fake.

FPX view: water is often over politicized, but it is not fake risk. The best operators will not be the ones who use the least water in every case. They will be the ones who can design intelligently, quantify clearly, use better sources, communicate with constituents, and earn enough trust that water does not become the reason the project dies. A permit is not permission. Local trust is part of the capacity stack.

9. Incentives, tariffs, and utilities are part of the site

Tax incentives helped build the U.S. data center map. Now they are becoming less reliable.

The old bargain was simple: bring a large capital project, get tax relief, create some jobs, expand the local tax base, and everyone moves on. AI changes that bargain. A 100 MW data center is not a quiet commercial project. It is a visible load on the grid, a political issue, and in some counties, a ratepayer fight waiting to happen.

The question local governments are asking is blunt: if a data center takes huge power, receives tax breaks, creates relatively few permanent jobs, and requires grid upgrades, what exactly does the community get back?

That question is not going away.

Some incentives will survive. Some will be rewritten. Some will come with energy efficiency rules, job requirements, local investment commitments, grid cost protections, or clawbacks. Some will disappear completely. NCSL says 38 states currently offer dedicated data center tax incentives, but states are already reassessing them, adding conditions, or considering repeal. Maine is the clean example: the state rejected a broader moratorium, but still moved to restrict certain data center projects from business development incentive programs.

The underwriting lesson is simple: treat incentives as upside, not base case.

If the project only works because a tax exemption survives untouched for ten years, the project is not robust. It is politically fragile.

The same is true for power tariffs. A large load tariff is the utility pricing and contract structure for very large customers like data centers. It can include upfront study payments, minimum contract terms, load ramp schedules, exit fees, security deposits, financial guarantees, and obligations to pay for grid upgrades. These are not footnotes. They decide whether the site actually works. Utility Dive reports 77 large load tariffs pending or in place across 36 states, with 29 approved in 2025 alone, compared with just 14 approvals from 2018 through 2024.

That means the tariff is part of the site.

The interconnection agreement is part of the site.

The utility’s posture is part of the site.

The regulator is part of the site.

A cheap headline power price means nothing if the tariff forces a rigid ramp, punishes load volatility, requires expensive upgrades, or creates stranded cost exposure. A buyer can win a low $/kWh and still lose the project.

This is why state selection is no longer just “cheap power good, expensive power bad.” Every market has a different constraint set. Texas has land, gas, fiber, and speed, but ERCOT congestion and transmission risk are real. Virginia has the deepest data center ecosystem in the world, but also rising load pressure, local opposition, and ratepayer scrutiny. The Midwest and interior markets are becoming more attractive because they can offer larger power blocks and more room to build, but those markets still need to be underwritten county by county, utility by utility, tariff by tariff.

The winners will not just pick the cheapest state.

They will pick the jurisdictions with the highest probability of delivered AI megawatts after accounting for interconnection, tariff structure, incentive durability, gas access, water, fiber, labor, zoning, and politics.

For operators, stop treating tax treatment and tariffs as external details. They are part of the product.

For buyers, diligence the tariff before you fall in love with the power price.

For investors, underwrite this like infrastructure. A site with a higher headline cost but a cleaner tariff, stronger utility relationship, and more durable political support may be worth more than a cheaper site that can collapse in approval, regulation, or rate design.

FPX view: incentives are no longer free money, and tariffs are no longer background paperwork. They are core capacity risk. If you do not understand the tax, tariff, utility, and regulatory stack, you do not understand the site.

10. Behind the meter is not magic. It is a power plant decision.

Behind the meter power is one of the most important trends in AI infrastructure.

It is also one of the easiest to oversell.

Behind the meter, or BTM, means some or all generation sits on the customer side of the utility meter. That can mean gas turbines, gas engines, fuel cells, batteries, or a hybrid microgrid. It can be used as bridge power while waiting for grid interconnection, supplemental power to reduce grid dependence, prime power for the data center, or islanded power where the site can operate largely on its own.

The appeal is obvious: grid interconnection is slow, AI demand is immediate, and waiting three to seven years for utility upgrades can kill the business case. BTM can move faster. That is why bring your own power and onsite generation are becoming normal in large scale planning.

But faster does not mean easier.

Grid connected power is simpler. BTM power is faster and harder.

A grid connected site still needs transformers, switchgear, UPS, generators, PDUs, cooling, and controls. A BTM site needs all of that plus fuel supply, generation equipment, step up transformers where needed, microgrid controls, protection studies, emissions systems, air permits, acoustic treatment, operations staff, maintenance contracts, and a plant management model that looks more like a utility asset than a colo hall.

That is the part the market underestimates. BTM is not a shortcut around complexity. It changes the complexity.

The critical path can move from utility interconnection to gas supply. Or turbine procurement. Or air permitting. Or emissions controls. Or noise. Or local opposition. Or the question of who actually operates the plant at 3 a.m. when the data center is under load and a unit trips.

A buyer or investor should ask the same questions every time:

Where is the fuel coming from?
Is the gas supply firm or interruptible?
Who operates the plant?
What emissions permits are required?
Is the site in a nonattainment area?
What noise limits apply?
What happens when the grid connection arrives?
Is this bridge power, backup power, prime power, or permanent islanded power?
What is the real cost per delivered AI megawatt after fuel, maintenance, redundancy, emissions control, and capex?

That last question matters most. BTM can look attractive when compared with waiting for the grid. It can look much less attractive when fully loaded with fuel, equipment, maintenance, redundancy, permitting, emissions treatment, and operational risk.

BTM is not cheap power.

BTM is speed, control, and complexity.

Use it when the cost of waiting for grid power is greater than the cost and risk of operating your own power system. Do not use it because it sounds sophisticated in a fundraising deck.

For operators, a credible BTM strategy can be a massive advantage. It can turn stranded land into deliverable capacity and create a bridge to permanent grid service. But only if the operator can actually execute the power plant layer.

For buyers, BTM diligence has to go beyond “is there power?” The real question is whether that power is permitted, fuelled, protected, dispatchable, financeable, and operable through failures.

For investors, BTM should change the underwriting model. You are no longer backing only a data center. You are backing a data center plus an energy asset. That can be valuable, but it deserves a different risk premium.

FPX view: behind the meter is not a slogan. It is an operating model. The market will reward serious BTM strategies and punish superficial ones. The operators who can combine grid strategy, onsite generation, fuel, controls, permitting, and data center operations will move faster than the grid. Everyone else is just adding another failure point.

11. Credit is now part of the product

This is the most underpriced shift in AI colocation: price no longer wins by itself. Providers are not just asking, will this customer pay rent? They are asking, will this customer strand my megawatts?

Stranded capacity is what happens when a provider commits scarce power, shell, liquid cooling, equipment, engineering time, utility queue position, and lender capacity to a buyer that fails to ramp. GPUs slip. Customers slip. Utilization slips. The hall sits half empty. The provider loses time, lender confidence, and the chance to allocate those megawatts to a stronger counterparty.

That is why weak credit can pay more and still lose. A hyperscaler, large enterprise, or sponsor backed neocloud with a credible ramp will beat a higher priced buyer that cannot prove durability. In this market, capacity is allocated to whoever can carry it, not whoever wants it most.

The winning buyer now needs two packages:

Credit package: financials, sponsor support, guarantees, deposits, letters of credit, insurance, customer contracts, offtake, prepayment, and balance sheet clarity.

Execution package: GPU delivery, energization date, cooling commissioning, network turn up, customer ramp, utilization plan, staffing, and operating readiness.

This is also where FPX adds leverage. The market is not one pool of capacity. Operators have different risk appetites. Some only want hyperscalers or investment grade enterprise credit. Some will work with sponsor backed neoclouds if the deposit, guarantee, offtake, or prepayment structure is right. Some will take more execution risk if the workload, timeline, and site economics fit.

The job is not just to find capacity. It is to match the buyer’s workload, credit profile, ramp schedule, cooling needs, and risk tolerance to the operators most likely to actually accept the deal. A 2 MW inference deployment, a 20 MW neocloud ramp, and a 100 MW training campus should not be shopped the same way.

FPX view: credit is now a feature of capacity. The right site is not the cheapest $/kW. It is the site where technical design, commercial structure, operator risk appetite, and delivery probability all line up.

12. What the market should do now

The playbook is simple: stop selling AI-ready and start proving AI-deliverable.

For builders and operators, the product is no longer the hall. It is certainty. Show the one-line. Show the utility position. Show the substation path. Show transformer and switchgear status. Show generator strategy. Show the cooling topology. Show who owns the CDU. Show the failover case. Show the commissioning plan. Show the delay remedies. Show what density works today, what density works after upgrades, and what density this site will never support.

The operator that wins is not the one with the biggest headline MW number. It is the one that turns ambiguity into an underwritable product. Power, cooling, equipment, tariffs, permits, local politics, and tenant credit all need to be packaged as one answer: this capacity will exist, this is when it will exist, this is what it can run, and this is what happens if something slips.

For buyers, the rule is harsher: stop shopping for racks. Buy probability of delivered AI megawatts. The first question is not “what is the price per kW?” The first question is: can this site actually run my workload, at my density, on my timeline, with my cooling method, under my tariff, in this jurisdiction? If the provider cannot answer that clearly, the price does not matter.

That means buyers need to diligence the site like infrastructure. Where is the power coming from? Is the utility commitment real? What equipment is ordered? What permits remain? What tariff applies? What happens if one side fails? Who owns the liquid cooling boundary? What network paths exist? What local opposition exists? What expansion rights exist? Can the site support the next hardware cycle, or only the current one?

For neoclouds and fast scaling AI companies, demand alone is not enough. Bring a credit package and an execution package. Bring deposits, guarantees, offtake, customer contracts, sponsor support, GPU delivery schedules, network plans, cooling requirements, and a credible ramp. In this market, credibility travels faster than price. Weak buyers pay more, wait longer, get worse terms, and still lose.

For investors, stop underwriting data centers like shells. Underwrite them like infrastructure assets. A deliverable substation, a clean tariff, a supportive county, secured long lead equipment, credible cooling, and a financeable tenant are worth more than cheap land in a famous market. The premium belongs to assets with high conversion probability from announced megawatts to delivered megawatts.

The market is tight now and the best 2027 capacity will not arrive as clean open inventory. It will be preleased, repriced, reserved, or allocated to buyers with stronger credit and earlier commitments. Waiting for the market to “loosen” may feel conservative. It may actually be how you end up with the leftovers.

FPX view: the winners will be the groups that move from marketing to proof. Operators need to prove delivery. Buyers need to prove credit and execution. Investors need to prove the asset can survive power, permitting, equipment, politics, and tenant risk. Everything else is noise.

For teams that do not want to waste months chasing the wrong capacity, FPX can do this work upfront. Through the FPX Colocation Marketplace, we source and filter sites based on the factors that actually decide whether a deployment works: workload type, size, density, timeline, cooling requirements, power path, utility position, tariff exposure, legislation, local politics, operator risk appetite, and credit fit.

A 2 MW inference deployment, a 20 MW neocloud ramp, and a 100 MW training campus should not be sourced the same way. The right site is not just the site with available power. It is the site where the workload, infrastructure, operator, jurisdiction, timeline, and commercial structure line up.

That is what FPX is built to find. Not theoretical megawatts. Not brochure capacity. Sites that can actually support the deployment you are trying to build.

14. The FPX view: the market is not in a bubble. It is in a sorting cycle.

The AI colocation market is not breaking because some projects are getting canceled. It is breaking because too many people called undeveloped land “capacity.” Demand is real. Delays are real. Cancellations are real. Pricing pressure is real. The shortage is real. All of those can be true at once because the market is finally separating announced megawatts from delivered megawatts. The next wave of distress will not come from lack of AI demand. It will come from overlevered projects that confused planned power with usable power, confused tenant interest with financeable credit, and confused a rendering with an operating asset.

Here are the predictions that matter:

1. “AI ready” becomes a red flag.
Serious buyers will stop accepting vague claims. They will ask for one lines, failover cases, equipment status, cooling specs, commissioning proof, utility position, and delay remedies. If the operator cannot prove it, the capacity is not real.

2. The market splits into fake megawatts and financeable megawatts.
Fake megawatts get delayed, renamed, sold, or canceled. Financeable megawatts get preleased, repriced, and allocated to stronger buyers before they ever show up as open inventory.

3. Training and inference become different infrastructure markets.
Training wants massive power campuses, dense liquid cooling, and operators that can deliver 50 MW to 100 MW plus projects. Inference wants placement, latency, fiber, speed, and hardware flexibility. The market will eventually value them differently.

4. Credit beats price.
The highest bidder will not always win. The buyer that reduces stranded capacity risk will. Weak credit will pay more, wait longer, get worse terms, and still lose to a buyer the operator, lender, and utility can actually underwrite.

5. The county becomes part of the capacity stack.
Power, land, and fiber do not matter if the project dies in local politics. Noise, water, generators, tax abatements, transmission, ratepayer impact, and local benefit will decide which large campuses actually get built.

6. Electrical architecture becomes the next competitive frontier.
Cooling gets the headlines, but high density power delivery will decide which sites stay relevant. Sites that cannot migrate toward the next hardware cycle will be capped earlier than investors expect.

7. Behind the meter separates serious operators from tourists.
Everyone will pitch onsite power. Fewer will execute it. Behind the meter is not cheap power. It is fuel, permits, emissions, controls, operations, redundancy, and power plant risk wrapped into a data center strategy.

8. Tax incentives become less bankable.
States and counties will demand more from large load projects. Pro formas that only work because incentives survive untouched for ten years will break. Incentives should be upside, not the base case.

9. 2027 will not bring the relief buyers expect.
More capacity should arrive, but the best capacity will already be spoken for, repriced, or reserved for buyers with stronger credit and earlier commitments. Waiting for the market to loosen may be how buyers end up with leftovers.

10. The best assets will look boring.
The winners will not always be the biggest announcements or the cheapest land. They will be the sites where power, permitting, equipment, cooling, tariff structure, operator capability, tenant credit, and local politics all line up.

The mandate is simple. Builders need to stop selling future capacity as if it is commissioned capacity. Operators need to turn power, cooling, controls, tariffs, and liability boundaries into something buyers can actually underwrite. Buyers need to secure capacity now, but only real capacity. Investors need to underwrite AI data centers like infrastructure assets, not shells with power.

The old colo market sold space.

The new AI infrastructure market allocates delivered megawatts.

FPX exists for that market. Through the FPX Colocation Marketplace, we source sites based on workload type, size, density, timeline, cooling requirements, power path, tariff exposure, legislation, local politics, operator risk appetite, and credit fit. Not brochure capacity. Not theoretical megawatts. Capacity that has a real path to servers.

That is the game now. The winners will be the ones who understand it before everyone else is forced to.

Beyond Power: The Helium & Tungsten Bottlenecks

FPX AI — Mon, 06 Apr 2026 20:41:46 GMT

This analysis is for informational and educational purposes only and does not constitute investment, legal, or financial advice. FPX AI has no positions in any securities mentioned. Do your own research. Geopolitics remain fluid.

The Setup

The market keeps making the same mistake.

It sees a geopolitical shock, finds the scariest input in the stack, and immediately extrapolates to system-wide AI paralysis. This time the input is helium.

That is directionally right and analytically lazy.

Qatar produced roughly 63 million cubic meters of helium in 2025 out of approximately 190 million globally - about one-third of world supply (USGS Mineral Commodity Summaries 2026). On March 2, 2026, QatarEnergy halted production at Ras Laffan Industrial City - the world’s largest LNG export hub - following Iranian drone and missile strikes. The Strait of Hormuz has been severely restricted, with Iran permitting limited transit only for vessels with no US or Israeli links (Reuters). Qatar’s full helium output went offline immediately; Reuters reports Qatar’s helium output is expected to fall about 14% as a result of the physical damage, though the initial shock removed the full one-third from the market for weeks.

On March 17, Airgas - a subsidiary of Air Liquide (AI.PA) and one of the largest US packaged gas distributors - declared force majeure on helium shipments to US customers, effective 12:01 a.m. Eastern. Letters reviewed by Bloomberg show customer deliveries capped at up to half of normal monthly volumes plus a $13.50 per hundred cubic feet surcharge. Spot helium prices have more than doubled since the conflict began, with some markets reporting surges of 70-100% (CNBC). Hundreds of specialized cryogenic ISO containers, each worth approximately $1 million, are stranded in the Middle East. Helium logistics are time-sensitive: Reuters says liquid helium generally needs to reach end users within about 45 days, so stranded inventory is not a buffer so much as a wasting asset. We are literally losing feedstock to the sky.

This is a real physical-layer shock.

But the market’s next leap is where the analysis breaks. The myth is that a helium disruption automatically means immediate AI infrastructure failure. That is not the right layer. Helium is a real upstream bottleneck in chipmaking and select cryogenic systems. It is not the working fluid of mainstream AI rack liquid cooling - NVIDIA and Schneider both describe modern AI liquid-cooling systems as water- or glycol-based closed loops. The first-order risk is not that every AI cluster suddenly goes dark. The first-order risk is that an already concentrated semiconductor materials chain gets tighter, more expensive, and more selective in who gets priority.

That distinction matters.

Because if you trade this as “AI stops,” you miss the actual structure of the shock. And the actual structure is far more instructive than the headline.

Our house view is straightforward: the supply shock is real. The panic is selective. The market is correct on scarcity and wrong on propagation. The names with weak inventory, weak recycling, and weak procurement leverage should worry. The largest memory players look buffered for now. The risk is duration, not day one. The market is pricing the gross supply loss faster than it is pricing the buffering mechanisms.

That is the setup.

The rest of this piece explains why helium became strategic, where the actual bottlenecks are, which parts of the semiconductor stack are most exposed, why the broad panic is overstated for Samsung (KRX: 005930), SK Hynix (KRX: 000660), TSMC (TSM), and Micron (MU), and which signals will tell you whether this remains a pricing event or turns into a real production event.

As we covered in Parts 1 and 2 of this series - on memory and networking - the market fixates on the components it can see: GPUs, FLOPs, megawatts. The real constraints live in the physical layer surrounding the chip. Helium is as physical as it gets. You cannot synthesize it. For many critical cryogenic and semiconductor uses, substitution is limited or impractical. You can only extract it from the earth, liquefy it at -269 C, and deliver it before it boils away.

Start From Physics, Not Headlines

Helium looks simple. Atomic number two. The second lightest element in the universe. But industrially, it is one of the hardest substances on Earth to handle. Understanding why requires starting from the physics - the foundational principle of every FPX analysis.

Most commodities can be stockpiled, rerouted, blended, or substituted. Helium does not cooperate. It is produced as a byproduct of natural-gas processing. It must be purified to extremely high levels for semiconductor use. It must be liquefied at ultra-low temperatures. It must move in specialized cryogenic containers. And it cannot sit still for long because boil-off destroys inventory value over time. The shock is not just “less helium.” The shock is “less helium in a system that was never built for graceful delay.”

Once you map the full physics of helium supply, one conclusion becomes unavoidable:

Helium is not scarce because it is rare. It is scarce because controlling it sits at the edge of engineering limits.

That is the right starting point.

The Cryogenic Paradox

Most gases cool when you expand them through a valve. This is the basic principle behind every refrigerator - the Joule-Thomson effect. Helium breaks this rule. At standard industrial temperatures, expanding helium through a valve makes it hotter, not colder. To push helium down to its liquefaction point at -269 C - four degrees above absolute zero - you need specialized turboexpander equipment that only a handful of companies manufacture globally. These are not commoditized machines. Lead times for critical liquefaction and purification components can stretch well beyond a year.

The Purity Gauntlet

Semiconductor-grade helium requires ultra-high purity - USGS defines US Grade-A helium as 99.997% or greater, and leading-edge processes use grades up to 99.9999% (six nines). Raw helium extracted from natural gas wells starts at roughly 0.04-0.5% concentration. Getting to semiconductor grade means concentrating the gas over 1,000x and then purifying it through multiple stages spanning a wide temperature range, including high-temperature getter systems that chemically trap impurities down to parts-per-billion. The specialized equipment and materials required for these purification stages come from a small number of suppliers with long lead times.

This is classic FPX territory: the relevant bottleneck is not gross supply. It is qualified supply at the required performance layer. A semiconductor fab does not need generic industrial gas. It needs ultra-high-purity helium delivered reliably into tightly tuned processes. Even if total global supply is only partially impaired, the usable portion for high-performance semiconductor applications can tighten faster than the headline volume loss suggests.

The Storage Problem

Helium is far harder to buffer than most commodities. The SIA says helium cannot be readily stockpiled. Liquid helium constantly absorbs ambient heat. Even in the best cryogenic containers, boil-off runs 0.1-1% per month. Reuters reports that liquid helium generally needs to reach end users within about 45 days. Some underground storage exists - USGS notes cavern storage in Texas - but this is not comparable to a strategic petroleum reserve in scale or accessibility.

And the US no longer has a federal buffer. The Bureau of Land Management completed the sale of the Federal Helium System in June 2024. USGS now lists the government stockpile as none.

This is why hundreds of stranded containers in the Middle East are not a “buffer” - they are a wasting asset evaporating into the atmosphere. For helium, logistics are not a supporting function. They are part of the product. If containers are stranded, helium is not merely delayed. It is physically degrading. This is one of the rare markets where logistics failure becomes literal product destruction.

What Helium Actually Does in a Fab

This is where the most common misunderstanding lives. The popular narrative has framed helium as “the coolant that keeps AI data centers alive.” That is wrong. NVIDIA and Schneider both describe modern AI liquid-cooling systems as water- or glycol-based loops. Helium does not cool your GPU rack.

The right framing: helium is one of the upstream gases that determines whether the chips feeding those data centers can be manufactured without yield loss. It operates at the wafer fabrication layer, not the deployment layer. That is still a serious bottleneck. It is just a different layer of the stack.

In a modern fab, helium performs four critical functions where substitutes are limited, lower-performance, or hard to qualify:

Backside wafer cooling. During plasma etching, helium is injected between the wafer and the electrostatic chuck to dissipate heat uniformly. Without it, thermal gradients warp the wafer and kill yield at advanced nodes.

Purge gas. Helium’s chemical inertness makes it a primary choice for displacing contaminants in process chambers where a single particle can ruin an entire wafer at 3nm. The SIA notes that many helium uses in semiconductor manufacturing lack viable substitutes.

Photolithography and EUV processes. Helium is used in photolithography environments for thermal management and atmosphere control. The SIA identifies photolithography as one of the critical semiconductor applications for helium.

Leak detection. Helium’s tiny atomic radius allows it to penetrate openings that no other test gas can reach, enabling micro-leak detection in vacuum systems and gas pipelines across the fab.

USGS reports that helium has no substitute in cryogenic applications below -429 F (-257 C). A semiconductor devices professor at South Korea’s Sangmyung University confirmed to TechNews that there is currently no viable alternative for cooling wafers in semiconductor production. The combination of thermal conductivity, chemical inertness, and atomic size is unique among all elements.

USGS reports that controlled atmospheres, fiber optics, and semiconductors together accounted for 17% of US helium sales in 2025. That exposure could rise as advanced-node capacity expands. Helium is embedded in photolithography and other wafer-fab steps, so as fabs push to smaller nodes, fab-level helium exposure can rise. The Semiconductor Industry Association cautioned in 2023 that substantial helium disruption would significantly affect US and global semiconductor manufacturing.

Helium operates at the wafer fabrication layer - upstream of the data center, not inside it. Substitution is limited or impractical across critical fab functions. Source: FPX AI, SIA

The Supply Chain: Too Few, Too Specialized, Too Slow

The entire global helium supply chain runs through a small number of highly specialized facilities. Lam Research(LRCX) reports that fewer than 20 helium refineries existed globally as of 2021. The exact count depends on whether you are counting refineries, liquefiers, or fully integrated plants. The number is not the point. The point is that the system is too concentrated, too specialized, and too slow to replace on a quarterly timeline.

Pre-crisis, the supply picture looked roughly like this: the United States produced approximately 81 million cubic meters annually (roughly 43% of global output), Qatar produced approximately 63 million cubic meters (roughly 33%), Algeria contributed meaningful volumes, and Russia - which has been expanding production amid its ongoing war in Ukraine - rounds out the top tier (USGS).

Even the more optimistic counterarguments do not dispute the core physics. They dispute whether standalone helium projects - dry-gas plants not tied to LNG infrastructure - can come online faster than the 3-6 years required for LNG-integrated facilities. Even if that is true, it is not useful for an immediate wartime shock.

Source: USGS Mineral Commodity Summaries 2025, Phil Kornbluth / Kornbluth Helium Consulting, Scientific American. Qatar's ~5.2M m3/month went offline on March 2; Reuters reports the lasting output loss at about 14%

The Net Shortage Is Not 33%

The critical detail most analysts miss: the pre-crisis helium market was actually oversupplied by roughly 15%. Phil Kornbluth, the most cited independent helium consultant in the world, told Scientific American that with a 30% loss of global capacity offset by a recent 15% supply overhang, the net effective shortage is approximately 15%. Some market observers cited by the Financial Times argue the eventual shortfall could settle closer to 10-15% of demand rather than a full one-third once inventories and alternate supply are factored in.

That is plausible, not settled. But it is the difference between a crisis and a manageable disruption. The deeper point: the market is pricing the gross supply loss faster than it is modeling the buffering mechanisms.

The Logistics Chokepoint

Helium distribution depends on roughly 2,000 specialized cryogenic ISO containers globally (Scientific American). These are not interchangeable with standard LNG or industrial gas containers. Each costs approximately $1 million. Hundreds are now stuck in the Middle East - in Qatar, on cargo ships, or in transit through restricted shipping lanes.

Container transit from the Persian Gulf to South Korea normally takes about one month. Helium that shipped from Qatar before the war started is still arriving. The real shortage at the fab level has not fully hit yet. As Kornbluth told Fortune: the shortage is a few weeks out. It is a sunny day on the beach, but the tsunami is visible on the horizon.

The Timeline: From Ras Laffan to Force Majeure

February 28, 2026. US-Israeli airstrikes on Iran begin. Iran retaliates with drone and missile strikes across the Gulf.

March 2. QatarEnergy halts LNG production at Ras Laffan following Iranian drone strikes on operational facilities. Helium extraction ceases simultaneously. Iran declared the Strait of Hormuz closed to U.S.- and Israeli-linked shipping, severely restricting commercial transit while still allowing limited passage for some neutral vessels.

March 4. QatarEnergy declares force majeure on LNG and associated product contracts, including helium. Approximately 5.2 million cubic meters of monthly helium supply goes offline. Gasworld convenes emergency webinar of industry experts. (C&EN)

March 12. Deutsche Bank notes the market has shifted from oversupplied to undersupplied. Bank of America estimates spot prices have already surged approximately 40%. (CNBC)

March 17. Airgas declares force majeure on helium shipments to US customers, effective 12:01 a.m. Eastern. Letters reviewed by Bloomberg confirm 50% delivery caps and $13.50/Mcf surcharges. Healthcare customers are prioritized over industrial buyers.

March 18-19. Iranian missiles strike Ras Laffan Industrial City directly, causing three fires and wiping out approximately 17% of Qatar’s LNG export capacity. QatarEnergy CEO Saad Al-Kaabi says repairs could take three to five years. (Al Jazeera, Fortune)

March 25-31. Spot helium prices reported at 70-100% above pre-crisis levels. Samsung and SK Hynix shares sell off on headlines. PGMEA prices up 40-50% separately on oil price pass-through. (CNBC, TrendForce)

What the Market Is Getting Right

To be clear, the panic is not fake. The market is right about three things.

The price shock is real. Airgas declaring force majeure and capping deliveries is not a theoretical risk. It is an operating fact. Bloomberg reviewed the letters. The surcharges are live. This is happening now.

The logistics problem is real. If expensive cryogenic containers are stranded and product is boiling off, that is not a sentiment issue. That is a physical loss mechanism. Scientific American reports that the industry relies on roughly 2,000 containers, many of which are now stuck in Qatar or on cargo ships. The initial pinch will feel worse until those tanks are repositioned.

Smaller and less-protected buyers are genuinely exposed. They do not need a global shutdown to get hurt. They only need tighter allocations and higher prices. Reuters reports that some production in the global tech supply chain is already being affected, and prolonged shortages could force slower output or product prioritization.

So this is not a call to dismiss the shock. It is a call to route the shock correctly.

What the Market Is Overstating

The consensus narrative went: Iran war closes Hormuz, helium supply collapses, chip fabs shut down, AI buildout stalls, sell everything.

That narrative is wrong in three specific ways.

Myth 1: “30% of supply vanished, so it’s a 30% shortage”

The pre-crisis helium market was oversupplied. Kornbluth estimates a net shortage of approximately 15%, not 30% (Scientific American). Financial Times reporting suggests the eventual shortfall could settle at 10-15% once inventories and alternate supply are factored in. Significant, but not catastrophic.

Myth 2: “Fabs have one week of inventory”

This was true pre-pandemic. It is no longer true. After the 2022 neon crisis - when Russian supply disruptions threatened EUV lithography gases - every major chipmaker rebuilt safety stocks. Reuters reports that Samsung and SK Hynix hold roughly four to six months of helium inventory. Digitimes confirms South Korean chipmakers have enough to sustain production through at least June. TSMC said it does not anticipate significant impact and maintains multi-source contracts (TrendForce). Micron appears relatively insulated given its US manufacturing footprint. The one-week figure applies to smaller, unhedged fabs - not the companies the market is selling off.

Myth 3: “Helium is the coolant keeping AI data centers alive”

This is the most important correction. Helium does not cool GPU racks. Mainstream AI liquid cooling is water- or glycol-based. Helium operates upstream - at the wafer fabrication layer, not the deployment layer. The right framing: helium determines whether the chips can be manufactured without yield loss. That is a serious bottleneck. It is just a different one than the market is pricing.

The right chain is: Iran war → Hormuz disruption → Qatar helium outage → industrial gas rationing → tighter fab input conditions → selective semiconductor pressure.

Not: Iran war → no helium → AI compute stops tomorrow. That assumption is outdated.

Helium Is the New Neon, but Not in the Way the Market Thinks

The clean historical analogy is neon in 2022.

Back then, the market also jumped from specialty-gas risk to broad semiconductor catastrophe. Ukraine supplied approximately 50% of the world’s semiconductor-grade neon. What actually happened was more nuanced. Shortages were real. Procurement mattered. Inventories mattered. But the strongest operators adapted faster than the headlines implied. No major fab shut down. Production continued.

Helium is arguably worse than neon in one respect: the logistics are even uglier. You cannot store helium indefinitely. You cannot ship it through contested waterways without physical degradation. The boil-off clock runs whether you are at war or not.

But the portfolio logic is similar. You should not ask, “Is helium a problem?” It is. You should ask:

Who is short helium resilience?

That is the investable question.

The Fake AI Copper Debate: Mispricing the Physical Layer

FPX AI — Fri, 20 Feb 2026 20:43:32 GMT

The market is stuck in a fake argument about the physical layer of AI.

If you listen to the current chatter around data center infrastructure, you are being fed a binary that doesn’t actually exist in the real world:

Myth #1: AI requires “hundreds of thousands of tons” of copper inside the data hall. The most‑shared extreme numbers are quite literally bad math—a game of telephone where “busbar copper per MW” got extrapolated into “whole-facility copper” and confused across kg/tons/pounds.
Myth #2: “Going all‑fiber” is a death sentence for copper. > The reality? You can’t model the physical constraints of an AI training cluster if you don’t understand the physics of how they are powered. Both of these myths miss the blindingly obvious: AI data centers are no longer just server farms. They are power plants.

Both of these myths completely miss the actual constraint. AI data centers are no longer just server farms. They are turning into power plants. The variables that actually dictate the buildout are power density, redundancy, and grid interconnection. It has nothing to do with what material is wrapped inside a fraction of the networking cables.

Disclaimer: This is not investment advice. It’s an analytical framework
+ a public-market watchlist for understanding how “robotics workloads”
could re-route compute spend across the stack. Do your own work /
consult a licensed professional before acting.

And for skeptics: the real bear case for copper isn’t fiber. It’s aluminum substitution in busway / conductors. But that threat is capped by physics: aluminum needs more cross‑section (space) for the same conductivity, and space is precisely what high‑density racks don’t have.

The “200 tons” story is a category error (and why it got misread)

The viral narrative about copper tonnage is a masterclass in what happens when financial analysts try to read engineering manuals.

Here is the actual context. NVIDIA published a technical post explaining why next-generation AI data centers must shift to high-voltage power architectures. To prove their point, they looked at legacy 54-volt systems. Because lower voltage requires massive amounts of physical metal to safely carry high power, pushing one megawatt of compute through a legacy rack requires roughly 200 kilograms of solid copper busbars. A busbar is simply the thick metal strip that conducts electricity inside the cabinet.

NVIDIA’s 800V/HVDC post put a real number on a very specific subsystem:

In a legacy 54V architecture, a 1 MW rack can require ~200 kg of copper busbar.
Scale that busbar-only subsystem to a 1 GW buildout and you get ~200,000 kg (200 metric tons) of copper for rack busbars. (NVIDIA Developer)

That number then got misinterpreted in both directions:

Some people (wrongly) treated 200 tons as the whole facility.
Others ran with an early, obviously wrong “hundreds of thousands of tons” framing which got publicly challenged and then stabilized into the correct interpretation (busbars, not the entire campus).

The key takeaway:
The “200 tons” number is useful, but only as a warning shot about low-voltage distribution hitting a wall and not as a way to model total copper demand.

The number that matters: ~30–40 tons of copper per MW is the floor

If you want a modeling input that actually works in the real world, you need to stop debating internal rack components. The only metric you should anchor on is facility-wide copper intensity per megawatt.

Right now, the cleanest framework comes from S&P Global. They baseline standard data centers at roughly 30 to 40 metric tons of copper for every megawatt of IT capacity. To be clear, that is a Day Zero construction metric. It strictly measures the initial build and entirely excludes the copper required for future lifecycle refits.

That baseline is not arbitrary. It is a structural floor dictated by the single most important variable that financial models chronically underweight.

Redundancy is the multiplier

Data centers are not like conventional factories. A factory can tolerate downtime; an AI training cluster cannot.

So builders don’t design for “N.” They design for N+1 / 2N everything:

transformers
switchgear
busway / cabling
UPS + generation
cooling backbones

Redundancy is the absolute multiplier Data centers are not factories. A factory can tolerate a few hours of downtime. A billion-dollar AI training cluster simply cannot.

Builders never design for a baseline capacity. They design for N plus one, or even double the capacity, across the entire site. That means duplicating the transformers, the switchgear, the heavy cabling, the backup generation, and the cooling backbones.

This is exactly why theoretical models fail. They calculate the bare minimum copper needed to run a facility and stop there. Real-world copper intensity is always drastically higher because redundancy forces you to buy the most copper-dense equipment on the campus twice.

Reality check: the range is wide, but the floor is sticky

When you look across the industry, the estimates for copper intensity land in the exact same zip code. The variance you see in the numbers comes down to three specific choices: redundancy, rack density, and whether the model measures just the initial build or the entire lifecycle.

Look at the actual deployments. A standard Microsoft facility in Chicago maps out to roughly 27 tons of copper per megawatt. High-redundancy AI training clusters in Asia push closer to 47 tons per megawatt. Meanwhile, stripped-down crypto mining sites drop down to 21 tons. If you start factoring in lifecycle refits and upgrades over time, those estimates can easily shoot past 60 tons.

Bottom line: if you’re underwriting AI data center copper demand, 30–40 t/MW is not aggressive. It’s the base case.

The “fiber delta” is real but it’s just not the main event

Here’s the cleanest way to frame the copper-vs-fiber question:

Fiber wins bandwidth. Copper carries amperage.

The one place copper actually loses inside the data hall is the interconnect cabling between racks. The industry is undeniably shifting toward fiber here. But you have to look at the actual magnitude. Going all-in on fiber reduces total copper intensity by roughly four to five tons per megawatt. Against a baseline of 30 to 40 tons, you are looking at a ten to fifteen percent haircut. That is a minor efficiency gain, not an extinction event.

The nuance here is that fiber only substitutes data interconnects. It does absolutely nothing to replace the heavy, amp-carrying copper in the power chain. Yes, fiber is taking market share, but it is taking share from a sliver of the buildout that does not move the needle. The structural driver of this entire trade is power delivery.

This is exactly what the market misunderstood about NVIDIA. Their warning was never that copper is disappearing. Their point was that legacy 54 volt power distribution physically breaks down as racks approach megawatt scale. That is why they are forcing the industry toward 800 volt direct current architectures. That viral busbar number was a symptom of the true underlying constraints in high-density computing. The bottlenecks are current, physical space, and heat.

NVIDIA’s point isn’t “copper will disappear”—it’s that 54V distribution hits physical limits as racks approach MW scale, which is why they’re pushing 800 VDC architectures for next‑gen “AI factories.” The busbar number is a symptom of the underlying constraint: current, space, and heat at extreme rack densities.

The hidden copper bull case is “outside the fence”

If you are looking for the true variant view in this trade, stop looking inside the data hall. The real edge is in the infrastructure that connects these hyperscale campuses to physical reality.

To model this correctly, you have to separate demand into two distinct buckets:

The copper consumed strictly inside the data center ecosystem.
The massive amount of copper required outside the facility to support it, specifically for power generation and grid expansion.

The critical nuance: Aluminum feeds the site, but the substation is copper

Because AI campuses require massive high-voltage interconnects, the immediate transmission lines often skew heavily toward aluminum. Utilities use aluminum conductors broadly across long distances to save weight and cost.

This is exactly where amateur models overclaim copper demand. If you want an accurate framework, use these two rules:

Never assume the incoming high-voltage feeder cable is copper.
Always assume the substation, the step-down transformers, the grounding grid, and the entire heavy equipment stack are massively copper-intensive.

S&P puts hard numbers on where copper shows up in the grid stack:

“Typical underground transmission lines” can use ~19,500 kg copper per km (and distribution ~3,700 kg/km) — depending on design. (S&P Global)
A 2,500 kVA transformer can contain >1 metric ton of copper, ~30% of its mass. (S&P Global)

The number that should change your model

When you aggregate the power infrastructure required to support these data centers, the forecast hits one million metric tons of copper per year by 2040. That volume splits cleanly down the middle: half a million tons for new renewable generation, and half a million tons for transmission and distribution. Specifically heavy underground lines.

This is your outside the fence alpha. Even if the data hall itself becomes hyperefficient and replaces internal cables with fiber, the mandatory grid expansion acts as a massive, parallel call on copper.

On-site power isn’t just a grid workaround. It’s a copper multiplier

The old model was simple: build a facility and plug it into the grid. The new model for AI factories is entirely different. Because interconnection queues are stretching into the next decade, builders are pivoting to a "bring your own stability" mindset. The builder behavior is shifting from “plug into the grid” to “bring your own stability”:

behind-the-meter generation
diesel backup at scale
microgrids
BESS layered on UPS

Grid operators are now explicitly designing rule sets that encourage large loads to bring capacity. For example, Reuters describes PJM proposals that include “bring‑your‑own‑generation” style options and “connect‑and‑manage” frameworks to accommodate the data‑center load wave.

Here is the realization that breaks most models. When an operator goes off-grid or builds behind the meter to bypass utility bottlenecks, they do not escape the copper thesis. They actually double down on it. You are effectively forcing a software company to build a local power plant. Furthermore, because megawatt-class racks create violent electrical transients, operators have to install entirely new hardware layers like supercapacitors and battery banks just to condition the power and keep the GPUs from crashing. Bypassing the grid does not kill copper demand. It multiplies it.

The LFP turbocharger (the part most models miss)

Stationary storage is converging on LFP for safety and cost. And copper intensity varies materially by chemistry.

As the entire stationary storage industry is rapidly converging on Lithium Iron Phosphate, or LFP, for its superior safety and cost profile. But here is the physical reality of battery chemistry. LFP cells require nearly double the copper intensity per kilowatt-hour compared to traditional high-energy cells.

This matters because it means:

A data center adding “just” hundreds of MWh of backup/peak-shaving can add hundreds of tonnes of copper demand quickly, and
The chemistry mix can swing that number meaningfully.

This is the hidden multiplier. Because LFP has a lower energy density, you need significantly more physical cells to reach your target capacity. Longer duration storage requires more cells, which means a massive increase in copper anode foil, internal busbars, and heavy system cabling.

When an AI data center adds just a few hundred megawatt-hours of battery backup for peak shaving or stability, it instantly creates demand for hundreds of metric tons of new copper. The math here is undeniable. The few tons of copper you save by switching to fiber optics are entirely real. But the battery energy storage system in the parking lot eats those savings alive.

Net: fiber savings are real, but BESS can eat the savings.

The first crunch is equipment, not metal

If you want the operational constraint that shows up before commodity tonnage does, it’s this: long‑lead power equipment.

Reuters reporting highlights how grid equipment shortages are already structural. For example, generation step‑up transformer delivery times averaged ~143 weeks in Q2 2025, and the industry has been responding with major factory investments because demand (renewables + data centers + electrification) is outrunning capacity.

This matters for investors because it changes the “copper trade” from a pure commodity view to an embedded‑copper capex bottleneck view: transformers, switchgear, and HV gear can gate commissioning even if copper cathode is available. Separate Reuters coverage also points to transformer supply shortfalls as demand surges.

Translation: the risk isn’t “running out of copper wire.” The risk is waiting on copper‑intensive hardware while GPUs depreciate in a warehouse.

The Sovereign Supply Shock: Copper is now a national security asset

Stop looking at the 2040 macro forecasts. The physical copper market is already breaking down today. The market assumes supply deficits are a distant, corporate problem. The reality is that the geopolitical supply chain fractured in early 2026, and governments are now actively panicking.

The smelting engine is starving

Consensus assumes China has an unbreakable monopoly on processing. The truth is much worse: they overbuilt their smelting capacity so aggressively that they literally ran out of rock.

Global mine supply cannot feed the furnaces. We have officially entered the “zero processing fee era.” Because raw ore is so scarce, the treatment and refining charges (TC/RCs) that smelters rely on have completely collapsed into negative territory. Smelters are now effectively paying to process rock. The bleeding got so bad that China’s top smelters were forced into an emergency pact to slash their primary production capacity by over 10% in 2026. The bottleneck is no longer a Chinese monopoly. It is a global engine starving for raw material.

The sovereign scramble

Western governments have realized the math is broken. In late 2025, the United States officially added copper to the USGS Critical Minerals List. But they aren’t just writing policy—they are deploying capital.

The US government is now backing billion-dollar investment vehicles to actively bypass open markets and buy direct stakes in mega-mines across the Democratic Republic of Congo and Zambia. Copper is no longer trading purely on corporate demand. It is trading as a sovereign security asset.

The foundation is cracking While nations scramble to secure new rock, the legacy foundation is decaying. Massive, aging mega-mines in Chile and Indonesia are fighting viciously depleting ore grades. Operators are pouring billions of dollars into capital expenditures just to watch their total output fall.

When you combine a starving Chinese smelting sector, aggressive US sovereign stockpiling, and collapsing legacy mine output, the supply deficit is not a 2040 model projection. It is a 2026 reality.

Supply: why this can actually become a crunch (and why price is the clearing mechanism)

If the AI buildout were the only demand shock, the copper market might survive. The problem is that AI is colliding directly with a global grid that is already tapped out by broader electrification.

The baseline math on the supply side is a structural gut punch:

Global copper demand is scaling from 28 million metric tons today to roughly 42 million metric tons by 2040.
Without an unprecedented expansion in mining, we are staring down a massive shortfall of 10 million metric tons over that same timeframe.
Primary mined supply is explicitly set to peak around 2030.
New discoveries are structurally paralyzed. Taking a mine from discovery to production takes well over a decade, suffocated by permitting, litigation, and environmental opposition.

Furthermore, the supply chain is not just geologically constrained. It is geopolitically bottlenecked. Just six countries control two-thirds of global mining production. The processing layer is even more vulnerable, with China holding forty percent of global smelting capacity and absorbing two-thirds of all mined concentrate imports.

What the crunch actually looks like in practice

When the market breaks, it will fracture along two distinct fault lines:

The Commodity Crunch: Physical market tightness in raw cathodes, concentrates, and scrap.
The Equipment Crunch: Severe bottlenecks in copper-heavy hardware like transformers, switchgear, and heavy busways.

In the AI era, the equipment bottleneck hits first. Copper is not just a raw input. It is the core ingredient embedded in the long-lead electrical hardware that dictates whether a facility can actually turn on. You can secure all the GPUs in the world, but if you are stuck in a queue for a copper-dense transformer, your data center is just a very expensive warehouse.

The Capex Reality Check We need to stop treating copper as a negligible line item. Look at a standard 230-megawatt greenfield AI campus. Assuming a heavy-redundancy baseline of 44 tons per megawatt, that single facility requires 10,000 tons of copper.

With copper decisively pushing past $11,500 per ton in late 2025, you are looking at over $115 million in raw copper costs for a single $3 billion data center. When you try to force a fast-twitch AI demand shock through a slow-twitch, highly concentrated mining supply chain, there is only one way the market clears. The price has to go up.

So will price go up?

No one can predict a near-term price target from a single piece of research, but the structural math is undefeated.

The market is facing an immediate, fast-twitch demand shock from AI data centers and global power grids. It is colliding with a slow-twitch supply chain constrained by 17-year mine development timelines and heavily bottlenecked processing infrastructure.

You cannot force an immediate, generational demand shock through a geologically fixed supply limit without breaking something. There is no magic technological fix for a physical shortage. Price is the only variable left to clear the board.

The real bear case: aluminum substitution (and why it’s bounded)

If copper gets expensive enough, the industry engineers around it. That’s always true.

But here’s the nuance that matters for AI data centers:

Aluminum is cheaper and lighter, but it has lower conductivity, so you need ~1.6× the cross-sectional area for the same performance.
Utilities often use aluminum conductors, especially where space/weight economics dominate.
Inside data centers, S&P notes copper is preferred over aluminum for power distribution because data centers are space-constrained and have elevated fire/heat considerations.
Aluminum substitution runs into space and heat dissipation barriers.

Even where aluminum is technically feasible, connectors/termination practices, heat rise, and real estate in dense busway runs become the practical constraints especially in high‑density halls.

The right conclusion isn’t “aluminum can’t happen.”
It’s: aluminum can cap the upside in certain subsystems (busway), but it doesn’t break the thesis because the MW buildout (and the copper fortress outside the fence) is still marching forward.

The Optics Trade: Fiber wins the volume game

Now the second half of the trade: optics.

The optical paradox

When you look at fiber substitution through the lens of a copper model, it looks like a minor headwind where you shave roughly four to five tons of metal off the per-megawatt baseline. But when you look at it through the lens of an optics model, the math goes completely parabolic.

The sheer volume of high-speed optical links inside an AI factory does not scale linearly. It scales exponentially across four distinct vectors:
- GPU count
- cluster size
- east–west bandwidth per GPU
- network topology complexity

This creates a dynamic where both sides of the trade can be wildly bullish at the exact same time. If you take only one thing away from this structural shift, it is this:

Copper is the tonnage story (power infrastructure).
Optics is the volume story (links and upgrades).

Copper is mostly Day‑0 CapEx; optics behaves more like recurring spend

This is the nuance investors miss.

Copper gets installed once (or at least infrequently) as part of facility capex.
Optics gets refreshed as link speeds step up — and that cadence is accelerating.

LightCounting expects massive deployments of 800G transceivers in 2025–2026, with 1.6T and 3.2T “soon after.”

Coherent’s market view similarly frames a rapid shift toward 800G+ and beyond this cycle.

So a “more optical” data center isn’t just a one-time capex shift — it’s a faster upgrade loop across the photonics supply chain.

CPO timing: inevitable, but not instantaneous

Co-packaged optics (CPO) is the logical endgame for power/thermal efficiency at extreme bandwidth, but timelines matter.

Reuters reported Jensen Huang saying NVIDIA will use co-packaged optics in networking switch chips through 2026, but that it’s not reliable enough yet for flagship GPUs, with mass adoption potentially 2028+. (Reuters)

That matches the practical deployment path most builders see:

Today: copper dominates ultra-short reach; fiber dominates rack-to-rack / row-to-row at high speeds
Mid term: more optics density (more 800G/1.6T), plus active copper where it still makes sense
Later: CPO spreads as reliability and manufacturing mature — pulling more of the optics value chain into packaging/assembly rather than pluggable modules

CPO isn’t for “adjacent cabinets” — it’s for switch power/thermal at extreme radix

A common pushback (often seen on social media) is: “Why would you need CPO just to connect adjacent cabinets—use copper.” That’s directionally right for very short reach.

But CPO’s economic target isn’t “one short cable.” It’s the power/thermal cost of pluggables and long electrical paths once you’re pushing 800G→1.6T class bandwidth across high‑radix switches and large fabrics. That’s why NVIDIA’s public messaging has been: switch chips first (through 2026), GPUs later, with broader adoption potentially 2028+ as reliability/manufacturing mature.

Net: Copper remains the rational choice where geometry makes reach trivial; CPO shows up where power per bit and density dominate.

The CPU comeback: agentic workloads pull CPUs (and networking) back into the frame

The most underpriced second order effect in the “AI factories” narrative may be CPU demand inside the stack, we have already made a dedicated article on this but since then NVIDIA and Meta just made this explicit with a multiyear deal that includes not only GPUs (Blackwell/Rubin) but also NVIDIA Grace and future Vera CPUs, with Grace positioned for broader data processing tasks and AI‑agent style workloads.

If CPU cycles become the limiter for certain inference/agentic pipelines (data prep, retrieval, orchestration, safety layers), it strengthens the case for tunable CPU:GPU ratios over time i.e., potential CPU disaggregation rather than permanently “fixed” GPU‑centric rack designs.

The actual “so what” (for builders and investors)

If you’re building neocloud / AI campuses

The risk is not “running out of copper wire.”
The risk is getting stuck in the queue for copper-heavy power hardware and grid interconnect approvals

If you do not secure your transformers and switchgear early, your GPUs will simply depreciate in a dark warehouse. Power access is the actual product you are building.

If you’re modeling copper demand

Delete the "200 tons per gigawatt" facility assumption from your spreadsheet immediately. It is a fundamental misread of a busbar subsystem. Your baseline must start at 30 to 40 tons of copper per megawatt of IT capacity for the inside-the-fence build. If you assume aggressive fiber substitution, shave four to five tons off that number. Then, aggressively model the outside-the-fence allocation. The grid and power generation infrastructure required to support these sites acts as a massive, parallel copper multiplier.

If you’re modeling optics demand

Stop worrying about copper substitution. Optics demand does not grow because copper dies. Optics demand explodes because the sheer volume of physical links multiplies exponentially alongside GPU cluster sizes, spine-leaf network complexity, and the relentless upgrade cadence from 800G to 1.6T and beyond.

The NeoCloud Guide to Robotics: Inside the Next Compute Gold Rush

FPX AI — Thu, 05 Feb 2026 18:47:09 GMT

The thesis in 3 bullets

The unit shift: In LLM land, cloud yield is tokens/sec. In robotics, the dominant unit becomes high‑quality experience hours (real + synthetic). Neoclouds win by maximizing Experience Yield, not peak TFLOPS.
The infrastructure split: Robotics creates two giant cloud shapes:
- Scale‑up brain training (world models / foundation policy backbones) → rack‑scale fabrics and high‑bandwidth memory.
- Scale‑out experience production (simulation + synthetic data generation + regression) → throughput clusters and brutal storage/network economics.
  Neoclouds that only buy “the biggest rack” will fail on margin.
The moat: The winning robotics cloud isn’t the one with the most GPUs; it’s the one that solves (a) the Optical Tax (data movement costs of thick sensor logs) and (b) Asset Ops (versioning the physical world as reliably as code).

Disclaimer: This is not investment advice. It’s an analytical framework
+ a public-market watchlist for understanding how “robotics workloads”
could re-route compute spend across the stack. Do your own work /
consult a licensed professional before acting.

1) Why robotics is the next stress test for neoclouds

Neoclouds were born in an era where the dominant workload was simple to describe: train or serve large models. The customer asked for more GPU memory, faster interconnect, more tokens, lower $/token. The data looked like text and embeddings; it moved cheaply, and it compressed well.

Robotics breaks that mental model.

Robots are “AI systems” that must survive physics, latency, safety constraints, regulatory constraints, and messy real‑world distribution shifts. A robotics company’s competitive advantage isn’t just model weights — it’s the closed loop: data → learning → evaluation → deployment → more data. And that closed loop runs on thick telemetry: multi‑camera video, lidar/radar streams, IMU, joint states, tactile signals, audio, maps, event logs, and operator feedback.

This isn’t theoretical scale. Industrial robotics continues to compound. The latest World Robotics statistics (via the International Federation of Robotics) reported 542,000 industrial robots installed in 2024, and forecasts suggesting the 700,000 annual mark will be surpassed by 2028.

So the question isn’t “will robotics use compute?” It’s: what kind of compute, where, and packaged into what product?

McKinsey’s critique of neoclouds is the right framing: BMaaS economics are fragile; long‑term viability requires moving “up the stack” into AI‑native services. Robotics is one of the cleanest “up‑the‑stack” opportunities because the infra primitives alone are not enough — the workflow plumbing is where the pain lives.

But there’s a trap: neoclouds shouldn’t try to be a robotics software company. The win is orchestration — being the place where robotics stacks run 10× faster and cheaper than generic cloud.

Hold that thought. First we need a shared mental map of robotics workloads.

2) Robotics workloads 101 — a quick mental map

Most robotics stacks can be understood as three loops with different latency and compute constraints:

Loop 1: Reflex / Control (milliseconds, on‑robot)

Motor control, stabilization, safety interlocks
Perception at sensor rate (vision, lidar, tactile preprocessing)
Local state estimation and fast collision avoidance

Constraint: deterministic latency and safety → edge compute.

Loop 2: Assist / Supervision (tens to hundreds of ms, edge + nearby compute)

Higher‑level perception fusion
Local planning with bigger context
Teleoperation and human‑in‑the‑loop interventions
Fleet “assist” services: map updates, localized reasoning, anomaly detection

Constraint: network jitter becomes the bottleneck before FLOPS. This is where “regional edge POPs” matter.

Loop 3: Improve / Learn (minutes to days, mostly cloud + some on‑prem)

Training policy models, vision backbones, VLA/VLM adapters
Reinforcement learning (RL), offline RL, imitation learning
Simulation, synthetic data generation (SDG), domain randomization
Regression testing and evaluation harnesses
Dataset curation, labeling, provenance, compliance

Constraint: throughput, reproducibility, cost → the domain of clouds, but not “generic” clouds.

If you internalize these three loops, the edge/cloud split becomes obvious:
robots do actions at the edge; clouds manufacture improvement.

That’s why the right metaphor for neoclouds in robotics is not “robot factories” (sounds like hardware). It’s Experience Refineries: facilities that take raw telemetry ore and refine it into policy gold.

3) Types of robots — and why “robotics” isn’t one workload

Robotics isn’t a single market; it’s a set of workload families with different “experience physics.”

A) Industrial arms (ABB (ABB), FANUC (6954.T), etc.)

Structured environments, repetitive tasks
High value on reliability, calibration, and deterministic behavior
Data is often proprietary and deployment environments are sensitive (IP, safety)

Compute shape: more regression testing, digital twins, PLC integration; less open‑world reasoning (for now).

B) Mobile robots in warehouses / logistics

AMRs, picking, sorting, inventory
Rich perception (cameras + depth), but environments still semi‑structured
Heavy focus on fleet management, uptime, and map/scene updates

Compute shape: lots of fleet ops + incident replay + simulation of edge cases.

C) Drones / aerial systems

Tight power budgets
Intermittent connectivity
Strong need for on‑device autonomy; cloud used post‑flight for analytics + training

Compute shape: edge-first inference, cloud for after‑action learning and simulation.

D) Autonomous vehicles / robotaxis (Alphabet (GOOG/GOOGL) subsidiary Waymo as an existence proof)

Extreme safety requirements
Massive logged datasets + rigorous simulation requirements
Closed-loop evaluation is the product

Waymo has explicitly studied scaling laws in autonomous driving for motion forecasting and planning, reporting predictable gains as compute/data scale.
That matters because it signals that “bigger models + more compute” isn’t only an LLM story — but the evaluation loop is far stricter.

E) Humanoids (Tesla (TSLA), plus many private players)

Humanoids are the most compelling “generalist robotics” narrative — but they are also the most operationally messy. They’re currently defined by:

multi‑modal perception,
dexterous manipulation,
and early-stage reliance on teleop / human-in-the-loop.

This is important because teleoperation is a near‑term cloud workload even before autonomy is “solved.” More on that later.

F) Soft robotics / medical / bio‑inspired systems

High-fidelity physics, contact dynamics, deformables
Simulation becomes harder and often more CPU‑bound or specialized

Compute shape: pushes more “physics correctness per dollar,” not just rendering.

The takeaway: robotics workloads fragment by robot type. The neocloud product must be modular — not “one GPU SKU fits all.”

4) Training, RL, synthetic data — where the real cloud spend lives

This is where robotics starts to look less like “inference at scale” and more like “manufacturing at scale.”

4.1 Training is no longer just “pretrain + finetune”

Robotics training is increasingly a blend of:

Foundation backbones (vision, VLM/VLA, world models)
Behavior cloning / imitation learning (from demos, teleop, logs)
Offline RL (learning from logged behavior)
Online RL (learning from simulation and, carefully, real fleet)
Tool‑using agents (planning, reasoning, memory, retrieval, constraints)

This is why “world models” matter: they compress physical interaction into learnable structure. And why robotics teams talk about “foundation models” differently: the model must predict outcomes in the world, not just tokens.

4.2 Synthetic data generation (SDG) is the new pretraining corpus

For a lot of robotics categories, you will never get enough real-world rare events:

corner cases in warehouses,
unusual lighting/weather,
rare contact events,
edge-case human behaviors,
safety-critical near-misses.

Simulation and SDG are how teams buy coverage.

You can see SDG tooling converging from multiple ecosystems:

Unity Software (U) ships a Perception toolkit for generating large-scale synthetic datasets for computer vision training and validation.
Epic Games’ Unreal Engine ecosystem is used for photorealistic virtual environments; companies like Duality discussed adopting USD as a standard format for large environment datasets and real-time simulation workflows.
Physics engines like MuJoCo have been open-sourced by Google DeepMind and are widely used for control/RL research.
Gazebo supports choosing physics engines at runtime via an abstraction layer — useful for experimentation and for matching simulation fidelity to task requirements.

This diversity matters for neoclouds: the cloud must run all of it, not just one vendor’s stack.

4.3 Log-based synthetic simulation becomes a first-class primitive

A lot of autonomy teams don’t generate worlds from scratch — they start from reality and perturb it.

Applied Intuition (private) describes log-based synthetic simulation as using real-world data within controlled simulation environments to improve validation and safety.
This pattern generalizes beyond AVs: “log → replay → perturb → evaluate” is becoming a standard robotics engineering workflow.

The implication for neoclouds: storage + ingest + replay pipelines become as important as GPUs.

5) What’s the “inference equivalent” for robotics?

If you come from LLM infrastructure, “inference” means: serve tokens; optimize latency and throughput; measure tokens/sec and $/1M tokens.

Robotics “inference” is not a single thing. It’s a hierarchy:

5.1 Act vs Think: the hybrid autonomy model

A durable architecture pattern is:

Small, fast model (“act”) runs on the robot: low-latency control and policy execution.
Large, slow model (“think”) runs off-robot (often cloud or nearby edge): planning, long-horizon reasoning, tool use, memory, scenario analysis.
An agentic system ties them together: policy constraints, safety checks, fallback behaviors, and the ability to ask for help (teleop).

This is not hype — it’s forced by physics and reliability. If your robot needs 10–20 ms control loops, cloud round trips don’t fit. But cloud is still valuable as:

long-term memory,
fleet-level learning,
and heavy reasoning/analysis that can tolerate latency.

5.2 The teleoperation bridge (the underrated near-term workload)

Before policies are fully autonomous, many robots — especially humanoids — rely on:

VR/AR operator interfaces,
multi-camera video streaming,
low-latency relay servers,
and continuous recording for later learning.

This creates a “robotics inference” workload that looks like:

real-time video ingest + transcode + relay (edge POPs),
operator session management (security, audit logs),
labeling hooks (marking moments of intervention),
and policy improvement pipelines that turn teleop into training data.

This is immediate revenue before the “world model” is perfect.

5.3 The important nuance: cloud doesn’t replace edge — it weaponizes it

Robots with good edge processors still need cloud because:

you need to train and continuously improve,
you need fleet analytics and regression,
you need simulation and SDG,
and you need to manage the digital twin / asset graph.

Edge compute reduces bandwidth and improves autonomy — but it increases the value of cloud as an improvement engine.

6) Hardware: what neoclouds actually need — and what to stop buying by reflex

This is where the robotics workload map turns into a build plan.

6.1 The core idea: robotics forces a “two-cloud” hardware strategy

Neoclouds that only optimized for “big model training” will miss the largest volume workload in robotics: experience production.

You need two primary hardware domains:

Domain A: Brain training (scale‑up + high‑bandwidth)

These are your “world model / backbone” clusters: large models, large activations, big collective ops, high bandwidth.

Rack-scale systems like NVIDIA’s Vera Rubin NVL72 are explicitly positioned as a rack‑scale AI supercomputer unifying 72 GPUs and 36 CPUs with high-speed interconnects.
You can treat “NVL72-class” as a proxy for the scale‑up direction of travel: dense accelerators + fast fabrics.

But here’s the key: most robotics companies will not run these 24/7 at full utilization. Their demand will be bursty: big training runs, then long stretches of evaluation and sim.

Domain B: Experience production (scale‑out throughput)

This is where most “robot hours” are minted:

simulation rollouts,
rendering,
SDG,
regression,
log replay and perturbation.

This workload often doesn’t need NVLink-scale fabrics. It wants:

cheap GPU throughput per dollar,
strong video/graphics pipelines,
lots of CPU,
fast local SSD,
and a storage backend that doesn’t collapse under asset churn.

Mid-tier “universal” GPUs like the NVIDIA L40S are marketed for both AI and graphics/video workloads, including 3D graphics, rendering, and video.
Whether it’s L40S specifically or an equivalent class, neoclouds need a render/sim farm tier that is not priced like frontier LLM training.

6.2 NVLink isn’t “overkill” — it’s just not the default answer for sim farms

A fair critique is: “Simulation is parallel. Why pay the NVLink tax?”

The honest answer:

Most sim farms are embarrassingly parallel → scale‑out wins.
NVLink-scale becomes valuable when you’re synchronizing multi‑agent environments tightly, training giant world models, or doing massive multi‑modal joint training where GPU–GPU bandwidth becomes a limiter.

So the clean position is:

NVL72-class is the “brain trainer.”
Throughput clusters are the “experience factory.”

Neoclouds need both.

6.3 The robotics “Optical Tax” — and the nuance that makes it real

Robotics fleets generate thick telemetry. If you assume:

multiple cameras,
high frame rates,
and continuous recording,

you can easily land in “petabytes per day” territory.

But there’s a nuance worth stating explicitly to sound credible:

Inside a data center, terabit-scale networking is manageable. The issue is not switch port capacity.
The cost lives in the WAN transit bill, ingest architecture, and the operational friction of moving, storing, and curating multi‑modal logs at scale.

This is why edge preprocessing is not just a technical preference — it’s a solvency requirement.

And it connects directly to optical networking trends. Companies like Marvell (MRVL) highlight 800G coherent pluggable optical modules (ZR/ZR+) for multi-site AI training and data center interconnect, enabling geographically distributed clusters.
Meanwhile, NVIDIA (NVDA) leadership has publicly signaled that co-packaged optics are promising but may take until 2028 or beyond for broad adoption, citing reliability constraints.

Translation for robotics neoclouds:
You don’t “wait for optics” to solve your data plane. You design for bandwidth scarcity now by pushing:

compression,
event-triggered logging,
on-device embeddings,
and selective upload.

Cloud value shifts from “store everything” to “refine the right things.”

6.4 Edge hardware is already good enough to change cloud economics

Two examples (not as endorsements — as proof that edge compute is real):

NVIDIA Jetson Orin Nano modules deliver up to 67 TOPS with 7W–25W power options.
Qualcomm (QCOM) Robotics RB6 advertises 70–200 TOPS (INT8) at low power.

As edge gets stronger, the cloud’s role becomes less about “real-time autonomy” and more about:

training,
simulation,
evaluation,
fleet analytics,
and long-horizon coordination.

6.5 A practical hardware map: workload → where it runs → what it needs

Below is a “useful, not perfect” mapping. Treat ratios as rules of thumb, not laws.

Loop 1 (on-robot reflex)

Runs on: SoCs / embedded GPU/NPUs + microcontrollers
Optimized for: latency, determinism, power
Key hardware: edge accelerators; fast local memory; sensor IO

Loop 2 (assist + teleop)

Runs on: edge gateways + regional POPs + sometimes on-prem “micro-cloud” racks
Optimized for: jitter, video relay, security
Key hardware: NICs, video encode/decode accelerators, moderate GPUs for transcode and perception assist, high-availability CPU

Loop 3 (learn/improve)

Split into sub-factories:

World model / backbone training
- Runs on: cloud + sometimes on-prem (large players)
- Hardware: rack-scale accelerators + high-bandwidth fabrics; strong networking; DPUs/NICs
Sim + SDG farms
- Runs on: cloud (best case), sometimes on-prem for proprietary environments
- Hardware: throughput GPUs (often not NVLink-heavy), CPU-rich nodes, high-IOPS local SSD, and a storage tier that can sustain asset churn
Evaluation + regression
- Runs on: cloud, often CPU-dominant
- Hardware: many CPU cores, fast object storage access, reproducible container runtimes
Data plane (ingest, ETL, indexing)
- Runs on: wherever data lands (cloud regions + edge POPs)
- Hardware: network + storage first; CPU; GPUs optional (embeddings, video preprocessing)

6.6 The overlooked cost center: energy and cooling

Robotics workloads amplify infrastructure constraints. Simulation farms are compute-heavy and storage-heavy; “move everything” data strategies hit both cost and power walls.

The IEA projects global data center electricity consumption could double to ~945 TWh by 2030 in a base case.
That’s not a robotics-specific number, but it’s the macro constraint that makes robotics data plane efficiency a competitive advantage, not a moral footnote.

7) The neocloud robotics software stack — don’t build everything, orchestrate the ecosystem

Here’s the highest-conviction point for neoclouds:

Robotics isn’t missing GPUs. It’s missing production-grade workflow plumbing.
Neoclouds should not try to write the physics engine; they should make the robotics ecosystem run 10× better on their infrastructure than on generic cloud.

The evidence is in the scars. AWS RoboMaker (Amazon (AMZN)) is a cautionary example: AWS announced it would discontinue RoboMaker support on September 10, 2025.
You can interpret that many ways, but one lesson is clear: building a full robotics application layer as a cloud provider is hard — and not always strategic.

So what should neoclouds do instead?

7.1 SimOps: the “Kubernetes for simulation factories”

What customers need:

orchestration of massive numbers of sim jobs (burst + drain),
deterministic runs and replayability,
cost-aware scheduling,
fast startup times (container images + environment assets),
cross-engine support (Unity, Unreal, Isaac, MuJoCo, Gazebo, proprietary).

Neocloud SimOps value-add:

asset-aware scheduling: schedule jobs where the assets already live (cache locality)
time-to-first-frame optimization: pre-warm images, pin common scenes, build “sim base images”
license-aware scheduling: for commercial engines

7.2 The Robotics Data Plane: “thick telemetry” ingest, refinement, and retrieval

This is the core product opportunity — the Experience Refinery.

What customers need:

ingestion endpoints close to fleets,
compression + filtering pipelines,
object storage with predictable performance,
indexing of multi-modal logs,
dataset lineage and provenance,
“search by behavior” (find similar failure modes),
and the ability to turn logs into training-ready shards.

Neocloud value-add:

edge POP ingest + direct peering: lower transit costs and jitter
GPU-optional preprocessing: embeddings, segmentation, event detection
policy-aware logging hooks: “log what matters” and tag with policy version, environment, and operator actions

7.3 Asset Ops: “Git for 3D worlds” (the boring moat)

Robotics simulation doesn’t scale without asset hygiene:

meshes, textures, materials,
environment graphs,
robot URDFs,
sensor models,
physics parameters,
and every change must be versioned, validated, and reproducible.

This is where OpenUSD becomes relevant as a common language.

Apple (AAPL), Pixar, Adobe (ADBE), Autodesk (ADSK), and NVIDIA (NVDA) formed the Alliance for OpenUSD to promote standardization and interoperability of 3D tools and data.

Neocloud Asset Ops value-add:

high-IOPS tiers for asset registries,
automated asset validation and compatibility checks,
diffing/versioning for 3D worlds,
promotion pipelines (dev → staging → prod),
and “asset provenance” tied to training runs and evaluations.

This is where infrastructure becomes sticky. Switching your GPU provider is easy. Switching your world versioning system is painful.

7.4 PolicyOps + EvalOps: the production line for autonomy

Robotics companies don’t just train models — they ship policies into physical systems.

What customers need:

policy registries,
reproducible builds,
evaluation harnesses (open-loop and closed-loop),
safety gates,
rollback tooling,
A/B testing for fleets,
and post-deploy monitoring tied back to training data.

This is where “Experience Yield” becomes measurable.

A usable definition (the one that will age well):

Experience Yield is the amount of real + synthetic experience that moves your evaluation curve per dollar and per day, weighted by coverage and downstream impact.

Neoclouds can offer this as a dashboarded KPI, not a vibe.

7.5 FleetOps integrations: speak the ecosystem’s language

To be credible, neoclouds have to integrate with the existing robotics substrate:

ROS 2 middleware variants, QoS complexity, mixed networks

ROS 2 supports non‑DDS middleware implementations like Zenoh, and documentation exists for installing rmw_zenoh.
This matters because robotics fleets often run over unreliable networks, and communication layers are evolving.

Also, “cloud robotics” frameworks exist in the wild: SAP (SAP) maintained an adaptation of the open-source Google Cloud Robotics platform to provide infrastructure for building and running robotics solutions (Cloud Robotics Core).
You don’t need to copy it — you need to be compatible with these patterns.

7.6 Partnership strategy: the “render farm” analogy

This is how you avoid the build-vs-partner trap:

Don’t build Isaac Sim. Make Isaac-based pipelines run cheaper and faster on your infra.
Don’t build Applied Intuition. Make log replay and synthetic perturbation pipelines run at higher throughput with lower storage + transit costs.
Don’t build Unity or Unreal. Build the scheduling, caching, and storage tiers that make SDG industrial.

Your role is to be the best substrate for the best tools.

8) The counter-argument: “why wouldn’t robotics companies just do on-prem?”

They often will — especially the biggest players with the most proprietary data.

So the neocloud thesis must be strong enough to survive that.

8.1 Elasticity is the obvious advantage — but not the only one

The simplest “cloud win” is bursty simulation:

You might need 5,000 GPUs for 6 hours of regression testing after a major change, then 200 GPUs for the rest of the week.
Owning for the peak is wasteful; renting for bursts is rational.

This is classic cloud math — but robotics intensifies it because sim/regression patterns are spikier than steady inference.

8.2 Hardware diversity: neoclouds as the “compute junk drawer” (in a good way)

Robotics orgs need a weird mix:

frontier accelerators occasionally,
mid-tier GPUs constantly,
CPU-heavy boxes for evaluation,
high-IOPS storage nodes for asset ops,
and network-heavy POPs for teleop/ingest.

Most companies don’t want to own that entire long tail. Neoclouds can.

8.3 A more subtle win: vendor-neutral orchestration

If the ecosystem continues to fragment (NVIDIA stack, open-source stack, custom accelerators), robotics companies will value portability and multi-engine pipelines.

Neoclouds that become “the portable factory floor” — where workflows can run across GPU types and simulation engines — can be strategically valuable even when some training stays on-prem.

8.4 Data gravity isn’t a death sentence — it’s a product opportunity

If “moving data” is too expensive, neoclouds can:

build ingest POPs near customers,
offer private connectivity/peering,
ship “edge refinery” appliances (managed ingest + preprocessing),
and only move refined experience, not raw ore.

That’s how you compete with on-prem: meet the data where it is, then refine.

9) Where the real alpha is for neoclouds (and how to package it)

Robotics infrastructure becomes valuable when it is sold as a closed-loop productivity system, not compute.

9.1 Productize Experience Yield

In LLMs, the KPI is tokens/sec. In robotics, the KPI should be something like:

Experience Yield per $
Coverage per week (how many environments, lighting conditions, contact states, failure modes)
Eval uplift per 1,000 hours of experience
Time-to-regression (how fast you can validate a new policy safely)

Neoclouds can host the dashboards, the evaluation harnesses, and the provenance graph that ties it all together.

9.2 Monetize the “factory stages,” not just the machines

A durable robotics neocloud will have multiple revenue lines:

Sim Farm compute (throughput GPUs + CPU-heavy nodes)
Asset Ops storage tiers (high-IOPS + object storage + caching)
Telemetry ingest + preprocessing (edge POP services)
Evaluation-as-a-service (standardized harnesses, reproducible runs)
Teleop relay infrastructure (low-latency streaming + compliance logging)
Managed orchestration (workflows across engines/vendors)

This is how you escape fragile BMaaS economics.

9.3 The best “credible big number” isn’t GPU TAM — it’s workflow TAM

Market sizing varies wildly depending on definitions, but the direction is clear: “cloud robotics” is non-trivial and growing. Grand View Research estimates the global cloud robotics market at $7.83B in 2024, projecting $55.68B by 2033.
Treat this as an order-of-magnitude signal (definitions differ across reports), but it supports the idea that robotics cloud spend is big enough for specialization to matter.

10) Risks, reality checks, and what would falsify this thesis

High conviction doesn’t mean pretending there are no counterforces.

10.1 Platform wars and vendor gravity

NVIDIA’s ecosystem is powerful — and it’s natural that many robotics workflows cluster around it. Isaac Sim is positioned as an open-source framework, but licensing is nuanced: the GitHub repository is under Apache 2.0, while building/using may require additional components (Omniverse Kit SDK, assets) under other terms.
A neocloud that bets on a single vendor stack risks becoming a reseller.

Aging-well move: be compatible, not captured.

10.2 The energy and grid constraint is real

Data center power is increasingly a strategic bottleneck, and AI is a major driver. The IEA’s base case projects data center electricity consumption rising to ~945 TWh by 2030.
Robotics workloads are not exempt; sim farms can be power-hungry. Carbon-aware scheduling and power‑aware siting aren’t virtue signals — they’re cost strategy.

10.3 Security and safety are existential

Robotics systems have physical consequences. That raises the bar for:

secure model distribution,
secure logging,
tamper-resistant audit trails,
and safe rollback mechanisms.

Neoclouds that can’t credibly do “secure-by-default” won’t win enterprise robotics.

10.4 What would falsify the neocloud robotics thesis?

Here are the real falsifiers (and how neoclouds hedge):

If robotics companies converge on fully on-prem training + sim because data never leaves secure facilities.
→ Neoclouds pivot to managed private clusters, colo, and edge refinery appliances.
If edge autonomy becomes so strong that cloud improvement loops shrink (less central training, more local adaptation).
→ Neoclouds focus on fleet coordination, safety certification pipelines, and evaluation-as-a-service.
If simulation tooling consolidates into one vertically integrated platform that bundles infra.
→ Neoclouds must become the best substrate for that platform or the neutral alternative for everyone who doesn’t want lock-in.

The robotics neocloud that wins

A robotics neocloud that wins over the next decade will not be defined by the newest GPU it can buy first. It will be defined by:

Experience Yield as the north-star KPI
A two-tier compute strategy: rack-scale brain training + throughput sim farms
A robot-native data plane that makes thick telemetry economically tractable
Asset Ops that version the world like code (OpenUSD and beyond)
Orchestration, not platform hubris: partner with the best engines and make them run best on your cloud
A credible path through on-prem gravity via hybrid deployment, private connectivity, and edge refineries

Robotics is where “AI infrastructure” stops being about chat and starts being about production. If neoclouds embrace that shift — from GPUs to factories, from tokens to experience — they don’t just survive the next workload transition. They become the infrastructure layer for physical autonomy.

Paid Section: Investor & Neocloud Alpha

Disclaimer: This is not investment advice. It’s an analytical framework + a public-market watchlist for understanding how “robotics workloads” could re-route compute spend across the stack. Do your own work / consult a licensed professional before acting.

Beyond the GPU: The 2026 CPU Bottleneck No One Is Pricing In

FPX AI — Thu, 29 Jan 2026 21:48:18 GMT

A quick vibe check on where the market is heading: NVIDIA ($NVDA) is no longer just selling “the GPU.” With the Rubin platform, NVIDIA is now shipping the matching CPU (Vera) and crucially positioning it as infrastructure that can show up as standalone control-plane capacity inside modern AI data centers. And CoreWeave ($CRWV) is among the first public examples waving the flag here: they’ve announced they’ll be an early adopter of NVIDIA’s CPU + storage platforms in their fleet.

This means $NVDA has officially entered the CPU room with $INTC and $AMD.

That matters because the market is still pricing “intelligence” like it’s a single forward pass: How many tokens per second can you generate? That was the right KPI for chatbots. It’s the wrong KPI for agents.

What’s changing in 2026 isn’t that models suddenly got mystical. It’s that we’re moving from Generation (GPUs) to Orchestration (CPUs + networking + memory + isolation)—and the bottleneck shifts from how fast the model can speak to how much infrastructure you can afford to keep thinking running.

FPX AI CPU Marketplace

If you’re looking to buy or rent CPU-only servers for orchestration, rollouts, tooling,
or control-plane capacity, reach out to us.
We can source and deliver validated configs with known lead times.

Explore : https://marketplace.fpx.world/cpus-for-sale

Full marketplace : https://marketplace.fpx.world/

Tokens/sec was a chatbot metric. Agents live and die by $/Completed Task.

In the chatbot world, the unit of work is simple:

A user prompt → one completion → done.

That’s why tokens/sec and $/token felt like gravity.

Agents break that mental model. The new unit of work is:

A user goal → a loop that plans → acts → checks → retries → logs → learns → repeats.

So the KPI flips. The KPI becomes:

cost per successful task (not cost per token).

Because in the agent world, the “answer” is often the smallest part of the job. The job is the messy reality around the answer: tool calls, network requests, sandboxes, retries, verification passes, state updates, and audit trails.

This is the part most “GPU = AI” narratives quietly skip. GPUs generate proposals. CPUs operationalize them.

The passive vs. the recursive: why the “Idle Tax” turns into a Work Tax

Traditional software is responsive. It waits. It wakes up when a human clicks something.

Agentic software is recursive. It lives inside a loop.

A chatbot is idle when you are.

An agent is not.

Even “boring” enterprise agent behavior is basically continuous orchestration:

It polls inboxes, watches dashboards, tails logs, checks regressions, reconciles mismatches, retries flaky jobs, gathers evidence, runs experiments to increase certainty, and escalates only when confidence drops below a threshold.

That’s why the Idle Tax becomes a Work Tax: the system consumes cycles while nobody is watching because the whole point is that it keeps going.

And that continuous “going” is overwhelmingly CPU + I/O + network + safety, not token generation.

Here’s the uncomfortable measurement: tool processing can dominate end‑to‑end latency.

If you want one empirical datapoint that snaps this into focus, it’s this: profiling work on agentic frameworks shows that CPU-side tool processing can consume the majority of total latency in representative agent workloads—in one study, up to ~90.6% of total latency—and can also be a large chunk of dynamic energy at scale.

Another 2026 system study looking at agentic inference traces from a large cloud provider finds tool execution dominates tail latency—roughly 30–80% in their measured “first-token response” path, with individual tool calls sometimes exceeding LLM prefill time.

That’s the hidden tax: once you move from “talk” to “do,” your wall-clock time gets eaten by everything around the model.

“But won’t the GPU run the agent loop?” Why GPUs still hate branching, waiting, and bureaucracy.

A skeptical hardware architect will (correctly) point out that GPUs are getting more flexible: CUDA graphs, more conditional logic, tighter scheduling, better kernels—sure.

But here’s the fundamental problem: GPUs are throughput machines. They’re built to do the same kind of work across a lot of lanes in parallel.

Agent loops are the opposite:

They’re full of if/else branches, divergent code paths, serialization points, and—most importantly—waiting (on APIs, on web pages, on filesystem I/O, on permission checks, on sandbox spin‑up).

This is where the economics get brutal:

Using a $30,000 GPU to wait for a weird third‑party API response is capital inefficiency bordering on comedy.

So even if you can run more control flow on the GPU, you generally shouldn’t—because serialization is the enemy of parallelism. When your agent is blocked on an HTTP “200 OK,” you’re effectively stalling thousands of CUDA cores.

That’s why the CPU’s role doesn’t vanish. It evolves into something more specific:

The GPU stays the throughput engine. The CPU becomes the latency absorber.

A selfish FPX CPU Marketplace plug

This is where procurement stops looking like “pick a CPU brand” and starts looking like systems design. If the agent loop is dominated by branching, waiting, tool calls, and I/O, then what matters is the throughput envelope around the model: single-thread responsiveness, memory bandwidth, NVMe IOPS, and NIC capacity—plus the ability to scale rollout workers without wasting expensive compute. That’s why FPX’s CPU marketplace is organized by workload roles, not marketing names: rollout nodes, control-plane nodes, and high-memory sim nodes, with clear spec bands (CPU class, RAM tier, NVMe layout, NIC tier) and lead-time-aware alternates. If you’re looking to buy or rent CPU-only servers, reach out—and if you need GPU, storage, or networking too, the broader FPX HPC marketplace covers those builds as well.

The hardware reality check: what “orchestration” looks like in a Rubin‑era rack

It’s worth grounding this in what $NVDA is literally selling.

NVIDIA’s Vera Rubin NVL72 is positioned as a rack-scale “AI factory” building block: 72 Rubin GPUs + 36 Vera CPUs, tied together with NVLink and a lot of networking hardware.

And the specs read like a confession that orchestration is now first-class:

Vera CPU: 88 custom Arm cores (“Olympus”), designed for orchestration workloads, with up to 1.5TB of LPDDR5X per CPU via SOCAMM, and coherent CPU↔GPU communication via NVLink‑C2C.
The rack includes BlueField‑4 DPUs and ConnectX‑9 SuperNICs—because the “agent loop” isn’t just compute; it’s east‑west traffic, security enforcement, and data movement.
NVIDIA markets the whole thing as a platform for “AI reasoning,” not just training/inference throughput—i.e., long-horizon loops that benefit from coherent orchestration and massive memory plumbing.

And here’s the nuance that makes the strategic tension interesting: even while pushing Vera, NVIDIA’s own DGX Rubin NVL8 spec still lists dual Intel Xeon 6776P host CPUs—a reminder that the transition from “general purpose host” to “specialized control plane” is real, but not instantaneous.

So yes, $NVDA is pushing vertical integration. But even $NVDA is living in a mixed world.

The examples get sharper when you tie them to hardware, not software vibes

A lot of agent talk stays abstract. The easiest way to make this real is to take three common agent loops and translate them into “what the silicon is actually doing.”

DevOps / remediation agents (the “commit → fail → fix → retry” loop) aren’t compute-heavy in the GPU sense; they’re reality-heavy. They clone repos, install dependencies, run unit tests, parse stack traces, spin sandboxes, and repeat. Most of that is CPU time (and disk/network time). The GPU proposes an edit; the CPU pays the bill for testing it.

Security/SOC triage agents spend their lives in parsing, filtering, correlating, and routing. That’s branchy text processing and huge volumes of I/O. Again: the GPU helps you propose the remediation playbook; the CPU does the grunt work of log wrangling, SIEM queries, and rule-engine execution.

Procurement / market intel agents are basically schedulers with opinions. Scraping supplier sites, normalizing messy HTML/PDFs, matching SKUs, diffing changes, and retrying flaky endpoints is less “matrix multiplication” and more “branch prediction + single-thread performance + networking.”

So when you say “web scraping is not a GPU problem,” you can make it more precise:

It’s often a branch-prediction and single-thread latency problem. Parsing ugly real-world documents rewards high-frequency cores, strong caches, and fast I/O—not bigger tensor cores.

The “Hidden Thinking” tax: test‑time compute turns inference into search, and search turns memory into the battlefield

We’re entering test-time compute economics. Modern reasoning stacks increasingly behave like search:

They branch, sample, verify, score, and select—especially as you push toward higher reliability.

That changes the shape of workloads:

Inference stops being “one pass,” and starts being “a tree of attempts.”

Even if the GPU does the heavy math, the CPU increasingly becomes the thing that:

Schedules branches, manages retries, routes tool calls, stores intermediate state, enforces budgets/timeouts, merges candidate solutions, and logs traces for improvement.

Now add the overflow problem: context windows and reasoning traces blow up the KV cache.

When KV cache can’t live comfortably on-device, systems start leaning on host memory tiers. One concrete industry signal: vLLM has explicitly discussed CPU‑memory KV cache offloading as a lever—offloading to CPU DRAM and optimizing host↔device transfer to keep throughput up.

That’s the “GPU engine, CPU fuel line” metaphor made literal: when GPU memory becomes the scarce tier, the CPU memory subsystem becomes the staging ground that decides whether the GPU is fed or starved.

And this is where the missing 2026 keyword belongs in the story:

CXL (Compute Express Link) is the industry’s practical answer to “stranded memory.” It’s how hyperscalers plan to pool and expand memory beyond what’s soldered to a single board, so orchestration workloads (and huge contexts) don’t get hard-capped by local DRAM slots.

The punchline is that agent economics make memory fungibility valuable. CXL is how you buy fungibility.

Autonomy creates bureaucracy. Bureaucracy creates CPU load.

The moment you let an AI system do anything “real”—

execute code, browse the web, access internal tools, write to a repo, touch production data—

you wrap it in isolation and policy: containers or microVMs, network controls, resource monitoring, audit logs, guardrails, and permission layers.

That overhead is not a rounding error. It’s the operational state machine around “doing work safely.”

AWS ($AMZN) is pretty direct about what it takes to run code interpreter-like agent workloads securely: you need managed environments, isolation boundaries, and a bunch of infrastructure around execution—not just a model endpoint.

And that overhead is mostly CPU-side orchestration. The GPU doesn’t run your audit trail. The GPU doesn’t enforce your egress policy. The GPU doesn’t spin up your sandbox.

Agents don’t remove enterprise bureaucracy. They automate it.

So the bureaucracy loop starts running at machine speed.

The networking tax: east‑west traffic becomes the silent limiter

There’s another bottleneck you can’t dodge once you go rack-scale: east‑west traffic inside the data center.

An agent “thinking” in production looks like this:

Query a vector DB → call a tool → update a log → fetch another document → run a verifier → write back state → repeat.

That’s a lot of internal movement, not just “compute.”

Which is why DPUs, SmartNICs, and high-end NICs show up as first-class orchestration components. In Rubin-era designs, you can see NVIDIA bundling BlueField DPUs and ConnectX networking directly into the platform story.

And it’s not just $NVDA. Broadcom ($AVGO) has been explicit about AI data-center “traffic controller” silicon and is pushing higher-performance Ethernet-scale solutions aimed at AI workloads.

So when we say “CPUs are the orchestration layer,” we should be honest: it’s really CPU + NIC + DPU as the orchestration layer. The CPU increasingly offloads parts of the networking “tax” to the DPU so the whole rack doesn’t drown in packets and policy checks.

Where RL joins in: agents become policies, not prompts—and that multiplies infrastructure demand

Once agents become the interface, reinforcement learning (RL) becomes the factory that turns them into workers.

And RL doesn’t just “make the model better.” It changes the workload shape:

It introduces exploration, multiple rollouts, evaluation, reward computation, and selection among alternatives. That means more attempts per task, not fewer.

DeepSeek’s R1 work is a concrete example people cite here: it reports that reasoning behaviors can be incentivized with RL, with emergent patterns like self-verification and self-reflection—exactly the behaviors that expand the inference graph into longer, more iterative loops.

And in agentic RL, “environment steps” aren’t toy gridworld frames. They can be headless browsers, containerized code runners, staging databases, API sandboxes, or CI pipelines.

That’s where CPUs get hammered: environment orchestration, reward evaluation, parallel sandboxes, state resets, trajectory storage—this is CPU/I/O scaling pressure.

RL is a multiplier on the orchestration thesis.

The shortage signal isn’t “empty shelves.” It’s price steps, allocation warfare, and lead times.

If you’re looking for a 2026 shortage, don’t imagine a consumer GPU shelf in 2021. Modern shortages show up as:

selective price hikes,
prioritization of higher-margin SKUs,
supply reserved for hyperscalers and anchor customers,
lead times that quietly stretch until they become planning constraints.

On the client side, reporting in late 2025 pointed to price increases on older Intel CPUs (notably Raptor Lake-era parts) on the order of ~10%+—with some markets seeing larger moves.

On the server side, Intel ($INTC) has publicly discussed a CPU supply shortage impacting data center/server availability, with expectations that the tightest point would be in Q1 2026 and improvement would follow later in the year.

And the memory side is not quietly behaving like a commodity anymore either: Reuters has described dynamics where AI server memory demand is pushing pricing up materially, with tightness expected to ripple through the supply chain.

That’s what an oncoming structural crunch looks like early: not “no inventory,” but repricing + rationing.

Why 2026 is a collision year: demand shifts just as supply hits hard physical limits

The 2026 argument gets stronger when you frame it as a collision between:

a structural demand shift (agents + test-time compute + RL loops), and
a supply system already operating near limits.

On the supply side, multiple signals stack:

TSMC ($TSM; 2330.TW) leadership has described leading-edge capacity as “very, very tight,” with constraints extending into 2026.
Analysts cited in EE Times estimate that wafer demand at 5nm and below could exceed capacity by ~25–30% in 2026, implying persistent tightness at the nodes that matter most for flagship silicon.
Reuters reported TSMC guiding to 2026 capex of roughly $52–$56B with expectations of strong 2026 growth—important, but still not instant capacity (fabs don’t show up on quarterly timelines).

Now layer in the “this is how physical reality works” details:

ASML’s ($ASML) High‑NA EUV tools cost hundreds of millions of dollars per unit and ship in hundreds of crates; Reuters noted they’re expected to be used in commercial manufacturing starting 2026 or 2027.

In other words, even when the industry spends aggressively, the constraint is time.

The reverse supply chain matters because the bottlenecks are not “the CPU vendor.” They’re the missing stage.

This is where your deep-dive framing really earns its keep. If you trace a finished server node backward, you quickly realize the choke points are specific stages where:

only a few suppliers dominate,
capacity takes years to expand,
yields make output non-linear,
and downstream assembly can’t proceed if one input is missing.

So let’s walk it backwards—more conversationally—but without losing the technical spine.

Stage 1: CPU installed on the motherboard (system integration)
This is where NVIDIA ($NVDA), AMD ($AMD), and Intel ($INTC) collide with the real constraints of OEMs and integrators like Dell ($DELL), HPE ($HPE), and Super Micro ($SMCI). At 2026 power levels, the binding constraint is often not core count, but whether the platform can deliver stable power, remove heat, and keep I/O stable under continuous load. Vera is built around a systems-first narrative and is designed to behave like a control plane for AI factories. Venice is designed to slide into the broad x86 server ecosystem with minimal friction. Diamond Rapids signals a platform-level bandwidth push that increases board complexity and routing density.

Stage 2: Memory subsystem and high-speed I/O (the “library and highways”)
Agentic workloads do not just compute, they hold state and move it constantly. They pull context, call tools, write logs, store intermediate artifacts, and keep environments alive while the loop runs. Memory bandwidth and I/O stop being secondary specs and become the throttle. Vera’s differentiation is its LPDDR5X approach through SOCAMM-style designs, prioritizing bandwidth per watt and tight coupling to the GPU factory model. Venice stays anchored in conventional DIMM ecosystems for maximum platform flexibility. Diamond Rapids pushes toward more aggregate bandwidth through platform choices that increase routing and power delivery complexity. This is also where CXL becomes the practical survival mechanism for “stranded memory,” enabling pooled expansion as context windows and reasoning traces grow.

Stage 3: CPU package substrate (ABF)
This is the multilayer wiring structure under the chip that makes high-speed signaling and high-current power delivery possible. Even flawless silicon cannot ship without sufficient high-end substrate capacity. The concentration here is the entire story. Ajinomoto (2802.T) is widely cited as holding near-total share in the insulation film used for high-performance ABF substrates. Substrate manufacturers such as Ibiden (4062.T), Shinko (6967.T), Unimicron (3037.TW), and AT&S (ATS.VI) cannot route signals without that material. Differentiation shows up as pressure profiles. Vera may emphasize a more uniform control-plane design, but it still needs a premium substrate to sustain coherent high-bandwidth signaling. Venice and Diamond Rapids mechanically increase substrate layer counts and routing density as they scale memory channels and I/O.

Stage 4: Advanced packaging capacity (assembly and integration queues)
Packaging is not just “put the die in a box.” It is the integration of dies or tiles, tight thermals, warpage control, and high yield under complex constraints. Capacity here is shared infrastructure, which is why it becomes a schedule setter. Foundries and OSATs such as TSMC ($TSM), Intel ($INTC), ASE ($ASX), and Amkor ($AMKR) become arbiters of volume when demand spikes. The exposure differs. Vera’s value proposition is tightly coupled to the broader Rubin platform cadence, which means packaging queues that gate accelerators can indirectly gate Vera-based system shipments. Venice and Diamond Rapids retain more ability to ship into non-AI server configurations, which can soften the shock when packaging bottlenecks bind.

Stage 5: The platform adjacency bottleneck (HBM and accelerator readiness)
AI servers do not ship as CPUs alone. They ship as platforms that require accelerator memory stacks, and that makes HBM a system-level limiter. If HBM4 is constrained, racks do not deploy, regardless of CPU availability. This creates asymmetric exposure. Vera is most directly tied to HBM availability because the Rubin narrative depends on coherent CPU to GPU memory movement and rack-scale determinism. Venice and Diamond Rapids can ship outside this dependency chain, but for flagship AI clusters the accelerator complex still sets the deployment pace.

Stage 6: Foundry FEOL transistor formation (FinFET to GAA transition)
This is where the 2026 story turns into a yield ramp story. Venice is tied to TSMC’s first-generation GAA-class process direction. Diamond Rapids is tied to Intel’s 18A RibbonFET roadmap. Both face nonlinear output as learning curves mature. This is where a small yield wobble can become a big supply shock.

Stage 7: Backside power delivery (where applicable)
Backside power approaches separate power delivery from frontside signal routing and can improve performance per watt, but they add process complexity and yield risk. Intel’s 18A strategy is closely associated with this style of shift, which makes it a key part of Diamond Rapids’ manufacturing risk profile.

Stage 8: EUV lithography (printing the smallest patterns)
EUV throughput and uptime become structural constraints because they define how quickly critical layers can be patterned at scale. ASML ($ASML) sits at the center. The tools are scarce, expensive, and operationally delicate, which is why EUV capacity is never “just add money.”

Stage 9: High-NA EUV enablement (the next lens system)
High-NA reduces multi-patterning but introduces ecosystem readiness constraints and extreme tool scarcity. Early adoption dynamics matter here because tool positioning can influence successor-node readiness and competitiveness.

Stage 10: Masks and metrology (defect control and inspection)
As nodes shrink and complexity rises, defect detection and mask quality become gating. KLA ($KLAC) is central in inspection and metrology. Mask makers like Photronics ($PLAB) and Japanese incumbents such as Toppan (7911.T), Dai Nippon Printing (7912.T), and HOYA (7741.T) sit in the critical path when mask counts rise and defect tolerance collapses.

Stage 11: EDA, IP, and signoff models (blueprints and building-code approval)
At the GAA era, modeling accuracy is yield. EDA vendors Synopsys ($SNPS), Cadence ($CDNS), and Siemens (SIEGY) define tapeout confidence. IP ecosystems like Arm ($ARM) matter especially for Vera’s compatibility story and for any platform pushing new coherency and system-level integration.

Stage 12: Wafers, chemicals, gases, and refined minerals (the purity stack)
This is the “clean kitchen” layer that determines whether leading-edge manufacturing is even possible. Wafers from Shin-Etsu (4063.T) and SUMCO (3436.T), process materials from Entegris ($ENTG) and DuPont ($DD), and gases from Linde ($LIN) and Air Liquide (AI.PA) all become more sensitive inputs as the industry moves into first-generation GAA ramps.

The takeaway is simple. “CPU shortage” is rarely a single company failing to ship. It is a shared pipeline where one missing stage can gate everything above it. Shortages emerge in phases. Allocation shifts first, then lead times stretch, then cascades form when one constrained input blocks multiple downstream assemblies. In 2026, Vera, Venice, and Diamond Rapids do not just compete on architecture. They collide on shared chokepoints, and their exposure differs depending on how tightly each strategy is coupled to the broader AI platform stack.

The 2026 platform clash: integration vs flexibility

This is where the competitive narrative gets fun, because the strategy split is clean:

The Rubin Protocol : Supply Chain, Bottlenecks, and the Real Winners of the AI Buildout

FPX AI — Mon, 19 Jan 2026 19:58:19 GMT

For most of computing history, progress was easy to model. Moore’s Law delivered more performance, costs fell, and systems improved almost automatically. That framework no longer holds for real AI production workloads. The limiting factor is no longer transistor density. It is everything that happens around the transistor.

Vera Rubin is NVIDIA’s response to that shift. Instead of relying on a node shrink to rescue AI economics, NVIDIA rebuilt the entire machine at once. Six chips were co designed as a single factory to eliminate idle time across compute, orchestration, memory tiers, scale up communication, scale out networking, and system level determinism. The outcome is not simply higher performance. It is a platform whose success is governed by manufacturing physics, supply chains, and integration discipline rather than raw FLOPs.

This piece walks through that reality step by step. We explain what each of the six chips actually does in plain terms and why it exists. Then we follow the system upstream through the production stack. From advanced packaging and HBM4 base dies to substrates, connectors, optics, cooling systems, and finally the raw materials themselves, we map the processes and companies that determine what ships on time, what slips, and where leverage quietly accumulates.

Disclaimer: This analysis is for informational and educational purposes only and does not constitute investment, legal, or financial advice. The companies and technologies discussed are referenced solely for technical and industry analysis. Readers should conduct their own independent research and consult appropriate professionals before making any investment decisions.

At CES 2026, Jensen Huang revealed details on the Vera Rubin (Blackwell Replacement) Platform and the one thing that was made clear: relying on Moore’s law to get exponential progress gains to save AI economics was no longer viable. The math had broken:

Rubin GPUs deliver ~5× inference performance over Blackwell
With only ~1.6× more transistors

That gain didn’t come from a node shrink.
It came from breaking another, more sacred rule.

“Never redesign the whole system at once.”

NVIDIA ignored it. In the age of AI the “safe choice” is now the risky option.

The Six Redesigned Chips

They rebuilt the AI factory as a single machine — in one synchronized generation. Rubin is not “a GPU launch.” It’s a six‑chip platform:

Vera CPU
Rubin GPU
NVLink 6 switch
ConnectX‑9 SuperNIC
BlueField‑4 DPU
Spectrum‑6 / Spectrum‑X Ethernet with co‑packaged optics

The Six Chips — What Each One Actually Does (In Plain Terms)

Think of Rubin not as a GPU upgrade, but as a machine where no part is allowed to wait and every chip exists to eliminate a different kind of stall.

1) Rubin GPU — The Thinker

What it does:
The Rubin GPU does the thinking — matrix math, attention, token generation.

What changed:
It’s not built to be “good at all math.” It’s built to be exceptionally good at the math AI actually uses today: low-precision inference and training (NVFP4).

Why that matters:
Modern AI isn’t limited by raw arithmetic. It’s limited by how fast data can be moved in and out of the GPU.
By lowering precision and redesigning the execution engine, Rubin gets far more useful work per transistor and per watt.

Intuition:
Rubin isn’t a faster brain — it’s a brain that wastes less time thinking about irrelevant details.

2) Vera CPU — The Traffic Controller

What it does:
The Vera CPU doesn’t “think.” It coordinates.

It schedules work, manages memory addresses, launches kernels, and ensures GPUs are never idle.

What changed:
Unlike traditional CPUs designed for browsers, databases, and operating systems, Vera is purpose-built to feed accelerators.

Why that matters:
In large AI systems, GPUs don’t stall because they’re slow — they stall because the CPU can’t keep up with orchestration.

Intuition:
If the GPU is the engine, the Vera CPU is the pit crew — invisible when it’s good, catastrophic when it’s not.

3) NVLink 6 — The Internal Nervous System

What it does:
NVLink connects GPUs to each other as if they were one device.

What changed:
NVLink 6 is fast enough that dozens of GPUs can share work without constantly stopping to synchronize.

Why that matters:
Large models — especially Mixture-of-Experts — require GPUs to constantly exchange partial results.
If that exchange is slow, everything slows.

Intuition:
NVLink 6 removes the “waiting room” between GPUs. No queues, no traffic jams.

4) ConnectX-9 SuperNIC — The Exit Ramp

What it does:
ConnectX-9 moves data between racks and clusters.

What changed:
It can write data directly into GPU memory (RDMA) without waking up the CPU.

Why that matters:
Cross-rack communication is where most large clusters lose performance. Every unnecessary hop adds latency and idle time.

Intuition:
ConnectX-9 is the highway on-ramp that lets GPUs talk to the outside world without stopping at toll booths.

5) BlueField-4 DPU — The Memory That Lets AI Think Longer

This is the most important chip people misunderstand.

What it does:
BlueField-4 manages context memory — the long-term working memory of AI models.

What changed:
Reasoning and agentic models need to remember a lot: conversation history, tool outputs, intermediate steps.
That memory lives in the KV cache — and it grows fast.

Instead of letting the GPU’s HBM fill up with old context, BlueField-4 provides a dedicated memory tier:

Massive “far memory” for KV cache
GPU-adjacent, low-latency, secure

Why that matters:
HBM should be used for active thinking, not hoarding old memories.

Intuition:
BlueField-4 is the difference between:

a human trying to think while holding every past conversation in their head
and one who can write notes instantly and recall them without effort

This is what makes long-horizon reasoning and agents economically viable.

6) Spectrum-X Ethernet — The Referee

What it does:
Spectrum-X keeps the network predictable under chaos.

What changed:
Traditional Ethernet drops packets when congested — fine for web traffic, disastrous for AI training and inference.

Spectrum-X uses telemetry from the NICs and DPUs to manage congestion before it happens.

Why that matters:
When packets stall, GPUs stall. When GPUs stall, the entire factory loses money.

Intuition:
Spectrum-X doesn’t make the network faster — it makes it reliable when everything is talking at once.

The System-Level Insight (Why All Six Had to Be Built Together)

Each chip removes a different bottleneck:

GPU → compute stall
CPU → orchestration stall
NVLink → synchronization stall
ConnectX → cluster stall
BlueField → memory/context stall
Spectrum → network stall

If you fix only one, another becomes the limit.

Here’s the trap NVIDIA has effectively avoided: once you’re training and serving frontier models at scale, the GPU stops being “the computer” and becomes one stage in a pipeline and pipelines don’t speed up when only one stage gets faster. MoE makes this visible (less compute per token, more expert routing), but it’s not a MoE story; it’s the trajectory of all future AI and robotics workloads: more sparsity, more modularity, more agents, more retrieval, more multimodal streams, more long-horizon state, tighter real‑time loops meaning the scarce resource shifts from raw FLOPs to movement and coordination of state (bandwidth, latency, memory hierarchy, synchronization, determinism). So a 5× faster GPU stapled onto a 2× faster fabric doesn’t produce 5× progress—it produces a machine that waits 3× more efficiently, burning capital and watts to idle. Rubin’s core logic is the only sustainable escape hatch: stop measuring chips and start measuring utilization by co-designing the whole path from compute → memory tiers → scale-up → scale-out → orchestration so the bottleneck can’t simply migrate to “the interface.” That’s what makes it futureproof: it’s not optimized for one model trend, it’s optimized for the invariant of the next decade—workloads will keep changing, but the cost of waiting will keep compounding.

That’s why Rubin works — and why Moore’s Law alone can’t deliver these gains anymore.

IF YOU WANT A VISUAL SUPPLY CHAIN BREAKDOWN CHECK THIS LINK OUT. WE HAVE THE SUPPLY CHAIN + ALL COMPANIES (WITH TICKER IF PUBLIC) ON HERE.

The Supply Chain Breakdown

From silica to a Vera Rubin NVL72 rack — how six “chips” become an AI factory

Before you can buy or build a Vera Rubin NVL72 rack, you have to build something much stranger: a global relay race that starts with silica rock and ends with a liquid‑cooled, rack‑scale supercomputer.

NVIDIA’s own “ground truth” specs are a useful anchor: Vera Rubin NVL72 is a rack system with 72 Rubin GPUs and 36 Vera CPUs, delivering 20.7 TB of HBM4 and 54 TB of LPDDR5X, plus a rack‑scale NVLink domain (260 TB/s) and scale‑out bandwidth (115 TB/s).

Per Rubin GPU, NVIDIA publishes 288 GB HBM4, 22 TB/s memory bandwidth, 3.6 TB/s NVLink, and a key clue: “Total NVIDIA + HBM4 chips: 12.”

That “12 chips” line is the tell: what looks like “one GPU” is actually a dense multi‑chip package where memory and advanced packaging are as important as the compute die itself. And that’s why this supply chain reads like a story.

Below is the story structure we’ll keep repeating for each major component:

Atoms → Ingredients (silica, copper, fluorine chemistry, ultra‑pure gases, ultrapure water)
Blueprints (EDA + IP + verification)
Printing (foundry wafer fabrication)
Grading (wafer test: keep only known‑good die)
Boxing (packaging: substrates, bumps, underfill, 2.5D/3D integration, test)
Turning parts into products (PCBs, connectors, optics, cooling hardware, system integration)

And along the way we’ll call out the industrial “monopolies” / chokepoints investors and operators should care about.

Chapter 0 — The shared foundation: how “sand” becomes chip‑grade reality

0.1 The silica chain: from quartz to 300mm wafers (the “canvas”)

It starts with silica (SiO₂) — but not beach sand. Semiconductor supply chains lean on high‑purity quartz because impurity levels that don’t matter in construction can destroy chip yields.

A simplified chain:

Quartz / silica ore → high‑purity feedstock
- Examples: Imerys (EPA: NK), Sibelco (private), The Quartz Corp (private)
Silica + carbon + electricity → metallurgical‑grade silicon (MG‑Si)
- Examples: Elkem (OSE: ELK), Ferroglobe (NASDAQ: GSM)
MG‑Si → semiconductor‑grade polysilicon (purified to extreme levels)
- Examples: Wacker Chemie (XETRA: WCH), Tokuyama (TSE: 4043)
Polysilicon → 300mm prime wafers (the wafer is the “blank canvas”)
- A small club of suppliers matters here. TSMC itself lists a set of major raw wafer suppliers and notes the top group accounts for the vast majority of global raw wafer supply.
- Names you’ll see repeatedly: Shin‑Etsu Chemical (TSE: 4063), SUMCO (TSE: 3436), GlobalWafers (TWSE: 6488), Siltronic (XETRA: WAF), SK siltron (subsidiary of SK Inc. (KRX: 034730)), Formosa Sumco Technology (TWSE: 3532).

Why this step must exist: if the wafer (the canvas) has defects, every layer printed later inherits that problem. On modern nodes, a tiny defect can wipe out a large die, and Rubin‑class silicon tends to use large die area and advanced packaging, which multiplies the cost of bad yield.

Output: ultra‑flat, ultra‑pure 300mm wafers ready for foundries and memory fabs.

0.2 The “bloodstream”: ultra‑pure gases, chemicals, and ultrapure water

Chipmaking is not just silicon — it’s chemistry logistics.

Industrial gas majors deliver ultra‑high‑purity (UHP) nitrogen/argon/oxygen/hydrogen and specialty gases:
- Linde (NYSE: LIN), Air Liquide (EPA: AI), Air Products (NYSE: APD)
Water becomes ultrapure water (UPW) (fabs consume it constantly):
- Veolia (EPA: VIE), Ecolab (NYSE: ECL), Kurita Water Industries (TSE: 6370)

Why it must exist: if gas purity drops or UPW supply hiccups, fabs don’t “slow down” — they often stop, because contamination kills yield.

Output: the continuous consumable stream that makes high‑yield manufacturing possible.

0.3 The stencils: masks, resists, and the “photography” step

You can’t build a chip without repeatedly “printing” patterns onto wafers.

Key roles:

Photomask blanks (ultra‑flat glass with demanding defect specs)
- Examples: HOYA (TSE: 7741), AGC (TSE: 5201)
Mask writing / photomasks (reticles)
- Examples: TOPPAN (TSE: 7911), Dai Nippon Printing (TSE: 7912), Photronics (NASDAQ: PLAB)
Photoresists + developers
- Examples: Tokyo Ohka Kogyo (TOK) (TSE: 4186), Shin‑Etsu (TSE: 4063), DuPont (NYSE: DD), Merck KGaA / EMD Electronics (XETRA: MRK)

Why it must exist: lithography is basically nano‑scale photography — resist is the light‑sensitive film, and masks are the negatives.

Output: wafer layers that can be etched/deposited into transistors and interconnect.

0.4 The machine tools: the factories that build the factories

Even if NVIDIA is the “architect,” toolmakers determine what’s physically manufacturable.

Lithography: ASML (NASDAQ: ASML) is the critical name in EUV. ASML itself states it is the only company that makes EUV lithography technology.
Deposition/etch: Applied Materials (NASDAQ: AMAT), Lam Research (NASDAQ: LRCX), Tokyo Electron (TSE: 8035)
Inspection/metrology: KLA (NASDAQ: KLAC)

A fun (and sobering) fact: EUV is so extreme that ASML describes it as using 13.5 nm wavelength light — almost x‑ray range.
ASML also explains EUV light generation as a tin‑droplet plasma process — a laser hits tiny droplets of tin to create EUV light.
IBM notes these EUV tools are shipped in pieces (think jumbo jets), and contain over 100,000 parts.

Why it must exist: at these nodes, you don’t “buy a machine,” you buy a decade of physics and supply chain integration.

Output: the capability to manufacture leading‑edge logic, advanced DRAM layers, and high‑density interconnect.

0.5 The invisible chokepoint: EDA software (blueprints you can actually build)

Before any wafer exists, Rubin‑class silicon must be designed, verified, and “signed off” with industrial EDA toolchains:

Synopsys (NASDAQ: SNPS)
Cadence (NASDAQ: CDNS)
Siemens EDA (part of Siemens AG, XETRA: SIE)

A major industry shift: Synopsys completed its acquisition of Ansys in July 2025, expanding the “silicon to systems” simulation stack under one roof.

Why it must exist: modern chips are too complex to “eyeball.” EDA is the compiler and the building inspector.

Output: tapeout‑ready design databases that can be turned into masks and then silicon.

The six core Rubin‑platform chips

NVIDIA frames Rubin as six new chips working as one system: Vera CPU, Rubin GPU, NVLink 6 switch, ConnectX‑9, BlueField‑4 DPU, and Spectrum‑6 Ethernet.

We’ll follow each from “atoms → product.”

1) Rubin GPU: the execution engine (and the packaging + HBM supply chain magnet)

What it is (NVIDIA ground truth)

Per Rubin GPU in NVL72, NVIDIA publishes:

288 GB HBM4
22 TB/s memory bandwidth
3.6 TB/s NVLink per GPU
“Total NVIDIA + HBM4 chips: 12”

That last line is why Rubin is a supply‑chain story: you’re not just buying a die — you’re buying a multi‑chip package + stacked memory skyscrapers + the ability to assemble them at yield.

The Rubin GPU manufacturing story

Step 1 — Blueprint (NVIDIA + EDA ecosystem)

NVIDIA (NASDAQ: NVDA) defines the GPU architecture, power/thermal envelope, NVLink behavior, memory interface, and packaging targets.
EDA stack: Synopsys (SNPS), Cadence (CDNS), Siemens (SIE). (These are the design compilers and signoff inspectors.)

Why it matters: design choices determine whether the chip is “just hard” or “physically manufacturable at yield.”

Step 2 — Printing compute silicon (foundry + EUV toolchain)

NVIDIA doesn’t typically publish the exact foundry node for each die on marketing pages. Industry reporting and the broader ecosystem point strongly to TSMC leadership‑node manufacturing for Rubin‑generation silicon — but treat node specifics as “reported,” not NVIDIA‑confirmed unless NVIDIA explicitly states it.

Company cast (the “print shop”):

Foundry: TSMC (NYSE: TSM; TWSE: 2330)
Lithography (EUV monopoly): ASML (NASDAQ: ASML)
Deposition/etch: AMAT, LRCX, TEL
Inspection: KLA

Why it matters: wafer starts on advanced nodes are capacity‑constrained by:

EUV tool availability (ASML is sole supplier)
ramp yield physics
tool install/qualification time

Output: wafers containing Rubin compute die(s).

Step 3 — Grading (wafer test: keep only the A‑students)

Before expensive packaging, wafer probe identifies known‑good die:

Test systems: Advantest (TSE: 6857), Teradyne (NASDAQ: TER)
Probe cards: FormFactor (NASDAQ: FORM)

Why it matters: in advanced packaging, a single bad die can scrap an entire assembled module. So test throughput becomes a “hidden” bottleneck.

Output: known‑good die maps.

Step 4 — HBM4: the memory skyscraper (and the new “base die” trap)

We already have a dedicated piece on this but HBM4 is the “skyscraper next to the GPU.” But the part many models miss is: HBM4 increasingly includes a logic base die (a foundry‑made logic layer), and this base die can be customer‑specific — reducing interchangeability between memory suppliers. Reuters has described the move toward customer‑specific logic dies in next‑gen HBM as a factor that tightens supply flexibility.

HBM4 company cast:

Memory makers (DRAM stacks):
- SK hynix (KRX: 000660)
- Samsung Electronics (KRX: 005930)
- Micron (NASDAQ: MU)
Foundry involvement (base die): frequently TSMC (TSM) and/or Samsung Foundry (005930), depending on supplier approach.

Concrete “base die” breadcrumbs from the suppliers themselves:

SK hynix has announced it plans to adopt TSMC’s logic process for the HBM4 base die.
Samsung has described its HBM4 as using a 4nm logic base die.
Micron has stated it is working with TSMC on HBM4E base logic die development.

Why this step must exist: bandwidth and power constraints push memory closer to compute, and stacking (HBM) is how you get “warehouse‑scale bandwidth” without burning impossible power.

Output: qualified HBM4 stacks ready for integration into the GPU package.

Step 5 — Advanced packaging: where Rubin often becomes supply‑constrained

Rubin‑class GPUs don’t just get “packaged,” they get assembled into a 2.5D integrated module (compute die + multiple HBM stacks) that behaves like a mini‑motherboard inside the package.

Two major chokepoints here:

CoWoS‑class integration capacity
- TSMC’s annual report describes CoWoS‑L (an RDL‑based CoWoS variant) and notes that CoWoS‑L entered volume production in 2024.
Substrate ecosystem (ABF + substrate makers)
- Ajinomoto Build‑up Film (ABF) is the dielectric “plywood” used in high‑end substrates. Ajinomoto itself has claimed extremely high share (“near 100%”) in ABF for package substrates.
- Substrate manufacturers (where ABF turns into dense wiring):
  - Ibiden (TSE: 4062)
  - Shinko Electric (TSE: 6967)
  - Unimicron (TWSE: 3037)
  - AT&S (VIE: ATS)
  - (often also Kinsus (TWSE: 3189), Samsung Electro‑Mechanics (KRX: 009150) in the broader ecosystem)

Why it matters: even if you have enough GPU dies and enough HBM stacks, you can still be blocked by:

CoWoS capacity
substrate yield/availability
underfill/encapsulation materials (reliability)
packaging test throughput

Output: a finished Rubin GPU package that can be mounted into server trays.

Why Rubin is “the bottleneck item” in plain English

Rubin’s supply isn’t governed by one factory. It’s governed by the minimum of many constrained pipelines:

EUV/advanced node capacity (ASML + foundry)
HBM4 supply (memory fabs) + logic base die coupling to foundry capacity
CoWoS‑class packaging capacity (TSMC)
ABF/substrate availability (Ajinomoto + substrate makers)

This is why GPU “demand” often shows up first as HBM allocation and advanced packaging capex, not as “more silicon wafers.”

2) Vera CPU: the orchestration engine (and the SOCAMM memory story)

What it is (NVIDIA disclosures)

NVIDIA describes Vera CPU as:

88 custom Olympus CPU cores / 176 threads, Arm‑compatible
1.8 TB/s NVLink‑C2C enabling coherent CPU‑GPU memory
Up to 1.5 TB LPDDR5X via SOCAMM modules, delivering up to 1.2 TB/s memory bandwidth

This matters because Vera isn’t “just a host CPU.” NVIDIA positions it as a high‑bandwidth data movement engine that keeps GPUs utilized at rack scale.

The Vera CPU manufacturing story

Step 1 — Blueprint (NVIDIA + Arm ecosystem + EDA)

Architect + integrator: NVIDIA (NVDA)
ISA/software compatibility: Arm Holdings (NASDAQ: ARM) ecosystem (Vera is Arm‑compatible per NVIDIA’s description).
EDA: SNPS / CDNS / SIE

Why it matters: CPUs are “control‑heavy” silicon; performance is driven by microarchitecture and memory subsystem design. Vera’s defining feature is its memory bandwidth + coherency story.

Output: tapeout‑ready Vera design.

Step 2 — Printing (foundry)

Same global machine tool stack applies:

Foundry: often assumed to be TSMC for this generation, but treat foundry/node specifics as not officially disclosed unless NVIDIA says so on record.

Output: CPU wafers.

Step 3 — Grading (test)

Same test ecosystem: Advantest / Teradyne / FormFactor.

Step 4 — Packaging (large CPU packages are substrate‑hungry)

Vera packages are large, high‑pin‑count, power‑dense devices. They lean heavily on:

ABF film ecosystem (Ajinomoto)
High‑density substrates (Ibiden, Shinko, Unimicron, AT&S)
OSAT/foundry packaging services (often ASE (NYSE: ASX / TWSE: 3711), Amkor (NASDAQ: AMKR), and foundry packaging for leading‑edge designs)

Output: packaged Vera CPUs ready for system integration.

3) BlueField‑4 DPU: the infrastructure processor (the “control plane in silicon”)

What it is (NVIDIA disclosures)

NVIDIA positions BlueField‑4 as part of the six‑chip Rubin platform lineup.
In the NVL72 architecture NVIDIA describes, BlueField‑4 plays a rack‑scale role: secure and accelerate networking, storage, and infrastructure services.

NVIDIA also describes BlueField‑4 in an “Inference Context Memory Storage (ICMS)” context as delivering:

800 Gb/s networking
A 64‑core Grace CPU component
high‑bandwidth LPDDR memory
line‑rate data integrity/encryption features

(For investors/operators: that is NVIDIA explicitly tying a DPU product to both compute and memory requirements — it’s not a tiny sidecar.)

The BlueField‑4 manufacturing story (high‑level)

BlueField looks like a “network card,” but supply chain‑wise it’s closer to: advanced SoC + packaging + board assembly.

Blueprint: NVIDIA (NVDA) + EDA (SNPS/CDNS/SIE)
Printing: leading foundry manufacturing (likely TSMC for Rubin‑generation silicon; treat node per chip as not always publicly itemized)
Grading: Advantest/Teradyne/FormFactor
Packaging: ABF + substrate + OSAT assembly
Board: high‑speed PCB laminates, connectors, VRMs, clocks; then EMS/ODM assembly

Output: BlueField‑4 modules/cards integrated into Rubin compute trays as the “infrastructure processor.”

4) ConnectX‑9 SuperNIC: the endpoint that keeps scale‑out from melting down

What it is (NVIDIA disclosures)

NVIDIA states in its Rubin platform write‑up that each compute tray includes four ConnectX‑9 SuperNIC boards, delivering 1.6 TB/s per Rubin GPU for scale‑out networking.
ConnectX‑9 is listed as one of the six Rubin platform chips.

The ConnectX‑9 manufacturing story

Step 1 — Blueprint (SerDes, PCIe, congestion control logic)

NVIDIA designs the ASIC. High‑speed IO chips are design‑heavy: their “secret sauce” includes SerDes, congestion control behaviors, and offloads — all of which drive link reliability and latency under load.

EDA: SNPS/CDNS/SIE.

Step 2 — Printing (foundry + EUV/DUV stack)

ConnectX‑9 is a high‑performance networking ASIC, likely manufactured on advanced logic nodes.

The upstream chokepoint is the same: lithography tools (ASML), etch/deposition (AMAT/LRCX/TEL), inspection (KLA).

Step 3 — Grading (test)

Test throughput matters because NIC demand scales with GPUs.

Step 4 — Packaging + “board branch”

ConnectX‑9 becomes real supply chain complexity when it turns into a board:

High‑speed PCB materials: low‑loss laminates
- Examples: Panasonic (TSE: 6752), Rogers (NYSE: ROG)
PCB fabrication (controlled impedance, multi‑layer):
- TTM Technologies (NASDAQ: TTMI), Unimicron (TWSE: 3037), Zhen Ding Tech (TWSE: 4958)
Connectors / cages (OSFP/QSFP ecosystems):
- TE Connectivity (NYSE: TEL), Amphenol (NYSE: APH), Molex (private), Samtec (private)
Power delivery (VRMs):
- Infineon (XETRA: IFX), Texas Instruments (NASDAQ: TXN), Analog Devices (NASDAQ: ADI), Monolithic Power Systems (NASDAQ: MPWR), Renesas (TSE: 6723), onsemi (NASDAQ: ON)
EMS/ODM assembly:
- Hon Hai / Foxconn (TWSE: 2317), Jabil (NYSE: JBL), Flex (NASDAQ: FLEX), Celestica (NYSE: CLS)

Output: SuperNIC boards integrated into compute trays.

5) NVLink 6 Switch ASIC: the “scale‑up spine” inside NVL72

What it is (NVIDIA disclosures)

NVIDIA describes the NVL72 all‑to‑all topology as using 36 NVLink 6 switches, with each switch tray incorporating four NVLink 6 switch chips.
NVLink 6 is part of the six‑chip Rubin platform list.

This is the scale‑up fabric: it is not “Ethernet switching,” it’s the internal GPU‑to‑GPU communication domain that makes rack‑scale training behave like one machine.

The NVLink 6 manufacturing story

NVLink switch silicon is a high‑bandwidth, high‑power ASIC — which means it shares the same underlying constraints as GPUs and NICs:

Blueprint: NVIDIA + EDA
Printing: foundry wafer fab (advanced node)
Grading: wafer test
Packaging: high‑end substrates, large packages, liquid‑cooling integration at tray level
Board: extreme signal integrity (high‑speed PCB materials, connectors)

Output: hot‑swappable NVLink switch trays that create the NVL72 NVLink domain.

6) Spectrum‑6 + Spectrum‑X Ethernet Photonics (co‑packaged optics): the “future scale‑out” story

This one is worth treating as a different species of supply chain because it blends:

switch ASIC manufacturing (electronics)
silicon photonics (optics on silicon)
laser supply chains (III‑V lasers and optical subassemblies)
fiber/connector precision assembly
advanced co‑packaging / test

What it is (NVIDIA disclosures)

NVIDIA describes Spectrum‑X Ethernet Photonics as co‑packaged optics for Ethernet scale‑out, and says the SN6800 switch delivers 409.6 Tb/s across 512 ports of 800G (or 2,048 ports of 200G) and is coming in 2H 2026.

So you’re not just building a switch — you’re building a switch whose “optical transceivers” are no longer pluggable boxes. They’re integrated into the same module system.

The co‑packaged optics manufacturing story

Step 1 — The electronics “brain” (switch ASIC)

This branch looks like classic chipmaking:

Blueprint: NVIDIA + EDA
Printing: foundry (TSMC is a key ecosystem partner in NVIDIA’s co‑packaged optics collaboration story)
Grading: wafer test
Packaging: advanced package integration

Step 2 — The photonics “mouth and ears” (optical engines)

NVIDIA describes COUPE‑based optical engines in its co‑packaged optics discussion, and gives a concrete throughput framing:

each optical engine supports 1.6 Tb/s transmit and 1.6 Tb/s receive, operating on eight 200 Gb/s lanes in each direction

Step 3 — The light plant (ELS: external laser source)

Instead of embedding lasers everywhere, NVIDIA describes using an External Laser Source (ELS) approach:

each ELS module contains eight lasers
centralizing lasers reduces total lasers (NVIDIA describes reduction by a factor of four)

Key laser ecosystem names NVIDIA calls out in this collaboration context:

Lumentum (NASDAQ: LITE)
Coherent (NYSE: COHR)
Sumitomo Electric (TSE: 5802)

Step 4 — The “plumbing”: fiber, micro‑optics, connectors

NVIDIA’s partner list and narrative ties CPO to fiber/connector specialists, including:

Corning (NYSE: GLW) (fiber ecosystem)
Browave (TPEX: 3163)
SENKO (private)
TFC Communication (noted by NVIDIA as part of the ecosystem)

Step 5 — Co‑packaging + test + system integration (where yield is won or lost)

This is where the supply chain becomes “assembly and metrology heavy.”

NVIDIA explicitly names SPIL for packaging/test roles in the CPO supply chain context, and also points to system assembly players:

Packaging/test: SPIL (Siliconware Precision Industries; part of ASE Technology Holding (NYSE: ASX / TWSE: 3711))
System integration: Hon Hai / Foxconn (TWSE: 2317), Fabrinet (NYSE: FN)

NVIDIA also notes manufacturing details like the use of solder reflow to attach optical engines to substrates in the production flow it describes.

Output: a co‑packaged switch system where electrical paths shrink, and optics become part of the module — but the assembly/test demands rise sharply.

The “supporting cast” that becomes headline constraints

LPDDR5X SOCAMM: why this memory is different (and why it can bottleneck)

LPDDR5X in servers isn’t “phone memory in a rack.” NVIDIA explicitly frames SOCAMM LPDDR5X as a serviceable, power‑efficient memory subsystem enabling up to 1.5 TB per Vera CPU with high bandwidth.

Why SOCAMM is different from standard server DDR:

It’s a module form factor designed around LPDDR behavior (power efficiency, density) and serviceability (replaceable modules rather than soldered packages). NVIDIA specifically highlights serviceability and fault isolation benefits.
The ecosystem is newer and more specialized than commodity RDIMMs.

Who builds SOCAMM modules (and why investors care):

Memory makers: Micron (MU), Samsung (005930), SK hynix (000660) are the prime candidates.
Micron says it is in volume production of SOCAMM, developed with NVIDIA, and positions it as a high‑bandwidth, low‑power, small form‑factor memory module.
Micron also announced SOCAMM2 sampling (up to 192GB) as the ecosystem evolves.

Where SOCAMM bottlenecks show up:

DRAM die supply is necessary but not sufficient — module assembly, PCB/connector tolerances, and qualification become the gating items.
Because SOCAMM is central to Vera’s “coherent memory pool” concept (LPDDR5X + HBM4), shortfalls can degrade system ship volume even if GPUs are available.

SSDs (E1.S NVMe): “boring” until you try to qualify them at NVL72 scale

SSDs are typically more multi‑sourced than GPUs/HBM, but the choke point becomes qualification + form factor + firmware + QoS.

Where SSDs show up in NVIDIA AI racks
NVIDIA DGX documentation describes using E1.S NVMe drives as local storage/cache in rack systems (configurations like multiple E1.S drives in RAID0 are common in this class of platform docs).

Why E1.S matters
E1.S is a standardized “ruler” form factor under SNIA’s SFF work (operators care because hot‑swap serviceability becomes part of uptime).

Two concrete “vendor qualification” signals (GB200 generation, same rack class)

Micron has positioned its 9550 PCIe Gen5 data center SSDs for AI workloads and notes inclusion in NVIDIA ecosystem vendor recommendations for NVL72‑class systems.
Western Digital has stated its DC SN861 E1.S is certified to support NVIDIA GB200 NVL72.

Why this matters for Rubin
Even if SSDs are “available,” the rack builder/operator needs:

firmware compatibility
thermal behavior under rack conditions
latency/QoS behavior
security/telemetry compliance

Output: qualified E1.S SSD SKUs that can ship at rack scale without “mystery latency” incidents.

In the next (paid) section we will talk about the bottlenecks and what companies are set to benefit the most from the increase in demand, and the companies that could be adversely affected due to some of these changes that were previously set to grow

NVIDIA’s Christmas Eve Gift: Groq and the New Physics of Inference

FPX AI — Thu, 25 Dec 2025 18:03:51 GMT

1. Executive Summary: The Christmas Eve Paradigm Shift

On December 24, 2025, NVIDIA struck what reads like a licensing deal—but behaves like an acquisition. Groq announced a “non‑exclusive” license of its inference chip technology to NVIDIA, while Groq founder Jonathan Ross and key Groq executives/engineers move to NVIDIA; Groq’s cloud business continues operating and Groq stays “independent” under a new CEO.

If you strip away the legal wrapper, the strategic message is blunt:

NVIDIA is conceding—in public—that training and inference have bifurcated so hard that one architecture cannot economically dominate both. The GPU remains the throughput king for training. But the next margin pool is inference (tokens), and inference is being reshaped by (1) reasoning/System‑2 models that “think” longer at test time and (2) supply chain physics—HBM + CoWoS are becoming a tollbooth on how fast the world can scale GPU supply.

This report is designed to be an example on how you can leverage the FPX “Bottlenecks Beyond Power” framework to understand movements in the market such as this one better :

Part 1 (Memory) explains why HBM + CoWoS is the hard governor on GPU scaling (the Memory Wall + packaging bottleneck).
Part 2 (Networking) explains why the next bottleneck becomes connectivity (copper → optics → CPO) as “AI factories” scale.

This report explains how the NVIDIA↔Groq move changes what gets built next (and therefore who wins in memory, optics, networking, power, and “capacity as a product”).

In the Paid section we break down the supply chain implications and what this really means not only for the Memory and Networking players in the space but also how you can refine your strategy if you are a Colocation Operator, buying or selling Powered Land or are an Investor in the space.

The core thesis

Groq isn’t valuable because it was about to “kill NVIDIA.” Groq is valuable because it is an orthogonal path to inference capacity:

HBM‑less, SRAM‑centric inference silicon (reduces dependence on the most constrained layer of the GPU supply chain).
Compiler‑scheduled determinism (reduces tail latency/jitter—crucial for agentic and reasoning workflows).
A second manufacturing lane (GlobalFoundries now; Samsung 4nm next-gen planned) that can ramp without waiting for CoWoS slots. (Groq)

In other words:

NVIDIA didn’t buy “a chip.” NVIDIA bought a second factory door that doesn’t pass through the CoWoS/HBM tollbooth.

And the second‑order consequence is the real story: once NVIDIA has two doors, it can segment customers, price discriminate, and keep control of the inference annuity—even as hyperscalers push harder to escape CUDA.

2. The Strategic Context: Why ~$20B? Why Now?

The reported ~$20B figure (a ~3x premium to Groq’s last $6.9B September 2025 valuation) underscores that this is not a revenue multiple story. It is a Cost of Goods Sold (COGS) story.

2.1 The transition from Training to Inference is a transition from CapEx to COGS

A clean way to frame the macro shift:

Training is CapEx: build the model once (or periodically).
Inference becomes COGS: every user query, agent loop, tool call, and “reasoning” step is a recurring cost line item.

As AI products become “always on,” CFOs stop asking “How fast can we train?” and start asking “What is our cost per useful token delivered?”

That is why NVIDIA’s inference posture matters more now than in 2021–2023. This is the phase where unit economics become strategy.

2.2 The real “Why Now?” is the DeepSeek / o1 factor: reasoning models are token multipliers

The industry hype cycle moved from “chat” to “reasoning.”

OpenAI’s o1 series popularized the idea that models spend more time thinking before responding, scaling performance with test‑time compute. (OpenAI)
OpenAI’s own developer docs are explicit that reasoning models “think before they answer” and can consume substantial “reasoning tokens” as part of producing a response.
DeepSeek’s R1 paper directly characterizes o1‑style reasoning as “inference‑time scaling” by increasing Chain‑of‑Thought length.

What that means in infrastructure terms:
Reasoning models don’t just generate the visible answer tokens. They generate large internal token sequences (reasoning tokens / thought tokens), which pushes you into a world where Time‑to‑Last‑Token (TTLT) becomes the user experience bottleneck, not just Time‑to‑First‑Token.

Strategic insight: NVIDIA didn’t just “buy inference.”
NVIDIA bought a path to make System‑2 reasoning feel like System‑1 latency.

If Groq collapses the “thinking loop” latency enough, it doesn’t reduce compute spend. It creates demand (Jevons paradox): cheaper inference becomes more inference, more agents, more loops, more tool use, more tokens.

2.3 The hyperscaler threat is no longer “chips.” It’s “escape velocity.”

Google is the cleanest case study. It has:

Custom silicon explicitly positioned for inference (e.g., TPU “Ironwood” described as designed for the “age of inference”). (blog.google)
A push to expand TPUs beyond internal use: As FPX first reported in September, Google has moved to sell TPUs directly into customer data centers, not just via Google Cloud—marking a strategic escalation from internal acceleration to full-stack infrastructure competition.
A direct attack on NVIDIA’s software moat: As FPX reported earlier this year, Google launched TorchTPU to make TPUs first-class PyTorch targets—explicitly reducing CUDA switching costs and partnering with Meta to accelerate ecosystem adoption.

This is exactly the “build vs buy” shift that shrinks NVIDIA’s default TAM if left unaddressed.

2.4 NVIDIA isn’t mainly playing defense against Groq. It’s playing offense with Groq.

Groq alone had a scaling problem: not physics, but distribution.

Inference hardware doesn’t win by being clever; it wins by being easy to buy, easy to deploy, and easy to program. NVIDIA already owns those channels: the software stack, the OEM/ODM ecosystem, the cluster networking narrative, and the default developer workflow.

So the offensive version of this deal is:

Groq brings a new inference architecture + second supply chain lane.
NVIDIA wraps it in the world’s dominant AI software ecosystem and global go‑to‑market.

That combination is more dangerous than Groq as an independent upstart.

3. Architectural Deep Dive: The Physics of the LPU vs. GPU

This deal only makes sense if you treat it as buying a different execution philosophy. Determinism is the new requirement for agentic workflows.

3.1 Determinism: why “compiler-first” matters more than raw TOPS

GPUs are dynamically scheduled machines. That’s great for variable workloads. But inference graphs—especially serving fixed model architectures—are known ahead of time.

Groq positions its architecture around deterministic execution and compiler scheduling, explicitly reducing the need for complex runtime scheduling and enabling tightly controlled data movement and timing.

Why that matters now (Agentic AI): the Straggler Effect becomes the tax.
In a multi-agent workflow (planner, executor, retriever, verifier, tool‑caller), your end‑to‑end latency is limited by the slowest sub‑call. If one “agent” stalls, the orchestration graph stalls.

Determinism isn’t just “faster.” It’s synchronization. It’s predictable TTLT. That’s an infrastructure primitive for agent swarms.

3.2 SRAM vs. HBM: Groq is a bet on breaking the Memory Wall by changing the memory hierarchy

In Part 1, we framed HBM as the “cutting board”: incredibly fast, physically close to the compute, but scarce and packaging‑constrained. (research.fpx.world)

Groq was among the upstarts that do not use external high‑bandwidth memory chips, instead relying on on‑chip SRAM—which speeds interactions but limits the model size that can be served.

Groq’s own materials emphasize “massive on‑chip memory” and a software-first approach that makes the hardware behave predictably.

Translation: Groq is “HBM-negative” per accelerator.
But it can be “network-positive” in the aggregate, because sharding and chip-to-chip fabric become central as models scale.

3.3 The elephant in the room: Cerebras is the “SRAM physics” competitor this deal validates

Cerebras is one of the primary rivals in the SRAM-centric, HBM‑less inference approach.

Cerebras’ wafer-scale pitch is even more explicit: the WSE‑3 announcement highlights 44GB of on-chip SRAM and extreme memory bandwidth—essentially turning “SRAM first” into an entire wafer-scale system design.

The “Validation Trap”:
This deal is a market-level validation of the SRAM-over-HBM inference thesis. But it also collapses the narrative: the alternative physics is no longer an alternative to NVIDIA—it is now being absorbed into NVIDIA’s orbit.

For competitors, that’s brutal. They can be right on architecture and still lose on ecosystem.

3.4 Networking: Groq’s determinism is a networking strategy, not just a compute strategy

Part 2 made the point that the network is the “nervous system,” not a sideshow. (research.fpx.world)

Groq’s scaling story inherently relies on chip-to-chip connectivity. Public reporting on Groq’s systems describes LPU racks stitched together with fast interconnect, including fiber-optic links in some scaling narratives. (OpenAI Platform)

Meanwhile NVIDIA is already driving deeper optical integration (CPO) and explicitly pulling major photonics suppliers into its ecosystem (as we covered in Part 2). (research.fpx.world)

Second-order implication:
If inference becomes cheaper and more distributed (more sites, more “inference hubs”), you get:

More front-end network + DCI optics (more regions, more replication, more east-west).
A faster push toward CPO and integrated optics inside large AI fabrics (power + density pressure).

This is why the “Groq + NVIDIA networking” combination is strategically meaningful. Groq’s deterministic execution model is a natural fit for future fabrics that behave more like a “virtual wafer” (your Google piece calls this out explicitly as well). (research.fpx.world)

3.5 Prefill vs Decode: the hybrid future is already visible in NVIDIA’s own roadmap

Here’s the key missing link that makes your “Speculative Decoding card” thesis feel inevitable:

NVIDIA itself is increasingly describing inference as two different workloads:

Context/Prefill (compute-intensive)
Decode/Generation (memory-bound)

This “disaggregated serving” framing shows up in reporting on NVIDIA’s recent disclosures, including the idea that splitting prefill and decode across different GPU pools can increase throughput.

And the roadmap implication is explicit: “Rubin CPX” is discussed as designed for “massive-context inference,” paired with full Rubin GPUs for generation.

This is the opening for Groq.
If decode is the memory/jitter bottleneck and Groq is an HBM‑less, deterministic decode engine, then Groq technology can become the “decode organ” inside a broader NVIDIA inference system—even if it never ships as a standalone Groq-branded product.

4. Manufacturing and Supply Chain: The “Second Source” Strategy

This is the most underappreciated strategic payload in the entire deal.

4.1 The CoWoS + HBM bottleneck is still the governor for GPU supply

In Part 1, we laid out why HBM is scarce and why GPU delivery schedules are gated by HBM stacks, CoWoS-class advanced packaging, and substrate constraints—not just wafers. (research.fpx.world)

If inference growth continues to be served primarily by HBM-rich GPUs, then inference expansion inherits the same bottlenecks as training.

4.2 Groq opens a second lane: inference capacity that is structurally less dependent on HBM and CoWoS

Groq’s approach avoids external HBM by using SRAM on-chip. Groq’s own funding release states it planned to deploy 108,000 LPUs manufactured by GlobalFoundries by end of Q1 2025—showing a concrete deployment plan outside the “TSMC CoWoS + HBM” lane. (Groq)
Most surces close to Groq heard directly from them that Groq contracted Samsung Foundry to manufacture next-gen 4nm LPUs.

That’s the orthogonal supply chain in one sentence: more inference capacity without consuming more CoWoS slots.

4.3 The visual mental model: NVIDIA just opened a second factory door

If you’re building a “physics-first” infrastructure map, draw this fork:

                 (CONSTRAINED LANE)
TSMC CoWoS + HBM stacks  →  GPU (training + some inference)  →  ultra-dense racks
          |
          |      (UNLOCKED / MORE COMMODITY LANE)
          └→  SRAM-centric inference silicon (HBM-less)  →  standard packaging  →  inference pods

Or even more simply:

Red lane (congested): HBM → CoWoS → GPUs → training factories
Green lane (flowing): SRAM → standard packaging → inference engines

“NVIDIA has effectively opened a second factory door that doesn’t lead through the CoWoS tollbooth.”

4.4 Second-order infrastructure effects of this supply chain fork

If NVIDIA can serve more inference demand through a less constrained lane:

HBM stays scarce—but the incremental HBM-per-inference-dollar ratio falls.
TSMC CoWoS capacity can be prioritized for the highest-margin training systems (and premium inference SKUs that still need HBM).
Inference deployments can fragment geographically (more colo, more enterprise, more sovereign deployments), because you’re no longer waiting on the most constrained packaging supply chain to ship boxes.

That last point is where Part 2 and Part 1 intersect: fragmentation increases optics/DCI and “deployable power” demand even if single-rack density is lower.

5. Economic Implications: The Tokenomics of 2026

5.1 The “Option A / Option B / Option C” segmentation becomes NVIDIA’s pricing weapon

Your framing is right, and this deal supercharges it:

Option A: NVIDIA GPUs (premium training + premium inference)
Option B: NVIDIA‑owned inference alternative (Groq-derived)
Option C: build from scratch (hyperscaler ASIC program)

The real power is that NVIDIA can now price discriminate across inference segments:

High-margin, “must-have” customers stay on the GPU stack.
Price-sensitive, latency-sensitive inference gets routed to an NVIDIA inference line that competes directly with hyperscaler ASIC economics—without forcing customers to leave the NVIDIA ecosystem.
This is a defensive moat against the Hyperscaler internal silicon. NVIDIA can now say: “Why build your own TPU when our 'Option B' is cheaper, faster, and already runs your software?”

This isn’t just revenue capture. It’s switching-cost engineering.

5.2 The Chain-of-Thought multiplier: why this is an inference deal, not a chip deal

OpenAI says reasoning models “think before they answer,” producing a long internal chain of thought; and OpenAI’s docs treat “reasoning tokens” as a first-class part of the token budget. (OpenAI Platform)
DeepSeek explicitly frames o1 as “inference-time scaling” by lengthening Chain-of-Thought.

Infrastructure translation:

More reasoning tokens → more decode work → more TTLT pain → more value in deterministic, fast token generation.

This is why Groq’s architecture is suddenly “worth $20B” (if CNBC’s number is accurate). It is the first credible path to keep reasoning UX from feeling broken at scale.

Callout: The Jevons Paradox of Inference

If NVIDIA/Groq lowers the cost and latency of inference by 10×, we won’t spend less on compute.
We will run more reasoning loops, more tool calls, more multi-agent plans, and more retrieval.
Cheaper tokens create more tokens.

5.3 Memory market implications (Link back to Part 1)

HBM is still the premium profit pool, and training remains HBM-hungry. But this acquisition creates a medium-term question: does inference growth continue to monetize through HBM content?

Groq avoids external HBM via SRAM.
Part 1 explains why HBM supply is constrained and why CoWoS and substrates are the bottleneck.

Second-order view:

If inference shifts to HBM-less engines, HBM demand doesn’t collapse—it reallocates toward training and top-end GPUs.
But the slope of “HBM dollars per incremental inference dollar” could flatten over time. That’s the subtle risk.

5.4 Networking + optics implications (Link back to Part 2)

Part 2 shows how the value pools move as we go from copper/DACs toward optics and co-packaged optics, and explicitly notes NVIDIA’s CPO ecosystem partners.

Groq-style inference scaling likely increases:

Endpoint count (more chips, more nodes, more sites).
Front-end bandwidth demand (more inference hubs, more DCI).
Pressure to reduce network power (which accelerates optics integration).

So the likely net is volume support for networking and optics, with a form-factor transition: a long-run migration from pluggables toward more integrated photonics in back-end fabrics (while DCI remains a separate growth vector).

6. Software Ecosystem: CUDA meets Compiler‑First Design

Groq’s problem was never performance.
It was where developers pay the switching cost.

Inference buyers do not want a new compiler, a new kernel language, or a new deployment workflow. They want endpoints, latency guarantees, and uptime. NVIDIA understands this—and that’s why the Groq deal is fundamentally a software play.

6.1 From CUDA Lock-In to Runtime Lock-In

CUDA was NVIDIA’s original moat, but inference changes the terrain. Most inference workloads in 2026 are:

containerized,
API-driven,
deployed via operators and orchestration layers.

That shifts lock-in away from kernels and toward the runtime.

NVIDIA’s move is to make NIM (Inference Microservices) the control plane for inference. Once a model is deployed as a NIM endpoint, the developer no longer targets a chip. They target a service contract.

At that point, hardware becomes an implementation detail.

6.2 Groq Becomes a Backend, Not a Platform

NVIDIA does not need to “port CUDA to Groq.”
It simply needs to hide Groq behind NIM.

Under this model:

the developer writes to the NIM API,
NVIDIA owns batching, scheduling, telemetry, and failover,
and the runtime dynamically selects the silicon.

GPU for training and memory-heavy prefill.
Groq-style engines for deterministic, low-jitter decode.

The user never “chooses Groq.”
They choose latency and cost profiles.

That is the strategic inversion.

6.3 Disaggregated Inference Becomes a Software Routing Problem

Inference is already bifurcating:

Prefill: memory-heavy, throughput-oriented.
Decode: latency-sensitive, jitter-sensitive, dominated by reasoning loops.

NVIDIA’s software stack is the natural place to broker that split.

Once this logic lives in NIM:

disaggregation becomes the default,
hardware specialization becomes invisible,
and performance gains accrue without ecosystem fragmentation.

Groq’s determinism is no longer an alternative worldview.
It becomes an internal acceleration mode.

6.4 The Real Strategic Outcome

This is how NVIDIA neutralizes alternatives without fighting them head-on.

Instead of competing with new programming models, NVIDIA:

absorbs the best architecture,
hides it behind the dominant runtime,
and turns “hardware choice” into a private scheduling decision.

The moat shifts upward:

from CUDA → to the inference control plane.

Developers won’t optimize for GPUs or LPUs.
They’ll optimize for NVIDIA’s inference APIs.

And NVIDIA will decide what silicon runs underneath.

That is the real lock-in.

7. Geopolitical and Antitrust Considerations

7.1 The deal structure is a regulatory strategy (but the economics are the story)

This is part of the broader trend: big tech pays large sums to take technology + talent while stopping short of a formal acquisition—an approach that has drawn scrutiny but has often survived.

For our purposes, that structure matters less than the operational outcome: Ross + key engineers move, the architecture moves, NVIDIA controls the roadmap.

7.2 Cerebras: the only remaining “pure play” in SRAM-first inference is now in a tighter box

Cerebras is Groq’s main rival in this HBM-less, SRAM-centric approach.
Cerebras’ wafer-scale WSE‑3 specs underline how serious SRAM-first designs can be.

Market implication: the competitive set compresses into:

Hyperscalers (internal chips + software)
NVIDIA (GPUs + now an absorbed alternative architecture)
A shrinking set of independents

8. The Hybrid AI Factory of 2026

The “One GPU to Rule Them All” era is ending—not because GPUs are weak, but because inference economics and supply-chain physics demand specialization.

The most defensible base case is not immediate dislocation. It’s a bifurcated architecture roadmap:

Training cortex: HBM-heavy, CoWoS-gated GPU factories (Blackwell → Rubin → beyond).
Inference organs: deterministic, SRAM-centric engines for decode-heavy, low-latency reasoning workflows.
Optical nervous system: a fabric roadmap that keeps scaling while power becomes the binding constraint (Part 2).

And the reason this “ages well” is simple: innovation cycles are shortening. Models iterate faster than hardware. Any architecture that can shift performance via compiler/runtime changes—without needing a brand-new silicon generation to keep up—wins more often in an era where the workload changes every quarter.

Actionable Insertions: The Signals to Watch (falsifiable)

You want signals that clearly validate (or falsify) the “Option 3 integration” thesis. Here are two clean ones:

Signal #1 — The “Jetson Pivot”

Watch for NVIDIA to re-brand or refresh its Jetson/edge/robotics line with Groq-derived IP first.

Why it’s the safest integration point:
Edge and robotics customers already accept heterogeneity, power constraints, and specialized acceleration. It’s the lowest-risk place to ship “non-GPU silicon” without spooking core datacenter GPU buyers.

What would count as confirmation:
A Jetson-class platform where NVIDIA explicitly markets deterministic token generation / low jitter inference as a core feature, and/or a Jetson module that contains a new inference accelerator block that looks architecturally Groq-like (compiler-scheduled, SRAM-centric).

Signal #2 — The “Speculative Decoding” Hybrid Card

Watch for a dual-slot card (or tightly coupled server tray) that pairs:

a GPU optimized for prefill/context (huge memory footprint)
with a Groq-like engine optimized for decode/generation (huge speed + determinism)

Why this is the holy grail:
The GPU holds the massive model and handles the compute-heavy prefill, while the deterministic engine drafts tokens quickly (and handles the latency-critical generation loop).

Why it’s plausible now:
NVIDIA is already openly treating inference as two phases (prefill vs decode) and discussing disaggregated serving as a throughput win; Rubin CPX being positioned for context/prefill is the roadmap breadcrumb.

What would count as confirmation:
Any NVIDIA product announcement that explicitly pairs “context chips” with “decode chips” in one SKU or one standard rack design—especially if decode silicon is not just “more GPU,” but a different architecture.

In the next section, we move past architecture and strategy and into the physical reality of scaling inference. We break down Groq’s deployment model at the level that actually matters for investors and operators: the bill of materials, the manufacturing lanes, the geopolitical hedge, and the new bottlenecks that emerge once you route around HBM and CoWoS. This is where the narrative shifts from “faster inference” to industrial procurement, brownfield data-center unlocks, cabling density, and who gets paid as inference capacity scales in the real world.

Part 9 — Supply Chain: NVIDIA’s Second Factory Door

(Why Groq turns inference into an industrial procurement problem)

9.1 The Product Isn’t a Chip. It’s an Orderable Rack.

The fastest way to misunderstand Groq is to argue about TOPS, latency charts, or benchmark deltas.

The correct way to understand Groq—and why NVIDIA effectively acquired it—is to look at what can be ordered, assembled, shipped, and deployed at scale.

Groq’s unit of deployment is not an exotic supercomputer. It is a standard server platform:

A 4U server chassis
8 accelerator cards
Conventional CPUs
Conventional DDR memory
Conventional NVMe storage
Conventional power supplies
A very unconventional amount of cabling

Part 2 : Beyond Power : The Networking Bottleneck Starting to take Shape

FPX AI — Tue, 09 Dec 2025 18:12:10 GMT

If you watch the AI hype cycle, it looks like GPUs and TPUs are doing all the work. In reality, they’re just the muscles. The nervous system that makes an AI factory useful is the network: a stack of copper traces, cables, fibers, and optics that moves bits from chip to chip, rack to rack, datacenter to datacenter, and eventually out to users.

That network is no sideshow. Data‑center Ethernet switching is already a low‑$20B annual market, with forecasts showing it rising toward the mid‑$30Bs by 2028 as hyperscalers expand AI back‑end and front‑end fabrics, growing high‑single digits annually, while the optical transceiver market is already in the low‑teens billions with mid‑teens CAGR, driven heavily by AI datacenter upgrades. Analysts expect shipments of 400G/800G datacom optical modules to climb from about $9 billion in 2024 toward $16 billion by 2026, with 800G modules and soon 1.6T driving the next wave. That’s before you count the backbone optics that tie regions and continents together.

So this isn’t just a “supporting cast.” It’s a fast‑growing capital line item and a potential bottleneck for every large‑scale AI deployment.

FPX Services: This report details the critical bottlenecks in AI scaling. If you are an investor or operator looking to navigate these supply chains, model TCO, or secure hardware capacity, FPX provides the execution layer to turn these insights into action. Explore our consulting services here.

As always this is Not investment advice; for informational purposes only.

The clearest way to understand it is to follow a single bit as it travels:

From one GPU/TPU to another inside a server
Across servers inside a rack
Across racks inside a datacenter (leaf–spine)
Between buildings on a campus
Between datacenters across a metro (DCI)
Across metro, long‑haul, and subsea fiber back to other regions and, ultimately, users

At every step you can ask three questions:

Is this electrons in copper, or photons in glass?
Which components and materials are in play?
Who in the value chain actually makes money here?

Let’s go ring by ring.

Paid Subscribers get a TL;DR version at the end of the report if you only want the key takeaways.

Fig 1: The Bit Journey: GPU → User

1. Inside the server: why everything is copper at short range

Start with something concrete: a rack‑scale AI system like Nvidia’s GB300 NVL72, which ties 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single liquid‑cooled rack. Inside that rack, each GPU exposes 18 NVLink 5 connections running at 100 GB/s each, for 1.8 TB/s of bidirectional bandwidth per GPU and 130 TB/s of NVLink bandwidth across the system. Google’s TPU v5p pods wire together 8,960 chips with 4.8 Tb/s of inter‑chip interconnect (ICI) per chip in a 3D torus; the newer Ironwood (TPU v7) superpods scale to 9,216 chips with 9.6 Tb/s per chip of ICI and ~1.77 PB of HBM3E addressable memory.

All of that is happening over copper.

Inside a package, between GPU compute dies and stacks of HBM, the information moves through microscopic copper interconnects in silicon and package substrates. You have:

Thin copper traces in the chip’s metal layers
Through‑silicon vias (TSVs) and microbumps connecting dies to HBM
Organic substrates or silicon interposers that fan signals out to the package edge

Once you leave the package, you land on a PCB (printed circuit board): a sandwich of fiberglass/epoxy and copper that routes signals between GPUs, CPUs, NICs, and switch ASICs. At this point, everything is still just a voltage pattern on a copper trace.

The key enabling block here is the SerDes (serializer/deserializer) built into every high‑speed chip. A GPU doesn’t push a 1.8 TB/s firehose on a single wire; it spreads that bandwidth across many high‑speed serial lanes. NVLink 5, for example, runs at 100 gigabytes per second per link, and each Blackwell GPU supports 18 such links to reach 1.8 TB/s. Those lanes are amplified, equalized, and shaped to survive a few centimeters of PCB and maybe a connector or two.

Why is everything copper here?

Because at centimeter‑scale, copper is unbeatable on cost, latency, and simplicity. The signal hasn’t had time to attenuate much; you can route dense, multi‑terabit connections with nothing more exotic than carefully designed copper traces and connectors. Lasers would make this more complex, not less. The bottleneck becomes SerDes design and signal integrity engineering, not the medium.

From an investor’s perspective, this short‑reach domain belongs to:

GPU/TPU and CPU vendors, who control the SerDes and package design
Switch/NIC/DPU vendors, who expose hundreds of SerDes lanes per chip
A handful of PCB and substrate suppliers that can handle the signal‑integrity demands of 112–224 Gb/s per lane

None of this looks like “optics,” but it directly drives demand for optics: the more high‑speed lanes a GPU or switch exposes, the more off‑rack bandwidth you need to move data out of that system.

What Companies are working in this space

Inside this “all-copper” domain, a relatively small set of companies quietly control most of the value chain. At the top sit the compute vendors – NVIDIA (NVDA), AMD (AMD) and the hyperscaler silicon teams inside Alphabet (GOOGL), Amazon (AMZN) and Microsoft (MSFT) – who decide how many SerDes lanes exist, at what speeds, and with which protocols (NVLink, Infinity Fabric, PCIe, CXL, TPU ICI, etc.). Every extra lane and every bump in per-lane speed multiplies pressure downstream: more I/O pins, more substrate layers, more PCB complexity, more connectors, more signal-conditioning silicon. Right next to them in the stack are the HBM memory suppliers – SK hynix (000660.KS), Samsung Electronics (005930.KS), and Micron (MU) – whose stacked DRAM sits on the same package and effectively defines how wide and how hot that on-package copper fabric must be; HBM capacity and yields are now as constraining to AI throughput as the GPU dies themselves.

Underneath that, you have the “infrastructure of copper” that makes these short-reach electrons possible. Advanced packaging and assembly are handled by foundry/OSAT players like TSMC (TSM), Intel (INTC), ASE Technology (ASX) and Amkor (AMKR), which turn bare dies plus HBM stacks into huge 2.5D packages and must supply enough CoWoS-class capacity to keep GPU releases on schedule. Those packages sit on IC substrates built with Ajinomoto (2802.T)’s ABF film (essentially a near-monopoly insulator for high-end substrates) and manufactured by houses like Ibiden (4062.T), Shinko (6967.T), Unimicron (3037.TW) and AT&S (ATS.VI). From there, server and accelerator PCBs come from fabricators such as TTM Technologies (TTMI) and Unimicron, populated with high-speed connectors and internal copper assemblies from TE Connectivity (TEL), Amphenol (APH) and Molex (private), and stitched together electrically by retimer/clocking and SerDes-adjacent silicon from Broadcom (AVGO), Marvell (MRVL), Credo (CRDO), Astera Labs (ALAB), Analog Devices (ADI) and Texas Instruments (TXN). None of these companies sell “optics,” but collectively they define how far you can push electrons on a board or inside a chassis, how many lanes you can route, and how much power/latency that costs. In other words: the more bandwidth you squeeze out of this copper stack, the more inevitable it becomes that the next hop – off the board and out of the rack – must be solved with fiber and optics.

2. Inside the rack: harnesses, DACs, and active copper

Now zoom out a bit. A rack like the NVL72 or a TPU superpod chassis is basically a small city of copper.

Within that rack:

GPUs connect to each other and to NVSwitch or similar fabrics over short copper links.
Servers or trays connect to the top‑of‑rack (ToR) switch using copper cables up to a few meters long.
In TPU pods, chips inside a cube or block are wired over PCB and internal copper harnesses before traffic is ever turned into light.

At this distance scale—roughly 0.5–7 meters—the industry splits copper into two regimes:

Passive DAC (Direct Attach Copper).
For very short runs (0–2 or 3 m), you can get away with simple twinax copper cables with no active electronics in the ends. The SerDes in the GPU or NIC pushes a PAM4 signal directly down the wire, and the receiver reconstructs it. This is the cheapest, lowest‑latency option.

Active Electrical Cables (AEC/LACC).
Once you push toward 3–7 m at 100–200 Gb/s per lane, copper behaves like a severe low‑pass filter—loss and reflections shred your eye diagram. To stay with copper instead of jumping to optics, vendors embed DSPs or linear amplifiers in the cable ends. Those chips equalize and reshape the signal so it can survive longer runs. Companies like Credo built their entire business around these Active Electrical Cables, and you saw them show up early in large AI systems such as Intel’s Gaudi platforms and early GPU clusters.

You can think of AECs as a bridge technology: they buy hyperscalers a few more meters of copper at each speed node so they don’t have to flood racks with optics too early. The value is extremely tangible: a rack full of AOCs (active optical cables) can burn almost as much power as the switches they connect; a rack using AECs and short DACs buys back real watts without sacrificing speed.

From a materials perspective, you’re still dealing with:

Copper twinax cables
High‑frequency connectors
Small DSP/retimer chips in the cable ends

All of these live in a brutally cost‑sensitive environment. But as line rates climb (112 → 224 Gb/s per lane), the willingness to pay for smarter copper increases. That’s why AEC and PAM4 PHY vendors are so leveraged to AI networking: they sit exactly where “just run more DACs” stops working.

What Companies are working in this space

At this distance scale you can also start to see a very clear stack of specialized vendors. At the physical layer, connector and cable OEMs like TE Connectivity (TEL), Amphenol (APH), Molex (private), Luxshare Precision (002475.SZ) and Foxconn Interconnect Technology (6088.HK) manufacture the twinax assemblies, high-speed connectors, and internal harnesses that physically bind GPUs, switches, and trays together. Their job is brutally practical: hit tight insertion-loss, crosstalk, and thermal constraints at the lowest possible cost per port, while scaling to millions of identical links. Because hyperscalers want predictable impedance and easy field replacement, these vendors effectively industrialize “copper as a modular building block” inside every rack.

Layered on top of that are the silicon players that turn dumb copper into active copper. Credo (CRDO), Marvell (MRVL) and Broadcom (AVGO) ship the PAM4 DSPs and retimer PHYs that live in the ends of AECs and high-speed copper links; they’re the reason a 224 Gb/s lane can survive several meters of lossy twinax instead of dying after a meter or two. System OEMs like NVIDIA (NVDA), Cisco (CSCO), Arista (ANET) and Juniper (JNPR) then qualify and brand end-to-end cable SKUs for their GPU and switch platforms, deciding where to use cheap passive DACs, where to pay up for AECs, and where to flip to optics. In other words, the economics of “how far can we push copper before we’re forced into fiber” flow through this whole chain—cable OEMs monetize volume, DSP vendors monetize every extra dB of reach, and the system vendors arbitrage the mix to minimize watts and dollars per terabit inside the rack.

3. Inside the datacenter: leaf–spine over fiber

Take one more step back. Inside a modern AI datacenter, racks don’t talk to each other directly; they talk through a fabric of switches—typically some form of Clos/leaf–spine topology.

At the bottom tier, you have a ToR or leaf switch in each rack. Above that, multiple spine switches connect every leaf. Any rack can reach any other in two hops: leaf → spine → leaf. For very large sites, a third super‑spine layer ties clusters of spines together.

Logically, this gives you:

Predictable latency
High bisection bandwidth
A structured place to scale from one AI cluster to many

Physically, it forces you off copper and onto fiber.

Inside the row, and certainly between rows, you’re usually looking at 10–500 meters of reach. At 400G and especially 800G per port, copper is simply not economical at those distances—the power you’d spend on equalization and retimers would be insane. So racks connect up to leaf/spine switches using optical transceivers and fiber.

In 2024–2025, mainstream hyperscale fabrics are largely:

400G ports built from 4×100G PAM4 lanes
800G ports built from 8×100G or 4×200G PAM4 lanes
Transitioning toward 1.6 Tb/s ports on next‑gen switch silicon and optics roadmaps

Those ports are exposed through pluggable modules (QSFP‑DD or OSFP) that contain:

A PAM4 DSP
One or more lasers (DFB/EML for single‑mode, VCSEL for multimode)
Modulators, photodiodes, TIAs
A fiber connector (often MPO or LC)

For distances within a building, most hyperscalers standardize on single‑mode fiber (SMF) and IM‑DD optics:

DR / DR4 for ~500 m single‑mode
FR / FR4 for ~2 km single‑mode
Multimode SR is still used in some environments, but the shift to SMF is well underway because it simplifies plant and supports higher speeds over longer runs.

Under the hood, switch ASICs are now in the 100 Tb/s class. Broadcom’s Tomahawk 5 and 6 families, for example, offer up to 102.4 Tb/s per chip—enough for 128×800G ports or similar combinations—while Nvidia’s Quantum‑X InfiniBand and Spectrum‑X Ethernet switches deliver hundreds of 800G ports for AI fabrics.

All of this is still powered by SerDes, but now the copper is very short: a few centimeters from the switch ASIC to the optical module. Nearly all of the distance is carried in glass.

This is where the optics market really scales: millions of 400/800G modules per year, low‑teens‑billion‑dollar revenue globally and growing mid‑teens annually, with AI clusters as a key demand driver for early adoption of 800G and 1.6T.

What Companies are working in this space

Inside this leaf–spine layer, what you’re really looking at is a small group of companies that decide how “AI bandwidth” is packaged and sold. On the system side, Arista Networks (ANET) and NVIDIA (NVDA) are effectively the dual keystones of high-performance fabrics: Arista dominates “pure Ethernet” at the top end, selling stateless Clos building blocks plus EOS and automation that let hyperscalers stitch 100+ Tb/s ASICs into predictable fabrics; NVIDIA, meanwhile, owns the specialized AI fabric stack with Quantum-X InfiniBand and Spectrum-X Ethernet, bundling switches, NICs/SuperNICs, and software (SHARP offload, congestion control, collectives libraries) into a vertically integrated offering tuned for all-reduce and model-parallel traffic. Cisco (CSCO) still anchors a huge portion of the datacenter switch TAM, combining its Nexus/Silicon One platforms with a captive optics strategy (via Acacia and Luxtera) to sell end-to-end IP/optical solutions to both cloud and enterprise, and now HPE (HPE) has effectively absorbed Juniper’s (ex-JNPR) data-center routing/switching franchise, giving it a credible spine/leaf story tied into HPE’s server and storage lines. In parallel, Celestica (CLS) and Accton (2345.TW) are the ODM muscle behind many “white-box” deployments – they take merchant silicon (Broadcom, Marvell, sometimes NVIDIA), wrap it into 1U/2U pizza boxes, and ship them by the tens of thousands into AI back-end networks where brand matters less than cost, power, and time-to-rack.

From a first-principles perspective, these vendors collectively answer three questions for the operator:

(1) Who controls the forwarding behavior and failure modes of the fabric? (Arista’s EOS, NVIDIA’s Infiniband stack, Cisco/HPE NOSes.)
(2) Who integrates the switch ASIC, optics, and mechanics into something that survives at 20–30 kW per rack? (Arista/Cisco/HPE on the branded side; Celestica/Accton on the ODM side.)
(3) Who sets the price-per-terabit curve as port speeds ratchet from 400G → 800G → 1.6T? Here, ANET and NVDA effectively define the top-end “AI fabric SKU” economics, while CSCO/HPE protect share in more general DC fabrics, and CLS/Accton compress margins from below by giving hyperscalers cost-optimized white-box gear.
All of them are tightly coupled to the same underlying constraints (Broadcom/Marvell switch ASIC roadmaps, 400/800G optics availability, rack power envelopes), but their business models sit at different layers of the stack: Arista/NVIDIA/Cisco/HPE monetize software, ecosystem, and support on top of merchant or custom silicon, whereas Celestica/Accton monetize volume manufacturing and integration efficiency. As port speeds climb and optics consume a larger fraction of switch BOM, the players that can co-design fabric topology + NOS + optics strategy (ANET, NVDA, CSCO, HPE) will have disproportionate leverage over how big an AI cluster can practically get and what each incremental terabit of leaf–spine capacity costs.

4. Co‑packaged optics: shrinking the electrical domain

At 800G, and especially 1.6 Tb/s per port, even the few centimeters of PCB between switch ASIC and pluggable optics start to hurt. You can throw more DSP at it, but you pay in power and latency. Nvidia’s own internal modeling for traditional architectures shows that a 400,000‑GPU AI “factory” could burn on the order of 70+ megawatts just in optics and switching if you stick with classic pluggables.

The industry’s answer is co‑packaged optics (CPO) and silicon photonics:

Move the optical engines onto the same package as the switch ASIC.
Shorten the electrical path from centimeters of PCB to millimeters of interposer.
Lower electrical loss, which lets you cut transmit power and DSP complexity.

At GTC 2025, Nvidia announced Quantum and Spectrum switches with embedded photonics and CPO, claiming up to 1.6 Tb/s per port and more than 50% reduction in network power compared to traditional pluggable transceivers at scale. STMicroelectronics, in collaboration with AWS, has separately announced photonics chips for AI datacenter transceivers, and market researchers like LightCounting expect the optical transceiver market to grow from around $7 billion in 2024 to $24 billion by 2030, driven by these kinds of deployments.

For investors, CPO is important because it reshuffles the value chain:

More value moves from module vendors to switch/NIC vendors + photonics engine suppliers.
Packaging and silicon‑photonics process know‑how become major moats.
Power per bit—the ultimate scarce resource in AI datacenters—becomes the central selling point.

What Companies are working in this space

Co-packaged optics pulls a very different cast of characters into the center of the networking story. At the system level, NVIDIA (NVDA), Broadcom (AVGO), and Cisco (CSCO) are no longer just selling boxes with slots for other people’s modules — they are designing electro-optical machines where the switch ASIC, photonics engines, and package are co-designed as a single organism. NVIDIA’s Spectrum-X Photonics and Quantum-X Photonics families take this to its logical extreme: 1.6 Tb/s-per-port switches that integrate silicon-photonic engines on the same package, eliminating faceplate pluggables and, by NVIDIA’s own numbers, cutting network power by roughly 3.5× and boosting resiliency by an order of magnitude for “million-GPU AI factories.” Broadcom (AVGO) plays the merchant analog: its Tomahawk 6 “Davisson” 102.4 Tb/s Ethernet switch co-packages 16 × 6.4 Tb/s optical engines around the ASIC, inheriting experience from its earlier Tomahawk 5 “Bailly” 51.2 Tb/s CPO switch and using TSMC’s COUPE photonic engine to drive 64 × 1.6 TbE ports at 200 G/lane. Cisco (CSCO), for its part, has demonstrated full CPO routers using Silicon One ASICs with co-packaged silicon-photonics tiles, and publicly commits to CPO as a way to collapse SerDes count and cut per-bit power while keeping an open, multi-vendor ecosystem around optics rather than locking everything to a single in-house module line.

Underneath those system brands sits a new “CPO stack” of enablers that looks very different from the pluggable world. TSMC (TSM) moves from being “just the switch fab” to being a co-architect of the photonic engine and package, co-developing compact silicon-photonics platforms and advanced packaging flows for both NVIDIA and Broadcom. Classic optics players like Coherent (COHR) and Lumentum (LITE), and fiber/cabling specialists like Corning (GLW), are explicitly named as partners in NVIDIA’s CPO announcement, because their lasers, modulators, and fiber-attach know-how now get baked into the switch package rather than sold as standalone transceivers. On the photonics-chip side, STMicroelectronics (STM) is a good example of how the gravity is shifting: it is co-developing an AI-focused silicon-photonics interconnect chip with Amazon/AWS (AMZN) that targets 800 Gb/s and 1.6 Tb/s optical links, with volume production in Crolles and planned deployment in AWS datacenters and a leading transceiver vendor’s products starting 2025. Layered on top of this are private startups like Ayar Labs, Lightmatter, and Celestial AI pushing chip-to-chip and in-package optical I/O — they don’t show up in market-share tables yet, but they’re the R&D tip of the spear that could eventually move optics directly onto GPU/CPU packages. Analysts like LightCounting expect this entire transition — via both Linear-Drive Pluggables (LPO) and CPO — to more than double the share of silicon-photonics-based transceivers from ~30% in the mid-2020s to ~60% by 2030, effectively shifting a big slice of optics revenue away from “boxy” pluggables and toward vendors that own SiPho IP, advanced packaging, and tight co-design with switch silicon.

5. Across buildings: campus and multi‑hall networks

Once you’re beyond a single hall, you hit the campus layer: connecting multiple buildings or datacenter rooms, typically hundreds of meters to tens of kilometers apart.

Physically, you’re still on single‑mode fiber, often laid in underground ducts between facilities. Logically, you’re either:

Extending your leaf–spine fabric across buildings, or
Creating a higher‑level aggregation/transport layer that connects multiple independent fabrics.

Here the optical toolbox widens:

Simple LR (10 km) or ER (40 km) client optics are enough if you own dark fiber and just need point‑to‑point links.
If you need to share fiber among many services, you may adopt CWDM/DWDM and even coherent “lite” pluggables to squeeze many wavelengths onto a single pair of fibers.

Large AI campuses—think three or more big halls full of GPUs—often end up blending packet fabrics with optical circuit switching. Google’s “AI Hypercomputer” architecture for TPU v5p and Ironwood, for instance, uses electric packet networks for many flows but relies on an optical circuit switch (OCS) layer to dynamically rewire high‑bandwidth connections between TPU pods over fiber.(Google Cloud)

At this layer, the same optical vendors show up, but modules tend to be tuned for longer reach and higher robustness; campus links are a critical failure domain boundary, so operators pay for reliability and manageability.

What Companies are working in this space

Zooming out, the campus layer pulls in a different mix of suppliers than the inside-rack world. Packet vendors like Cisco (CSCO), Arista (ANET), Juniper (JNPR) and NVIDIA (NVDA) effectively “own” the logical fabric between halls: they sell the Ethernet/InfiniBand ports that stitch leaf–spine domains across buildings, and they decide which LR/ER pluggables get qualified and how much overbuild you carry at this failure-domain boundary. Hanging off those ports is a datacom-optics industry that has quietly turned into a hyperscale lever: InnoLight, Coherent Corp. (COHR) and Eoptolink (300502.SZ) are now the largest suppliers of high-speed datacom modules, in a market Cignal AI expects to exceed ~$12B by 2026, of datacom optical component revenue in 2025, driven by 400/800G ramps. LightCounting pegs InnoLight’s 2024 revenue at >US$3.3B (up 114% YoY) and Eoptolink at ~US$1.2B (up 175%), almost entirely from Ethernet transceivers, while quarterly transceiver sales across leading vendors have already cleared US$3B/quarter. Around them, Lumentum (LITE), Applied Optoelectronics (AAOI) and Accelink (300308.SZ) supply both finished modules and the InP lasers/EMLs, drivers and receivers that let campus links stretch from a few hundred meters to 10–40 km without falling over.

Under the optics, you’re literally paying for glass and wavelength plumbing. Campus and multi-hall builds pull on a fiber-cable market worth~$75B, growing at roughly 10% CAGR through the next decade, where vendors like Corning (GLW), CommScope (COMM), Prysmian (PRY.MI), Fujikura (5803.T), Furukawa Electric (5801.T) and Sumitomo Electric (5802.T) dominate high-quality single-mode cable, ducts and connectivity hardware. Yet the space is structurally fragmented: the top 10 fiber-optic cable competitors control only ~14.23% of the market, with Corning ~4.26%, CommScope ~2.16%, Prysmian ~1.51%, and the rest spread across dozens of regional players. On top of that fiber, Ciena (CIEN), Nokia (NOK) (plus ex-Infinera assets), Cisco (CSCO), ZTE (0763.HK/000063.SZ) and Huawei sell CWDM/DWDM systems and ROADMs, while a new OCS tier emerges as a choke-point: Lumentum (LITE), Coherent (COHR) and HUBER+SUHNER / POLATIS (HUBN.SW) are turning optical circuit switches into standard campus gear, with Cignal AI now forecasting the external OCS market to exceed US$2.5B by 2029 as deployments spread beyond Google into broader AI campuses. First-principles, that’s the economic story of the campus layer: routers and pluggables concentrate spend per port, fiber vendors monetize kilometers of glass, and a small but fast-growing OCS niche taxes every incremental hall you want to rewire on demand.

6. Metro DCI: datacenter ↔ datacenter across a city

Beyond a campus, you get into Data Center Interconnect (DCI): linking whole sites across a metro region, say 20–100+ km apart.

This used to require dedicated DWDM transport shelves. Now it’s increasingly done with coherent pluggable optics like 400ZR and ZR+:

400ZR: 400 Gb/s coherent optics, engineered for 80 km class point‑to‑point DCI over DWDM fiber, directly from a router/switch port.
ZR+ / OpenZR+: Similar coherent engines with higher power and flexible forward error correction, designed to span hundreds of kilometers with amplifiers and ROADMs.(HYC System)

What makes coherent different from IM‑DD is that the transceiver no longer just measures the brightness of light; it reconstructs its phase and polarization, and demodulates complex QAM constellations (e.g., 16‑QAM, 64‑QAM) with heavy DSP. That lets you pack much more information into each symbol and survive traversing ROADM networks.

For hyperscalers, coherent pluggables are economically elegant because:

They collapse DCI into the switch/router line card.
They avoid separate DWDM systems at each end.
They align with the normal lifecycle of IP equipment.

For the supply chain, they’re a sweet spot: higher ASPs than simple DR/FR modules, but still high‑volume and tied closely to AI and cloud expansion.

What Companies are working in this space

Metro DCI also has a very tight industrial stack. At the top, Cisco (CSCO) (via Acacia), Marvell (MRVL) and Ciena (CIEN) effectively define what “400ZR / ZR+” looks like in the wild. LightCounting notes that Cisco/Acacia and Marvell dominate shipments of pluggable coherent 400ZR/ZR+ modules, with Ciena as the main third supplier. Cignal AI and others then show the consequence of that dominance: in 2024, 400ZR and ZR+ accounted for the bulk of all WDM bandwidth deployed, and most of those modules were shipped by Marvell, Acacia/Cisco and Ciena. From a port-level view, we’ve already crossed the tipping point—Cisco’s own CiscoLive material (citing Cignal AI) says >70% of coherent ports are now pluggable coherent optics, with Cisco/Acacia positioned as “pioneer and market leader” for these pluggables. Put differently: for a metro DCI link, the “DWDM shelf” has collapsed into a QSFP-DD/OSFP module supplied by a tiny handful of vendors, and every AI rack you light up in a second site is now implicitly paying a Cisco/Acacia, Marvell, or Ciena tax on its coherent ports.

Underneath the modules sit the optical line-system vendors that actually move photons across the city. Dell’Oro’s 2024 numbers put Huawei at ~33% of global optical transport equipment revenue, with Ciena at 19%, and a combined Nokia + Infinera at ~19% as well. Reuters and other coverage of Nokia’s US$2.3B acquisition of Infinera (INFN) frame the merged company as the #2 optical networking vendor with ~20% share, just behind Huawei, and explicitly call out the deal as a way to sell more DCI gear into AI-driven data-center builds at Amazon (AMZN), Alphabet (GOOGL) and Microsoft (MSFT). Cignal AI’s “pluggable world” thesis closes the loop: coherent pluggable optics accounted for all of the telecom bandwidth growth in 2024, with embedded optics actually shrinking, and that growth is increasingly metro-focused as IP-over-DWDM becomes the default DCI architecture. First-principles, the metro DCI stack is now highly concentrated: three Western vendors (CSCO, MRVL, CIEN) control most coherent pluggable volume, and a handful of line-system suppliers (Huawei, Ciena, Nokia+Infinera, ZTE) control most of the ROADM/amplifier footprint that those pluggables ride over.

7. Metro, long‑haul, subsea: the global backbone

Once you leave the metro, you’re in the world of carrier‑class optical transport. Distances stretch from hundreds to thousands of kilometers along terrestrial routes, then thousands to tens of thousands for subsea systems.

The physical medium is still single‑mode fiber, but the system around it is very different:

Coherent transponders and muxponders aggregate many 100/400/800G client signals into fewer 400/800+G wavelengths.
DWDM line systems stack dozens of wavelengths onto each fiber pair.
EDFAs and Raman amplifiers boost the optical signal every 60–80 km or so.
ROADMs (Reconfigurable Optical Add‑Drop Multiplexers) dynamically switch wavelengths between directions and destinations at intermediate nodes.(HYC System)

Vendors like Ciena, Nokia, Cisco, Infinera, Huawei, and ZTE dominate this layer, using large coherent DSPs and advanced photonics to keep up with capacity demand.

AI workloads matter here because:

Training traffic is increasingly multi‑region: checkpoints, models, and datasets replicate between continents.
Inference traffic can be served from multiple regions for latency or resiliency reasons.
Hyperscale AI deployments become anchor tenants for new long‑haul and subsea builds.

This is where the optics supply chain meets power and geopolitics. Countries courting hyperscale datacenters (like Malaysia’s recent push to become a regional DC hub) are also grappling with the power and water footprint of these facilities, plus the backbone capacity they require.

What Companies are working in this space

At the backbone layer, you’ve basically left the world of “ports” and entered the world of wavelength factories and industrial projects, and the cast of companies changes accordingly. On the terrestrial side, Huawei (private), Ciena (CIEN), Nokia (NOK), ZTE (000063.SZ / 0763.HK), FiberHome (600498.SS), and Cisco (CSCO) are the ones selling complete DWDM line systems, ROADMs, amplifiers, and coherent transponders to carriers and, increasingly, directly to cloud providers. Dell’Oro’s 2024 numbers have Huawei at ~33% global optical transport share and Ciena at ~19%, with the top five being Huawei, Ciena, Nokia, ZTE and Infinera. By 2Q25, the top six by revenue share are Huawei, Ciena, Nokia, ZTE, FiberHome, and Cisco, and direct cloud-provider purchases of WDM systems are growing ~60% year-on-year as hyperscalers start buying long-haul/metro gear themselves to build AI backbones. Cignal AI pegs the 2Q25 optical transport market at $3.8B, +9% YoY, and explicitly attributes the rebound to AI-driven backbone builds in North America. In other words, at this layer Ciena, Nokia+Infinera, and Cisco are the main Western ways to own AI backbone spend, while Huawei/ZTE/FiberHome dominate China and parts of the Global South; the structural risk is that a handful of vendors now sit between AI clusters and the rest of the world’s connectivity.

Underneath those system OEMs is the coherent “modem stack” – the DSPs and optical engines that actually turn 100/400/800G Ethernet or OTN signals into a single complex optical waveform that can survive 1,000+ km of dispersion and noise. Here, there are effectively two merchant gravity wells plus a few vertically integrated giants. Cisco (CSCO), via its Acacia unit, and Marvell (MRVL) together dominate merchant 400ZR/ZR+ coherent pluggables; Cignal AI explicitly calls them established leaders in 400G pluggable optics, with Marvell and Acacia responsible for the bulk of 400ZR volume. Ciena (CIEN) historically shipped its WaveLogic coherent DSPs only inside its own chassis, but as of 3Q24 it also became a significant supplier of 400G pluggables, crossing over into the router/pluggable ecosystem. Nokia (NOK) has its own PSE coherent DSP family, and via the 2025 acquisition of Infinera it inherits an additional coherent/PIC stack. Huawei (private), meanwhile, continues to ship very high-speed embedded coherent ports – including 800G/1.2T+ class line cards – primarily into Chinese and aligned markets. From a first-principles standpoint, these companies are all tackling the same Shannon-defined problem: how many bits/Hz you can squeeze into spectrum before SNR kills you. Whoever can deliver higher-order QAM at lower W/bit without killing reach wins disproportionate value in long-haul AI replication and multi-region training, because every extra bit of capacity on an existing fiber pair defers hundreds of millions of dollars in new cable or route builds.

Once you cross an ocean, the business model shifts again from “selling cards and DSPs” to designing, laying, and maintaining entire cable systems—an area so concentrated it’s essentially a cartel of industrial specialists. Alcatel Submarine Networks / ASN (state-linked, France), SubCom (private, US), and NEC (6701.T) together account for almost all new submarine cable construction by cable length since 2017, with Carnegie’s 2024 analysis estimates that from 2020–2024 ASN built ~34% of new subsea systems, SubCom ~19%, and HMN Tech ~10%, with HMN’s share of planned systems dropping to ~4% as Western and Quad governments actively discourage Chinese-built systems. TeleGeography and CSIS note that well over 95% of intercontinental traffic rides these undersea cables, and that roughly four firms manufacture and install almost all of them. Around that core sit “picks and shovels” vendors like Corning (GLW) (long-haul fiber), Prysmian (PRY.MI / PRYMY) and Nexans (NEX.PA) (cable), and a finite fleet of cable-lay ships that governments are now subsidizing or flagging as strategic assets. For AI investors, the implication is that global AI scaling ultimately runs through a tiny number of optical-transport and subsea vendors, and the real bottlenecks can be as mundane—and as hard to accelerate—as fiber draw towers, repeater factories, marine crews, and shipyard capacity, not just 800G DSP roadmaps.

The copper stack: electrons, SerDes, and DSP

Zoom back in and look just at copper and the components built around it.

At the chip level, copper is all about SerDes and packaging:

Every GPU, TPU, CPU, NIC, and switch has dozens to hundreds of high‑speed SerDes lanes.
These are designed at bleeding‑edge process nodes (5 nm and below) and are often as complex as small processors, implementing PAM4 modulation, sophisticated equalization, and forward error correction.
Fifth‑generation NVLink, for example, uses 18 links per Blackwell GPU at 100 GB/s each to deliver 1.8 TB/s, more than 14× PCIe Gen5 bandwidth from a single GPU.(AMAX Engineering)

On PCBs and backplanes, you see:

High‑layer‑count boards using low‑loss dielectrics
Precisely controlled trace geometries to maintain impedance and limit reflections
Connectors with carefully engineered pin fields to control crosstalk

Then there’s the signal‑conditioning silicon that makes copper competitive at modern speeds:

Retimers that clean up and re‑launch signals over board traces or cables
Gearboxes that convert between different lane widths and speeds
PAM4 DSPs inside AECs that extend copper reach

Companies like Credo, Marvell, Broadcom, MaxLinear and others live here. They sell into line cards, NICs, and active cables—products whose unit volumes rise sharply with each big AI cluster rollout and whose content per port grows with every speed bump.

The economic pattern is simple:

AI pushes per‑lane speeds from 25 → 50 → 100 → 200 Gb/s.
Copper wants to die at each step; DSP vendors keep resurrecting it.
As long as rack‑level distances stay in the single‑digit meters, there’s a strong incentive to do that resurrection instead of replacing every cable with fiber.

The optics stack: lasers, silicon photonics, and coherent DSP

On the optics side, the “headline” products are the pluggable transceivers, but the story goes deeper.

A 400G or 800G IM‑DD module for a datacenter port is a tiny system:

A PAM4 DSP takes electrical lanes from the switch/NIC, pre‑distorts and equalizes them, and may implement FEC.
Laser sources (DFB/EML for single‑mode, VCSEL for multimode) generate light at specific wavelengths.
Modulators encode the PAM4 signal onto light intensity.
Photodiodes and TIAs on the receive side convert incoming light back into electrical current and then voltage.
A small controller monitors temperature, power, and alarms.

For coherent modules (400ZR/ZR+, line‑side 800G and beyond), add:

A local oscillator laser to beat against the incoming signal
I/Q modulators and coherent receivers that can detect phase and polarization
A much heavier coherent DSP that demodulates QAM constellations and handles polarization rotation, chromatic dispersion, and non‑linearities

Behind these modules sits a materials and process ecosystem:

III‑V semiconductors like InP and GaAs for lasers
Silicon photonics processes for modulators and waveguides
Advanced packaging to co‑locate photonics dies, DSP dies, and fiber arrays
Fiber and cabling from Corning, Prysmian, Sumitomo, Fujikura, Furukawa/OFS, YOFC, and others that actually carry the light.(Fortune Business Insights)

The numbers here are large and accelerating. The global optical transceiver market is estimated around $12–15 billion in 2024, with projections out to $35–40+ billion by early 2030s at mid‑teens annual growth, and AI‑cluster‑specific optics already exceeding $4 billion in annual sales.(Fortune Business Insights) High‑speed datacom optics (400/800G) alone are about $9 billion in 2024, expected to approach $12 billion by 2026.(Cignal AI)

As CPO takes hold, some of that value shifts from discrete module vendors to:

Switch/NIC silicon vendors integrating optics on‑package
Photonics IP vendors and fabs
Packaging houses capable of marrying advanced CMOS with photonics at scale

This is where Nvidia, Broadcom, Marvell, Intel, Cisco/Acacia, Ciena, and an emerging wave of silicon‑photonics specialists are all trying to stake out long‑term moats.

Fig2: Inside Datacom Transceiver

Why networking and optics are a real scaling constraint

Pull back and you see why this entire stack is both a growth engine and a bottleneck.

On the compute side:

A single GB300 NVL72 rack can deliver 130 TB/s of NVLink bandwidth and tens of exaFLOPs of FP4 inference when clustered as Microsoft just did with 4,608 GB300 GPUs in a supercomputer‑scale Azure deployment.(NVIDIA)
A single Ironwood TPU superpod can deliver 42.5 exaFLOPs of FP8 across 9,216 chips, linked at 9.6 Tb/s each with 1.77 PB of HBM3E of shared, directly addressable memory.

On the network side, the world is trying to keep up by:

Pushing switch ASICs from 25.6 → 51.2 → 102.4 Tb/s and on to 200+ Tb/s
Pushing per‑port optics from 100G → 400G → 800G → 1.6T
Adopting CPO and silicon photonics to cut watts per bit
Deploying ever more 400ZR/ZR+ coherent optics for DCI and beyond

That’s why companies like Marvell now talk about the data center semiconductor TAM—including switching and interconnect—heading toward nearly $100 billion by 2028, with custom AI chips and interconnect silicon capturing a significant slice.

At the same time, regulators and local governments are waking up to the power and water footprint of these AI “factories.” Regions like Johor in Malaysia are courting multi‑gigawatt datacenter clusters to chase AI growth while simultaneously confronting grid strain and sustainability concerns. Networking and optics sit right in that tension: they are essential to scaling AI, but they cost real power and money.

From an investor’s view, the thesis is straightforward:

Every time GPU/TPU performance doubles, networking and optics must catch up or the extra FLOPs are wasted.
That means more SerDes, more AECs, more optical modules, more coherent pluggables, and more backbone capacity.
The market for the chips and optics that move bits is already tens of billions of dollars, growing faster than the underlying datacenter market and tightly correlated with AI capex.

The mental model you want to carry around is very simple:

Electrons for short distances, where copper is cheap and photons are overkill.
Photons for long distances, where copper dies and fiber looks almost lossless.
A stack of SerDes, DSPs, lasers, fibers, and switches that hand off a bit from one regime to the other, over and over, from GPU die to user device.

Everything else—product cycles, vendor line‑ups, valuations—is just the story of who controls which part of that journey, and how much rent they can charge per bit.

Turning Insight into Execution Understanding the physics of failure is just the first step; executing around it is where the edge is found. Today, FPX helps operators and investors navigate these choke points through our Consulting & Procurement Services—securing allocations of scarce compute, modeling TCO for liquid-cooled retrofits, and mapping supply chains to avoid geopolitical risk.

But we believe the industry needs more than just new orders; it needs liquidity. We are actively inviting partners to join our “Infrastructure Circularity” initiatives. As hyperscalers aggressively migrate to 1.6T and Blackwell, massive volumes of high-performance gear (400G optics, H100 clusters, and surplus copper) risk becoming “stranded assets.” We are building the clearinghouse to recertify and recirculate this capacity—turning one player’s decommissioned waste into another’s “Inference-Ready” treasure. Whether you are an operator sitting on surplus inventory or a builder struggling to source critical spares, reach out to the FPX team. Let’s turn supply chain inefficiencies into deployment velocity.

Everything above gives you the physics and intuition of the AI networking stack.
You now understand how a bit moves inside an AI system — from copper traces in a GPU package all the way out to subsea cables.

But knowing how the bit moves is only half the story.

The part that actually determines whether an AI datacenter scales, ships, or collapses under its own ambition lives in the paid section.

That’s where we stop describing the nervous system —
and start diagnosing its failures, choke points, and the companies that profit or suffer at each one.

In the paid section we cover:

Bottlenecks at every distance scale
Exactly where things fail: CoWoS/HBM capacity, PCB loss and retimers, DAC/AEC limits, 400/800G optics power walls, CPO readiness, metro/DCI spectrum, and subsea capacity.
Full supply chain, down to materials
Which companies control ABF, substrates, laminates, twinax, DSPs, lasers, SiPh, coherent engines, fiber, cable-lay ships — and where the true chokepoints are.
Power + capex math that actually constrains clusters
How many MW the network burns, how much of a rack/switch is optics, when pluggables stop making economic sense, and what 1.6T/3.2T really imply for budgets.
Winners, losers, and shifting moats
Public tickers and key privates that benefit from each bottleneck, who’s structurally at risk, and where the margin pools move as we go from DACs → 800G → CPO.
Breakthroughs required to avoid hitting a wall
What must go right in packaging, SerDes, optics, coherent DSP, fiber, and subsea for AI scaling to continue through 2030 — plus where there’s room for new entrants.

Above we had = physics + architecture.

Below we will have = bottlenecks + supply chain + investable edge.

Breaking down Google's plan to Double AI Compute Every Six Months

FPX AI — Mon, 24 Nov 2025 21:01:45 GMT

Google’s AI infrastructure chief, Amin Vahdat, recently told employees that demand for AI services now requires doubling Google’s compute capacity every six months – aiming for a 1,000× increase in 5 years. This goal, presented at a November all-hands meeting, underscores an unprecedented scaling challenge. This rate of Exponential growth, far outpaces the historical 2× every ~24 months of Moore’s Law. A few months ago we wrote an article breaking down the supply chain that goes into manufacturing the TPUs and mapping out the biottlenecks.

This time we take that to the next level. For Google to achieve the impossible, they must take a page out of Elon’s playbook and:

Deconstruct to Physics. Strip the data center of its racks, cables, and vendors. What remains are the only hard limits: the flow of electrons, the speed of photons, and the rejection of heat. If the laws of thermodynamics allow it, it is possible. Everything else is just legacy.
Rebuild to Economics: We calculate the “Idiot Index” of compute. By comparing the spot price of raw energy and silicon wafers to the current market price of inference, we expose a massive pricing disconnect. This gap isn’t value—it is structural inefficiency. Eliminating it is the only way to make the unit economics of a 1,000× scale-up viable.

Mapping the physics of this 1,000× scale-up uncovers distinct opportunities across the value chain. For our readers at Google, we hope you enjoy this independent analysis and welcome any feedback on our assumptions. For data center operators and power developers, this breakdown separates short-term speculation from structural reality, offering a blueprint to align land and energy assets with the liquid-cooled, gigawatt-scale architectures of the future. Finally, for investors, the opportunity extends well beyond the obvious chip names; we aim to highlight the emerging, critical ecosystems—from photonics to thermal management—that must scale alongside the GPU to make this roadmap possible and as always this is not investment advice.

Supply to FPX. Got spare GPUs/compute, liquid‑ready colo, or powered land with interconnect? List it on FPX—the AI infrastructure marketplace for secondary hardware, colocation, and powered land.

1. Silicon: Breaking the 2.5-Year Chip Cycle
Chiplets, SPAD (prefill vs decode), Lego-style TPUs, and how Google shrinks hardware iteration time.

2. Power: Turning Data Centers Into Power Plants
SMRs, geothermal, boneyard turbines, grid arbitrage, and why power must become part of the design, not an input.

3. Networking: From Packet Cops to Virtual Wafers
Optical circuit switching, Google’s Huygens-class time sync, and designing a fabric that behaves like one giant chip.

4. Memory: Surviving the HBM Bottleneck
HBM scarcity, CXL Petabyte Shelves, Zombie tiers, FP4, and architecting around a finite memory supply.

5. Models: When Intelligence Meets Physics
DeepSeek-style sparsity, Titans memory, inference-time reasoning (o1/R1), world models, and the next era of model–infrastructure co-design.

The 6-Month Doubling Mandate: A “Wartime” Mobilization

At Google’s November all-hands, Infrastructure Chief Amin Vahdat did not present a forecast; he issued a mobilization order. His slide on “AI compute demand” laid out a vertical ascent: “Now we must double every 6 months… the next 1,000× in 4–5 years.”

This is not an aspirational target. It is the calculated minimum velocity required to survive.

The cost of missing this target is already visible on Google’s balance sheet. CEO Sundar Pichai frankly admitted that the rollout of Veo, Google’s state-of-the-art video generation model, was throttled not by code, but by physics: “We just couldn’t [give it to more people] because we are at a compute constraint.” The financial impact is immediate—a $155 billion cloud contract backlog sits waiting for the silicon to serve it. In this high-stakes environment, capacity is no longer just infrastructure; it is the ceiling on revenue.

This only works if silicon, data center, networking, DeepMind and power teams behave like one product org, not five silos. The ‘delete the part’ algorithm applies to org charts too.”

The Game Plan: Beyond Brute Force

Vahdat was explicit: “Our job is to build this infrastructure, but not to outspend the competition.” Achieving a 1,000× gain by ~2029 via checkbook capability is mathematically impossible. It requires a fundamental re-architecture of the computing stack.

Google’s strategy has evolved from simple “scaling” to building a unified “AI Hypercomputer.” This approach attacks the problem from four distinct vectors:

This approach attacks the problem from five distinct vectors:

1. Silicon Specialization (The Physics of Time)

Google is ending the era of the monolithic, general-purpose chip. The new TPU v7 “Ironwood” leverages “Active Bridge” chiplets to break the 30-month design cycle, allowing Google to swap compute tiles annually while keeping the I/O base stable. By splitting silicon into Reader Pods (dense compute for prefill) and Writer Pods (dense memory for decode), they align the hardware to the specific physics of the workload, achieving 3× the throughput per watt over generic GPUs.

2. Thermodynamic Coupling (The Physics of Power)

You cannot plug 1,000× more chips into a standard grid. Google is moving from “consuming” power to “coupling” with it. This means bypassing the grid via Nuclear SMRs and Geothermal sources, and using Liquid-to-Liquid cooling to feed TPU waste heat directly into reactor feedwater systems. By deleting the chiller plant and sourcing power behind the meter, they turn the data center into a thermodynamic co-generator rather than a parasitic load.

3. The “Virtual Wafer” Network (The Physics of Bandwidth)

Scaling to 100,000 chips fails if the network is a bottleneck. Google is deploying Optical Circuit Switches (OCS)—mirrors that route light instead of electricity—combined with a Huygens‑class time‑synchronization stack (nanosecond‑grade clock sync that gives every NIC and TPU the same notion of ‘now’)to create a “scheduled” network. By deleting reactive electrical switches and power-gating electronics during compute cycles, they create a fabric that behaves like a single giant chip (a “Virtual Wafer”) spanning the entire campus.

4. Memory Disaggregation (The Physics of Capacity)

HBM is the most expensive real estate on Earth, and it is sold out. Google is breaking the “private backpack” model where memory is trapped on individual chips. Through CXL “Petabyte Shelves” and “Zombie” Tiers (recertified storage), they allow TPUs to borrow capacity instantly from a shared pool. Simultaneously, they are using Synthetic Data to enable FP4 (4-bit) training, effectively quadrupling the capacity of every HBM stack in the fleet without buying new silicon.

5. Model Co-Design (The Physics of Intelligence)

Hardware alone cannot bridge the gap. Learning from the efficiency of DeepSeek and the constraints of physics, Google DeepMind is rewriting the model architecture itself. This includes adopting Multi-Head Latent Attention (MLA) to slash memory usage, Titans architecture for long-term neural memory (replacing context window bloat), and System 2 “Cortex” logic that trades time for parameters. The goal is to escape the “Transformer Monoculture” and build models that inherently require fewer joules per thought.

FPX: The AI Infrastructure Marketplace. We run a secondary hardware marketplace (recertified accelerators, DRAM, SSD/HDD), place it into liquid‑ready colocation, and bundle powered land with interconnect so you can scale now. When supply chains stall, we get creative and fix bottlenecks.

We’ll start with the part that looks the most familiar from the outside — the chips — and then follow the constraints outward into power, networks, memory, and finally the models themselves.

1) Silicon: Matching Hardware to the Speed of Intelligence

1) Breaking It Down to the Physical Constraints

If you ignore the branding and SKUs, Google’s AI hypercomputer is bounded by four hard things:

Time – how fast you can change silicon.
Calculation – how much energy each useful operation burns.
Bandwidth and distance – how far bits have to travel, and through what medium.
Intelligence – how many bits you really need to represent the world and keep state.

TPUs, Axion, Titanium, Apollo, Firefly, SparseCore – these are not random product names; they’re successive attempts to align the machine with those four constraints. The question isn’t “are they doing enough?” but “what else can be deleted?”

The Physics of Time: silicon moves in years, models move in months

The first hard limit is temporal. Leading‑edge chips still move on a roughly 2–3‑year design/fab cycle. Large models, attention mechanisms, agent architectures and serving patterns are turning over in 6–12 months. That mismatch is why chips end up bloated: because you can’t know exactly what Gemini‑4 or some agentic successor will look like, you overbuild “just in case.”

The way out, from a physics point of view, is to stop treating a TPU as a single static object. You freeze the “slow physics” parts and you spin the “fast physics” parts.

Slow physics: I/O PHYs, high‑speed SerDes, HBM interfaces, power delivery, security islands – analog, timing‑critical, painful to re‑verify. Fast physics: systolic arrays, sparsity engines, precision formats, routing logic – the math and the dataflow.

The Active Bridge / Lego Pod metaphor is useful here. Instead of one mega‑die, you build:

a long‑lived base tile for I/O + HBM
one or more Math Tiles that you can respin yearly
and a bridge that makes them behave like one chip

Once you have that, the “chip” becomes a configuration problem. A training or prefill‑heavy pod might be four Math Tiles and two HBM tiles bolted to a bridge. A decode‑heavy pod might be one Math Tile and eight HBM tiles. Same tiles, same manufacturing stack, completely different physics profile.

SPAD —> Specialized Prefill And Decode, is the physics-aligned insight that an LLM isn’t one workload, but two. Prefill is compute-bound: square matrix multiplies that want wall-to-wall MXUs and dense FLOPs. Decode is memory-bound: KV lookups and skinny matmuls that sit idle unless you feed them bandwidth and HBM. Traditional GPUs try to be “ambidextrous” and serve both phases, which means they’re great at neither. SPAD flips that: build one kind of silicon for prefill (Readers), another for decode (Writers), and wire the fabric so each token is handled by the hardware whose physics actually fits it.

Ironwood is already a step in this direction: it’s explicitly an inference‑first TPU with more aggressive perf/W, heavy matrix units, and the expectation that it will be replaced more rapidly than the underlying data center fabric. SPAD‑style “Reader/Writer” specialization is just the logical endpoint of that trend.

Time isn’t just in the design tools; it’s also in the logistics. The speed of light is not your enemy here; the speed of FedEx is. That’s why Google’s new hardware hub in Taipei matters: it collapses the loop between TSMC, packaging, and Google’s own engineers. The extreme version of this is a “zero‑mile fab”: a Google‑only test lab bolted onto the fab and packaging line, where wafers can be probed by Google’s own validation rigs hours after they come out of the ovens, not weeks later when they’ve cleared customs and been shipped across the Pacific. In a war of exponential curves, shrinking the iteration loop from three weeks to three hours is its own kind of physics.

The Physics of Calculation: prefill vs decode, and Ironwood as SPAD v0

Next constraint: the cost of math.

LLMs have two very different phases from a physics viewpoint:

Prefill – ingest the prompt. Big, square matrix multiplies; high arithmetic intensity; compute‑bound.
Decode – generate tokens. KV cache lookups in HBM; skinny matmuls per step; memory‑bound.

Classic GPUs and early accelerators are ambidextrous by design: one chip is supposed to handle training, prefill, decode, recsys, you name it. That means in prefill you’re starved on FLOPs, and in decode those FLOPs mostly sit idle waiting for memory.

Google already did the optimization Step‑1 once here. TPUs deleted a lot of general‑purpose junk – huge caches, branch predictors, wide scalar ALUs – and rebuilt around systolic arrays and scratchpads. They added SparseCore as a tiny on‑die dataflow engine for embeddings and routing, so the main arrays don’t have to waste cycles on those patterns. From a physics perspective that’s: delete speculative hardware, push the intelligence into the compiler (XLA), and only put transistors where math actually happens.

SPAD is the next delete: stop pretending prefill and decode belong on the same die. You want a Reader that is basically wall‑to‑wall matrix units with just enough memory to keep them full, and a Writer that is mainly HBM capacity and bandwidth with little compute sprinkled near each stack.

Ironwood already leans heavily into the “Reader” role – an inference‑first TPU with beefed‑up systolic arrays and perf/W tuned for serving. The architectural ideal we’re talking about is just making that split explicit at the package level. With chiplets, you don’t need separate product lines; you vary the ratio of Math Tiles to HBM tiles per pod. One configuration looks like a prefill cannon; another looks like a KV cache farm.

And then there’s routing. Not every token needs to see every layer. DeepMind’s mixture‑of‑experts and mixture‑of‑depths work is exactly about that: easy tokens exit early; hard ones go deep. SparseCore is the right place to embody that physically. Instead of quietly sitting in the corner accelerating embeddings, it becomes a brainstem: a small organ that decides which tokens go where and which never touch the big MXUs at all. Every token that early‑exits is a pile of FLOPs and joules you never spend.

The Physics of Bandwidth and Distance: from packet cops to a virtual wafer

Bandwidth isn’t really about “how many terabits” your spec sheet says. It’s about how far bits travel, what medium they travel through, and how much thinking the network has to do about them.

Electrons in copper are slow, hot, and lossy. Every long trace and every SerDes hop costs you energy per bit and nanoseconds of latency. Packet switches exist to make per‑packet decisions because the network was designed assuming traffic is random. That’s why a big Broadcom switch chip happily burns half a kilowatt just parsing headers and juggling queues. It’s all reactive.

AI workloads aren’t random. Training collectives are literally graphs; we know the all‑reduce steps before we launch the job. Even inference isn’t truly chaotic once continuous batching gets involved. Systems like vLLM take a swarm of incoming user prompts, buffer them for a few milliseconds, and pack them into dense, regular batches. From the network’s point of view, it suddenly looks a lot like training: large, predictable bursts of tensors.

This is where Google’s optics are quietly radical. Apollo replaces a whole spine layer of packet switches with optical circuit switches – MEMS mirrors and glass. The mirrors don’t make decisions; they just sit at whatever angle the control plane told them to assume. Combine that with Huygens‑class time‑synchronization stack sync (NICs all marching in nanosecond lockstep) and AI‑first NICs, and you get a fabric that can be scheduled rather than policed.

In that world, you don’t route packets, you timetable bursts. The compiler/runtime knows when gradients are going to fly or when a batch of prompts is going to be scattered; it can instruct the OCS to re‑wire the graph a few milliseconds in advance. During compute phases, the links can go mostly dark or be repurposed for checkpointing. During comm phases, all links are hot, and no one is stuck in a buffer because collisions simply aren’t allowed by construction.

This is where the economic bridge starts to show through. The Idiot Index of a standard switch is high because it creates heat to make decisions. Apollo creates almost no heat and makes no decisions. It just follows orders. By moving the “intelligence” (routing) to the compiler and the “labor” (switching) to mirrors, Google has effectively driven the marginal cost of moving a byte toward zero. The optical core and the fiber don’t care if endpoints are speaking 100 G, 400 G, or 1.6 T – they will happily reflect whatever hits them. You stop ripping out the nervous system every time a new chip doubles its SerDes rate; you just plug faster lasers into the same glass.

From a physics standpoint, that’s the closest you can get to a virtual wafer: tens of thousands of chips and a couple of petabytes of HBM talking to each other as if they were a single coherent device, because most of the “distance” is traveled at the speed of light in a passive medium.

The Physics of Intelligence: synthetic data arbitrage and stateful silicon

The last constraint isn’t in the metal; it’s in the bits we push through it.

HBM is already the tightest resource in the system: capacity, bandwidth, and energy per access all bite. One lever is precision. Hardware has raced ahead to support 8‑bit and even 4‑bit floating‑point formats for both training and inference. The catch is that the internet is a mess. Training at 4 bits on raw, noisy web text is like doing surgery with oven mitts on: technically possible, but you wouldn’t trust the result.

The sensible first‑principles play is what you could call synthetic data arbitrage. Instead of trying to make ever more heroic quantization schemes, you change the data so low precision is actually safe. Gemini‑class models are already good enough to rewrite ugly, inconsistent web pages into structured, textbook‑like knowledge. If you use them to clean, summarize, and normalize your pretraining corpus, you can manufacture a dataset with:

fewer pathological outliers
more consistent distributions
less adversarial garbage

That’s a corpus where 4‑bit statistics make more sense. If you then design your training curricula and architectures around that reality, you can push more of the model into FP4/INT4 without collapse. Every bit you delete halves your memory needs for that part of the network. That’s not free – it costs cycles upfront to synthesize the data – but it’s a capital trade you make once for a fleet‑wide gain.

The 4‑Bit Direction: Using Fewer Bits, More Often
Today, most frontier models still train in BF16 or FP8 and only use 4‑bit formats (FP4/NF4) for parts of the stack or for inference. Pushing everything to 4 bits overnight would blow up optimization. But the direction of travel is clear: every tensor that can safely drop from 16‑bit → 8‑bit → 4‑bit frees scarce HBM capacity and bandwidth. NVIDIA’s Blackwell and the next TPU generations are being built with native FP4 support for exactly this reason: they expect a growing fraction of weights, KV caches, and optimizer states to live at 4 bits, at least during inference and later stages of training. Google doesn’t need “full FP4 training” on day one to win — it needs a roadmap that steadily expands the share of the model that can tolerate 4‑bit without collapsing.

Synthetic Data as a Low‑Bit Enabler
Raw web text is numerically ugly: outliers, adversarial junk, wild distribution shifts. That’s exactly what makes extreme quantization brittle. The real value of Gemini‑class synthetic data is not just “more tokens,” it’s better‑conditioned tokens. If Google uses its strongest models to rewrite the internet into textbook‑like corpora — consistent style, fewer outliers, clearer supervision signals — it can safely push more of the training and inference pipeline into FP8 and FP4. Clean data doesn’t magically make 4‑bit trivial, but it widens the stability margin for quantization‑aware training and mixed‑precision regimes. In practice, that means every year a larger slice of the model can drop to 4‑bit, turning the same fixed HBM budget into more usable parameters and longer contexts.

There’s a second “intelligence” constraint emerging that’s newer: state. Google’s launch of agent‑first tooling like Antigravity changes the workload from “stateless two‑second chats” to “stateful four‑hour work sessions.” Current TPUs are basically amnesiacs: they serve a request, flush most of the interesting state from on‑chip memory, and load fresh context from HBM next time. That’s fine for single prompts; it’s brutal for long‑lived agents that need a large, evolving working set.

The physics fix there is different: you need stateful silicon. Think of a “Cortex Tile”: a chiplet that is mostly SRAM – static RAM on‑die – rather than HBM. SRAM is expensive in area but ~100× faster and lower‑energy per access than DRAM. You don’t deploy it everywhere; you buy a specific rack of Cortex TPUs that exist primarily to hold agent state in SRAM for hours at a time, while more generic compute tiles rotate through to do the heavy lifting. Instead of constantly re‑hydrating context from cold storage and HBM, your agents live in a warm, electrically‑near memory pool.

From a physics perspective, that’s just another specialization: HBM tiles for bulk parameters, SRAM tiles for hot, agentic working sets. From an economic perspective, you’re reserving your most exotic silicon for the most valuable, long‑running workloads rather than wasting fleet‑wide capacity on state that 95% of users don’t need.

Rebuilding to Economics: lowering the Idiot Index

Once you’ve stripped everything down to physics, you can start putting the dollars and watts back in and ask how dumb the current setup really is. That’s where the “Idiot Index” is useful: it’s the ratio between what you pay per token today and what you would pay if you were perfectly aligned with energy and wafer costs.

Chiplets and Lego Pods are a direct attack on that index. By splitting TPUs into long‑lived plumbing and fast‑moving math, Google reduces both capital risk and wasted silicon. You’re no longer betting billions on a single mega‑die that may or may not age well. You’re betting on an I/O + HBM base that you’ll reuse across multiple math generations, and much smaller Math Tiles you can afford to be aggressive with. When models change, you don’t throw away your entire design; you respin the piece that enforces the new math.

Free the fast silicon. FPX backfills everything that isn’t MXU‑hot: Shelf Bricks (CXL DRAM pools) and Checkpoint/Context Pods (recertified SSD/HDD). Keep HBM for prefill and math—push KV, logs, and checkpoints into cheaper tiers.

SPAD‑style specialization does the same thing in “compute space.” An ambidextrous chip spends a lot of its life as dead weight: training logic sitting idle during inference, or big MXUs sulking during decode. A Reader/Writer split implemented via tile ratios means each pod spends more of its energy doing the kind of work it’s physically good at. Even if the wafer cost never budges, the tokens per watt and tokens per dollar go up because you’ve stopped paying for transistors on vacation.

Axion and Titanium are the economic reflection of what TPUs did architecturally. Instead of paying Intel or AMD for huge, general‑purpose CPUs that spend their lives shuttling buffers and handling IRQs, Google runs its own Arm host and offloads networking and storage into dedicated controllers. The host becomes a thin control plane, not the star of the show. That’s vendor margin erased from the BOM and host power reclaimed for accelerators. The physics is simple – don’t burn energy on work the TPU or NIC can do more efficiently – and so are the economics.

The networking story is where the Idiot Index really collapses. Traditional AI clusters are on a treadmill: every bump in line‑rate forces a new generation of copper and switch ASICs, plus the labor to rip and replace them. You are, in effect, paying constantly to move heat around inside boxes that make decisions. Apollo flips that: the core is glass and mirrors that never learn, never think, never age out of a spec. You move “intelligence” up into XLA, Pathways and the batcher, and let mirrors be stupid. The result is that the marginal cost of moving another byte across the fabric is dominated by endpoint energy, not by the spine. The network becomes like a 400 V bus bar or a concrete slab: a long‑lived asset you amortize over many chip generations.

Synthetic data arbitrage and low‑bit formats attack the two most expensive invisible line items in an AI box: HBM and joules per DRAM access. If a year of work on data cleaning and training recipes lets you reliably run large swaths of your models at 4 bits instead of 16, you effectively double your usable capacity and cut memory opex per token in half. The work you did once in the data pipeline compounds across every pod you deploy.

The honest way to think about FP4 is not ‘we will train everything in 4‑bit next year,’ but ‘every year, another slice of the model safely moves to 4‑bit.’ The winner is the lab that moves the largest share of its workload down the precision ladder without losing quality.

Stateful silicon for agents looks expensive on paper – SRAM‑heavy tiles are not cheap – but it is the right kind of expense: tightly targeted and physics‑aligned. If you know that a small fraction of sessions (say, a trading agent, a code‑review copilot, an operations “AI SRE”) dominate value and have long, rich state, it is economically saner to buy a few racks of Cortex‑style tiles to house those brains than to force the entire fleet to reload their world from cold storage on every call.

Zoom out, and a pattern emerges. The moves that look “hardware‑nerdy” in isolation – chiplets and active bridges, SPAD‑like pods, SparseCore routing, Apollo optics, Huygens‑style time‑sync schedules , Axion/Titanium hosts, synthetic low‑bit data, SRAM‑heavy cortex racks – are all the same move repeated: align the machine with the underlying physics so ruthlessly that anything wasteful stands out as an accounting error. Once you do that, the economics start to bend. The Idiot Index comes down, and Amin’s “1,000× in roughly five years at similar cost and power” stops sounding like bravado and starts looking like a reasonable target for a company willing to delete everything that doesn’t serve the electrons, the photons, and the heat.

What This Means for Data Center Operators

How SPAD, virtual wafers, and task-specific silicon reshape facilities

As Google leans into physics-aligned architectures—splitting prefill and decode workloads, designing SPAD-style Reader/Writer pods, and treating thousands of TPUs as a single “virtual wafer”—the data center stops being a generic compute warehouse and becomes a set of specialized organs. Training and prefill jobs want remote, high-density campuses built for liquid cooling and massive optical backbones. Decode and real-time inference want smaller metro sites sitting close to IXPs, with more HBM and network bandwidth than raw compute. Long-lived agent workloads introduce a third category: stateful halls with SRAM-heavy “cortex” racks that hold multi-hour context in fast memory while compute tiles rotate around them. Operators who recognize this shift early can pre-position themselves by designing three distinct facility types—compute factories, metro inference hubs, and state hubs—each optimized for the physics of its workload and compatible with chiplet-style TPUs, optical fabrics, and high rack densities. Doing so makes the operator a drop-in extension of Google’s hypercomputer rather than a retrofit compromise.

FPX for Operators. Bring liquid‑ready, high‑density suites and fiber‑rich halls; we place secondary hardware and memory shelves to create SPAD‑aligned Reader/Writer/Cortex zones. New revenue from existing space.

How Operators Gain an Edge (and How to Prepare Now)

Operators who prepare for this split win by becoming TPU-ready before TPU demand arrives. That means:

Adopting liquid cooling, 400 V busways, and fiber-dense hall design as defaults—not upgrades.
Building white-space that can host Reader/Writer pod ratios (compute-heavy vs. HBM-heavy racks).
Designing campuses around long-lived optics and power infrastructure, not around switch ASIC refresh cycles.
Marketing real estate not as “capacity,” but as specialized tiles Google or any hyperscaler can snap into their virtual wafer.

Being early here turns the operator into an AI-grade infrastructure partner, not just a landlord—making them relevant for multi-cycle deployments instead of single-generation GPU scrambles.

Fabric‑ready shells sell. List metro inference suites, memory/shelf rows, and training‑grade halls on FPX. We match your envelope (space + fiber + MW) to live AI demand.

What This Means for Investors

How to price powered land, colo shells, and optical-ready campuses in a SPAD world

When prefill, decode, and agent workloads split into different physical requirements, not all megawatts or square feet are equal anymore. The winning assets will be those aligned with the physics: large powered land near cheap generation for prefill/training pods; metro-edge shells with excellent peering for decode; and ultra-reliable, ultra-secure campuses that can house stateful SRAM-heavy “cortex” racks. Investors should understand that the value migrates from servers to the envelope—the power, fiber, cooling, and zoning that support a decade of TPU evolution. Facilities built around optical fabrics (like Google’s Apollo-style architectures) become long-lived utilities, while racks and GPU generations turn over rapidly. This creates asymmetric upside for owners of “future-proof” land and infrastructure.

How Investors Gain an Edge (and What to Do Now)

Prioritize powered land near hydro, nuclear, or major substations—perfect for prefill/training factories.
Accumulate metro-edge colos with rich peering for decode and agent workloads (these become the new latency-critical frontier).
Back operators modernizing to AI-grade spec: liquid cooling, 50–80 kW racks, optical-first design, 400 V distribution.
Favor long-lived infrastructure plays (glass, power, entitlements) over single-generation hardware exposure.

The thesis is simple: Google’s architectural shift creates a structural demand frontier for AI-ready land, optics-ready shells, and specialized facilities. Investors who position ahead of this curve aren’t just riding the GPU boom—they’re buying the foundational real estate of the next compute era.

Own stranded assets? FPX packages them into AI‑grade supply. FPX converts retired accelerators, media, brownfield MWs, and shells into standardized SKUs tenants actually buy. If you control the atoms, we’ll clear the path to AI demand.

2) Power & Thermodynamics: When Energy Becomes the Hard Limit

Power is the bottleneck that doesn’t care how clever your model is. At today’s ~30,900 TWh of annual electricity use, the world runs at an average of about 3.5 TW. If the industry truly built out ~300 GW of new AI datacenter load every year, we’d blow past all current global generation in just over a decade. That’s before EVs, heat pumps, or industrial electrification even show up. You don’t solve that by “plugging in more TPUs.” The physics is brutal: you take high‑grade electrical energy, turn it into low‑grade heat in the chip, then burn more electricity on chillers and pumps to throw that heat into the air. In thermodynamics, the portion of energy that can actually do useful work is called exergy; current AI infrastructure wastes a huge amount of it. To get 1,000× more compute without 1,000× more emissions and blackouts, you have to stop treating power and cooling as line items, and start treating them as a coupled thermodynamic system you can engineer.

Phase 1 – Deconstruct to Physics: Energy Density & Heat Rejection

At the physical level, every watt that goes into a TPU comes out as heat. A typical hyperscale data center still follows the same pattern: the grid feeds a substation; the substation feeds power distribution units (PDUs); PDUs feed racks; racks dump heat into air or water; then a chiller plant spends another ~20–40% of the IT load’s power turning that hot water back into cold water. The industry uses a metric called PUE (Power Usage Effectiveness: total facility power divided by IT power). A “good” PUE today is ~1.2. From a physics lens, that’s still an Idiot Index: you built a second power plant whose only job is to undo what the first one did.

That waste shows up inside the chips too. Because of process variation (random manufacturing differences between dies), some TPUs are “golden” (low‑leakage, high‑yield), others are dogs. The safe thing is to set voltage and frequency for the worst chip in the fleet—say 0.8–0.85 V—and run every part there. Most of your silicon is over‑volted relative to its actual physical limit, burning extra dynamic power just so the worst 1% doesn’t glitch. You’re paying for randomness in the fab as if it were a law of nature.

The network stack leaks energy in the same way. Even when no useful packets are flowing, lasers, SerDes (serializer/deserializers that encode bits onto high‑speed links), and DSPs sit there sipping watts to stay synchronized and ready. Yet AI traffic is not random. Training jobs “breathe”—hundreds of milliseconds of pure compute, then brief all‑reduce bursts to exchange gradients. Inference stops looking like Brownian motion once you add continuous batching: the serving stack buffers user queries for a few milliseconds and fires them as dense, predictable bursts instead of tiny dribbles. The fabric sees trains, not cars.

On top of that, the grid itself is slow and lumpy. Building a new 500 MW substation and its transmission lines is a 5–10‑year permitting fight in most jurisdictions. Nuclear small modular reactors (SMRs) like Kairos’s Hermes‑2 will bring Google tens of megawatts by around 2030 and up to ~500 MW by 2035—but that’s a 2030s answer, not a 2027 patch. Even geothermal, where Google and Fervo’s “Project Red” is already delivering 24/7 carbon‑free power in Nevada, scales in hundreds of megawatts, not gigawatts overnight. Put differently: the natural timescale of the grid is years; the AI build‑out is moving in quarters. That mismatch is the true physical constraint.

FPX Power Envelopes: 50–500 MW you can actually use.
We package powered land + interconnects + permits—often with temporary generation (refurb aero‑turbines), shared industrial cooling, or geothermal tie‑ins—so your AI campus lands on a realistic timeline. We also broker interconnection queue positions and substation piggybacks where industrial feeders are under‑utilized. Pair that with training‑as‑flexible load and you get lower blended power costs and faster approvals.

Phase 2 – Rebuild to Economics: From Consumption to Coupling

Once you admit the grid’s timescales and thermodynamics, the first principle move is clear: stop “consuming” power in whatever form the grid hands you; couple your compute directly to the sources, pipes, and waste streams where physics is already on your side. The job is to turn power from a bill into a design variable.

2.1 Co‑locate with the atoms: nuclear, geothermal, and thermal reuse

Google’s 500 MW deal with Kairos Power is usually framed as “buying green electrons,” but the deeper move is thermal integration. SMRs are essentially steady high‑temperature heat engines: they already have massive cooling loops and “ultimate heat sinks” designed to dump gigawatts of heat into rivers or towers. If you stick a data center right next to that, you don’t need a fully separate chiller plant. You can drive absorption chillers—devices that use heat, not electricity, to make cold water—off the reactor’s waste heat, and share cooling towers and water infrastructure instead of duplicating them.

Push the coupling one step further and you get what you might call the liquid‑to‑liquid loop. Nuclear steam cycles spend a lot of energy pre‑heating feedwater from ambient to near boiling before it enters the reactor. TPU coolant loops run at 50–70 °C. Instead of cooling that back to ambient and throwing it away, you can use it as a pre‑heater for the plant’s feedwater. The AI farm becomes a cogeneration unit: its “waste” heat raises the temperature of the water that will be boiled by the reactor anyway. You haven’t literally violated conservation of energy—there’s still an ultimate heat sink somewhere—but you’ve effectively shrunk the stand‑alone cooling plant on the data‑center side and improved the power plant’s thermal efficiency on the other. One set of pipes does double duty.

Because SMRs take the rest of the decade to arrive, the 2020s bridge is geothermal. Google’s early work with Fervo in Nevada and follow‑on clean‑transition tariffs show how enhanced geothermal systems (deep drilled wells that use oil‑and‑gas‑style techniques to reach hot rock) provide 24/7 carbon‑free power today. From an engineering perspective, an EGS plant looks a lot like a data center already: closed‑loop pipes, big pumps, and heat exchangers. Putting AI pods on the same site lets you share those loops, run higher rack densities (because you have process‑grade cooling anyway), and guarantee firm power without waiting for a new reactor license.

The same logic extends to other industrial sites. LNG terminals and some chemical plants have excess cold (they’re literally boiling cold liquids back into gas); refineries, steel mills, and paper plants have excess low‑grade heat. Data centers can be the thermal sponge in either direction:

Next to “cold” plants, direct‑to‑chip loops can be cooled via liquid‑to‑liquid exchangers, shrinking or deleting most of the data center’s own chiller plant.
Next to “hot” plants or SMRs, AI exhaust heat can be sold as process heat or feedwater pre‑heat, lowering both sides’ effective energy cost.

In all cases you’re using the same joule twice—once for computation, once for heat work—instead of paying for two separate energy systems.

2.2 Bootstrap power fast: boneyard turbines, zombie peakers, and flared gas

Nuclear and geothermal are great, but they are slow. Google’s 1,000× target is not going to wait for every SMR to clear the NRC. So the next class of moves is ugly but fast: reuse metal that already exists.

One obvious pool is retired aircraft engines. Widebody fleets are being scrapped faster than their turbofans wear out. Gas‑turbine specialists already convert engines like GE’s CF6 into stationary 40–50 MW peaking units; the physics is done, the engines are sitting in boneyards, and the lead times are a fraction of a new utility‑scale turbine. The gating factor becomes local air permits and politics, not global turbine backlogs. The build fast; iterate later , move here is: don’t sit in the four‑year queue for brand‑new gas turbines; buy the boneyard, refurbish, and drop 50 MW “jet‑gens” behind the meter at AI campuses. FPX’s role is obvious—source and procure those used engines, match them to brownfield sites, and coordinate OEM‑backed refurb programs so the reliability looks like aviation, not DIY.

The same idea applies at plant scale. There are “zombie” peaker plants all over the world: gas and even coal facilities that still have 250–500 MW substations, cooling water rights, and industrial zoning, but can’t make money in normal capacity markets. Their wires and permits are worth more than their boilers. Rolling them up into an “AI power fund” lets you buy those interconnects and switchyards at distressed prices, then repower the generation (with gas‑only, high‑efficiency turbines, used aero engines, or eventually SMRs) while you drop modular TPU pods on the existing pads. You’re not building new substations; you’re compressing more compute into substations that somebody else already paid for.

Then there’s the dirty bridge: flared gas. In North Dakota and West Texas, operators literally burn surplus natural gas at the wellhead because there’s no pipeline capacity; 100% of that chemical energy turns into heat and light in the sky. Thermodynamically, the Idiot Index of flaring is infinite. If you park containerized AI pods and small turbines on the pad, you’re still emitting CO₂, but you are at least getting useful compute out of energy that was otherwise pure waste. This is not the end state—you eventually want those sites replaced by geothermal or nuclear—but as a bridge, “flare‑to‑compute” is strictly less bad than flare‑to‑nothing, especially if it’s paired with a clear sunset plan and offsets.

None of these moves are easy. Zombie peakers and aero‑turbines run into air permits and local resistance; flare‑to‑compute offends ESG purists even when it’s thermodynamically better than flaring; SMRs fight national politics. But physics doesn’t care about press releases, and neither does a 300 GW/year build‑out.

2.3 Arbitrage the grid itself: queues, substations, and FPX‑style power envelopes

A subtler bottleneck is that a lot of power is stuck in paperwork. Interconnection queues in the U.S. and Europe are clogged with solar farms, hydrogen projects, crypto mines, and generic industrial loads that may never be built. Brownfield factories sit under 100 MW feeders while their actual load shrank to 20 MW years ago. From a physics view, those are stranded rights to move electrons, not just stranded assets.

This is where an power exchange becomes interesting. Instead of only trading compute capacity, the market starts trading grid positions and latent substation headroom. A stalled 80 MW solar project in Ohio might be three years from financing; its developer sits on a valuable queue slot but no capital. FPX can broker a lease or sale of that queue position to Google (or any AI player), who drops in containerized training pods for 5–7 years while their own permanent campus is being built. When the solar farm is finally ready, the AI pods move on but the substation upgrades and legal work remain. Similarly, FPX can identify industrial sites with under‑used feeders, assemble “substation piggyback” deals where an AI tenant shares capacity and time‑slices with the host, and package those as ready‑to‑go 50–200 MW chunks.

This is exactly the direction Meta is moving with Atem Energy, its new subsidiary that has applied for authority to trade wholesale power and capacity. Meta’s play is clear: become its own power trader so it can arbitrage prices and secure flexible supply for 2 GW‑class AI campuses, rather than relying on utilities alone. That’s a signal that hyperscalers no longer see electricity as a simple pass‑through; they’re willing to carry trading risk on their own balance sheets if it buys them certainty. FPX doesn’t have a formal exchange product today, but it already does the hard part: sourcing and procuring used power infrastructure—generators, turbines, substations, and distressed datacenter shells. The natural next step is to surface those as standardized “power envelopes”: brownfield land + interconnect + sometimes temporary generation, sold as bundles AI companies can simply plug into.

FPX “Grid Desk.” We surface queue slots, latent substation headroom, and brownfield feeders as tradable envelopes. You bring the pods; we bring the electrons and paperwork you can stand on.

2.4 Make AI a grid asset, not just a load

Not all power solutions are on the supply side. AI’s weird superpower is that training is elastic: you care that the model is done this week, not that every gradient step ran at exactly 2:34 p.m. In grid language, training is a “flexible load.”

If Google exposes that flexibility to the grid operator, AI becomes a kind of virtual power plant. When there’s too much solar at noon, Borg and the batcher spin up training jobs and prefill‑heavy workloads; when the sun sets and the grid tightens, they checkpoint and throttle down, freeing hundreds of megawatts without any human noticing. Structured as demand‑response or frequency‑regulation products, that flexibility is something utilities pay for. The net effect is lowering Google’s average power price and making regulators eager to approve new AI campuses because they come with built‑in controllability instead of just more peak demand.

At the micro level, the same principle applies on die. Google already has tools to hunt “mercurial cores” and silent data corruption; that telemetry can be reused to create software‑defined voltage and frequency per chip. Golden TPUs can be safely undervolted toward their individual physics limits, shaving 10–20% dynamic power; weaker dies can be fenced to low‑priority, low‑clock jobs. Across 100,000 chips, that’s a free power plant’s worth of savings with no new hardware—just a willingness to treat V/f as a software knob instead of a fixed spec.

And don’t forget the network. Optical circuit switches (OCS) built with MEMS mirrors typically reconfigure in milliseconds, not microseconds—far too slow for per‑packet routing, but perfectly fine for the 50–300 ms “breaths” of an AI job once continuous batching has packed the workload into trains. Time‑sync schedules means the NICs know exactly when each breath arrives; the mirrors can rewire a few tens of milliseconds beforehand. In that world, you can power‑gate the hungry SerDes and DSP electronics for most of the cycle, keeping lasers in low‑power idle or burst‑mode, brightening only when the schedule says “burst now.” Across millions of links, turning off the electronics during the 70–80% of time when nothing useful is flowing is another huge chunk of “invisible” power reclaimed.

2.5 Go where the photons are: training in space, inference on Earth

Finally, there’s the move that feels like sci‑fi but is already on the roadmap: space‑based compute. Google’s Project Suncatcher is a research moonshot to put TPUs on solar‑powered satellites in dawn–dusk orbits, talking over free‑space optical links. In orbit you get near‑continuous solar, several times the energy yield per square meter of panel compared with many locations on Earth, and a 3 K cosmic background as your heat sink. Latency and radiation make it impractical for user‑facing inference, but for long‑running training loops it’s plausible on a 10‑year horizon if Starship‑class launch really delivers hundreds of tons to orbit cheaply.

The physics split is neat: inference stays on Earth, close to users, data, and regulation; training migrates to wherever energy density and heat rejection are best—eventually that might be space. Google plans to launch small Suncatcher prototypes around 2027 to test TPUs in radiation and optical cross‑links; any commercial version is likely a mid‑2030s story at best. But the direction is consistent with everything else: follow the photons and the cooling, not the legacy substations.

Pulling it together

All of these moves—nuclear co‑location, geothermal bridges, boneyard turbines, zombie peaker roll‑ups, flared‑gas pods, queue and substation arbitrage, industrial symbiosis, AI‑as‑flexible‑load, orbital training constellations—are variations on the same theme: stop treating power as an exogenous constraint and start designing the AI stack around the physics of energy.

For Google, that means Amin Vahdat’s 1,000× target can’t just be a story about better TPUs and smarter compilers. It has to be a story about where the atoms, pipes, queues, and photons are—and about partnering with firms like FPX that are willing to do the unglamorous work of scavenging turbines, brownfield substations, interconnect slots, and stranded wells. For FPX, it’s the opportunity to position itself as the Atem of infrastructure: not a power trader, but the specialist that finds, assembles, and procures the weird, messy assets—old plants, queue positions, used generators, and eventually orbital power slots—that will quietly decide who actually gets to build the next 10 GW of AI.

For investors

the power section is basically a filter for what will actually be scarce and valuable over the next decade. It says: stop thinking in terms of “more data centers” and start thinking in terms of where exergy lives. The assets that benefit from 1,000× AI aren’t generic shells; they’re powered dirt near stranded or baseload energy (geothermal fields, old peakers, big substations), plus the brownfield sites that can be quickly repowered with used turbines and containerized pods. You want to own the stuff that works across three hardware cycles: high‑capacity interconnects, cooling rights, industrial zoning, and substations that can host multiple generations of SMR/geo/jet‑gen behind them. Management teams that talk fluently about liquid‑to‑liquid loops, direct‑to‑chip cooling, queue arbitrage, and AI as a flexible load are telling you they understand where the game is going. Those still selling “10–15 kW air‑cooled colo” are, politely, on the wrong side of history.

For data center operators

The message is: design like a power plant, not like a server hotel. The winners will be the ones who show up early at the atoms—at SMR and geothermal sites, at zombie peakers, at LNG terminals and industrial clusters—and offer to be the thermal and electrical “organ” that soaks up waste heat or excess cold. That means building campuses that assume 50–100 kW racks, liquid cooling as default, and explicit tie‑ins to industrial loops where TPU exhaust can pre‑heat feedwater or feed district heating, instead of dumping everything into the sky. It also means getting comfortable with temporary and modular generation: refurbished aero‑turbines, leased gas engines, even flare‑gas pods as bridges while permanent baseload comes online. The operators who learn to work with specialists that can source used turbines, distressed substations, and interconnect slots will be able to offer hyperscalers something far more compelling than “space and power”—they’ll be offering time: megawatts you can actually use in the next 18–36 months instead of in 2031.

For Colos/Developers

The opportunity sits at the edge between all of this heavy infrastructure and the end customers. You probably won’t own an SMR or drill a geothermal field, but you can be the flexible envelope that hyperscalers and AI labs plug into while they wait for those big projects to mature. That means positioning specific sites as “AI‑grade”: already wired for high‑density racks, liquid‑ready, with strong peering and the ability to piggyback on underused industrial feeders or rolled‑up brownfield plants. It means being open to weird power structures—time‑of‑day pricing, sharing feeders with local industry, selling waste heat to municipalities—and to short‑to‑medium‑term deals where containerized GPUs/TPUs land on your pads for 3–7 years and then move on. The colos that lean into this, and work with firms like FPX to find and procure unusual power infrastructure instead of waiting for pristine greenfield, become indispensable: they’re the glue layer that turns stranded megawatts and stalled projects into live, revenue‑generating AI capacity.

This is the perfect next step. We’ve covered 1. Chips (Time/Calculation) and 2. Power (Energy/Thermodynamics).

Now we tackle Networking (Distance/Bandwidth).

3) The Physics of Bandwidth: From Packet Cops to Virtual Wafers

The constraint in networking isn’t the speed of light; it’s how often you stop light to think about it. Every time a photon becomes an electron and passes through a switch ASIC, you pay in power (O‑E‑O conversion) and latency (indecision). Treat a data center as a bunch of servers, and this is just “networking gear.” Treat it as a Virtual Wafer—one giant computer—and the network is the computer. The “Idiot Index” of the legacy design is how many times you turn light into heat and back into light just to move a tensor from Chip A to Chip B.

Google’s 1,000× roadmap on the network side is really about deleting decision points. Apollo’s optical circuit switches (OCS) already rip out big layers of electrical spine switches; Google’s Huygens‑style time‑sync stack gives you tens‑of‑nanoseconds clock alignment so collectives can be scheduled instead of guessed. The next questions are: what can you delete at the rack and pod level, and how do you make sure the fabric respects the SPAD split—Readers vs Writers, prefill vs decode—instead of fighting it?

Phase 1 – Deconstruct to Physics: The SerDes Tax & The Reactive Trap

Two physical realities dominate the cost of moving bits today:

The SerDes tax. SerDes (serializer/deserializers) and their DSP front‑ends encode wide, slow on‑chip data into multi‑GHz streams on copper. At 400–800 G and beyond, those blocks are chewing up on the order of 30% of the I/O power budget on many modern parts—more and more of the chip’s thermal envelope is spent fighting signal loss and jitter rather than doing model math.
The reactive trap. Traditional switches exist to manage randomness: they read headers, juggle queues, and make per‑packet decisions because internet traffic is chaotic. But AI training traffic is not chaotic. It’s a sequence of all‑reduce collectives the compiler knows about in advance. Even inference becomes structured once you apply continuous batching: user requests get buffered into 5–50 ms “trains” so the hardware sees predictable bursts, not white noise. Using fully reactive packet switches for this is like putting stop signs in the middle of a railway.

Even after Apollo deletes the electrical spine, you’re still paying for:

ToR switches that treat each rack as a mini‑internet.
Short‑reach copper between TPUs/NICs and transceivers, which forces additional SerDes and retimers right where the energy per bit is already worst.

Overlay that with workload physics and the SPAD split:

Scale‑out training (“the symphony”) – deterministic bursts: 10k chips compute for ~300 ms, then scream at each other for ~50 ms, then go quiet.
Scale‑out inference (“the factory”) – messy at the user edge, but internally decomposes into:
- Reader flows (prefill) – big, matmul‑heavy, compute‑bound.
- Writer flows (decode) – small, KV‑cache‑heavy, memory‑bound and tail‑latency sensitive.

If the network ignores that structure and uses the same Clos/ToR logic everywhere, you’ve effectively thrown away half of what SPAD and batching bought you. The job now is to delete as much of that generic machinery as physics will allow.

Phase 2 – Rebuild to Economics: Color, Air, Analog, and Petabyte Shelves

3.1 The Rainbow Bus (Passive WDM) – Routing by Color, Not Silicon

The Rainbow Bus idea asks: if you already know which node you’re sending to, why decode headers at all? Arrayed Waveguide Grating Routers (AWGRs) are passive photonic devices that route light by wavelength: “red” exits port 1, “blue” exits port 2, etc. Combine them with fast tunable lasers and you get a wavelength‑routed fabric: the TPU doesn’t send a packet “to chip #50,” it just emits on λ₅₀ and the glass prism sends it to the right place. No switch ASIC, no O‑E‑O, essentially zero incremental power to route.

Reality check:

AWGRs are mature enough for telecom and have been prototyped for data center networks. The physics is sound, but crosstalk, temperature sensitivity, and wavelength management make it hard to scale them to thousands of ports without heroic engineering.
Fast, stable tunable lasers exist in research and early products, often using microcombs or integrated photonics, but they’re still expensive and tricky to manufacture at hyperscale.

Where it makes sense soon is not as a planet‑scale “Rainbow spine,” but as a pod‑scale delete:

Use AWGRs inside a pod or rack‑group to replace ToR switches and some SerDes: 32–64 nodes can be interconnected via a passive wavelength fabric. Training and prefill traffic—static, compiler‑known patterns—are a perfect match.
The compiler (XLA/Pathways) assigns wavelengths deterministically: rank 0 sends gradients on λ₀, rank 1 on λ₁, and so on. The fabric becomes a static “color map” rather than a programmable router.

In SPAD terms, the Rainbow Bus is most valuable for Reader and training pods. It lets you treat a pod as a fully connected clique without paying a kilowatt of ToR silicon. For Writers and latency‑sensitive decode, the routing problem is different; static wavelengths are less compelling there. So: keep Rainbow at pod/rack scale, wired into training/prefill, and don’t pretend it can replace Apollo’s OCS core in the campus anytime soon.

3.2 The Breathing Fabric (SerDes Power‑Gating)

Google already knows networks ‘breathe’: long compute phases, short communication bursts. Its Huygens‑class time‑sync stack gives every NIC and TPU a shared notion of ‘now’; Apollo’s mirrors reconfigure in milliseconds before a collective starts. That’s enough to turn the network into a scheduled organ rather than a static utility.

The obvious first principle move is: stop powering lungs that aren’t inhaling.

You can’t hard‑off standard WDM lasers without dealing with wavelength drift and relock time, but you can power‑gate the hungry SerDes and DSP electronics for long stretches.
The scheduler knows when all‑reduces are coming, and in inference land, the batcher knows when big prefill waves will hit. In the gaps, NICs can put high‑speed I/O blocks into deep sleep, only waking in time to reacquire clocks and align CRCs.

Across hundreds of thousands of links, that’s not rounding error; it’s megawatts. And it’s purely a software + firmware change on top of existing optics. This is low‑hanging fruit for Amin’s 4–5‑year window: fully compatible with Apollo and Falcon, and complementary to everything else.

3.3 SPAD‑Aligned Networks: Reader Pods, Writer Pods, and the Cortex

Where the previous drafts were too generic, this is where we tie networking directly to SPAD.

Reader Pods (Prefill). These are Ironwood/Trillium‑heavy clusters optimized for matmuls: lots of compute, enough HBM to hold weights, and very high bandwidth for one‑shot prompt ingestion. Their outbound traffic is mostly compact state (KV/cache summaries) headed to Writers. They benefit from pod‑local Rainbow Bus or similar static fabrics and from Apollo‑style scheduled optics when they ship state across the campus.
Writer Pods (Decode). These look like “HBM with a brain”: lots of memory and KV cache, modest compute. Their network priority is low‑tail‑latency, many‑to‑one links to memory tiers and KV/state stores, not massive bisection bandwidth. Inside a Writer pod, the “network” should look more like a CXL/photonic memory fabric than like Ethernet; the main job is to keep KV cache and agent state electrically close, not to route arbitrary RPCs.
Cortex Tiles (State). For long‑lived agents, you want a small number of SRAM‑heavy “cortex” pods where context and world‑model live essentially permanently. Networking’s job is to keep Reader/Writer compute gravitating toward the cortex that holds a session’s state, instead of reloading context from cold storage on each turn.

Topology‑aware scheduling is the glue: the orchestrator places a user’s session on a specific Writer + Cortex neighborhood and keeps it there, minimizing cross‑campus hops and avoiding “KV ping‑pong” across pods. That’s a simple software policy, but it demands explicitly SPAD‑aware fabric design, not a generic L3 mesh.

3.4 The Air‑Gap (Indoor FSO) – When Fiber Runs Out

Bringing Project Taara indoors is exactly the kind of off‑script move Amin would appreciate: delete cable bundles, beam bits through air. Research prototypes like FireFly and OWCell show that rack‑to‑rack free‑space optics (FSO) in a data center is possible: steerable laser “eyes” on racks hitting ceiling mirrors, reconfigurable in software.

But reality bites:

Line‑of‑sight can be blocked by people, lifts, new racks.
Dust, smoke, and refractive turbulence affect reliability.
Aligning and maintaining thousands of beams in a hot, vibrating hall is non‑trivial.

So the right framing is: FSO is a scalpel, not a backbone.

Use it as a “break glass” overlay where you literally can’t pull more fiber (heritage buildings, constrained conduits, brownfield retrofits).
Use it to temporarily augment bandwidth between hot pods while permanent optical fibers are being added.

For Amin’s 1,000× plan, FSO is a niche tool. It’s clever and occasionally necessary, but the main line should be more glass and smarter fabrics, not turning every row into a room full of Taara turrets.

3.5 The Petabyte Shelf (Optical CXL) – Fixing the Memory Wall for Writers

The Petabyte Shelf is the least speculative idea here and the most aligned with what the ecosystem is already building. Today, HBM is bolted to the accelerator. If one chip runs out of memory, it fails, even if its neighbor has tens of gigabytes free. CXL (Compute Express Link) and related standards exist precisely to turn memory into a pooled resource, and photonic I/O vendors like Ayar Labs and Celestial AI are explicitly targeting optical memory fabrics that detach DRAM from compute.

The architecture looks like this:

A rack (or short row) of memory sleds: DRAM/NVRAM shelves attached via CXL or a custom photonic protocol.
Ironwood/Trillium tiles with optical I/O chiplets that talk load/store semantics to those shelves—“read 4 MB from slot X”—instead of slinging giant KV blobs over IP.
A controller layer that manages allocation, QoS, and basic coherency.

SPAD‑wise, this is the missing organ:

Readers keep most of their weights and temporary activations in local HBM but can spill rare big layers or long contexts to the shelf.
Writers treat the shelf as their primary KV/agent state store, pulling the hot working set into local SRAM/DRAM and leaving the rest in pooled memory.
Cortex tiles effectively live in the shelf: long‑lived agent state is just a pinned region of this pooled RAM.

Feasibility is high:

CXL memory pooling is already shipping in CPU systems; hyperscalers are deploying it for databases and in‑memory analytics.
Celestial AI and Ayar Labs both report hyperscaler engagements to build photonic fabrics for disaggregated memory and accelerator I/O.

For decode and agent workloads, this is the most direct path to a 10×–100× effective memory increase without 10×–100× more HBM stacks and power.

3.6 The Analog Sum – Where to Park It (For Now)

Analog optical computing—doing math with interference, not transistors—is very real in the lab. Groups and startups have shown optical matrix‑vector multiplies, convolutions, even pieces of backprop, and you can absolutely build an optical adder tree that performs a reduce‑sum across a few dozen inputs “for free” in the optical domain.

The problems are precision and scale:

Gradient sums need ~8–16 effective bits of accuracy over wide dynamic ranges; analog optics adds noise, drift, and calibration overhead.
Integrating large optical mesh networks into real TPUs and routing gradients through them without massive engineering risk is a long project, not a 2‑year rollout.

The sensible compromise is:

Treat analog sum as a near‑chip or rack‑level accelerator: use optics to pre‑aggregate gradients from a handful of neighbors, then feed the result into a digital all‑reduce tree. That shrinks data volume and I/O energy without betting training convergence on a fully analog fabric.
Keep it in the Phase 3/R&D bucket for Amin’s plan. It’s aligned with physics, and DeepMind‑style algorithmic robustness might make it viable sooner than people expect, but it’s not something you count on for 2029 capacity.

The Profound Bit: What Google Should Actually Do

If you strip away the sci‑fi and keep only what the physics and timelines support, the networking playbook that maximizes value for Google looks like this:

In training:
- Double down on Apollo + Huygens as the “optical scheduler” for pods and campuses.
- Push co‑packaged optics and short‑reach photonics to delete as much SerDes tax as possible.
- Experiment with Rainbow‑style AWGR fabrics inside pods to delete ToRs and let the compiler assign wavelengths.
- Add breathing‑fabric power‑gating to SerDes and DSPs to reclaim idle megawatts.
In inference:
- Make SPAD real at the network level: physically distinct Reader, Writer, and Cortex pods, with topology‑aware placement of agent sessions.
- Build Petabyte Shelves—CXL/photonic memory fabrics—that take KV and context off local HBM and turn them into pooled assets.
- Use Apollo + batching for long‑haul state moves; use memory fabrics, not IP meshes, for most Writer and Cortex traffic.
In research:
- Treat Rainbow Bus at campus scale, Air‑Gap fabrics, and full analog all‑reduce as high‑upside experiments, not prerequisites. Fund them through DeepMind and the hardware research org as 2030s accelerants, not 2020s dependencies.

That way, networking stops being a static cost and becomes another place where first‑principles design can buy you orders of magnitude. Not by buying more 800 G ports—but by deleting the parts (ToRs, unnecessary SerDes, generic meshes) that no longer make sense once you accept that an AI data center is not a collection of servers. It’s a Virtual Wafer, and the job of the network is to make that lie as close to true as physics allows.

What this means for investors: follow the glass, not the cops

If you take the Virtual Wafer idea seriously, the center of gravity in networking shifts away from “smart packet cops” and toward photons, packaging, and memory fabrics.

The legacy trade was: buy the switch ASIC vendors and assume complexity scales with bandwidth. But in a spine built around Apollo‑style OCS and Huygens scheduling, the whole point is to delete routing intelligence from the middle of the fabric. The value is migrating up into the compiler/runtime (Pathways, XLA, vLLM‑style batching) and down into the optics and materials that make a glass core viable for 10–15 years. The risk, from an investor’s lens, is being long on companies whose only differentiation is “smarter packet inspection in a Clos spine” as hyperscalers quietly replace those spines with passive mirrors and wavelength fabrics.

The upside is in the supply chain that makes Virtual Wafers and Petabyte Shelves real. That means optical engines and CPO, not just pluggables; photonics assemblers and test houses (the “TSMC of the network”); electro‑optic bridge silicon for CXL and memory pooling; and the materials and fiber vendors whose volume explodes if every large campus needs tens of thousands of strands between buildings instead of a few hundred. Your mental rotation is: from “ports per switch” to watts per bit, from “L3 features” to how early can we turn electrons into light and never turn them back until we hit HBM or DRAM. The companies that win that game, even if they’re small today, are the ones that will quietly sit under every Reader pod, every Petabyte Shelf, every campus‑scale Virtual Wafer.

What this means for data center operators: design for fiber gravity and SPAD zones

For operators, the big shift is that topology and conduit become as critical as megawatts and floor loading. If your mental model is still “roomful of identical halls with standard 4″ duct banks between them,” you’re not building for Virtual Wafers. A 100k‑chip training cluster spread over two or three buildings wants absurd fiber density and very clean, low‑loss paths: straight‑shot duct banks, room for multiple high‑count cables, and the physical plant to support OCS nodes and optical patching at campus scale. You’re not just sizing transformers; you’re engineering fiber highways.

SPAD also implies you stop treating every white space the same. Reader pods (training + prefill) want very high rack densities, liquid cooling, short hop latency into the optical core and maybe Rainbow‑style pod fabrics. Writer pods and Petabyte Shelves want memory density, CXL backplanes, and clean short‑reach optics more than insane kW/rack. Cortex/state pods want ultra‑reliable power, low‑latency fabrics, and tight coupling to storage. The operator who can walk into a Google/Anthropic/Cohere RFP and say, “Here’s our Training Zone spec, here’s our Inference/Writer Zone spec, here’s our Memory/Shelf pod spec, and here’s the duct bank between them” is playing a different game than the one still selling “up to 10 kW per rack, chilled water available”.

This is also where campus layout becomes a moat. If you have brownfield campuses with existing high‑capacity duct, rights‑of‑way for new fiber, good line‑of‑sight between buildings (for the odd FSO overlay), and the physical volume to host OCS/patch rooms, you can credibly market yourself as Virtual‑Wafer‑ready. If you don’t, the cheapest thing you can do today is overbuild conduit and risers everywhere you still can; the most expensive thing you can do is assume “a couple of 864‑count bundles between halls” will be enough when Reader/Writer clustering and Petabyte Shelves really show up.

What this means for colos & developers: sell fabric‑ready shells, not just space & power

For colocation providers and developers, the networking deep dive basically says: “space and power” is table stakes; the product now is fabric readiness. Your best tenants over the next decade will be AI shops that aren’t quite big enough to build their own Jupiter+Apollo clone, but want something that rhymes with Google’s architecture.

That means three things. First, your MMR and campus interconnect story has to level up. It’s not just “here are the IXPs and waves we can sell you”; it’s “here is a pre‑engineered optical mesh across our buildings, with dark fiber or wavelength services you can treat as your own Virtual Wafer core”. If you can offer OCS‑friendly topologies or even managed optical fabrics between suites—“here’s a 4‑hall mesh with guaranteed latency and loss characteristics, ready for your Reader/Writer split”—you’ll win training and stateful inference workloads that a vanilla colo never sees.

Second, you can start to productize SPAD in real estate terms. Instead of generic 2 MW halls, you market:

“Training Suites”: liquid‑ready, high floor loading, dense power distribution, great cross‑connect into the campus glass core.
“Inference / Edge Suites”: more modest power, but rich metro connectivity and short paths to MMRs and end‑user networks.
“Memory / Shelf Suites”: optimized for Petabyte Shelves and CXL fabrics, with lots of rack positions, moderate power, and very clean, short‑reach optical paths to adjacent compute suites.

Third, you can differentiate by being the neutral aggregator of weird infra that makes these fabrics possible. Most AI tenants don’t want to negotiate for extra duct banks, exotic fiber types, shared Petabyte shelves, or FSO links across roofs with landlords and cities; they want someone who shows up with a “cluster‑ready envelope.” This is where working with FPX‑type partners helps: you show up not just with powered land, but with pre‑sourced optical plant, duct, and even shared memory/caching tiers they can plug into. In a world where hyperscalers are turning their own DCs into Virtual Wafers, the edge for colos is to offer the same pattern—glass core, SPAD‑aware zoning, pooled memory—without the tenant having to reinvent Jupiter in a leased hall.

This is the fourth and final “Physics” deep dive. We have covered Chips, Power, and Networking. Now we tackle Memory.

The “First Principles” hook here is: Memory is the only thing that matters.

Compute is cheap ($10^{-12}$ Joules). Moving data from memory to compute costs $100\times$ more energy. The “Memory Wall” is not a metaphor; it is a thermodynamic tax.

4) The Physics of Memory: HBM Is the New Oil (and We’re Wasting It)

We went deep on this in our earlier piece, Beyond Power: The AI Memory Crisis – arguing that the real constraint on hyperscale AI isn’t how many GPUs you can buy, but how many useful bytes you can keep close to them and at what cost. That piece mapped the DRAM/HBM supply chain, the CoWoS bottleneck, and why memory has become the hidden governor of AI scale. Here, we take that a step further and apply a first‑principles lens specifically to Google’s world: HBM scarcity, SPAD (prefill vs decode), agents, and how a company at Google’s scale can architect around a memory system it does not fully control.

If you strip Google’s AI stack down to physics, one thing jumps out: compute is no longer the limit — memory is. A TPU or GPU can do a low‑precision MAC for a fraction of a picojoule; the expensive part is hauling the operands in and out of memory. Every rung you climb down the hierarchy, from SRAM to HBM to DRAM to SSD to HDD, adds orders of magnitude in energy and latency. In a large LLM, most of the joules are spent moving bits, not thinking with them.

Now add the supply chain: the only memory fast enough to keep up with frontier chips is HBM (high‑bandwidth memory), and that market is tiny and stressed. SK hynix has become the HBM kingpin, Samsung has most of the remaining share, and Micron, after largely sitting out the early HBM3 cycle and focusing on HBM3E — is only now ramping into contention. Packaging (TSVs, 3D stacking, 2.5D interposers) is the real choke point, and all three vendors have broadcast the same message: HBM capacity is effectively sold out into the mid‑2020s. One of the three skipping a full generation (HBM3) is precisely why the shortage feels so acute.

HBM is now the most expensive real estate in the data center on a per‑bit basis, yet the way we use it looks like a landlord who rents skyscrapers to people who only occupy half a floor. We solder 80–192 GB of HBM to every accelerator and then strand a huge fraction because the workload doesn’t perfectly fill the silo. That’s the memory Idiot Index: the scarcest, hardest‑to‑scale resource is also the least efficiently used.

If Google applies a genuine first‑principles framework here — question every assumption instead of importing GPU culture — it has to behave as if HBM supply never really catches up. That means:

Treat HBM as a cache, not a comfort blanket.
Design memory around the actual structure of LLM workloads: prefill vs decode vs long‑running agents.
Push everything that doesn’t absolutely need HBM into DRAM, SSD, HDD, and older silicon that FPX can scavenge.

Phase 1 – Deconstruct to Physics: Prefill, Decode, Agents vs HBM

Start with what the model actually does.

Prefill (Readers)

Reads the full prompt and context.
Giant matmuls and dense attention; access patterns are wide and relatively predictable.
Extremely sensitive to HBM bandwidth and locality.
This is where HBM earns its keep.

Decode (Writers)

Generates tokens one (or a few) at a time.
Dominated by KV cache and attention over the past tokens.
Mostly memory‑bound; compute units wait on KV and embeddings.
Needs capacity and steady bandwidth more than bleeding‑edge FLOPs.

Agents (Cortex)

Long‑running workflows: coding agents, planners, research assistants.
Need consistent, structured state over minutes or hours.
Working set is relatively small but hit constantly; the rest is long‑term memory.

Now overlay the physical tiers:

SRAM / registers — nanoseconds, tiny energy, minuscule capacity, very expensive area.
HBM — ultra‑fast, low latency, horrifically expensive and supply‑constrained.
DRAM / CXL shelves — larger, slower, still decent energy/bit; expandable.
SSD / HDD — massive, slow, cheap; lots of used capacity in the world.

And the key constraint: HBM output can’t be scaled at will. Micron sitting out much of HBM3 and only going big on HBM3E means one entire slice of potential supply simply wasn’t there when AI demand took off. SK hynix and Samsung are already maxing their TSV/stacking capacity. There is no “we’ll just get another 3× HBM by 2027” button.

So the first‑principles conclusion is:

You cannot scale prefill, decode, and agents by just slapping more HBM on each chip.
You have to change how each stage uses HBM and push everything else into tiers FPX can actually source at scale.

Phase 2 – Rebuild to Economics: SPAD‑Aligned Memory in a Scarce‑HBM World

Now we rebuild the memory hierarchy with two constraints in mind:

HBM is a fixed, cartel‑constrained resource.
Prefill, decode, and agents have totally different physics.

4.1 Prefill (Readers): HBM‑only, at 4 bits whenever possible

Prefill is the only stage that truly deserves HBM by default. That means:

HBM holds only hot weights and minimal activations.
- No KV caches. No agent history. No long prompts.
- If a byte isn’t repeatedly touched in microseconds, it gets evicted to DRAM/SSD.
Virtual HBM with FP4/INT4.
- Prefill is mostly linear algebra in well‑behaved layers. This is the easiest place to go aggressively low‑precision.
- Make FP4 / INT4 the default for Reader weights and activations; reserve FP8/BF16 only for the genuinely sensitive layers.
- For the tensors that can move from 16‑bit to 4‑bit, you get up to a 4× shrink in footprint. In practice, you might see something more like 2–3× effective capacity once you blend in FP8/BF16 layers, metadata, and uncompressed tensors — but even that is the difference between ‘HBM as a hard wall’ and ‘HBM as a tight but manageable budget.
Pod‑level HBM pooling.
- Stop thinking “1 chip = 1 silo.” Treat the prefill pod as a shared HBM pool.
- The compiler/runtime should pack layers and shards across devices so you never have one 80 GB HBM stack 30% full and another spilling.

The subtle but important bit: this locks in a design target for DeepMind and Gemini. “Train so that the prefill path runs at 4 bits” is not just a modeling trick; it’s a hard requirement to keep the HBM budget survivable.

Our contention is that better‑curated, Gemini‑cleaned corpora are what make 4‑bit training practical at scale; the data pipeline becomes part of the hardware strategy.

FPX doesn’t touch HBM directly, but by helping Google push everything else into cheaper tiers, it gives Amin room to be ruthless here: HBM is only for this narrow prefill hot path.

4.2 Decode (Writers): stop using HBM as KV landfill

Decode is where current systems quietly torch HBM on the wrong work: KV cache, bloated contexts, scratch state.

For Writers:

Default assumption: KV and long context do not belong in HBM.
- Keep the immediate KV needed for the next few tokens on HBM.
- Everything else — older KV blocks, extended context windows, multi‑agent scratchpads — gets pushed into DRAM via CXL Petabyte Shelves.
Use CXL shelves as “decode lungs”, not fake HBM.
- A decode node might have 64–96 GB HBM for weights and hottest KV, and borrow 512 GB–1 TB of pooled DRAM from a CXL shelf.
- CXL DRAM is ~100–200 ns slower than local DRAM, but for older KV/less frequently accessed context it’s still orders of magnitude better than going to SSD.
Compress and summarize KV aggressively.
- Blockwise quantization, sparse KV, dynamic truncation, and summarization can slash the footprint.
- The less KV you store per token, the less HBM/DRAM you need per 1 M tokens served.

Here FPX can directly help:

FPX can work with distressed operators, OEM buyback programs, and decommissioned fleets to acquire memory‑heavy servers (lots of DIMMs, older CPUs).
FPX tests and bins the DRAM, and bundles them as CXL shelf bricks—pre‑packaged DRAM pools Google can drop behind decode pods.

That’s a clean division of labor: Google solves the software and scheduling; FPX solves, “Where do we get a few petabytes of cheap DRAM in the next 18 months?”

4.3 Agents (Cortex): SRAM for thought, junk for long‑term memory

Agents are where “reload everything from disk every turn” becomes untenable.

First‑principles fix:

Introduce Cortex tiles — SRAM‑heavy chiplets per pod.
- Reserve a small number of tiles with tens/hundreds of MB of SRAM and some HBM for agent working sets.
- High‑value agents get pinned to a Cortex tile for the duration of their job; their active plan, stack, and short‑term memory never leaves silicon.
Tier agent memory by actual access frequency:
- SRAM: active thought loop, recent messages, immediate code.
- HBM/DRAM: last N steps of history, hot tools and docs.
- SSD/HDD (Zombie Tier): logs, old context, rarely used references.

Again, this forces modeling and infra choices: agents must be designed to keep their “mind” small: compressed latent representations and summaries that fit in limited SRAM, not 10 GB JSON blobs.

FPX’s part is to make the bottom of this pyramid feel infinite and cheap:

Zombie SSD/HDD clusters right next to agent pods, built from recertified drives and retired JBODs.
Cheap capacity for all the stuff agents might need twice a month but which should never be on HBM or even DRAM.

4.4 The Zombie Tier: where FPX lives

The Zombie Tier is where FPX can provide immediate, non‑theoretical value:

Checkpoint Pods.
- Training checkpoints are huge sequential writes with rare reads.
- FPX can aggregate retired enterprise HDDs and SSDs (via direct buys, OEM recert programs, and structured buybacks), wipe and test them, then sell them back as “Checkpoint Pods” with clear SLAs: cheap, sequential, redundant.
Cold Context Pods.
- Old agent histories, low‑frequency RAG corpora, archived models.
- Same underlying hardware, different durability/availability profile.

Instead of Google burning fresh NVMe at $X/TB on workloads that don’t need it, FPX feeds in Zombie capacity at a fraction of the capex and keeps that spend focused on HBM/DRAM where the whole world is truly constrained.

FPX Zombie Tiers: Cheap bytes that free scarce HBM.
Checkpoints want sequential I/O and redundancy, not fresh NVMe. Agent histories want cheap depth, not sub‑100 ns latency. FPX aggregates recertified SSD/HDD into Checkpoint Pods and Cold Context Pods with clear SLAs, so HBM and DRAM stay reserved for prefill/compute and hot KV. Result: more usable tokens per joule and fewer wasted HBM silos.

4.5 “Zombie GPUs”: only for long‑tail inference, not as cache bricks

Your pushback is exactly right: using older GPUs as glorified HBM cache bricks is thermodynamically dumb. An A100 idling just to keep its HBM online still burns serious power; a DRAM DIMM or CXL shelf is far more efficient if all you want is capacity.

So the first‑principles rule should be:

Never run a 300 W GPU just to emulate a 5 W DRAM stick.

Where “Zombie GPUs” do make sense is in long‑tail inference, where you still need both compute and memory bandwidth, but not the latest perf/W:

Serving small or mid‑sized expert models, older Gemini generations, or internal tools where latency isn’t hypersensitive.
Batchy, background workloads like email summarization, internal analytics, or low‑priority agents.

These are jobs where you’d otherwise be tempted to “waste” H100s or newest TPUs. Instead, FPX can:

Scoop up A100/H100/older TPU fleets from neoclouds and failed AI startups.
Stand them up as Long‑Tail Inference Clusters in power‑cheap sites.
Let Google and others offload non‑critical workloads there so the newest HBM is saved for frontier pretraining and high‑value inference.

Framed that way, Zombie GPUs are not “memory nodes”; they’re a way to avoid burning fresh HBM on tasks that don’t need it. Anything that really is “just cache” should be pushed down into DRAM/SSD via shelves and Zombie Pods.

FPX Long‑Tail Clusters. We stand up retired accelerators in power‑cheap sites for throughput LLMs, batchy internal jobs, and non‑critical inference—so frontier TPUs stay on frontier work.

The FPX Layer: Turning Scrap Silicon into Usable Memory

Across all of this, FPX’s role is pretty clear:

Used SSD/HDD sourcing and aggregation
- Work directly with hyperscalers, colo providers, and OEM recert programs to set up buyback channels.
- Turn that into standardized Checkpoint/Context Pod SKUs.
Used DRAM and servers for CXL shelves
- Acquire memory‑dense servers from distressed operators and refresh cycles.
- Test, bin, and ship them as “Shelf Bricks” tailored for Google‑style decode pods.
Zombie GPU clusters for long‑tail inference
- Aggregate old accelerators and sell them explicitly as “Good enough inference for X $/token” capacity, not as generic GPU hours.
Partner with OEMs up the chain
- Cooperate with drive and memory vendors on structured buyback/refurbish programs so they get ongoing margin and FPX gets predictable supply.

If Google wants a first‑principles answer to the memory bottleneck, it looks something like this:

Assume HBM will stay scarce and expensive.
Design SPAD architectures — Reader, Writer, Cortex — to need as little of it as possible.
Push everything else down into DRAM and junkyard silicon.
Let FPX handle the ugly work of turning the world’s retired drives, DIMMs, and GPUs into clean, productized Zombie tiers.

That’s how you stop treating HBM like a commodity you’ll always be able to buy more of, and start treating it like what it really is: the limiting reagent in the 1,000× AI experiment.

What this means for investors

For investors, the memory story says: HBM is the new oil, but the real upside is in everything that makes us need less of it. The obvious plays (SK hynix, Samsung, Micron, HBM packaging/CoWoS houses) are already priced as if they are the only game in town—and they are the bottleneck—but the asymmetric opportunities are around them: CXL and memory‑pooling silicon (Astera‑style), photonic/copper interconnects that make Petabyte Shelves viable, and the testing/recertification ecosystem that can turn “used” SSDs/DRAM into Zombie Tiers with credible SLAs. The winners aren’t just the ones selling fresh HBM; they’re the ones selling more useful bytes per HBM bit—through FP4, pooling, compression, and “good enough” capacity from refurbished hardware. FPX AI sits squarely in that second category: if it becomes the default aggregator and seller of “AI‑grade used memory and storage,” it captures a structural margin from everyone who is still buying brand‑new flash and DRAM for checkpointing and cold data.

What this means for data center operators

For data center operators, the takeaway is that memory topology becomes as important as power and cooling. A campus that wants to host serious AI can’t just offer “GPU cages”; it needs SPAD‑aware memory zones: high‑bandwidth, low‑latency adjacency between Reader pods and their HBM‑first racks; separate decode/Writer rows with space and power for CXL Petabyte Shelves; and colder storage rooms (lower power density, lots of rack positions) for Zombie Tiers built from HDD/SSD. That means planning for more east–west connectivity between compute and memory racks, designing “shelf rows” and “checkpoint rows” explicitly, and being willing to let a partner like FPX drop in pre‑assembled shelf bricks and checkpoint pods sourced from the secondary market. The operator that can say, “We have dedicated space, power, and fiber for your pooled DRAM and cheap checkpoint capacity, not just for your TPUs,” will win the tenants who understand that HBM is scarce and everything else has to move closer and get cheaper.

What this means for powered‑land and infrastructure owners

For powered‑land owners—people sitting on substations, stranded power, old industrial sites—the memory lens opens up a new product: “memory campuses” rather than just “compute campuses.” Petabyte Shelves and Zombie storage don’t need the same pristine, ultra‑dense power and cooling as frontier HBM clusters; they need lots of reasonably priced MWs, floor space, and enough connectivity to sit one or two network hops from the main AI site. That makes brownfield sites with decent grid hooks and cheap land—old industrial parks, retired factories, secondary metros—ideal for DRAM shelves and checkpoint pods. FPX can sit in the middle: taking those powered shells, filling them with refurbished DRAM/SSD/HDD capacity and CXL‑ready nodes, and then presenting them to hyperscalers and neoclouds as “off‑site memory extensions” that free up premium metro real estate (and HBM budget) for Readers, Writers, and Cortex tiles. In other words: powered land that can’t justify a 100 kW/rack GPU build can still be extremely valuable—if it brands itself as the cheap, deep memory layer under someone else’s scarce HBM.

5) Model Evolution: Turning Physics into Intelligence

Everything we’ve laid out so far — TPUs, power, networking, memory — buys Google raw 1,000× compute. Whether that turns into 1,000× more intelligence depends entirely on what sits on top of it: the models.

The last 18 months have quietly changed the rules here.

DeepSeek showed that standard Transformers are physically wasteful: their MoE + MLA stack cuts KV memory and active compute massively versus dense baselines while matching or beating quality.
DeepSeek‑R1 and OpenAI o1 showed that thinking longer at inference can matter more than making the base model bigger: chain‑of‑thought and test‑time compute become primary scaling axes.
Google’s Titans work showed how to break the context window trap with learned neural memory — separating short‑term attention from long‑term, updateable memory at test time.

All three are really saying the same thing in different ways:

The old “just add parameters and pre‑training FLOPs” era is over.
The new frontier is physics‑aware, inference‑time‑aware, memory‑aware models.

So let’s run the same two‑step playbook we used on hardware.

Phase 1 – Deconstruct to Physics: Where Models Fight the Machine

If you strip a standard LLM + agent stack down to what it physically does, the misalignments with Google’s hardware are obvious:

Uniform depth & dense activation
Every token goes through the full depth of the network, and almost every parameter fires. Easy tokens (“the”, boilerplate) pay the same compute and memory cost as hard reasoning steps. That’s maximum FLOPs, maximum HBM churn, maximum SerDes activity — the opposite of what we want on a power‑ and bandwidth‑constrained TPU fleet.
KV cache and context as the only memory
Classic Transformers stuff everything into the context window: recent conversation, long‑term history, scratchpad, task hints. KV cache grows linearly with sequence length and heads; by the time you’re playing with 128k+ tokens, you’re essentially using HBM as a giant circular buffer. That’s exactly the resource we’ve already established is the scarcest in the stack.
Answer time tied to token count
Autoregressive generation is intrinsically serial: to produce T tokens, you pay at least T forward passes. Great for flexibility, terrible for long reasoning chains or huge code edits, and a bad match for Apollo’s scheduled fabric.
No notion of “thinking time” as a separate budget
Traditional scaling laws (Chinchilla etc.) focused almost entirely on training compute. R1 and o1 changed that: they show smooth gains from adding inference‑time compute — longer chains of thought — even at fixed model size. But most of today’s infrastructure and models treat inference compute as a fixed cost rather than a policy decision.
Agents as thin chat wrappers
Most “agents” today are just loops around an LLM: send full context in, get text out, repeat. All the state lives in tokens and external tools. There’s no compact world model, no dedicated controller, no learned notion of state — which means massive repeated parsing and KV rebuilds.

Contrast that with what DeepSeek and Titans are actually doing:

DeepSeek‑V2/V3: MoE + MLA + mixture‑of‑depth → far fewer active experts per token, compressed K/V so KV cache is tiny, and conditional depth so not every token pays for every layer.
DeepSeek‑R1 / OpenAI o1: explicit test‑time compute scaling; more internal reasoning steps improves performance without touching the base parameter count.
Titans: short‑term attention + neural long‑term memory + persistent memory → context window becomes a front cache, not the whole story.

All three are directly attacking the same physical constraints we’ve been talking about:

HBM and CXL shelves are scarce → compress KV, use sparsity, separate long‑term memory.
Network bandwidth is constrained and scheduled → avoid dense all‑to‑all every token.
Power ceilings are real → don’t fire all experts and all layers on every step.
Memory hierarchy is layered → stop using context windows as the only “memory.”

That’s Phase 1. Now the fun part is Phase 2: how Google can rebuild models to fit the stack it actually has.

Phase 2 – Rebuild to Economics: Optimization Breakthroughs That Align with the Stack

Here are the key moves that matter and age well — the ones that should guide DeepMind’s roadmap if it wants to be directionally correct for the next decade.

1. Make DeepSeek‑style sparsity and MLA the default

Core idea: pay only for the hard parts of the sequence, and move less KV.

Mixture‑of‑Experts (MoE): activate only a handful of experts per token, and learn routing that respects TPU pod boundaries and Virtual Wafer locality. Experts become “organs” pinned to specific slices of the fabric, not amorphous weights scattered everywhere.
Mixture‑of‑Depth: easy tokens exit early; only difficult regions see full depth. That shrinks average FLOPs/token and cuts HBM traffic.
MLA or MLA‑like attention: compress keys/values into a low‑rank latent space so KV cache shrinks dramatically and attention becomes more compute‑bound than bandwidth‑bound.

This is exactly what DeepSeek has already proven under sanctions; Google should treat it as table stakes, not a curiosity. New Gemini/Titans families should have to justify any departure from “sparse + low‑rank KV” the way a chip designer has to justify adding a big new block on die.

2. Treat Titans Memory as the answer to context, not bigger windows

Core idea: context windows are L1 cache, not memory.

Titans already sketches the right structure:

Short‑term attention for immediate local context.
Neural long‑term memory that learns what to store/retrieve across turns and tasks.
Persistent memory encoding more stable knowledge.

Map that onto the hardware we’ve designed:

Short‑term attention ↔ HBM + on‑chip SRAM (Readers/Writers/Cortex).
Long‑term memory ↔ DRAM CXL shelves that FPX helps Google deploy around decode/agent pods.
Persistent memory ↔ SSD/HDD tiers (including Zombie storage) plus external knowledge stores.

Design principle:

New models must be built assuming
“long‑term memory lives in Titans, not in the context window.”

That means training models to write compact latent summaries into Titans Memory and learn to retrieve them — rather than stuffing everything into a sliding 1M‑token window and hoping more HBM shows up from SK hynix.

3. Embrace thinking‑time arbitrage with Cortex tiles

Core idea: System 1 vs System 2 should be a hardware concept, not just a metaphor.

R1 and o1 show that for hard tasks, it’s better to keep the model size fixed and spend more compute at inference on chain‑of‑thought reasoning.

That matches the Cortex tile concept perfectly:

We dedicate small, SRAM‑heavy, stateful TPUs as “thinking cores” where the model can unroll long chains of thought without thrashing HBM or the network.
System 1 traffic (most queries) never hits these tiles: it goes through small, cheap models on Zombie GPUs or low‑power TPUs with tight latency budgets.
System 2 traffic (hard, high‑value queries) is explicitly routed to Cortex tiles, with a configurable thinking‑time budget: “you can use up to N seconds and M joules to think before answering.”

In this world, inference‑time compute becomes a dial the orchestrator can turn, backed by physical resources that make long reasoning efficient rather than catastrophic for HBM and power. That’s exactly what o1/R1 discovered in software; Google has the chance to give it a proper home in hardware.

4. Add fixed‑depth logic modules: diffusion‑style refinement where it helps

Core idea: turn some long outputs into O(1)‑depth problems.

Autoregressive LLMs are inherently O(T) in depth for T tokens. For certain problems — structured plans, proofs, code edits, config diffs — we can do better by:

Letting a base model propose a full candidate solution.
Running a diffusion/flow‑style module that refines that candidate in a small, fixed number of steps, independent of length.
Verifying or scoring the result with another pass if needed.

This is a good fit for Reader TPUs and Apollo: big, regular matmuls over long sequences in a fixed, schedulable number of steps. It doesn’t replace LLMs; it gives Google a way to offload certain “long but structured” tasks from the serial, attention‑heavy path onto a fixed‑depth refinement engine.

The implementation details will change, but the principle will age well:

Wherever you can turn “token by token” into “refine a full candidate in K steps,”
do it. It aligns with scheduled optics, power‑gating, and high FLOP/s TPUs.

5. Use world models for agents, not chats in a loop

Core idea: agents should simulate next state, not re‑parse their entire life every turn.

Right now, many “agents” are just LLMs in a tool loop: stuff history into context, get a response, repeat. World models give you a cleaner, more physical alternative:

Maintain a compact latent world state for the agent: beliefs, goals, working memory.
At each step, predict the next state given an action and new observations, rather than recomputing everything from raw text.
Use LLMs as tools — to interpret inputs, generate text, or solve subproblems — not as the main state machine.

In our stack:

The world model and state live on Cortex tiles + Titans Memory.
The LLM (Gemini/Titans) is a callable expert, not the agent’s core.
Petabyte Shelves and Zombie storage hold long‑term logs, documents, and knowledge; the world model keeps a compressed working set in fast memory.

That gives Google a path to long‑running, low‑power agents that measure progress in “episodes per joule,” not just “tokens per dollar” — an increasingly important metric as compute and power budgets tighten.

6. Co‑Design or Lose: Why This Matters Uniquely for Google

DeepSeek’s constraints were geopolitical: no top‑end GPUs, hard FLOP ceilings. Their reaction — MLA, MoE, training and inference‑time efficiency — is what you get when physics and policy are hard walls.

Google’s constraints are different but just as real:

HBM comes from a duopoly and CoWoS capacity, not from wishes.
Power and cooling are gated by nuclear/geothermal projects, brownfield deals, and FPX‑style scavenging of generators and substations.
Networks are bounded by what Apollo/Firefly/Huygens and photonics can realistically deliver.
Memory beyond HBM depends on DRAM shelves, used SSD/HDD pools, and clever Titans‑style architectures.

The advantage — and the obligation — is that Google controls the whole stack: TPUs, Axion, Apollo, Jupiter, Kairos/Fervo power, Titans Memory, Gemini/Titans models, agents, and (through partners like FPX) even the used hardware and powered land beneath it.

If it treats models as “just another customer of compute,” it will keep getting out‑maneuvered by labs that have to be more efficient. If it treats model evolution as a co‑design problem with chips, power, network, and memory, it can still be the place where physics bends in its favor.

The optimization breakthroughs are clear:

Phase 1 (Deconstruct): admit that vanilla Transformers and naive agents are fighting the hardware — wasting HBM, abusing context windows, and ignoring inference‑time compute as a controllable axis.
Phase 2 (Rebuild): make DeepSeek‑style sparsity/MLA, Titans‑style memory, thinking‑time arbitrage on Cortex tiles, fixed‑depth refinement modules, and world‑model‑based agents the default design assumptions for the next generation of Gemini and beyond.

If Google gets that right, Amin’s “double serving capacity every ~6 months” target stops being a pure infrastructure race and becomes what it should be: a joint optimization of physics and algorithms. The 1,000× story then isn’t “1,000× more FLOPs”; it’s “1,000× more useful reasoning per joule, per byte of HBM, per meter of fiber” — and that’s the only version that will actually scale.

What this means for investors

The big shift is that “buy GPUs” is no longer the whole trade. If DeepSeek, Qwen, o1, Titans, etc. are directionally right, value migrates from raw flops to everything that lets you use flops selectively:

Routing and orchestration – Systems that can steer traffic between System 1 and System 2 lanes, between Reader/Writer/Cortex models, and across different hardware tiers. That’s not just pretty dashboards; it’s deep stack software that decides which model runs where and how much “thinking time” it gets.
Memory‑ and sparsity‑aware model toolchains – Compilers, runtime libraries, and training stacks that make MLA, MoE, Mixture‑of‑Depths, Titans‑style memory and quantization “just work.” Whoever owns the toolchain that squeezes 2–5× more tokens per joule out of the same silicon will quietly capture huge economic rent.
Reasoning‑friendly infrastructure – Anything that makes long chain‑of‑thought economically viable: SRAM‑rich accelerators, CXL memory appliances, ultra‑low‑latency interconnect, and scheduling systems that can sell “thinking time” as a product.

The mental model: throughput LLMs (cheap, small, System 1) look like a commoditizing business over time; reasoning LLMs (System 2, agents, world models) look more like bespoke high‑margin services that sit on top of the same hardware. The interesting equity and venture exposure is in the glue: companies and platforms that can (a) route work between the two, (b) compress model demand into whatever hardware/memory is actually available, and (c) expose that as predictable, billable SKUs.

FPX is naturally on that “glue” side: we’re already mapping where the power, HBM and DRAM physically are and which workloads fit where. The next step is pairing that with model‑aware economics: flag which sites are ideal for cheap System 1 throughput (Zombie GPU farms, DRAM‑heavy shelves) and which are candidates for high‑value System 2 clusters (low‑latency, good cooling, close to Titans‑style memory).

What this means for data center operators

If model evolution goes the way we’ve described, your customer’s fleet won’t be “a bunch of identical LLM servers” for very long. You’re going to see three physically different classes of AI load:

Readers / bulk prefill – dense, high‑FLOP, bandwidth‑hungry, but latency‑tolerant. These want big, liquid‑cooled racks, strong optical backbones, and tight integration with training clusters.
Writers / decode & memory‑heavy inference – relatively modest compute but huge appetite for HBM, DRAM shelves, and local storage. These want well‑connected, memory‑centric pods with room for CXL appliances and shelves.
Cortex / agents & reasoning – smaller models that run for longer, need SRAM‑ or DRAM‑rich sockets, and sit next to long‑term memory. These want lower‑density but very “clean” latency and strong connectivity to both compute and storage.

Practically, that means:

Stop designing homogeneous AI halls. Start carving out Reader, Writer, and Cortex zones with different power densities, cooling provisions, and interconnect assumptions.
Expect System 2 SLAs: tenants will ask not just for MW and racks, but “X MW of low‑latency reasoning capacity with Y seconds of guaranteed thinking time per request.” That has implications for how you reserve headroom, design your internal fabrics, and write your contracts.
Treat near‑memory space as a first‑class product. If Titans‑style memory and world‑model‑based agents land, tenants will value racks that are physically close (fiber‑wise and latency‑wise) to DRAM shelves and SSD/HDD tiers as much as they value proximity to GPU cages.

Operators that can walk into a hyperscaler or big AI tenant and say, “Here is a floorplan and power profile specifically tailored for your Reader/Writer/Cortex split, with room for reasoning pods and memory shelves” will out‑compete those still selling generic “AI‑ready” white space.

FPX can help you pre‑design that: we already see how workloads and chips are evolving; the next step is working with you to tag each building, hall, and row as System 1‑optimised, System 2‑optimised, memory‑optimised, or some blend.

What this means for colo providers and developers

For colo and powered‑land developers, the model story tells you which boxes you should be trying to host where.

System 1 / throughput LLMs – Small and mid‑sized models doing summarization, classification, basic chat. These can run on older accelerators, refurbished gear, and DRAM‑heavy nodes, provided they have decent networking. This is where you monetize Zombie GPUs and “good enough” power/cooling at scale.
System 2 / reasoning and agents – High‑value, low‑volume workloads that want longer thinking time, tight latency envelopes, and proximity to long‑term memory. These will justify premium pricing on a smaller number of racks, especially in metros close to customers or data sources.
Memory and state planes – Titans‑like long‑term memory, Petabyte shelves, checkpoint and log tiers. These don’t need 100 kW/rack, but they do need a lot of space, power at a sane price, and good fiber to the compute sites.

What you can do with that:

Start positioning sites as profiles, not just locations:
- “This campus is a System 1 farm: great for cheap, large‑scale inference on smaller models.”
- “This hall is a Cortex zone: low‑latency, agent‑friendly, near a shared memory shelf.”
- “This building is a memory campus: ideal for Titans‑style state and world‑model storage, tethered to compute a few ms away.”
Work with FPX to bundle physical characteristics with model expectations: power density, cooling type, fiber routes, and even used hardware (SSDs/HDDs/DRAM) mapped to the right class of models. That lets you pitch: “You don’t just get 10 MW and 200 racks; you get a ready‑made environment for decode pods and agent controllers aligned with your model roadmap.”
Think in terms of “model envelopes”, not just “AI capacity”: when a tenant says “we’re rolling out o1‑style reasoning” or “we’re going heavier on DeepSeek‑like MoE + MLA,” you can respond with a concrete offer: which of your sites can actually host that style of workload efficiently, and what the economics look like.

If chips were the story of 2023 and power was the story of 2024, models + infra co‑design will be the story of the second half of the decade. Investors who understand that will look for businesses that sit at that interface. Operators and developers who understand it will stop selling generic “GPU space” and start selling places for specific kinds of intelligence to live. FPX’s role is to sit in the middle: translate model roadmaps into concrete requirements on land, power, cooling, memory, and used hardware—and then help you source, design, and fill the right kind of space for what’s actually coming.

Climbing the Kardashev Curve: Google’s Next Decade

To make all of this concrete, it helps to think in overlapping horizons. Horizon 1 (2025–2029) is the “no‑excuses” window: everything here is just hard engineering, not science fiction. Chiplets and Active Bridges, SPAD‑style Reader/Writer ratios, Apollo‑class OCS with Huygens‑grade time‑sync and SerDes power‑gating, CXL Petabyte Shelves, zombie HDD/SSD for checkpoints, zombie GPUs for long‑tail inference, DeepSeek‑style sparsity and MLA, Titans‑style learned memory, and System 2 reasoning on SRAM‑heavy Cortex tiles — all of that is achievable with today’s physics and supply chains if Google is willing to question every assumption and delete anything that doesn’t serve electrons, photons, or heat.

Horizon 2 (the 2030s) is where the stack starts to look like infrastructure, not improvisation: SMR campuses and enhanced geothermal co‑designed with liquid‑to‑liquid loops, large‑scale Rainbow/photonic fabrics and bigger CXL memory planes spanning entire sites, agents built around explicit world models instead of chat loops, early Suncatcher‑style training testbeds in orbit or at extreme stranded‑power sites. Horizon 3 is the deep R&D frontier: analog in‑network compute where the fabric does math, cryogenic or superconducting pods, bio‑hybrid controllers, and full orbital training constellations. The point of laying it out this way isn’t to push the hard stuff into the future; it’s the opposite. The only way to hit Amin’s 1,000× and then 1,000,000× is to compress the cycle time between these horizons — to treat SMRs, Rainbow fabrics, Titans memory, System 2 agents, and even orbital power as iterative engineering programs, not distant dreams. The future stack is already visible in the physics; the only real question for Google is how fast it’s willing to move to meet it.

Recalibrating the Machine, Resetting the Horizon

Google is at an inflection point. Amin’s 1,000× mandate isn’t just a capacity target; it’s a test of whether the company can realign itself around physics fast enough to matter in the next era of AI. The constraints are now brutally clear: HBM is cartel‑limited, power is grid‑limited, bandwidth is SerDes‑limited, and models are still mostly pretending those limits don’t exist. Closing that gap is how Google stops being “compute constrained” and starts compounding again.

The good news is that no one on Earth is better positioned to do this than Google. It already owns the full stack: TPUs, Axion, Titanium, Apollo, Huygens‑class time‑sync, Jupiter fabrics, Kairos/Fervo power bets, Titans memory, Gemini and DeepMind, plus a global fleet of data centers. The bad news is that this only works if those pieces stop behaving like five separate empires and start acting like one product. The same first‑principles test that we applied to chips, power, networks, memory, and models has to be applied to org charts and incentives: if it doesn’t serve electrons, photons, or heat then delete it. If it doesn’t lower the Idiot Index per token we redesign it.

For everyone outside Google, this isn’t a closed‑door story; it’s a giant “help wanted” sign. Data center operators, powered‑land owners, and developers can turn stranded megawatts, brownfield substations, and awkward shells into training factories, decode hubs, and memory campuses that plug directly into Google’s virtual wafers. Neoclouds and colo providers can stop selling generic “GPU space” and start selling Reader/Writer/Cortex zones that mirror Google’s SPAD split. Supply‑chain and infra players: from photonics and CXL shelves to boneyard turbines, zombie HDD/SSD fleets, and used GPU pools, can become the specialized organs that feed Google’s hypercomputer instead of fighting it.

That’s where FPX sits in this picture: as the glue layer between the physics and the market. Our job is to find and assemble the weird, messy assets like the stranded power, the distressed data centers, the recertified drives, the retired accelerators, the duct banks and queue positions and expose them as clean, standardized envelopes Google (and others) can snap into their 1,000× roadmap. Not investment advice — but if this works, the upside won’t just accrue to a single chip vendor; it will flow through everyone who helps turn scrap, lagging infra, and overlooked land into real, scheduled compute.

The future stack is already visible in the constraints: chiplets and SPAD ratios in Horizon 1, SMR‑ and Rainbow‑based campuses in Horizon 2, analog fabrics and orbital training in Horizon 3. The only real question is how aggressively Google wants to pull those horizons forward. If it leans into first‑principles co‑design, silicon with models, power with cooling, networks with memory, agents with state the 1,000× target stops being a fire‑drill and becomes the on‑ramp to the next three orders of magnitude. The future isn’t “later” anymore; the physics is here now. The ecosystem that helps Google bend to it like operators, neoclouds, colo builders, suppliers, and scrappy aggregators like FPX get to help write what comes after Moore’s Law.

Part 1 : Beyond Power: The AI Memory Bottlenecks Investors Are Missing

FPX AI — Wed, 29 Oct 2025 22:12:26 GMT

Everyone’s staring at substations and megawatts. Sam Altman flew for memory.
In October 2025 he was in Seoul—not to buy compute—but to pre‑allocate DRAM/HBM from Samsung and SK hynix for OpenAI’s “Stargate” build‑out. South Korean officials talked up 900,000 wafers in 2029; OpenAI framed it as targeting 900,000 DRAM wafer starts per month. Quibble over cadence if you like; the move is the message: the scarce layer sets the pace.

This series maps the non‑power bottlenecks—what they are, who controls them, and how to invest around them. Power caps tell you where a data center can exist. Memory and packaging decide whether those racks do useful work. As GPU fleets scale into the multi‑gigawatt era, the constraint shifts upstream—HBM stacks, CoWoS‑class packaging, ABF substrates, and the flash+disk that feed them

In this first report, we take a ground‑up look at:

What the three memory tiers actually are—how they differ, and what they do.
How each is manufactured, from raw materials to advanced packaging.
Where the next shortages will emerge as GPU deployments accelerate.
Who stands to win from the widening imbalance between compute and memory capacity.

The Three Tiers of AI Memory

To understand AI memory systems, imagine a busy Restuarant from the Chef’s (GPU’s) perspective

Memory Hierarchy

1. HBM — The Chef’s Cutting Board

Imagine the chef’s cutting board: it’s where every motion happens—precise, hot, and instantaneous. Space is limited, but every millimeter counts. That’s High-Bandwidth Memory (HBM).

HBM sits directly beside the GPU core, bonded through a silicon interposer using advanced packaging technologies like TSMC’s CoWoS or Samsung’s I-Cube. Each stack is made of multiple DRAM layers vertically linked by through-silicon vias (TSVs)—essentially microscopic elevators for electrons.

GPUs compute at petaflop scales, but electrons can only move so fast.
The further data travels, the longer the latency.
Traditional DRAM, sitting inches away on a motherboard, becomes the bottleneck.

By moving memory on-package, HBM cuts the distance from inches to millimeters, increasing bandwidth from gigabytes to terabytes per second—the difference between slicing vegetables at the counter versus running to the walk-in fridge every time.

HBM Architecture

But HBM is scarce because:

Capacity is tiny—hundreds of GB per GPU versus terabytes on disks.
Manufacturing is complex—stacks are 3D bonded, yields are low.
Packaging is limited—CoWoS and ABF substrates are in chronic shortage.

This is why HBM is already sold out globally, and why NVIDIA’s delivery schedule is gated not by wafer output, but by how many HBM stacks and CoWoS packages can be assembled each month.

Key suppliers:

Memory: SK hynix, Micron, Samsung
Packaging: TSMC (CoWoS), Amkor (U.S. capacity 2028+), Samsung (I-Cube)
Materials: Ajinomoto (ABF film), Ibiden / Unimicron / Shinko / Nan Ya PCB (substrates)

Investor takeaway: Every incremental GPU shipment consumes HBM stacks, ABF substrates, and CoWoS slots—each a toll booth in the AI supply chain.

2. SSD — The Prep Counter and Fridge

If HBM is the cutting board, SSDs (Solid-State Drives) are the prep counter and refrigerator. They don’t touch the fire, but they make sure the chef never runs out of ingredients mid-service.

SSDs are built from NAND flash memory—non-volatile cells that retain data even when power is off—stacked vertically into hundreds of layers and paired with a controller chip (NVMe) that directs traffic.

Just as a restaurant keeps both quick-access prep bins and deep freezers, AI systems use different kinds of SSDs depending on where the data sits. NVMe drives are the high-speed ones—the prep bins right beside the kitchen—connected over PCIe lanes and able to move data at tens of gigabytes per second. SATA drives are the slower, older kind—the back-room fridge—limited by an interface built for hard drives. And even within NVMe, the flash type changes the economics: TLC SSDs (fast, durable) live inside servers feeding GPUs, while QLC SSDs (denser, cheaper) live outside them, storing the “warm” data that doesn’t change often but must stay nearby. NVMe handles the speed; TLC versus QLC balances performance against cost.

Architecture of a Solid State Drive

From first principles:
HBM gives speed but almost no persistence. SSDs provide persistence with speed that’s “good enough”—read times measured in microseconds, not nanoseconds, but still orders of magnitude faster than HDDs.

Upstream complexity:
SSDs rely on a deep, global manufacturing chain:

NAND producers: Samsung, SK hynix / Solidigm, Kioxia / Western Digital, Micron
Controller designers: Phison, Silicon Motion, Marvell
Tool makers: Lam Research, Tokyo Electron, Applied Materials
Process gases: NF₃ and WF₆ (from Air Products, Kanto Denka, Merck/EMD)

Investor takeaway: When HBM is constrained, workloads spill into SSD. Every checkpoint, RAG index, and offload increases demand for enterprise NAND and controllers—turning memory shortages into NAND pricing tailwinds.

3. HDD — The Warehouse Out Back

Behind the kitchen lies the warehouse—the Hard-Disk Drive (HDD). It’s not glamorous, but it’s indispensable. Bulk ingredients, cleaning supplies, everything the restaurant relies on for the next week—it all lives there.

In AI, HDDs are the cold, high-capacity tier. They use spinning magnetic platters to store petabytes of data cheaply. Every LLM training run begins with massive datasets—text, code, images—that live here before being pre-processed and staged on SSDs.

From first principles:

HDDs trade speed for capacity. Their latency is measured in milliseconds, but they offer tens of terabytes per drive at the lowest cost per bit.
This cost advantage is crucial: keeping an AI dataset in NAND or HBM would be economically impossible.

An HDD’s supply chain is a masterpiece of specialization:

Platters: glass or aluminum—HOYA dominates glass substrates.
Media: cobalt or FePt magnetic coatings from Resonac / Showa Denko.
Heads & suspensions: engineered by TDK and NHK Spring.
Motors: precision fluid-dynamic bearings by Nidec (~80% global share).

Why it matters:
Every additional trillion tokens of training data or terabytes of user-generated content lands on these disks. As models retrain on fresh data, HDD demand compounds. Tight supply in nearline 30–32 TB drives has already led to longer lead times and firmer pricing.

Investor takeaway: HDDs quietly define the economics of AI storage. When HDD supply tightens, hyperscalers push some cold workloads into QLC SSD tiers, reinforcing NAND cycles and linking all three memory markets together.

Where the Weights Live: Training vs InferenceTraining vs Inference

Training: Raw data sits on HDD with an SSD cache; tokenized shards live on NVMe SSDs to feed GPUs; NVMe also stores frequent checkpoints (weights + optimizer) and acts as a spillover cache during runs.
Inference: Weights execute in HBM (loaded from DRAM/NVMe if not already resident); as context grows the KV cache spills HBM → DRAM → SSD; RAG/vector indexes live on NVMe SSDs for low-latency fetches.

Everything above is the surface view—the “menu” of how AI memory works.
But the real investment edge lies beneath the packaging, inside the factories, chemical plants, and mineral supply lines that feed these components.

In the next section, we peel back every layer: from copper foils and fluorine gases that enable NAND etch processes, to the rare-earth magnets and platinum-group metals behind HDDs, to the ABF films and glass substrates that make HBM packaging even possible.

We’ll map the entire value chain for all three memory tiers—HBM, SSD, and HDD—down to the mineral level, highlighting:

Every chokepoint that could stall AI capacity build-outs,
The companies positioned to capture outsized margins as shortages intensify, and
The signals investors should watch before the broader market catches on.

If you want to understand where the next trillion-dollar wave of AI infrastructure profits will come from, this is where it starts.
The free overview shows what memory does.
The full analysis shows who controls it, where it’s breaking, and how to profit before everyone else notices.

Memory Tier Comparison

The Right Way to Run a NeoCloud: A Deep-Dive Strategy Report

FPX AI — Wed, 08 Oct 2025 18:28:26 GMT

Introduction

The market’s full of noise — everyone’s calling a top in AI infrastructure, but the real story is simpler: not all data centers are built equal. We’ve already seen the bifurcation in the market and the next cycle will make that brutally clear. The datacenter operators that win will win big and the ones that will already look different — they already have their power locked and expandable, they build where latency and liquidity intersect, and they scale only when utilization justifies it. The rest are chasing headlines, not economics.

In this report, we cut through the hype. We break down what separates operators that actually convert megawatts into margin from those that just burn cash and capex. You’ll see why the best neoclouds — the CoreWeaves and Nebiuses of the world — are winning for reasons most investors still miss.

If you’re a data center operator, this is your playbook. If you’re an investor, it’s your diligence checklist. And if you’re renting compute, use it as your filter.

The best time to act is when the market’s scared — when capital retreats, discipline compounds. The ones who play this next phase smart will be the ones everyone calls “lucky” five years from now.

Four pillars of success

Google's TPU Supply Chain Playbook: The Underestimated Threat to Nvidia’s AI Dominance

FPX AI — Sat, 13 Sep 2025 20:12:09 GMT

When the debate turns to “Who can dent Nvidia’s lead?” the same names roll off every tongue: AMD’s Instinct, Amazon’s Trainium, Microsoft’s homegrown silicon or the newer Specialized Hardware Manufacturers. Google rarely makes the list—because most assume TPUs are just an internal tool, locked away inside Google’s own data centers. That assumption is outdated. For a decade, Alphabet has been building Tensor Processing Units that already train Gemini, run at production scale, and power Google’s core revenue engines—Search, Ads, YouTube, and Cloud. The real question isn’t whether Google can build competitive silicon—it’s whether it will commercialize TPUs aggressively. If it does, the balance of power in AI hardware could shift far sooner than the market expects.

Nvidia’s $4T ascent sharpened the calculus. In the last year, Google has tested external demand—quiet pilots and capacity offers with top labs and neoclouds starved of GPUs. Names like OpenAI, Apple, and Fluidstack/CoreWeave keep surfacing in industry chatter for a reason: everyone wants credible non-Nvidia supply. But here’s the real tension that matters to investors and operators: Can Google crack CUDA’s moat and ship TPUs at Nvidia class scale—hardware, tooling, and reliability—outside its own walls? That’s the battle line.

What we’ll do in this piece:

Explain why Google is likely to step out with TPUs now—and what signals to watch in its partner playbook.

Show how TPUs differ from Nvidia’s Blackwell stack (and where they’re converging): networking, scaleout, training vs. inference, power, and TCO—with clean comparison tables.

Get practical: the moves datacenter operators, investors, and colocation providers can make if Google scales up—where to place bets, how to prepare fleets, and how to price risk.

Then, we’ll open the box: break down a TPU by component, name the suppliers, and map who stands to benefit if Alphabet turns the TPU dial to “mass.”

Why Google Can’t Afford to Sit Out the Chip Wars?

Because the market just told them. Nvidia is worth $4T doing a subset of what Alphabet does—principally silicon, platform software, and a partner cloud footprint—while Alphabet already operates the full stack: chips, compilers, global data centers, and revenuecritical AI products at planetary scale. The surprise isn’t why Google would sell TPUs; it’s why they wouldn’t unlock that value.

Because the hardware is converging. Google’s latest TPUs (v6/v7) and Nvidia’s Blackwell land on the same endgame: chiplet MCMs + massive HBM + lowprecision math (FP8/FP4) + dedicated sparsity + highradix fabrics. When physics pushes everyone to similar building blocks, the battleground shifts to scaleout efficiency, availability, and cost per token. That’s Google’s home field: multi-pod TPU meshes, optical switching, tightly engineered power/cooling, and inference first silicon tuned for price/performance at fleet scale.

Because the timing is perfect. GPUs are supplyconstrained and pricey, while inference spend is set to surpass training. Externalizing TPUs does three things at once: cuts Google’s own COGS, offers customers a credible second source beside Nvidia, and starts eroding the CUDA lockin via PyTorchXLA and JAX. If Google can match reliability and developer ergonomics outside its walls, even a single digit share of today’s Nvidia sized pie is a multi- billion dollar business—with strategic leverage far beyond mere chip sales.

Google’s Masterplan: Scaled Hardware and Software Readiness

Google isn’t diving in unprepared – it’s been methodically scaling up its hardware and software for this moment. On the hardware side, Google is now on its 6th generation TPU (code-named “Trillium”) and about to launch the 7th (“Ironwood”). Each generation has dramatically increased performance and scale. Google made its TPU v6 pods widely available on Google Cloud in late 2024, and demand was immediately high. These TPU v6 chips can be arrayed into “supercomputer” pods delivering up to 42.5 exaFLOPs of aggregate compute – an astronomical figure – thanks to Google’s advanced interconnects and clustering technology. In practical terms, a TPU pod can link up to 9,216 TPU chips into one tightly-synced machine. This far outscales what’s practical with most GPU setups (even Nvidia’s largest DGX SuperPods top out at a few hundred GPUs in tight coherence). Google also outfitted its TPUs with enormous high-bandwidth memory: for example, the upcoming TPU v7 “Ironwood” will have 192 GB of HBM memory on each chip – compared to 80 GB on Nvidia’s flagship H100 GPU. That memory, running at over 7 TB/s bandwidth per chip, lets TPUs handle gigantic models and datasets smoothly. Google even developed optical switching networks to connect TPUs, slashing communication power costs and latency at the cluster scale. In short, Google has quietly built some of the most sophisticated AI supercomputers on the planet – and is now prepping them for external customers.

Testing the Waters with Key Partners

What makes Google’s TPU push credible isn’t just specs on a slide deck — it’s the fact that real customers are now paying to use them. Over the past year, Google has quietly notched a series of external deals that validate TPUs as a viable alternative to Nvidia’s GPUs. Taken together, they show a pattern: Big Tech hedging, AI labs chasing lower inference costs, and startups eager for a non-Nvidia option.

OpenAI, the most GPU-hungry company on the planet began leasing TPU capacity through Google Cloud since mid-2025, mainly for testing. While OpenAI stressed it wasn’t abandoning Nvidia, this was still pretty important for Google: Nvidia’s largest single customer was experimenting with a competitor’s chips. The driver was simple economics. Inference serving at OpenAI costs billions annually, and TPUs offered a way to shave meaningful dollars off that bill. Reports put the deal at a scale large enough to register — though notably, Google wasn’t offering OpenAI its very top-end Ironwood pods, a sign of cautious rollout.

Google’s TPU Partnerships

Meta is another whale. In August 2025, Meta signed a six-year, $10 billion agreement with Google Cloud to host its AI workloads. On paper, the deal is about reducing Meta’s capital expenditures — renting capacity instead of building it all in-house. But strategically, it’s hard to imagine Meta not at least evaluating TPUs within that environment. Meta’s open-source LLaMA models run on PyTorch, which now compiles to TPUs via XLA. If even a slice of that deal migrates to TPUs, it would mark the first time a hyperscaler-scale peer adopted Google’s silicon. Even if Meta ends up sticking to Nvidia GPUs inside Google Cloud, Google wins either way: it gets the revenue, and it has the chance to keep pitching TPUs.

Apple sits in a different category. It designs world‑class device chips, but trains server LLMs in the cloud. Apple’s own 2025 technical report confirms it trained Apple Intelligence server models on Cloud TPU clusters—e.g., 8,192 v5p chips—a strong signal that Google’s software/hardware stack now meets Apple’s bar. Apple won’t trumpet reliance on a rival cloud, but the fact it used TPUs speaks for itself.

Anthropic is another marquee name validating TPUs. In November 2023, Anthropic and Google said Anthropic would deploy Cloud TPU v5e at scale, initially for inference, while also highlighting training economics and MultiSlice (multi‑pod) scaling for larger model training. Google continued to feature Anthropic as a flagship Cloud customer in 2025. In parallel, Anthropic has been hiring TPU Kernel Engineers—a clear signal that it intends to keep optimizing on TPUs. Independent analysis shows Anthropic is multi‑sourcing compute: even as it expands on AWS Trainium, it’s not giving up on TPUs or Nvidia GPUs, balancing cost, availability, and performance as demand spikes.

Among startups, Safe Superintelligence (SSI) is the most strategically important. Co-founded by Ilya Sutskever — the OpenAI co-founder and chief scientist who once famously pressed Jensen Huang to sell GPUs for training AI models when they were still thought of as “gaming cards” — SSI has been using Google’s TPUs for its research since April 2025. In its most recent round, SSI raised $2 billion at a $32 billion valuation, making it one of the most highly valued AI startups pre-product. That combination matters: Sutskever has a track record of spotting hardware inflection points early, and his decision to bet on TPUs despite the glut of available capital is a signal in itself. If one of the most influential figures in AI is willing to tie his new lab’s compute destiny to Google’s chips, investors and operators should pay attention.

Cohere, the Canadian LLM company behind the Command family of models, has also been building on TPUs. The choice isn’t accidental — several ex-Googlers sit in its technical leadership, including co-founder Aidan Gomez, who co-authored the original “Attention Is All You Need” Transformer paper while at Google Brain. Cohere has already moved part of its training pipeline onto Google Cloud TPUs, citing cost and throughput gains as models scale. And for Cohere, it’s a hedge — access to high-end compute without competing head-to-head for scarce Nvidia GPUs. While the immediate dollar value may only be in the tens of millions, the significance lies in Cohere’s stature as a top-tier LLM lab: its use of TPUs validates the hardware as production-grade for frontier-model training, not just inference experiments.

Google has also seeded TPUs deeply into academia, getting students familiar with its ecosystem. Through the TPU Research Cloud (TRC), thousands of researchers have received free access to pods of Cloud TPUs, enabling open-source milestones like EleutherAI’s GPT-J (6B) training run on a v3-256 pod. TRC spotlights show mainstream PyTorch projects, such as the timm vision library, relying on TPUs—proof that PyTorch-XLA is production-ready outside Google. At the entry level, Colab and Kaggle TPUs let students prototype on the same accelerators that power Google’s data centers, creating habits that follow them into startups and labs. And with Google’s $1 billion university initiative announced in 2025, offering cloud credits and advanced tooling, TPU literacy is being institutionalized at scale. Strategically, this academic pipeline matters: CUDA became dominant because a generation of researchers learned it by default; Google is now trying to ensure the next wave of AI talent grows up just as comfortable with JAX and PyTorch-on-TPU, creating latent demand that will spill into industry.

On the Neocloud Front : Fluidstack + TeraWulf: the first public TPU host site

The clearest signal that Google will place TPUs outside its own campuses came mid August 2025 when TeraWulf announced it would host 200 MW (expandable to 360 MW) for Fluidstack at Lake Mariner (NY), backed by Google’s $1.8B financial backstop and warrants equal to ~8% proforma equity. On Aug 18, TeraWulf disclosed Google increased the backstop to ~$3.2B and its stake to ~14% via additional warrants.

How the structure works. The TPUs will be physically deployed at TeraWulf’s Lake Mariner site, operated by Fluidstack, which leases the power and space. Google acts as (1) anchor backer—guaranteeing financing against Fluidstack’s longterm leases—and (2) strategic investor—taking equity exposure to the site operator (TeraWulf). This lets Google place TPUs in a thirdparty colo without owning the real estate, while still controlling access to the chips. Google has also reportedly approached CoreWeave and Crusoe.

Essentially Google is using selective partnerships to validate TPUs in the wild — inference at OpenAI, AI research at Apple and SSI, training at Cohere, hosting at Fluidstack, large-scale capacity at Meta. Each deal acts as a real-world stress test of both hardware and software. Every bug fixed for OpenAI’s ChatGPT pipeline, every PyTorch operator patched for Cohere, every data-center reliability issue surfaced by Fluidstack — it all strengthens the TPU offering.

That’s why the rollout looks slow and cautious: it’s not just about renting compute; it’s about hardening the ecosystem. By the time Google scales TPU access broadly, it will have the credibility of saying: these chips already run workloads from OpenAI, Apple, Meta, Cohere, and SSI. That kind of validation will be necessary for Google to prove credibility.

From Neo-Cloud Distress to Google’s Advantage

Neoclouds thrived when GPUs were scarce, but scarcity isn’t a business model. Many financed growth at high rates, signed longterm power leases, and sold shortterm contracts at markups that only made sense in a crisis. Now, as Nvidia ramps supply and hyperscalers build aggressively, those same commitments are turning into liabilities.

For Google, this dislocation is an entry point. Unlike GPU only operators, it controls the silicon and the software; unlike traditional REITs, it has the balance sheet to underwrite power. By stepping into stressed structures—through lease guarantees (as with Fluidstack/TeraWulf), warrants in site operators, or utilization floors that derisk financing—Google can expand TPU footprint at a discount. For landlords, a TPU pod underwritten by Google lowers cost of capital and reopens financing channels. For Google, it’s a way to grow globally without pouring billions into its own concrete.

Even private capital that typically shows up only in distressed cycles is paying attention. If TPU deployments reunderwrite stranded megawatts into financeable infrastructure, the pool of partners isn’t limited to operators alone. Recent multiyear, multibillion AIinfra contracts (e.g., Microsoft–Nebius) illustrate how anchor deals can rerate digital assets and reopen capital markets.

Crucially in the paid section later in this report we will explore how Google, neo-clouds, and datacenter operators might jointly win by aligning around TPU deployments, but first its worth exploring how TPUs actually stack up against the incumbents. Benchmarks and software support are where reputations are made or broken. To understand Google’s chances of breaking into a CUDA-dominated market, we need to compare TPU performance and developer tooling head-to-head with Nvidia’s Blackwell GPUs and AMD’s Instinct line.

TPU Architecture and the Convergence with Blackwell

Google’s latest TPUs and NVIDIA’s newest GPUs have converged on a surprisingly similar physics‑driven design philosophy. Both have abandoned the idea of a giant monolithic die in favor of chiplet packaging, surrounding compute cores with vast stacks of high-bandwidth memory.

With TPU v7 “Ironwood,” Google packages two compute dies plus eight HBM stacks (≈192 GB total, ~7.2–7.4 TB/s); Google quotes ~4.6 PFLOPS FP8 per chip and pods of 256 or 9,216 chips. The Register pegs per‑chip power in the ~700 W–1 kW class.

NVIDIA Blackwell (B200/GB200) follows a similar dual‑die design (≈208 B transistors on TSMC 4NP) and pushes FP8/FP4 math via Transformer Engine v2. NVIDIA markets NVL72 as a single 72‑GPU NVLink domain; NVLink Switch can extend that fabric up to 576 GPUs before you step out to InfiniBand/Ethernet. NVIDIA continues to publish the most concrete system numbers (e.g., DGX B200: 72 PFLOPS training FP8; 144 PFLOPS inference FP4 for 8 GPUs).

The physics of modern AI left both companies with no choice but to build toward the same frontier.

TPUv6 performance was approaching that of the B100, and TPUv7 is expected to be even more competitive. Credits : Epoch

Credits: Epoch

Text within this block will maintain its original spacing when published

Where the divergence shows is in scale. NVIDIA’s NVLink interconnect tops out at around seventy-two GPUs in a tightly coupled configuration before handing off to InfiniBand or Ethernet, which adds latency and complexity. Google built its TPU pods to scale to thousands of chips natively, with a custom interconnect and optical switching fabric. The result is that Google can train trillion-parameter models across nine thousand TPUs in a single pod with relatively little overhead. For workloads that truly need that level of synchronization, TPU remains the cleaner path.

Head to head comparison of the flagship hardwares

* PerGPU FP4 figure is commonly cited by integrators/analysts summarizing NVIDIA guidance; NVIDIA publishes systemlevel numbers (e.g., DGX B200: 72 PF training / 144 PF inference for 8× B200). † MI355X “74 PF” figures are node/platform (not per chip) in vendor/press materials. Treat perchip FLOPs carefully until AMD publishes a canonical table.

Sources: TPU v6e spec table (HBM/BW/ICI/TFLOPs).
Ironwood memory/BW/FP8 perchip, 256/9,216chip pods, 2× perf/W claims.
Blackwell MCM + NVLink5 + NVL72 domain.
AMD MI350/MI355 memory/BW/FP6/FP4; platform scaling

Source: Google

Software, Ecosystem, and the CUDA Moat (drop in replacement)

Hardware doesn’t win adoption—developer experience does. NVIDIA still sets the pace with CUDA + cuDNN + NCCL for training and TensorRTLLM + Triton Inference Server for serving, which makes “runs on day one” the default for most teams. Google has closed much of the practical gap for mainstream LLM/RAG/recsys: PyTorch on TPU via OpenXLA/PJRT (PyTorch/XLA), JAX for research, vLLM on TPU for highthroughput inference, and widespread access through the TPU Research Cloud. AMD’s ROCm has improved (PyTorch CI, broader kernels), but it’s still less turnkey than CUDA or Google’s current TPU toolchain. Net: NVIDIA remains the broadest, lowestfriction path; Google is now a credible choice when efficiency/scale drive the decision; AMD is catching up, with momentum but a usability delta. AMD does work really great for smaller workloads now however and has a TCO and performance edge for certain models over NVIDIA.

Text within this block will maintain its original spacing when published

Note: NVIDIA Triton Inference Server (for deployment) is different from OpenAI Triton (a kernel DSL); the name collision confuses newcomers.

Software Framework Comparison

Who Wins Where: Training, Inference, Model Shifts, and the Edge

Training at frontier scale. In practice, there are two proven paths today for finished frontierscale pretraining runs: NVIDIA Hopperclass clusters (H100/H200) and Google TPUs. Meta’s Llama 3.1 405B and similar large releases have been trained on NVIDIA’s platform, while Google’s Gemini 2.5 family is trained on TPUs. NVIDIA’s newest Blackwell parts (B200/GB200) are ramping, with strong results in MLPerf Training and major inference milestones, but we haven’t seen a tier-one lab publicly release a flagship model trained end-to-end on Blackwell yet. Expect that to change as NVL72 deployments spread.

Why TPUs remain credible for the very largest jobs.

Google designed pods as native scaleup machines: Ironwood (TPU v7) composes 256 chip and 9,216 chip pods with an optical fabric and ICI links, yielding ≈42.5 EF FP8 and ≈1.77 PB of shared HBM in one tightly synchronized domain. That system design keeps collectiveop overhead unusually low for trillionparameter training. NVIDIA’s NVL72 is a 72GPU NVLink domain per rack; NVLink Switch can extend further (NVIDIA cites fabric scales up to 576 GPUs) but most realworld builds step out to InfiniBand/Ethernet beyond that boundary. Translation: TPUs still offer the cleanest single system image at extreme node counts, while NVIDIA dominates breadth and availability.

Inference at scale.

Two stories are unfolding. First, Google’s Ironwood is inference-first silicon: FP8 throughput, 192 GB HBM3e per chip, and Google reported ~2× perf/W vs Trillium. It exists to cut cost per million tokens in cloud serving while keeping long contexts resident. Second, NVIDIA keeps setting public tokens per second records with Blackwell software stacks (TensorRTLLM, Transformer Engine), which matters for low latency, high throughput serving. In other words, TPUs press the TCO angle; NVIDIA presses timetoresult and ecosystem.

Model architecture shifts (who adapts fastest).

When researchers add a new attention variant or operator, the winner is whoever gets kernels and graph compilers updated first and broadly. That’s still NVIDIA: CUDA + cuDNN + NCCL + TensorRT/Triton and a decade of kernel IP make “dayone” support likely. Google has closed much of the gap for mainstream ops through PyTorch/XLA (PJRT) and JAX, and the pace is visible in recent PyTorch/XLA releases (including vLLM support). AMD’s ROCm is improving (PyTorch CI, ROCm 7), but remains less turnkey for novel ops. Net: if you are betting on rapid model innovation, NVIDIA minimizes integration risk; if your workload is standard LLM/RAG/recsys and you care most about cost/token, TPU is credible; AMD is viable where its memory profile fits, with diligence on software.

Memory-bound single device work.

If your bottleneck is “fit it all on one device” (huge embeddings, very long context windows, large MoE experts), AMD’s MI355X is the capacity leader at 288 GB HBM3e and 8 TB/s per GPU, with rack level disclosures up to 128 GPUs using Pollara NICs. This can simplify sharding and improve tail latency—provided the kernels you need are mature on ROCm.

What about AWS Trainium, Tesla, and Meta’s chips?

AWS Trainium 2 is scaling (Rainier-class superclusters; Anthropic names AWS its primary training partner) but lacks a publicly finished frontier model so far; Tesla/XAI are focusing its own silicon on inference the AI5 and AI6 chips while leaning on heavily on NVIDIA GPUs for training in their data centers; Meta MTIA is live for recommendation inference and testing a training part, with NVIDIA still carrying Llama-class training. At the edge, NVIDIA Jetson → AGX Thor dominates performance robotics, Apple’s Neural Engine anchors on device consumer AI, and Google Edge TPU plays targeted roles .Edge and “physical AI.” The center of gravity at the edge is NVIDIA Jetson, now evolving from Orin to Jetson AGX Thor (Blackwellbased), with early adopters across robotics and industrials. On devices, Apple’s Neural Engine powers ondevice Apple Intelligence models, with server models handling heavier requests

Performance robotics (mobile manipulators, AMRs, humanoids) overwhelmingly standardizes on NVIDIA Jetson today and Jetson AGX Thor next because it brings the same CUDA/Isaac stack as the datacenter into 30–100 W modules, handles multisensor fusion and realtime control, and now adds Blackwellclass transformer throughput for visionlanguage policies. Consumer ondevice AI is led by Apple’s Neural Engine on A/Mseries chips (private, lowlatency experiences like Apple Intelligence), with Qualcomm and MediaTek NPUs saturating Android phones and new Copilot+ PCs—huge in volume even when singledevice models are modest. The third segment—industrial/IoT edge (cameras, gateways, retail/telemetry)—leans on Google’s Edge TPU/Coral and similar lowpower ASICs for alwayson, quantized inference where watts and BOM dominate; they’re superb at sustained detection/classification but not meant for longcontext LLMs. Read this as segmentation, not competition: Jetson/Thor wins where realtime performance and CUDA compatibility matter; Apple/phone/PC NPUs win ubiquitous, privacypreserving experiences; and Edge TPUclass parts win costandpowerconstrained deployments. Together they expand the total edge TAM—and they also pull cloud demand (training and fleet scale orchestration), which is why the datacenter race and the edge race are tightly coupled.

Looking Ahead

Chiplets, HBM and lowprecision math are locked in; the real separation will be scale, software, and unit economics.

In the next few years Inference will overtake training as the spending center, and buyers normalize on cost per million tokens at a latency SLA. If google plays their cards right TPUs should become the credible second source: expect at least one marquee external win where Ironwood pods beat comparable GPU fleets on cost per token while meeting SLOs. NVIDIA should still own the “dayone” path—its ecosystem ships new kernels fastest—and even if share redistributes, a bigger pie plus Blackwell ramp keeps revenue growing. AMD is well on its way to fixing the memory and networking shortcomings and should capture memorybound niches (288 GB HBM per GPU) as context windows expand—provided ROCm is locked down for those workloads. Specialist silicon settles into home turf: Trainium2 for inAWS cost/control, MTIA for recsys (training to follow), Tesla AI5/AI6 for lowlatency autonomy/robotics. Cerebras, Groq and Samanova are also worth keeping an eye on.

The edge splits three ways—Jetson/Thor for performance robotics, phone/PC NPUs for private ondevice AI, lowpower ASICs (EdgeTPU class) for industrial/IoT—and each pulls more cloud training and orchestration. New KPIs—tokens/joule, tokens/rack, cost per million tokens @ P95—replace FLOPs in procurement. And because networks and cooling decide who can keep collectives tight at scale, opticalheavy fabrics and liquid cooling become kingmakers. Bottom line: this is a portfolio market—NVIDIA remains the broad onramp, Google wins scale up and cost per token, AMD wins per device memory—so everyone can grow even as share shifts.

Meanwhile, Nvidia isn’t complacent – it will likely respond by offering more integrated solutions (e.g., Nvidia may bundle software and even cloud-like offerings via partners, as it’s doing with DGX Cloud). It’s also possible Nvidia might cut prices or offer leasing models if it feels pressure. But Nvidia also has to be careful not to undermine its own massive margins. Google, in contrast, could even treat TPU offerings as a loss leader to grow cloud market share (since Google’s overall business benefits if you come to its cloud).

In the end, the AI hardware race is entering a new phase. For years it was all about GPU vs GPU (Nvidia vs AMD vs Intel’s fledgling attempts). Now it’s shifting to GPU vs specialized accelerators vs hyperscaler-designed chips. Google’s entry instantly makes the fight more interesting. And unlike a startup, Google has deep pockets to fund this and a huge internal use-case guaranteeing that TPUs won’t go unused.

In the free section above, we explored Google’s motivations, technology and strategy in taking on Nvidia, and how TPUs compare to the current and next-gen GPUs.

In the subscribers-only section to follow, we will delve into the often-overlooked side of this equation:

1) The TPU Anchor Structures — For neoclouds looking to expand, or under pressure and financiers looking for stability, this section lays out clear deal templates to de-risk growth. Think anchor agreements, guaranteed offtakes and the ability to land long term deals that align smaller players with larger operators and spread risk beyond a single supplier. The playbook shows how fragile, spot-based GPU income can be converted into steady TPU-backed cashflows—contracts that lenders and investors can actually underwrite.

2) The TPU Supply Chain Winners — In this deep dive, we map the companies (public and private) that stand to gain if Google meaningfully scales TPU production—and the broader set of suppliers riding the AI accelerator wave across all hyperscalers. From fabs and HBM vendors to substrate makers, optics providers, and liquid-cooling OEMs, we’ll show you who is already winning orders, where bottlenecks lie, and which sectors are best positioned for durable growth. For investors, this is a forward-looking tracker of the stocks and categories most levered to TPU and GPU demand. For neoclouds and operators, it’s a practical view of which partners are emerging as critical—and how to align site specs to capture anchor status.

How GPT‑OSS Is Silently Reshaping Inference Optimization and Model Serving Sector— and How Neoclouds Can Capitalize

FPX AI — Mon, 25 Aug 2025 17:49:45 GMT

Introduction

OpenAI recently released its first open weight models since GPT2 called GPTOSS. Headlines fixated on benchmarks but the quiet story that most people missed out on is how this release restructures serving LLMs to the end user. It didn’t just drop weights—it packaged the right kernels (tuned for all major GPU families allowing users to run an optimized model out of the box for their GPU).They also provided ready-made integration hooks into the common inference engines people already use. With GPT-OSS pre-tuned for mainstream runtimes, the old advantage of ‘we run models faster’ for a few users/internally evaporates. The baseline is fast; what matters now is scale and orchestration.

From first principles, optimal LLM serving has two levels of optimization:

1. Inside a server (the engine): this is where the weights of the model actually live. Think of them like the recipe book the AI uses. Optimizing here means making sure that recipe can be read quickly and efficiently — no wasted steps. GPT-OSS largely solved this by shipping the weights already organized for the most common machines, so everyone starts with a “tuned engine.”

2. Between servers (the fleet): once you outgrow a single machine, the challenge shifts to how you spread that recipe across many kitchens without slowing down. This is where you decide where the memory of past conversations (the KV cache) sits, how often you can reuse it, and which machines should handle which parts of the work. That’s what NVIDIA’s Dynamo standardizes: it makes sure the kitchens talk to each other smoothly, so long conversations and tool calls don’t grind to a halt.

Once the box is standard, the economics migrate to the port: state locality (where the KV cache lives and how often you can reuse it) and capacity placement (which GPUs, on which interconnects, in which regions).

This is the terrain where inference optimization platforms like Together, Fireworks, Baseten, and peers—have operated, blending custom kernels with proprietary schedulers. But if model providers now ship models with those optimizations—and frameworks like Dynamo make multinode choreography baseline—does the “optimization” layer commoditize? Or does the moat re-form higher up, around how well you manage memory across sessions (global KV) and how reliable the agent experience feel—precisely where do Neoclouds and Investors/PE Funds capitalize on these changes?

Hold that thought. The rest of this piece maps where the moats are reforming—and how operators and investors can position before the market prices it in.

1) Inside the box: how GPT‑OSS standardizes single‑node optimization (and turns it into a commodity)

What GPT‑OSS actually ships for the node: not just open weights, but a serving recipe that makes one server fast by default. Two models—gpt‑oss‑20B and gpt‑oss‑120B —arrive with tool ‑use formatting (Harmony), pretuned‑ kernels, and dropin‑ paths through vLLM and Transformers. In practice that means: 20B runs on ~16 GB GPUs (most Gaming GPUs for the edge); 120B fits on a single 80 GB H100 (common datacenter class card); both support long contexts without bespoke engineering.

Why it’s fast on one box (simple version):

· Less math per token by design (Sparse MoE). Each token activates 4 experts out of 32 (20B) or 128 (120B), You get the same model capacity, but only a fraction of the compute per step.

· Smaller, cheaper weight movement (MXFP4 on MoE). MoE weights are stored in a custom 4-bit format (MXFP4) that packs multiple values into a byte, cutting memory traffic dramatically while keeping accuracy. That’s how these models fit in modest VRAM footprints.

· Attention matches the fast lane : The model’s attention pattern was chosen to line up with today’s fastest GPU code, so long conversations stay stable and you keep using the optimized kernels instead of slow fallbacks.

· The kernels are delivered to you. In Transformers/vLLM you flip a single config flag and the runtime fetches and loads the right kernels for your GPU (e.g., FlashAttention‑class paths). If a specific kernel isn’t available, it picks a compatible fast alternative. Making all the secret CUDA a one line option.

· The I/O format keeps you on the fast path. The models are trained for Harmony (channels for analysis/final and structured tool calls). Using the provided chat template avoids formatting mismatches that would push runtimes onto slower kernels.

GPT-OSS doesn’t drop off raw ingredients and leave you to prep; it delivers a meal kit with everything pre-chopped, portioned, and matched to your stove. Turn on the burner, and dinner is ready at restaurant speed—no kitchen hacks required.

The standard it sets and why intra‑node optimizations commoditize:
With GPT-OSS, the recipe for single-node speed is now out in the open. Mixture-of-Experts layouts, MXFP4 packing schemes, and attention patterns are not just described in papers—they’re baked into the model and maintained upstream by Hugging Face, vLLM, and NVIDIA. The tuned kernels are auto-pulled at install. That means one-off advantages like “we wrote a slightly faster matmul” don’t translate into lasting differentiation anymore. As soon as a kernel lands in the open-source stack, everyone has it.

For companies whose entire pitch rests on squeezing more tokens per second out of a single GPU, that’s a flashing red light. The ground they stand on is being standardized. The only real edge left at the node level is narrow and short-lived: earliest support for a brand-new silicon generation (e.g., Blackwell’s native MXFP4 tensor cores), or handling extreme constraints (ultra-low VRAM, mobile devices). For everyone else, peak tokens/sec per box is becoming table stakes—a baseline you get by running the blessed stack.

The constructive takeaway: if your company lives purely in the single-node optimization niche, it’s time to move up the stack. Future moats will form around fleet-level orchestration, global KV/state management, and agent runtime quality, not around shaving a few microseconds off matmuls. Otherwise, as GPT-OSS and Dynamo raise the floor, the optimization business becomes a race to zero.

Figure: GPT-OSS architecture (Source: Jay Alammar)

2) Across the fleet: Dynamo, the autopilot for inter‑node optimization

If GPT‑OSS standardized the engine, NVIDIA Dynamo is the autopilot that flies the fleet. Think of it as the control tower above your runtime (vLLM, TensorRTLLM‑, SGLang, PyTorch) that quietly handles the messy between ‑servers work so your service stays fast and predictable.

What Dynamo automates:

· Prefill vs. decode become separate stations. Long prompts go to heavy-duty prep cooks, while token-by-token generation is handled by line cooks trained for speed. They work in parallel, instead of tripping over each other.

· Conversations stay near their memory. Sessions are pinned close to where their KV cache lives, so context doesn’t need to be rebuilt every turn. Time-to-first-token stays flat even as histories grow.

· Hot vs. cold state is managed intelligently. Active memory lives on fast HBM, while colder chunks are pushed to cheaper RAM or disk. The right pieces are always within arm’s reach.

· Traffic is steered by the fabric, not by chance. Requests are batched when it helps, split when it doesn’t, and routed according to NVLink or InfiniBand realities—avoiding tail-latency pileups.

Net effect: Dynamo takes a decade of hand ‑rolled internode tricks—custom batchers, ‑adhoc‑ prefill pools, cache pinning/migration—and makes them the default. It lifts the baseline for cluster performance the same way GPT‑OSS lifted the baseline for single ‑server performance.

Market impact
When fleet behavior is standardized, the old “we invented a better scheduler” pitch loses altitude. Differentiation shifts to two places:

Upstack‑: turning primitives into products—global KV as a service (policy, isolation, observability), agent ready‑ APIs (tool loops, retries, structured outputs), and developer ergonomics.
Down‑stack: scale and placement—who actually has next‑gen GPUs, NVLink/IB topologies, pre-warmed pools, and sovereign regions turn “we have GPUs” into contractual SLAs on latency, reliability, and compliance.

Figure: Working of Dynamo (Source: NVIDIA)

3) Where this leaves Baseten, Fireworks, Together (and the Next-Wave Platforms)

These teams didn’t get here by accident. They’ve lived in both worlds—inside the server (kernels, memory layouts, quantization) and across the fleet (batching, routing, cache reuse). Dynamo doesn’t erase that expertise; it repoints it. When the fleet’s “autopilot” becomes standard, the pitch shifts from “our scheduler is smarter” to “our service turns state into speed”. In practice that means packaging global KV (the model’s short term‑ memory) as a product with policy, isolation, and observability; presenting agent ready‑ APIs that make tool calls reliable; and turning goodput SLAs (flat Time to first token (TTFT) under long prompts, stable latency during tool loops) into the headline—because that’s what customers actually feel.

Each platform already has a lane:

· Baseten wins on product surface and enterprise posture—Truss-style packaging, safe rollouts, autoscaling, and the ability to “run in my VPC.” It appeals to teams that want compliance and integration, not infrastructure tinkering.

· Fireworks leans into the runtime and agent layer—OpenAI-compatible endpoints, structured outputs, and a latency/cost story developers can adopt in an afternoon. OpenRouter usage shows Fireworks already leading in tool-calling success (96% vs peers in the 90–93% range), validating that this focus is translating into reliability customers notice.

· Together plays the silicon-first card—early adoption of the newest GPUs, FlashAttention pedigree, and relentless speed on distributed training. It positions them as the open-model factory for ambitious runs, making them the most natural partner for large-scale infra.

None of that collides with Dynamo; if anything, Dynamo raises their floor so they can spend more calories on the parts customers touch—and less on re‑implementing the same orchestration primitives.

Figure: Fireworks is the best for tool calling (Source: OpenRouter)

But there is still an unavoidable differentiator that infrastructure companies need to take into account. Standardization raises the floor, but it doesn’t erase the ceiling—there are key moves neoclouds and inference platforms can make to turn commodity capacity into defensible advantage. In the next sections, we explore what those moves are and recommend specific partnerships and strategies that can take these businesses to the next level—and why the boards of these companies should be thinking about this now, before the market prices it in.

What Happens to Datacenters When Smaller Models Start Solving Bigger Problems

FPX AI — Fri, 22 Aug 2025 13:19:37 GMT

A few days ago, an investor who finances large-scale datacenters asked us a simple question:

"I saw something about a new model that runs on a laptop but gets state-of-the-art results on reasoning problems and puzzles. Does this change anything for us?"

That question gets to the heart of what this note is about: What does the future of AI workloads look like, and what does it mean for the datacenter ecosystem?

If models continue getting smaller, faster to train, and easier to run, does demand for large-scale infrastructure fall off? Or does Jevons paradox kick in—where making compute cheaper and more efficient only drives usage even higher?

Fig: As AI gets cheaper, total usage (and compute demand) explodes exponentially
Source: FPX AI

The AI industry has been built on a simple premise: bigger models yield better results. For years, the race has been to scale up: more parameters, more data, more compute. However, a paradigm shift is underway. Smaller, more efficient models are beginning to solve problems that were once the exclusive domain of their larger counterparts. This evolution raises a critical question: What happens to the massive datacenter infrastructure built to support these computational giants when the future belongs to the efficient and compact?

This transformation isn't just about model architecture. It's about fundamentally reimagining how we approach AI infrastructure, resource allocation, and the economics of artificial intelligence. As we stand at this inflection point, understanding the implications for datacenters, cloud providers, and the broader AI ecosystem becomes paramount.

The Hierarchical Reasoning Model: Rethinking Where the Compute Goes

Let's start with the model that sparked the question. The model in question is the Hierarchical Reasoning Model (HRM), introduced in a recent paper by Wang et al. (2024). It is modelled after the human brain, it's a compact, 27-million-parameter model—small enough to run on a laptop. But in a series of symbolic reasoning tasks that typically baffle even massive LLMs, HRM matches or beats them—without any pretraining, and using just a few hundred examples per task.

So what makes it different?

HRM abandons the typical transformer approach of processing everything in a single pass. Instead, it's built around a simple but powerful principle: not all reasoning should happen at the same speed or level of abstraction. It splits computation into two loops:

A slow, high-level module that plans and sets the context
A fast, low-level module that iterates on subproblems and refines solutions

Fig: HRM is inspired by hierarchical processing and temporal separation in the brain. It has two recurrent networks operating at different timescales to collaboratively solve tasks.
Source: Hierarchical Reasoning Model

Fig: The HRM (~27M parameters) outperforms state-of-the-art chain-of-thought models on tough benchmarks like ARC-AGI, Sudoku-Extreme, and Maze-Hard—where those models failed. It was trained from scratch and solved tasks directly, without using chain-of-thought reasoning.
Source: Hierarchical Reasoning Model

These two modules interact in a loop, feeding results back and forth until the model "converges" on an answer. This mimics how humans often solve hard problems—trying something, stepping back, adjusting the plan, and trying again.

What's elegant here is that the model isn't bigger; it's deeper in time. Instead of stacking more layers, it loops smarter. This allows HRM to "think" for longer on hard problems and less on easy ones, without blowing up memory or compute.

Hardware Implications

From first principles, this matters because it reshapes how we measure model capability. Instead of more parameters or more tokens, performance can come from more internal steps, more efficiently executed. That has serious implications for hardware.

The workloads look different: less like giant matrix multiplications (that GPUs are good at), more like recurrent, latency-sensitive programs. Hardware optimized for streaming, tight memory access, or fast feedback cycles may suddenly have an edge.

HRM is far from general-purpose. It doesn't replace LLMs or handle open-ended language tasks. But it shows that reasoning—long considered the domain of massive models—can be re-architected. And if that happens at scale, the entire shape of compute demand could shift.

The bottom line is that algorithmic innovation won't stop and neither will the demand for AI workloads, the winners will be the companies that can act fast across the stack.

The Future of AI Workloads Is Hybrid—and Smaller Than You Think

If training drove the last wave of AI infrastructure buildout, inference will drive the next. And inference—especially for robotics and real-world AI applications—is increasingly moving toward smaller, more capable models that don't need hyperscale clusters to operate.

We're seeing a quiet shift. Small models are getting more complicated—architecturally deeper, more adaptive, more reasoning-capable—and they can increasingly run on edge devices, laptops, or compact datacenter instances. That makes them natural candidates for Physical AI, a term we use to describe the entire class of embodied agents, sensors, robots, and autonomous machines interacting with the physical world.

By the end of the decade, inference related to Physical AI will likely dominate total compute usage, simply because these systems will be running continuously and everywhere—from home robots to warehouse automation to autonomous industrial systems.

Yet while models are shrinking in size, scaling laws aren't going anywhere. There will still be value in pushing large, centralized models to new heights. What's changing is how and where those models are used. We see three clear archetypes emerging in Physical AI deployment:

Fig: Three deployment models for AI workloads: Cloud Only, Local Only, and Hybrid approaches, each with their own advantages and use cases.
Source: FPX AI

What This Means for the Infrastructure Market

Serving these hybrid and cloud-first models is where Neoclouds have the edge.

Gone is the model of a single hyperscaler tenant consuming 80% of your datacenter. The new era demands multi-client, multi-hardware flexibility—supporting dozens of AI-native customers, each with their own architecture preferences, latency needs, and usage spikes.

The operators who thrive will be the ones who:

Design for shorter, bursty inference jobs, not just long training runs
Offer hardware optionality, not just NVIDIA GPUs—some clients will want AMD, FPGAs, CPUs, or future inference-specific ASICs
Build low-latency, urban-proximate datacenters near major metros, where Physical AI agents live and act

That means infrastructure is becoming a demand-shaping problem, not just a supply problem. You're no longer building the biggest possible box—you're building the right box in the right place, with the right routing intelligence.

This is where firms like ours are focused. We help colocation providers identify the best inference-grade locations, advise Neoclouds on how to think about next-gen data center design, and help them build teams that can support not just large GPU clusters, but also hybrid workloads, robotic deployments, and edge-cloud architectures.

The future of AI workloads isn't just about model size. It's about distribution, specialization, and responsiveness. And the datacenter strategy that wins will be the one designed to match.

The AI Infrastructure Shift: How to Win if You Build, Fund, or Operate It

For Datacenter Operators: Verticalize, Specialize, and Move Fast

If you do not have any existing large scale training client or the scale that large neoclouds like CoreWeave or Crusoe do, chasing large-scale training clients is a losing game. The better path is to go vertical: pick a high-value niche—robotics, vision QA, healthcare AI—and build around it. That means low-latency colocation sites near major metros, ideally on diverse fiber paths for minimal RTT. If you're offering bare metal, make sure it's production-ready: full observability, Slurm or Kubernetes orchestration, flexible and transparent SLAs.

Differentiate with real benchmarks—training vs. inference, single-node vs. multi-node—and support hardware isolation (MIG, SR-IOV, DPUs) to ensure multi-tenant stability. Add compliance features buyers care about—HIPAA, PCI, data residency—and be ready to answer questions in $/task, not just $/GPU.

Your edge won't come from raw silicon—it'll come from the software stack and services you wrap around it. Either build or partner to deliver tooling that solves real user problems. Talk to buyers about their workloads and help them pick the right hardware for their task, not just the most expensive. Use marketplaces to monetize excess capacity, and staff teams who understand the space from first principles—because in an environment where new chips and architectures emerge constantly, the first operators who adapt will win the margin.

FPX Consulting works with Neoclouds to help guide you through exactly what buyers are looking for—from hardware procurement and colo expansion to identifying the right sites and building out your stack. We'd love to help.

For Colocation & Power Developers: Designing Datacenter Portfolios That Sell

The most valuable asset in today's datacenter market isn't just power—it's power that's actually deliverable today. For colocation operators, site selection is no longer just about future power expansion or land area—it's about how quickly you can stand up compute that generates revenue. The highest premiums today go to sites that offer immediate, energizable power, diverse long-haul fiber access, water rights that future-proof cooling, and near a tier 1 or 2 Metro to get easy access to talent.

Think of your portfolio in tiers. Your Tier A sites should be metro-adjacent with energization timelines under 12 months. These are your go-to locations inference hubs 15-50MW sites that serve latency-sensitive clients. Your Tier B sites should have substation pads poured, transmission interconnects in motion, and a clear path to 50–200 MW over 24–36 months. These campuses become your long-term scale plays—especially if they're near robotics, biotech, or AI manufacturing hubs.

Great sites combine strong shells and physical security with dual utility feeds or ring bus potential, true fiber diversity (not shared ROW), access to reclaimed or permitted water, and zoning that enables fast-track development. Great operators match this with conversion-ready buildouts—flexible power distribution, busways, and cooling that allow rapid swings between training and inference workloads.

As more colocation supply comes online, standing out will require more than just MW and square footage. Partnering with operators or platforms that offer true specialization and IP—like Colovore, who deliver high power and cooling density per rack and support diverse hardware types beyond just NVIDIA—can give your portfolio an edge. Tenants are becoming more sophisticated, and differentiated technical capabilities will drive faster absorption and longer-term value.

At FPX, we help operators and investors design and build high-performance portfolios across Tier A and Tier B assets. We also assess, validate, and help market existing sites—whether to Hyperscalers, Neoclouds, or specialized buyers looking for GPU-ready infrastructure. If you're siting a new facility or repositioning an old one, we'd love to help.

For Investors: Finding the Edge in an Evolving AI Infra Market

If you're an investor financing datacenters or colocation projects, the next decade of returns won't come from chasing hyperscaler training deals—they'll come from backing operators who know how to monetize specialized, low-latency, hybrid inference. The winners will be teams that think from first principles, operate metro-adjacent sites with real power and fiber, and deploy workload-driven infrastructure, not just racks of GPUs.

The biggest opportunities right now are often hidden in distress—stranded campuses with undeliverable power, failed single-tenant plays, or long-lead substation delays that can be converted into multi-tenant inference hubs with the right upgrades. Your portfolio should combine Tier A revenue sites (energization <12 months, 200–500kW pods, heterogeneous-ready) with Tier B growth assets (substation pads poured, water secured, GIAs in motion).

Red flags are everywhere: fake fiber diversity, "paper megawatts," and operators that can't benchmark $/task or latency.

FPX Consulting works directly with investors to source off-market deals, conduct deep power and fiber due diligence, and help structure portfolios that reflect where AI infrastructure is actually headed—not where it used to be. We'd love to help.

The Bifurcation of the AI Cloud Compute Market

FPX AI — Fri, 22 Aug 2025 01:31:16 GMT

Introduction and Key Findings

AI compute demand is exploding, straining global data center capacity to its limits. Occupancy rates for third-party data centers sit at record highs, and overall power demand from AI workloads is surging (~97% CAGR for AI datacenter capacity 2022–2026). This report examines how the GPU-centric cloud market – the backbone of AI – is diverging into distinct tiers of providers, each capturing value (or struggling to) in different ways. We define four main categories of GPU compute sellers, analyze global supply/demand trends, and highlight a growing divide in fortunes:

Hyperscalers (AWS, Google, Microsoft, Oracle) enjoy massive internal AI workloads and booming enterprise demand, investing unprecedented sums (over $190 billion in AI infrastructure capex in 2024 and projected to spend over 325 billion in 2025) to maintain dominance.
"NeoClouds" (specialized GPU cloud upstarts like CoreWeave, Crusoe, Lancium) are growing at 400-750% YoY; triple-digit rates, leveraging better price-performance and fast provisioning to win overflow demand from hyperscalers and cost-conscious AI labs.
Marketplaces (Runpod, SFCompute, Compute Exchange, Vast.ai, etc.) aggregate spare GPUs from various owners, offering bargain prices – but they rarely land large enterprise deals and face a glut of supply (H100 rental prices crashed from $8/hr to ~$2/hr in late 2024 amid oversupply).
Bare-Metal GPU Datacenters (colos or enterprises with GPUs but no cloud software stack) often sit on underutilized hardware. Without robust orchestration platforms or large sales teams, these operators struggle to monetize their GPUs, resorting to reselling capacity via third-party marketplaces.

AI is taking over the datacenter. AI workloads now account for ~13% of global datacenter demand and are on track to hit 28% by 2027, doubling in just two years. Total demand (≈62 GW today) is set to grow 50%+ by 2027, outpacing supply and keeping markets tight. However there are only a few winners who will capitalize on this wave of AI demand. The companies that spot these trends early and act fast will have major upside.

Our analysis finds a bifurcated market: Hyperscalers and some NeoClouds like coreweave have been able to capture outsized growth (fed by the AI boom and their agility), while unspecialized hosters fall behind with very low utilization. Enterprises themselves are increasingly "verticalizing" AI compute, with firms like XTX (quant trading) building huge private GPU clusters (25,000+ GPUs) to avoid cloud constraints. The result is a rapidly evolving landscape with new opportunities – and risks – for investors and buyers to navigate.

Figure: The GPU Cloud Market Matrix - Four types of sellers competing for AI compute demand

1. GPU Cloud Ecosystem: Four Types of Sellers

a. Hyperscalers: The AI Superpowers

Amazon, Microsoft, Google, and Oracle dominate global AI compute with millions of GPUs, custom silicon (TPUs, Trainium), and nearly $200B in AI infra capex in 2024. Microsoft alone bought ~485,000 GPUs, targeting 1.8M live by year-end; Meta is at ~600,000 GPUs.

But they aren't just powering external clients—they consume much of this compute themselves, training frontier models (like GPT-4-class systems), running recommendation engines, and deploying generative AI across their own products.

Public cloud AI services like Azure OpenAI and Google Vertex AI are growing fast (~20% YoY), but internal workloads remain a major driver of hyperscaler demand.

Their scale, global reach, and full-stack platforms make them the default AI backbone for enterprises. Yet cracks are emerging: slow provisioning, vendor lock-in, and opaque pricing are pushing some users toward more nimble alternatives.

b. NeoClouds: Specialized GPU Clouds for AI/HPC:

CoreWeave, Crusoe, Lambda Labs, and Lancium represent a fast-rising class of specialized GPU cloud providers built for AI and HPC. They deploy state-of-the-art Nvidia GPUs(B200s/B300s, H200s etc) with robust software stacks (Kubernetes, container orchestration, ML frameworks) that rival Big Cloud—but at 50-70% lower prices and with faster provisioning. CoreWeave alone scaled from ~53,000 to 250,000+ GPUs in 2024, hitting $1.9B in revenue (up 730% YoY) and growing ~5× again in Q1 2025. A $7.5B Nvidia pre-buy gave it early access to H200s and let it capture hyperscaler overflow.

NeoClouds win by pairing hardware at scale with a developer-first experience: intuitive UX, fast onboarding, first movers advantage to SOTA hardware and real-time support via Slack/Discord.

But not all survive. Those lacking a software stack or support layer are struggling—GPUs alone aren't enough. With hardware costs high and capital tight, the gap between winners and distressed players is widening fast.

The Neoclouds lack the hyperscalers' economies of scale in data center operation — their unit costs for hardware, power, and networking are a bit higher — so margins are thin. NeoClouds run on razor-thin economics: higher unit costs and heavy debt mean they survive only if utilization stays >50% and revenue isn't dominated by a single client. Yet investors still assign multi-billion valuations because agile players are grabbing hyperscaler-sized wins—witness Crusoe's 200 MW, 100k-GPU Texas campus pre-leased to a Fortune 100 customer. The eventual winners will be those that lock in abundant power at the right locations and a diversified customer mix; over-leveraged, single-tenant NeoClouds risk a hard fall.

c. Marketplaces: GPU Compute Brokers

Platforms like Vast.ai, RunPod, and SFCompute aggregate spare GPU capacity—from miners, hobbyists, and small data centers—and rent it out usually to smaller buyers or individual researchers (often 70-80% cheaper than hyperscalers). Some try acting as an external salesforce to land larger clients but these are harder to come by.

They thrived during the 2023 GPU crunch but mostly serve small, price-sensitive users, not enterprises; fragmented supply and weak SLAs make them ill-suited for mission-critical AI. Prices have since collapsed (e.g. H100s < $2/hr), leaving many sites recycling the same excess inventory while marketplaces compete for the same customers. A lack of clear SLAs, support, transparency on locations of the hardware etc makes it hard for larger buyers to trust most marketplaces.

The marketplaces that survive will be those that bundle easy-to-use software, pooled purchasing power, and community support—effectively turning thousands of micro-buyers into one "virtual enterprise" big enough to matter.

d. Bare-Metal GPU Datacenters (No Platform):

Some colos, telcos, and ex-crypto miners crammed racks with GPUs but never built a cloud-grade software layer. Selling raw "bare-metal" compute—no orchestration, support, or developer tools—forces customers to bring their own stack, so enterprise and larger AI buyers look elsewhere. Many of these operators now dump surplus GPUs onto marketplaces, but rock-bottom prices and sporadic demand rarely deliver the revenue or utilization they need. Worse, several run in facilities that fall short of Tier III reliability, further limiting enterprise appeal. Unless they add a turnkey software layer or partner with a NeoCloud, they'll remain low-margin "arms suppliers" in an AI boom that increasingly rewards full-service platforms.

2. Macro Trends Shaping the Datacenter Market

The AI infrastructure market is undergoing rapid transformation—driven not just by raw demand, but by where and how that demand manifests. Below are the seven defining trends reshaping the landscape:

a. Inference Tsunami

While training gets the headlines, the real compute demand is in inference. Once a model is deployed, over 90% of its lifetime compute is spent on test-time use—whether it's powering chatbots, search ranking, ad targeting, or robotics. With the rise of autonomous agents, copilots, and smart devices, inference FLOPs are projected to outpace training by 15-20× within a few years. The result: clouds optimized for low-latency inference, high memory bandwidth, and fast autoscaling will win the next wave of spend.

b. Latency Critical workloads is the fastest growing segment

AI inference requires millisecond response times. That makes location-critical—power alone isn't enough. Texas, for example, has cheap power but many sites sit 15-20 ms from major metros, too slow for real-time AI. Fiber-rich, metro-adjacent data centers (e.g., Northern Virginia, Santa Clara, North Jersey, North Carolina, Ohio) now command premiums. The new standard is: power + latency + connectivity.

c. Enterprise Self-Build Surge

NeoClouds expected large enterprise demand, but many top AI buyers are going direct-to-hardware. Firms like XTX Markets and JPMorgan are building their own private GPU clusters for security, control, and cost. Enterprises prefer to have their own customized hardware giving them more bargaining power against the hyperscalers, futureproof themselves to macroeconomic conditions, and don't have to worry about sensitive IP sitting on others servers. This trend is stripping away the most stable demand from NeoClouds, who now face a tougher fight for customers.

d. One-Buyer Risk

A single anchor tenant can drive massive short-term growth—but also existential risk. Some NeoClouds have already been crippled by the exit of a major client, left holding financed GPUs and long-term leases. Investors and lenders now demand customer diversification and take-or-pay commitments to de-risk the model.

e. Cloud is Still a Buyer's Market

GPU prices have collapsed—H100s dropped from >$8/hr to <$2/hr in 12 months—as supply caught up with demand. Buyers have choices. In this crowded landscape, only clouds with real differentiation—sovereign compliance, better UX, integrated MLOps, or unbeatable economics—will stand out.

f. The AI Site Playbook: Five Traits That Signal Long-Term Value

A truly premium AI datacenter must have all five:

>100 MW of reliable power
Dual, long-haul fiber connectivity
Abundant water or advanced cooling systems
<15 ms latency to a major metro
Permitting/tax clarity

Fewer than 5% of Advertised global "Datacenter Sites" check all five boxes. Hyperscalers and smart infra funds are buying them now—before prices spike.

g. Sovereign AI is Inflating Regional Demand

Governments are now major compute buyers. Europe has launched a €20 billion "AI Gigafactory" plan; Gulf states and India are buying hundreds of thousands of GPUs to build national AI capacity. These sovereign projects often pay a premium, distorting regional markets and squeezing commercial buyers in those zones.

3. Market Bifurcation: Winners and Strugglers

The AI compute boom isn't lifting all boats—it's separating winners from the rest.

a. Hyperscalers: Dominant and Self-Sustaining

Hyperscalers (AWS, Azure, Google, Meta) remain the backbone of AI compute, powered by insatiable internal workloads and growing enterprise demand. Over 80% of global AI capacity still runs through them. Their scale enables custom chips, aggressive pricing, and the ability to absorb excess GPU supply internally. They're not just renting compute—they're training the biggest models in-house, justifying massive capex no one else can match. The rich are getting richer.

b. NeoClouds: Fast-Growing but High-Risk

NeoClouds (CoreWeave, Crusoe, Lancium) filled gaps the hyperscalers left—faster provisioning, lower prices, better UX. They've grown 10× in revenue, winning startups and overflow from even Meta and OpenAI. But they're heavily debt-financed and vulnerable to one-buyer risk. Many bet on big enterprise demand that never came, while hyperscalers are now pushing back with lower prices and reserved capacity. NeoClouds are winning—for now—but running hot with thin margins and no room for error.

c. Marketplaces: Clearing Houses, Not Cloud Platforms

GPU marketplaces (Vast.ai, RunPod) serve the long tail—hobbyists and indie devs, not Fortune 500s. Prices fell from $8/hr to <$2/hr, signaling oversupply. Marketplaces are useful for absorbing idle capacity, but lack trust, support, and SLAs to win large buyers. Without differentiation or scale, they're stuck in margin compression and buyer churn.

d. Bare-Metal Datacenters: Stranded Without Software

Owning GPUs isn't enough—AI buyers want full-stack solutions. Many bare-metal datacenters lack orchestration, developer tools, or proper Tier III facilities. They're forced to dump capacity on marketplaces, earning little. Some may survive by leasing wholesale to hyperscalers or partnering with NeoClouds, but most face underutilization and consolidation risk.

The bottom line is that the market is bifurcating. Winners deliver speed, support, software, and strategic positioning. Strugglers are left with idle hardware and no value-add. The new currency is integration and differentiation—not just raw compute.

Figure: Market positioning - Leaders driving growth vs. players facing challenges in the evolving GPU cloud landscape

4. Our Strategic Recommendations

Here's our advice for NeoClouds, enterprises, colo operators, and investors navigating this shifting market:

a. NeoClouds: Specialize and Stabilize

Competing on price alone won't cut it. The winners will specialize by owning vertical use cases—biotech, quant finance, media rendering, robotics, etc.—and offering tailored software, SLAs, and developer tooling.

Avoid the one-buyer trap, the market has evolved and the amount of risk this strategy poses is simply too high. Build a diverse client base of smaller, consistent users with the right support stack, orchestration layer, and onboarding experience. If you're missing these, partner or license to get there faster. We would love to work with you to refine this strategy and find buyers or partners.

Figure: Tower of Differentiation - Strategic layers NeoClouds must build to compete effectively

b. Enterprise Buyers: Own vs Rent, But Optimize Either Way

If you're running 24/7 inference or training massive models, building your own cluster might make sense. We're seeing firms like XTX and JPMorgan go this route to control costs, preserve full privacy of IP, hedge against GPU rental prices ever rising back, and latency.

Not ready to own yet? You can still negotiate better contracts, pursue hybrid setups, or consider managed dedicated clusters. We help AI companies evaluate TCO, select sites, and match with the right providers—without overpaying for suboptimal infra. We can even match you with companies that offer datacenter building as a service—so you don't overpay or overbuild.

c. Colo Operators: Add Value or Get Bypassed

Hardware alone isn't enough—enterprises want more than rackspace. Without orchestration or developer tools, your GPUs will stay idle or end up listed on marketplaces with no pricing power. Instead, consider layering managed services or aligning with NeoCloud players to become part of their ecosystem. If your site has Tier III+ infra, fiber and water access, abundant power, and close proximity to a city, there's real opportunity. If not, we can help assess and reposition.

d. Investors: Follow Utilization, Not Hype

Distressed GPU owners, stranded mining sites, and underutilized colo assets are the biggest arbitrage opportunity today. But location quality (power, fiber, water, latency) is everything. If you're holding infrastructure or capital and unsure how to turn it into AI yield—we'll show you who's buying, how to convert, and where to avoid mistakes. Our deal flow and visibility can give you a head start.

5. Under-Reported Insights & Forward Outlook

As the AI infrastructure market matures, the surface story of growth and hype gives way to deeper, more structural realities. Below, we highlight the most important—yet often overlooked—forces shaping the future of NeoClouds and datacenter investment. For firms navigating these shifts, understanding where value is created (and destroyed) is essential.

a. NeoCloud Economics: 90/10 Market

The majority of NeoCloud providers are built on high-debt, high-utilization assumptions. If they fail to hit scale, their models collapse. Expect most upstarts to fail or consolidate—only the top 10% will build sustainable businesses. Understanding utilization dynamics and customer concentration early is key to avoiding stranded assets.

b. Site Scarcity, Not Power Scarcity

Power is available—but Tier III+ sites with 100+ MW, water, dual fiber, and low latency to metro hubs are scarce. Those are what everyone is chasing. Markets that some companies think are still very appealing like Texas are starting to get saturated, but new opportunities exist in overlooked regions with the right fundamentals. We've helped investors and builders identify these locations before the crowd. Site selection is where edge is made—or lost.

c. Sovereign AI: Government-Backed Cloud is Here

From UAE's 5 GW cluster to Europe's €20B AI gigafactories, sovereign-backed infrastructure is reshaping the landscape. These clouds have demand locked in and are playing long-term—making them harder to compete with, but valuable to partner with or emulate. Understanding which regions are overbuilt and which will attract sovereign tailwinds is a growing strategic advantage.

d. Inference Will Eclipse Training

Production inference is scaling faster than training—with some estimates suggesting a 15-20× delta by 2027. Hardware optimized for inference, low-latency sites near cities, and edge AI will matter more than brute force GPU clusters in the long run. Those who can support inference workloads at scale or provide the hyperscalers with the power—without overbuilding—will lead in unit economics.

e. Value is Moving Up the Stack

Raw compute is becoming commoditized. Clouds that layer software, vertical tools, or managed AI services are pulling ahead. The margin isn't in the silicon—it's in what you do with it.

f. Consolidation is Inevitable

The market is bifurcating—hyperscalers will buy distressed clouds, colos will partner or vanish, and only the most adaptive NeoClouds will survive. Investors need to underwrite not just hardware—but resilience and differentiation. We've seen early signs of both value traps and breakout winners—and can help make sense of the field.

The race to build AI infrastructure is still early—but increasingly competitive. Those who move with precision—securing the right sites, utilization models, and demand channels—will capture durable value. Those who don't will be left with idle racks and expensive mistakes. If you're building, investing, or scaling in this market and want an edge—on where to go, who to trust, or how to build—we're always happy to share what we're seeing on the ground.