Abstract
Measuring a credible NVIDIA second source
We measured an Intel Arc Pro B70 (32 GB, Battlemage) head-to-head against an NVIDIA RTX 3090 (24 GB, Ampere) on the ColabHive distributed inference/training platform, across six real AI workloads — LLM serving, image diffusion, and fine-tuning — with device-attributed throughput and end-to-end energy measurement. Every number comes from a controlled run of an identical workload on a known, dedicated GPU, with energy read at the device (Intel xe sysfs counters, NVIDIA NVML), so the comparison avoids the CPU/GPU contamination that plagues aggregate fleet telemetry.
On the workloads that dominate today's AI spend — LLM serving, LoRA and 4-bit QLoRA fine-tuning, and Stable Diffusion XL — the B70 delivers 95–112% of the 3090's throughput while drawing 56–64% of its power, for ~1.45–2.0× better performance-per-watt. Where applicable we also report the Arc Pro B50 (16 GB, 70 W) as a third, lower-power efficiency data point. Diffusion is an outright Intel win; QLoRA — once given the right SYCL device selector — is ~7% faster. The remaining gaps are in the software stack and are closing release-over-release; this is a value-tier second-source case, not a claim against the frontier H100/B200 silicon that dominates the data-center segment.
1 · Motivation
Why a second source is the thesis
The AI compute market is the most concentrated in modern computing. NVIDIA shipped an estimated 98% (3.76 M of 3.85 M units) of data-center GPUs in 2023 (TechInsights, via HPCwire). The dependency is reinforced by CUDA lock-in, which is contractual as well as technical: the CUDA EULA prohibits reverse-engineering or translating CUDA-SDK output to target a non-NVIDIA platform, and the ZLUDA CUDA-on-other-hardware project was taken down at AMD's request in August 2024. Supply and price compound the problem: Blackwell-generation accelerators have been reported sold out ~12 months ahead, at quoted prices of $30,000–40,000 per GPU.
For an infrastructure layer whose mission is to give developers, startups, and researchers —
particularly in LATAM — access to affordable, available, power-efficient
compute, single-vendor dependency is the central strategic risk. A credible second source is not
optional; it is the thesis. Intel's Arc Pro B-series ("Battlemage") and the
Project Battlematrix initiative (up to 8× Arc Pro B60 →
192 GB VRAM, targeting 70 B+ models) aim squarely at the AI-workstation / inference-server
niche, with a containerized software stack (llm-scaler: vLLM-XPU, ComfyUI, SGLang). The
open questions for a production operator are empirical: How close is real throughput? What is the
actual energy and cost profile? Where does the software ecosystem still hurt? This paper answers
them with measured data from production-grade hardware.
Scaling to Project Battlematrix — 8×B60 → 192 GB, 70 B+ on-node
The per-card economics in this paper are the unit of a larger story. Intel's Project Battlematrix packs 8× Arc Pro B60 into 192 GB of aggregate VRAM in a single workstation/server chassis — exactly the capacity envelope a 70 B-class model at full bf16 needs (~140 GB of weights plus KV cache, fit on-node with no cross-host sharding). That is a class of model that today forces a multi-GPU NVIDIA box (typically 2–4× A100/H100-class cards) at a multiple of the power draw. The measured single-card facts below are the building blocks of that claim: a B70 already serves a 14 B at full bf16 on one card (§5), and the per-card vLLM-XPU efficiency we measured (95–97% of 3090 LLM throughput at ~1.5× the tokens/joule) is the unit economics that, if it composes, scales into a 70 B-class node at a fraction of the NVIDIA multi-GPU power budget.
2 · Methodology
Device-attributed, isolated, energy-instrumented
ColabHive already serves Intel Arc GPUs in production via an inference-ipex image built
on Intel's own intel/llm-scaler-vllm:0.14.0-b8.3.1 base (vLLM-XPU,
torch 2.10.0+xpu, compute-runtime 26.09). For this study the platform's energy
telemetry was extended and cabled end-to-end. A core methodological commitment is
device-attributed, isolated, energy-instrumented measurement. Three pitfalls we
explicitly avoided:
- CPU-vs-GPU contamination. On a GPU node, much inference actually executes on the
CPU (specialists, tools, fallbacks), so aggregate telemetry mixes devices. Every number in the
results comes from a controlled run on a known, dedicated GPU, verified GPU-resident via vLLM
/metricscounters, GPU utilization, and energy draw. - Throughput ground-truth. LLM throughput is taken from vLLM's own
vllm:generation_tokens_totalcounter (delta over the window), not client estimates, with output length pinned viaignore_eosso both vendors process exactly identical work. - Energy attribution. Intel energy is the
xedriver's sysfsenergy1_inputaccumulating counter (µJ, monotonic, exact by construction); NVIDIA energy is the exact NVMLnvmlDeviceGetTotalEnergyConsumptioncounter where available (QLoRA), else the integral ofnvidia-smi power.drawat 200 ms. The window is bracketed by workload-emittedMEASURE_START/ENDtimestamps, so warmup and model-load are excluded.
max_model_len, batch, and step
counts. The two headline inference cells (LLM serving, SDXL) carry K=3 repeats with
±1 sample-std (n=3); the training cells are paired point estimates. Crucially, K=3
controls run-to-run noise on the same two physical cards — it does not control
silicon-lottery / board-partner / thermal variation (N=1 hardware per vendor).
| Role | GPU | VRAM | Host CPU |
|---|---|---|---|
| Intel | Arc Pro B70 (0xe223, BMG-G31) | 32 GB GDDR6-ECC | i7-10700 (8c) |
| NVIDIA | RTX 3090 (GA102) | 24 GB GDDR6X | Xeon E5-2680 v4 (28c) |
3 · Results
Summary matrix
Across the six benchmarked workloads, the B70 reaches parity-or-better on compute-bound modern AI while structurally drawing less power on every workload measured (44–80% of the 3090's draw):
| Workload | Type | B70 throughput vs 3090 | B70 perf/watt vs 3090 | Verdict |
|---|---|---|---|---|
| SDXL image generation | inference | 112% (n=3) | 2.00× | Intel wins on both |
| LoRA fine-tune (LLM 7B) | training | ~103% | 1.64× | ~parity + efficiency |
| LLM serving (7B, vLLM) | inference | 95–97% (n=3) | 1.46–1.65× | near-parity + efficiency |
| SDXL UNet LoRA | training | 57% | 1.29× | slower, still better perf/watt |
| TabNet (deep-tabular) | training | 46% | 0.56× | 3090 wins (Intel weak spot) |
| QLoRA 4-bit (LLM 7B) | training | 107% | 1.89× | works (selector fix) + faster |
LLM serving is the one cell where the B70 is slightly behind — but at 1.46–1.65× the tokens/joule. Two things stand out in the K=3 data. First, the B70 is markedly more consistent run-to-run (std < 1%) than the 3090, whose single-GPU throughput varied 5–9% across back-to-back sweeps. Second, the 3090's higher variance is partly mild thermal droop under sustained load — which cuts against us as much as for us: the C=1 near-parity could reflect the 3090 underperforming as much as the B70 reaching it, so we lean on the cleaner high-concurrency cells (C=8/16, a steady 95–97%) for the parity read and treat C=1 as noisy. Under load the 3090 drew ~343 W against the B70's ~219 W (~64%).
Perf/watt is a structural property of the part, not a one-off. The B70 wins the
efficiency axis on six of seven workloads, peaking at 2.00× on SDXL inference and 1.89× on
QLoRA. Only TabNet inverts (0.56×), for reasons we own plainly below. Because perf/watt is now
collected fleet-wide by the node runtime (NVML on NVIDIA, Xe sysfs on Intel, into
node_power_samples), this is platform telemetry going forward — not a single bench.
The energy story is the clearest part of the case. SDXL costs the B70 half the joules per image (50.0%), QLoRA half the joules per step (52.8%), LoRA ~61%, and an LLM token at C=16 ~69%. These are not rounding-error margins; on SDXL the K=3 confidence intervals are cleanly separated (B70 1,621 ± 6 J/image vs 3090 3,240 ± 37 J/image), so the efficiency win is repeatable, not a single-run artifact.
Read together, Figures 2–4 are the heart of the paper: on the workloads that dominate affordable-tier AI spend the B70 is at or near throughput parity (SDXL 112%, QLoRA 107%, LoRA 103%, LLM 95–97%) while spending far fewer joules to get there. The two red bars — UNet-LoRA at 57% and TabNet at 46% — are real and are addressed honestly in their own sections; they are raw-eager / small-operator patterns where a mature CUDA stack still pulls ahead.
Power is where the silicon advantage is unambiguous. Across every workload the B70 drew 44–80% of the 3090's board power — from 152 W vs 346 W on UNet-LoRA (44%) to 219 W vs 343 W on LLM serving (64%). Lower power is the lever that turns near-parity throughput into a decisive perf/watt and operating-cost win, and it is a property of the part that no amount of engine maturity on the NVIDIA side can erode.
4 · The QLoRA reversal
A "hardware limitation" that was one line of config
An earlier pass reported 4-bit QLoRA as failing on Battlemage. On re-investigation that was a
misdiagnosis, and the correction is one of the more instructive results in this study.
The failure is not a missing Triton-XPU kernel. bitsandbytes 0.49.2 dispatches 4-bit
quantization through a native custom op (torch.ops.bitsandbytes.quantize_4bit), and under
the platform's device-pinning convention ONEAPI_DEVICE_SELECTOR=level_zero:0 that op throws
a SYCL No device of requested type available — bnb's separately-compiled SYCL queue
rejects the Level-Zero selector even though torch.xpu sees the GPU fine.
ONEAPI_DEVICE_SELECTOR=*:gpu (any-backend GPU) lets the kernel resolve a device,
and 4-bit QLoRA then trains end-to-end on the B70 — ~7% faster than the 3090 and
1.89× more energy-efficient. It is genuinely 4-bit: the quantized 7B occupies 5.45 GiB
on both vendors (bf16 would be ~15 GiB), and the ~2.2× slowdown vs LoRA-fp16 is
exactly the expected 4-bit dequant tax. A result we had shipped as a hardware limitation turned out to
be a one-line configuration issue.
| QLoRA-NF4 (Qwen2.5-7B, r=16) | RTX 3090 | Arc Pro B70 |
|---|---|---|
| Throughput | 0.889 steps/s · 911 tok/s | 0.949 steps/s · 972 tok/s |
| Energy / step | 459.0 J | 242.3 J |
| Avg power | ~408 W | ~230 W |
| VRAM (4-bit weights) | 5.45 GiB | 5.45 GiB |
The methodological lesson is one we own: this updates our prior so that we now treat a single
negative-for-Intel result as provisional pending a configuration audit. By that bar the
TabNet weak spot is not yet audited and should be read as "unaudited, plausibly improvable"
rather than a settled silicon limit. The remaining QLoRA work is purely platform-integration: wiring the
*:gpu selector into the Intel training-launch path for bnb-4bit workloads.
This finding is significant enough to name as a contribution of this work, not bury as a footnote. The conventional wisdom is wrong. The public consensus in 2026 — repeated across forums, issue threads, and "does it run on Intel?" guides — is that "bitsandbytes / 4-bit QLoRA does not run on Intel Arc." Meanwhile bitsandbytes' own release notes officially list XPU support. Both cannot be fully true, and the gap is where operators get stuck and conclude (as we initially did) that the hardware can't do it.
The trap is a documentation collision. Intel's own multi-GPU guidance recommends
pinning devices with ONEAPI_DEVICE_SELECTOR=level_zero:N — and that is exactly the
selector under which bitsandbytes' separately-compiled SYCL 4-bit op throws
No device of requested type available. An operator who follows Intel's multi-GPU docs to the
letter and installs the XPU-supporting bitsandbytes hits a hard failure that looks like a missing
kernel — when in fact torch.xpu sees the GPU fine and only bnb's SYCL queue rejects the
Level-Zero selector. Two correct-in-isolation pieces of Intel-recommended configuration combine into a failure mode that looks like a missing kernel.
Our root-cause and fix. Set ONEAPI_DEVICE_SELECTOR=*:gpu (any-backend GPU,
so bnb's kernel queue can resolve a device) and pin the specific card with ZE_AFFINITY_MASK=<idx>
instead of the Level-Zero selector. With that one change, 4-bit QLoRA trains end-to-end on the
B70 (~7% faster than the 3090 at 1.89× the efficiency, genuinely 4-bit at 5.45 GiB), and
we confirmed it generalizes across the B-series — the identical fix
(*:gpu + ZE_AFFINITY_MASK=1) works on the B50. Because the fix is
selector-level and card-agnostic, it unblocks 4-bit QLoRA on the entire Arc B-series —
including Intel's own 8-card Project Battlematrix topology, where per-card pinning is mandatory and
the level_zero:N trap is therefore likely to surface for anyone following the standard
multi-GPU docs.
It is in production. The fix is not a notebook hack — it is wired into a production training launcher (node-runtime 0.10.121), so every Arc training job ColabHive dispatches uses the correct selector automatically. And we are giving it back: we are reporting this upstream to bitsandbytes (the failure signature + the selector root-cause) and to Intel's IPEX / llm-scaler documentation. A ready-to-file bug-report draft is in Appendix C.
Bottom line: we don't just consume the Intel stack — we fix and contribute back to it. The single most-cited "Arc can't do QLoRA" limitation is, in our hands, a solved, productionized, upstream-reported one-line configuration fix that scales to the full B-series and to Battlematrix.
4.10 · INT8 inference on Battlemage
The spec promise vs the realizable win
The economics below lean on Battlemage's published INT8 TOPS/W advantage (B70 1.60 vs 3090 0.81). A spec is a promise, not a measurement — so we tried to cash it into delivered tokens/s on the B70. The answer has two halves: the INT8-compute path is blocked by a missing kernel, but the deployment-standard 4-bit path not only works, it flips the LLM-serving verdict to a B70 win.
compressed-tensors W8A8 model (which would exercise the 367 INT8 TOPS) fails
at engine init on vLLM-XPU:
File .../quantization/kernels/scaled_mm/__init__.py, line 55, in choose_scaled_mm_linear_kernel
for kernel in _POSSIBLE_KERNELS[current_platform._enum]:
KeyError: <PlatformEnum.XPU: 4>
vLLM has no INT8 scaled-mm kernel registered for the XPU platform — the dispatch table
carries CUDA/ROCm/CPU/TPU entries but not XPU. So the B70's published INT8 TOPS are not yet realizable
through vLLM's INT8 serving path: the economic INT8-TOPS edge is a genuine spec advantage that today's software
stack cannot cash in. It is a concrete, reportable gap — analogous to the QLoRA selector issue but deeper
(a missing kernel registration, not a config), and squarely on the closing-trajectory list.
(b) AWQ-4bit — works, and the B70 leads the 3090. The precision operators actually
deploy on 24–32 GB cards is 4-bit weight-only (AWQ/GPTQ). On vLLM-XPU this routes through the IPEX
weight-only path, gated behind a deprecation guard that we bypass with one flag
(--allow-deprecated-quantization — a third minor ecosystem unlock we document). Serving
identical AWQ Qwen2.5-7B, coherent output on both vendors:
| AWQ-4bit, Qwen2.5-7B | 3090 tok/s | B70 tok/s | B70 % | 3090 tok/J | B70 tok/J |
|---|---|---|---|---|---|
| C=1 | 40.8 | 48.9 | 120% | 0.255 | 0.325 |
| C=8 | 302.9 | 381.8 | 126% | 1.913 | 2.351 |
| C=16 | 598.0 | 709.5 | 119% | 3.506 | 4.179 |
Quantization flips the LLM-serving verdict. At bf16 (§3) the B70 trails the 3090 at 95–97%; at AWQ-4bit — the realistic operating point — the B70 leads by 19–26% on throughput and 1.19× on tokens/joule, while also cutting its own latency (p50 2.70 s vs 3.65 s bf16) and power (~170 W vs ~219 W). Each vendor runs its own AWQ kernel (IPEX weight-only on XPU vs Marlin on CUDA), so this is a real-world "what each card actually serves" comparison rather than a same-kernel one — and it lands in the B70's favor.
5 · VRAM headroom
A model the 3090 cannot hold at full precision
At full bf16 precision the 32 GB B70 holds mid-size models that do not fit a
24 GB 3090. We frame this honestly: it is a simplicity / headroom advantage, not an
absolute capability gap — a quantized (AWQ/GPTQ/fp8) 14 B fits a 3090 fine and is the standard
way to serve one on 24 GB. The narrow, measured claim: at the identical no-quantization bf16 config
used everywhere else in this paper, loading Qwen2.5-14B-Instruct has the 3090
fail with torch.OutOfMemoryError (23.49 GiB used, cannot allocate the
next 270 MiB) while the B70 loads and serves a coherent completion at
31.1 GiB used.
The operational value is simplicity: on the B70 you serve or fp16-finetune a ~13–14 B model on a single card with no quantization pipeline, no offload, no second GPU. So the defensible framing is not "NVIDIA can't run a 14 B" (it can, quantized) but "Arc Pro gives full-precision headroom for a whole class of mid-size models that a 24 GB card forces you to quantize or shard."
4.9 · The Arc Pro B50
Efficiency at 70 W — a third data point
We also ran everything that fits on the node's second Intel card, the Arc Pro B50
(16 GB, 70 W TBP, no external power connector) — same device selectors
(level_zero:1; *:gpu + ZE_AFFINITY_MASK=1 for QLoRA,
confirming the §4 fix is general rather than B70-specific) and the same card-attributed Xe energy
method. The B50 is the efficiency / power floor of the three parts.
| Workload | RTX 3090 (350 W) | Arc B70 (32 GB) | Arc B50 (16 GB, 70 W) |
|---|---|---|---|
| SDXL / image | 8.21 s · 3,240 J · ~394 W | 7.33 s · 1,621 J · ~221 W | 18.74 s · 1,291 J · ~69 W |
| QLoRA-NF4 / step | 0.889 st/s · 459 J · ~408 W | 0.949 st/s · 242 J · ~230 W | 0.373 st/s · 187 J · ~70 W |
| TabNet | 16,852 rows/s · ~108 W | 7,739 rows/s · ~87 W | 3,700 rows/s · ~39 W |
| LLM serving (7B, bf16) | ✓ | ✓ | ✗ OOM (16 GB) |
| LoRA-fp16 (7B) | ✓ | ✓ | ✗ OOM (16 GB) |
(1) Slow but astonishingly efficient. The B50 is 2.3–4.5× slower than the 3090, yet draws only 39–70 W and posts the lowest energy-per-unit-of-work of all three parts on the workloads it can run — SDXL at 1,291 J/image (0.40× the 3090, below even the B70) and QLoRA at 187 J/step (0.41× the 3090). For power- or density-constrained deployments (many cards per chassis, no external power connector), that is the entire pitch.
(2) 16 GB is the binding constraint. A 7 B model at bf16 out-of-memories the B50 for both serving and LoRA-fp16 — it must be quantized to fit; QLoRA-4bit (5.45 GiB) runs comfortably. So the B50 is a quantized-inference and QLoRA card, and the B70's 32 GB is precisely what buys the full-precision headroom of §5.
6 · Economics
Cheaper VRAM, decisively better perf-per-watt
On $/GB VRAM, Intel Arc Pro is the headline "~2.5–3× VRAM per dollar" against a new 3090 ($1,499). But the 3090 is a 2020 part bought used (~$700–1,050 ≈ $36.5/GB); against that realistic comparator the gap shrinks to ~0.6–0.8×, i.e. roughly 1.2–1.7× the VRAM per dollar, not 3×. We price the B70 itself here (~$949), not only the cheaper B60/B50, which earlier drafts were fairly criticized for.
Where Intel's lead is robust regardless of price basis is the per-watt economics: VRAM per watt (1.75–3.3×) and INT8 TOPS/W (1.2–3.0×) are structural efficiency the used market cannot erode. The honest economic case: modestly cheaper VRAM than a used 3090, much cheaper than a new one, decisively better perf/watt — and a brand-new part (warranty, current drivers, support) versus a depreciating six-year-old card.
Sticker price is the wrong unit. What an operator actually pays is cost per unit of work over the life of the card — hardware amortization plus the electricity to do the work — and that is where the measured perf/watt of the results converts directly into dollars.
Interactive TCO calculator
All-in 3-year cost per unit of work
Drag the controls — every number recomputes from the measured §3 throughput and power. Illustrative TCO model — assumptions in Appendix A.
$/unit = price_card / (units_per_3yr × util)
+ energy_per_unit_kWh × price_kWh, with units_per_3yr = throughput × 26,280 h.
Assumptions in Appendix A.All-in cost per unit of work
All-in $/unit = hardware ÷ (units produced over life at utilization U) + energy/unit × $/kWh. The hardware term shrinks with utilization (a card amortized over more work is cheaper per unit); the energy term does not. Both headline workloads are computed from the measured throughput and power.
LLM serving (Qwen-7B, C=16) — $ / million tokens @ US $0.15, 100% utilization:
| Card | HW $/M-tok | Energy $/M-tok | All-in $/M-tok |
|---|---|---|---|
| Arc B70 | $0.0181 | $0.0166 | $0.0347 |
| 3090 (used) | — | — | $0.0401 |
| 3090 (new) | — | — | $0.0514 |
The B70's win on LLM serving is purely energy — it is slightly behind on raw throughput, so it does not win on the hardware term; it wins because each token costs ~half the joules. That makes the advantage utilization-gated: at very low utilization the hardware term dominates and the cheaper-sticker used 3090 can edge ahead; as utilization rises the energy term dominates and the B70 pulls away.
SDXL image generation — $ / 1,000 images @ US $0.15, 100% utilization:
| Card | All-in $/1,000 img | vs 3090-used | vs 3090-new |
|---|---|---|---|
| Arc B50 | $0.123 | — | — |
| Arc B70 | $0.141 | 33% cheaper | 47% cheaper |
| 3090 (used) | $0.211 | — | — |
| 3090 (new) | $0.265 | — | — |
SDXL is different in kind: the B70 wins on both the hardware-per-image term (it is faster per image) and the energy term (half the joules). When a card is cheaper on every component of the cost, there is no crossover.
Energy & carbon at fixed throughput
Flip the question: hold output constant and ask what it costs in power and carbon. This is the number a data-center operator or an ESG-minded investor cares about.
| Fixed throughput | RTX 3090 | Arc B70 | Arc B50 |
|---|---|---|---|
| 1 M SDXL images / day | 900 kWh/day · 328.5 MWh/yr | 450 kWh/day · 164.4 MWh/yr (−50%, −66 t CO₂/yr) | 359 kWh/day (−60%) |
| 1 B tokens / day (LLM) | 161 kWh/day | 111 kWh/day (−31%) | OOM (16 GB) |
Rack / chassis density
Power, not slots, is the binding constraint in a modern rack. Sizing by total board power (TBP — 3090 350 W / B70 230 W / B50 70 W) into a fixed 10 kW rack shows what you can actually deploy per kilowatt:
| Per 10 kW rack | RTX 3090 | Arc B70 | Arc B50 |
|---|---|---|---|
| Cards | 28 | 43 | 142 |
| Aggregate VRAM (GB) | 672 | 1,376 (2.05×) | 2,272 (3.38×) |
| SDXL throughput (img/min) | 204 | 353 (1.73×) | 454 (2.23×) |
| LLM throughput (tok/s) | 16,296 | 23,878 (1.47×) | OOM (16 GB) |
| VRAM per kW (GB/kW) | 68.6 | 139.1 | 228.6 |
7 · Honest limitations
Where the case is weakest, stated plainly
The case is deliberately calibrated. Beyond TabNet, classical gradient boosting (XGBoost/CatBoost) has no
Intel-GPU backend and stays on CPU — a library reality, not a B70 deficiency. The hardware sample is
N=1 per vendor: K=3 controls run-to-run noise on the same two physical cards but not
silicon-lottery, board-partner, or thermal variation, so a second device and a second model size (~1.5 B)
are the next steps. There is also an engine-version skew confound — Intel runs vLLM-XPU 0.14.x
on torch 2.10+xpu while NVIDIA runs stock vLLM (cu121 lineage) — so near-parity reflects
silicon plus engine. And the economics are strongest against new-NVIDIA pricing and only modest
against the used-3090 comparator. The remaining gaps (no FlashAttention/xformers on XPU, no CUDA-graph
equivalent, out-of-tree Triton-XPU) are software-ecosystem ones with a clear, fast-moving upstream
trajectory — not silicon limitations.
Bottom line: for the affordable inference-and-finetune tier ColabHive serves — not the frontier H100/B200 segment — Intel Arc Pro is a credible, power- and cost-efficient NVIDIA alternative exactly where modern-AI value concentrates for that tier (LLM + diffusion inference, LoRA/QLoRA fine-tuning). Deploy LLM serving and SDXL inference on Arc Pro first; run LoRA and 4-bit QLoRA fine-tuning there too; keep deep-tabular/TabNet on NVIDIA and classical GBDT on CPU. The remaining gaps are in the software stack and are closing release-over-release.
Appendices
TCO model, raw data, reproduction & upstream report
Appendix A — The TCO model (formula, assumptions, sensitivity). Everything in the economics section is computed from this one model; a reviewer can recompute every cell. For a card producing throughput T units/hour at average power P watts:
all_in_$/unit(U) = capital / (T × hours_life × U) ← hardware term
+ (P / 1000 / T) × price_kWh ← energy term
where units_over_life = T × hours_life × U
energy_per_unit_kWh = (P watts / 1000) / (T units/hour)
Assumptions: hours_life = 26,280 h (3-yr straight-line).
price_kWh ∈ {US 0.15, AR 0.10, DE 0.30}. Grid carbon = 0.4 kg CO₂/kWh.
Capital: 3090-used $875, 3090-new $1,499, B70 $949, B50 $349.
T, P from the measured cells (LLM C=16: 3090 582 tok/s @343 W, B70 555.3 @219 W;
SDXL: 3090 7.3 img/min @394 W, B70 8.2 @221 W, B50 3.2 @69 W).
Crossover (LLM, B70 wins energy but not HW):
U* = ΔHW_constant / price_kWh = 0.0429 / price_kWh
→ US 0.15 → 28.6% | DE 0.30 → 14.3% | AR 0.10 → 43%
SDXL: B70 wins both terms → no positive-U crossover → unconditional win.
Sensitivity. The energy term — and the entire B70 LLM-serving advantage — scales linearly with electricity price; cheaper power (Argentina) pushes the LLM crossover up to 43% util, expensive power (Germany) drops it to 14.3%. Utilization U only scales the hardware term, never energy — so every result is most favorable to the cheaper-energy card (Arc) at high utilization. A production fleet lives at high U, the regime where Arc wins. SDXL is insensitive in sign (always wins), only in magnitude.
Appendix B — Consolidated raw-data table. Every measured cell, so a reviewer can recompute J/unit, perf/watt, and the economics independently. All cells measured 2026-06-21, bf16/eager, single dedicated GPU. "—" = not run / n/a; "OOM" = did not fit in VRAM.
| Workload (unit) | Metric | RTX 3090 | Arc B70 | Arc B50 |
|---|---|---|---|---|
| LLM serving C=1 (tok/s) | throughput | 37.8 ± 3.4 | 36.6 ± 0.1 | — |
| LLM serving C=8 (tok/s) | throughput | 296.7 ± 6.9 | 287.0 ± 0.5 | — |
| LLM serving C=16 (tok/s) | throughput | 582.0 ± 19.6 | 555.3 ± 2.1 | OOM |
| LLM serving C=16 | tok/J | 1.723 | 2.511 | — |
| LLM serving (under load) | avg power (W) | ~343 | ~219 | — |
| SDXL (per image) | latency (s) | 8.21 ± 0.09 | 7.33 ± 0.03 | 18.74 |
| SDXL (per image) | throughput (img/min) | 7.3 | 8.2 | 3.2 |
| SDXL (per image) | energy/image (J) | 3,240 ± 37 | 1,621 ± 6 | 1,291 |
| SDXL | avg power (W) | ~394 | ~221 | ~69 |
| LoRA-fp16 7B (per step) | steps/s · tok/s | 1.90 · 1,944 | 1.96 · 2,007 | OOM |
| LoRA-fp16 7B | energy/step (J) | 180.8 | 110.1 | OOM |
| LoRA-fp16 7B | avg power (W) | ~347 | ~222 | OOM |
| QLoRA-NF4 7B (per step) | steps/s · tok/s | 0.889 · 911 | 0.949 · 972 | 0.373 |
| QLoRA-NF4 7B | energy/step (J) | 459.0 | 242.3 | 187 |
| QLoRA-NF4 7B | avg power (W) | ~408 | ~230 | ~70 |
| QLoRA-NF4 7B | VRAM 4-bit (GiB) | 5.45 | 5.45 | 5.45 |
| SDXL UNet-LoRA (per step) | throughput (steps/s) | 1.625 | 0.930 | — |
| SDXL UNet-LoRA | energy/step (J) | 208 | 161 | — |
| SDXL UNet-LoRA | avg power (W) | ~346 | ~152 | — |
| TabNet 16k×64 (rows/s) | throughput | 16,852 | 7,739 | 3,700 |
| TabNet | avg power (W) | ~108 | ~87 | ~39 |
| TabNet | rows/J | 159 | 89 | — |
| AWQ-4bit C=1 (tok/s) | throughput | 40.8 | 48.9 | — |
| AWQ-4bit C=8 (tok/s) | throughput | 302.9 | 381.8 | — |
| AWQ-4bit C=16 (tok/s) | throughput | 598.0 | 709.5 | — |
| Qwen2.5-14B bf16 | weight load | OOM (24 GB) | loads (31.1 GiB) | — |
Energy method per cell: Intel = Xe sysfs exact counter (all cells); NVIDIA = NVML exact counter for QLoRA,
nvidia-smi power.draw @200 ms integral for LLM/SDXL (asymmetry flagged in Appendix D).
Appendix C — Reproduction & upstream report. Reproduce by
pinning a single dedicated GPU per side, draining other models off it, bf16/eager throughout, identical
prompts/inputs/lengths; LLM throughput from vLLM's vllm:generation_tokens_total delta over a
MEASURE_START/END-bracketed window (ignore_eos to fix output length); energy from the
exact accumulating counters where possible. The QLoRA cell requires the selector fix. The ready-to-file
bug-report draft:
Title: 4-bit QLoRA fails on Intel Arc B-series under
ONEAPI_DEVICE_SELECTOR=level_zero:N — SYCL No device of requested type available
(bnb's 4-bit op rejects the Level-Zero selector that Intel's own multi-GPU docs recommend).
Components: bitsandbytes (XPU 4-bit
custom op) · Intel IPEX / llm-scaler multi-GPU documentation.
Environment: Arc Pro B70 (0xe223, 32 GB)
and B50 (0xe212, 16 GB); torch 2.10.0+xpu; bitsandbytes 0.49.2;
compute-runtime 26.05+; xe driver; Ubuntu 24.04 / kernel 6.17; image on
intel/llm-scaler-vllm:0.14.0-b8.3.1 base.
Failure signature: With
ONEAPI_DEVICE_SELECTOR=level_zero:0 (per Intel's multi-GPU pinning guidance), loading an
NF4-quantized model and starting a QLoRA step makes bitsandbytes' native 4-bit op
(torch.ops.bitsandbytes.quantize_4bit) throw a SYCL No device of requested type
available. Note torch.xpu.is_available() is True and
get_device_properties() enumerates the GPU correctly — only bnb's separately-compiled SYCL
kernel queue rejects the device.
Root cause: bnb's 4-bit SYCL queue does not resolve
a device when the process is scoped to the Level-Zero backend selector. torch.xpu and bnb's SYCL
runtime resolve devices through different paths; the Level-Zero-only scoping that satisfies torch starves
bnb's queue.
Fix (one line of configuration): use the any-backend GPU selector and pin the card by affinity mask instead of by Level-Zero index:
# ❌ fails for bnb 4-bit: ONEAPI_DEVICE_SELECTOR=level_zero:1
# ✅ works:
export ONEAPI_DEVICE_SELECTOR='*:gpu'
export ZE_AFFINITY_MASK=1 # pin to the desired card (0-based)
With this, 4-bit QLoRA trains end-to-end (verified genuinely 4-bit: NF4 7B = 5.45 GiB).
Affected scope: the entire Arc B-series (reproduced
on both B70 and B50 with the identical fix). Critically, it affects Project Battlematrix
(8×B60) and any multi-card Arc deployment, because per-card pinning is mandatory there and the
standard level_zero:N guidance is exactly what triggers the failure.
Requested doc change (Intel): note that workloads
using bitsandbytes 4-bit must pin with ONEAPI_DEVICE_SELECTOR='*:gpu' +
ZE_AFFINITY_MASK=<idx>, not level_zero:<idx>.
Requested fix (bitsandbytes): make the 4-bit SYCL kernel
queue resolve a device under the Level-Zero backend selector (or emit an actionable error pointing at the
selector rather than the opaque No device of requested type available).
Appendix D — Methodology, versions, provenance. Intel inference
inference-ipex:v0.7.24 ← intel/llm-scaler-vllm:0.14.0-b8.3.1
(torch 2.10.0+xpu, vLLM-XPU, compute-runtime 26.09); Intel training
training-ipex:v0.2.0 (llm-scaler base + peft, trl,
bitsandbytes 0.49.2, pytorch-tabnet 4.1.0); NVIDIA inference-vllm:v2.0.8,
inference-generative:v3.0.0, training-transformers-cu121:v1.0.5. Host Ubuntu 24.04,
kernel 6.17, xe driver, compute-runtime 26.05+, Resizable BAR on. Energy: Intel = xe
sysfs energy1_input (µJ, exact, per-card by PCI id); NVIDIA = exact NVML
nvmlDeviceGetTotalEnergyConsumption for QLoRA, else the nvidia-smi power.draw @200 ms
integral for LLM/SDXL — the two NVIDIA methods are not the same instrument, so the LLM/SDXL
perf/watt margins carry a small extra uncertainty the QLoRA cell does not. Now collected fleet-wide by
node-runtime 0.10.119 → node_power_samples. Known confounds: engine-version skew (Intel
vLLM-XPU 0.14.x vs NVIDIA stock vLLM cu121) and host-CPU mismatch (i7-10700 8c vs Xeon E5-2680 v4 28c, a live
confound for the host-sync-bound TabNet cell). Intel does not publish dense BF16 TFLOPS for Arc Pro; the
compute-economics table uses published INT8 TOPS only. Market-share, pricing, CUDA-EULA, and spec figures are
from public sources (TechInsights/HPCwire, NVIDIA CUDA EULA, Intel datasheets/newsroom).
Appendix E — Sources (public). External market, pricing, licensing, and hardware-spec figures in §2 and §5 are from public sources; street prices and shipping specs are volatile and were current at the 2026-06-21 measurement date.
- NVIDIA data-center GPU market share (~98%, 3.76M of 3.85M units, 2023) — TechInsights, reported via HPCwire.
- NVIDIA CUDA End User License Agreement, §1.2 "Limitations" (item 8, on translating CUDA output to non-NVIDIA platforms) — NVIDIA.
- ZLUDA (CUDA-on-non-NVIDIA) project takedown at AMD's request, August 2024 — project repository and press coverage.
- Blackwell supply ("sold out ~12 months ahead," Oct 2024) and per-GPU pricing ($30,000–40,000, Jensen Huang) — NVIDIA management remarks and press.
- Intel Arc Pro B-series (Battlemage), Project Battlematrix (8×B60 → 192 GB), and the llm-scaler software stack — Intel datasheets and newsroom.
- bitsandbytes Arc / XPU 4-bit support — bitsandbytes release notes.
- Native torch.xpu support timeline (PyTorch 2.5 onward; Battlemage maturing 2.6–2.7) and IPEX standalone EOL (~March 2026) — PyTorch and Intel documentation.
- GPU specifications (VRAM, memory bandwidth, TBP, INT8 TOPS) for the RTX 3090 and Arc Pro B70/B60/B50 — vendor spec pages.
📄 Download Technical Paper
Full benchmarking study (~8 pages) covering the measurement methodology, the six-workload results with per-cell tables, full-precision VRAM headroom, an honest economics comparison, an ecosystem-maturity assessment, and a deployment roadmap — with every number and confound stated plainly.
Measured 2026-06-21 on production hardware (Intel Arc Pro B70 vs NVIDIA RTX 3090)