Technical Paper • June 2026

Reducing NVIDIA Dependency with Intel Arc

For the affordable inference-and-finetune tier, an Intel Arc Pro B70 (Battlemage) delivers 95–112% of an RTX 3090's throughput at 56–64% of its power — ~1.45–2.0× better performance-per-watt — making it a credible, power- and cost-efficient second source exactly where modern-AI value concentrates.

Measured 2026-06-21

Engineer José Luis Minich

6 workloads • 9 figures • K=3 repeats

Working draft v2

95–112%

of 3090 throughput

56–64%

of 3090 power

1.45–2.0×

better perf / watt

32 GB

B70 VRAM (vs 24 GB)

70 W

B50 efficiency floor

Abstract

Measuring a credible NVIDIA second source

We measured an Intel Arc Pro B70 (32 GB, Battlemage) head-to-head against an NVIDIA RTX 3090 (24 GB, Ampere) on the ColabHive distributed inference/training platform, across six real AI workloads — LLM serving, image diffusion, and fine-tuning — with device-attributed throughput and end-to-end energy measurement. Every number comes from a controlled run of an identical workload on a known, dedicated GPU, with energy read at the device (Intel xe sysfs counters, NVIDIA NVML), so the comparison avoids the CPU/GPU contamination that plagues aggregate fleet telemetry.

On the workloads that dominate today's AI spend — LLM serving, LoRA and 4-bit QLoRA fine-tuning, and Stable Diffusion XL — the B70 delivers 95–112% of the 3090's throughput while drawing 56–64% of its power, for ~1.45–2.0× better performance-per-watt. Where applicable we also report the Arc Pro B50 (16 GB, 70 W) as a third, lower-power efficiency data point. Diffusion is an outright Intel win; QLoRA — once given the right SYCL device selector — is ~7% faster. The remaining gaps are in the software stack and are closing release-over-release; this is a value-tier second-source case, not a claim against the frontier H100/B200 silicon that dominates the data-center segment.

1 · Motivation

Why a second source is the thesis

The AI compute market is the most concentrated in modern computing. NVIDIA shipped an estimated 98% (3.76 M of 3.85 M units) of data-center GPUs in 2023 (TechInsights, via HPCwire). The dependency is reinforced by CUDA lock-in, which is contractual as well as technical: the CUDA EULA prohibits reverse-engineering or translating CUDA-SDK output to target a non-NVIDIA platform, and the ZLUDA CUDA-on-other-hardware project was taken down at AMD's request in August 2024. Supply and price compound the problem: Blackwell-generation accelerators have been reported sold out ~12 months ahead, at quoted prices of $30,000–40,000 per GPU.

For an infrastructure layer whose mission is to give developers, startups, and researchers — particularly in LATAM — access to affordable, available, power-efficient compute, single-vendor dependency is the central strategic risk. A credible second source is not optional; it is the thesis. Intel's Arc Pro B-series ("Battlemage") and the Project Battlematrix initiative (up to 8× Arc Pro B60 → 192 GB VRAM, targeting 70 B+ models) aim squarely at the AI-workstation / inference-server niche, with a containerized software stack (llm-scaler: vLLM-XPU, ComfyUI, SGLang). The open questions for a production operator are empirical: How close is real throughput? What is the actual energy and cost profile? Where does the software ecosystem still hurt? This paper answers them with measured data from production-grade hardware.

Scaling to Project Battlematrix — 8×B60 → 192 GB, 70 B+ on-node

The per-card economics in this paper are the unit of a larger story. Intel's Project Battlematrix packs 8× Arc Pro B60 into 192 GB of aggregate VRAM in a single workstation/server chassis — exactly the capacity envelope a 70 B-class model at full bf16 needs (~140 GB of weights plus KV cache, fit on-node with no cross-host sharding). That is a class of model that today forces a multi-GPU NVIDIA box (typically 2–4× A100/H100-class cards) at a multiple of the power draw. The measured single-card facts below are the building blocks of that claim: a B70 already serves a 14 B at full bf16 on one card (§5), and the per-card vLLM-XPU efficiency we measured (95–97% of 3090 LLM throughput at ~1.5× the tokens/joule) is the unit economics that, if it composes, scales into a 70 B-class node at a fraction of the NVIDIA multi-GPU power budget.

Honest caveat — and a concrete ask We have not measured tensor-parallel (TP) scaling or inter-card bandwidth on Battlematrix-class topology. Multi-card TP efficiency on Arc is genuinely unproven in our hands, and we observe a real multi-card TP fault on dual-B70 in the current vLLM-XPU stack (TP > 1 does not yet run clean across two Arc cards in our environment). So the "8×B60 serves 70 B on-node" claim is a forward projection from solid single-card data, not a measured result — and the gap between the two is precisely the joint-validation work that a Battlematrix unit in our lab would let us close. Bottom line: the single-card unit economics are proven and favorable; the only thing standing between them and a power-efficient on-node 70 B is multi-card TP validation — a Battlematrix validation unit would let us close that gap jointly: measure TP scaling, fix the TP path the way we fixed QLoRA, and report the results back with the same rigor.

2 · Methodology

Device-attributed, isolated, energy-instrumented

ColabHive already serves Intel Arc GPUs in production via an inference-ipex image built on Intel's own intel/llm-scaler-vllm:0.14.0-b8.3.1 base (vLLM-XPU, torch 2.10.0+xpu, compute-runtime 26.09). For this study the platform's energy telemetry was extended and cabled end-to-end. A core methodological commitment is device-attributed, isolated, energy-instrumented measurement. Three pitfalls we explicitly avoided:

CPU-vs-GPU contamination. On a GPU node, much inference actually executes on the CPU (specialists, tools, fallbacks), so aggregate telemetry mixes devices. Every number in the results comes from a controlled run on a known, dedicated GPU, verified GPU-resident via vLLM /metrics counters, GPU utilization, and energy draw.
Throughput ground-truth. LLM throughput is taken from vLLM's own vllm:generation_tokens_total counter (delta over the window), not client estimates, with output length pinned via ignore_eos so both vendors process exactly identical work.
Energy attribution. Intel energy is the xe driver's sysfs energy1_input accumulating counter (µJ, monotonic, exact by construction); NVIDIA energy is the exact NVML nvmlDeviceGetTotalEnergyConsumption counter where available (QLoRA), else the integral of nvidia-smi power.draw at 200 ms. The window is bracketed by workload-emitted MEASURE_START/END timestamps, so warmup and model-load are excluded.

Fairness controls Identical model, identical prompts/inputs, bf16 + eager on both vendors (neither got fp8 or a compiled fast-path the other lacked), identical max_model_len, batch, and step counts. The two headline inference cells (LLM serving, SDXL) carry K=3 repeats with ±1 sample-std (n=3); the training cells are paired point estimates. Crucially, K=3 controls run-to-run noise on the same two physical cards — it does not control silicon-lottery / board-partner / thermal variation (N=1 hardware per vendor).

Role	GPU	VRAM	Host CPU
Intel	Arc Pro B70 (`0xe223`, BMG-G31)	32 GB GDDR6-ECC	i7-10700 (8c)
NVIDIA	RTX 3090 (GA102)	24 GB GDDR6X	Xeon E5-2680 v4 (28c)

3 · Results

Summary matrix

Across the six benchmarked workloads, the B70 reaches parity-or-better on compute-bound modern AI while structurally drawing less power on every workload measured (44–80% of the 3090's draw):

Workload	Type	B70 throughput vs 3090	B70 perf/watt vs 3090	Verdict
SDXL image generation	inference	112% (n=3)	2.00×	Intel wins on both
LoRA fine-tune (LLM 7B)	training	~103%	1.64×	~parity + efficiency
LLM serving (7B, vLLM)	inference	95–97% (n=3)	1.46–1.65×	near-parity + efficiency
SDXL UNet LoRA	training	57%	1.29×	slower, still better perf/watt
TabNet (deep-tabular)	training	46%	0.56×	3090 wins (Intel weak spot)
QLoRA 4-bit (LLM 7B)	training	107%	1.89×	works (selector fix) + faster

RTX 3090 (NVIDIA, Ampere) Arc Pro B70 (Intel, Battlemage) Arc Pro B50 (Intel, 70 W floor)

Figure 1 — LLM serving throughput (tok/s). Qwen2.5-7B-Instruct on vLLM (bf16, eager, 4096 ctx), grouped by concurrency. Error bars are ±1 sample-std over K=3 repeats. The B70 holds 95–97% of the 3090's throughput while being markedly more consistent run-to-run (std < 1% vs 5–9% on the 3090); at C=1 the 3090's wide CI (37.8 ± 3.4) fully contains the B70's 36.6. B50 omitted — 7B bf16 OOMs its 16 GB (see §4.9).

LLM serving is the one cell where the B70 is slightly behind — but at 1.46–1.65× the tokens/joule. Two things stand out in the K=3 data. First, the B70 is markedly more consistent run-to-run (std < 1%) than the 3090, whose single-GPU throughput varied 5–9% across back-to-back sweeps. Second, the 3090's higher variance is partly mild thermal droop under sustained load — which cuts against us as much as for us: the C=1 near-parity could reflect the 3090 underperforming as much as the B70 reaching it, so we lean on the cleaner high-concurrency cells (C=8/16, a steady 95–97%) for the parity read and treat C=1 as noisy. Under load the 3090 drew ~343 W against the B70's ~219 W (~64%).

Figure 2 — Performance-per-watt, Intel ÷ 3090 (ratio). One bar group per workload (blue = B70, teal = B50); the dashed line marks the 1.0 parity baseline. Above 1.0 is an Intel efficiency win. The B70 beats the 3090 on every workload except TabNet (up to 2.00× on SDXL); the B50 is even more efficient on the three workloads it can run (2.51× SDXL, 2.45× QLoRA) — it has no LoRA/LLM bars because those OOM its 16 GB.

Perf/watt is a structural property of the part, not a one-off. The B70 wins the efficiency axis on six of seven workloads, peaking at 2.00× on SDXL inference and 1.89× on QLoRA. Only TabNet inverts (0.56×), for reasons we own plainly below. Because perf/watt is now collected fleet-wide by the node runtime (NVML on NVIDIA, Xe sysfs on Intel, into node_power_samples), this is platform telemetry going forward — not a single bench.

Figure 3 — Energy per unit of work, B70 as % of 3090 (lower is better). The 100% line is the 3090 baseline. Error bars (±, propagated from K=3 where available) show the SDXL/image margin is comfortably separated from parity. The B70 uses roughly half the energy per SDXL image and per QLoRA step, and ~60–69% per LoRA step and per LLM token.

The energy story is the clearest part of the case. SDXL costs the B70 half the joules per image (50.0%), QLoRA half the joules per step (52.8%), LoRA ~61%, and an LLM token at C=16 ~69%. These are not rounding-error margins; on SDXL the K=3 confidence intervals are cleanly separated (B70 1,621 ± 6 J/image vs 3090 3,240 ± 37 J/image), so the efficiency win is repeatable, not a single-run artifact.

Figure 4 — Throughput as % of 3090 (higher is better). The 100% line is parity (blue = B70, teal = B50). The B70 is at or above parity on the modern-AI core (SDXL, QLoRA, LoRA, LLM), and behind on raw-eager UNet-LoRA training and TabNet. The B50 trades throughput for efficiency — 22–44% of the 3090 on the three workloads it runs (no LoRA/LLM bars, which OOM its 16 GB).

Read together, Figures 2–4 are the heart of the paper: on the workloads that dominate affordable-tier AI spend the B70 is at or near throughput parity (SDXL 112%, QLoRA 107%, LoRA 103%, LLM 95–97%) while spending far fewer joules to get there. The two red bars — UNet-LoRA at 57% and TabNet at 46% — are real and are addressed honestly in their own sections; they are raw-eager / small-operator patterns where a mature CUDA stack still pulls ahead.

Figure 5 — Average board power under load (W). Grouped 3090 vs B70 vs B50, per workload. The B70 draws 44–80% of the 3090's draw on every workload; the B50 sits far lower still — 39–70 W on the SDXL/QLoRA/TabNet workloads it can run (no LLM/LoRA bars). This is the structural efficiency advantage that underpins every perf/watt figure above.

Power is where the silicon advantage is unambiguous. Across every workload the B70 drew 44–80% of the 3090's board power — from 152 W vs 346 W on UNet-LoRA (44%) to 219 W vs 343 W on LLM serving (64%). Lower power is the lever that turns near-parity throughput into a decisive perf/watt and operating-cost win, and it is a property of the part that no amount of engine maturity on the NVIDIA side can erode.

4 · The QLoRA reversal

A "hardware limitation" that was one line of config

An earlier pass reported 4-bit QLoRA as failing on Battlemage. On re-investigation that was a misdiagnosis, and the correction is one of the more instructive results in this study. The failure is not a missing Triton-XPU kernel. bitsandbytes 0.49.2 dispatches 4-bit quantization through a native custom op (torch.ops.bitsandbytes.quantize_4bit), and under the platform's device-pinning convention ONEAPI_DEVICE_SELECTOR=level_zero:0 that op throws a SYCL No device of requested type available — bnb's separately-compiled SYCL queue rejects the Level-Zero selector even though torch.xpu sees the GPU fine.

The fix Setting ONEAPI_DEVICE_SELECTOR=*:gpu (any-backend GPU) lets the kernel resolve a device, and 4-bit QLoRA then trains end-to-end on the B70 — ~7% faster than the 3090 and 1.89× more energy-efficient. It is genuinely 4-bit: the quantized 7B occupies 5.45 GiB on both vendors (bf16 would be ~15 GiB), and the ~2.2× slowdown vs LoRA-fp16 is exactly the expected 4-bit dequant tax. A result we had shipped as a hardware limitation turned out to be a one-line configuration issue.

QLoRA-NF4 (Qwen2.5-7B, r=16)	RTX 3090	Arc Pro B70
Throughput	0.889 steps/s · 911 tok/s	0.949 steps/s · 972 tok/s
Energy / step	459.0 J	242.3 J
Avg power	~408 W	~230 W
VRAM (4-bit weights)	5.45 GiB	5.45 GiB

The methodological lesson is one we own: this updates our prior so that we now treat a single negative-for-Intel result as provisional pending a configuration audit. By that bar the TabNet weak spot is not yet audited and should be read as "unaudited, plausibly improvable" rather than a settled silicon limit. The remaining QLoRA work is purely platform-integration: wiring the *:gpu selector into the Intel training-launch path for bnb-4bit workloads.

⭐ Contribution — unblocking 4-bit QLoRA across the Arc B-series

This finding is significant enough to name as a contribution of this work, not bury as a footnote. The conventional wisdom is wrong. The public consensus in 2026 — repeated across forums, issue threads, and "does it run on Intel?" guides — is that "bitsandbytes / 4-bit QLoRA does not run on Intel Arc." Meanwhile bitsandbytes' own release notes officially list XPU support. Both cannot be fully true, and the gap is where operators get stuck and conclude (as we initially did) that the hardware can't do it.

The trap is a documentation collision. Intel's own multi-GPU guidance recommends pinning devices with ONEAPI_DEVICE_SELECTOR=level_zero:N — and that is exactly the selector under which bitsandbytes' separately-compiled SYCL 4-bit op throws No device of requested type available. An operator who follows Intel's multi-GPU docs to the letter and installs the XPU-supporting bitsandbytes hits a hard failure that looks like a missing kernel — when in fact torch.xpu sees the GPU fine and only bnb's SYCL queue rejects the Level-Zero selector. Two correct-in-isolation pieces of Intel-recommended configuration combine into a failure mode that looks like a missing kernel.

Our root-cause and fix. Set ONEAPI_DEVICE_SELECTOR=*:gpu (any-backend GPU, so bnb's kernel queue can resolve a device) and pin the specific card with ZE_AFFINITY_MASK=<idx> instead of the Level-Zero selector. With that one change, 4-bit QLoRA trains end-to-end on the B70 (~7% faster than the 3090 at 1.89× the efficiency, genuinely 4-bit at 5.45 GiB), and we confirmed it generalizes across the B-series — the identical fix (*:gpu + ZE_AFFINITY_MASK=1) works on the B50. Because the fix is selector-level and card-agnostic, it unblocks 4-bit QLoRA on the entire Arc B-series — including Intel's own 8-card Project Battlematrix topology, where per-card pinning is mandatory and the level_zero:N trap is therefore likely to surface for anyone following the standard multi-GPU docs.

It is in production. The fix is not a notebook hack — it is wired into a production training launcher (node-runtime 0.10.121), so every Arc training job ColabHive dispatches uses the correct selector automatically. And we are giving it back: we are reporting this upstream to bitsandbytes (the failure signature + the selector root-cause) and to Intel's IPEX / llm-scaler documentation. A ready-to-file bug-report draft is in Appendix C.

Bottom line: we don't just consume the Intel stack — we fix and contribute back to it. The single most-cited "Arc can't do QLoRA" limitation is, in our hands, a solved, productionized, upstream-reported one-line configuration fix that scales to the full B-series and to Battlematrix.

4.10 · INT8 inference on Battlemage

The spec promise vs the realizable win

The economics below lean on Battlemage's published INT8 TOPS/W advantage (B70 1.60 vs 3090 0.81). A spec is a promise, not a measurement — so we tried to cash it into delivered tokens/s on the B70. The answer has two halves: the INT8-compute path is blocked by a missing kernel, but the deployment-standard 4-bit path not only works, it flips the LLM-serving verdict to a B70 win.

(a) True INT8 (W8A8) — blocked by a missing vLLM-XPU kernel Serving an INT8 compressed-tensors W8A8 model (which would exercise the 367 INT8 TOPS) fails at engine init on vLLM-XPU:

File .../quantization/kernels/scaled_mm/__init__.py, line 55, in choose_scaled_mm_linear_kernel
    for kernel in _POSSIBLE_KERNELS[current_platform._enum]:
KeyError: <PlatformEnum.XPU: 4>

vLLM has no INT8 scaled-mm kernel registered for the XPU platform — the dispatch table carries CUDA/ROCm/CPU/TPU entries but not XPU. So the B70's published INT8 TOPS are not yet realizable through vLLM's INT8 serving path: the economic INT8-TOPS edge is a genuine spec advantage that today's software stack cannot cash in. It is a concrete, reportable gap — analogous to the QLoRA selector issue but deeper (a missing kernel registration, not a config), and squarely on the closing-trajectory list.

(b) AWQ-4bit — works, and the B70 leads the 3090. The precision operators actually deploy on 24–32 GB cards is 4-bit weight-only (AWQ/GPTQ). On vLLM-XPU this routes through the IPEX weight-only path, gated behind a deprecation guard that we bypass with one flag (--allow-deprecated-quantization — a third minor ecosystem unlock we document). Serving identical AWQ Qwen2.5-7B, coherent output on both vendors:

Figure 9 — AWQ-4bit LLM throughput (tok/s). AWQ-4bit Qwen2.5-7B — quantization flips bf16 near-parity into a clean B70 win (+19–26%). Grouped by concurrency (green = 3090, blue = B70). At C=1 40.8 → 48.9, C=8 302.9 → 381.8, C=16 598.0 → 709.5 tok/s — the B70 leads at every concurrency.

AWQ-4bit, Qwen2.5-7B	3090 tok/s	B70 tok/s	B70 %	3090 tok/J	B70 tok/J
C=1	40.8	48.9	120%	0.255	0.325
C=8	302.9	381.8	126%	1.913	2.351
C=16	598.0	709.5	119%	3.506	4.179

Quantization flips the LLM-serving verdict. At bf16 (§3) the B70 trails the 3090 at 95–97%; at AWQ-4bit — the realistic operating point — the B70 leads by 19–26% on throughput and 1.19× on tokens/joule, while also cutting its own latency (p50 2.70 s vs 3.65 s bf16) and power (~170 W vs ~219 W). Each vendor runs its own AWQ kernel (IPEX weight-only on XPU vs Marlin on CUDA), so this is a real-world "what each card actually serves" comparison rather than a same-kernel one — and it lands in the B70's favor.

Bottom line The B70's INT8-TOPS spec edge is not yet cashable (vLLM has no XPU INT8 scaled-mm kernel — a fixable gap, reported in Appendix C), but the realizable quantized win is already decisive: at the 4-bit precision people actually deploy, the B70 beats the RTX 3090 on LLM serving (+19–26%) and efficiency (1.19×), turning the lone bf16 near-parity loss into a clean win at the operating point that matters.

5 · VRAM headroom

A model the 3090 cannot hold at full precision

Figure 8 — VRAM capacity & full-precision headroom. Horizontal bars show board VRAM (B50 16 GB, 3090 24 GB, B70 32 GB). The dashed line at 29.5 GB marks Qwen2.5-14B at bf16: both the B50 and the 3090 OOM before reaching it, while only the B70 loads and serves it (31.1 GiB used).

At full bf16 precision the 32 GB B70 holds mid-size models that do not fit a 24 GB 3090. We frame this honestly: it is a simplicity / headroom advantage, not an absolute capability gap — a quantized (AWQ/GPTQ/fp8) 14 B fits a 3090 fine and is the standard way to serve one on 24 GB. The narrow, measured claim: at the identical no-quantization bf16 config used everywhere else in this paper, loading Qwen2.5-14B-Instruct has the 3090 fail with torch.OutOfMemoryError (23.49 GiB used, cannot allocate the next 270 MiB) while the B70 loads and serves a coherent completion at 31.1 GiB used.

The operational value is simplicity: on the B70 you serve or fp16-finetune a ~13–14 B model on a single card with no quantization pipeline, no offload, no second GPU. So the defensible framing is not "NVIDIA can't run a 14 B" (it can, quantized) but "Arc Pro gives full-precision headroom for a whole class of mid-size models that a 24 GB card forces you to quantize or shard."

4.9 · The Arc Pro B50

Efficiency at 70 W — a third data point

We also ran everything that fits on the node's second Intel card, the Arc Pro B50 (16 GB, 70 W TBP, no external power connector) — same device selectors (level_zero:1; *:gpu + ZE_AFFINITY_MASK=1 for QLoRA, confirming the §4 fix is general rather than B70-specific) and the same card-attributed Xe energy method. The B50 is the efficiency / power floor of the three parts.

Workload	RTX 3090 (350 W)	Arc B70 (32 GB)	Arc B50 (16 GB, 70 W)
SDXL / image	8.21 s · 3,240 J · ~394 W	7.33 s · 1,621 J · ~221 W	18.74 s · 1,291 J · ~69 W
QLoRA-NF4 / step	0.889 st/s · 459 J · ~408 W	0.949 st/s · 242 J · ~230 W	0.373 st/s · 187 J · ~70 W
TabNet	16,852 rows/s · ~108 W	7,739 rows/s · ~87 W	3,700 rows/s · ~39 W
LLM serving (7B, bf16)	✓	✓	✗ OOM (16 GB)
LoRA-fp16 (7B)	✓	✓	✗ OOM (16 GB)

(1) Slow but astonishingly efficient. The B50 is 2.3–4.5× slower than the 3090, yet draws only 39–70 W and posts the lowest energy-per-unit-of-work of all three parts on the workloads it can run — SDXL at 1,291 J/image (0.40× the 3090, below even the B70) and QLoRA at 187 J/step (0.41× the 3090). For power- or density-constrained deployments (many cards per chassis, no external power connector), that is the entire pitch.

(2) 16 GB is the binding constraint. A 7 B model at bf16 out-of-memories the B50 for both serving and LoRA-fp16 — it must be quantized to fit; QLoRA-4bit (5.45 GiB) runs comfortably. So the B50 is a quantized-inference and QLoRA card, and the B70's 32 GB is precisely what buys the full-precision headroom of §5.

Two complementary Intel parts The B50 is the efficiency floor — the lowest joules-per-unit-of-work in the study, at 39–70 W and no external power — for quantized inference and QLoRA. The B70 is the full-precision capacity part — 32 GB of bf16 headroom and near-3090 throughput. Together they span the second-source range: pick the B50 where power and density bind, the B70 where precision and model size do.

6 · Economics

Cheaper VRAM, decisively better perf-per-watt

Figure 6 — VRAM cost ($/GB, lower is better). The Arc Pro line is ~0.35–0.47× the $/GB of a new 3090 ($62.46/GB) but only ~0.81× vs a used 3090 (~$36.5/GB, shown for context) — the honest comparator for a 2020 part.

On $/GB VRAM, Intel Arc Pro is the headline "~2.5–3× VRAM per dollar" against a new 3090 ($1,499). But the 3090 is a 2020 part bought used (~$700–1,050 ≈ $36.5/GB); against that realistic comparator the gap shrinks to ~0.6–0.8×, i.e. roughly 1.2–1.7× the VRAM per dollar, not 3×. We price the B70 itself here (~$949), not only the cheaper B60/B50, which earlier drafts were fairly criticized for.

Figure 7 — Per-watt economics (higher is better). INT8 TOPS/W and GB VRAM/100W across the Arc Pro line and the 3090. The 3090 is last on both axes — Intel's lead on energy economics is robust regardless of price basis, because the used market can erode price but not watts.

Where Intel's lead is robust regardless of price basis is the per-watt economics: VRAM per watt (1.75–3.3×) and INT8 TOPS/W (1.2–3.0×) are structural efficiency the used market cannot erode. The honest economic case: modestly cheaper VRAM than a used 3090, much cheaper than a new one, decisively better perf/watt — and a brand-new part (warranty, current drivers, support) versus a depreciating six-year-old card.

Sticker price is the wrong unit. What an operator actually pays is cost per unit of work over the life of the card — hardware amortization plus the electricity to do the work — and that is where the measured perf/watt of the results converts directly into dollars.

TCO assumptions (stated once, used throughout) 3-year straight-line hardware amortization = 26,280 operating hours. Electricity: US $0.15 / Argentina $0.10 / Germany $0.30 per kWh. Grid carbon intensity ≈ 0.4 kg CO₂/kWh. Hardware capital: used 3090 $875, B70 $949, B50 $349 (new 3090 $1,499 shown where it sharpens the contrast). Throughput and watts are the measured cells. Full formula and a sensitivity note are in Appendix A.

Interactive TCO calculator

Live model

All-in 3-year cost per unit of work

Drag the controls — every number recomputes from the measured §3 throughput and power. Illustrative TCO model — assumptions in Appendix A.

Electricity price $0.15/kWh

Utilization 70%

Workload

NVIDIA 3090 baseline

RTX 3090 (used)

—

baseline

Arc Pro B70

—

Arc Pro B50

—

Illustrative TCO model —

$/unit = price_card / (units_per_3yr × util)
        + energy_per_unit_kWh × price_kWh

, with units_per_3yr = throughput × 26,280 h. Assumptions in Appendix A.

All-in cost per unit of work

All-in $/unit = hardware ÷ (units produced over life at utilization U) + energy/unit × $/kWh. The hardware term shrinks with utilization (a card amortized over more work is cheaper per unit); the energy term does not. Both headline workloads are computed from the measured throughput and power.

LLM serving (Qwen-7B, C=16) — $ / million tokens @ US $0.15, 100% utilization:

Card	HW $/M-tok	Energy $/M-tok	All-in $/M-tok
Arc B70	$0.0181	$0.0166	$0.0347
3090 (used)	—	—	$0.0401
3090 (new)	—	—	$0.0514

The B70's win on LLM serving is purely energy — it is slightly behind on raw throughput, so it does not win on the hardware term; it wins because each token costs ~half the joules. That makes the advantage utilization-gated: at very low utilization the hardware term dominates and the cheaper-sticker used 3090 can edge ahead; as utilization rises the energy term dominates and the B70 pulls away.

Crossover utilization U* = $0.0429 / price_kWh. The B70 beats a used 3090 on $/token above U* = 28.6% utilization at US $0.15/kWh, 14.3% at Germany $0.30, and 43% at Argentina $0.10. The more expensive your power and the busier your fleet, the more decisively Arc wins LLM serving — and a production inference fleet runs well above 28.6%.

SDXL image generation — $ / 1,000 images @ US $0.15, 100% utilization:

Card	All-in $/1,000 img	vs 3090-used	vs 3090-new
Arc B50	$0.123	—	—
Arc B70	$0.141	33% cheaper	47% cheaper
3090 (used)	$0.211	—	—
3090 (new)	$0.265	—	—

SDXL is different in kind: the B70 wins on both the hardware-per-image term (it is faster per image) and the energy term (half the joules). When a card is cheaper on every component of the cost, there is no crossover.

Bottom line Under our measured throughput/power and the stated TCO assumptions, SDXL is cheaper on Arc at every utilization and every electricity price. In our model there is no operating point — no fleet, no country, no duty cycle — at which a 3090 generates SDXL images more cheaply than a B70 (let alone a B50). The higher sticker on a B70 is repaid by the energy bill, not by marketing: on LLM serving it pays back above ~29% utilization (US) — i.e. on any real fleet — and on SDXL it is cheaper at every operating point in our TCO model.

Energy & carbon at fixed throughput

Flip the question: hold output constant and ask what it costs in power and carbon. This is the number a data-center operator or an ESG-minded investor cares about.

Fixed throughput	RTX 3090	Arc B70	Arc B50
1 M SDXL images / day	900 kWh/day · 328.5 MWh/yr	450 kWh/day · 164.4 MWh/yr (−50%, −66 t CO₂/yr)	359 kWh/day (−60%)
1 B tokens / day (LLM)	161 kWh/day	111 kWh/day (−31%)	OOM (16 GB)

Bottom line At any fixed output, Arc cuts the energy bill 31% (LLM) to 60% (SDXL) and the carbon with it. A diffusion-heavy slice halves its electricity and sheds ~66 t CO₂/yr per million-image-per-day of capacity. As an illustrative ceiling, a 6.1 TWh SDXL-equivalent 3090 fleet would draw 3.05 TWh/yr on B70s — a saving of ~the continuous output of a ~350 MW power plant. At fleet scale the diffusion savings alone are measured in power-plants, not percentages.

Rack / chassis density

Power, not slots, is the binding constraint in a modern rack. Sizing by total board power (TBP — 3090 350 W / B70 230 W / B50 70 W) into a fixed 10 kW rack shows what you can actually deploy per kilowatt:

Per 10 kW rack	RTX 3090	Arc B70	Arc B50
Cards	28	43	142
Aggregate VRAM (GB)	672	1,376 (2.05×)	2,272 (3.38×)
SDXL throughput (img/min)	204	353 (1.73×)	454 (2.23×)
LLM throughput (tok/s)	16,296	23,878 (1.47×)	OOM (16 GB)
VRAM per kW (GB/kW)	68.6	139.1	228.6

Bottom line The no-external-power 70 W B50 is the density play — in a power-bound rack it delivers 3.3× the VRAM and 2.2× the SDXL throughput per kilowatt of a 3090, and matches a used 3090's image output for less money and half the power. Every density and energy figure here scales cards linearly and ignores the ~10–20% PSU/cooling overhead a real rack pays — which penalizes the higher-wattage 3090 fleet more, so these figures understate the Arc advantage.

7 · Honest limitations

Where the case is weakest, stated plainly

The TabNet weak spot The B70 runs TabNet deep-tabular at ~46% of the 3090's throughput (the 3090 is ~2.2× faster, and 1.8× more efficient). TabNet is many small operators with frequent host-device synchronization; the XPU does not saturate (~87 W indicates a starved GPU) and per-kernel/eager overhead dominates — exactly the pattern where a mature CUDA stack pulls ahead. Per the QLoRA lesson, this cell is unaudited and may be partly host-CPU (the Intel host is an 8-core i7 vs a 28-core Xeon).

The case is deliberately calibrated. Beyond TabNet, classical gradient boosting (XGBoost/CatBoost) has no Intel-GPU backend and stays on CPU — a library reality, not a B70 deficiency. The hardware sample is N=1 per vendor: K=3 controls run-to-run noise on the same two physical cards but not silicon-lottery, board-partner, or thermal variation, so a second device and a second model size (~1.5 B) are the next steps. There is also an engine-version skew confound — Intel runs vLLM-XPU 0.14.x on torch 2.10+xpu while NVIDIA runs stock vLLM (cu121 lineage) — so near-parity reflects silicon plus engine. And the economics are strongest against new-NVIDIA pricing and only modest against the used-3090 comparator. The remaining gaps (no FlashAttention/xformers on XPU, no CUDA-graph equivalent, out-of-tree Triton-XPU) are software-ecosystem ones with a clear, fast-moving upstream trajectory — not silicon limitations.

Bottom line: for the affordable inference-and-finetune tier ColabHive serves — not the frontier H100/B200 segment — Intel Arc Pro is a credible, power- and cost-efficient NVIDIA alternative exactly where modern-AI value concentrates for that tier (LLM + diffusion inference, LoRA/QLoRA fine-tuning). Deploy LLM serving and SDXL inference on Arc Pro first; run LoRA and 4-bit QLoRA fine-tuning there too; keep deep-tabular/TabNet on NVIDIA and classical GBDT on CPU. The remaining gaps are in the software stack and are closing release-over-release.

Keywords

Intel Arc Battlemage QLoRA perf-per-watt vLLM NVIDIA dependency LATAM energy-aware SDXL VRAM economics

Appendices

TCO model, raw data, reproduction & upstream report

Appendix A — The TCO model (formula, assumptions, sensitivity). Everything in the economics section is computed from this one model; a reviewer can recompute every cell. For a card producing throughput T units/hour at average power P watts:

all_in_$/unit(U) = capital / (T × hours_life × U)      ← hardware term
                 + (P / 1000 / T) × price_kWh          ← energy term

where  units_over_life      = T × hours_life × U
       energy_per_unit_kWh  = (P watts / 1000) / (T units/hour)

Assumptions: hours_life = 26,280 h (3-yr straight-line).
  price_kWh ∈ {US 0.15, AR 0.10, DE 0.30}. Grid carbon = 0.4 kg CO₂/kWh.
  Capital: 3090-used $875, 3090-new $1,499, B70 $949, B50 $349.
  T, P from the measured cells (LLM C=16: 3090 582 tok/s @343 W, B70 555.3 @219 W;
  SDXL: 3090 7.3 img/min @394 W, B70 8.2 @221 W, B50 3.2 @69 W).

Crossover (LLM, B70 wins energy but not HW):
  U* = ΔHW_constant / price_kWh = 0.0429 / price_kWh
     → US 0.15 → 28.6%  |  DE 0.30 → 14.3%  |  AR 0.10 → 43%
  SDXL: B70 wins both terms → no positive-U crossover → unconditional win.

Sensitivity. The energy term — and the entire B70 LLM-serving advantage — scales linearly with electricity price; cheaper power (Argentina) pushes the LLM crossover up to 43% util, expensive power (Germany) drops it to 14.3%. Utilization U only scales the hardware term, never energy — so every result is most favorable to the cheaper-energy card (Arc) at high utilization. A production fleet lives at high U, the regime where Arc wins. SDXL is insensitive in sign (always wins), only in magnitude.

Appendix B — Consolidated raw-data table. Every measured cell, so a reviewer can recompute J/unit, perf/watt, and the economics independently. All cells measured 2026-06-21, bf16/eager, single dedicated GPU. "—" = not run / n/a; "OOM" = did not fit in VRAM.

Workload (unit)	Metric	RTX 3090	Arc B70	Arc B50
LLM serving C=1 (tok/s)	throughput	37.8 ± 3.4	36.6 ± 0.1	—
LLM serving C=8 (tok/s)	throughput	296.7 ± 6.9	287.0 ± 0.5	—
LLM serving C=16 (tok/s)	throughput	582.0 ± 19.6	555.3 ± 2.1	OOM
LLM serving C=16	tok/J	1.723	2.511	—
LLM serving (under load)	avg power (W)	~343	~219	—
SDXL (per image)	latency (s)	8.21 ± 0.09	7.33 ± 0.03	18.74
SDXL (per image)	throughput (img/min)	7.3	8.2	3.2
SDXL (per image)	energy/image (J)	3,240 ± 37	1,621 ± 6	1,291
SDXL	avg power (W)	~394	~221	~69
LoRA-fp16 7B (per step)	steps/s · tok/s	1.90 · 1,944	1.96 · 2,007	OOM
LoRA-fp16 7B	energy/step (J)	180.8	110.1	OOM
LoRA-fp16 7B	avg power (W)	~347	~222	OOM
QLoRA-NF4 7B (per step)	steps/s · tok/s	0.889 · 911	0.949 · 972	0.373
QLoRA-NF4 7B	energy/step (J)	459.0	242.3	187
QLoRA-NF4 7B	avg power (W)	~408	~230	~70
QLoRA-NF4 7B	VRAM 4-bit (GiB)	5.45	5.45	5.45
SDXL UNet-LoRA (per step)	throughput (steps/s)	1.625	0.930	—
SDXL UNet-LoRA	energy/step (J)	208	161	—
SDXL UNet-LoRA	avg power (W)	~346	~152	—
TabNet 16k×64 (rows/s)	throughput	16,852	7,739	3,700
TabNet	avg power (W)	~108	~87	~39
TabNet	rows/J	159	89	—
AWQ-4bit C=1 (tok/s)	throughput	40.8	48.9	—
AWQ-4bit C=8 (tok/s)	throughput	302.9	381.8	—
AWQ-4bit C=16 (tok/s)	throughput	598.0	709.5	—
Qwen2.5-14B bf16	weight load	OOM (24 GB)	loads (31.1 GiB)	—

Energy method per cell: Intel = Xe sysfs exact counter (all cells); NVIDIA = NVML exact counter for QLoRA, nvidia-smi power.draw @200 ms integral for LLM/SDXL (asymmetry flagged in Appendix D).

Appendix C — Reproduction & upstream report. Reproduce by pinning a single dedicated GPU per side, draining other models off it, bf16/eager throughout, identical prompts/inputs/lengths; LLM throughput from vLLM's vllm:generation_tokens_total delta over a MEASURE_START/END-bracketed window (ignore_eos to fix output length); energy from the exact accumulating counters where possible. The QLoRA cell requires the selector fix. The ready-to-file bug-report draft:

Ready-to-file bug report — bitsandbytes 4-bit QLoRA on Intel Arc B-series

Title: 4-bit QLoRA fails on Intel Arc B-series under ONEAPI_DEVICE_SELECTOR=level_zero:N — SYCL No device of requested type available (bnb's 4-bit op rejects the Level-Zero selector that Intel's own multi-GPU docs recommend).

Components: bitsandbytes (XPU 4-bit custom op) · Intel IPEX / llm-scaler multi-GPU documentation.

Environment: Arc Pro B70 (0xe223, 32 GB) and B50 (0xe212, 16 GB); torch 2.10.0+xpu; bitsandbytes 0.49.2; compute-runtime 26.05+; xe driver; Ubuntu 24.04 / kernel 6.17; image on intel/llm-scaler-vllm:0.14.0-b8.3.1 base.

Failure signature: With ONEAPI_DEVICE_SELECTOR=level_zero:0 (per Intel's multi-GPU pinning guidance), loading an NF4-quantized model and starting a QLoRA step makes bitsandbytes' native 4-bit op (torch.ops.bitsandbytes.quantize_4bit) throw a SYCL No device of requested type available. Note torch.xpu.is_available() is True and get_device_properties() enumerates the GPU correctly — only bnb's separately-compiled SYCL kernel queue rejects the device.

Root cause: bnb's 4-bit SYCL queue does not resolve a device when the process is scoped to the Level-Zero backend selector. torch.xpu and bnb's SYCL runtime resolve devices through different paths; the Level-Zero-only scoping that satisfies torch starves bnb's queue.

Fix (one line of configuration): use the any-backend GPU selector and pin the card by affinity mask instead of by Level-Zero index:

# ❌ fails for bnb 4-bit:  ONEAPI_DEVICE_SELECTOR=level_zero:1
# ✅ works:
export ONEAPI_DEVICE_SELECTOR='*:gpu'
export ZE_AFFINITY_MASK=1        # pin to the desired card (0-based)

With this, 4-bit QLoRA trains end-to-end (verified genuinely 4-bit: NF4 7B = 5.45 GiB).

Affected scope: the entire Arc B-series (reproduced on both B70 and B50 with the identical fix). Critically, it affects Project Battlematrix (8×B60) and any multi-card Arc deployment, because per-card pinning is mandatory there and the standard level_zero:N guidance is exactly what triggers the failure.

Requested doc change (Intel): note that workloads using bitsandbytes 4-bit must pin with ONEAPI_DEVICE_SELECTOR='*:gpu' + ZE_AFFINITY_MASK=<idx>, not level_zero:<idx>.

Requested fix (bitsandbytes): make the 4-bit SYCL kernel queue resolve a device under the Level-Zero backend selector (or emit an actionable error pointing at the selector rather than the opaque No device of requested type available).

Appendix D — Methodology, versions, provenance. Intel inference inference-ipex:v0.7.24 ← intel/llm-scaler-vllm:0.14.0-b8.3.1 (torch 2.10.0+xpu, vLLM-XPU, compute-runtime 26.09); Intel training training-ipex:v0.2.0 (llm-scaler base + peft, trl, bitsandbytes 0.49.2, pytorch-tabnet 4.1.0); NVIDIA inference-vllm:v2.0.8, inference-generative:v3.0.0, training-transformers-cu121:v1.0.5. Host Ubuntu 24.04, kernel 6.17, xe driver, compute-runtime 26.05+, Resizable BAR on. Energy: Intel = xe sysfs energy1_input (µJ, exact, per-card by PCI id); NVIDIA = exact NVML nvmlDeviceGetTotalEnergyConsumption for QLoRA, else the nvidia-smi power.draw @200 ms integral for LLM/SDXL — the two NVIDIA methods are not the same instrument, so the LLM/SDXL perf/watt margins carry a small extra uncertainty the QLoRA cell does not. Now collected fleet-wide by node-runtime 0.10.119 → node_power_samples. Known confounds: engine-version skew (Intel vLLM-XPU 0.14.x vs NVIDIA stock vLLM cu121) and host-CPU mismatch (i7-10700 8c vs Xeon E5-2680 v4 28c, a live confound for the host-sync-bound TabNet cell). Intel does not publish dense BF16 TFLOPS for Arc Pro; the compute-economics table uses published INT8 TOPS only. Market-share, pricing, CUDA-EULA, and spec figures are from public sources (TechInsights/HPCwire, NVIDIA CUDA EULA, Intel datasheets/newsroom).

Appendix E — Sources (public). External market, pricing, licensing, and hardware-spec figures in §2 and §5 are from public sources; street prices and shipping specs are volatile and were current at the 2026-06-21 measurement date.

NVIDIA data-center GPU market share (~98%, 3.76M of 3.85M units, 2023) — TechInsights, reported via HPCwire.
NVIDIA CUDA End User License Agreement, §1.2 "Limitations" (item 8, on translating CUDA output to non-NVIDIA platforms) — NVIDIA.
ZLUDA (CUDA-on-non-NVIDIA) project takedown at AMD's request, August 2024 — project repository and press coverage.
Blackwell supply ("sold out ~12 months ahead," Oct 2024) and per-GPU pricing ($30,000–40,000, Jensen Huang) — NVIDIA management remarks and press.
Intel Arc Pro B-series (Battlemage), Project Battlematrix (8×B60 → 192 GB), and the llm-scaler software stack — Intel datasheets and newsroom.
bitsandbytes Arc / XPU 4-bit support — bitsandbytes release notes.
Native torch.xpu support timeline (PyTorch 2.5 onward; Battlemage maturing 2.6–2.7) and IPEX standalone EOL (~March 2026) — PyTorch and Intel documentation.
GPU specifications (VRAM, memory bandwidth, TBP, INT8 TOPS) for the RTX 3090 and Arc Pro B70/B60/B50 — vendor spec pages.

📄 Download Technical Paper

Full benchmarking study (~8 pages) covering the measurement methodology, the six-workload results with per-cell tables, full-precision VRAM headroom, an honest economics comparison, an ecosystem-maturity assessment, and a deployment roadmap — with every number and confound stated plainly.

Download PDF LaTeX Source

Measured 2026-06-21 on production hardware (Intel Arc Pro B70 vs NVIDIA RTX 3090)