\documentclass[11pt]{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[letterpaper,margin=1in]{geometry}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{booktabs}
\usepackage{array}
\usepackage{xcolor}
\usepackage{tikz}
\usepackage{pgfplots}
\usepackage{pgfplotstable}
\pgfplotsset{compat=1.16}
\usetikzlibrary{patterns}
\usepackage{fancyvrb}
\usepackage[most]{tcolorbox}
\usepackage[colorlinks=true,linkcolor=black,urlcolor=blue,citecolor=blue]{hyperref}
\usepackage{microtype}
\usepackage[htt]{hyphenat}  % allow long \texttt{} identifiers to break across lines
\usepackage{caption}
\captionsetup{font=small,labelfont=bf,skip=4pt}
\sloppy

% --- Brand / chart colors ---------------------------------------------------
\definecolor{nvgreen}{RGB}{118,185,0}    % RTX 3090
\definecolor{b70blue}{RGB}{0,150,220}    % Arc Pro B70
\definecolor{b50teal}{RGB}{0,180,160}    % Arc Pro B50
\definecolor{wingreen}{RGB}{34,139,34}   % "B70 wins" bars
\definecolor{losered}{RGB}{200,40,40}    % "B70 loses" bars
\definecolor{nv3090used}{RGB}{86,135,0}  % used 3090 shade
\definecolor{baselinegray}{RGB}{90,90,90}
\definecolor{boxbg}{RGB}{240,247,252}
\definecolor{boxframe}{RGB}{0,120,180}

% Column types for compact tables
\newcolumntype{L}[1]{>{\raggedright\arraybackslash}p{#1}}
\newcolumntype{C}[1]{>{\centering\arraybackslash}p{#1}}

% Shared pgfplots style for the in-column bar charts
\pgfplotsset{
  paperbar/.style={
    width=0.8\linewidth,
    height=6.4cm,
    ybar,
    bar width=10pt,
    ymajorgrids=true,
    grid style={gray!25},
    tick align=outside,
    tick label style={font=\footnotesize},
    label style={font=\footnotesize},
    legend style={font=\footnotesize,draw=none,fill=none},
    title style={font=\small\bfseries},
    axis line style={gray!60},
    enlarge x limits=0.18,
  },
}

\setlength{\parindent}{0pt}
\setlength{\parskip}{5pt}

\title{\textbf{Reducing NVIDIA Dependency with Intel Arc}\\[3pt]
\large The performance-per-watt second source for the affordable AI tier ---
measured performance, energy efficiency, and total cost of ownership of Intel
Arc Pro (Battlemage) for modern AI workloads on the ColabHive platform}

\author{Engineer Jos\'e Luis Minich}

\date{June 2026 \\[2pt] \small Working draft (v2 --- K=3 repeats on the
LLM-serving, SDXL, and LoRA/QLoRA headline cells; QLoRA-on-Battlemage now
working --- see \S8)}

\begin{document}
\maketitle
\begin{abstract}
\noindent
We benchmark an \textbf{Intel Arc Pro B70 (32\,GB, Battlemage)} head-to-head
against an \textbf{NVIDIA RTX 3090 (24\,GB, Ampere)} on the ColabHive
distributed inference/training platform, across six real AI workloads (LLM
serving, Stable Diffusion XL, and LoRA/QLoRA/TabNet fine-tuning) with
\emph{device-attributed} throughput and end-to-end energy instrumentation. On
the compute-bound workloads that dominate today's AI spend, the B70 delivers
\textbf{95--112\% of the 3090's throughput while drawing 56--64\% of its
power}, for \textbf{$\sim$1.45--2.0$\times$ better performance-per-watt}.
Diffusion is an Intel win outright ($\sim$12\% more throughput at half the
energy per image); 4-bit QLoRA --- initially mis-reported as broken --- in fact
\emph{runs} on Battlemage and is $\sim$7\% faster than the 3090 once given the
correct SYCL device selector, a fix we productionize and report upstream as a
contribution of this work. At 4-bit AWQ --- the precision operators actually
deploy --- the B70 \emph{leads} the 3090 by 19--26\% on LLM serving, flipping
the bf16 near-parity into a clean win. The honest weak spot is small-operator
workloads (TabNet: the 3090 is $\sim$2.2$\times$ faster). Economically, Intel
Arc Pro is $\sim$0.4$\times$ the \$/GB-VRAM of a \emph{new} 3090 and
$\sim$0.6--0.8$\times$ that of a \emph{used} one, with a structural per-watt
efficiency advantage the used market cannot erode that compounds into a
total-cost-of-ownership win on any real fleet. Where applicable we also report the \textbf{Arc Pro B50 (16\,GB, 70\,W)} as a third, lower-power efficiency data point. We present nine figures and a
full TCO model, and a candid assessment of the remaining, fast-closing
software-ecosystem gaps. This is a \emph{value-tier second-source} case --- not
a claim against frontier H100/B200 silicon.
\end{abstract}
\vspace{1.2em}

\section{Executive summary}

\begin{quote}
\textbf{Thesis.} For the affordable inference-and-finetune tier where modern-AI
value concentrates, Intel Arc Pro delivers NVIDIA-class throughput at
$\sim$1.5--2$\times$ the performance-per-watt and materially cheaper VRAM ---
the efficiency, value, and compute-sovereignty second source, not a
frontier-NVIDIA killer.
\end{quote}

We measured an \textbf{Intel Arc Pro B70 (32 GB, Battlemage)} head-to-head
against an \textbf{NVIDIA RTX 3090 (24 GB, Ampere)} on the ColabHive distributed
inference/training platform, across six real AI workloads (LLM serving, image
diffusion, and fine-tuning), with \textbf{device-attributed throughput and
end-to-end energy measurement}. The findings:

\begin{itemize}
\item \textbf{Parity-or-better on compute-bound modern AI.} On the workloads that
dominate today's AI spend --- LLM serving, \textbf{LoRA \emph{and} 4-bit QLoRA}
fine-tuning, and Stable Diffusion XL --- the B70 delivers \textbf{95--112\,\% of
the RTX 3090's throughput} while drawing \textbf{56--64\,\% of its power}, for
\textbf{$\sim$1.45--2.0$\times$ better performance-per-watt}. (LLM serving is the
one cell where the B70 is slightly \emph{behind} at bf16 --- 95--97\,\% --- but
at $\sim$1.5$\times$ the efficiency. And \textbf{quantization flips even that}:
at \textbf{4-bit AWQ, the precision operators actually deploy, the B70
\emph{leads} the 3090 by 19--26\,\%} on LLM serving, \S4.10. The two inference
cells carry K=3 repeats; the training cells are paired point estimates.)
\item \textbf{Diffusion is an Intel win outright.} SDXL image generation delivers
\textbf{$\sim$12\,\% more throughput} ($\approx$11\,\% lower latency per image) on
the B70 \emph{and} uses \textbf{half the energy per image} (2.00$\times$ img/J),
repeatably across K=3 runs ($\pm$1 sample-std).
\item \textbf{Structurally lower power --- the headline finding.} Across
\emph{every} workload measured, the B70 drew \textbf{44--80\,\% of the 3090's
power} --- there is no workload, fast or slow, on which it consumes more.
Perf/watt is not a one-off measurement artifact; it is a consistent, large-margin result across every workload we measured, and it is the axis on which the entire economic case (\S5) turns.
For an operator paying the power bill, that is the number that compounds.
\item \textbf{Production-integrated, not a lab demo.} This is not a one-off
benchmark on a borrowed card. Both vendors are served in production on the same
ColabHive orchestrator + per-node runtime; the energy telemetry that produced
every number in \S4 is now cabled fleet-wide (vendor-agnostic
\texttt{node\_power\_samples}, node-runtime 0.10.119), and the QLoRA-on-Arc fix
(\S4.7) is wired into a production training launcher (node-runtime 0.10.121).
The Arc stack ships, serves, and self-reports today.
\item \textbf{Cheaper VRAM --- strongly so vs a \emph{new} 3090, modestly so vs a
\emph{used} one.} Against the 3090's \$1,499 launch price, Intel Arc Pro is
\textbf{$\sim$0.4$\times$ the \$/GB of VRAM}; against a used 3090
($\sim$\$700--1,050, the realistic comparator for a 2020 part) the advantage
narrows to \textbf{$\sim$0.6--0.8$\times$} (\S5 now prices the B70 itself, not
only the cheaper B60/B50). Either way the B70 buys more VRAM \emph{per watt}, and
that headroom is \emph{enabling}: at full bf16 precision the 32 GB B70 serves a
14 B model that out-of-memories on the 24 GB 3090 (\S4.8) --- though a
\emph{quantized} 14 B does fit a 3090, so this is a
full-precision-\textbf{simplicity} advantage, not an absolute one.
\item \textbf{Honest limits.} The B70 is meaningfully slower on small-operator
workloads (TabNet deep-tabular: 3090 $\sim$2.2$\times$ faster). Classical
gradient-boosting (XGBoost/CatBoost) has no Intel-GPU backend and stays on CPU.
(4-bit QLoRA was \emph{initially} reported as non-functional; on re-investigation
it \textbf{runs on Battlemage} --- and is $\sim$7\,\% faster than the 3090 ---
once bitsandbytes is given the right SYCL device selector, \S4.7. It was a
one-line configuration issue, not a kernel gap.) The remaining limits are
software-ecosystem gaps with a clear, fast-moving upstream trajectory --- not
silicon limitations.
\end{itemize}

\textbf{Bottom line:} for the \textbf{affordable inference-and-finetune tier}
that ColabHive serves --- \emph{not} the frontier data-center segment that
H100/B200 dominate --- Intel Arc Pro is a credible, power- and cost-efficient
NVIDIA alternative \emph{exactly where modern-AI value concentrates for that
tier} (LLM + diffusion inference, LoRA/QLoRA fine-tuning). The headline economics
are strongest against new-NVIDIA pricing and more modest against the used market;
the perf/watt and full-precision-VRAM advantages hold regardless. The remaining
gaps are in the software stack and are closing release-over-release.

% ===================== FIGURE 1 =====================
\begin{figure}[t]
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  bar width=9pt,
  title={LLM serving throughput (tok/s)},
  ylabel={tokens / second},
  symbolic x coords={C=1,C=8,C=16},
  xtick=data,
  ymin=0,
  enlarge x limits=0.28,
  legend pos=north west,
  nodes near coords align={vertical},
  error bars/y dir=both,
  error bars/y explicit,
]
\addplot[fill=nvgreen,draw=nvgreen!60!black,error bars/.cd,error bar style={black}]
  coordinates {(C=1,37.8) +- (0,3.4) (C=8,296.7) +- (0,6.9) (C=16,582.0) +- (0,19.6)};
\addplot[fill=b70blue,draw=b70blue!60!black,error bars/.cd,error bar style={black}]
  coordinates {(C=1,36.6) +- (0,0.1) (C=8,287.0) +- (0,0.5) (C=16,555.3) +- (0,2.1)};
\legend{RTX 3090, Arc Pro B70}
\end{axis}
\end{tikzpicture}
\caption{LLM serving throughput (Qwen2.5-7B, vLLM bf16/eager), K=3 mean $\pm$1
std. B70 holds 95--97\% of the 3090. B50 omitted: 7B bf16 OOMs its 16\,GB.}
\label{fig:f1}
\end{figure}

\section{Background \& motivation}

\subsection{The NVIDIA dependency}

The AI compute market is the most concentrated in modern computing. NVIDIA
shipped an estimated \textbf{98\,\% (3.76 M of 3.85 M units) of data-center GPUs
in 2023} (TechInsights, via HPCwire). The dependency is reinforced by
\textbf{CUDA lock-in}, which is contractual as well as technical: the
\textbf{CUDA EULA (\S1.2 ``Limitations,'' item 8) prohibits reverse-engineering
or translating the output of CUDA SDK elements to target a non-NVIDIA platform},
and the ZLUDA CUDA-on-other-hardware project was taken down at AMD's request in
August 2024. Supply and price compound the problem: Blackwell-generation
accelerators have been reported \textbf{sold out $\sim$12 months ahead} (NVIDIA
management, Oct 2024), with Jensen Huang quoting \textbf{\$30,000--40,000 per
GPU} (he later noted NVIDIA sells full systems, not bare chips, so per-GPU
pricing varies).

For an infrastructure layer whose mission is to give developers, startups, and
researchers --- particularly in LATAM --- access to \textbf{affordable,
available, power-efficient} compute, single-vendor dependency is the central
strategic risk. A credible second source is not optional; it is the thesis.

\subsection{Why Intel Arc}

Intel's Arc Pro B-series (``Battlemage'') and the \textbf{Project Battlematrix}
initiative (up to \textbf{8$\times$ Arc Pro B60 $\rightarrow$ 192 GB VRAM},
Intel's stated target \textbf{70 B+ parameter models}) target precisely the
AI-workstation / inference-server niche, with a containerized software stack
(\texttt{llm-scaler}: vLLM-XPU, ComfyUI, SGLang). The open questions for a
production operator are empirical: \emph{How close is real throughput? What is the
actual energy and cost profile? Where does the software ecosystem still hurt?}
This paper answers them with measured data from production-grade hardware.

\subsection{Scaling to Project Battlematrix (8$\times$B60 $\rightarrow$ 192 GB,
70 B+ on-node)}

The per-card economics in this paper are the \emph{unit} of a larger story.
Intel's \textbf{Project Battlematrix} packs \textbf{8$\times$ Arc Pro B60 into
192 GB of aggregate VRAM} in a single workstation/server chassis. That capacity
envelope is exactly what a \textbf{70 B-class model at full bf16} needs:
$\sim$140 GB of weights plus KV cache fit \textbf{on-node}, with no cross-host
sharding --- a class of model that today forces a multi-GPU NVIDIA box (typically
2--4$\times$ A100/H100-class cards) at a multiple of the power draw. The measured
single-card facts in \S4 are the building blocks of that claim: a B70 already
serves a 14 B at full bf16 on one card (\S4.8), and the per-card vLLM-XPU
efficiency we measured (95--97\,\% of 3090 LLM throughput at $\sim$1.5$\times$ the
tokens/joule, \S4.2) is the unit economics that, \emph{if it composes}, scales
into a 70 B-class node at a fraction of the NVIDIA multi-GPU power budget.

\textbf{Honest caveat --- and a concrete ask.} We have \textbf{not} measured
tensor-parallel (TP) scaling or inter-card bandwidth on Battlematrix-class
topology. Multi-card TP efficiency on Arc is genuinely unproven in our hands, and
we observe a \textbf{real multi-card TP fault on dual-B70 in the current
vLLM-XPU} stack (TP$>$1 does not yet run clean across two Arc cards in our
environment, \S6/\S8). So the ``8$\times$B60 serves 70 B on-node'' claim is a
\emph{forward projection from solid single-card data}, not a measured result ---
and the gap between the two is precisely the joint-validation work that a
Battlematrix unit in our lab would let us close. \textbf{Bottom line: the
single-card unit economics are proven and favorable; the only thing standing
between them and a power-efficient on-node 70 B is multi-card TP validation ---
a Battlematrix validation unit would let us close that gap jointly: measure TP scaling, fix the TP path the way we fixed QLoRA, and report the results back with the same rigor.}

\section{Platform \& methodology}

\subsection{ColabHive}

ColabHive is a distributed inference/training platform (orchestrator + per-node
runtime + vendor-aware dispatch). It already serves Intel Arc GPUs in production
via an \texttt{inference-ipex} image built on \textbf{Intel's own
\texttt{intel/llm-scaler-vllm:0.14.0-b8.3.1}} base (vLLM-XPU,
\texttt{torch 2.10.0+xpu}, compute-runtime 26.09). The platform's energy
telemetry (described below) was extended for this study and is now cabled
end-to-end.

\subsection{Hardware under test}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{1.8cm} L{1.6cm} L{4.6cm} L{2.8cm} L{3.4cm}@{}}
\toprule
\textbf{Role} & \textbf{Node} & \textbf{GPU} & \textbf{VRAM} & \textbf{Host CPU} \\
\midrule
Intel & IAC001 & \textbf{Arc Pro B70} (\texttt{0xe223}, BMG-G31) &
\textbf{32 GB} GDDR6-ECC & i7-10700 (8c) \\
Intel (2nd) & IAC001 & Arc Pro B50 (\texttt{0xe212}) & 16 GB & --- \\
NVIDIA & AUR001 & RTX 3090 (GA102) & 24 GB GDDR6X & Xeon E5-2680 v4 (28c) \\
\bottomrule
\end{tabular}
\caption{Hardware under test. The B70's 32\,GB was confirmed (32{,}656\,MB) via
\texttt{torch.xpu}.}
\label{tab:hw}
\end{table}

Host: Ubuntu 24.04, kernel 6.17, \textbf{\texttt{xe}} driver, compute-runtime
26.05+, Resizable BAR enabled. All GPU benchmarks pin a \textbf{single dedicated
GPU} per side; other models were drained off the target GPU to avoid contention.

\subsection{How we measured (and how we avoided measuring the wrong thing)}

A core methodological commitment of this study is \textbf{device-attributed,
isolated, energy-instrumented} measurement. Three pitfalls we explicitly avoided:

\begin{enumerate}
\item \textbf{CPU-vs-GPU contamination.} On a GPU node, much inference actually
executes on the CPU (specialists, tools, fallbacks). Aggregate platform telemetry
therefore mixes devices and models and is \emph{not} a valid GPU comparison.
Every number in \S4 comes from a \textbf{controlled run of an identical workload
on a known, dedicated GPU}, verified to be GPU-resident (vLLM \texttt{/metrics}
generation counters, GPU utilization, energy draw).
\item \textbf{Throughput ground-truth.} LLM throughput is taken from vLLM's own
\texttt{vllm:generation\_tokens\_total} counter (delta over the measured window),
not client-side estimates. Output length is pinned with \texttt{ignore\_eos} so
both vendors process \emph{exactly} identical work.
\item \textbf{Energy attribution.} Energy is measured at the device:
\begin{itemize}
\item \textbf{Intel:} the \texttt{xe} driver's sysfs accumulating energy counter
\texttt{.../hwmon/.../energy1\_input} (microjoules, monotonic) --- discovered
per-card by PCI device id, exact by construction.
\item \textbf{NVIDIA:} where available, the exact NVML accumulating energy counter
\texttt{nvmlDeviceGetTotalEnergyConsumption} (mJ, monotonic) --- the symmetric
analog to Intel's Xe counter --- is the \textbf{primary} figure (used for QLoRA,
\S4.7). For the LLM-serving and SDXL cells, energy is the integral of
\texttt{nvidia-smi power.draw} at 200 ms; trapezoidal integration of instantaneous
samples can bias slightly versus the exact counter --- a measurement asymmetry we
flag in \S8.
\item The energy window is bracketed by \texttt{MEASURE\_START}/\texttt{MEASURE\_END}
timestamps emitted by the workload itself (host and container share the kernel
clock), so energy is attributed to the \emph{measured phase only} (warmup and
model-load excluded).
\end{itemize}
\end{enumerate}

\textbf{Fairness controls:} identical model, identical prompts/inputs,
\textbf{bf16 + eager} on both vendors (no vendor got fp8 or a compiled fast-path
the other lacked), identical \texttt{max\_model\_len}, batch, and step counts.

\textbf{Production cabling:} the same energy sources are now collected
automatically by the node runtime (\texttt{gpu\_manager.collect\_power\_energy()}
$\rightarrow$ heartbeat $\rightarrow$ \texttt{node\_power\_samples} table; NVML on
NVIDIA, Xe sysfs on Intel, RAPL best-effort on CPU; vendor-agnostic, shipped in
node-runtime 0.10.119). Validated live across the fleet (both vendors reporting).
So perf/watt is no longer a one-off measurement --- it is platform telemetry going
forward.

\textbf{Caveat (stated up front):} the two headline \textbf{inference} cells (LLM
serving \S4.2, SDXL \S4.3) are \textbf{K=3 repeats with $\pm$1 sample-std (n=3)}.
The training cells (LoRA \S4.4, QLoRA \S4.7) are paired measured runs reported as
\textbf{point estimates} (repeated, but we do not claim tight CIs on the short
50-step benchmark). SDXL-UNet-LoRA (\S4.5) and TabNet (\S4.6) are single runs.
\textbf{Important sampling caveat:} ``K=3'' controls \emph{run-to-run} noise on
the \emph{same two physical cards} --- it does \textbf{not} control
silicon-lottery / board-partner / thermal variation (N=1 hardware per vendor); a
second device and a second model size are the next steps (\S8).

\section{Results}

All figures measured 2026-06-21. ``B70 \%'' = B70 throughput $\div$ 3090
throughput. ``perf/watt'' = B70 $\div$ 3090 on the workload's efficiency metric
(tokens/J, img/J, steps$\cdot$rows per J).

\subsection{Summary matrix}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.6cm} L{1.8cm} L{2.8cm} L{3.0cm}@{}}
\toprule
\textbf{Workload} & \textbf{Type} & \textbf{B70 thr.\ vs 3090} &
\textbf{B70 perf/W vs 3090} \\
\midrule
\textbf{SDXL image gen.} & infer. & \textbf{112\,\%} (n=3) &
\textbf{2.00$\times$} \\
\textbf{LoRA FT (LLM 7B)} & train & \textbf{$\sim$103\,\%} &
\textbf{1.64$\times$} \\
\textbf{LLM serving (vLLM)} & infer. & \textbf{95--97\,\%} (n=3) &
\textbf{1.46--1.65$\times$} \\
\textbf{SDXL UNet LoRA} & train & 57\,\% & \textbf{1.29$\times$} \\
\textbf{TabNet (deep-tab.)} & train & 46\,\% & 0.56$\times$ \\
\textbf{QLoRA 4-bit (7B)} & train & \textbf{107\,\%} & \textbf{1.89$\times$} \\
\bottomrule
\end{tabular}
\caption{Summary matrix. Verdicts: SDXL --- Intel wins both; LoRA ---
$\sim$parity + efficiency; LLM --- near-parity + efficiency; UNet-LoRA --- slower
but better perf/W; TabNet --- 3090 win (Intel weak spot); QLoRA --- Intel win
(selector fix) + faster.}
\label{tab:summary}
\end{table}

\textbf{Structural finding:} the B70 drew \textbf{44--80\,\% of the 3090's power}
on \emph{every} workload that ran on both.

% ===================== FIGURE 2 =====================
\begin{figure}[t]
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  title={Perf/watt relative to 3090 (B70 and B50)},
  ylabel={ratio (1.0 = parity)},
  symbolic x coords={SDXL infer,QLoRA,LoRA,LLM c=1,LLM c=16,UNet-LoRA,TabNet},
  xtick=data,
  x tick label style={rotate=40,anchor=east,font=\scriptsize},
  ymin=0,ymax=2.8,
  bar width=8pt,
  enlarge x limits=0.10,
  legend pos=north east,
]
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates
  {(SDXL infer,2.00) (QLoRA,1.89) (LoRA,1.64) (LLM c=1,1.65) (LLM c=16,1.46) (UNet-LoRA,1.29) (TabNet,0.56)};
\addplot[fill=b50teal,draw=b50teal!60!black] coordinates
  {(SDXL infer,2.51) (QLoRA,2.45) (TabNet,0.60)};
\legend{Arc Pro B70, Arc Pro B50}
\draw[baselinegray,dashed,thick] (axis cs:SDXL infer,1.0) -- (axis cs:TabNet,1.0);
\end{axis}
\end{tikzpicture}
\caption{Performance-per-watt relative to the RTX 3090 (1.0 = parity). Above the
dashed line, the Intel part does more work per joule. The B50 (70\,W card)
posts the best perf/watt of all three on the workloads it can run; it has no
LoRA/LLM bar (those OOM its 16\,GB).}
\label{fig:f2}
\end{figure}

\subsection{LLM serving --- Qwen2.5-7B-Instruct, vLLM (bf16, eager, 4096 ctx, 128
output tokens) --- K=3}

Throughput is the engine-side \texttt{vllm:generation\_tokens\_total} counter
$\div$ wall time; mean $\pm$ sample-std over \textbf{3 repeats} per concurrency.
Figure~\ref{fig:f1} plots the grouped throughput; Table~\ref{tab:llm} gives the
full numbers including energy.

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}C{1.8cm} C{2.8cm} C{2.8cm} C{1.6cm} C{1.8cm}@{}}
\toprule
\textbf{Conc.} & \textbf{3090 tok/s} & \textbf{B70 tok/s} &
\textbf{B70 \%} & \textbf{B70 p/W} \\
\midrule
1 & 37.8 $\pm$ 3.4 & \textbf{36.6 $\pm$ 0.1} & 97\,\% & \textbf{1.65$\times$} \\
8 & 296.7 $\pm$ 6.9 & \textbf{287.0 $\pm$ 0.5} & 97\,\% & \textbf{1.53$\times$} \\
16 & 582.0 $\pm$ 19.6 & \textbf{555.3 $\pm$ 2.1} & 95\,\% & \textbf{1.46$\times$} \\
\bottomrule
\end{tabular}
\caption{LLM serving, K=3 mean $\pm$1 std. 3090 tok/J: 0.110/0.876/1.723; B70
tok/J: 0.182/1.344/2.511.}
\label{tab:llm}
\end{table}

Under load: \textbf{3090 $\approx$ 343 W, B70 $\approx$ 219 W ($\sim$64\,\%)}.
Across the sweep the \textbf{B70 holds 95--97\,\% of the 3090's throughput at
1.46--1.65$\times$ the tokens/joule}, with B70 p50 latency $\sim$5--10\,\% higher
at the top of the range. Two things stand out in the K=3 data: (1) the
\textbf{B70 is markedly more consistent run-to-run} (std $<$ 1\,\%) than the 3090,
whose single-GPU throughput varied 5--9\,\% across back-to-back sweeps (a mild
thermal droop at C=1, where its CI 37.8 $\pm$ 3.4 fully contains the B70's 36.6);
and (2) the 3090's higher variance is partly mild thermal droop under
back-to-back load --- which cuts \emph{against} us as much as for us: the C=1
``near-parity'' could reflect the \emph{3090 underperforming} as much as the B70
reaching it, so we lean on the higher-concurrency cells (C=8/16, a clean
95--97\,\%) for the parity read and treat C=1 as noisy. The 32 GB B70 on a mature
vLLM-XPU stack reaches \textbf{near-parity with the 3090} (within 3--5\,\%,
slightly behind) --- well ahead of the weaker LLM positioning the smaller,
bandwidth-limited B-series cards (e.g.\ the B50/B60) show in early consumer
reviews.

% ===================== FIGURE 3 =====================
\begin{figure}[t]
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  title={Energy per unit work: B70 as \% of 3090},
  ylabel={\% of 3090 (lower better)},
  symbolic x coords={SDXL/img,QLoRA/step,LoRA/step,LLM/token},
  xtick=data,
  x tick label style={rotate=30,anchor=east,font=\scriptsize},
  ymin=0,ymax=115,
  bar width=14pt,
  nodes near coords,
  nodes near coords style={font=\scriptsize},
  every node near coord/.append style={/pgf/number format/precision=1,/pgf/number format/fixed},
]
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates
  {(SDXL/img,50.0) (QLoRA/step,52.8) (LoRA/step,60.8) (LLM/token,68.5)};
\draw[baselinegray,dashed,thick] (axis cs:SDXL/img,100) -- (axis cs:LLM/token,100);
\end{axis}
\end{tikzpicture}
\caption{B70 energy per unit of work as a fraction of the 3090's. The B70 uses
roughly half to two-thirds the energy.}
\label{fig:f3}
\end{figure}

\subsection{SDXL image generation --- 1024$\times$1024, 30 steps (8 images/run
$\times$ 3 runs) --- K=3}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{4.0cm} C{3.4cm} C{3.4cm}@{}}
\toprule
& \textbf{3090 (n=3)} & \textbf{B70 (n=3)} \\
\midrule
Latency / image & 8.21 $\pm$ 0.09 s & \textbf{7.33 $\pm$ 0.03 s} \\
Throughput & 7.3 img/min & \textbf{8.2 img/min} \\
Energy / image & 3,240 $\pm$ 37 J & \textbf{1,621 $\pm$ 6 J} \\
Avg power & $\sim$394 W & $\sim$221 W \\
\bottomrule
\end{tabular}
\caption{SDXL inference, K=3 mean $\pm$1 std.}
\label{tab:sdxl}
\end{table}

\textbf{The B70 delivers $\sim$12\,\% more throughput ($\approx$11\,\% lower
latency per image) and uses 2.00$\times$ less energy per image}, repeatably
across 3 runs (B70 7.33 $\pm$ 0.03 vs 3090 8.21 $\pm$ 0.09 s/image; $\pm$1
sample-std, n=3, clearly separated) --- the flagship result is not a single-run
artifact. Diffusion is compute-bound with large, regular kernels --- Battlemage's
FP16 throughput shines here. (Independent SDXL reviews generally place Arc
\emph{behind} comparable NVIDIA on raw throughput while ahead on perf/\$ and
perf/W; the B70's outright throughput win on this optimized serving path is
therefore a notable, platform-specific result.)

% ===================== FIGURE 4 =====================
\begin{figure}[t]
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  title={Throughput as \% of 3090 (B70 and B50)},
  ylabel={\% of 3090},
  symbolic x coords={SDXL,QLoRA,LoRA,LLM c=8,LLM c=16,UNet-LoRA,TabNet},
  xtick=data,
  x tick label style={rotate=40,anchor=east,font=\scriptsize},
  ymin=0,ymax=130,
  bar width=8pt,
  enlarge x limits=0.10,
  legend pos=north east,
]
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates
  {(SDXL,112) (QLoRA,107) (LoRA,103) (LLM c=8,97) (LLM c=16,95) (UNet-LoRA,57) (TabNet,46)};
\addplot[fill=b50teal,draw=b50teal!60!black] coordinates
  {(SDXL,43.8) (QLoRA,42.0) (TabNet,22.0)};
\legend{Arc Pro B70, Arc Pro B50}
\draw[baselinegray,dashed,thick] (axis cs:SDXL,100) -- (axis cs:TabNet,100);
\end{axis}
\end{tikzpicture}
\caption{Throughput as a percentage of the RTX 3090. The B70 wins diffusion and
QLoRA and trails on small-operator workloads; the slower, low-power B50 runs at
$\sim$22--44\,\% of the 3090 (no LoRA/LLM bar --- those OOM its 16\,GB).}
\label{fig:f4}
\end{figure}

\subsection{LoRA fine-tuning --- Qwen2.5-7B, r=16 (q/k/v/o), seq 512, batch 2, 50
steps (bf16)}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{4.6cm} C{3.6cm} C{3.6cm}@{}}
\toprule
& \textbf{3090} & \textbf{B70} \\
\midrule
Throughput & 1.90 st/s, 1,944 tok/s &
\textbf{1.96 st/s, 2,007 tok/s} \\
Energy / step & 180.8 J & \textbf{110.1 J} \\
Avg power & $\sim$347 W & $\sim$222 W \\
\bottomrule
\end{tabular}
\caption{LoRA fine-tuning (point estimates from repeated runs).}
\label{tab:lora}
\end{table}

\textbf{LoRA training: the B70 is $\sim$3\,\% faster with 1.64$\times$ better
perf/watt.} These are point estimates from repeated runs; we do \emph{not} claim
a tight confidence interval here --- run-to-run scatter on this short 50-step
benchmark is a few percent on both sides, so the honest read is ``roughly parity
on throughput, $\sim$1.6$\times$ the efficiency.'' LoRA-fp16 is a reliable
LLM-finetuning path on Intel (see \S6).

\subsection{SDXL UNet LoRA fine-tuning --- 1024px latents, batch 2, 30 steps
(bf16, eager)}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{4.6cm} C{3.6cm} C{3.6cm}@{}}
\toprule
& \textbf{3090} & \textbf{B70} \\
\midrule
Throughput & \textbf{1.625 steps/s} & 0.930 steps/s \\
Energy / step & 208 J & \textbf{161 J} \\
Avg power & $\sim$346 W & $\sim$152 W \\
\bottomrule
\end{tabular}
\caption{SDXL UNet-LoRA fine-tuning (single run).}
\label{tab:unetlora}
\end{table}

Here the 3090 is \textbf{1.75$\times$ faster} (its CUDA backward path on raw eager
UNet training is more optimized). In operator terms that is a real cost --- a
UNet-LoRA job runs at $\sim$57\,\% of the work-rate, i.e.\ takes $\sim$1.75$\times$
longer (fewer jobs/hour/card) --- so \textbf{perf/watt (1.29$\times$, drawing only
44\,\% of the power) is the consolation here, not a throughput win.} (Contrast
with SDXL \emph{inference} in \S4.3, where the optimized serving path let the B70
win throughput outright --- inference and raw-eager training are different compute
patterns.)

% ===================== FIGURE 5 =====================
\begin{figure}[t]
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  title={Average board power under load (W)},
  ylabel={Watts},
  symbolic x coords={LLM,SDXL,LoRA,QLoRA,UNet-LoRA,TabNet},
  xtick=data,
  x tick label style={rotate=35,anchor=east,font=\scriptsize},
  ymin=0,ymax=460,
  bar width=6pt,
  legend pos=north east,
  enlarge x limits=0.12,
]
\addplot[fill=nvgreen,draw=nvgreen!60!black] coordinates
  {(LLM,343) (SDXL,394) (LoRA,347) (QLoRA,408) (UNet-LoRA,346) (TabNet,108)};
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates
  {(LLM,219) (SDXL,221) (LoRA,222) (QLoRA,230) (UNet-LoRA,152) (TabNet,87)};
\addplot[fill=b50teal,draw=b50teal!60!black] coordinates
  {(SDXL,69) (QLoRA,70) (TabNet,39)};
\legend{RTX 3090, Arc Pro B70, Arc Pro B50}
\end{axis}
\end{tikzpicture}
\caption{Average board power. The B70 draws 44--80\% of the 3090's power on every
workload measured; the 70\,W-class B50 draws only 39--70\,W (no LoRA/LLM bar ---
those OOM its 16\,GB) --- a consistent efficiency advantage of the Intel parts across our measured workloads.}
\label{fig:f5}
\end{figure}

\subsection{TabNet (deep-tabular) --- 16k$\times$64 synthetic, 20 epochs --- the
honest weak spot}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{4.6cm} C{3.6cm} C{3.6cm}@{}}
\toprule
& \textbf{3090} & \textbf{B70} \\
\midrule
Throughput & \textbf{16,852 rows/s} & 7,739 rows/s \\
Avg power & $\sim$108 W & $\sim$87 W \\
Efficiency & \textbf{159 rows/J} & 89 rows/J \\
\bottomrule
\end{tabular}
\caption{TabNet deep-tabular (single run) --- the honest weak spot.}
\label{tab:tabnet}
\end{table}

\textbf{The 3090 wins both axes (2.2$\times$ throughput, 1.8$\times$ perf/watt).}
TabNet is many small operators with frequent host-device synchronization; the XPU
does not saturate (87 W indicates the GPU is starved), and per-kernel/eager
overhead dominates --- exactly the pattern where a mature CUDA stack pulls ahead.
This is a real limitation for small-op workloads, stated plainly.

\subsection{QLoRA 4-bit --- runs on Battlemage with a device-selector fix (and
beats the 3090)}

An earlier pass reported 4-bit QLoRA as \emph{failing} on Battlemage; on
re-investigation that was a \textbf{misdiagnosis}. The failure is \textbf{not} a
missing Triton-XPU kernel. bitsandbytes 0.49.2 dispatches 4-bit quantization
through a native custom op (\texttt{torch.ops.bitsandbytes.quantize\_4bit}), and
under the platform's device-pinning convention
\texttt{ONEAPI\_DEVICE\_SELECTOR=level\_zero:0} that op throws a SYCL
\textbf{\texttt{No device of requested type available}} --- bnb's own kernel queue
rejects the Level-Zero device selector (torch.xpu sees the GPU fine; bnb's
separately-compiled SYCL queue does not). Setting
\textbf{\texttt{ONEAPI\_DEVICE\_SELECTOR=*:gpu}} (any-backend GPU) lets the kernel
resolve a device, and \textbf{4-bit QLoRA then trains end-to-end on the B70}. It
is genuinely 4-bit: the quantized 7B occupies \textbf{5.45 GiB on both vendors}
(bf16 would be $\sim$15 GiB), and the $\sim$2.2$\times$ slowdown vs LoRA-fp16
(\S4.4) is exactly the expected 4-bit dequant tax.

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{4.6cm} C{3.6cm} C{3.6cm}@{}}
\toprule
\textbf{QLoRA-NF4} & \textbf{3090} & \textbf{B70} \\
\midrule
Throughput & 0.889 st/s, 911 tok/s &
\textbf{0.949 st/s, 972 tok/s} \\
Energy / step & 459.0 J & \textbf{242.3 J} \\
Avg power & $\sim$408 W & $\sim$230 W \\
VRAM (4-bit) & 5.45 GiB & 5.45 GiB \\
\bottomrule
\end{tabular}
\caption{QLoRA-NF4 (Qwen2.5-7B, r=16, B=2, S=512, 50 steps); exact energy
counters both sides.}
\label{tab:qlora}
\end{table}

\textbf{The B70 is $\sim$7\,\% faster at QLoRA and 1.89$\times$ more
energy-efficient} --- the same pattern as LoRA-fp16, measured in one paired
session with exact energy counters (Xe sysfs on Intel, NVML total-energy on
NVIDIA). This removes a headline limitation: the reliable LLM-finetuning path on
Intel is \textbf{both LoRA-fp16 and 4-bit QLoRA}. The remaining work is purely a
platform-integration task --- wiring the \texttt{*:gpu} selector into the Intel
training-launch path for bnb-4bit workloads (the node runtime currently pins
\texttt{level\_zero:<idx>}); see \S6.

\textbf{Methodological note (we own this):} a result we shipped as a
\emph{hardware limitation} turned out to be a one-line configuration issue. That
updates our prior --- we now treat a single negative-for-Intel result as
\textbf{provisional pending a configuration audit}. By that bar the TabNet weak
spot (\S4.6) is \emph{not yet} audited and should be read as ``unaudited,
plausibly improvable'' rather than a settled silicon limit.

% ===================== BOXED CONTRIBUTION =====================
\begin{tcolorbox}[breakable,colback=boxbg,colframe=boxframe,boxrule=1pt,
  arc=2pt,left=8pt,right=8pt,top=6pt,bottom=6pt,
  title={\textbf{Contribution: unblocking 4-bit QLoRA across the Arc B-series}}]

This finding is significant enough to name as a \textbf{contribution of this
work}, not bury as a \S4.7 footnote.

\textbf{The conventional wisdom is wrong.} The public consensus in 2026 ---
repeated across forums, issue threads, and ``does it run on Intel?'' guides ---
is that \textbf{``bitsandbytes / 4-bit QLoRA does not run on Intel Arc.''}
Meanwhile, bitsandbytes' own release notes \textbf{officially list XPU support}.
Both cannot be fully true, and the gap between them is where operators get stuck
and conclude (as we initially did) that the hardware can't do it.

\textbf{The trap is a documentation collision.} Intel's \emph{own} multi-GPU
guidance recommends pinning devices with
\textbf{\texttt{ONEAPI\_DEVICE\_SELECTOR=level\_zero:N}} --- and that is exactly
the selector under which bitsandbytes' separately-compiled SYCL 4-bit op throws
\textbf{\texttt{No device of requested type available}}. So an operator who
follows Intel's multi-GPU docs \emph{to the letter} and installs the
XPU-supporting bitsandbytes will hit a hard failure that \emph{looks} like a
missing kernel --- when in fact \texttt{torch.xpu} sees the GPU fine and only
bnb's SYCL queue rejects the Level-Zero selector. Two correct-in-isolation pieces
of Intel-recommended configuration combine into a failure mode that looks like a missing kernel.

\textbf{Our root-cause and fix.} Set
\textbf{\texttt{ONEAPI\_DEVICE\_SELECTOR=*:gpu}} (any-backend GPU, so bnb's
kernel queue can resolve a device) and pin the specific card with
\textbf{\texttt{ZE\_AFFINITY\_MASK=<idx>}} instead of the Level-Zero selector.
With that one change, \textbf{4-bit QLoRA trains end-to-end on the B70}
($\sim$7\,\% faster than the 3090 at 1.89$\times$ the efficiency, genuinely
4-bit at 5.45 GiB --- \S4.7), and we confirmed it \textbf{generalizes across the
B-series}: the identical fix (\texttt{*:gpu} + \texttt{ZE\_AFFINITY\_MASK=1})
works on the B50 (\S4.9). Because the fix is selector-level and card-agnostic, it
\textbf{unblocks 4-bit QLoRA on the entire Arc B-series --- including Intel's own
8-card Project Battlematrix topology}, where per-card pinning is mandatory and
the \texttt{level\_zero:N} trap is therefore likely to surface for anyone
following the standard multi-GPU docs.

\textbf{It is in production.} The fix is not a notebook hack --- it is wired into
a production training launcher (\textbf{node-runtime 0.10.121}), so every Arc
training job ColabHive dispatches uses the correct selector automatically.

\textbf{We are giving it back.} We are reporting this upstream to
\textbf{bitsandbytes} (the failure signature + the selector root-cause) and to
\textbf{Intel's IPEX / llm-scaler documentation} (so the multi-GPU pinning
guidance and the bnb 4-bit path stop colliding). A ready-to-file bug-report draft
is in \textbf{Appendix C}.

\textbf{Bottom line: we don't just consume the Intel stack --- we fix and
contribute back to it. The single most-cited ``Arc can't do QLoRA'' limitation
is, in our hands, a solved, productionized, upstream-reported one-line
configuration fix that scales to the full B-series and to Battlematrix.}
\end{tcolorbox}

\subsection{VRAM headroom --- a model the 3090 cannot hold at \emph{full
precision}}

At full \textbf{bf16} precision the 32 GB B70 holds mid-size models that do not
fit a 24 GB 3090. This is a \textbf{simplicity/headroom} advantage, \emph{not} an
absolute capability gap --- a quantized (AWQ/GPTQ/fp8) 14 B fits a 3090 fine and
is the standard way to serve one on 24 GB. The narrow, honest claim: at the
identical \textbf{no-quantization bf16} config used everywhere else in this paper,
loading \textbf{Qwen2.5-14B-Instruct} has both vendors attempt identical work
(Figure~\ref{fig:f8}).

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{2.8cm} L{5.6cm} L{5.6cm}@{}}
\toprule
& \textbf{RTX 3090 (24 GB)} & \textbf{Arc Pro B70 (32 GB)} \\
\midrule
bf16 weight load &
\textbf{fails} --- CUDA OOM (23.49 / 23.56 GiB used, cannot allocate next
270 MiB) & \textbf{loads \& serves} (31.1 GiB used) \\
Real completion? & no & \textbf{yes} (coherent text) \\
\bottomrule
\end{tabular}
\caption{Qwen2.5-14B bf16 weight load: 3090 out-of-memories; B70 loads and
serves.}
\label{tab:vram}
\end{table}

The operational value is \textbf{simplicity}: on the B70 you serve or
fp16-finetune a $\sim$13--14 B model on a single card with no quantization
pipeline, no offload, no second GPU. On a 3090 the same model needs a
quantization step --- acceptable for inference, more involved for full-precision
finetuning. So the defensible framing is \emph{not} ``NVIDIA can't run a 14 B''
(it can, quantized) but ``\textbf{Arc Pro gives full-precision headroom} for a
whole class of mid-size models that a 24 GB card forces you to quantize or
shard.''

\subsection{The Arc Pro B50 --- efficiency at 70 W (a third data point)}

We also ran everything that fits on the node's second Intel card, the
\textbf{Arc Pro B50 (16 GB, 70 W TBP, no external power)} --- same selectors
(\texttt{level\_zero:1}; \texttt{*:gpu}+\texttt{ZE\_AFFINITY\_MASK=1} for QLoRA,
confirming the \S4.7 fix is general, not B70-specific), same energy method
(card-2 Xe counter). The B50 is the \textbf{efficiency/power floor} of the three
parts:

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.4cm} L{3.7cm} L{3.5cm} L{3.8cm}@{}}
\toprule
\textbf{Workload} & \textbf{RTX 3090 (350 W)} & \textbf{Arc B70 (32 GB)} &
\textbf{Arc B50 (16 GB, 70 W)} \\
\midrule
SDXL / image &
8.21 s $\cdot$ 3,240 J $\cdot$ $\sim$394 W &
7.33 s $\cdot$ 1,621 J $\cdot$ $\sim$221 W &
18.74 s $\cdot$ \textbf{1,291 J} $\cdot$ \textbf{$\sim$69 W} \\
QLoRA-NF4 / step &
0.889 st/s $\cdot$ 459 J $\cdot$ $\sim$408 W &
0.949 st/s $\cdot$ 242 J $\cdot$ $\sim$230 W &
0.373 st/s $\cdot$ \textbf{187 J} $\cdot$ \textbf{$\sim$70 W} \\
TabNet &
16,852 rows/s $\cdot$ $\sim$108 W &
7,739 rows/s $\cdot$ $\sim$87 W &
3,700 rows/s $\cdot$ \textbf{$\sim$39 W} \\
LLM serving (7B, bf16) & yes & yes & \textbf{OOM} (16 GB) \\
LoRA-fp16 (7B) & yes & yes & \textbf{OOM} (16 GB) \\
\bottomrule
\end{tabular}
\caption{Three-way comparison including the Arc Pro B50. ``yes'' = runs; ``OOM''
= out-of-memory on the 16\,GB B50 at bf16.}
\label{tab:b50}
\end{table}

Two takeaways. \textbf{(1) Slow but astonishingly efficient.} The B50 is
2.3--4.5$\times$ slower than the 3090, yet draws only \textbf{39--70 W} and posts
the \textbf{lowest energy-per-unit-of-work of all three parts} on the workloads
it can run --- SDXL at \textbf{1,291 J/image (0.40$\times$ the 3090, below even
the B70)} and QLoRA at \textbf{187 J/step (0.41$\times$ the 3090)}. For power- or
density-constrained deployments (many cards per chassis, no external power
connector), that is the entire pitch. \textbf{(2) 16 GB is the binding
constraint.} A 7 B model at bf16 \textbf{out-of-memories} the B50 for both
serving and LoRA-fp16 --- it must be quantized to fit; QLoRA-4bit (5.45 GiB)
runs comfortably. So the B50 is a \emph{quantized-inference and QLoRA} card, and
the \textbf{B70's 32 GB is precisely what buys full-precision headroom} (\S4.8).
The two Intel parts are complementary: B50 for efficiency-per-watt at the low
end, B70 for full-precision capacity and near-3090 throughput.

\subsection{INT8 inference on Battlemage --- connecting the spec to throughput}

The economics in \S5 lean on Battlemage's published \textbf{INT8 TOPS/W}
advantage (B70 1.60 vs 3090 0.81). A spec is a promise, not a measurement --- so
we tried to cash it into delivered tokens/s on the B70. The answer has two halves:
the \emph{INT8-compute} path is blocked by a missing kernel, but the
\emph{deployment-standard} 4-bit path not only works, it \textbf{flips the
LLM-serving verdict to a B70 win}.

\textbf{(a) True INT8 (W8A8) --- blocked by a missing vLLM-XPU kernel.} Serving an
INT8 \texttt{compressed-tensors} W8A8 model (which would exercise the 367 INT8
TOPS) fails at engine init on vLLM-XPU:

\begin{Verbatim}[fontsize=\footnotesize,frame=single,framesep=4pt]
File .../quantization/kernels/scaled_mm/__init__.py, line 55,
    in choose_scaled_mm_linear_kernel
  for kernel in _POSSIBLE_KERNELS[current_platform._enum]:
KeyError: <PlatformEnum.XPU: 4>
\end{Verbatim}

vLLM has \textbf{no INT8 scaled-mm kernel registered for the XPU platform} --- the
dispatch table carries CUDA/ROCm/CPU/TPU entries but not XPU. So the B70's
published INT8 TOPS are \textbf{not yet realizable through vLLM's INT8 serving
path}: the economic INT8-TOPS edge (\S5) is a genuine spec advantage that today's
software stack cannot cash in. It is a concrete, reportable gap --- analogous to
the QLoRA selector issue but deeper (a missing kernel registration, not a config),
and squarely on \S6's closing trajectory.

\textbf{(b) AWQ-4bit --- works, and the B70 \emph{leads} the 3090.} The precision
operators actually deploy on 24--32 GB cards is 4-bit weight-only (AWQ/GPTQ). On
vLLM-XPU this routes through the IPEX weight-only path, gated behind a deprecation
guard that we bypass with one flag (\texttt{--allow-deprecated-quantization} --- a
third minor ecosystem unlock we document). Serving identical AWQ Qwen2.5-7B,
coherent output on both vendors:

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.4cm} C{2.0cm} C{2.0cm} C{1.4cm} C{2.0cm} C{2.0cm}@{}}
\toprule
\textbf{AWQ-4bit, Qwen2.5-7B} & \textbf{3090 tok/s} & \textbf{B70 tok/s} &
\textbf{B70 \%} & \textbf{3090 tok/J} & \textbf{B70 tok/J} \\
\midrule
C=1  & 40.8  & \textbf{48.9}  & \textbf{120\,\%} & 0.255 & 0.325 \\
C=8  & 302.9 & \textbf{381.8} & \textbf{126\,\%} & 1.913 & 2.351 \\
C=16 & 598.0 & \textbf{709.5} & \textbf{119\,\%} & 3.506 & \textbf{4.179} \\
\bottomrule
\end{tabular}
\caption{AWQ-4bit LLM serving (Qwen2.5-7B). At the precision operators actually
deploy, the B70 leads the 3090 by 19--26\,\% on throughput and 1.19$\times$ on
tokens/joule.}
\label{tab:awq}
\end{table}

\textbf{Quantization flips the LLM-serving verdict.} At bf16 (\S4.2) the B70
trails the 3090 at 95--97\,\%; at AWQ-4bit --- the realistic operating point ---
the B70 \textbf{leads by 19--26\,\% on throughput and 1.19$\times$ on
tokens/joule}, while also cutting its own latency (p50 2.70 s vs 3.65 s bf16) and
power ($\sim$170 W vs $\sim$219 W). Each vendor runs its own AWQ kernel (IPEX
weight-only on XPU vs Marlin on CUDA), so this is a real-world ``what each card
actually serves'' comparison rather than a same-kernel one --- and it lands in
the B70's favor (Figure~\ref{fig:f9}).

% ===================== FIGURE 9 =====================
% AWQ figure lives in S4.10 (before S5) but is the 9th figure by identity;
% pin its number so float ordering does not relabel it.
\begin{figure}[t]
\setcounter{figure}{8}
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  bar width=10pt,
  title={AWQ-4bit LLM serving throughput (tok/s)},
  ylabel={tokens / second},
  symbolic x coords={C=1,C=8,C=16},
  xtick=data,
  ymin=0,
  enlarge x limits=0.28,
  legend pos=north west,
  nodes near coords,
  nodes near coords style={font=\scriptsize},
  every node near coord/.append style={/pgf/number format/precision=1,/pgf/number format/fixed},
]
\addplot[fill=nvgreen,draw=nvgreen!60!black] coordinates
  {(C=1,40.8) (C=8,302.9) (C=16,598.0)};
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates
  {(C=1,48.9) (C=8,381.8) (C=16,709.5)};
\legend{RTX 3090, Arc Pro B70}
\end{axis}
\end{tikzpicture}
\caption{AWQ-4bit LLM serving (Qwen2.5-7B): quantization flips the bf16
near-parity (95--97\%, Fig 1) into a clean B70 win (+19--26\%) at the precision
operators deploy.}
\label{fig:f9}
\end{figure}

\textbf{Bottom line:} the B70's INT8-\emph{TOPS} spec edge is not yet cashable
(vLLM has no XPU INT8 scaled-mm kernel --- a fixable gap, reported in Appendix C),
but the \emph{realizable} quantized win is already decisive: at the 4-bit
precision people actually deploy, the B70 \textbf{beats} the RTX 3090 on LLM
serving (+19--26\,\%) and efficiency (1.19$\times$), turning \S4.2's lone
near-parity loss into a clean win at the operating point that matters.

\section{Economics}

VRAM capacity and INT8 inference throughput per dollar and per watt are the
metrics that matter for democratizing AI access. (Intel does not publish dense
BF16 TFLOPS for the Arc Pro line, so INT8 TOPS --- which \emph{is} published ---
is used for the compute-economics comparison; raw BF16/\$ would be
apples-to-oranges and is omitted rather than estimated.)

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.2cm} C{2.2cm} C{2.2cm} C{2.2cm} C{3.0cm}@{}}
\toprule
\textbf{Metric} & \textbf{B70} & \textbf{B60} & \textbf{B50} & \textbf{3090} \\
\midrule
Price USD & \textbf{$\sim$949} & $\sim$599 & 349 & 1499 new \\
\$/GB VRAM & 29.66 & 24.96 & 21.81 & 62.46 / 36.5 \\
GB/100W & 13.9 & 12.0 & 22.9 & 6.86 \\
INT8 TOPS/W & 1.60 & 0.99 & 2.43 & 0.81 \\
INT8 TOPS/\$ & 0.39 & 0.33 & 0.49 & 0.19 / 0.33 \\
\bottomrule
\end{tabular}
\caption{Economics. 3090 cells show new\,/\,used. INT8 TOPS (dense): B70 367,
B60 197, B50 170, 3090 285.}
\label{tab:econ}
\end{table}

The \textbf{B70 --- the part actually benchmarked in \S4 --- is now priced in this
table}; earlier drafts listed only the cheaper B60/B50, which was a fair
criticism. Spec context:
B70 32 GB GDDR6, 608 GB/s, $\sim$230 W TBP; B60 456 GB/s, 120--200 W; B50
224 GB/s, 70 W (no external power); RTX 3090 24 GB GDDR6X, 936 GB/s, 350 W.

% ===================== FIGURE 6 =====================
\begin{figure}[t]
\setcounter{figure}{5}
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  title={VRAM cost (\$ per GB)},
  ylabel={\$ / GB VRAM},
  symbolic x coords={B70,B60,B50,3090 used,3090 new},
  xtick=data,
  x tick label style={rotate=30,anchor=east,font=\scriptsize},
  ymin=0,ymax=70,
  bar width=14pt,
  nodes near coords,
  nodes near coords style={font=\scriptsize},
  enlarge x limits=0.14,
]
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates
  {(B70,29.66) (B60,24.96) (B50,21.81)};
\addplot[fill=nv3090used,draw=nv3090used!60!black] coordinates
  {(3090 used,36.46)};
\addplot[fill=nvgreen,draw=nvgreen!60!black] coordinates
  {(3090 new,62.46)};
\end{axis}
\end{tikzpicture}
\caption{VRAM cost per GB. Intel is $\sim$2.5$\times$ cheaper than a NEW 3090,
$\sim$1.2--1.7$\times$ cheaper than a USED one.}
\label{fig:f6}
\end{figure}

% ===================== FIGURE 7 =====================
\begin{figure}[t]
\setcounter{figure}{6}
\centering
\begin{tikzpicture}
\begin{axis}[
  paperbar,
  title={Per-watt economics (price-independent)},
  ylabel={metric value},
  symbolic x coords={B70,B60,B50,3090},
  xtick=data,
  ymin=0,ymax=25,
  bar width=11pt,
  legend pos=north east,
  enlarge x limits=0.22,
]
\addplot[fill=wingreen,draw=wingreen!60!black] coordinates
  {(B70,1.60) (B60,0.99) (B50,2.43) (3090,0.81)};
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates
  {(B70,13.9) (B60,12.0) (B50,22.9) (3090,6.86)};
\legend{INT8 TOPS/W, GB-VRAM/100W}
\end{axis}
\end{tikzpicture}
\caption{Structural efficiency that the used-market price cannot erode: INT8 TOPS
per watt and GB of VRAM per 100W.}
\label{fig:f7}
\end{figure}

% ===================== FIGURE 8 =====================
\begin{figure}[t]
\setcounter{figure}{7}
\centering
\begin{tikzpicture}
\begin{axis}[
  width=0.8\linewidth,
  height=5.0cm,
  xbar,
  bar width=16pt,
  xmin=0,xmax=37,
  ymin=0.4,ymax=3.6,
  xlabel={VRAM capacity (GB)},
  ytick={1,2,3},
  yticklabels={Arc Pro B50, RTX 3090, Arc Pro B70},
  ymajorgrids=false,
  xmajorgrids=true,
  grid style={gray!25},
  tick label style={font=\footnotesize},
  label style={font=\footnotesize},
  nodes near coords,
  nodes near coords style={font=\scriptsize},
  axis line style={gray!60},
]
\addplot[fill=b50teal,draw=b50teal!60!black] coordinates {(16,1)};
\addplot[fill=nvgreen,draw=nvgreen!60!black] coordinates {(24,2)};
\addplot[fill=b70blue,draw=b70blue!60!black] coordinates {(32,3)};
\draw[losered,dashed,very thick] (axis cs:29.5,0.5) -- (axis cs:29.5,3.5);
\node[losered,font=\scriptsize\bfseries,anchor=south,rotate=90]
  at (axis cs:29.5,2) {Qwen2.5-14B @ bf16};
\node[losered,font=\scriptsize\bfseries,anchor=west] at (axis cs:16.5,1) {OOM};
\node[white,font=\scriptsize\bfseries,anchor=east] at (axis cs:23.6,2) {OOM};
\node[wingreen,font=\scriptsize\bfseries,anchor=west] at (axis cs:32.4,3) {fits};
\end{axis}
\end{tikzpicture}
\caption{A 14B model at full bf16 precision ($\sim$29.5 GB) fits the 32GB B70 and
out-of-memories both the 24GB 3090 and the 16GB B50 (B70 used 31.1 GB).}
\label{fig:f8}
\end{figure}

\textbf{Reading the advantage honestly.} On \textbf{\$/GB VRAM}, Intel Arc Pro is
\textbf{0.35--0.47$\times$ a \emph{new} 3090} (\$1,499) --- the headline
``$\sim$2.5--3$\times$ VRAM per dollar.'' But the 3090 is a 2020 part bought
\emph{used} ($\sim$\$700--1,050 $\approx$ \$36.5/GB); against that realistic
comparator the gap shrinks to \textbf{$\sim$0.6--0.8$\times$}, i.e.\ roughly
\textbf{1.2--1.7$\times$ the VRAM per dollar}, not 3$\times$. Where Intel's lead is
\textbf{robust regardless of price basis} is \textbf{VRAM per watt}
(1.75--3.3$\times$) and \textbf{INT8 TOPS/W} (1.2--3.0$\times$) --- structural
efficiency the used market cannot erode (Figure~\ref{fig:f7}). The honest
economic case: \emph{modestly} cheaper VRAM than a used 3090, \emph{much} cheaper
than a new one, decisively better perf/watt --- and a brand-new part (warranty,
current drivers, support) versus a depreciating six-year-old card.

Sticker price is the wrong unit. What an operator actually pays is \textbf{cost
per unit of work} over the life of the card --- hardware amortization \emph{plus}
the electricity to do the work --- and that is where the measured perf/watt of \S4
converts directly into dollars. The rest of \S5 builds that
total-cost-of-ownership (TCO) case from the measured data.

\begin{tcolorbox}[colback=boxbg,colframe=boxframe!70,boxrule=0.6pt,arc=2pt,
  left=7pt,right=7pt,top=5pt,bottom=5pt]
\textbf{TCO assumptions (stated once, used throughout \S5).} 3-year
straight-line hardware amortization = \textbf{26,280 operating hours}.
Electricity: \textbf{US \$0.15 / Argentina \$0.10 / Germany \$0.30 per kWh}. Grid
carbon intensity $\approx$ \textbf{0.4 kg CO\textsubscript{2} / kWh}. Hardware
capital: \textbf{used 3090 \$875}, \textbf{B70 \$949}, \textbf{B50 \$349} (new
3090 \$1,499 shown where it sharpens the contrast). Throughput and watts are the
\S4 measured cells. The full formula and a sensitivity note are in
\textbf{Appendix A}.
\end{tcolorbox}

\subsection{All-in cost per unit of work}

All-in \$/unit = \textbf{hardware $\div$ (units produced over life at utilization
\emph{U})} + \textbf{energy/unit $\times$ \$/kWh}. The hardware term shrinks with
utilization (a card amortized over more work is cheaper per unit); the energy term
does not. Both headline workloads below are computed from the \S4 throughput and
power.

\textbf{LLM serving (Qwen-7B, concurrency C=16) --- \$ / million tokens @ US
\$0.15, 100\,\% utilization:}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.6cm} C{2.6cm} C{2.6cm} C{2.8cm}@{}}
\toprule
\textbf{Card} & \textbf{HW \$/M-tok} & \textbf{Energy \$/M-tok} &
\textbf{All-in \$/M-tok} \\
\midrule
\textbf{Arc B70} & \$0.0181 & \$0.0166 & \textbf{\$0.0347} \\
3090 (used) & --- & --- & \textbf{\$0.0401} \\
3090 (new) & --- & --- & \textbf{\$0.0514} \\
\bottomrule
\end{tabular}
\caption{All-in \$/million-token, LLM serving at C=16, US \$0.15/kWh, 100\,\%
utilization.}
\label{tab:tco-llm}
\end{table}

The B70's win on LLM serving is \textbf{purely energy} --- it is slightly
\emph{behind} on raw throughput (\S4.2), so it does not win on the hardware term;
it wins because each token costs $\sim$half the joules. That makes the
LLM-serving advantage \textbf{utilization-gated}: at very low utilization the
hardware term dominates and the cheaper-sticker used 3090 can edge ahead; as
utilization rises the energy term dominates and the B70 pulls away. The crossover
is exact:

\begin{tcolorbox}[colback=boxbg,colframe=boxframe!70,boxrule=0.6pt,arc=2pt,
  left=7pt,right=7pt,top=5pt,bottom=5pt]
\textbf{Crossover utilization \emph{U}* = \$0.0429 / price\_kWh.} The B70 beats a
used 3090 on \$/token above \textbf{U* = 28.6\,\% utilization at US \$0.15/kWh},
\textbf{14.3\,\% at Germany \$0.30}, and \textbf{43\,\% at Argentina \$0.10}. The
more expensive your power and the busier your fleet, the more decisively Arc wins
LLM serving --- and a production inference fleet runs well above 28.6\,\%.
\end{tcolorbox}

\textbf{SDXL image generation --- \$ / 1,000 images @ US \$0.15, 100\,\%
utilization:}

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.2cm} C{3.0cm} C{2.6cm} C{2.6cm}@{}}
\toprule
\textbf{Card} & \textbf{All-in \$/1,000 img} & \textbf{vs 3090-used} &
\textbf{vs 3090-new} \\
\midrule
\textbf{Arc B50} & \textbf{\$0.123} & --- & --- \\
\textbf{Arc B70} & \textbf{\$0.141} & \textbf{33\,\% cheaper} &
\textbf{47\,\% cheaper} \\
3090 (used) & \textbf{\$0.211} & --- & --- \\
3090 (new) & \textbf{\$0.265} & --- & --- \\
\bottomrule
\end{tabular}
\caption{All-in \$/1,000 SDXL images, US \$0.15/kWh, 100\,\% utilization.}
\label{tab:tco-sdxl}
\end{table}

SDXL is different in kind: the B70 wins on \textbf{both} the hardware-per-image
term (it is \emph{faster} per image, \S4.3) \textbf{and} the energy term (half the
joules). When a card is cheaper on every component of the cost, there is
\textbf{no crossover} ---

\begin{tcolorbox}[colback=boxbg,colframe=boxframe!70,boxrule=0.6pt,arc=2pt,
  left=7pt,right=7pt,top=5pt,bottom=5pt]
\textbf{Under our measured throughput/power and the stated TCO assumptions, SDXL is cheaper on Arc at every utilization and every electricity price.} In our model there is no operating point --- no fleet, no country, no duty
cycle --- at which a 3090 generates SDXL images more cheaply than a B70 (let alone
a B50). The B50 is the cheapest image-factory of all four parts.
\end{tcolorbox}

\textbf{Bottom line: the higher sticker on a B70 is repaid by the energy bill, not
by marketing. On LLM serving it pays back above $\sim$29\,\% utilization (US) ---
i.e.\ on any real fleet --- and on SDXL it is cheaper than a 3090 at every operating point in our TCO model.}

\subsection{Energy \& carbon at fixed throughput}

Flip the question: hold \emph{output} constant and ask what it costs in power and
carbon. This is the number a data-center operator or an ESG-minded investor cares
about.

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.6cm} L{2.8cm} L{4.0cm} L{2.8cm}@{}}
\toprule
\textbf{Fixed throughput} & \textbf{RTX 3090} & \textbf{Arc B70} &
\textbf{Arc B50} \\
\midrule
\textbf{1 M SDXL img / day} & 900 kWh/day, 328.5 MWh/yr &
\textbf{450 kWh/day, 164.4 MWh/yr ($-$50\,\%, $-$66 t CO\textsubscript{2}/yr)} &
\textbf{359 kWh/day ($-$60\,\%)} \\
\textbf{1 B tokens / day (LLM)} & 161 kWh/day &
\textbf{111 kWh/day ($-$31\,\%)} & OOM (16 GB) \\
\bottomrule
\end{tabular}
\caption{Energy and carbon at fixed output. Arc cuts the energy bill 31\,\% (LLM)
to 60\,\% (SDXL).}
\label{tab:energy}
\end{table}

The diffusion number is the dramatic one: \textbf{a diffusion-heavy slice of an AI
fleet halves its electricity and sheds $\sim$66 t CO\textsubscript{2}/yr per
million-image-per-day of capacity} simply by being Arc-class rather than
3090-class. Tie this to ColabHive's companion \textbf{6.1 TWh / 10-million-GPU}
scaling vision: a diffusion-heavy slice of that fleet halves on Arc-class
efficiency. As an \emph{illustrative ceiling} --- not a forecast --- if the entire
6.1 TWh were SDXL-equivalent work running on 3090s, the same work on B70s would
draw \textbf{3.05 TWh/yr}, a saving of \textbf{$-$3.05 TWh/yr $\approx$ the
continuous output of a $\sim$350 MW power plant}.

\textbf{Bottom line: at any fixed output, Arc cuts the energy bill 31\,\% (LLM) to
60\,\% (SDXL) and the carbon with it --- and at fleet scale the diffusion savings
alone are measured in power-plants, not percentages.}

\subsection{Rack / chassis density}

Power, not slots, is the binding constraint in a modern rack. Sizing by
\textbf{total board power (TBP --- 3090 350 W / B70 230 W / B50 70 W)} into a fixed
\textbf{10 kW rack} shows what you can actually deploy per kilowatt:

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{3.6cm} C{2.4cm} C{2.8cm} C{2.8cm}@{}}
\toprule
\textbf{Per 10 kW rack} & \textbf{RTX 3090} & \textbf{Arc B70} &
\textbf{Arc B50} \\
\midrule
Cards & 28 & \textbf{43} & \textbf{142} \\
Aggregate VRAM (GB) & 672 & \textbf{1,376 (2.05$\times$)} &
\textbf{2,272 (3.38$\times$)} \\
SDXL throughput (img/min) & 204 & \textbf{353 (1.73$\times$)} &
\textbf{454 (2.23$\times$)} \\
LLM throughput (tok/s) & 16,296 & \textbf{23,878 (1.47$\times$)} & OOM (16 GB) \\
\textbf{VRAM per kW (GB/kW)} & 68.6 & 139.1 & \textbf{228.6} \\
\bottomrule
\end{tabular}
\caption{Rack / chassis density into a fixed 10 kW power budget, sized by total
board power.}
\label{tab:rack}
\end{table}

Per watt, the B50 packs \textbf{3.3$\times$ the VRAM and 2.2$\times$ the SDXL
throughput} of a 3090; the B70 packs \textbf{2$\times$ the VRAM and 1.5$\times$
the LLM throughput}. Put the other way --- \textbf{value-tier sizing to match one
used 3090's SDXL throughput} takes only \textbf{0.89 of a B70 (\$845, 205 W)} or
\textbf{2.28 B50s (\$796, 160 W, 36.5 GB)}. The B50 path matches the 3090's image
throughput for \textbf{less capital, half the power, and more VRAM} --- and on a
card with \textbf{no external power connector at all}.

\textbf{Bottom line: the no-external-power 70 W B50 is the density play --- in a
power-bound rack it delivers 3.3$\times$ the VRAM and 2.2$\times$ the SDXL
throughput per kilowatt of a 3090, and matches a used 3090's image output for less
money and half the power.}

\begin{tcolorbox}[colback=boxbg,colframe=boxframe!70,boxrule=0.6pt,arc=2pt,
  left=7pt,right=7pt,top=5pt,bottom=5pt]
\textbf{Conservative note.} All three subsections scale cards \emph{linearly} and
ignore the $\sim$10--20\,\% PSU/cooling overhead a real rack pays on top of card
TBP. Because that overhead scales with card wattage, it penalizes the
higher-wattage 3090 fleet \emph{more} than the lower-wattage Arc fleet --- so every
density and energy figure here \textbf{understates} the Arc advantage rather than
inflating it.
\end{tcolorbox}

\section{Honest ecosystem-maturity assessment}

The hardware case is strong; intellectual honesty about the software stack is what
makes it credible.

\textbf{Where Intel XPU lags today:}
\begin{itemize}
\item \textbf{4-bit QLoRA} initially appeared to fail, but the cause was a
\textbf{SYCL device-selector mismatch} in bitsandbytes' 4-bit custom op under
\texttt{ONEAPI\_DEVICE\_SELECTOR=level\_zero:0}, \emph{not} a missing Triton-XPU
kernel; with \texttt{ONEAPI\_DEVICE\_SELECTOR=*:gpu} it trains end-to-end and
beats the 3090 (\S4.7). Both \textbf{LoRA-fp16 and 4-bit QLoRA} are now working
Intel finetuning paths (\S4.4, \S4.7). The open item is wiring the selector into
the platform's Intel training-launch path.
\item \textbf{No FlashAttention / no xformers} on XPU (CUDA-only upstream);
attention falls back to \texttt{torch\_sdpa} / oneDNN paths.
\item \textbf{Small-operator overhead} (TabNet, raw-eager training): the XPU
under-saturates and per-kernel/eager cost dominates (\S4.6).
\item \textbf{Classical gradient boosting (XGBoost, CatBoost) has no Intel-GPU
backend} --- these stay on CPU on Intel (CUDA-only / experimental SYCL). Not a B70
deficiency; a library reality.
\item vLLM-XPU has quantization gaps (no XPU INT8 scaled-mm kernel, \S4.10) and no
CUDA-graph equivalent; Triton-XPU is still out-of-tree.
\item \textbf{Multi-card tensor-parallel (TP$>$1) is unvalidated on Arc}, and we
observe a \textbf{real TP fault on dual-B70 in the current vLLM-XPU} (TP$>$1 does
not run clean across two Arc cards in our environment). This is the single most
important gap for the on-node 70 B / Battlematrix story (\S2.3) --- single-card
efficiency is proven, multi-card composition is not.
\end{itemize}

\textbf{The trajectory is fast and favorable:}
\begin{itemize}
\item \textbf{Native \texttt{torch.xpu}} has been in PyTorch since 2.5 (Arc
A-Series + Data Center Max named first), with \textbf{Arc B-Series (Battlemage)
support maturing through 2.6--2.7}; the part we benchmarked runs
\texttt{torch 2.10.0+xpu}.
\item \textbf{bitsandbytes 0.48} added official Arc B-Series support (the 4-bit
path is the immature piece, not bnb-on-Arc per se).
\item \textbf{IPEX is folding into upstream PyTorch} (the standalone package
reaches EOL $\sim$March 2026) --- i.e.\ the ``you need a special extension'' era is
ending.
\item \textbf{Triton-XPU is being upstreamed} (relevant to other quantization
kernels, though --- per \S4.7 --- \emph{not} the gating dependency for the
bitsandbytes 4-bit QLoRA path, which already works on Battlemage with the correct
SYCL device selector).
\end{itemize}

\textbf{A concrete integration finding from this work:} a training image built
naively on \texttt{ubuntu:24.04} + a hand-installed compute-runtime
\textbf{segfaulted} on \texttt{torch.xpu} device enumeration, despite a correct
torch version. The fix was to build the training image on \textbf{Intel's own
\texttt{llm-scaler-vllm} runtime base} (the same coherent oneAPI/runtime stack
used for inference). The lesson --- \emph{use Intel's curated runtime, don't
assemble your own} --- is itself a useful data point for anyone standing up Arc
training, and it confirmed the operator's hypothesis that ``the key is in the
llm-scaler.''

\section{Conclusion \& roadmap}

Measured on production hardware with device-attributed energy, \textbf{Intel Arc
Pro (Battlemage) is a practical, power- and cost-efficient alternative to the RTX
3090 for the affordable-tier modern-AI workloads} --- LLM serving, LoRA/QLoRA
fine-tuning, and Stable Diffusion --- delivering \textbf{95--112\,\% of the
throughput at 56--64\,\% of the power ($\sim$1.45--2.0$\times$ perf/watt)}, at
\textbf{$\sim$0.4$\times$ the \$/GB of VRAM versus a new 3090 ($\sim$0.6--0.8$\times$
versus a used one)}. It is genuinely \emph{better} on diffusion and on QLoRA,
$\sim$parity on LoRA and (slightly behind, 95--97\,\%) on LLM serving, and
decisively better on perf/watt across the board. It is honestly weaker on
small-operator workloads (TabNet --- a loss that is, per the \S4.7 note,
\emph{unaudited}) and has a defined, closing set of software-ecosystem gaps
(FlashAttention, classical-ML-on-GPU). This is a \textbf{value-tier second-source
case} --- not a claim against frontier H100/B200 silicon.

\textbf{Recommended deployment patterns (today):}
\begin{itemize}
\item \textbf{LLM serving and SDXL inference on Arc Pro} --- best perf/watt and
economics; deploy first.
\item \textbf{LoRA-fp16 and 4-bit QLoRA fine-tuning on Arc Pro} --- both faster
and more energy-efficient than the 3090; the production LLM-customization paths
(QLoRA needs the \texttt{*:gpu} selector wired into the training launcher --- a
small integration task).
\item \textbf{Keep deep-tabular/TabNet on NVIDIA} for now; classical GBDT stays on
CPU on either platform.
\end{itemize}

\textbf{Expansion path} (per Intel's proposal to broaden the engagement): scale
Arc Pro capacity for the inference + LoRA tier, add \textbf{Xeon} for the
CPU/classical tier (where it is vendor-neutral anyway), and evaluate
\textbf{Crescent Island} for the next-generation inference fleet as the XPU
software stack closes its remaining gaps. The combination directly serves the
original mission: \textbf{affordable, scalable, power-efficient AI compute for the
LATAM developer and research community.}

\section{Appendices --- TCO model, raw data, reproduction \& upstream report,
methodology}

\subsection*{Appendix A --- The TCO model (formula, assumptions, sensitivity)}

Everything in \S5.1--5.3 is computed from this one model; a reviewer can recompute
every cell from it.

\textbf{All-in cost per unit of work.} For a card producing throughput \emph{T}
units/hour at average power \emph{P} watts:

\begin{Verbatim}[fontsize=\footnotesize,frame=single,framesep=4pt]
all_in_$/unit(U) = capital / (T x hours_life x U)        <- hardware term
                 + (P / 1000 / T) x price_kWh            <- energy term

where:
  units_over_life     = T x hours_life x U
  energy_per_unit_kWh = (P watts / 1000) / (T units/hour)
\end{Verbatim}

\textbf{Assumptions (identical to the \S5 box):} \texttt{hours\_life = 26,280 h}
(3-yr straight-line). \texttt{price\_kWh} $\in$ \{US 0.15, AR 0.10, DE 0.30\}.
Grid carbon \texttt{= 0.4 kg CO\textsubscript{2}/kWh}. Capital:
\texttt{3090-used \$875}, \texttt{3090-new \$1,499}, \texttt{B70 \$949},
\texttt{B50 \$349}. \emph{T} and \emph{P} are the \S4 measured cells (LLM serving
C=16: 3090 582 tok/s @ 343 W, B70 555.3 tok/s @ 219 W; SDXL: 3090 7.3 img/min @
394 W, B70 8.2 img/min @ 221 W, B50 3.2 img/min @ 69 W).

\textbf{Crossover utilization.} Setting
\texttt{all\_in\_B70(U) = all\_in\_3090used(U)} and solving for \emph{U} on the
LLM cell (where the B70 wins on energy but not hardware) gives the closed form
used in \S5.1:

\begin{Verbatim}[fontsize=\footnotesize,frame=single,framesep=4pt]
U* = dHW_constant / price_kWh = 0.0429 / price_kWh
  -> US 0.15 -> 28.6 %   |   DE 0.30 -> 14.3 %   |   AR 0.10 -> 43 %
\end{Verbatim}

For SDXL the B70 wins \textbf{both} terms, so the equation has no positive-\emph{U}
crossover $\rightarrow$ unconditional win (\S5.1).

\textbf{Sensitivity (the two knobs that move the answer):}
\begin{itemize}
\item \textbf{\$/kWh.} The energy term --- and the \emph{entire} B70 LLM-serving
advantage --- scales linearly with electricity price. Cheaper power (Argentina)
pushes the LLM crossover up to 43\,\% util and shrinks the margin; expensive power
(Germany) drops it to 14.3\,\% and widens it. SDXL is insensitive in sign (always
wins) and only varies in \emph{magnitude}.
\item \textbf{Utilization \emph{U}.} \emph{U} only scales the hardware term, never
the energy term. So every result here is \textbf{most favorable to the
cheaper-energy card (Arc) at high utilization} and most favorable to the
cheaper-sticker card (used 3090) at low utilization. A production fleet lives at
high \emph{U}, which is the regime where Arc wins.
\end{itemize}

\subsection*{Appendix B --- Consolidated raw-data table (every measured cell)}

So a reviewer can recompute J/unit, perf/watt, and the \S5 economics
independently. All cells measured 2026-06-21, bf16/eager, single dedicated GPU.
``---'' = not run / not applicable; ``OOM'' = did not fit in VRAM.

\begin{table}[h]
\centering
\small
\begin{tabular}{@{}L{4.2cm} L{2.6cm} C{2.4cm} C{2.4cm} C{1.8cm}@{}}
\toprule
\textbf{Workload (unit)} & \textbf{Metric} & \textbf{RTX 3090} & \textbf{Arc B70}
& \textbf{Arc B50} \\
\midrule
LLM serving C=1 (tok/s) & throughput & 37.8 $\pm$ 3.4 & 36.6 $\pm$ 0.1 & --- \\
LLM serving C=8 (tok/s) & throughput & 296.7 $\pm$ 6.9 & 287.0 $\pm$ 0.5 & --- \\
LLM serving C=16 (tok/s) & throughput & 582.0 $\pm$ 19.6 & 555.3 $\pm$ 2.1 & OOM \\
LLM serving C=16 & tok/J & 1.723 & 2.511 & --- \\
LLM serving (load) & avg power (W) & $\sim$343 & $\sim$219 & --- \\
SDXL (per image) & latency (s) & 8.21 $\pm$ 0.09 & 7.33 $\pm$ 0.03 & 18.74 \\
SDXL (per image) & throughput (img/min) & 7.3 & 8.2 & 3.2 \\
SDXL (per image) & energy/image (J) & 3,240 $\pm$ 37 & 1,621 $\pm$ 6 & 1,291 \\
SDXL & avg power (W) & $\sim$394 & $\sim$221 & $\sim$69 \\
LoRA-fp16 7B (step) & thr.\ (st/s, tok/s) & 1.90, 1,944 & 1.96, 2,007 & OOM \\
LoRA-fp16 7B & energy/step (J) & 180.8 & 110.1 & OOM \\
LoRA-fp16 7B & avg power (W) & $\sim$347 & $\sim$222 & OOM \\
QLoRA-NF4 7B (step) & thr.\ (st/s, tok/s) & 0.889, 911 & 0.949, 972 & 0.373 \\
QLoRA-NF4 7B & energy/step (J) & 459.0 & 242.3 & 187 \\
QLoRA-NF4 7B & avg power (W) & $\sim$408 & $\sim$230 & $\sim$70 \\
QLoRA-NF4 7B & VRAM 4-bit (GiB) & 5.45 & 5.45 & 5.45 \\
SDXL UNet-LoRA (step) & throughput (st/s) & 1.625 & 0.930 & --- \\
SDXL UNet-LoRA & energy/step (J) & 208 & 161 & --- \\
SDXL UNet-LoRA & avg power (W) & $\sim$346 & $\sim$152 & --- \\
TabNet 16k$\times$64 (rows/s) & throughput & 16,852 & 7,739 & 3,700 \\
TabNet & avg power (W) & $\sim$108 & $\sim$87 & $\sim$39 \\
TabNet & rows/J & 159 & 89 & --- \\
Qwen2.5-14B bf16 & weight load & OOM (24 GB) & loads (31.1 GiB) & --- \\
AWQ-4bit C=1 (tok/s) & throughput & 40.8 & 48.9 & --- \\
AWQ-4bit C=8 (tok/s) & throughput & 302.9 & 381.8 & --- \\
AWQ-4bit C=16 (tok/s) & throughput & 598.0 & 709.5 & --- \\
\bottomrule
\end{tabular}
\caption{Consolidated raw data, every measured cell.}
\label{tab:rawdata}
\end{table}

\emph{Energy method per cell: Intel = Xe sysfs exact counter (all cells); NVIDIA =
NVML exact counter for QLoRA, \texttt{nvidia-smi power.draw} @200 ms integral for
LLM/SDXL (asymmetry flagged in Appendix D).}

\subsection*{Appendix C --- Reproduction \& upstream report}

\textbf{C.1 Reproducing the headline cells.} Pin a single dedicated GPU per side,
drain other models off it, bf16/eager throughout, identical
prompts/inputs/lengths. LLM throughput from vLLM's
\texttt{vllm:generation\_tokens\_total} delta over a \texttt{MEASURE\_START/END}-%
bracketed window (\texttt{ignore\_eos} to fix output length). Energy from the
exact accumulating counters where possible (Xe \texttt{energy1\_input}, NVML
\texttt{nvmlDeviceGetTotalEnergyConsumption}). Software versions in Appendix D.
The QLoRA cell requires the selector fix below.

\textbf{C.2 Ready-to-file bug report --- bitsandbytes 4-bit QLoRA on Intel Arc
B-series (device-selector trap).}

\begin{tcolorbox}[breakable,colback=boxbg,colframe=boxframe!70,boxrule=0.6pt,
  arc=2pt,left=7pt,right=7pt,top=5pt,bottom=5pt]
\textbf{Title:} 4-bit QLoRA fails on Intel Arc B-series under
\texttt{ONEAPI\_DEVICE\_SELECTOR=level\_zero:N} --- SYCL \texttt{No device of
requested type available} (bnb's 4-bit op rejects the Level-Zero selector that
Intel's own multi-GPU docs recommend)

\textbf{Components:} \texttt{bitsandbytes} (XPU 4-bit custom op) $\cdot$ Intel IPEX
/ \texttt{llm-scaler} multi-GPU documentation

\textbf{Environment:} Arc Pro B70 (\texttt{0xe223}, BMG-G31, 32 GB) and Arc Pro
B50 (\texttt{0xe212}, 16 GB); \texttt{torch 2.10.0+xpu};
\texttt{bitsandbytes 0.49.2}; compute-runtime 26.05+; \texttt{xe} driver; Ubuntu
24.04 / kernel 6.17; image built on \texttt{intel/llm-scaler-vllm:0.14.0-b8.3.1}
base.

\textbf{Failure signature:} With \texttt{ONEAPI\_DEVICE\_SELECTOR=level\_zero:0}
(per Intel's multi-GPU pinning guidance), loading an NF4-quantized model and
starting a QLoRA step makes bitsandbytes' native 4-bit op
(\texttt{torch.ops.bitsandbytes.quantize\_4bit}) throw a SYCL \texttt{No device of
requested type available}. Note \texttt{torch.xpu.is\_available()} is
\texttt{True} and \texttt{torch.xpu.get\_device\_properties()} enumerates the GPU
correctly --- only bitsandbytes' separately-compiled SYCL kernel queue rejects the
device.

\textbf{Root cause:} bitsandbytes' 4-bit SYCL queue does not resolve a device when
the process is scoped to the \textbf{Level-Zero} backend selector.
\texttt{torch.xpu} and bnb's SYCL runtime resolve devices through different paths;
the Level-Zero-only scoping that satisfies torch starves bnb's queue.

\textbf{Fix (one line of configuration):} Use the any-backend GPU selector and pin
the specific card by affinity mask instead of by Level-Zero index:
\begin{Verbatim}[fontsize=\footnotesize,frame=single,framesep=4pt]
# FAILS for bnb 4-bit:  ONEAPI_DEVICE_SELECTOR=level_zero:1
# WORKS:
export ONEAPI_DEVICE_SELECTOR='*:gpu'
export ZE_AFFINITY_MASK=1        # pin to the desired card (0-based)
\end{Verbatim}
With this, 4-bit QLoRA trains end-to-end (verified genuinely 4-bit: NF4 7B =
5.45 GiB).

\textbf{Affected scope:} the \textbf{entire Arc B-series} (reproduced on
\textbf{both B70 and B50} with the identical fix). Critically, it affects
\textbf{Project Battlematrix (8$\times$B60)} and any multi-card Arc deployment,
because per-card pinning is mandatory there and the standard
\texttt{level\_zero:N} guidance is exactly what triggers the failure --- so following the standard multi-GPU docs is exactly what surfaces this failure.

\textbf{Requested doc change (Intel):} in the IPEX / llm-scaler multi-GPU pinning
guidance, note that workloads using bitsandbytes 4-bit must pin with
\texttt{ONEAPI\_DEVICE\_SELECTOR='*:gpu'} + \texttt{ZE\_AFFINITY\_MASK=<idx>}, not
\texttt{level\_zero:<idx>}.

\textbf{Requested fix (bitsandbytes):} make the 4-bit SYCL kernel queue resolve a
device under the Level-Zero backend selector (or emit an actionable error pointing
at the selector rather than the opaque \texttt{No device of requested type
available}).
\end{tcolorbox}

\subsection*{Appendix D --- Methodology, versions, reproducibility (provenance)}

\textbf{Software versions (as benchmarked):}
\begin{itemize}
\item Intel inference: \texttt{inference-ipex:v0.7.24} $\leftarrow$
\texttt{intel/llm-scaler-vllm:0.14.0-b8.3.1} (\texttt{torch 2.10.0+xpu}, vLLM-XPU,
compute-runtime 26.09).
\item Intel training: \texttt{training-ipex:v0.2.0} (llm-scaler base +
\texttt{peft}, \texttt{trl}, \texttt{bitsandbytes 0.49.2},
\texttt{pytorch-tabnet 4.1.0}; PIP\_CONSTRAINT pins \texttt{torch==2.10.0+xpu}).
\item NVIDIA: \texttt{inference-vllm:v2.0.8}, \texttt{inference-generative:v3.0.0},
\texttt{training-transformers-cu121:v1.0.5}, \texttt{training-pytorch-cu121:v1.0.1}
(TabNet baseline required runtime \texttt{numpy<2} + \texttt{pytorch-tabnet==4.1.0}
due to a numpy-2/torch-2.1.2 incompatibility in the older image).
\item Host: Ubuntu 24.04, kernel 6.17, \texttt{xe} driver, compute-runtime 26.05+,
Resizable BAR on.
\end{itemize}

\textbf{Energy method (and its asymmetry, stated plainly):} Intel = \texttt{xe}
sysfs \texttt{energy1\_input} ($\mu$J, exact accumulating counter, per-card by PCI
id). NVIDIA = the exact NVML \texttt{nvmlDeviceGetTotalEnergyConsumption} counter
where used (QLoRA), else the integral of \texttt{nvidia-smi power.draw} @200 ms
(LLM, SDXL). \textbf{The two NVIDIA methods are not the same instrument} --- 200 ms
sampling can miss sub-sample transients --- so the LLM/SDXL perf/watt margins carry
a small extra uncertainty the QLoRA cell (exact counter, both sides) does not.
Window bracketed by workload-emitted \texttt{MEASURE\_START/END}. Now collected
fleet-wide by node-runtime 0.10.119 $\rightarrow$ \texttt{node\_power\_samples}.

\textbf{Workloads:} Qwen2.5-7B-Instruct (LLM, vLLM and LoRA/QLoRA); Stable
Diffusion XL base 1.0 (inference + UNet-LoRA training); synthetic tabular
(TabNet). bf16/eager throughout; identical prompts/inputs/lengths cross-vendor.

\textbf{Data provenance / open items:}
\begin{itemize}
\item \textbf{K=3 ($\pm$1 sample-std, n=3)} covers the two inference cells --- LLM
serving (\S4.2) and SDXL (\S4.3). LoRA (\S4.4) and QLoRA (\S4.7) are paired
measured runs reported as \textbf{point estimates} (repeated, no tight-CI claim).
SDXL-UNet-LoRA (\S4.5) and TabNet (\S4.6) are single runs. \textbf{Sampling
caveat:} all ``K=3'' is \emph{same-device} repetition --- \textbf{N=1 hardware per
vendor} --- so silicon-lottery / board-partner / thermal variation is
uncontrolled; a second physical pair and a second model size ($\sim$1.5 B) are the
next steps. The 14 B VRAM-headroom case (\S4.8) is measured.
\item \textbf{Known confounds (not yet eliminated):} (a) \emph{engine-version
skew} --- Intel runs vLLM-XPU 0.14.x on \texttt{torch 2.10+xpu}; NVIDIA runs stock
vLLM (\texttt{inference-vllm:v2.0.8}, CUDA cu121 lineage). bf16/eager is pinned for
fair fast-paths, but they are different vLLM forks/versions, so near-parity
reflects silicon \emph{plus} engine, not silicon alone. (b) \emph{host-CPU
mismatch} --- Intel host i7-10700 (8c) vs NVIDIA Xeon E5-2680 v4 (28c); irrelevant
for GPU-resident kernels but a live confound for the host-sync-bound
\textbf{TabNet} cell specifically (its 2.2$\times$ loss may be partly host-CPU, and
is unaudited).
\item \textbf{QLoRA-4bit now runs on the B70} (\S4.7) --- 0.949 steps/s $\cdot$
242.3 J/step, $\sim$7\,\% faster and 1.89$\times$ more efficient than the 3090,
verified genuinely 4-bit (5.45 GiB on both vendors). The earlier ``fails'' result
was a \textbf{SYCL device-selector mismatch} (\texttt{level\_zero:0} $\rightarrow$
bnb's 4-bit op throws ``No device''; \texttt{*:gpu} fixes it), \textbf{not} a
Triton-XPU gap as first hypothesized.
\item Intel \textbf{does not publish dense BF16 TFLOPS} for Arc Pro B-series; the
compute-economics table uses published INT8 TOPS only.
\item A separate platform improvement was identified (not yet shipped): the node
runtime reads Intel VRAM via \texttt{xpu-smi} (which reports ``No device'' on the
B70) and falls back to accounting; reading \texttt{torch.xpu.mem\_get\_info()}
would give live, accurate Intel VRAM and improve placement.
\item Market-share, pricing, CUDA-EULA, and spec figures in \S2 and \S5 are from
public sources (TechInsights/HPCwire, NVIDIA CUDA EULA, Intel datasheets/newsroom,
vendor spec pages); street prices are volatile and flagged as such.
\end{itemize}

\subsection*{Appendix E --- Sources (public)}
External market, pricing, licensing, and hardware-spec figures in \S2 and \S5 are from public sources; street prices and shipping specs are volatile and were current at the 2026-06-21 measurement date.
\begin{enumerate}\itemsep2pt
  \item NVIDIA data-center GPU market share ($\sim$98\%, 3.76M of 3.85M units, 2023) --- TechInsights, reported via HPCwire.
  \item NVIDIA CUDA End User License Agreement, \S1.2 ``Limitations'' (item 8, on translating CUDA output to non-NVIDIA platforms) --- NVIDIA.
  \item ZLUDA (CUDA-on-non-NVIDIA) project takedown at AMD's request, August 2024 --- project repository and press coverage.
  \item Blackwell supply (``sold out $\sim$12 months ahead,'' Oct 2024) and per-GPU pricing (\$30,000--40,000, Jensen Huang) --- NVIDIA management remarks and press.
  \item Intel Arc Pro B-series (Battlemage), Project Battlematrix (8$\times$B60 $\to$ 192 GB), and the llm-scaler software stack --- Intel datasheets and newsroom.
  \item bitsandbytes Arc / XPU 4-bit support --- bitsandbytes release notes.
  \item Native torch.xpu support timeline (PyTorch 2.5 onward; Battlemage maturing 2.6--2.7) and IPEX standalone EOL ($\sim$March 2026) --- PyTorch and Intel documentation.
  \item GPU specifications (VRAM, memory bandwidth, TBP, INT8 TOPS) for the RTX 3090 and Arc Pro B70/B60/B50 --- vendor spec pages.
\end{enumerate}

\end{document}