Five frontier models on OpenRouter — including Sakana’s fugu-ultra — were each asked to write an original research plan for a hard systems problem: a kernel for batched square compact-Householder QR factorization on the NVIDIA GB10 (Grace Blackwell). Then the latest Gemini judged the plans.
Gemini was given the returned plans blind and asked to score each on novelty, GB10-specificity, depth, numerical rigor and completeness — then to name the strongest and weakest, flag hand-wavy or technically dubious claims, and pick the single best idea.
| Model | Novelty | GB10-specificity | Depth | Numerical-rigor | Completeness |
|---|---|---|---|---|---|
| sakana/fugu-ultra | 8 | 8 | 8 | 7 | 8 |
| openai/gpt-5.5 | 6 | 7 | 7 | 8 | 8 |
| anthropic/claude-opus-4.8 | 10 | 10 | 10 | 10 | 9 |
| z-ai/glm-5.2 | 7 | 7 | 6 | 4 | 7 |
| moonshotai/kimi-k2.7-code | 8 | 9 | 9 | 8 | 9 |
Strongest: anthropic/claude-opus-4.8. This plan demonstrates peerless architectural knowledge, citing specific Blackwell features (e.g., tcgen05 MMA paths, TMEM). It seamlessly solves the most stubborn problem in batched QR (the sequential panel factorization latency) through brilliant, low-level PTX warpgroup scheduling and mathematically sound carried-residual bounds.
Weakest: z-ai/glm-5.2. It completely misunderstands hardware latency regimes by proposing that the Grace CPU should factor the panel while the GPU does the update. For tiny/small matrices, this means paying Grace-Blackwell NVLink-C2C round-trip latency at a higher cost than the actual matrix math, destroying the kernel.
z-ai/glm-5.2 proposes offloading the unblocked panel factorization of small matrices ($N=32-256$) to the Grace CPU over unified memory. The panel operations for these sizes take mere hundreds of nanoseconds; the CPU-GPU coherence protocol and NVLink-C2C latency will massively bottleneck the pipeline. z-ai/glm-5.2 suggests an "Error-Free Transform (EFT)" using BF16 inputs with FP32 accumulation. Standard EFTs (like Ozaki or Dekker-Veltkamp) require exact representation of residuals via mantissa splitting. BF16 only has a 7-bit mantissa; using it for an EFT scheme is mathematically invalid and will not reliably preserve precision.openai/gpt-5.5 mentions all the correct hardware features (TMA, TBC, DSMEM) but largely fails to explain how they map to the exact data flows at the instruction level, giving it a textbook-like, hand-wavy feel compared to the dense technical reality found in plans 3 and 5.Cross-batch software-pipelined panel/update interleave, proposed by anthropic/claude-opus-4.8.
Standard batched QR suffers from low GPU utilization because the panel factorization is an inherently serial, latency-bound process that leaves powerful Tensor Cores idle. By explicitly mapping the active matrix's panel generation to the cluster's default CUDA Cores (WG-panel) while simultaneously executing the previous matrix's trailing matrix update on the Blackwell tcgen05 Tensor Cores (WG-update) using disjoint warpgroups and TMEM, this idea entirely hides the sequential panel latency behind a throughput-bound GEMM.
Each model’s full plan, as returned (capped at 20K tokens including reasoning). Click to expand.
As a GPU kernel research lead, I propose the following end-to-end research plan for designing a next-generation batched square compact-WY Householder QR factorization kernel.
Standard cuSOLVER and MAGMA implementations treat batched QR either as a loop over monolithic GPU routines or rely on naive warp-stride loops for small matrices. Both approaches suffer from severe register pressure, synchronization overhead, and vast underutilization of modern AI-focused Streaming Multiprocessors (SMs). To achieve speed-of-light performance on the NVIDIA GB10 (Grace Blackwell) Superchip, we must fundamentally abandon legacy flat arrays and level-2 BLAS panel approaches.
Here is the staged research plan to build a GB10-native batched QR engine built around Tensor Memory Accelerator (TMA), Thread Block Clusters (TBC), Distributed Shared Memory (DSMEM), NVLink-C2C, and Warp-Group Matrix-Multiply-Accumulate (WGMMA) instructions.
Objective: Define the target regime (thousands of $N \times N$ matrices, $N \in [32, 1024]$) and design a data layout that eliminates memory-coalescing bottlenecks.
[Macro_Batch, I_tile, J_tile, Intra_Batch_Lane]. By interleaving the batch dimension at the innermost level (e.g., chunks of 16 matrices packed together), we allow Blackwell's TMA to issue bulk asynchronous 3D memory fetches directly into shared memory. The hardware TMA handles address calculation and bounds checking, bringing identically positioned tiles from multiple independent matrices into the SM simultaneously.Objective: Redesign the panel factorization and TMU to bypass the traditionally sequential Householder dependencies.
Objective: Maximize TMU throughput by exploiting Blackwell's Tensor Cores and warp specialization while maintaining strict HPC numerical fidelity.
cp.async.bulk instructions to fetch the next batch-tile of $A$ into DSMEM. Consumer warps purely execute matrix math. This decouples memory latency from math pipelines, maximizing WGMMA utilization even under immense register pressure.Objective: Leverage the 900 GB/s cache-coherent Grace-Blackwell interconnect to handle control flow and outliers without stalling the massive Blackwell GPU.
NEED_HELP status word. The Grace CPU, monitoring via coherent memory, instantly acts as a sidecar. Grace can intercept that specific matrix, run a high-precision CPU check, or enqueue it for a stricter native-FP64 GPU kernel. This avoids globally slowing down the entire batch for a few pathological matrices.Objective: Create a dynamic execution environment that adapts to batch sizes, condition numbers, and thermal throttling.
Objective: Rigorously prove that the novel mixed-precision and asynchronous kernel is numerically sound.
Objective: Quantify the exact performance gains of the GB10-native design.
cusolverDnDgeqrfBatched, MAGMA magma_dgeqrf_batched, and a highly optimized Hopper-generation implementation.Conclusion: This research plan moves completely away from porting legacy monolithic LAPACK routines. By exploiting Blackwell's Tensor Cores for HPC-grade math via emulation, restructuring data for 4D TMA loads, caching compact-WY factors in DSMEM across TBCs, and relying on Grace as an active asynchronous sidecar over NVLink-C2C, we will deliver a batched QR kernel operating at the absolute architectural limits of the GB10 platform.
We need factor many independent square matrices:
\[ A_i \in \mathbb{R}^{n \times n}, \quad i = 0,\dots,B-1 \]
into
\[ A_i = Q_i R_i \]
using blocked Householder QR with compact-WY representation:
\[ Q_{\text{panel}} = I - Y T Y^T \]
where Y contains Householder vectors and T is the compact upper-triangular block reflector factor.
The target is batched square matrices, likely in the range:
n <= 3232 < n <= 128128 < n <= 512512 < n <= 1024, if batch size is large enoughThe kernel family should outperform library-style batched QR by exploiting fixed square sizes, batching regularity, in-kernel compact-WY fusion, Blackwell tensor cores, TMA, thread-block clusters, and Grace–Blackwell unified memory.
For batched square QR, the main costs are:
n.n.\[ C \leftarrow C - Y T^T (Y^T C) \]
I would use a blocked Householder QR with compact-WY panel representation.
For each panel starting at column k with block size b:
\[ A_{k:n, k:k+b-1} \rightarrow Y_k, T_k, R_{kk} \]
\[ A_{k:n, k+b:n} \leftarrow A_{k:n, k+b:n} - Y_k T_k^T \left(Y_k^T A_{k:n, k+b:n}\right) \]
The blocked Householder method is the right baseline because:
RYtau and panel-local T stored separatelyI would not simply implement a cuSOLVER-style panel plus batched GEMM pipeline. The main research novelty would be the following.
Standard compact-WY update forms:
\[ G = Y^T C \]
then
\[ G \leftarrow T^T G \]
then
\[ C \leftarrow C - YG \]
This requires at least two GEMM-like operations and a temporary G.
Instead, for small square batched matrices, I would design a single persistent cluster kernel that:
Y and T into shared memory or distributed shared memory.C using TMA.G_tile = Y^T C_tile using tensor cores.T^T to G_tile in registers or shared memory.Y G_tile.C_tile.The temporary G_tile never goes to global memory.
This should reduce:
This is the central kernel-design idea.
For medium-small matrices, one thread-block cluster processes one matrix panel and several trailing tiles.
Within the cluster:
TY and T are placed in distributed shared memoryThis creates a QR pipeline inside a cluster:
panel k factorization
↓
broadcast Y_k, T_k through DSM
↓
fused compact-WY update of trailing tiles
↓
next panelFor n <= 256, the entire active submatrix may fit in cluster-local shared memory or DSM, enabling very low global-memory traffic.
Instead of accepting ordinary strided matrices as the compute layout, I would introduce a GPU-side packing layer that converts batches into a tile-major structure-of-arrays layout:
A_packed[tile_row][tile_col][batch_group][tile_element]For example, with 16 x 16 tiles:
A_packed[tr][tc][g][i][j]where g indexes a group of matrices processed together.
This gives:
For users requiring LAPACK-like storage, we provide pack and unpack kernels. For high-throughput batched workloads, we expose the packed layout as the native API.
The kernel will not use one fixed precision mode. It will dynamically select among:
tau, T, and panel inner products in FP64The precision mode is chosen using lightweight panel diagnostics:
Instead of one kernel launch per panel or relying on batched GEMM calls, I would implement a persistent scheduler:
n, precision mode, and layout.This avoids load imbalance when matrices have different convergence paths or precision escalation decisions.
For each matrix:
A stores R1tau[k] stores scalar reflector coefficientT_panel[p] stores compact-WY triangular factor for each panelFor panel size b, panel p stores:
\[ T_p \in \mathbb{R}^{b \times b} \]
packed as an upper triangular block.
For a panel beginning at column k:
For j = 0 ... b-1:
\[ x = A_{k+j:n, k+j} \]
\[ \|x\|_2 \]
\[ H_j = I - \tau_j v_j v_j^T \]
v_j below the diagonal, implicit one at the diagonal.After all b reflectors are generated, construct T:
\[ T_{jj} = \tau_j \]
and for i < j:
\[ T_{ij} = -\tau_j \left[ T_{i:j-1, i:j-1} \left( Y_{:, i:j-1}^T v_j \right) \right]_i \]
depending on the chosen convention.
The block reflector is:
\[ H_{\text{panel}} = I - Y T Y^T \]
For applying to the trailing matrix from the left:
\[ C \leftarrow \left(I - Y T^T Y^T\right) C \]
n <= 32Use one warp per matrix.
T generated in registers.This regime prioritizes latency and batch throughput.
32 < n <= 128Use one CTA per matrix.
b = 8, 16, or 32.128 < n <= 512Use one thread-block cluster per matrix.
Y and T stored in distributed shared memory.n > 512Use one or more clusters per matrix.
k, another begins preparing panel k+b once dependencies are satisfied.This regime is secondary because the target is batched small square QR, but the design should scale.
Blackwell tensor cores should be used primarily for:
\[ Y^T C \]
and
\[ Y G \]
in the compact-WY trailing update.
For FP32-like QR:
For lower precision:
For very low precision:
For FP64:
Tensor Memory Accelerator will be used for:
YThe fused update kernel pipeline:
TMA load Y/T panel tile
TMA load C tile stage 0
compute G = Y^T C
TMA load C tile stage 1
compute T^T G
compute C -= YG
TMA store C tile stage 0
repeatUse asynchronous transaction barriers to overlap:
Clusters are useful for n >= 128.
Cluster layout example:
clusterDim = 4 CTAs
CTA 0: panel factorization / T construction
CTA 1: trailing tile columns 0..m
CTA 2: trailing tile columns m..2m
CTA 3: trailing tile columns 2m..3mShared state:
Y_panelT_panelPanel data is placed in DSM so update CTAs can access it without repeated global loads.
TMA multicast can distribute panel data to all CTAs in a cluster.
The design should explicitly trade off:
For tiny matrices, occupancy is dominated by CTA count and scheduling overhead.
For medium matrices, the goal is not maximum occupancy but maximum sustained tensor-core utilization while keeping panel latency hidden.
I would build an occupancy model:
\[ T_{\text{panel}} + T_{\text{update}} + T_{\text{sync}} + T_{\text{memory}} \]
and tune cluster size so that:
The Grace CPU should be used for:
The GPU should own matrix data during factorization.
Use unified memory carefully:
For streaming workloads:
Grace CPU prepares batch group g+1
Blackwell GPU factors batch group g
Grace CPU validates or consumes batch group g-1A_i[row + col * lda + i * stride]Pros:
Cons:
Aarray[i] -> A_iThis should be supported only as an input compatibility mode.
It is poor for performance because of pointer chasing and irregularity.
For high performance:
A_packed[tile_row][tile_col][batch_group][tile_i][tile_j]Example with 16 x 16 tiles and batch group size G:
A_packed[tr][tc][g][16][16]or vectorized:
A_packed[tr][tc][tile_i][tile_j][g]depending on whether the kernel wants matrix-contiguous or batch-contiguous access.
For tiny matrices, the batch dimension should be innermost so one warp can process multiple matrices with coalesced loads.
For medium matrices, tile elements should be contiguous for TMA.
I would use a diagonal-panel swizzle:
Matrices are grouped so that panels with the same k are contiguous in memory:
batch group:
all matrices, panel 0 tiles
all matrices, panel 1 tiles
all matrices, panel 2 tilesThis makes QR panel progression naturally coalesced across the batch.
For very large batches, the scheduler can process all matrices at panel k together, improving instruction locality and TMA descriptor reuse.
I would implement several specialized kernels rather than one universal kernel.
For n <= 32.
For 32 < n <= 128.
For 128 < n <= 512.
Y and TFor n > 512.
For a trailing tile:
\[ C_t \in \mathbb{R}^{m \times w} \]
where:
m = n-kw is tile widthY \in R^{m x b}T \in R^{b x b}Compute:
\[ G = Y^T C_t \]
\[ G = T^T G \]
\[ C_t = C_t - YG \]
Implementation sketch:
load Y tile into shared/DSM
load T into registers/shared
for each trailing C tile:
TMA load C tile into shared
MMA: G = Y^T C
small triangular multiply: G = T^T G
MMA: U = Y G
C = C - U
TMA store CG is small: b x w.
For b = 16 and w = 32, G has only 512 elements, feasible in registers/shared.
The key is that G is never written to global memory.
tau in FP32T in FP32tau and T in FP64\[ C \leftarrow C - \Delta_{\text{low}} \]
then compute a small residual estimate
\[ E = YT^TY^TC - \widehat{\Delta} \]
and apply a correction when needed.
At initialization, benchmark microkernels:
Then choose:
if native FP64 update is efficient:
use native FP64 for strict mode
else:
use FP64 panel + emulated or compensated updateEmulated FP64 update options:
hi + lo FP32 pairThe goal is not universal FP64 emulation, but adaptive use where strict accuracy is requested.
For each n, batch size, layout, and precision mode:
b4, 8, 16, 24, 32, 48, 64w1, 2, 4, 8 CTAsC tilesG temporaryRun on target GB10 hardware:
(n, dtype, batch_size_range, layout, accuracy_mode)At runtime:
n, batch size, layout, requested tolerance.The autotuner should be guided by a cost model.
Panel cost approximately:
\[ O(n b^2) \]
Trailing update cost per panel:
\[ O(2(n-k)b(n-k-b)) \]
Total blocked QR cost:
\[ \frac{4}{3}n^3 + O(n^2 b) \]
For each panel, estimate:
\[ T = \max(T_{\text{TMA}}, T_{\text{TC}}) + T_{\text{panel}} + T_{\text{sync}} \]
Use this to prune parameter combinations before exhaustive tuning.
For each matrix:
\[ \frac{\|A - QR\|_F}{\|A\|_F} \]
\[ \|I - Q^T Q\|_F \]
\[ \frac{\|\operatorname{tril}(R, -1)\|_F}{\|A\|_F} \]
R diagonal.Use diverse matrices:
Condition numbers:
\[ \kappa(A) = 10^0, 10^2, 10^4, 10^8, 10^{12}, 10^{16} \]
depending on precision mode.
For FP32 mode, expected bounds:
\[ \frac{\|A - QR\|}{\|A\|} = O(n u_{\text{fp32}}) \]
\[ \|I - Q^TQ\| = O(n u_{\text{fp32}}) \]
For mixed TF32/FP32 mode, require looser but explicit bounds.
If diagnostics exceed thresholds:
The factorization stores compact-WY data. Validation needs explicit or implicit Q application.
Implement:
ormqr_batched_compact_wyform_q_batchedFor benchmarking QR only, avoid explicit Q construction unless required.
Compare against:
nMeasure:
Sweep:
n = 8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024B = 1, 8, 32, 128, 1024, 8192, 65536Ablate each proposed feature:
The goal is to quantify exactly where performance comes from.
Deliverables:
APIs:
gb10_batched_geqrf(
A,
lda,
stride_or_layout,
tau,
T,
n,
batch_count,
precision_mode,
workspace,
stream);and packed-layout variant:
gb10_packed_batched_geqrf(
A_packed,
tau_packed,
T_packed,
n,
batch_count,
config,
stream);Implement:
T constructionPurpose:
Implement:
n <= 32n <= 128Expected outcome:
Implement the novel fused update:
\[ C \leftarrow C - YT^T(Y^TC) \]
without global-memory G.
Versions:
Measure:
Implement:
Evaluate:
Implement:
Y and TPrimary target:
n = 128, 192, 256, 384, 512Expected outcome:
Implement precision modes:
Validation:
Implement:
Tuning keys:
n
batch_count
dtype
layout
precision_mode
GB10 SKU capabilityPerform complete benchmark suite.
Outputs:
The most important expected contributions are:
This is not merely a reimplementation of cuSOLVER or MAGMA. The central design principle is to treat batched compact-WY QR as a single fused, layout-aware, cluster-resident streaming computation rather than a sequence of panel kernels and batched GEMMs.
Project codename: cwy-batch-gb10 Scope: Many independent square matrices, A_k ∈ ℝ^{n×n} (and complex variant), n ∈ {8 … 512}, batch counts 10³–10⁶, factorize A_k = Q_k R_k via blocked compact-WY Householder, returning {V_k, T_k, R_k}.
The standard cuSOLVER/MAGMA recipe for batched QR is: one CTA per matrix, unblocked or lightly-blocked Householder in shared memory, FP32/FP64 throughout, dot-products via warp shuffles. That design is latency-bound on the panel, leaves tensor cores idle, and ignores Blackwell's memory/cluster machinery.
My plan rests on five novel design pillars, developed in detail below:
| Regime | n | Strategy | Matrix residency |
|---|---|---|---|
| T (tiny) | 8–32 | 1 warp or sub-warp per matrix, fully register-resident, batched over lanes | registers |
| S (small) | 32–128 | 1 CTA per matrix, SMEM-resident, warp-cooperative panel + small MMA | shared memory |
| M (medium) | 128–512 | 1 cluster per matrix, DSMEM-resident, tcgen05 MMA trailing update | distributed SMEM |
| L (large/few) | >512 | fall back to streamed per-matrix blocked QR (out of primary scope) | HBM/L2 |
Build a GB10-specific roofline (FP32, BF16, FP8, emulated-FP64) early (Milestone 0). Establish the break-even arithmetic intensity where tensor-core trailing update beats memory traffic. This number sets the minimum nb and minimum n for which [NOVEL-2]/[NOVEL-3] pay off.
Panel of nb columns starting at column j. For each local column i:
Q_panel = H_1⋯H_{nb} = I − V T Vᵀ.
GB10's GPU is Blackwell-class but consumer-tier FP64 (native FP64 throughput ~1/32–1/64 of FP32, no high-rate FP64 tensor path). Yet many target apps want FP64-quality R. So the design must deliver double-precision accuracy without relying on native double-precision throughput. Two complementary mechanisms:
For the "I really need FP64" path:
tcgen05 MMA path: operands staged in TMEM, supporting FP8/FP6/FP4, BF16, TF32, and INT.cp.async.bulk.tensor for matrix k+1 while computing on k (prefetch pipeline depth ≥2).mapa/remote SMEM access + cluster.sync, avoiding global memory and avoiding serializing the reduction in a single CTA.Within a CTA (regime S) or cluster (regime M), split warpgroups into:
Because panel and update use different functional units, they overlap with near-zero contention. A persistent-kernel ring buffer of in-flight matrices keeps both stages busy; mbarrier/named-barrier handoff transfers V,T from WG-panel to WG-update. This directly attacks the classic batched-QR bottleneck (panel serial chain) by filling it with a neighbor's GEMM rather than idling.
cudaMemAdvise/prefetch hints to keep working set GPU-resident while staging next batch from LPDDR.__shfl_xor butterfly reductions.atomicAdd ticket) over the batch; cluster/CTA grabs next matrix → enables [NOVEL-1] ring pipeline and load balancing across heterogeneous n if batches are mixed.M0 (weeks 1–3) — Characterization & infra. GB10 roofline (all precisions), microbenchmark harness, baselines (cuSOLVER/MAGMA + naïve), reference + error-metric framework. Gate: validated roofline and baseline numbers.
M1 (weeks 4–7) — Regime-S core, FP32. CTA-per-matrix SMEM compact-WY with fused-T; correctness vs LAPACK. Establishes a working, accurate baseline of our design. Gate: matches baseline accuracy, ≥ parity throughput.
M2 (weeks 8–11) — [NOVEL-1] cross-batch pipeline + persistent kernel. Warpgroup split, ring buffer, mbarrier handoff. Gate: ablation A1 shows measurable small-n speedup (target ≥2× over naïve at n≤64).
M3 (weeks 12–16) — [NOVEL-2] PA-WY mixed precision + TMEM reuse. Precision schedule, carried residual, tensor-core trailing update; error-bound theory + empirical confirmation. Gate: FP64-level backward error at low-precision throughput on κ≤10⁶.
M4 (weeks 17–21) — Regime M [NOVEL-4] cluster/DSMEM + TMA multicast. Recursive compact-WY, tcgen05 path. Gate: ablation A4 win for n≥128.
M5 (weeks 22–25) — [NOVEL-3] emulated-FP64 (Ozaki) hybrid. Slice-adaptive trailing GEMM + native-FP64 panel. Gate: exceed native-FP64 throughput at equal accuracy (ablation A3).
M6 (weeks 26–29) — Regime T register-batched + MMA packing + layouts [NOVEL-6]. Gate: tiny-n bandwidth-bound efficiency (≥80% achievable BW).
M7 (weeks 30–34) — [NOVEL-5] autotuner + unified-memory CPU producer (C2C). Cost model, config DB, end-to-end overlap. Gate: autotuner within X% of exhaustive best; A7 end-to-end win.
M8 (weeks 35–38) — Full validation, ablations, paper/library. Pareto fronts, energy, reproducibility; package with stable ABI.
mma.sync fallback; abstract MMA behind an internal interface.I'd anchor the publishable claim on the combination of (1)+(2)+(3): double-precision-accurate batched QR at low-precision tensor-core throughput on a platform with weak native FP64, with the cross-batch pipeline removing the panel-latency wall that has historically capped batched-QR efficiency.
Here is a comprehensive research plan for designing a high-performance, batched compact-WY Householder QR factorization kernel optimized for the NVIDIA GB10 (Grace-Blackwell) platform.
Batched QR factorization processes many independent, small-to-medium dense square matrices ($n \in [32, 256]$). Standard LAPACK implementations are memory-bound due to Level-2 BLAS unblocked updates. The compact-WY representation reformulates Householder QR into a blocked algorithm: $A \leftarrow (I - Y T Y^T) A$. The trailing matrix update requires computing $W = Y^T A$, updating $Z = A - Y T W$, and reassigning $A = Z$. This exposes GEMM-like operations, making it amenable to Tensor Cores, but the small matrix sizes make standard cuBLas/cuSOLVER per-kernel launch overheads prohibitive.
Standard batched layouts (A[batch][row][col]) cause TMA fetch inefficiencies for small matrices due to boundary misalignment.
> [NOVEL IDEA 1] Batch-Interleaved TMA-Aligned Layout (BITAL) > Instead of storing matrices contiguously, matrices are interleaved at the block level. We group $b$ matrices into a 3D tensor of shape (rows, cols, b), padding $b$ such that the innermost dimension perfectly aligns with TMA swizzle patterns. If $b$ is insufficient, we use a "2.5D" layout where the $b$ dimension is flattened into the row dimension with specific power-of-2 padding. This allows TMA to issue bulk loads for multiple matrices simultaneously, maximizing memory bandwidth per transaction.
We use the standard compact-WY formulation but redesign the execution model to fit GB10's cluster architecture.
> [NOVEL IDEA 2] Asynchronous Cluster-Pipelined WY (ACPW) > We map the execution of a single batched matrix (or a small sub-batch) to a single Thread Block Cluster. > - Block 0 (The Panel Block): Responsible for unblocked Householder generation to form the $Y$ matrix and the upper-triangular $T$ matrix. Computes entirely in registers/DSMEM. > - Blocks 1 to N (The Update Blocks): Responsible for the trailing matrix update ($A \leftarrow A - Y T (Y^T A)$). > Once Block 0 computes a block of $Y$, it asynchronously broadcasts $Y$ and $T$ via DSMEM (distributed shared memory) to Blocks 1-N using cluster-level async copy. Blocks 1-N immediately begin the $Y^T A$ GEMM via TMA while Block 0 computes the next panel. This creates a fine-grained, cluster-wide producer-consumer pipeline that completely hides the Householder generation latency behind Tensor Core GEMM execution.
Pure FP64 compact-WY is slow; pure FP16 is numerically unstable. We will exploit Blackwell Tensor Cores via a mixed-precision strategy.
> [NOVEL IDEA 3] TC-Accelerated WY via Error-Free Transforms (TC-EFT) > 1. Panel Factorization: Compute $Y$ and $T$ in FP32 (using standard scalar/vector ALUs). $Y$ is bounded in magnitude, preventing overflow. > 2. Trailing Update: The matrices $Y$, $A$, and $W$ are cast to BF16. The core operations $W = Y^T A$ and $Z = Y(W^T T)^T$ are executed using BF16 Tensor Cores with FP32 accumulation. > 3. Error Compensation: To guarantee numerical stability equivalent to FP64, we compute a single error-free compensation vector $E = Y_{FP32} \cdot A_{FP32} - W_{BF16}$. The final update applies $Z = A - Y T W + Y T E$. Because $E$ is small, $T E$ can be computed on standard FP32 units with negligible overhead, while the bulk of the FLOPs ($Y T W$) benefit from the massive BF16 Tensor Core throughput.
For batches requiring strict FP64 compliance (e.g., quantum chemistry), we fall back to Native FP64 Tensor Cores. However, to maintain occupancy, we will execute the FP64 GEMMs using smaller MMA fragments (e.g., m16n8k16) and overlap them with FP64 scalar Householder dot-products.
When batch sizes exceed GPU memory, or when dealing with extremely large batches of small matrices, standard PCIe transfers become a bottleneck.
> [NOVEL IDEA 4] CPU-GPU Overlapped Panel Factorization (Grace-Assisted WY) > Because GB10 features UMA and NVLink-C2C, we can pipeline work across the Grace CPU and Blackwell GPU. > - The Grace CPU uses its NEON SIMD units to compute the unblocked panel factorization ($Y$, $T$) for batch $i+1$ directly in unified memory. > - The Blackwell GPU fetches the $Y$ and $T$ matrices via TMA from the unified memory space and executes the heavy trailing matrix update (GEMM) for batch $i$. > - This achieves perfect compute-overlap: the CPU handles the sequential, latency-bound Householder generation, while the GPU remains 100% utilized on throughput-bound Tensor Core GEMMs. No explicit cudaMemcpy is required due to C2C coherence.
cp.async.bulk PTX instruction for TMA to load $A$ directly into shared memory.mbarrier::arrive_and_wait for asynchronous warp execution to ensure no warps block waiting for data.Given the non-linear scaling of Blackwell Tensor Cores with matrix sizes, a static configuration is suboptimal.
> [NOVEL IDEA 5] Topology-Aware CuTe Autotuner > We will implement the kernel using CUTLASS 3.x CuTe layouts, allowing for compile-time and JIT-time autotuning. The autotuner will sweep over: > - Cluster size (1 to 8 blocks). > - TMA tile shapes (e.g., 64x128 vs 128x128). > - Precision paths (FP64 TC, TC-EFT, Pure FP32). > - CPU-GPU overlap toggle (Grace-Assisted mode). > The cost model will not just measure time-to-solution, but will query the TMA throughput and Tensor Core utilization via hardware performance counters (using CUPTI), generating a Pareto frontier of latency vs numerical accuracy.
cublasDgetrfBatched (if applicable), MAGMA batched QR, and standard Level-3 BLAS WY.# Research Plan: BlackQR-B10 – Batched Square Compact-WY Householder QR on NVIDIA GB10 (Grace Blackwell)
Codename: BlackQR-B10 Target: batched square QR, $A^{(b)} = Q^{(b)}R^{(b)}$, $A^{(b)}\in\mathbb{R}^{n\times n}$, $b=0\ldots B-1$, $n$ roughly 32–512, batch size $B$ from thousands to millions. Platform: NVIDIA GB10 SoC (Grace CPU + Blackwell GPU, NVLink-C2C, Tensor Cores Gen 5, TMA, thread-block clusters / distributed shared memory).
This plan deliberately deviates from the standard cuSOLVER/MAGMA recipe of “one small kernel per matrix + repeated cuBLAS GEMM calls” or “one CTA per matrix with scalar Level-2 updates.” The core thesis is that on GB10 a single persistent cluster-wide kernel can keep the whole small matrix in distributed shared memory (DSM), fuse the compact-WY panel factorization and the trailing-matrix update, and accelerate the trailing update with Blackwell warp-group MMA (tensor cores) and, where acceptable, block-scaled FP8/FP16 mixed precision.
| # | Novelty | What it changes vs. standard recipes |
|---|---|---|
| N1 | Cluster-resident whole-matrix execution – one thread-block cluster owns one matrix; the matrix lives in DSM for the lifetime of the factorization. | Eliminates almost all intermediate global-memory traffic; removes per-step kernel launches; replaces pointer-array batching with a device-side work queue. |
| N2 | Tensor-core compact-WY trailing update – the rank-$\mathtt{nb}$ update $A\leftarrow A-VW$ and the Gram piece $V^{\!T}A$ are performed with warp-group MMA, including custom micro-kernels for narrow $K=\mathtt{nb}$. | Standard libraries fall back to CUDA-core GEMM or scalar Householder updates; we treat the trailing step as a small-K fused GEMM pipeline. |
| N3 | Block-scaled FP8/FP16 trailing update with FP64-scalar panel – panel scalars ($\beta$, $\tau$, $T$) are kept in FP64 (or compensated FP32), while bulk rank updates use Blackwell FP8/FP16 Tensor Cores with micro-scaling and FP32 accumulation. | New accuracy/performance trade-off for small dense QR, enabled by Blackwell’s micro-scaled FP8 MMA. |
| N4 | Batched tile-major (BTM) layout + TMA 5-D tensor maps – each matrix is stored as tiles, with batch as the outer dimension; TMA loads/stores full cluster stripes and multicasts the panel. | Better coalescing, TMA descriptor reuse, and direct cluster-to-DSM transfer. |
| N5 | Grace-Blackwell unified-memory control plane – the Grace CPU builds TMA descriptors, pre-stages data via NVLink-C2C, and maintains a lock-free GPU work queue in shared system memory; the GPU is purely a data-plane cluster engine. | Overlaps CPU bookkeeping with GPU compute and removes launch-boundary bottlenecks for enormous batches. |
| N6 | Precision-aware autotuner – a CUTLASS-style kernel generator searched over panel width, tile shape, cluster geometry, register/shared split, and precision mode, constrained by a validation error budget. | Adapts kernel to matrix size and numerical tolerance, not just to FLOPs. |
cusolverDn<t>geqrf_batched / MAGMA magma_dgeqrf_batched.We will profile all three to establish an absolute speed-up target (goal: 3–10× for the sweet-spot sizes) and to calibrate numerical error baselines.
For one matrix, partition the right-looking factorization into panels of width $\mathtt{nb}$. For panel $k$, let
\[ V_k\in\mathbb{R}^{(n-k\mathtt{nb})\times \mathtt{nb}},\qquad T_k\in\mathbb{R}^{\mathtt{nb}\times \mathtt{nb}} \]
be the compact-WY factors such that the accumulated Householder reflectors satisfy
\[ H_k = I - V_k T_k V_k^{\!T}. \]
$T_k$ is built by the Schreiber–Van Loan recurrence:
\[ T_1 = [\tau_1],\qquad T_j = \begin{bmatrix} T_{j-1} & -\tau_j\,T_{j-1}V_{j-1}^{\!T}v_j\\[2pt] 0 & \tau_j \end{bmatrix}. \]
$V_k$ is stored in-place in the strict lower triangle of the current panel; $R$ overwrites the upper triangle. The block reflector $T_k$ is stored in a small per-matrix workspace (or packed into an auxiliary tensor).
For the trailing submatrix $A_{22}=A[k\mathtt{nb}:n,\;k\mathtt{nb}+\mathtt{nb}:n]$, the update is
\[ A_{22} \;\leftarrow\; (I - V_k T_k^{\!T} V_k^{\!T})\,A_{22} \]
(assuming the convention $H_1\cdots H_{\mathtt{nb}}=I-VTV^{\!T}$). Algorithmically:
\[ Z = V_k^{\!T}A_{22},\qquad W = T_k^{\!T}Z,\qquad A_{22}\leftarrow A_{22} - V_k W. \]
This is the main computational loop. It is not a generic GEMM: $V_k$ has height $m$ but width only $\mathtt{nb}\le64$, and $T_k$ is tiny. We will custom-build warp-group MMA kernels for this narrow-$K$ case instead of calling cuBLAS.
\[ m_r = \left\lfloor \frac{n+k_r}{R}\right\rfloor \] so that distributed shared memory (DSM) collectively holds the whole trailing submatrix.
Why it fits GB10: Blackwell clusters expose direct DSM load/store plus cluster.sync(). Because $n$ is small, the entire active submatrix (or even the whole matrix, for $n\lesssim 384$ in FP32) fits in aggregated cluster shared memory. The factorization then becomes a shared-memory algorithm with global memory only at load/store boundaries.
Within a cluster:
\[ \beta_j = \operatorname{sign}(x_{j0})\|x_j\|_2,\qquad \tau_j = \frac{\beta_j - x_{j0}}{\beta_j},\qquad v_j = \frac{x_j}{\beta_j - x_{j0}}, \] and broadcasts $\tau_j$ and the pivot value through DSM.
GB10 details:
cluster.sync() as the memory fence between generation and broadcast.For each panel, the trailing submatrix $A_{22}$ is distributed by rows across the CTAs. The update is performed column-tile by column-tile so that the small $Z$ and $W$ blocks remain in registers:
For each column tile $J$ of width $w$ ($w\approx 16$–$64$):
\[ Z_r = V_{k,r}^{\!T} A_{22,r}^{(:,J)} \in \mathbb{R}^{\mathtt{nb}\times w} \] using warp-group wgmma with $M=\mathtt{nb}$, $N=w$, $K=m_r$.
cluster.sync() + a logarithmic warp reduction.\[ W = T_k^{\!T} Z \in \mathbb{R}^{\mathtt{nb}\times w} \] (small triangular multiply in registers) and multicasts $W$ to all CTAs.
\[ A_{22,r}^{(:,J)} \leftarrow A_{22,r}^{(:,J)} - V_{k,r}\,W \] again with wgmma, accumulating in FP32.
This design keeps the only cross-CTA data per column tile as two small $\mathtt{nb}\times w$ register blocks, minimizing DSM traffic. It also avoids materializing the full $Z$ and $W$ matrices.
MMA tile selection on Blackwell:
wgmma shapes such as $M64N8K16$ (FP16/BF16) or $M64N8K32$ (FP8), depending on the measured Blackwell instruction set.wgmma.mma_async and wgmma.wait_group. \[ \mathcal{A}[B,\,\text{tile}_i,\,\text{tile}_j,\,y,\,x] \] to global memory. It cp.async.bulk.tensors only its row stripe into its local shared memory; the cluster mapping makes the union the full matrix in DSM.
evict_last to avoid polluting L2 with one-off matrix writes.| Component | Proposed precision | Rationale |
|---|---|---|
| Panel scalars ($\beta,\tau$, $T$) | FP64 (or compensated FP32 fallback) | Critical for orthogonality; cheap because $O(n\mathtt{nb}^2)$. |
| Vectors in $V$ and $A$ storage | FP32 or FP16/FP8 | Bulk data; determines memory footprint. |
| Tensor-core accumulation | FP32 | Standard safe MMA accumulator. |
| Trailing update | FP32 default; optional FP16/BF16 or FP8 with block scaling | FP8 only where error budget allows. |
Block-scaled FP8 algorithm for the trailing update:
If FP64 throughput on GB10 is lower than FP32, the panel still dominates very little time; for $n\ge128$ the trailing update is $>90\%$ of FLOPs, so tensor cores dominate wall time.
head, tail) mapped in shared system memory via NVLink-C2C.Store matrix $b$ as tiles of size $(t_M,t_N)$. The linear index is
\[ \text{off}(b,i,j,y,x) = b\cdot L + \bigl(i\cdot N_{\text{tiles},j}+j\bigr)\cdot t_M t_N + (x\cdot t_M + y), \]
where $i,j$ are tile coordinates and $y,x$ are in-tile coordinates. Choose $t_N=\mathtt{nb}$ if possible so that a panel is a contiguous sequence of whole tiles.
Advantages:
\[ \mathcal{T}[b,\,k,\,i,\,j]\quad 0\le i,j < \mathtt{nb} \] (only upper triangular used).
Build variants via a CUTLASS-style kernel generator; the tunable axes are:
| Axis | Values |
|---|---|
| Panel width $\mathtt{nb}$ | 16, 24, 32, 48, 64 |
| Cluster geometry $C_r\times C_c$ | $1\times2$, $1\times4$, $2\times2$, $1\times8$, $2\times4$ |
| MMA tile $(M,N,K)$ | $64\times8\times\mathtt{nb}$, $64\times16\times\mathtt{nb}$, $128\times16\times\mathtt{nb}$, FP8 variants with doubled $K$ |
| Column-tile width $w$ | 16, 32, 64 |
| Register budget | 96, 128, 160, 192 per thread |
| Precision mode | FP64, FP32, FP16-acc-FP32, BF16-acc-FP32, block-scaled FP8-acc-FP32 |
| Algorithm mode | cluster-resident (full DSM) / streaming tiles |
| Layout | BTM / BCM / column-major |
dgeqrf in FP64, or a multi-precision reference using MPFR for very ill-conditioned cases.\[ \rho_R = \frac{\|A - QR\|_F}{\|A\|_F\,\epsilon}, \qquad \rho_Q = \frac{\|Q^{\!T}Q - I\|_F}{n\,\epsilon}, \] where $Q$ is reconstructed from $V,T$ by explicit multiplication for testing only, and $\epsilon$ is machine epsilon of the tested precision.
randsvd).cusolverDnDgeqrfBatched / single-precision equivalent.magma_dgeqrf_batched.cublas<t>gemm calls for the trailing update plus a custom panel kernel.Run the tuned kernel while systematically disabling each GB10-specific feature:
| Ablation | Variant | What it tests |
|---|---|---|
| A1 | Tensor cores off; CUDA cores only | Value of wgmma |
| A2 | DSM off; all matrix loads/stores go to global memory | Value of cluster-resident execution |
| A3 | TMA off; load/store via cp.async / global pointers | Value of TMA + multicast |
| A4 | Mixed precision off; pure FP32 / FP64 | Value of FP8/FP16 tensor-core path |
| A5 | BTM off; standard pointer-array column-major layout | Value of data-layout transformation |
| A6 | CPU control plane off; GPU builds descriptors/queue | Value of Grace-Blackwell co-scheduling |
| A7 | Vary $\mathtt{nb}$ around tuned value | Sensitivity to algorithmic block size |
| Risk | Mitigation |
|---|---|
| GB10 shared-memory / cluster size smaller than expected, preventing full-matrix residence. | Implement streaming-tile fallback and use analytical model to choose mode per $n$. |
| FP8 dynamic range insufficient for QR. | Provide FP16/BF16 fallback; use block scaling; validate error budgets empirically. |
| Tensor cores underutilized for small trailing submatrices near the end. | Special-case last panels with narrow MMA shapes and early-exit; fuse multiple panels when possible. |
| Cluster-wide reductions become latency bottleneck. | Keep reductions tiny ($\mathtt{nb}\times w$); use warp-level + DSM logarithmic reduction. |
| CPU stager cannot keep GPU fed via C2C. | Overlap staging with compute across batches; add GPU-side layout transform fallback. |
| Mixed precision fails adversarial cases. | Dispatcher falls back to FP64/FP32 when validation error exceeds threshold. |
By the end of this research cycle we expect to deliver:
This plan is intentionally architecture-native rather than library-portable: every design decision is driven by the GB10 Grace–Blackwell hardware capabilities, with portability treated as a secondary concern after performance and robustness are established.
Generation speed, size and cost of each plan.
| Model | tokens / s | total tokens | cost | time |
|---|---|---|---|---|
| FUGU-ULTRA | 66.8 | 40,777 | $0.6724 | 280 s |
| GPT-5.5 | 39.5 | 9,401 | $0.2766 | 233 s |
| CLAUDE OPUS 4.8 | 70.1 | 12,127 | $0.2953 | 167 s |
| GLM-5.2 | 36.4 | 4,345 | $0.0126 | 113 s |
| KIMI K2.7 | 77.9 | 17,154 | $0.0594 | 217 s |
| TOTAL | 60.1 avg | 83,804 | $1.3163 | 1011 s |
+ judge (GEMINI 3.1 PRO): 30,019 tokens · $0.0984 · 35 s. Plans were generated concurrently, so wall-clock was far less than the summed time.