● NEO-TOKYO RESEARCH NODE // OPENROUTER← index

研究テスト／カーネル設計

フグ // QR
a kernel-research test

Five frontier models on OpenRouter — including Sakana’s fugu-ultra — were each asked to write an original research plan for a hard systems problem: a kernel for batched square compact-Householder QR factorization on the NVIDIA GB10 (Grace Blackwell). Then the latest Gemini judged the plans.

入力 // PROMPT

You are a GPU kernel research lead. Produce a detailed, ORIGINAL research plan to design and implement a high-performance kernel for BATCHED SQUARE compact-WY Householder QR factorization, targeting the NVIDIA GB10 (Grace Blackwell) platform. Requirements: - Propose genuinely NEW approaches to kernel design — not a restatement of standard cuSOLVER / MAGMA recipes. - Lay out EVERY step you would follow, end to end: problem analysis; algorithm choice (compact-WY / blocked Householder, panel factorization, trailing-matrix update); GB10-specific optimization (Blackwell tensor cores and low/mixed-precision QR, TMA, thread-block clusters / distributed shared memory, FP64 strategy — native vs emulation, Grace–Blackwell unified memory and NVLink-C2C, occupancy); data layout for batching many small square matrices; an autotuning strategy; numerical-stability and correctness validation; and a benchmarking / ablation methodology. - Be concrete and technical. Stage the plan into milestones and clearly mark the novel ideas.

5models prompted

5plans returned

20Ktoken budget · incl. reasoning

審判judged by GEMINI 3.1 PRO

審判 // VERDICT The judge’s comparisonGEMINI 3.1 PRO

Gemini was given the returned plans blind and asked to score each on novelty, GB10-specificity, depth, numerical rigor and completeness — then to name the strongest and weakest, flag hand-wavy or technically dubious claims, and pick the single best idea.

Scorecard

Model	Novelty	GB10-specificity	Depth	Numerical-rigor	Completeness
sakana/fugu-ultra	8	8	8	7	8
openai/gpt-5.5	6	7	7	8	8
anthropic/claude-opus-4.8	10	10	10	10	9
z-ai/glm-5.2	7	7	6	4	7
moonshotai/kimi-k2.7-code	8	9	9	8	9

sakana/fugu-ultra Creative, dynamic execution approach; Grace sidecar is a neat idea but potentially unscalable.

openai/gpt-5.5 Exhaustive and well-structured, but ultimately reads like a safe, generic catalog of GPU buzzwords.

anthropic/claude-opus-4.8 An absolute masterclass; cites precise Blackwell PTX concepts and solves the fundamental bottleneck.

z-ai/glm-5.2 Fundamentally flawed by latency-bound CPU/GPU ping-ponging and mathematically unsound mixed-precision.

moonshotai/kimi-k2.7-code Highly practical and deeply technical, specifically addressing narrow-K Tensor Core micro-kernels.

Strongest & weakest

Strongest: anthropic/claude-opus-4.8. This plan demonstrates peerless architectural knowledge, citing specific Blackwell features (e.g., tcgen05 MMA paths, TMEM). It seamlessly solves the most stubborn problem in batched QR (the sequential panel factorization latency) through brilliant, low-level PTX warpgroup scheduling and mathematically sound carried-residual bounds.

Weakest: z-ai/glm-5.2. It completely misunderstands hardware latency regimes by proposing that the Grace CPU should factor the panel while the GPU does the update. For tiny/small matrices, this means paying Grace-Blackwell NVLink-C2C round-trip latency at a higher cost than the actual matrix math, destroying the kernel.

Generic or dubious

Dubious CPU/GPU Overlap: z-ai/glm-5.2 proposes offloading the unblocked panel factorization of small matrices ($N=32-256$) to the Grace CPU over unified memory. The panel operations for these sizes take mere hundreds of nanoseconds; the CPU-GPU coherence protocol and NVLink-C2C latency will massively bottleneck the pipeline.
Dubious Math: z-ai/glm-5.2 suggests an "Error-Free Transform (EFT)" using BF16 inputs with FP32 accumulation. Standard EFTs (like Ozaki or Dekker-Veltkamp) require exact representation of residuals via mantissa splitting. BF16 only has a 7-bit mantissa; using it for an EFT scheme is mathematically invalid and will not reliably preserve precision.
Generic: openai/gpt-5.5 mentions all the correct hardware features (TMA, TBC, DSMEM) but largely fails to explain how they map to the exact data flows at the instruction level, giving it a textbook-like, hand-wavy feel compared to the dense technical reality found in plans 3 and 5.

Best idea

Cross-batch software-pipelined panel/update interleave, proposed by anthropic/claude-opus-4.8.

Standard batched QR suffers from low GPU utilization because the panel factorization is an inherently serial, latency-bound process that leaves powerful Tensor Cores idle. By explicitly mapping the active matrix's panel generation to the cluster's default CUDA Cores (WG-panel) while simultaneously executing the previous matrix's trailing matrix update on the Blackwell tcgen05 Tensor Cores (WG-update) using disjoint warpgroups and TMEM, this idea entirely hides the sequential panel latency behind a throughput-bound GEMM.

計画 // PLANS The research plans

Each model’s full plan, as returned (capped at 20K tokens including reasoning). Click to expand.

フグFUGU-ULTRASakana AI◉ PLAN RECEIVED18742 tok · 2128 reasoning · 280.4s · $0.672

As a GPU kernel research lead, I propose the following end-to-end research plan for designing a next-generation batched square compact-WY Householder QR factorization kernel.

Standard cuSOLVER and MAGMA implementations treat batched QR either as a loop over monolithic GPU routines or rely on naive warp-stride loops for small matrices. Both approaches suffer from severe register pressure, synchronization overhead, and vast underutilization of modern AI-focused Streaming Multiprocessors (SMs). To achieve speed-of-light performance on the NVIDIA GB10 (Grace Blackwell) Superchip, we must fundamentally abandon legacy flat arrays and level-2 BLAS panel approaches.

Here is the staged research plan to build a GB10-native batched QR engine built around Tensor Memory Accelerator (TMA), Thread Block Clusters (TBC), Distributed Shared Memory (DSMEM), NVLink-C2C, and Warp-Group Matrix-Multiply-Accumulate (WGMMA) instructions.

MILESTONE 1: Problem Analysis & GB10-Native Data Layout

Objective: Define the target regime (thousands of $N \times N$ matrices, $N \in [32, 1024]$) and design a data layout that eliminates memory-coalescing bottlenecks.

Problem Analysis: For a block of $b$ Householder reflectors, the compact-WY representation is $Q = I - V T V^T$. The factorization splits into a latency-bound panel factorization (computing $V$ and $T$) and a compute-bound Trailing Matrix Update (TMU) ($A \leftarrow A - V T V^T A$). For square matrices, the trailing matrix shrinks predictably, making static wavefront scheduling viable, but traditional memory layouts destroy L2 cache reuse and TMA efficiency across batch dimensions.
[NOVEL IDEA 1] Batch-Tile Matrix-Interleaved (4D Fractal) Layout: Instead of standard arrays of pointers or linearly batched strided matrices, we map the batch into a 4D layout: [Macro_Batch, I_tile, J_tile, Intra_Batch_Lane]. By interleaving the batch dimension at the innermost level (e.g., chunks of 16 matrices packed together), we allow Blackwell's TMA to issue bulk asynchronous 3D memory fetches directly into shared memory. The hardware TMA handles address calculation and bounds checking, bringing identically positioned tiles from multiple independent matrices into the SM simultaneously.
[NOVEL IDEA 2] Dual-View Canonical/Execution Layout: We provide a Grace CPU-side TMA-vectorized packer kernel. For latency-sensitive tiny matrices, we allow direct execution on strided layouts; for large batches, the packing cost is instantly amortized by the subsequent WGMMA throughput gains.

MILESTONE 2: Algorithmic Innovations in Factorization

Objective: Redesign the panel factorization and TMU to bypass the traditionally sequential Householder dependencies.

[NOVEL IDEA 3] Micro-WY Ladder Panel with Speculative Norm Prefetch: Standard blocked QR forms the panel largely with Level-2 BLAS operations. We will split the panel block $b$ into micro-panels of size $r$ (e.g., $r=4$). We form local micro-WY blocks ($I - V_r T_r V_r^T$) and apply them to the rest of the panel using Tensor Core-friendly small GEMMs. Simultaneously, while column $j$ is being finalized, we speculatively precompute partial norms for columns $j+1$ to $j+r$, overlapping reduction work with reflector application and correcting the norms exactly via update formulas.
[NOVEL IDEA 4] DSMEM-Resident WY Update Pipeline: We map an active trailing region to a Thread Block Cluster (TBC) (e.g., 4 or 8 CTAs). One CTA in the cluster owns the panel $V$ and $T$; the others own column tiles of the trailing matrix $A_{trail}$. $V$ and $T$ are staged once into Distributed Shared Memory (DSMEM). The multiple CTAs reuse this DSMEM "WY-cache" to compute intermediate $W = V^T A_{trail}$ and $W = T^T W$ without ever spilling intermediate steps to HBM.

MILESTONE 3: GB10 Compute Optimization & FP64 Strategy

Objective: Maximize TMU throughput by exploiting Blackwell's Tensor Cores and warp specialization while maintaining strict HPC numerical fidelity.

[NOVEL IDEA 5] Adaptive Ozaki-Emulated FP64 WGMMA Updates: Blackwell is optimized for reduced precision (FP4/FP8/TF32) in its 4th-gen Tensor Cores. Native FP64 relies on standard CUDA cores, creating a massive throughput disparity. We will implement an adaptive Ozaki decomposition for the massive $V T V^T A$ update. We slice the FP64 mantissas of $V$, $T$, and $A$ into TF32/FP8 components, compute the component cross-products via high-speed WGMMA Tensor Cores accumulating natively in FP32, and reconstruct the FP64 result. A final lightweight native FP64 CUDA-core pass performs a residual correction.
Occupancy via Warp Specialization: We divide the CTAs into TMA Producer Warps and WGMMA Consumer Warps. Producer warps execute asynchronous cp.async.bulk instructions to fetch the next batch-tile of $A$ into DSMEM. Consumer warps purely execute matrix math. This decouples memory latency from math pipelines, maximizing WGMMA utilization even under immense register pressure.
[NOVEL IDEA 6] Diagonal-Wave Scheduler: Because all matrices in the batch are square, the active submatrix shrinks identically for all. We schedule work in "diagonal waves" across the entire batch chunk, drastically reducing branch divergence and allowing a single persistent cluster kernel to reuse TMA descriptors across waves.

MILESTONE 4: Heterogeneous Execution via Grace NVLink-C2C

Objective: Leverage the 900 GB/s cache-coherent Grace-Blackwell interconnect to handle control flow and outliers without stalling the massive Blackwell GPU.

[NOVEL IDEA 7] Grace Sidecar for Numerical Outliers: Not all matrices are well-conditioned. During GPU QR, if a matrix faces massive cancellation (tracked by a lightweight per-panel numerical risk score), it writes a NEED_HELP status word. The Grace CPU, monitoring via coherent memory, instantly acts as a sidecar. Grace can intercept that specific matrix, run a high-precision CPU check, or enqueue it for a stricter native-FP64 GPU kernel. This avoids globally slowing down the entire batch for a few pathological matrices.

MILESTONE 5: Real-Time Autotuning Strategy

Objective: Create a dynamic execution environment that adapts to batch sizes, condition numbers, and thermal throttling.

[NOVEL IDEA 8] Grace-Driven Contextual Bandit Autotuner: Traditional offline autotuning is rigid. We will deploy a lightweight contextual bandit RL agent running asynchronously on the Grace CPU. Between batch chunks, Grace reads hardware performance counters (via CUPTI) and risk-score histograms from coherent memory. It dynamically adjusts the compact-WY block size ($b$), cluster size, and precision policy (e.g., number of Ozaki components) for subsequent chunks. Reward functions will penalize execution time, orthogonality loss, and fallback rates.

MILESTONE 6: Correctness and Stability Validation

Objective: Rigorously prove that the novel mixed-precision and asynchronous kernel is numerically sound.

Metrics: We will measure relative factorization residual ($\|A - QR\|_F / \|A\|_F$) and orthogonality loss ($\|I - Q^T Q\|_F$). We will normalize these against $n \epsilon$.
Validation Suite: Matrices will be sourced from the SuiteSparse Matrix Collection, alongside synthetically generated pathological matrices (Kahan matrices, Hilbert-like matrices, nearly rank-deficient SVD manipulations) to stress-test the Ozaki-emulated FP64 TMU and the Grace CPU sidecar.
Mixed-Precision Safety Checks: We will track residual growth per panel and verify that our lightweight risk score accurately predicts failures before they corrupt the final $R$ factor.

MILESTONE 7: Benchmarking and Ablation Methodology

Objective: Quantify the exact performance gains of the GB10-native design.

Baselines: cuSOLVER cusolverDnDgeqrfBatched, MAGMA magma_dgeqrf_batched, and a highly optimized Hopper-generation implementation.
Ablation 1 (Layout & Memory): Compare the 4D Fractal Layout + TMA against standard arrays-of-pointers with global memory loads to isolate the bandwidth savings of TMA and the hit-rate improvement in L2.
Ablation 2 (Cluster & DSMEM): Disable TBCs and DSMEM WY-cache (forcing all $V$ and $T$ staging through L2/HBM) to measure the latency hidden by cluster-level reuse.
Ablation 3 (Compute Emulation): Run the TMU purely on FP64 CUDA cores vs. our Ozaki-Emulated Tensor Core strategy. We will plot a Pareto frontier of throughput (effective TFLOPS) vs. accuracy degradation.
Ablation 4 (Grace Sidecar): Inject 5% highly ill-conditioned matrices into a batch. Compare the end-to-end batch throughput with the Grace sidecar active versus a pure GPU implementation forced to drop to a safe, slow precision globally.

Conclusion: This research plan moves completely away from porting legacy monolithic LAPACK routines. By exploiting Blackwell's Tensor Cores for HPC-grade math via emulation, restructuring data for 4D TMA loads, caching compact-WY factors in DSMEM across TBCs, and relying on Grace as an active asynchronous sidecar over NVLink-C2C, we will deliver a batched QR kernel operating at the absolute architectural limits of the GB10 platform.

ジーピーティーGPT-5.5OpenAI◉ PLAN RECEIVED9184 tok · 1891 reasoning · 232.7s · $0.277

Research plan: high-performance batched square compact-WY Householder QR on NVIDIA GB10 / Grace Blackwell

Target problem

We need factor many independent square matrices:

\[ A_i \in \mathbb{R}^{n \times n}, \quad i = 0,\dots,B-1 \]

into

\[ A_i = Q_i R_i \]

using blocked Householder QR with compact-WY representation:

\[ Q_{\text{panel}} = I - Y T Y^T \]

where Y contains Householder vectors and T is the compact upper-triangular block reflector factor.

The target is batched square matrices, likely in the range:

tiny: n <= 32
small: 32 < n <= 128
medium-small: 128 < n <= 512
upper target: 512 < n <= 1024, if batch size is large enough

The kernel family should outperform library-style batched QR by exploiting fixed square sizes, batching regularity, in-kernel compact-WY fusion, Blackwell tensor cores, TMA, thread-block clusters, and Grace–Blackwell unified memory.

1. Problem analysis

1.1 Performance bottlenecks

For batched square QR, the main costs are:

Panel factorization

Sequential Householder dependency.
Norms and dot products.
Poor tensor-core utilization.
Latency-sensitive for small n.

Trailing matrix update

Dominant FLOPs for larger n.
In compact-WY form:

\[ C \leftarrow C - Y T^T (Y^T C) \]

Normally implemented as several small GEMMs.
Batched small GEMMs suffer from launch overhead, low occupancy, and memory traffic.

Batching overhead

Matrices are independent.
Standard layouts often produce non-coalesced loads when many CTAs process different matrices.
Per-matrix metadata, pointer arrays, and small kernel launches become nontrivial.

Precision tension

Householder QR is numerically stable in working precision.
Tensor-core acceleration pushes toward TF32, BF16, FP16, FP8, or FP4.
Direct low-precision Householder QR can lose orthogonality on ill-conditioned matrices.
GB10 FP64 capability may be weaker than datacenter Blackwell SKUs, so FP64 strategy must be adaptive.

2. Algorithmic choice

I would use a blocked Householder QR with compact-WY panel representation.

For each panel starting at column k with block size b:

Factor panel:

\[ A_{k:n, k:k+b-1} \rightarrow Y_k, T_k, R_{kk} \]

Apply the block reflector to the trailing matrix:

\[ A_{k:n, k+b:n} \leftarrow A_{k:n, k+b:n} - Y_k T_k^T \left(Y_k^T A_{k:n, k+b:n}\right) \]

The blocked Householder method is the right baseline because:

It is more stable than Cholesky QR and classical Gram-Schmidt.
It exposes level-3 work through compact-WY updates.
It allows tensor-core acceleration mostly in the trailing update.
It supports in-place LAPACK-compatible storage:
upper triangle stores R
lower triangle stores Householder vectors Y
tau and panel-local T stored separately

3. Novel design ideas

I would not simply implement a cuSOLVER-style panel plus batched GEMM pipeline. The main research novelty would be the following.

Novel idea 1: fused compact-WY trailing update without materializing intermediates

Standard compact-WY update forms:

\[ G = Y^T C \]

then

\[ G \leftarrow T^T G \]

then

\[ C \leftarrow C - YG \]

This requires at least two GEMM-like operations and a temporary G.

Instead, for small square batched matrices, I would design a single persistent cluster kernel that:

Loads a panel block Y and T into shared memory or distributed shared memory.
Streams tiles of C using TMA.
Computes G_tile = Y^T C_tile using tensor cores.
Applies T^T to G_tile in registers or shared memory.
Immediately computes Y G_tile.
Stores updated C_tile.

The temporary G_tile never goes to global memory.

This should reduce:

global memory traffic
kernel launches
synchronization overhead
pointer-chasing overhead
small-GEMM inefficiency

This is the central kernel-design idea.

Novel idea 2: cluster-resident multi-panel pipeline

For medium-small matrices, one thread-block cluster processes one matrix panel and several trailing tiles.

Within the cluster:

one CTA performs panel factorization or finalizes T
several CTAs update different trailing tiles
Y and T are placed in distributed shared memory
TMA multicast distributes panel data to CTAs in the cluster
trailing tiles are streamed through TMA

This creates a QR pipeline inside a cluster:

panel k factorization
      ↓
broadcast Y_k, T_k through DSM
      ↓
fused compact-WY update of trailing tiles
      ↓
next panel

For n <= 256, the entire active submatrix may fit in cluster-local shared memory or DSM, enabling very low global-memory traffic.

Novel idea 3: layout-transforming batched QR

Instead of accepting ordinary strided matrices as the compute layout, I would introduce a GPU-side packing layer that converts batches into a tile-major structure-of-arrays layout:

A_packed[tile_row][tile_col][batch_group][tile_element]

For example, with 16 x 16 tiles:

A_packed[tr][tc][g][i][j]

where g indexes a group of matrices processed together.

This gives:

coalesced loads across the batch dimension
TMA-friendly contiguous tile transfers
better tensor-core tile feeding
fewer scattered global-memory transactions
easier persistent scheduling

For users requiring LAPACK-like storage, we provide pack and unpack kernels. For high-throughput batched workloads, we expose the packed layout as the native API.

Novel idea 4: precision-adaptive compact-WY QR

The kernel will not use one fixed precision mode. It will dynamically select among:

strict FP64

FP64 panel
FP64 update
used when accuracy is critical or condition indicators are poor

FP64-panel / FP32-or-TF32 update with compensation

norms, tau, T, and panel inner products in FP64
trailing update through tensor cores in TF32 or FP32
residual correction accumulated in FP32/FP64

FP32-panel / TF32 update

default high-throughput mode

BF16/FP16 tensor-core update with FP32 accumulation

only for well-scaled matrices and user-approved approximate QR

emulated FP64 update using FP32 tensor cores

double-single or Ozaki-style splitting for SKUs with poor native FP64 throughput

The precision mode is chosen using lightweight panel diagnostics:

column norm range
estimated local condition growth
reflector cancellation risk
observed loss of triangularity or orthogonality on sample panels
user-specified tolerance

Novel idea 5: batch-aware persistent QR scheduler

Instead of one kernel launch per panel or relying on batched GEMM calls, I would implement a persistent scheduler:

CTAs or clusters fetch matrix IDs from a global queue.
Matrices are grouped by n, precision mode, and layout.
For tiny matrices, one warp or one CTA handles multiple matrices.
For medium matrices, one cluster handles one matrix.
For large batches, the scheduler maintains high SM residency by assigning new matrices as soon as clusters finish.

This avoids load imbalance when matrices have different convergence paths or precision escalation decisions.

4. Detailed algorithm plan

4.1 In-place storage format

For each matrix:

upper triangle of A stores R
strict lower triangle stores Householder vectors
diagonal of each Householder vector is implicit 1
tau[k] stores scalar reflector coefficient
T_panel[p] stores compact-WY triangular factor for each panel

For panel size b, panel p stores:

\[ T_p \in \mathbb{R}^{b \times b} \]

packed as an upper triangular block.

4.2 Householder panel factorization

For a panel beginning at column k:

For j = 0 ... b-1:

\[ x = A_{k+j:n, k+j} \]

Compute norm:

\[ \|x\|_2 \]

Generate Householder reflector:

\[ H_j = I - \tau_j v_j v_j^T \]

Store v_j below the diagonal, implicit one at the diagonal.

Apply current reflector to remaining columns inside the panel.

Update partial compact-WY state.

After all b reflectors are generated, construct T:

\[ T_{jj} = \tau_j \]

and for i < j:

\[ T_{ij} = -\tau_j \left[ T_{i:j-1, i:j-1} \left( Y_{:, i:j-1}^T v_j \right) \right]_i \]

depending on the chosen convention.

The block reflector is:

\[ H_{\text{panel}} = I - Y T Y^T \]

For applying to the trailing matrix from the left:

\[ C \leftarrow \left(I - Y T^T Y^T\right) C \]

4.3 Panel kernels by matrix size

Regime A: `n <= 32`

Use one warp per matrix.

Store whole matrix in registers if possible.
Use warp shuffles for norms and dot products.
No shared memory except maybe for spilling.
Compact-WY T generated in registers.
Update columns directly.

This regime prioritizes latency and batch throughput.

Regime B: `32 < n <= 128`

Use one CTA per matrix.

Matrix tile resident in shared memory.
Panel size b = 8, 16, or 32.
Warp specialization:
warp 0: norms, Householder generation
warp 1: panel dot products
warps 2–3 or more: trailing update microtiles
Tensor cores used for compact-WY update if tile dimensions justify it.
Avoid global-memory temporaries.

Regime C: `128 < n <= 512`

Use one thread-block cluster per matrix.

Cluster contains several CTAs.
Panel Y and T stored in distributed shared memory.
TMA loads trailing tiles into CTA-local shared memory.
Tensor cores update several trailing tiles in parallel.
Panel CTA and update CTAs operate in a pipelined manner.

Regime D: `n > 512`

Use one or more clusters per matrix.

Split trailing matrix into macrotiles.
One cluster owns the panel.
Multiple clusters update disjoint trailing tiles.
Use global-memory synchronization between panels or CUDA cooperative groups if available.
Consider look-ahead:
while some clusters update panel k, another begins preparing panel k+b once dependencies are satisfied.

This regime is secondary because the target is batched small square QR, but the design should scale.

5. GB10 / Grace Blackwell-specific optimization

5.1 Tensor cores

Blackwell tensor cores should be used primarily for:

\[ Y^T C \]

and

\[ Y G \]

in the compact-WY trailing update.

For FP32-like QR:

use TF32 tensor cores for speed
accumulate in FP32
optionally refine with correction terms

For lower precision:

use BF16 or FP16 with FP32 accumulation
only when numerical diagnostics permit

For very low precision:

FP8/FP4 are not default QR formats
they may be useful for approximate updates in machine-learning workloads
require scaling and orthogonality repair

For FP64:

if the GB10 configuration exposes strong native FP64, use native FP64 CUDA cores or FP64 tensor cores where available
if native FP64 is weak, use:
FP64 panel only
mixed-precision update
emulated FP64 GEMM using FP32 tensor cores and splitting
compensation for accumulated update error

5.2 TMA

Tensor Memory Accelerator will be used for:

loading packed tiles of the trailing matrix
loading or multicasting panel Y
storing updated trailing tiles
prefetching the next panel region

The fused update kernel pipeline:

TMA load Y/T panel tile
TMA load C tile stage 0
compute G = Y^T C
TMA load C tile stage 1
compute T^T G
compute C -= YG
TMA store C tile stage 0
repeat

Use asynchronous transaction barriers to overlap:

global memory movement
tensor-core computation
panel synchronization

5.3 Thread-block clusters and distributed shared memory

Clusters are useful for n >= 128.

Cluster layout example:

clusterDim = 4 CTAs

CTA 0: panel factorization / T construction
CTA 1: trailing tile columns 0..m
CTA 2: trailing tile columns m..2m
CTA 3: trailing tile columns 2m..3m

Shared state:

Y_panel
T_panel
panel scaling factors
precision escalation flags
local diagnostics

Panel data is placed in DSM so update CTAs can access it without repeated global loads.

TMA multicast can distribute panel data to all CTAs in a cluster.

5.4 Occupancy strategy

The design should explicitly trade off:

registers per thread
shared memory per CTA
DSM per cluster
tensor-core occupancy
number of resident clusters per SM group
TMA queue pressure

For tiny matrices, occupancy is dominated by CTA count and scheduling overhead.

For medium matrices, the goal is not maximum occupancy but maximum sustained tensor-core utilization while keeping panel latency hidden.

I would build an occupancy model:

\[ T_{\text{panel}} + T_{\text{update}} + T_{\text{sync}} + T_{\text{memory}} \]

and tune cluster size so that:

update CTAs hide panel latency
TMA stages keep tensor cores fed
shared memory does not reduce active clusters below the point where latency dominates

5.5 Grace–Blackwell unified memory and NVLink-C2C

The Grace CPU should be used for:

preprocessing batch metadata
grouping matrices by size and precision mode
constructing TMA descriptors
performing optional high-precision verification on samples
feeding persistent GPU queues

The GPU should own matrix data during factorization.

Use unified memory carefully:

prefetch matrices to GPU before factorization
avoid first-touch page faults inside kernels
use memory advice to prefer GPU residency
keep descriptor tables coherent and CPU-visible
use NVLink-C2C for low-latency CPU-GPU scheduling metadata, not for frequent matrix movement

For streaming workloads:

Grace CPU prepares batch group g+1
Blackwell GPU factors batch group g
Grace CPU validates or consumes batch group g-1

6. Data layout for batched square matrices

6.1 Supported layouts

Layout 1: conventional strided batch

A_i[row + col * lda + i * stride]

Pros:

compatible with existing applications

Cons:

poor coalescing for tiny matrices
less friendly to TMA
harder to vectorize across batch

Layout 2: pointer batch

Aarray[i] -> A_i

This should be supported only as an input compatibility mode.

It is poor for performance because of pointer chasing and irregularity.

Layout 3: proposed tile-major batch layout

For high performance:

A_packed[tile_row][tile_col][batch_group][tile_i][tile_j]

Example with 16 x 16 tiles and batch group size G:

A_packed[tr][tc][g][16][16]

or vectorized:

A_packed[tr][tc][tile_i][tile_j][g]

depending on whether the kernel wants matrix-contiguous or batch-contiguous access.

For tiny matrices, the batch dimension should be innermost so one warp can process multiple matrices with coalesced loads.

For medium matrices, tile elements should be contiguous for TMA.

6.2 Novel batch swizzling

I would use a diagonal-panel swizzle:

Matrices are grouped so that panels with the same k are contiguous in memory:

batch group:
    all matrices, panel 0 tiles
    all matrices, panel 1 tiles
    all matrices, panel 2 tiles

This makes QR panel progression naturally coalesced across the batch.

For very large batches, the scheduler can process all matrices at panel k together, improving instruction locality and TMA descriptor reuse.

7. Kernel architecture

7.1 Kernel family

I would implement several specialized kernels rather than one universal kernel.

Kernel K0: pack/unpack

strided/pointer input to packed tile-major layout
optional scaling of matrices
optional column norm precomputation

Kernel K1: tiny QR

For n <= 32.

one warp per matrix
register-resident
no compact-WY needed unless requested
optionally output compact-WY metadata for API consistency

Kernel K2: small compact-WY QR

For 32 < n <= 128.

one CTA per matrix
shared-memory panel
fused compact-WY update
tensor-core update if profitable

Kernel K3: cluster compact-WY QR

For 128 < n <= 512.

one cluster per matrix
DSM-resident Y and T
TMA-fed trailing update
tensor-core fused update

Kernel K4: multi-cluster QR

For n > 512.

multiple clusters per matrix
panel owner cluster
distributed trailing updates
global panel synchronization

Kernel K5: validation / diagnostics

compute residual samples
compute orthogonality estimate
flag matrices needing refactorization in higher precision

7.2 Fused compact-WY update kernel detail

For a trailing tile:

\[ C_t \in \mathbb{R}^{m \times w} \]

where:

m = n-k
w is tile width
Y \in R^{m x b}
T \in R^{b x b}

Compute:

\[ G = Y^T C_t \]

\[ G = T^T G \]

\[ C_t = C_t - YG \]

Implementation sketch:

load Y tile into shared/DSM
load T into registers/shared
for each trailing C tile:
    TMA load C tile into shared
    MMA: G = Y^T C
    small triangular multiply: G = T^T G
    MMA: U = Y G
    C = C - U
    TMA store C

G is small: b x w.

For b = 16 and w = 32, G has only 512 elements, feasible in registers/shared.

The key is that G is never written to global memory.

8. Precision strategy

8.1 Default modes

Mode P0: FP32 stable mode

panel norms in FP32 or FP64 depending on availability
tau in FP32
T in FP32
trailing update in TF32 tensor cores with FP32 accumulation
optional correction for sensitive panels

Mode P1: FP64-accurate mode

panel in FP64
tau and T in FP64
update in native FP64 if fast enough
otherwise update using compensated FP32 tensor-core emulation

Mode P2: mixed FP64/TF32

panel in FP64
compact-WY update in TF32/FP32
correction update:

\[ C \leftarrow C - \Delta_{\text{low}} \]

then compute a small residual estimate

\[ E = YT^TY^TC - \widehat{\Delta} \]

and apply a correction when needed.

Mode P3: approximate ML mode

panel in FP32
update in BF16/FP16 or FP8 with scaling
optional reorthogonalization of selected reflectors
intended only for tolerant applications

8.2 FP64 native versus emulated strategy

At initialization, benchmark microkernels:

FP64 FMA throughput
FP64 tensor-core throughput, if exposed
FP32/TF32 tensor-core throughput
shared-memory bandwidth
TMA bandwidth

Then choose:

if native FP64 update is efficient:
    use native FP64 for strict mode
else:
    use FP64 panel + emulated or compensated update

Emulated FP64 update options:

double-single

represent value as hi + lo FP32 pair
use multiple FP32 tensor-core products
accumulate correction

Ozaki-style splitting

split FP64 operands into several FP32 chunks
tensor-core multiply chunks
accumulate into FP64 or compensated FP32

hybrid correction

perform main update in TF32/FP32
estimate update error using sampled columns or residual projection
recompute only flagged tiles in higher precision

The goal is not universal FP64 emulation, but adaptive use where strict accuracy is requested.

9. Autotuning strategy

9.1 Parameters to tune

For each n, batch size, layout, and precision mode:

panel block size b
candidates: 4, 8, 16, 24, 32, 48, 64
trailing tile width w
CTA shape
number of warps per CTA
cluster size
1, 2, 4, 8 CTAs
TMA stage count
double-buffering versus triple-buffering
shared-memory split
panel storage
C tiles
G temporary
tensor-core instruction shape
batch group size
layout swizzle
precision mode
whether to fuse or split WY update
native versus packed input
number of matrices per CTA for tiny sizes

9.2 Autotuning workflow

Offline tuning

Run on target GB10 hardware:

Generate representative matrix batches.
Sweep parameters.
Collect:

runtime
achieved TFLOP/s
tensor-core utilization
memory bandwidth
TMA stall cycles
occupancy
shared-memory bank conflicts
register spills
numerical error

Store best configurations in a tuning table keyed by:

(n, dtype, batch_size_range, layout, accuracy_mode)

Online tuning

At runtime:

Identify n, batch size, layout, requested tolerance.
Select nearest offline configuration.
Optionally run a small calibration on the first batch.
Cache winning configuration.

9.3 Analytical model

The autotuner should be guided by a cost model.

Panel cost approximately:

\[ O(n b^2) \]

Trailing update cost per panel:

\[ O(2(n-k)b(n-k-b)) \]

Total blocked QR cost:

\[ \frac{4}{3}n^3 + O(n^2 b) \]

For each panel, estimate:

\[ T = \max(T_{\text{TMA}}, T_{\text{TC}}) + T_{\text{panel}} + T_{\text{sync}} \]

Use this to prune parameter combinations before exhaustive tuning.

10. Numerical stability and correctness validation

10.1 Correctness metrics

For each matrix:

Residual:

\[ \frac{\|A - QR\|_F}{\|A\|_F} \]

Orthogonality:

\[ \|I - Q^T Q\|_F \]

Relative triangular error:

\[ \frac{\|\operatorname{tril}(R, -1)\|_F}{\|A\|_F} \]

Sign convention consistency for R diagonal.

Comparison against high-precision reference:

CPU LAPACK
cuSOLVER
MPFR or long-double reference for selected tiny matrices

10.2 Matrix test suite

Use diverse matrices:

random Gaussian
random uniform
orthogonal times diagonal with controlled condition number
Hilbert-like matrices
Kahan matrices
nearly rank-deficient matrices
scaled columns with large norm variation
subnormal and overflow-risk cases
identity plus low-rank perturbation
adversarial matrices designed to trigger cancellation in Householder norm generation

Condition numbers:

\[ \kappa(A) = 10^0, 10^2, 10^4, 10^8, 10^{12}, 10^{16} \]

depending on precision mode.

10.3 Stability policy

For FP32 mode, expected bounds:

\[ \frac{\|A - QR\|}{\|A\|} = O(n u_{\text{fp32}}) \]

\[ \|I - Q^TQ\| = O(n u_{\text{fp32}}) \]

For mixed TF32/FP32 mode, require looser but explicit bounds.

If diagnostics exceed thresholds:

refactor selected matrix in higher precision
recompute flagged panels
disable low-precision trailing updates for that batch group
emit status code to user

10.4 Reconstructing Q

The factorization stores compact-WY data. Validation needs explicit or implicit Q application.

Implement:

ormqr_batched_compact_wy

apply Q or Qᵀ to test matrices

form_q_batched

explicitly generate Q for validation and users who need it

For benchmarking QR only, avoid explicit Q construction unless required.

11. Benchmarking methodology

11.1 Baselines

Compare against:

cuSOLVER batched QR, where available
cuSOLVER non-batched loop
MAGMA batched QR
cuBLAS batched GEMM-based custom blocked QR
CPU LAPACK on Grace
hybrid CPU-GPU implementation for small n
naïve CUDA Householder QR

11.2 Metrics

Measure:

matrices per second
effective GFLOP/s or TFLOP/s
end-to-end time including packing
QR-only time excluding packing
memory bandwidth
tensor-core utilization
SM occupancy
TMA throughput
shared-memory bank conflicts
register spills
numerical residual
orthogonality error
energy per factorization if available

11.3 Benchmark grid

Sweep:

n = 8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024
batch sizes:
B = 1, 8, 32, 128, 1024, 8192, 65536
precision modes:
FP64
FP32
FP64-panel/TF32-update
FP32/TF32
BF16/FP16 approximate
layouts:
strided
pointer
packed tile-major
condition numbers:
well-conditioned
moderate
ill-conditioned

11.4 Ablation studies

Ablate each proposed feature:

fused WY update versus three-stage update
TMA versus ordinary global-memory loads
DSM panel sharing versus global reload
cluster kernel versus CTA-only kernel
packed layout versus strided layout
tensor-core update versus CUDA-core update
FP64 panel versus FP32 panel
adaptive precision versus fixed precision
persistent scheduler versus static one-matrix-per-CTA launch
panel block size sensitivity
batch swizzle on/off
Grace-side metadata preparation on/off

The goal is to quantify exactly where performance comes from.

12. Milestone plan

Milestone 0: specification and harness

Deliverables:

API design
matrix layout specification
correctness test harness
benchmark harness
reference LAPACK/cuSOLVER comparison
synthetic matrix generator

APIs:

gb10_batched_geqrf(
    A,
    lda,
    stride_or_layout,
    tau,
    T,
    n,
    batch_count,
    precision_mode,
    workspace,
    stream);

and packed-layout variant:

gb10_packed_batched_geqrf(
    A_packed,
    tau_packed,
    T_packed,
    n,
    batch_count,
    config,
    stream);

Milestone 1: baseline compact-WY QR

Implement:

simple CTA-based blocked Householder QR
in-place storage
explicit T construction
unfused trailing update
FP32 and FP64 versions

Purpose:

verify math
establish baseline
validate storage format

Milestone 2: tiny and small specialized kernels

Implement:

warp-resident QR for n <= 32
CTA-resident QR for n <= 128
shared-memory compact-WY update
no global temporaries

Expected outcome:

strong speedups for small matrices
reduced launch and memory overhead

Milestone 3: fused compact-WY update

Implement the novel fused update:

\[ C \leftarrow C - YT^T(Y^TC) \]

without global-memory G.

Versions:

CUDA-core version
tensor-core TF32 version
FP32 accumulation version
FP64-panel mixed update version

Measure:

global memory traffic reduction
tensor-core utilization
numerical impact

Milestone 4: packed tile-major batch layout

Implement:

pack kernel
unpack kernel
native packed QR
batch swizzles
TMA-friendly tile ordering

Evaluate:

pack overhead
break-even batch size
throughput improvement over strided layout

Milestone 5: GB10 cluster/TMA kernel

Implement:

cluster-resident panel sharing
DSM storage of Y and T
TMA load/store pipeline
TMA multicast for panel data
persistent matrix scheduler

Primary target:

n = 128, 192, 256, 384, 512

Expected outcome:

improved trailing-update throughput
lower global-memory traffic
higher SM efficiency

Milestone 6: mixed-precision and FP64 strategy

Implement precision modes:

FP32/TF32
FP64 native
FP64-panel/TF32-update
emulated FP64 update prototype
adaptive precision escalation

Validation:

residual and orthogonality versus references
adversarial matrices
condition-number sweep

Milestone 7: autotuner

Implement:

offline tuner
runtime configuration selector
performance database
analytical pruning model
numerical tolerance-aware tuning

Tuning keys:

n
batch_count
dtype
layout
precision_mode
GB10 SKU capability

Milestone 8: full benchmark and ablation

Perform complete benchmark suite.

Outputs:

performance plots
accuracy plots
roofline analysis
per-feature ablation
recommended configurations
public technical report or paper draft

13. Expected research outcomes

The most important expected contributions are:

A fused compact-WY batched QR update kernel that avoids global temporaries and eliminates small-GEMM launch overhead.

A cluster-resident QR pipeline using Blackwell TMA, DSM, and thread-block clusters.

A packed tile-major batched matrix layout specifically optimized for QR panel progression and tensor-core update tiles.

A precision-adaptive Householder QR system that uses tensor cores aggressively while preserving accuracy through diagnostics and fallback.

A persistent batch scheduler that keeps GB10 saturated across tiny, small, and medium square matrices.

This is not merely a reimplementation of cuSOLVER or MAGMA. The central design principle is to treat batched compact-WY QR as a single fused, layout-aware, cluster-resident streaming computation rather than a sequence of panel kernels and batched GEMMs.

オーパスCLAUDE OPUS 4.8Anthropic◉ PLAN RECEIVED11733 tok · 746 reasoning · 167.3s · $0.295

Research Plan: Batched Square Compact-WY Householder QR for NVIDIA GB10

Project codename: cwy-batch-gb10 Scope: Many independent square matrices, A_k ∈ ℝ^{n×n} (and complex variant), n ∈ {8 … 512}, batch counts 10³–10⁶, factorize A_k = Q_k R_k via blocked compact-WY Householder, returning {V_k, T_k, R_k}.

Part 0 — Guiding theses (what makes this plan original)

The standard cuSOLVER/MAGMA recipe for batched QR is: one CTA per matrix, unblocked or lightly-blocked Householder in shared memory, FP32/FP64 throughout, dot-products via warp shuffles. That design is latency-bound on the panel, leaves tensor cores idle, and ignores Blackwell's memory/cluster machinery.

My plan rests on five novel design pillars, developed in detail below:

[NOVEL-1] Cross-batch software-pipelined "panel/update interleave": treat the latency-bound sequential panel factorization of matrix k and the throughput-bound tensor-core trailing update of matrix k–1 as two pipeline stages mapped onto disjoint warpgroups of a thread-block cluster, so the panel's serial dependency chain is hidden behind a neighbor's GEMM.
[NOVEL-2] Precision-adaptive compact-WY with carried residual (PA-WY): form V and T in working precision but apply the trailing block reflector in FP8/FP16 tensor-core MMA, maintaining a low-rank residual correction term so the final R has FP64-equivalent backward error without ever issuing native FP64 tensor ops.
[NOVEL-3] Ozaki-split block-reflector update as the DGEMM path: for the FP64-accuracy mode, the V·(Tᵀ·(VᵀA)) update is realized by integer/FP32 error-free GEMM emulation on Blackwell INT8/TF32 tensor cores instead of the GB10's anemic native FP64 units.
[NOVEL-4] Cluster-resident matrix in distributed shared memory (DSMEM): for medium n, the whole matrix lives across a 2–4 CTA cluster's combined shared memory, with TMA multicast loading and a DSMEM-based reduction tree for the panel, eliminating round-trips to L2/HBM during factorization.
[NOVEL-5] Learned, hardware-feedback autotuner over a regime-partitioned kernel family, selecting (block size nb, precision schedule, mapping granularity, TMA tiling) per (n, batch, precision-target) using on-device microbenchmarks + a gradient-boosted cost model.

Part 1 — Problem analysis

1.1 Workload characterization

Batched small/medium square QR appears in: bundle adjustment / SLAM, Kalman ensembles, MIMO detection, DFT-based PDE solvers, ML second-order optimizers (Shampoo/K-FAC factor inverses), and tensor-network contractions.
Arithmetic: QR of n×n costs ≈ (4/3)n³ flops (real). For n=64 that is ~3.5×10⁵ flops/matrix — tiny. With batch 10⁵ → 3.5×10¹⁰ flops total. On a GB10 GPU (≈ tens of TFLOP/s FP32, hundreds of TFLOP/s tensor) this is <1 ms of pure compute, so the kernel is bandwidth- and launch/latency-bound, not flop-bound for small n.
Data volume: n=64 FP32 matrix = 16 KiB. Batch 10⁵ = 1.6 GiB read + write. On GB10 unified LPDDR5X (~hundreds of GB/s aggregate), this is the true floor. ⇒ Minimizing memory traffic and maximizing effective bandwidth dominates design for small n; tensor-core utilization matters most for n≳128.

1.2 Regime partition (drives the whole design)

Regime	n	Strategy	Matrix residency
T (tiny)	8–32	1 warp or sub-warp per matrix, fully register-resident, batched over lanes	registers
S (small)	32–128	1 CTA per matrix, SMEM-resident, warp-cooperative panel + small MMA	shared memory
M (medium)	128–512	1 cluster per matrix, DSMEM-resident, tcgen05 MMA trailing update	distributed SMEM
L (large/few)	>512	fall back to streamed per-matrix blocked QR (out of primary scope)	HBM/L2

1.3 Roofline targets

Build a GB10-specific roofline (FP32, BF16, FP8, emulated-FP64) early (Milestone 0). Establish the break-even arithmetic intensity where tensor-core trailing update beats memory traffic. This number sets the minimum nb and minimum n for which [NOVEL-2]/[NOVEL-3] pay off.

Part 2 — Algorithmic foundation (compact-WY, made precise)

2.1 Math recap

Panel of nb columns starting at column j. For each local column i:

x = A[i:m, i]; compute (β, τ_i, v_i) with v_i[i]=1, H_i = I − τ_i v_i v_iᵀ.
Build V = [v_1 … v_{nb}] (m×nb, unit lower-trapezoidal), and triangular factor T (nb×nb, upper) so that

Q_panel = H_1⋯H_{nb} = I − V T Vᵀ.

LARFT recurrence: T[i,i]=τ_i; column i: w = −τ_i (V[:,1:i−1]ᵀ v_i); T[1:i−1,i] = T[1:i−1,1:i−1] w.
Trailing update: A_tr ← (I − V Tᵀ Vᵀ) A_tr = A_tr − V·(Tᵀ·(Vᵀ A_tr)). Two GEMMs (W=VᵀA_tr; A_tr−=V·(TᵀW)) + one TRMM.

2.2 Algorithmic choices and a new wrinkle

Right-looking blocked for tensor-core throughput; left-looking considered for the register-resident tiny regime (less write traffic).
[NOVEL] Fused T-construction: instead of LARFT's per-column triangular solves, build T while factorizing the panel using a delayed rank-1 accumulation of VᵀV and the τ's, then a single small triangular inversion of the upper part. This collapses nb sequential TRMVs into one cluster-cooperative MMA + a tiny in-register triangular solve, reducing the panel's critical path — important because panel latency is the thing [NOVEL-1] must hide.
Recursive compact-WY (Elmroth–Gustavson style) at the panel level for n in regime M: split panel into halves, recurse, combine T blocks via T₁₂ = −T₁(V₁ᵀV₂)T₂. This converts more of the panel into MMA.

Part 3 — Numerical / precision strategy

3.1 Why precision is the central research question on GB10

GB10's GPU is Blackwell-class but consumer-tier FP64 (native FP64 throughput ~1/32–1/64 of FP32, no high-rate FP64 tensor path). Yet many target apps want FP64-quality R. So the design must deliver double-precision accuracy without relying on native double-precision throughput. Two complementary mechanisms:

3.2 [NOVEL-2] Precision-Adaptive compact-WY (PA-WY)

Compute reflectors (norms, τ, V, T) in FP32 or TF32-with-correction (panel is small, accuracy-critical, latency-bound → keep it accurate).
Apply the trailing update with BF16/FP8 MMA: cast V, T, A_tr to low precision, do MMA in tensor cores accumulating FP32.
Carried residual: maintain ΔA, the low-rank update residual from precision truncation, as a correction GEMM at coarser cadence (every g panels). Conceptually a block-wise iterative refinement of the QR update: A_tr_true ≈ low-prec-update + correction(ΔV, ΔT) where ΔV/ΔT are the quantization residuals of V and T. Because V is unit-triangular and T is small, the residual GEMMs are cheap relative to the main update.
Precision schedule (autotuned, [NOVEL-5]): FP8 for early/large trailing blocks (well-conditioned, large), escalating to BF16/FP32 for the last panels (where R's diagonal small entries dominate conditioning). The schedule is a function of estimated column norms.
Validate against the theory of mixed-precision Householder (extend Higham/Mary block error bounds to the carried-residual scheme; derive a bound of form ‖A−QR‖ ≤ c·(u_high + nb·u_low·κ_eff) and show the residual correction removes the u_low·κ term to first order).

3.3 [NOVEL-3] Emulated-FP64 mode (Ozaki-split block reflector)

For the "I really need FP64" path:

Split FP64 operands of the trailing-update GEMMs into a sum of low-bit slices (FP32 splitting → multiple TF32/BF16 products, or integer Ozaki scheme on INT8 tensor cores with error-free accumulation).
The block-reflector update A_tr − V(TᵀW) becomes a sum of K×K slice-products accumulated exactly (or to chosen accuracy). On Blackwell, INT8 tensor-core rates are very high; a 3–6 slice scheme can reach FP64-equivalent accuracy at throughput exceeding native FP64.
Novelty: apply Ozaki splitting only to the throughput-dominant trailing GEMM, while panel reflector construction uses true FP64 scalar ops (cheap, latency-bound, runs on the FP64 ALUs that would otherwise be idle). This is a hybrid native-FP64-panel / emulated-FP64-update split that maps each precision technique to the hardware unit where it is least costly.
Determine slice count adaptively from κ estimate (cheap from R diagonal).

3.4 Stability safeguards (all regimes)

Householder (not modified Gram–Schmidt) for unconditional orthogonality.
Column-norm safeguards: recompute norms when downdating loses >half the digits (LAPACK xLARFG-style β scaling to avoid over/underflow), implemented in FP32 with FP64 fallback for the scalar τ.
Optional column pivoting variant deferred (breaks batch regularity) — design hooks but not in core.

Part 4 — GB10 architecture mapping

4.1 Tensor cores (tcgen05 / 5th-gen) + Tensor Memory (TMEM)

Use the Blackwell tcgen05 MMA path: operands staged in TMEM, supporting FP8/FP6/FP4, BF16, TF32, and INT.
[NOVEL] Keep the block reflector V and T resident in TMEM across the multiple MMAs of a trailing update (W=VᵀA, then V·(TᵀW)), so V is loaded once and reused for both the producing and consuming GEMM — V appears in both, transposed. Design the TMEM layout so Vᵀ and V are both addressable without recopy (exploit MMA operand-transpose modes).
For tiny/small regimes where MMA tiles are larger than the matrix, pack multiple matrices into one MMA tile (block-diagonal batching) so a single 64×… MMA processes several n=16 matrices at once — turns under-utilized tensor cores into useful work.

4.2 TMA (Tensor Memory Accelerator)

Batched load via TMA descriptors: build a multi-dimensional TMA tensor map over the batched buffer so one async bulk copy fills SMEM/DSMEM with a matrix (or a panel + trailing block).
[NOVEL] Strided-batch TMA descriptor with a "matrix stride" dimension so the descriptor itself iterates over the batch; combined with the cluster mapping, the producer warp issues TMA cp.async.bulk.tensor for matrix k+1 while computing on k (prefetch pipeline depth ≥2).
Use TMA multicast to broadcast shared sub-blocks (e.g., the panel reflector) to all CTAs in a cluster.

4.3 Thread-block clusters + DSMEM ([NOVEL-4])

Regime M: one cluster of 2–4 CTAs owns one matrix; the matrix tiles are partitioned across CTAs' SMEM and addressed as one distributed shared memory space.
Panel reduction (norms, dot products for V and T): a DSMEM reduction tree using cluster-wide mapa/remote SMEM access + cluster.sync, avoiding global memory and avoiding serializing the reduction in a single CTA.
Trailing update partitioned by column blocks across CTAs; V multicast once into each CTA's DSMEM slice.

4.4 [NOVEL-1] Cross-batch panel/update interleave (the headline scheduling idea)

Within a CTA (regime S) or cluster (regime M), split warpgroups into:

WG-panel: does the sequential, latency-bound panel factorization (norms, reflectors, T) for the current matrix — uses CUDA cores / FP64 ALUs / shuffles.
WG-update: does the tensor-core trailing update for the previous matrix (or previous panel) — uses tensor cores + TMEM.

Because panel and update use different functional units, they overlap with near-zero contention. A persistent-kernel ring buffer of in-flight matrices keeps both stages busy; mbarrier/named-barrier handoff transfers V,T from WG-panel to WG-update. This directly attacks the classic batched-QR bottleneck (panel serial chain) by filling it with a neighbor's GEMM rather than idling.

4.5 Grace–Blackwell unified memory & NVLink-C2C

Allocate batched buffers in managed/unified memory; exploit C2C coherence so the Grace CPU prepares descriptors / does pivoting bookkeeping while GPU factorizes (no explicit copies).
[NOVEL] CPU-side producer streams matrices into a unified ring buffer; GPU consumer persistent kernel pulls them — overlapping host generation (e.g., from a sensor/sim pipeline) with factorization, using C2C bandwidth and cudaMemAdvise/prefetch hints to keep working set GPU-resident while staging next batch from LPDDR.
Use per-matrix page locality: align matrix storage so a batch chunk maps to pages that can be prefetched ahead.

4.6 Occupancy & resource budgeting

Persistent kernel sized to fill all SMs; tune CTAs/SM vs SMEM (DSMEM) footprint vs register pressure.
Register-resident tiny regime: keep n²/lanes ≤ register budget (e.g., n=16 → 256 elements over 32 lanes = 8 regs/matrix-element + reflector temporaries).
TMEM capacity budgeting for resident V/T.

Part 5 — Kernel family (concrete designs per regime)

5.1 Regime T (n≤32): warp-batched register-resident

Map one matrix per warp (or 2 matrices per warp for n≤16). Column held across lanes; Householder dot products via __shfl_xor butterfly reductions.
Unblocked left-looking Householder (low write traffic). Optionally pack 4–8 matrices into one MMA (block-diagonal) for the rank-update if it beats shuffle-based AXPY (autotuner decides).
No global traffic except initial load and final store; one TMA bulk load per warp's matrices.

5.2 Regime S (32<n≤128): CTA-per-matrix, SMEM-resident, PA-WY

Load matrix via TMA into SMEM. Panel nb ∈ {8,16,32}.
WG-panel/WG-update interleave ([NOVEL-1]) across two warpgroups; panel uses warp-cooperative reductions, update uses small BF16/FP8 MMA with FP32 accumulate ([NOVEL-2]).
T built via fused construction ([NOVEL] §2.2). Carried residual GEMM every g panels.

5.3 Regime M (128<n≤512): cluster-per-matrix, DSMEM-resident

Cluster of 2–4 CTAs ([NOVEL-4]); matrix tiled across DSMEM.
Recursive compact-WY panel; tcgen05 MMA trailing update with V,T resident in TMEM (§4.1).
Optional emulated-FP64 path ([NOVEL-3]) selected by autotuner/precision flag.
TMA multicast of reflector blocks; DSMEM reduction tree for panel.

5.4 Shared infrastructure

Persistent megakernel with a global work queue (atomicAdd ticket) over the batch; cluster/CTA grabs next matrix → enables [NOVEL-1] ring pipeline and load balancing across heterogeneous n if batches are mixed.
Unified ABI returning V (in-place lower), T (separate compact buffer), R (in-place upper), τ.

Part 6 — Data layout for batching

Evaluate three layouts and let [NOVEL-5] pick:

Array-of-matrices, column-major (LAPACK-like, contiguous per matrix) — good TMA tiling, standard interop.
[NOVEL] Panel-interleaved batch-major: store all matrices' first nb columns contiguously, then next nb, etc. This coalesces the panel across the batch so memory-bound tiny/small regimes read perfectly contiguous reflector data and the cross-batch pipeline streams naturally.
Block-diagonal packed for MMA batching in regime T.

Provide cheap on-device layout transposer (TMA-based) so callers can supply standard layout; transpose cost amortized when batch is reused.
Align matrix stride to 128 B / TMA swizzle requirements; pad n to avoid shared-memory bank conflicts and to fit MMA tile granularity.

Part 7 — Autotuning ([NOVEL-5])

Search space: regime kernel, nb, warpgroup split ratio, pipeline depth, MMA precision schedule, Ozaki slice count, TMA tile shape, cluster size, layout, CTAs/SM.
Method:

Microbenchmark harness measuring panel latency, MMA throughput, TMA bandwidth, DSMEM reduction cost on the actual GB10.
Analytic pruning using the §1.3 roofline to discard infeasible configs.
Cost model: gradient-boosted regressor trained on a sweep over (n, batch, precision-target) → predicts runtime; used to seed
a bandit / Bayesian search that runs only top-K predicted configs on-device, with hardware counters (Nsight Compute: tensor pipe util, mem throughput, DSMEM stalls) fed back to refine the model.

Ship a small config database keyed by (n bucket, precision mode, batch bucket); JIT-select at launch. Re-tune triggered if measured throughput drifts.

Part 8 — Numerical validation & correctness

Reference: LAPACK/cuSOLVER QR in FP64 as ground truth; also a CPU extended-precision (double-double) reference for the FP64-emulation mode.
Metrics per matrix, aggregated over batch:
Backward error ‖A − QR‖_F / ‖A‖_F.
Orthogonality ‖I − QᵀQ‖_F (form Q from V,T).
R accuracy / diagonal sign convention.
Conditioning sweep: generate test matrices with prescribed κ (SVD construction) across κ ∈ {1, 10², 10⁶, 10¹²}, varied n; verify PA-WY residual correction recovers FP64-level error and that the precision schedule's escalation triggers correctly.
Adversarial cases: near-rank-deficient, graded matrices, vectors causing β cancellation (test xLARFG-style safeguards), denormal/overflow edges for FP8.
Determinism/repro: fixed reduction order option for bitwise reproducibility validation.
Derive and empirically confirm the error bound from §3.2 (NOVEL theory deliverable).

Part 9 — Benchmarking & ablation methodology

Baselines: cuSOLVER batched geqrf, MAGMA batched QR, a naïve one-CTA-per-matrix FP32 kernel (our own, as honest baseline).
Sweeps: n ∈ {8,16,32,64,128,256,512}; batch ∈ {10³…10⁶}; precision modes {FP32, PA-WY-mixed, emulated-FP64}.
Primary metrics: throughput (factorizations/s, effective TFLOP/s), achieved bandwidth fraction, energy/factorization (GB10 power telemetry — important for the platform), and accuracy at fixed throughput (Pareto front).
Ablations (each isolates one NOVEL pillar):
A1: with/without cross-batch pipeline [NOVEL-1] → expect large win in small-n latency-bound regime.
A2: precision schedule on/off and carried-residual on/off [NOVEL-2] → accuracy/throughput Pareto.
A3: Ozaki-emulated vs native FP64 trailing update [NOVEL-3] → throughput & accuracy.
A4: DSMEM-cluster vs single-CTA for regime M [NOVEL-4].
A5: panel-interleaved vs standard layout [NOVEL-6 layout].
A6: TMEM-resident V/T reuse vs reload.
A7: unified-memory CPU-producer overlap on/off (end-to-end pipeline throughput).
Profiling: Nsight Compute for tensor-pipe utilization, TMA bandwidth, DSMEM traffic, warp-stall breakdown; Nsight Systems for the persistent-kernel pipeline and C2C overlap.
Report roofline placement of each kernel and the autotuner's selected configs.

Part 10 — Milestones

M0 (weeks 1–3) — Characterization & infra. GB10 roofline (all precisions), microbenchmark harness, baselines (cuSOLVER/MAGMA + naïve), reference + error-metric framework. Gate: validated roofline and baseline numbers.

M1 (weeks 4–7) — Regime-S core, FP32. CTA-per-matrix SMEM compact-WY with fused-T; correctness vs LAPACK. Establishes a working, accurate baseline of our design. Gate: matches baseline accuracy, ≥ parity throughput.

M2 (weeks 8–11) — [NOVEL-1] cross-batch pipeline + persistent kernel. Warpgroup split, ring buffer, mbarrier handoff. Gate: ablation A1 shows measurable small-n speedup (target ≥2× over naïve at n≤64).

M3 (weeks 12–16) — [NOVEL-2] PA-WY mixed precision + TMEM reuse. Precision schedule, carried residual, tensor-core trailing update; error-bound theory + empirical confirmation. Gate: FP64-level backward error at low-precision throughput on κ≤10⁶.

M4 (weeks 17–21) — Regime M [NOVEL-4] cluster/DSMEM + TMA multicast. Recursive compact-WY, tcgen05 path. Gate: ablation A4 win for n≥128.

M5 (weeks 22–25) — [NOVEL-3] emulated-FP64 (Ozaki) hybrid. Slice-adaptive trailing GEMM + native-FP64 panel. Gate: exceed native-FP64 throughput at equal accuracy (ablation A3).

M6 (weeks 26–29) — Regime T register-batched + MMA packing + layouts [NOVEL-6]. Gate: tiny-n bandwidth-bound efficiency (≥80% achievable BW).

M7 (weeks 30–34) — [NOVEL-5] autotuner + unified-memory CPU producer (C2C). Cost model, config DB, end-to-end overlap. Gate: autotuner within X% of exhaustive best; A7 end-to-end win.

M8 (weeks 35–38) — Full validation, ablations, paper/library. Pareto fronts, energy, reproducibility; package with stable ABI.

Part 11 — Risks & mitigations

GB10 FP64 weakness worse than expected → emulation is the primary path, not a fallback (M5 prioritized); native-FP64-panel still cheap.
tcgen05/TMEM tooling immaturity → keep WMMA/mma.sync fallback; abstract MMA behind an internal interface.
Cluster scheduling/DSMEM latency not paying off at small n → regime partition ensures clusters used only where beneficial (autotuner decides).
FP8 dynamic range breaks ill-conditioned panels → precision schedule escalation + per-column scaling; carried residual as safety net; fall to BF16/FP32.
Pipeline imbalance (panel ≫ update or vice versa) → tune warpgroup ratio and pipeline depth per (n, precision) via autotuner.

Summary of explicitly novel contributions

Cross-batch panel/update software pipeline on disjoint functional units (latency hiding via neighbor GEMM). [NOVEL-1]
Precision-adaptive compact-WY with carried low-rank residual correction and norm-driven precision schedule. [NOVEL-2]
Hybrid native-FP64-panel + Ozaki-emulated-FP64 trailing GEMM, mapping each precision technique to its cheapest hardware unit. [NOVEL-3]
Cluster-resident matrix in distributed shared memory with DSMEM reduction tree and TMA-multicast reflector broadcast. [NOVEL-4]
Learned, hardware-feedback autotuner over a regime-partitioned kernel family. [NOVEL-5]
Panel-interleaved batch-major data layout + block-diagonal MMA packing for under-filled tensor tiles. [NOVEL-6]
Fused, single-MMA T-factor construction shortening the panel critical path. [NOVEL, §2.2]

I'd anchor the publishable claim on the combination of (1)+(2)+(3): double-precision-accurate batched QR at low-precision tensor-core throughput on a platform with weak native FP64, with the cross-batch pipeline removing the panel-latency wall that has historically capped batched-QR efficiency.

ジーエルエムGLM-5.2Z.AI◉ PLAN RECEIVED4122 tok · 1684 reasoning · 113.1s · $0.013

Here is a comprehensive research plan for designing a high-performance, batched compact-WY Householder QR factorization kernel optimized for the NVIDIA GB10 (Grace-Blackwell) platform.

Research Plan: High-Performance Batched Compact-WY QR on Grace-Blackwell (GB10)

Phase 1: Problem Analysis & Architecture Context

1.1 Problem Profile

Batched QR factorization processes many independent, small-to-medium dense square matrices ($n \in [32, 256]$). Standard LAPACK implementations are memory-bound due to Level-2 BLAS unblocked updates. The compact-WY representation reformulates Householder QR into a blocked algorithm: $A \leftarrow (I - Y T Y^T) A$. The trailing matrix update requires computing $W = Y^T A$, updating $Z = A - Y T W$, and reassigning $A = Z$. This exposes GEMM-like operations, making it amenable to Tensor Cores, but the small matrix sizes make standard cuBLas/cuSOLVER per-kernel launch overheads prohibitive.

1.2 GB10 Architecture Constraints & Features

Grace-Blackwell Interconnect (NVLink-C2C): Coherent unified memory (UMA) between Grace CPU and Blackwell GPU. Zero-copy pointer passing.
5th Gen Tensor Cores: Native FP64 support, but FP16/BF16/TF32 throughput remains orders of magnitude higher.
TMA (Tensor Memory Accelerator): Asynchronous bulk memory transfers directly to shared memory, bypassing register file pressure.
Thread Block Clusters & DSMEM: Multiple thread blocks can be grouped into a cluster, sharing a physically distributed but logically unified shared memory space (DSMEM).

Phase 2: Novel Algorithmic Design & Data Layout

2.1 Data Layout Strategy

Standard batched layouts (A[batch][row][col]) cause TMA fetch inefficiencies for small matrices due to boundary misalignment.

> [NOVEL IDEA 1] Batch-Interleaved TMA-Aligned Layout (BITAL) > Instead of storing matrices contiguously, matrices are interleaved at the block level. We group $b$ matrices into a 3D tensor of shape (rows, cols, b), padding $b$ such that the innermost dimension perfectly aligns with TMA swizzle patterns. If $b$ is insufficient, we use a "2.5D" layout where the $b$ dimension is flattened into the row dimension with specific power-of-2 padding. This allows TMA to issue bulk loads for multiple matrices simultaneously, maximizing memory bandwidth per transaction.

2.2 Algorithm Choice: Compact-WY with Cluster-Level Pipelining

We use the standard compact-WY formulation but redesign the execution model to fit GB10's cluster architecture.

> [NOVEL IDEA 2] Asynchronous Cluster-Pipelined WY (ACPW) > We map the execution of a single batched matrix (or a small sub-batch) to a single Thread Block Cluster. > - Block 0 (The Panel Block): Responsible for unblocked Householder generation to form the $Y$ matrix and the upper-triangular $T$ matrix. Computes entirely in registers/DSMEM. > - Blocks 1 to N (The Update Blocks): Responsible for the trailing matrix update ($A \leftarrow A - Y T (Y^T A)$). > Once Block 0 computes a block of $Y$, it asynchronously broadcasts $Y$ and $T$ via DSMEM (distributed shared memory) to Blocks 1-N using cluster-level async copy. Blocks 1-N immediately begin the $Y^T A$ GEMM via TMA while Block 0 computes the next panel. This creates a fine-grained, cluster-wide producer-consumer pipeline that completely hides the Householder generation latency behind Tensor Core GEMM execution.

Phase 3: GB10-Specific Kernel Optimization

3.1 Mixed-Precision Compact-WY via Tensor Cores

Pure FP64 compact-WY is slow; pure FP16 is numerically unstable. We will exploit Blackwell Tensor Cores via a mixed-precision strategy.

> [NOVEL IDEA 3] TC-Accelerated WY via Error-Free Transforms (TC-EFT) > 1. Panel Factorization: Compute $Y$ and $T$ in FP32 (using standard scalar/vector ALUs). $Y$ is bounded in magnitude, preventing overflow. > 2. Trailing Update: The matrices $Y$, $A$, and $W$ are cast to BF16. The core operations $W = Y^T A$ and $Z = Y(W^T T)^T$ are executed using BF16 Tensor Cores with FP32 accumulation. > 3. Error Compensation: To guarantee numerical stability equivalent to FP64, we compute a single error-free compensation vector $E = Y_{FP32} \cdot A_{FP32} - W_{BF16}$. The final update applies $Z = A - Y T W + Y T E$. Because $E$ is small, $T E$ can be computed on standard FP32 units with negligible overhead, while the bulk of the FLOPs ($Y T W$) benefit from the massive BF16 Tensor Core throughput.

3.2 Native FP64 Strategy

For batches requiring strict FP64 compliance (e.g., quantum chemistry), we fall back to Native FP64 Tensor Cores. However, to maintain occupancy, we will execute the FP64 GEMMs using smaller MMA fragments (e.g., m16n8k16) and overlap them with FP64 scalar Householder dot-products.

3.3 Grace-Blackwell UMA & NVLink-C2C Exploitation

When batch sizes exceed GPU memory, or when dealing with extremely large batches of small matrices, standard PCIe transfers become a bottleneck.

> [NOVEL IDEA 4] CPU-GPU Overlapped Panel Factorization (Grace-Assisted WY) > Because GB10 features UMA and NVLink-C2C, we can pipeline work across the Grace CPU and Blackwell GPU. > - The Grace CPU uses its NEON SIMD units to compute the unblocked panel factorization ($Y$, $T$) for batch $i+1$ directly in unified memory. > - The Blackwell GPU fetches the $Y$ and $T$ matrices via TMA from the unified memory space and executes the heavy trailing matrix update (GEMM) for batch $i$. > - This achieves perfect compute-overlap: the CPU handles the sequential, latency-bound Householder generation, while the GPU remains 100% utilized on throughput-bound Tensor Core GEMMs. No explicit cudaMemcpy is required due to C2C coherence.

3.4 TMA and Occupancy Tuning

Use the cp.async.bulk PTX instruction for TMA to load $A$ directly into shared memory.
Deploy mbarrier::arrive_and_wait for asynchronous warp execution to ensure no warps block waiting for data.
Maximize occupancy by limiting register pressure via the TC-EFT scheme (storing $Y$ in DSMEM rather than registers) and using cluster-level barrier synchronization.

Phase 4: Autotuning Strategy

Given the non-linear scaling of Blackwell Tensor Cores with matrix sizes, a static configuration is suboptimal.

> [NOVEL IDEA 5] Topology-Aware CuTe Autotuner > We will implement the kernel using CUTLASS 3.x CuTe layouts, allowing for compile-time and JIT-time autotuning. The autotuner will sweep over: > - Cluster size (1 to 8 blocks). > - TMA tile shapes (e.g., 64x128 vs 128x128). > - Precision paths (FP64 TC, TC-EFT, Pure FP32). > - CPU-GPU overlap toggle (Grace-Assisted mode). > The cost model will not just measure time-to-solution, but will query the TMA throughput and Tensor Core utilization via hardware performance counters (using CUPTI), generating a Pareto frontier of latency vs numerical accuracy.

Phase 5: Numerical-Validation and Benchmarking

5.1 Numerical Stability & Correctness

Orthogonality: Verify $\| Q^T Q - I \|_F \le \mathcal{O}(n \epsilon_{\text{mach}})$.
Factorization: Verify $\| QR - A \|_F \le \mathcal{O}(n \epsilon_{\text{mach}} \|A\|_F)$.
Validation against Reference: Compare residuals against LAPACK FP64 and cuSOLVER FP64.
TC-EFT Validation: Prove mathematically and empirically that the TC-EFT error bound does not exceed standard FP32 accumulation bounds, despite the BF16 inputs.

5.2 Benchmarking & Ablation Methodology

Baselines: cuSOLVER cublasDgetrfBatched (if applicable), MAGMA batched QR, and standard Level-3 BLAS WY.
Metrics: TFLOPs/s, memory bandwidth (GB/s), kernel launch overhead, and energy efficiency (TFLOPs/Watt).
Ablations:

Disable DSMEM broadcasting (force standard global memory sync) to quantify ACPW gains.
Disable TC-EFT (force pure FP64 Tensor Cores) to quantify mixed-precision gains.
Disable Grace-Assisted mode (force all work on GPU) to quantify NVLink-C2C UMA overlap gains.

Project Milestones

Milestone 1: Feasibility & Micro-Architecture Probing (Weeks 1-3)

Set up GB10 testbed and CUTLASS 3.x/CUDA 12+ toolchain.
Write TMA and DSMEM micro-benchmarks to verify latency and bandwidth limits on Blackwell.
Prototype the TC-EFT math model to verify numerical stability bounds on random matrices.

Milestone 2: Layout & Basic Kernel (Weeks 4-6)

Implement the BITAL data layout converter (CPU-side).
Implement a standard non-clustered FP64 compact-WY kernel using TMA.
Validate numerical correctness against LAPACK.

Milestone 3: ACPW & DSMEM Integration (Weeks 7-10)

Refactor the kernel into Thread Block Clusters.
Implement asynchronous panel broadcast via DSMEM.
Implement the Grace-Assisted WY overlap using standard POSIX threads on Grace CPU.

Milestone 4: TC-EFT Mixed-Precision (Weeks 11-13)

Implement the BF16 Tensor Core trailing update with FP32 error-free correction.
Benchmark pure FP64 vs TC-EFT paths for varying matrix sizes (32x32 to 256x256).

Milestone 5: Autotuning & Final Ablations (Weeks 14-16)

Integrate the JIT autotuner.
Run comprehensive ablation studies (disabling features one by one).
Finalize documentation, paper draft, and open-source release.

キミKIMI K2.7Moonshot AI◉ PLAN RECEIVED16935 tok · 10443 reasoning · 217.3s · $0.059

# Research Plan: BlackQR-B10 – Batched Square Compact-WY Householder QR on NVIDIA GB10 (Grace Blackwell)

Codename: BlackQR-B10 Target: batched square QR, $A^{(b)} = Q^{(b)}R^{(b)}$, $A^{(b)}\in\mathbb{R}^{n\times n}$, $b=0\ldots B-1$, $n$ roughly 32–512, batch size $B$ from thousands to millions. Platform: NVIDIA GB10 SoC (Grace CPU + Blackwell GPU, NVLink-C2C, Tensor Cores Gen 5, TMA, thread-block clusters / distributed shared memory).

This plan deliberately deviates from the standard cuSOLVER/MAGMA recipe of “one small kernel per matrix + repeated cuBLAS GEMM calls” or “one CTA per matrix with scalar Level-2 updates.” The core thesis is that on GB10 a single persistent cluster-wide kernel can keep the whole small matrix in distributed shared memory (DSM), fuse the compact-WY panel factorization and the trailing-matrix update, and accelerate the trailing update with Blackwell warp-group MMA (tensor cores) and, where acceptable, block-scaled FP8/FP16 mixed precision.

1. Executive summary of novel ideas

#	Novelty	What it changes vs. standard recipes
N1	Cluster-resident whole-matrix execution – one thread-block cluster owns one matrix; the matrix lives in DSM for the lifetime of the factorization.	Eliminates almost all intermediate global-memory traffic; removes per-step kernel launches; replaces pointer-array batching with a device-side work queue.
N2	Tensor-core compact-WY trailing update – the rank-$\mathtt{nb}$ update $A\leftarrow A-VW$ and the Gram piece $V^{\!T}A$ are performed with warp-group MMA, including custom micro-kernels for narrow $K=\mathtt{nb}$.	Standard libraries fall back to CUDA-core GEMM or scalar Householder updates; we treat the trailing step as a small-K fused GEMM pipeline.
N3	Block-scaled FP8/FP16 trailing update with FP64-scalar panel – panel scalars ($\beta$, $\tau$, $T$) are kept in FP64 (or compensated FP32), while bulk rank updates use Blackwell FP8/FP16 Tensor Cores with micro-scaling and FP32 accumulation.	New accuracy/performance trade-off for small dense QR, enabled by Blackwell’s micro-scaled FP8 MMA.
N4	Batched tile-major (BTM) layout + TMA 5-D tensor maps – each matrix is stored as tiles, with batch as the outer dimension; TMA loads/stores full cluster stripes and multicasts the panel.	Better coalescing, TMA descriptor reuse, and direct cluster-to-DSM transfer.
N5	Grace-Blackwell unified-memory control plane – the Grace CPU builds TMA descriptors, pre-stages data via NVLink-C2C, and maintains a lock-free GPU work queue in shared system memory; the GPU is purely a data-plane cluster engine.	Overlaps CPU bookkeeping with GPU compute and removes launch-boundary bottlenecks for enormous batches.
N6	Precision-aware autotuner – a CUTLASS-style kernel generator searched over panel width, tile shape, cluster geometry, register/shared split, and precision mode, constrained by a validation error budget.	Adapts kernel to matrix size and numerical tolerance, not just to FLOPs.

2. Problem analysis

2.1 Workload characteristics

Square, small matrices: $n\in[32,512]$, so a matrix uses $4n^2$–$1\,$MiB in FP32. This is too small to saturate the GPU with one matrix, but large enough that a per-matrix scalar kernel is memory-latency bound.
Batched: thousands to millions of independent matrices expose massive outer parallelism, but a naive pointer-array batched kernel pays launch overhead, descriptor setup, and poor L2 reuse.
Key cost: for Householder QR, FLOPs are $\tfrac{4}{3}n^3$ per matrix. In a right-looking blocked algorithm the trailing update dominates; the panel factorization is only $O(n\,\mathtt{nb}^2)$ per panel. The kernel must therefore be a fast Level-3-like update engine, not a scalar Householder generator.
Numerical sensitivity: QR orthogonality is governed by the $\beta_j,\tau_j$ and the compact-$\!WY$ matrix $T$. These must be computed carefully; the bulk trailing update can tolerate lower precision if accumulated in FP32 and refined.

2.2 Baseline targets

cuSOLVER cusolverDn<t>geqrf_batched / MAGMA magma_dgeqrf_batched.
A reference “one-CUDA-block-per-matrix” scalar implementation.
A reference blocked implementation using separate cuBLAS GEMM/TRMM calls.

We will profile all three to establish an absolute speed-up target (goal: 3–10× for the sweet-spot sizes) and to calibrate numerical error baselines.

3. Algorithm choice: compact-WY blocked Householder QR

3.1 Representation

For one matrix, partition the right-looking factorization into panels of width $\mathtt{nb}$. For panel $k$, let

\[ V_k\in\mathbb{R}^{(n-k\mathtt{nb})\times \mathtt{nb}},\qquad T_k\in\mathbb{R}^{\mathtt{nb}\times \mathtt{nb}} \]

be the compact-WY factors such that the accumulated Householder reflectors satisfy

\[ H_k = I - V_k T_k V_k^{\!T}. \]

$T_k$ is built by the Schreiber–Van Loan recurrence:

\[ T_1 = [\tau_1],\qquad T_j = \begin{bmatrix} T_{j-1} & -\tau_j\,T_{j-1}V_{j-1}^{\!T}v_j\\[2pt] 0 & \tau_j \end{bmatrix}. \]

$V_k$ is stored in-place in the strict lower triangle of the current panel; $R$ overwrites the upper triangle. The block reflector $T_k$ is stored in a small per-matrix workspace (or packed into an auxiliary tensor).

3.2 Trailing update

For the trailing submatrix $A_{22}=A[k\mathtt{nb}:n,\;k\mathtt{nb}+\mathtt{nb}:n]$, the update is

\[ A_{22} \;\leftarrow\; (I - V_k T_k^{\!T} V_k^{\!T})\,A_{22} \]

(assuming the convention $H_1\cdots H_{\mathtt{nb}}=I-VTV^{\!T}$). Algorithmically:

\[ Z = V_k^{\!T}A_{22},\qquad W = T_k^{\!T}Z,\qquad A_{22}\leftarrow A_{22} - V_k W. \]

This is the main computational loop. It is not a generic GEMM: $V_k$ has height $m$ but width only $\mathtt{nb}\le64$, and $T_k$ is tiny. We will custom-build warp-group MMA kernels for this narrow-$K$ case instead of calling cuBLAS.

4. GB10-specific kernel architecture

4.1 Execution model: one cluster = one matrix [N1]

Assign one thread-block cluster to each matrix.
Cluster geometry $(C_r \times C_c)$, e.g. $1\times4$ or $2\times2$, for a total of $R=C_rC_c$ thread blocks (CTAs).
The cluster partitions the matrix by rows: CTA $r$ owns row stripe $r$ of size

\[ m_r = \left\lfloor \frac{n+k_r}{R}\right\rfloor \] so that distributed shared memory (DSM) collectively holds the whole trailing submatrix.

A persistent grid of clusters is launched. Each cluster atomically pops the next matrix index from a queue in unified memory, loads that matrix into DSM, factors it, stores the result, and repeats. This removes per-matrix kernel launch latency and amortizes pipeline setup across the batch.

Why it fits GB10: Blackwell clusters expose direct DSM load/store plus cluster.sync(). Because $n$ is small, the entire active submatrix (or even the whole matrix, for $n\lesssim 384$ in FP32) fits in aggregated cluster shared memory. The factorization then becomes a shared-memory algorithm with global memory only at load/store boundaries.

4.2 Panel factorization in DSM

Within a cluster:

Column ownership. Each CTA owns its row stripe of the current panel $A[k:n,\,k:k+\mathtt{nb}]$.
Householder generation. For reflector $j=0\ldots \mathtt{nb}-1$:

Each CTA computes local contribution to the squared norm $\|x_j\|_2^2$ of the pivot column below the diagonal via warp-shuffle reductions.
CTA-level partials are written to a DSM reduction scratchpad; the designated panel leader CTA (CTA 0) finishes the reduction, computes

\[ \beta_j = \operatorname{sign}(x_{j0})\|x_j\|_2,\qquad \tau_j = \frac{\beta_j - x_{j0}}{\beta_j},\qquad v_j = \frac{x_j}{\beta_j - x_{j0}}, \] and broadcasts $\tau_j$ and the pivot value through DSM.

All CTAs apply the new reflector to the remaining columns of their local panel stripe.
The leader updates the compact-WY block $T_k$ using the recurrence above; the dot products $V_{j-1}^{\!T}v_j$ are computed by local partials plus a cluster reduction.

Result. After $\mathtt{nb}$ steps, $V_k$ is in-place in the strict lower panel, $R$ in the upper panel, and $T_k$ in the leader’s shared memory / workspace.

GB10 details:

Use FP64 CUDA cores for $\beta_j,\tau_j$, and the triangular solve/update of $T_k$. Because there are only $O(n\mathtt{nb}^2)$ such ops, even low FP64 throughput is acceptable.
Use cluster.sync() as the memory fence between generation and broadcast.
For $n$ larger than the cluster DSM limit, the panel factorization still operates on a resident panel stripe; the trailing matrix is streamed in tiles.

4.3 Tensor-core trailing update [N2]

For each panel, the trailing submatrix $A_{22}$ is distributed by rows across the CTAs. The update is performed column-tile by column-tile so that the small $Z$ and $W$ blocks remain in registers:

For each column tile $J$ of width $w$ ($w\approx 16$–$64$):

Local projection. CTA $r$ computes

\[ Z_r = V_{k,r}^{\!T} A_{22,r}^{(:,J)} \in \mathbb{R}^{\mathtt{nb}\times w} \] using warp-group wgmma with $M=\mathtt{nb}$, $N=w$, $K=m_r$.

Cluster reduction. Partial $Z_r$’s are reduced to a single $Z$ in a DSM slot via cluster.sync() + a logarithmic warp reduction.
Apply $T$. The leader computes

\[ W = T_k^{\!T} Z \in \mathbb{R}^{\mathtt{nb}\times w} \] (small triangular multiply in registers) and multicasts $W$ to all CTAs.

Rank-$\mathtt{nb}$ update. Each CTA computes

\[ A_{22,r}^{(:,J)} \leftarrow A_{22,r}^{(:,J)} - V_{k,r}\,W \] again with wgmma, accumulating in FP32.

This design keeps the only cross-CTA data per column tile as two small $\mathtt{nb}\times w$ register blocks, minimizing DSM traffic. It also avoids materializing the full $Z$ and $W$ matrices.

MMA tile selection on Blackwell:

Use wgmma shapes such as $M64N8K16$ (FP16/BF16) or $M64N8K32$ (FP8), depending on the measured Blackwell instruction set.
Since $\mathtt{nb}$ is the contraction dimension, specialize kernels for $K\in\{16,24,32,48,64\}$; pad smaller panels with zeros.
Use warp-group specialization: warpgroup 0 does TMA/DSM movement, warpgroup 1 does MMA, with asynchronous wgmma.mma_async and wgmma.wait_group.

4.4 TMA and memory movement [N4]

Initial load: Each CTA uses a TMA descriptor that maps the batched tile-major tensor

\[ \mathcal{A}[B,\,\text{tile}_i,\,\text{tile}_j,\,y,\,x] \] to global memory. It cp.async.bulk.tensors only its row stripe into its local shared memory; the cluster mapping makes the union the full matrix in DSM.

Panel multicast: When a panel must be visible to all CTAs during factorization, use the cluster multicast mode of TMA so that a single global load populates all participating CTAs’ shared memory.
Asynchronous store: Final results are written back with TMA store, optionally with evict_last to avoid polluting L2 with one-off matrix writes.
No intermediate global traffic: With cluster-resident storage, the trailing update never touches global memory until the final panel.

4.5 Mixed precision & FP64 strategy [N3]

Component	Proposed precision	Rationale
Panel scalars ($\beta,\tau$, $T$)	FP64 (or compensated FP32 fallback)	Critical for orthogonality; cheap because $O(n\mathtt{nb}^2)$.
Vectors in $V$ and $A$ storage	FP32 or FP16/FP8	Bulk data; determines memory footprint.
Tensor-core accumulation	FP32	Standard safe MMA accumulator.
Trailing update	FP32 default; optional FP16/BF16 or FP8 with block scaling	FP8 only where error budget allows.

Block-scaled FP8 algorithm for the trailing update:

For each $V_{k,r}$ tile and $A_{22,r}^{(:,J)}$ tile, compute block scale $s_V = \max |V|$, $s_A=\max |A|$ over 128-element micro-scaling groups.
Quantize to E4M3/E5M2 with that scale.
Run MMA in FP8, accumulating in FP32, and rescale the accumulator by $s_V s_A$.
Keep a single-step FP32 correction: compute the residual of the update in FP32 with CUDA cores and add a small $\delta W = T^{\!T}V^{\!T}\Delta A$ to repair O($\epsilon_{\text{FP8}}$) errors.

If FP64 throughput on GB10 is lower than FP32, the panel still dominates very little time; for $n\ge128$ the trailing update is $>90\%$ of FLOPs, so tensor cores dominate wall time.

4.6 Grace–Blackwell unified memory and NVLink-C2C [N5]

Control plane on Grace: The CPU builds TMA tensor-map descriptors for each panel width / tile size variant, prefetches the next batch of input matrices into GPU-reachable physical memory, and maintains a lock-free work queue (head, tail) mapped in shared system memory via NVLink-C2C.
Data plane on Blackwell: The GPU clusters pop queue entries; global memory is read once and written once.
CPU layout transform: When input arrives in standard column-major pointer-array form, a Grace thread stream transposes/packs it into the BTM layout using Scalable Vector Extensions (SVE) while the GPU computes the previous batch. NVLink-C2C bandwidth (≈900 GB/s) makes this feasible.
Spill path: For the rare matrices larger than the cluster DSM limit, the CPU can flag them for a two-phase streaming variant; this avoids a single-kernel-fits-all compromise.

5. Data layouts for batching [N4]

5.1 Batched Tile-Major (BTM)

Store matrix $b$ as tiles of size $(t_M,t_N)$. The linear index is

\[ \text{off}(b,i,j,y,x) = b\cdot L + \bigl(i\cdot N_{\text{tiles},j}+j\bigr)\cdot t_M t_N + (x\cdot t_M + y), \]

where $i,j$ are tile coordinates and $y,x$ are in-tile coordinates. Choose $t_N=\mathtt{nb}$ if possible so that a panel is a contiguous sequence of whole tiles.

Advantages:

A panel factorization reads contiguous tiles.
The trailing update’s tile GEMMs read contiguous $t_M\times t_N$ blocks.
TMA can be programmed as a 5-D tensor map; the batch dimension is a natural outer index.
In a cluster, each CTA’s row stripe consists of a contiguous slab of tiles.

5.2 Workspace layout

Output $A^{(b)}$: in-place packed $R$/$V$, size $n^2$ per matrix.
$T_k$: small workspace. For all panels, store as

\[ \mathcal{T}[b,\,k,\,i,\,j]\quad 0\le i,j < \mathtt{nb} \] (only upper triangular used).

Optional $Q$ not formed; if needed later, $V$ and $T$ are sufficient to apply $Q$ or $Q^{\!T}$.

5.3 Fallback layouts

BCM (block-column-major): column blocks of width $\mathtt{nb}$ are contiguous; good when $n$ is not tile-aligned.
SoA pointer array: kept only as an input adapter; the hot path uses BTM.

6. Autotuning strategy [N6]

6.1 Search space

Build variants via a CUTLASS-style kernel generator; the tunable axes are:

Axis	Values
Panel width $\mathtt{nb}$	16, 24, 32, 48, 64
Cluster geometry $C_r\times C_c$	$1\times2$, $1\times4$, $2\times2$, $1\times8$, $2\times4$
MMA tile $(M,N,K)$	$64\times8\times\mathtt{nb}$, $64\times16\times\mathtt{nb}$, $128\times16\times\mathtt{nb}$, FP8 variants with doubled $K$
Column-tile width $w$	16, 32, 64
Register budget	96, 128, 160, 192 per thread
Precision mode	FP64, FP32, FP16-acc-FP32, BF16-acc-FP32, block-scaled FP8-acc-FP32
Algorithm mode	cluster-resident (full DSM) / streaming tiles
Layout	BTM / BCM / column-major

6.2 Pruning and search

Hard prune by resources.

If $4n^2$ exceeds measured cluster DSM capacity, discard cluster-resident variants for that $n$.
If shared-memory or register budget exceeds GB10 limits, discard.

Fast analytical filter.

Predict memory traffic and MMA utilization for each variant.
Eliminate variants with predicted occupancy $<30\%$.

Measured search.

Use a tree-structured Parzen estimator (TPE) or Gaussian-process bandit on the remaining space.
Benchmark each candidate on a representative validation matrix set (random + ill-conditioned).
Optimize objective: best wall-clock time subject to numerical error $\le 10\times$ FP64 reference error.

Dispatch table.

At runtime, given $(n,B,\text{precision}, \text{error budget})$, look up the pre-tuned variant.
Allow a lightweight online fallback: if error budget is too tight, fall back to FP32/FP64.

7. Numerical stability and correctness validation

7.1 Reference and metrics

Reference: LAPACK dgeqrf in FP64, or a multi-precision reference using MPFR for very ill-conditioned cases.
Metrics per matrix:

\[ \rho_R = \frac{\|A - QR\|_F}{\|A\|_F\,\epsilon}, \qquad \rho_Q = \frac{\|Q^{\!T}Q - I\|_F}{n\,\epsilon}, \] where $Q$ is reconstructed from $V,T$ by explicit multiplication for testing only, and $\epsilon$ is machine epsilon of the tested precision.

7.2 Test matrix suite

Random well-conditioned matrices.
Random with controlled condition number (randsvd).
Near-rank-deficient matrices (scaled columns).
Adversarial: Hilbert, Vandermonde, matrices with near-cancelling sign choices.
Extreme dynamic range: each column scaled by powers of 2.
Edge cases: $n=1$, $n$ not multiple of $\mathtt{nb}$, zero columns, $\infty$/NaN handling.

7.3 Validation harness

Run the kernel in all precision modes and compare against FP64 reference.
Check deterministic output (same seed → same bits, including sign convention).
For mixed precision, verify the correction step recovers FP32-level residual.
Fail if either $\rho_R$ or $\rho_Q$ exceeds a library-tunable threshold (default 10–50 depending on precision).

8. Benchmarking and ablation methodology

8.1 Baselines

cuSOLVER batched: cusolverDnDgeqrfBatched / single-precision equivalent.
MAGMA batched: magma_dgeqrf_batched.
cuBLAS-blocked reference: separate cublas<t>gemm calls for the trailing update plus a custom panel kernel.
One-CTA-per-matrix scalar kernel.

8.2 Metrics

Throughput: matrices per second, and effective FP32-equivalent TFLOP/s.
Per-matrix latency for small $B$.
Occupancy: active thread blocks, cluster count, warp occupancy via Nsight Compute.
Memory: global bytes read/written, L2 hit rate, TMA throughput, NVLink-C2C traffic.
Tensor-core utilization: MMA instruction issue rate, FP8/FP16 active cycles.
Energy: joules per matrix via NVML / on-chip sensors.

8.3 Ablation plan

Run the tuned kernel while systematically disabling each GB10-specific feature:

Ablation	Variant	What it tests
A1	Tensor cores off; CUDA cores only	Value of `wgmma`
A2	DSM off; all matrix loads/stores go to global memory	Value of cluster-resident execution
A3	TMA off; load/store via `cp.async` / global pointers	Value of TMA + multicast
A4	Mixed precision off; pure FP32 / FP64	Value of FP8/FP16 tensor-core path
A5	BTM off; standard pointer-array column-major layout	Value of data-layout transformation
A6	CPU control plane off; GPU builds descriptors/queue	Value of Grace-Blackwell co-scheduling
A7	Vary $\mathtt{nb}$ around tuned value	Sensitivity to algorithmic block size

8.4 Reporting

Roofline plot for representative $n$ showing achieved vs. memory/compute roofs.
Speed-up curves over baselines for $n=32..512$ and $B=1\text{K},10\text{K},100\text{K},1\text{M}$.
Error-vs-speed scatter to show mixed-precision Pareto points.

9. Milestones and deliverables

M1 – Problem characterization & baseline (Weeks 1–3)

Collect exact GB10 parameters: SM count, shared-memory size per SM, cluster/DSM limits, FP64/FP32/FP16/FP8 tensor-core rates, NVLink-C2C bandwidth.
Build reference FP64 CPU/LAPACK verifier.
Implement scalar one-CTA baseline and cuSOLVER/MAGMA wrappers.
Define evaluation matrix suite and error thresholds.

M2 – Algorithm prototype (Weeks 2–4)

Implement a correct sequential compact-WY QR in C++ with the chosen recurrence.
Verify error metrics and sign convention.
Finalize mathematical invariants for the cluster-distributed version.

M3 – GB10 feature microbenchmarks (Weeks 3–6)

Microbenchmark TMA throughput, cluster sync latency, DSM read/write bandwidth, warp-group MMA shapes and throughput, FP8 micro-scaling behavior, NVLink-C2C queue latency.
Determine the maximum $n$ for cluster-resident full-matrix storage at FP32/FP64/FP16.

M4 – Core cluster-resident kernel (Weeks 5–9)

Single-matrix kernel: CTA cluster + DSM, panel factorization, compact-$T$ recurrence, tensor-core trailing update.
Integrate TMA load/store, warp-group MMA, and cluster reductions.
Validate against M2 prototype for one matrix.

M5 – Batching, layout, and persistent scheduling (Weeks 7–11)

Implement BTM layout conversion (initially GPU-only, later CPU-assisted).
Persistent cluster grid and lock-free device work queue in unified memory.
CPU-side descriptor builder and data stager.
Validate on moderate batches ($B=1\text{K}..100\text{K}$).

M6 – Mixed precision & iterative refinement (Weeks 9–13)

Add FP16-acc-FP32 and block-scaled FP8-acc-FP32 trailing-update paths.
Implement FP32 residual correction step.
FP64-scalar panel option vs. compensated FP32 panel option.
Tune error/performance trade-offs.

M7 – Autotuner & kernel generator (Weeks 11–15)

Build template kernel generator over the search space.
Run offline autotuning and produce dispatch tables.
Implement runtime dispatch logic.

M8 – Validation, benchmarking, ablation, report (Weeks 14–18)

Full numerical validation suite.
Benchmark vs. baselines.
Ablation study and roofline analysis.
Document results, publish reproducibility package (build system, harness, tuned configs).

10. Risks and mitigations

Risk	Mitigation
GB10 shared-memory / cluster size smaller than expected, preventing full-matrix residence.	Implement streaming-tile fallback and use analytical model to choose mode per $n$.
FP8 dynamic range insufficient for QR.	Provide FP16/BF16 fallback; use block scaling; validate error budgets empirically.
Tensor cores underutilized for small trailing submatrices near the end.	Special-case last panels with narrow MMA shapes and early-exit; fuse multiple panels when possible.
Cluster-wide reductions become latency bottleneck.	Keep reductions tiny ($\mathtt{nb}\times w$); use warp-level + DSM logarithmic reduction.
CPU stager cannot keep GPU fed via C2C.	Overlap staging with compute across batches; add GPU-side layout transform fallback.
Mixed precision fails adversarial cases.	Dispatcher falls back to FP64/FP32 when validation error exceeds threshold.

11. Summary of expected outcomes

By the end of this research cycle we expect to deliver:

A single-kernel batched square compact-WY QR implementation for GB10 that beats cuSOLVER/MAGMA by a large margin on matrices $n=64..512$ through cluster-resident DSM execution and tensor-core trailing updates.
A precision-aware design that uses FP64 only where mathematically necessary and Blackwell FP8/FP16 tensor cores for the bulk update, with controllable numerical error.
A reusable autotuned kernel generator that adapts cluster geometry, tile sizes, block size, and precision mode to the input size and batch count.
A validation harness and ablation study that quantifies the contribution of each GB10-specific feature (clusters, DSM, TMA, tensor cores, NVLink-C2C, unified memory) to the final speed-up.

This plan is intentionally architecture-native rather than library-portable: every design decision is driven by the GB10 Grace–Blackwell hardware capabilities, with portability treated as a secondary concern after performance and robustness are established.

計測 // TELEMETRY Run metricstokens · cost · time

Generation speed, size and cost of each plan.

Model	tokens / s	total tokens	cost	time
FUGU-ULTRA	66.8	40,777	$0.6724	280 s
GPT-5.5	39.5	9,401	$0.2766	233 s
CLAUDE OPUS 4.8	70.1	12,127	$0.2953	167 s
GLM-5.2	36.4	4,345	$0.0126	113 s
KIMI K2.7	77.9	17,154	$0.0594	217 s
TOTAL	60.1 avg	83,804	$1.3163	1011 s

+ judge (GEMINI 3.1 PRO): 30,019 tokens · $0.0984 · 35 s. Plans were generated concurrently, so wall-clock was far less than the summed time.

資料 // ARCHIVE Research related

LLMs that write GPU kernels

Sakana AI — The AI CUDA Engineer (Feb 2025): evolutionary LLM-driven CUDA kernel discovery; 17,000-kernel archive. (Note: initial 10–100× speedup claims were later found to exploit eval bugs — “reward hacking.”)
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with LLMs
cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution
KForge: Program Synthesis for Diverse AI Hardware Accelerators
Fine-Tuning GPT-5 for GPU Kernel Generation · Towards Automated Kernel Generation in the Era of LLMs

The algorithm & the hardware

Schreiber & Van Loan, A storage-efficient WY representation for products of Householder transformations (SIAM J. Sci. Stat. Comput., 1989) — the compact-WY form.
Haidar, Dong, Tomov, Luszczek, Dongarra — batched QR/LU/Cholesky on GPUs in MAGMA, the standard baseline for batched small-matrix factorization.
NVIDIA DGX Spark / GB10 Grace Blackwell superchip — the target platform.

This site

A transformer, by hand — Qwen3-0.6B forward pass + logit lens, the other “real model on your GPU” page.

Scorecard

Strongest & weakest

Generic or dubious

Best idea

MILESTONE 1: Problem Analysis & GB10-Native Data Layout

MILESTONE 2: Algorithmic Innovations in Factorization

MILESTONE 3: GB10 Compute Optimization & FP64 Strategy

MILESTONE 4: Heterogeneous Execution via Grace NVLink-C2C

MILESTONE 5: Real-Time Autotuning Strategy

MILESTONE 6: Correctness and Stability Validation

MILESTONE 7: Benchmarking and Ablation Methodology

Research plan: high-performance batched square compact-WY Householder QR on NVIDIA GB10 / Grace Blackwell

Target problem

1. Problem analysis

1.1 Performance bottlenecks

2. Algorithmic choice

3. Novel design ideas

Novel idea 1: fused compact-WY trailing update without materializing intermediates

Novel idea 2: cluster-resident multi-panel pipeline

Novel idea 3: layout-transforming batched QR

Novel idea 4: precision-adaptive compact-WY QR

Novel idea 5: batch-aware persistent QR scheduler

4. Detailed algorithm plan

4.1 In-place storage format

4.2 Householder panel factorization

4.3 Panel kernels by matrix size

Regime A: n <= 32

Regime B: 32 < n <= 128

Regime C: 128 < n <= 512

Regime D: n > 512

5. GB10 / Grace Blackwell-specific optimization

5.1 Tensor cores

5.2 TMA

5.3 Thread-block clusters and distributed shared memory

5.4 Occupancy strategy

5.5 Grace–Blackwell unified memory and NVLink-C2C

6. Data layout for batched square matrices

6.1 Supported layouts

Layout 1: conventional strided batch

Layout 2: pointer batch

Layout 3: proposed tile-major batch layout

6.2 Novel batch swizzling

7. Kernel architecture

7.1 Kernel family

Kernel K0: pack/unpack

Kernel K1: tiny QR

Kernel K2: small compact-WY QR

Kernel K3: cluster compact-WY QR

Kernel K4: multi-cluster QR

Kernel K5: validation / diagnostics

7.2 Fused compact-WY update kernel detail

8. Precision strategy

8.1 Default modes

Mode P0: FP32 stable mode

Mode P1: FP64-accurate mode

Mode P2: mixed FP64/TF32

Mode P3: approximate ML mode

8.2 FP64 native versus emulated strategy

9. Autotuning strategy

9.1 Parameters to tune

9.2 Autotuning workflow

Offline tuning

Online tuning

9.3 Analytical model

10. Numerical stability and correctness validation

10.1 Correctness metrics

10.2 Matrix test suite

10.3 Stability policy

10.4 Reconstructing Q

11. Benchmarking methodology

11.1 Baselines

11.2 Metrics

11.3 Benchmark grid

11.4 Ablation studies

12. Milestone plan

Milestone 0: specification and harness

Milestone 1: baseline compact-WY QR

Milestone 2: tiny and small specialized kernels

Milestone 3: fused compact-WY update

Milestone 4: packed tile-major batch layout

Regime A: `n <= 32`

Regime B: `32 < n <= 128`

Regime C: `128 < n <= 512`

Regime D: `n > 512`