Open-source transpiler for CUDA Tile (13.1) migration
https://github.com/RightNow-AI/RightNow-Tile.git
CUDA SIMT to cuTile Python Transpiler
Transform your CUDA kernels for NVIDIA Blackwell GPUs
Live Demo โข Quick Start โข Features โข Patterns โข Discord
RightNow Tile is a production-grade transpiler that converts traditional CUDA SIMT (Single Instruction, Multiple Threads) kernels into cuTile Python code โ NVIDIA's new tile-based programming model optimized for Blackwell GPUs (compute capability 10.x+).
Part of the RightNow AI ecosystem โ a code editor built for GPU kernel development.
NVIDIA's cuTile represents a paradigm shift in GPU programming:
| Traditional CUDA | cuTile |
|---|---|
| Thread-centric programming | Tile-centric programming |
| Manual memory coalescing | Automatic tile-based loads |
| Complex index calculations | Declarative tile operations |
| Low-level synchronization | High-level tile semantics |
# Clone the repository
git clone https://github.com/RightNow-AI/RightNow-Tile.git
cd RightNow-Tile
# Install dependencies
npm install
# Start development server
npm run dev
Open http://localhost:3000 and start transpiling!
Automatically identifies 18 computational patterns with 60+ variant-specific optimizations:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Your CUDA โ โโโบ โ Pattern Match โ โโโบ โ Optimized โ
โ Kernel โ โ + Analysis โ โ cuTile Code โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
CUDA Source
โ
โผ
โโโโโโโโโโโโโโโโ
โ 1. Extractor โ Parse kernel signatures, parameters, memory accesses
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 2. Parser โ Recognize 150+ CUDA intrinsics & index patterns
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 3. Semantic โ Detect reductions, dependencies, race conditions
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 4. Memory โ Analyze coalescing, bank conflicts, access patterns
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 5. Pattern โ Match against 18 patterns with confidence scoring
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 6. IR Build โ Generate intermediate representation with config
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 7. Optimize โ Select optimal tile sizes & configurations
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 8. CodeGen โ Apply variant-specific templates
โโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโ
โ 9. Validate โ Verify correctness & generate diagnostics
โโโโโโโโดโโโโโโโโ
โ
โผ
cuTile Python
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| GEMM | naive, tiled, register_blocked | Matrix multiplication, deep learning | High |
| Reduction | tree, warp_shuffle, multi_block, segmented | Sum, max, min, dot product | High |
| Scan | inclusive, exclusive, segmented | Prefix sum, stream compaction | High |
| Stencil | 1d_3pt, 1d_5pt, 2d_5pt, 2d_9pt, 3d | Image processing, PDE solvers | High |
| Elementwise | simple, vectorized | Point-wise operations | High |
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| Attention | flash_attention, flash_attention_v2, multi_head, causal, cross | Transformer models | High |
| Normalization | layernorm, rmsnorm, batchnorm, groupnorm, instancenorm | Neural network layers | High |
| Convolution | conv1d, conv2d, conv3d, depthwise, grouped, winograd, im2col | CNNs, signal processing | High |
| Pooling | max_pool_2d, avg_pool_2d, global_avg, global_max, adaptive | Feature downsampling | High |
| Embedding | lookup, embedding_bag, positional | NLP, recommender systems | Medium |
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| RoPE | standard, neox, cached | Rotary position embeddings | High |
| KV Cache | append, paged, prefix, gqa | LLM inference optimization | High |
| Quantization | int8, int4, fp8, dequantize | Model compression | Medium |
| Fused | matmul_activation, matmul_bias_activation, layernorm_residual | Kernel fusion | Medium |
| Pattern | Variants | Use Cases | Confidence |
|---|---|---|---|
| FFT | radix2, radix4, radix8, inverse, real | Signal processing | High |
| Sparse | spmv_csr, spmv_csr_warp, spmv_coo, spmv_ell, spmm, sddmm | Sparse matrix operations | Medium |
| Histogram | atomic, privatized, multipass, weighted, 2d | Data distribution, statistics | Medium |
| Sorting | bitonic, bitonic_shared, radix, merge | Parallel sorting | Medium |
Input: CUDA SIMT Kernel
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
Output: cuTile Python
import cuda_tile as ct
import cupy
TILE_SIZE = 256
@ct.kernel
def vector_add(a, b, c, n: ct.Constant[int], tile_size: ct.Constant[int]):
"""
Elementwise kernel - auto-transpiled from CUDA
Original: vectorAdd
Confidence: 100%
"""
pid = ct.bid(0)
# Load input tiles
a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
b_tile = ct.load(b, index=(pid,), shape=(tile_size,))
# Compute
result = a_tile + b_tile
# Store result
ct.store(c, index=(pid,), tile=result)
def launch_vector_add(a, b, c):
"""Launch the vector_add kernel"""
n = a.shape[0]
grid = (ct.cdiv(n, TILE_SIZE), 1, 1)
stream = cupy.cuda.get_current_stream()
ct.launch(stream, grid, vector_add, (a, b, c, TILE_SIZE))
Input: Flash Attention CUDA Kernel
__global__ void flash_attention_kernel(
float* Q, float* K, float* V, float* O,
int seq_len, int head_dim, float scale
) {
// Complex multi-phase attention implementation
// with online softmax and tiled matrix multiply
...
}
Output: cuTile Python (Flash Attention)
import cuda_tile as ct
import cupy
BLOCK_Q = 64
BLOCK_KV = 64
@ct.kernel
def flash_attention(
Q, K, V, O,
seq_len_q: ct.Constant[int],
seq_len_kv: ct.Constant[int],
head_dim: ct.Constant[int],
scale: ct.Constant[float],
block_q: ct.Constant[int],
block_kv: ct.Constant[int]
):
"""
Flash Attention kernel - auto-transpiled from CUDA
Confidence: 95%
Variant: flash_attention_v2
"""
block_q_idx = ct.bid(0)
head_idx = ct.bid(1)
# Initialize output accumulator and softmax stats
acc = ct.zeros((block_q, head_dim), dtype=ct.float32)
m_i = ct.full((block_q,), float('-inf'), dtype=ct.float32)
l_i = ct.zeros((block_q,), dtype=ct.float32)
# Load Q tile (stays in registers)
q_tile = ct.load(Q, index=(head_idx, block_q_idx), shape=(block_q, head_dim))
# Iterate over K,V blocks with online softmax
for block_kv_idx in range(0, ct.cdiv(seq_len_kv, block_kv)):
k_tile = ct.load(K, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))
v_tile = ct.load(V, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))
# QK^T with scaling
qk = ct.tile_matmul(q_tile, ct.transpose(k_tile)) * scale
# Online softmax update
m_ij = ct.reduce(qk, op=ct.max, axis=1)
m_new = ct.maximum(m_i, m_ij)
alpha = ct.exp(m_i - m_new)
acc = acc * alpha[:, None]
l_i = l_i * alpha
p = ct.exp(qk - m_new[:, None])
l_ij = ct.reduce(p, op=ct.sum, axis=1)
l_i = l_i + l_ij
# Accumulate output
acc = acc + ct.tile_matmul(p, v_tile)
m_i = m_new
# Normalize and store
out = acc / l_i[:, None]
ct.store(O, index=(head_idx, block_q_idx), tile=out)
Use the transpiler programmatically:
import { transpile } from './lib/transpiler';
const result = await transpile(cudaCode);
// Access results
result.tileCode // Generated cuTile Python code
result.pattern.archetype // Detected pattern (e.g., 'attention', 'gemm')
result.pattern.confidence // Confidence score (0-1)
result.pattern.variant // Specific variant (e.g., 'flash_attention_v2')
result.validation.isValid // Validation status
result.diagnostics // Warnings and suggestions
result.memoryAnalysis // Memory access analysis
result.semanticAnalysis // Semantic analysis results
curl -X POST http://localhost:3000/api/transpile \
-H "Content-Type: application/json" \
-d '{"code": "__global__ void add(float* a, float* b, float* c, int n) { ... }"}'
rightnow-tile/
โโโ app/
โ โโโ api/transpile/ # REST API endpoint
โ โโโ components/ # React components
โ โ โโโ ScientificVisualization.tsx
โ โ โโโ ThemeProvider.tsx
โ โ โโโ ThemeToggle.tsx
โ โโโ page.tsx # Main UI
โ โโโ globals.css # Styling
โโโ lib/
โ โโโ ast/ # AST extraction & semantic analysis
โ โ โโโ extractor.ts # Kernel parsing
โ โ โโโ semantic-analyzer.ts
โ โ โโโ memory-analyzer.ts
โ โ โโโ phase-analyzer.ts # Multi-phase kernel detection
โ โ โโโ types.ts # 18 archetypes, 60+ variants
โ โโโ parser/
โ โ โโโ intrinsics.ts # 150+ CUDA intrinsics
โ โโโ patterns/ # Pattern matchers (18 patterns)
โ โ โโโ matchers/
โ โ โโโ attention.ts # Flash Attention, MHA
โ โ โโโ fused.ts # Fused kernels
โ โ โโโ fft.ts # FFT variants
โ โ โโโ gemm.ts # Matrix multiply
โ โ โโโ reduction.ts # Reductions
โ โ โโโ scan.ts # Prefix sums
โ โ โโโ stencil.ts # Stencil patterns
โ โ โโโ sparse.ts # Sparse matrix ops
โ โ โโโ histogram.ts # Histogram
โ โ โโโ convolution.ts # CNN convolutions
โ โ โโโ sorting.ts # Sorting algorithms
โ โ โโโ pooling.ts # Pooling layers
โ โ โโโ normalization.ts # Norm layers
โ โ โโโ embedding.ts # Embeddings
โ โ โโโ rope.ts # Rotary embeddings
โ โ โโโ kvcache.ts # KV cache ops
โ โ โโโ quantization.ts # Quantization
โ โ โโโ elementwise.ts
โ โโโ ir/ # Intermediate representation
โ โ โโโ builder.ts # 11 specialized IR types
โ โ โโโ optimizer.ts
โ โ โโโ types.ts
โ โโโ codegen/ # Code generation
โ โ โโโ generator.ts # Routes to all 18 archetypes
โ โ โโโ templates/ # 14 template files
โ โ โโโ attention.ts
โ โ โโโ fused.ts
โ โ โโโ sparse.ts
โ โ โโโ histogram.ts
โ โ โโโ convolution.ts
โ โ โโโ sorting.ts
โ โ โโโ pooling.ts
โ โ โโโ normalization.ts
โ โ โโโ embedding.ts
โ โ โโโ rope.ts
โ โ โโโ kvcache.ts
โ โ โโโ quantization.ts
โ โ โโโ reduction.ts
โ โ โโโ stencil.ts
โ โโโ validation/ # Validation & diagnostics
โ โโโ transpiler.ts # Main entry point
โโโ docs/ # Documentation
โโโ public/ # Static assets
# Build for production
npm run build
# Start production server
npm start
Deploy to Vercel, AWS, or any Node.js hosting platform.
We welcome contributions! Here's how to get started:
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)# Run development server
npm run dev
# Type checking
npx tsc --noEmit
# Build
npm run build
This project is licensed under the MIT License โ see the LICENSE file for details.
RightNow AI ยท GPU Kernel Code Editor
Live Demo โข
cuTile Docs โข
Discord โข
Issues
Made with โฅ by RightNow AI