๐Ÿ“ฆ RightNow-AI / RightNow-Tile

Open-source transpiler for CUDA Tile (13.1) migration

โ˜… 17 stars โ‘‚ 3 forks ๐Ÿ‘ 17 watching โš–๏ธ MIT License
cudacuda-kernelscuda-pythoncuda-supportcuda-tiles
๐Ÿ“ฅ Clone https://github.com/RightNow-AI/RightNow-Tile.git
HTTPS git clone https://github.com/RightNow-AI/RightNow-Tile.git
SSH git clone git@github.com:RightNow-AI/RightNow-Tile.git
CLI gh repo clone RightNow-AI/RightNow-Tile
jaberjaber23 jaberjaber23 Updating the diagram 88ed930 2 months ago ๐Ÿ“ History
๐Ÿ“‚ main View all commits โ†’
๐Ÿ“ app
๐Ÿ“ docs
๐Ÿ“ lib
๐Ÿ“ public
๐Ÿ“„ .gitignore
๐Ÿ“„ ARCHITECTURE.md
๐Ÿ“„ CHANGELOG.md
๐Ÿ“„ eslint.config.mjs
๐Ÿ“„ LICENSE
๐Ÿ“„ next-env.d.ts
๐Ÿ“„ next.config.ts
๐Ÿ“„ package-lock.json
๐Ÿ“„ package.json
๐Ÿ“„ postcss.config.js
๐Ÿ“„ README.md
๐Ÿ“„ tailwind.config.ts
๐Ÿ“„ tsconfig.json
๐Ÿ“„ README.md

RightNow Tile

RightNow Tile

CUDA SIMT to cuTile Python Transpiler
Transform your CUDA kernels for NVIDIA Blackwell GPUs

License: MIT Next.js 16 TypeScript cuTile Discord

Live Demo โ€ข Quick Start โ€ข Features โ€ข Patterns โ€ข Discord


What is RightNow Tile?

RightNow Tile is a production-grade transpiler that converts traditional CUDA SIMT (Single Instruction, Multiple Threads) kernels into cuTile Python code โ€” NVIDIA's new tile-based programming model optimized for Blackwell GPUs (compute capability 10.x+).

Part of the RightNow AI ecosystem โ€” a code editor built for GPU kernel development.


Why cuTile?

NVIDIA's cuTile represents a paradigm shift in GPU programming:

Traditional CUDAcuTile
Thread-centric programmingTile-centric programming
Manual memory coalescingAutomatic tile-based loads
Complex index calculationsDeclarative tile operations
Low-level synchronizationHigh-level tile semantics
RightNow Tile bridges the gap โ€” take your existing CUDA kernels and transform them for next-gen hardware.


Quick Start

# Clone the repository
git clone https://github.com/RightNow-AI/RightNow-Tile.git
cd RightNow-Tile

# Install dependencies
npm install

# Start development server
npm run dev

Open http://localhost:3000 and start transpiling!


Features

Intelligent Pattern Detection

Automatically identifies 18 computational patterns with 60+ variant-specific optimizations:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Your CUDA     โ”‚ โ”€โ”€โ–บ โ”‚  Pattern Match   โ”‚ โ”€โ”€โ–บ โ”‚  Optimized      โ”‚
โ”‚   Kernel        โ”‚     โ”‚  + Analysis      โ”‚     โ”‚  cuTile Code    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

9-Stage Transpilation Pipeline

CUDA Source
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. Extractor โ”‚  Parse kernel signatures, parameters, memory accesses
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 2. Parser    โ”‚  Recognize 150+ CUDA intrinsics & index patterns
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3. Semantic  โ”‚  Detect reductions, dependencies, race conditions
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4. Memory    โ”‚  Analyze coalescing, bank conflicts, access patterns
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 5. Pattern   โ”‚  Match against 18 patterns with confidence scoring
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 6. IR Build  โ”‚  Generate intermediate representation with config
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 7. Optimize  โ”‚  Select optimal tile sizes & configurations
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 8. CodeGen   โ”‚  Apply variant-specific templates
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 9. Validate  โ”‚  Verify correctness & generate diagnostics
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
       โ–ผ
  cuTile Python

Modern Developer Experience

  • Monaco Editor โ€” VS Code-quality editing with syntax highlighting
  • Real-time Transpilation โ€” See results instantly
  • Dark/Light Themes โ€” Easy on the eyes
  • Expandable Output โ€” Full-screen code view
  • One-Click Copy โ€” Get your code ready to deploy

Supported Patterns

Core Compute Patterns

PatternVariantsUse CasesConfidence
GEMMnaive, tiled, register_blockedMatrix multiplication, deep learningHigh
Reductiontree, warp_shuffle, multi_block, segmentedSum, max, min, dot productHigh
Scaninclusive, exclusive, segmentedPrefix sum, stream compactionHigh
Stencil1d_3pt, 1d_5pt, 2d_5pt, 2d_9pt, 3dImage processing, PDE solversHigh
Elementwisesimple, vectorizedPoint-wise operationsHigh

ML/Deep Learning Patterns

PatternVariantsUse CasesConfidence
Attentionflash_attention, flash_attention_v2, multi_head, causal, crossTransformer modelsHigh
Normalizationlayernorm, rmsnorm, batchnorm, groupnorm, instancenormNeural network layersHigh
Convolutionconv1d, conv2d, conv3d, depthwise, grouped, winograd, im2colCNNs, signal processingHigh
Poolingmax_pool_2d, avg_pool_2d, global_avg, global_max, adaptiveFeature downsamplingHigh
Embeddinglookup, embedding_bag, positionalNLP, recommender systemsMedium

LLM/Transformer-Specific Patterns

PatternVariantsUse CasesConfidence
RoPEstandard, neox, cachedRotary position embeddingsHigh
KV Cacheappend, paged, prefix, gqaLLM inference optimizationHigh
Quantizationint8, int4, fp8, dequantizeModel compressionMedium
Fusedmatmul_activation, matmul_bias_activation, layernorm_residualKernel fusionMedium

Specialized Patterns

PatternVariantsUse CasesConfidence
FFTradix2, radix4, radix8, inverse, realSignal processingHigh
Sparsespmv_csr, spmv_csr_warp, spmv_coo, spmv_ell, spmm, sddmmSparse matrix operationsMedium
Histogramatomic, privatized, multipass, weighted, 2dData distribution, statisticsMedium
Sortingbitonic, bitonic_shared, radix, mergeParallel sortingMedium

Example

Input: CUDA SIMT Kernel

__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Output: cuTile Python

import cuda_tile as ct
import cupy

TILE_SIZE = 256

@ct.kernel
def vector_add(a, b, c, n: ct.Constant[int], tile_size: ct.Constant[int]):
    """
    Elementwise kernel - auto-transpiled from CUDA
    Original: vectorAdd
    Confidence: 100%
    """
    pid = ct.bid(0)

    # Load input tiles
    a_tile = ct.load(a, index=(pid,), shape=(tile_size,))
    b_tile = ct.load(b, index=(pid,), shape=(tile_size,))

    # Compute
    result = a_tile + b_tile

    # Store result
    ct.store(c, index=(pid,), tile=result)


def launch_vector_add(a, b, c):
    """Launch the vector_add kernel"""
    n = a.shape[0]
    grid = (ct.cdiv(n, TILE_SIZE), 1, 1)
    stream = cupy.cuda.get_current_stream()
    ct.launch(stream, grid, vector_add, (a, b, c, TILE_SIZE))

Flash Attention Example

Input: Flash Attention CUDA Kernel

__global__ void flash_attention_kernel(
    float* Q, float* K, float* V, float* O,
    int seq_len, int head_dim, float scale
) {
    // Complex multi-phase attention implementation
    // with online softmax and tiled matrix multiply
    ...
}

Output: cuTile Python (Flash Attention)

import cuda_tile as ct
import cupy

BLOCK_Q = 64
BLOCK_KV = 64

@ct.kernel
def flash_attention(
    Q, K, V, O,
    seq_len_q: ct.Constant[int],
    seq_len_kv: ct.Constant[int],
    head_dim: ct.Constant[int],
    scale: ct.Constant[float],
    block_q: ct.Constant[int],
    block_kv: ct.Constant[int]
):
    """
    Flash Attention kernel - auto-transpiled from CUDA
    Confidence: 95%
    Variant: flash_attention_v2
    """
    block_q_idx = ct.bid(0)
    head_idx = ct.bid(1)

    # Initialize output accumulator and softmax stats
    acc = ct.zeros((block_q, head_dim), dtype=ct.float32)
    m_i = ct.full((block_q,), float('-inf'), dtype=ct.float32)
    l_i = ct.zeros((block_q,), dtype=ct.float32)

    # Load Q tile (stays in registers)
    q_tile = ct.load(Q, index=(head_idx, block_q_idx), shape=(block_q, head_dim))

    # Iterate over K,V blocks with online softmax
    for block_kv_idx in range(0, ct.cdiv(seq_len_kv, block_kv)):
        k_tile = ct.load(K, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))
        v_tile = ct.load(V, index=(head_idx, block_kv_idx), shape=(block_kv, head_dim))

        # QK^T with scaling
        qk = ct.tile_matmul(q_tile, ct.transpose(k_tile)) * scale

        # Online softmax update
        m_ij = ct.reduce(qk, op=ct.max, axis=1)
        m_new = ct.maximum(m_i, m_ij)
        alpha = ct.exp(m_i - m_new)
        acc = acc * alpha[:, None]
        l_i = l_i * alpha

        p = ct.exp(qk - m_new[:, None])
        l_ij = ct.reduce(p, op=ct.sum, axis=1)
        l_i = l_i + l_ij

        # Accumulate output
        acc = acc + ct.tile_matmul(p, v_tile)
        m_i = m_new

    # Normalize and store
    out = acc / l_i[:, None]
    ct.store(O, index=(head_idx, block_q_idx), tile=out)


API Usage

Use the transpiler programmatically:

import { transpile } from './lib/transpiler';

const result = await transpile(cudaCode);

// Access results
result.tileCode              // Generated cuTile Python code
result.pattern.archetype     // Detected pattern (e.g., 'attention', 'gemm')
result.pattern.confidence    // Confidence score (0-1)
result.pattern.variant       // Specific variant (e.g., 'flash_attention_v2')
result.validation.isValid    // Validation status
result.diagnostics           // Warnings and suggestions
result.memoryAnalysis        // Memory access analysis
result.semanticAnalysis      // Semantic analysis results

REST API

curl -X POST http://localhost:3000/api/transpile \
  -H "Content-Type: application/json" \
  -d '{"code": "__global__ void add(float* a, float* b, float* c, int n) { ... }"}'


Project Structure

rightnow-tile/
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ api/transpile/        # REST API endpoint
โ”‚   โ”œโ”€โ”€ components/           # React components
โ”‚   โ”‚   โ”œโ”€โ”€ ScientificVisualization.tsx
โ”‚   โ”‚   โ”œโ”€โ”€ ThemeProvider.tsx
โ”‚   โ”‚   โ””โ”€โ”€ ThemeToggle.tsx
โ”‚   โ”œโ”€โ”€ page.tsx              # Main UI
โ”‚   โ””โ”€โ”€ globals.css           # Styling
โ”œโ”€โ”€ lib/
โ”‚   โ”œโ”€โ”€ ast/                  # AST extraction & semantic analysis
โ”‚   โ”‚   โ”œโ”€โ”€ extractor.ts      # Kernel parsing
โ”‚   โ”‚   โ”œโ”€โ”€ semantic-analyzer.ts
โ”‚   โ”‚   โ”œโ”€โ”€ memory-analyzer.ts
โ”‚   โ”‚   โ”œโ”€โ”€ phase-analyzer.ts # Multi-phase kernel detection
โ”‚   โ”‚   โ””โ”€โ”€ types.ts          # 18 archetypes, 60+ variants
โ”‚   โ”œโ”€โ”€ parser/
โ”‚   โ”‚   โ””โ”€โ”€ intrinsics.ts     # 150+ CUDA intrinsics
โ”‚   โ”œโ”€โ”€ patterns/             # Pattern matchers (18 patterns)
โ”‚   โ”‚   โ””โ”€โ”€ matchers/
โ”‚   โ”‚       โ”œโ”€โ”€ attention.ts  # Flash Attention, MHA
โ”‚   โ”‚       โ”œโ”€โ”€ fused.ts      # Fused kernels
โ”‚   โ”‚       โ”œโ”€โ”€ fft.ts        # FFT variants
โ”‚   โ”‚       โ”œโ”€โ”€ gemm.ts       # Matrix multiply
โ”‚   โ”‚       โ”œโ”€โ”€ reduction.ts  # Reductions
โ”‚   โ”‚       โ”œโ”€โ”€ scan.ts       # Prefix sums
โ”‚   โ”‚       โ”œโ”€โ”€ stencil.ts    # Stencil patterns
โ”‚   โ”‚       โ”œโ”€โ”€ sparse.ts     # Sparse matrix ops
โ”‚   โ”‚       โ”œโ”€โ”€ histogram.ts  # Histogram
โ”‚   โ”‚       โ”œโ”€โ”€ convolution.ts # CNN convolutions
โ”‚   โ”‚       โ”œโ”€โ”€ sorting.ts    # Sorting algorithms
โ”‚   โ”‚       โ”œโ”€โ”€ pooling.ts    # Pooling layers
โ”‚   โ”‚       โ”œโ”€โ”€ normalization.ts # Norm layers
โ”‚   โ”‚       โ”œโ”€โ”€ embedding.ts  # Embeddings
โ”‚   โ”‚       โ”œโ”€โ”€ rope.ts       # Rotary embeddings
โ”‚   โ”‚       โ”œโ”€โ”€ kvcache.ts    # KV cache ops
โ”‚   โ”‚       โ”œโ”€โ”€ quantization.ts # Quantization
โ”‚   โ”‚       โ””โ”€โ”€ elementwise.ts
โ”‚   โ”œโ”€โ”€ ir/                   # Intermediate representation
โ”‚   โ”‚   โ”œโ”€โ”€ builder.ts        # 11 specialized IR types
โ”‚   โ”‚   โ”œโ”€โ”€ optimizer.ts
โ”‚   โ”‚   โ””โ”€โ”€ types.ts
โ”‚   โ”œโ”€โ”€ codegen/              # Code generation
โ”‚   โ”‚   โ”œโ”€โ”€ generator.ts      # Routes to all 18 archetypes
โ”‚   โ”‚   โ””โ”€โ”€ templates/        # 14 template files
โ”‚   โ”‚       โ”œโ”€โ”€ attention.ts
โ”‚   โ”‚       โ”œโ”€โ”€ fused.ts
โ”‚   โ”‚       โ”œโ”€โ”€ sparse.ts
โ”‚   โ”‚       โ”œโ”€โ”€ histogram.ts
โ”‚   โ”‚       โ”œโ”€โ”€ convolution.ts
โ”‚   โ”‚       โ”œโ”€โ”€ sorting.ts
โ”‚   โ”‚       โ”œโ”€โ”€ pooling.ts
โ”‚   โ”‚       โ”œโ”€โ”€ normalization.ts
โ”‚   โ”‚       โ”œโ”€โ”€ embedding.ts
โ”‚   โ”‚       โ”œโ”€โ”€ rope.ts
โ”‚   โ”‚       โ”œโ”€โ”€ kvcache.ts
โ”‚   โ”‚       โ”œโ”€โ”€ quantization.ts
โ”‚   โ”‚       โ”œโ”€โ”€ reduction.ts
โ”‚   โ”‚       โ””โ”€โ”€ stencil.ts
โ”‚   โ”œโ”€โ”€ validation/           # Validation & diagnostics
โ”‚   โ””โ”€โ”€ transpiler.ts         # Main entry point
โ”œโ”€โ”€ docs/                     # Documentation
โ””โ”€โ”€ public/                   # Static assets


Tech Stack


Requirements

  • Node.js 18+
  • npm or yarn
  • For running generated code: NVIDIA Blackwell GPU (compute capability 10.x+)

Production Deployment

# Build for production
npm run build

# Start production server
npm start

Deploy to Vercel, AWS, or any Node.js hosting platform.


Contributing

We welcome contributions! Here's how to get started:

  • Fork the repository
  • Create your feature branch (git checkout -b feature/amazing-feature)
  • Commit your changes (git commit -m 'Add amazing feature')
  • Push to the branch (git push origin feature/amazing-feature)
  • Open a Pull Request

Development

# Run development server
npm run dev

# Type checking
npx tsc --noEmit

# Build
npm run build


Roadmap

  • Support for 18 CUDA patterns with 60+ variants
  • Flash Attention and Transformer-specific patterns
  • LLM inference patterns (RoPE, KV Cache, Quantization)
  • Comprehensive convolution support (Winograd, im2col)
  • Batch transpilation for multiple kernels
  • Performance benchmarking comparisons
  • VS Code extension integration
  • CLI tool for CI/CD pipelines
  • CUDA to Triton transpilation

License

This project is licensed under the MIT License โ€” see the LICENSE file for details.


Links

RightNow AI ยท GPU Kernel Code Editor

Live Demo โ€ข cuTile Docs โ€ข Discord โ€ข Issues


Made with โ™ฅ by RightNow AI