back to home

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

9,694 stars
962 forks
2 issues
CudaPythonC++

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing xlite-dev/LeetCUDA in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind-ai.vercel.app/repo/xlite-dev/LeetCUDA)
Preview:Analyzed by RepoMind

Repository Summary (README)

Preview
<div align="center"> <p align="center"> <h2>📚 LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners 🐑</h2> <img src='https://github.com/user-attachments/assets/b2578723-b7a7-4d8f-bcd1-5008947b808a' width="360" height="56" > <a href="https://hellogithub.com/repository/98348655a96640ca8ddcbc298edc901d" target="_blank"><img src="https://api.hellogithub.com/v1/widgets/recommend.svg?rid=98348655a96640ca8ddcbc298edc901d&claim_uid=ofSCbzTmdeQk3FD&theme=dark" alt="Featured|HelloGitHub" style="width: 250px; height: 54px;" width="250" height="54" /></a> </p> <div align='center'> <img src=https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg > <img src=https://img.shields.io/badge/Language-CUDA-brightgreen.svg > <img src=https://img.shields.io/github/forks/xlite-dev/LeetCUDA.svg?style=dark > <img src=https://img.shields.io/github/stars/xlite-dev/LeetCUDA.svg?style=dark > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > </div> </div>

📚 LeetCUDA: It includes Tensor/CUDA Cores, TF32/F16/BF16/F8, 📖200+ CUDA Kernels🔥 with PyTorch, 📖100+ LLM/CUDA🔥 blogs, 📖HGEMM⚡️ which can achieve 98%~100% TFLOPS of cuBLAS, and 📖flash-attn⚡️ using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️

<div align="center"> <p align="center"> <a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a> <br> <a href=https://github.com/xlite-dev/LeetCUDA/graphs/contributors > <img src=https://opencollective.com/leetcuda/contributors.svg height=40px > </a> </p> </div>

©️Citations🎉🎉

@misc{LeetCUDA@2025,
  title={LeetCUDA: A Modern CUDA Learn Notes with PyTorch for Beginners},
  url={https://github.com/xlite-dev/LeetCUDA.git},
  note={Open-source software available at https://github.com/xlite-dev/LeetCUDA.git},
  author={DefTruth and Many Others},
  year={2025}
}

📖 News 🔥🔥

<div id="news"></div>
  • [2026/02] Cache-DiT 🎉v1.2.1 release is ready, the major updates including: Ring Attention w/ batched P2P, USP (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥USP + TP), VAE-P Comm overhead reduce.
<div align='center'> <img src=https://github.com/vipshop/cache-dit/raw/main/assets/cache-dit-logo-v2.png height=320px> </div> <div align='center'> <img height="320px" alt="image" src="https://github.com/user-attachments/assets/ed30185b-2e11-4293-832f-43e9003d6ad9" /> </div>
  • [2024/12]: ⚡️HGEMM is released! Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak🎉 performance.

📖 Contents

<div id="contents"></div>

📖 HGEMM Benchmark 🎉🎉

<div id="HGEMM-bench"></div>

Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the HGEMM (WMMA/MMA/CuTe) in this repo (blue🔵) can achieve 98%~100% of its (orange🟠) performance. Please check toy-hgemm library⚡️⚡️ or HGEMM⚡️⚡️ repo for more details.

toy-hgemm-library

📚Feature📚Feature📚Feature📚Feature
✔️CUDA/Tensor Cores✔️Loop over K✔️Tile Block(BMxBK)✔️Tile Threads(T 8x8)
✔️WMMA(m16n16k16)✔️MMA(m16n8k16)✔️Pack LDST(128 bits)✔️SMEM Padding
✔️Copy Async✔️Tile MMAs✔️Tile Warps✔️Multi Stages(2~4)
✔️Register Double Buffers✔️Block Swizzle✔️Warp Swizzle✔️SMEM Swizzle(CuTe/MMA)
✔️Collective Store(Shfl)✔️Layout NN✔️Layout TN✔️SGEMM FP32/TF32

📖 FA2-MMA Benchmark 🎉🎉

<div id="fa-mma-bench"></div>

I have also implemented FlashAttention-2 using pure MMA PTX instructions, which supports features such as Multi-Stages, Tile MMA, Tile Warp, Shared KV SMEM, Fully Shared QKV SMEM, Prefetch Q s2r, Prefetch K/V g2s, QKV Fine-grained Tiling, Collective Store, etc. Please refer to flash-attn⚡️⚡️ for more details.

flash-attn-mma

📚Feature📚Feature📚Feature📚Feature
✔️Tensor Cores✔️Loop over N/D✔️Tile Block(Br, Bc)✔️MMA(m16n8k16)
✔️Pack LDST(128 bits)✔️SMEM Swizzle/Padding✔️Copy Async✔️Tile MMAs
✔️Tile Warps✔️Multi Stages(1/2)✔️Collective Store(Shfl)✔️Split KV/Q
✔️Shared QKV SMEM✔️Prefetch Q s2r✔️Prefetch KV g2s✔️QKV Fine-grained Tiling

Currently, for small-scale attention (B<=4, H <=48, SeqLen <= 8192, D <= 64) it can run faster than FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, 📚 Split Q + Fully Shared QKV SMEM method can achieve 55 TFLOPS (D=64) that almost ~1.5x 🎉 faster than FA2. On NVIDIA L20, 🤖ffpa-attn method can achieve 104 TFLOPS (D=512) that almost ~1.8x 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16/F32, softmax Acc F32 vs FA2 MMA/softmax Acc F32, 👇Benchmark)

Algorithm(B,H,N,D)RTX 3080 LaptopL20RTX 4090
FlashAttention-2(1,8,8192,64)37 TFLOPS100 TFLOPS145 TFLOPS
share-qkv+stage2(1,8,8192,64)55 TFLOPS99 TFLOPS221 TFLOPS
FlashAttention-2(1,48,8192,64)37 TFLOPS109 TFLOPS163 TFLOPS
share-qkv+stage2(1,48,8192,64)48 TFLOPS107 TFLOPS224 TFLOPS
SDPA(EFFICIENT ATTENTION)(1,48,8192,512)16 TFLOPS58 TFLOPS85 TFLOPS
🤖ffpa-attn(1,48,8192,512)39 TFLOPS104 TFLOPS200 TFLOPS
Precision Errors vs FA2/SDPA/max: < ~1e-3min: ~0.0mean: < ~1e-5

The Split KV and Split Q implementations have been carried out in flash-attn⚡️⚡️ for performance comparison. The Split KV method, which involves splitting all QKV across MMA (Warps), is slower than Split Q method, which splitting Q across MMA(Warps) and keep access KV for all MMA(Warps).

  • 📚 Split KV (Basic, FlashAttention-1)
<div id="mma-split-kv"></div>
// Split QKV across MMA(Warps) using naive matmul MMA&Warp tiling policy.
// case: The layout of 8 MMA(2x4)  [after] kWarpTileSeqLenQxkWarpTileSeqLenK(2x2) -> 32x2,32x2=64x64:
// |  [64,64]  |    warp_KV 0    |    warp_KV 1    |    warp_KV 2    |    warp_KV 3    |
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_kv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q (Faster, FlashAttention-2)
<div id="mma-split-q"></div>
// Split Q across MMA(Warps) and keep access KV for all MMA(Warps),
// in order to reduce the comm between warps via smem and warp shuffle.
// case: MMA = m16n8k16, Br=16x4=64, Bc=8x8=64, layout: 4 warps
// |   64x64   |      warp_KV 0       |
// | warp_QP 0 | MMA 0 ... MMA 0 (x8) |
// | warp_QP 1 | MMA 1 ... MMA 1 (x8) |
// | warp_QP 2 | MMA 2 ... MMA 2 (x8) |
// | warp_QP 3 | MMA 3 ... MMA 3 (x8) |
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Shared KV SMEM (1/2 SRAM vs FA2)
<div id="mma-share-kv"></div>
// K, V shared the same shared memory, improve block occupancy.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_kv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Fully Shared QKV SMEM (1/4 SRAM vs FA2)
<div id="mma-share-qkv"></div>
// Q, K, V fully shared the same shared memory and prefetch Q s2r, improve block occupancy
// and reduce Q SMEM IO-Access.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_qkv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + QK Fine-grained Tiling (O(16xd) SRAM vs FA2 O(4xBrxd) SRAM, Headdim -> 1024)
<div id="mma-tiling-qk"></div>
// Fine-grained tiling at the MMA level for Q@K^T results in a constant SRAM usage of
// 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to
// an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to
// extend D (head dimension) up to 1024.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Fully QKV Fine-grained Tiling (O(2xBrx16)~O(1) SRAM vs FA2 O(4xBrxd) SRAM)
<div id="mma-tiling-qkv"></div>
// Fine-grained tiling at the MMA level for all Q@K^T and P@V results in a constant SRAM usage of
// Br * 16 or Bc * 16 for Q, K, V, leading to an overall SRAM complexity of O(Br * 16). Consequently,
// this approach allows us to run faster than SDPA w or w/o MMA Acc F32.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half* O, ...);

💡NOTE: 📚Split Q + Fully QKV Fine-grained Tiling has been refactored into 🤖ffpa-attn.

📖 200+ CUDA Kernels 🔥🔥 (Easy -> Hard++) (©️back👆🏻)

<div id="cuda-kernel"></div>

The kernels listed here will guide you through a step-by-step progression, ranging from easy to very challenging topics. The workflow for each topic will be as follows: custom CUDA kernel implementation -> PyTorch Python bindings -> Run tests. 👉TIPS: * = Tensor Cores (WMMA, MMA, CuTe), otherwise, CUDA Cores; / = not supported; ✔️ = supported; = TODO. Contents are listed as follows:

📚 Easy and 📚 Medium sections cover operations such as element-wise, mat_trans, warp/block reduce, nms, relu, gelu, swish, layer-norm, rms-norm, online-softmax, dot-prod, embedding and basic usage for FP32, FP16, BF16 and FP8 . 📚 Hard, 📚 Hard+ and 📚 Hard++ sections delve deeper into advanced topics, primarily focusing on operations like sgemv, sgemm, hgemv, hgemm and flash-attention. These sections also provide numerous kernels implemented using Tensor Cores with pure MMA PTX.

📚 Easy ⭐️ & Medium ⭐️⭐️ (©️back👆🏻)

<div id="cuda-kernel-easy-medium"></div>
📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ elementwise_f32f32/link⭐️
✔️ elementwise_f32x4f32/link⭐️
✔️ elementwise_f16f16/link⭐️
✔️ elementwise_f16x2f16/link⭐️
✔️ elementwise_f16x8f16/link⭐️
✔️ elementwise_f16x8_packf16/link⭐️⭐️
✔️ histogram_i32i32/link⭐️
✔️ histogram_i32x4i32/link⭐️
✔️ sigmoid_f32f32/link⭐️
✔️ sigmoid_f32x4f32/link⭐️
✔️ sigmoid_f1616/link⭐️
✔️ sigmoid_f16x2f16/link⭐️
✔️ sigmoid_f16x8f16/link⭐️
✔️ sigmoid_f16x8_packf16/link⭐️⭐️
✔️ relu_f32f32/link⭐️
✔️ relu_f32x4f32/link⭐️
✔️ relu_f16f16/link⭐️
✔️ relu_f16x2f16/link⭐️
✔️ relu_f16x8f16/link⭐️
✔️ relu_f16x8_packf16/link⭐️⭐️
✔️ elu_f32f32/link⭐️
✔️ elu_f32x4f32/link⭐️
✔️ elu_f16f16/link⭐️
✔️ elu_f16x2f16/link⭐️
✔️ elu_f16x8f16/link⭐️
✔️ elu_f16x8_packf16/link⭐️⭐️
✔️ gelu_f32f32/link⭐️
✔️ gelu_f32x4f32/link⭐️
✔️ gelu_f16f16/link⭐️
✔️ gelu_f16x2f16/link⭐️
✔️ gelu_f16x8f16/link⭐️
✔️ gelu_f16x8_packf16/link⭐️⭐️
✔️ swish_f32f32/link⭐️
✔️ swish_f32x4f32/link⭐️
✔️ swish_f16f16/link⭐️
✔️ swish_f16x2f16/link⭐️
✔️ swish_f16x8f16/link⭐️
✔️ swish_f16x8_packf16/link⭐️⭐️
✔️ hardswish_f32f32/link⭐️
✔️ hardswish_f32x4f32/link⭐️
✔️ hardswish_f16f16/link⭐️
✔️ hardswish_f16x2f16/link⭐️
✔️ hardswish_f16x8f16/link⭐️
✔️ hardswish_f16x8_packf16/link⭐️⭐️
✔️ hardshrink_f32f32/link⭐️
✔️ hardshrink_f32x4f32/link⭐️
✔️ hardshrink_f16f16/link⭐️
✔️ hardshrink_f16x2f16/link⭐️
✔️ hardshrink_f16x8f16/link⭐️
✔️ hardshrink_f16x8_packf16/link⭐️⭐️
✔️ embedding_f32f32/link⭐️
✔️ embedding_f32x4f32/link⭐️
✔️ embedding_f32x4_packf32/link⭐️
✔️ embedding_f16f16/link⭐️
✔️ embedding_f16x2f16/link⭐️
✔️ embedding_f16x8f16/link⭐️
✔️ embedding_f16x8_packf16/link⭐️⭐️
✔️ mat_trans_f32_col2row{2d}f32/link⭐️
✔️ mat_trans_f32_row2col{2d}f32/link⭐️
✔️ mat_trans_f32_diagonal2df32/link⭐️⭐️
✔️ mat_trans_f32x4_col2row{2d}f32/link⭐️⭐️
✔️ mat_trans_f32x4_row2col{2d}f32/link⭐️⭐️
✔️ mat_trans_cutef32/link⭐️⭐️
✔️ warp_reduce_{all}allalllink⭐️⭐️
✔️ block_all_reduce_f32_f32f32f32link⭐️⭐️
✔️ block_all_reduce_f32x4_f32f32f32link⭐️⭐️
✔️ block_all_reduce_f16_f16f16f16link⭐️⭐️
✔️ block_all_reduce_f16_f32f16f32link⭐️⭐️
✔️ block_all_reduce_f16x2_f16f16f16link⭐️⭐️
✔️ block_all_reduce_f16x2_f32f16f32link⭐️⭐️
✔️ block_all_reduce_f16x8_pack_f16f16f16link⭐️⭐️
✔️ block_all_reduce_f16x8_pack_f32f16f32link⭐️⭐️
✔️ block_all_reduce_bf16_bf16bf16bf16link⭐️⭐️
✔️ block_all_reduce_bf16_f32bf16f32link⭐️⭐️
✔️ block_all_reduce_bf16x2_bf16bf16bf16link⭐️⭐️
✔️ block_all_reduce_bf16x2_f32bf16f32link⭐️⭐️
✔️ block_all_reduce_bf16x8_pack_bf16bf16bf16link⭐️⭐️
✔️ block_all_reduce_bf16x8_pack_f32bf16f32link⭐️⭐️
✔️ block_all_reduce_fp8_e4m3_f16fp8_e4m3f16link⭐️⭐️⭐️
✔️ block_all_reduce_fp8_e5m2_f16fp8_e5m2f16link⭐️⭐️⭐️
✔️ block_all_reduce_fp8_e4m3x16_pack_f16fp8_e4m3f16link⭐️⭐️⭐️
✔️ block_all_reduce_fp8_e5m2x16_pack_f16fp8_e5m2f16link⭐️⭐️⭐️
✔️ block_all_reduce_i8_i32i8i32link⭐️⭐️
✔️ block_all_reduce_i8x16_pack_i32i8i32link⭐️⭐️
✔️ dot_product_f32f32f32link⭐️⭐️
✔️ dot_product_f32x4f32f32link⭐️⭐️
✔️ dot_product_f16_f32f16f32link⭐️⭐️
✔️ dot_product_f16x2_f32f16f32link⭐️⭐️
✔️ dot_product_f16x8_pack_f32f16f32link⭐️⭐️
✔️ softmax_f32_per_tokf32f32link⭐️⭐️
✔️ softmax_f32x4_per_tokf32f32link⭐️⭐️
✔️ safe_softmax_f32_per_tokf32f32link⭐️⭐️
✔️ safe_softmax_f32x4_per_tokf32f32link⭐️⭐️
✔️ safe_softmax_f16_f32_per_tokf16f32link⭐️⭐️
✔️ safe_softmax_f16x2_f32_per_tokf16f32link⭐️⭐️
✔️ safe_softmax_f16x8_pack_f32_per_tokf16f32link⭐️⭐️
✔️ online_safe_softmax_f32_per_tokenf32f32link⭐️⭐️
✔️ online_safe_softmax_f32x4_pack_per_tokf32f32link⭐️⭐️
✔️ rope_f32f32f32link⭐️⭐️
✔️ rope_f32x4_packf32f32link⭐️⭐️
✔️ layer_norm_f32f32f32link⭐️⭐️
✔️ layer_norm_f32x4f32f32link⭐️⭐️
✔️ layer_norm_f16_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x2_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x8_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x8_pack_f16f16f16link⭐️⭐️
✔️ layer_norm_f16x8_pack_f32f16f32link⭐️⭐️
✔️ layer_norm_f16_f32f16f32link⭐️⭐️
✔️ rms_norm_f32f32f32link⭐️⭐️
✔️ rms_norm_f32x4f32f32link⭐️⭐️
✔️ rms_norm_f16_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x2_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x8_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x8_f32f16f32link⭐️⭐️
✔️ rms_norm_f16x8_pack_f16f16f16link⭐️⭐️
✔️ rms_norm_f16x8_pack_f32f16f32link⭐️⭐️
✔️ rms_norm_f16_f32f16f32link⭐️⭐️
✔️ nms_f32f32/link⭐️⭐️
✔️ merge_attn_statesf16/bf16/f32f32link⭐️⭐️
✔️ notes v1(deprecated)f32f32/⭐️⭐️
✔️ How to use nsys/ncu(timeline/ptx/sass)//link⭐️⭐️

📚 Hard ⭐⭐⭐️ (©️back👆🏻)

<div id="cuda-kernel-hard"></div>
📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ sgemv_k32_f32f32f32link⭐️⭐️⭐️
✔️ sgemv_k128_f32x4f32f32link⭐️⭐️⭐️
✔️ sgemv_k16_f32f32f32link⭐️⭐️⭐️
✔️ hgemv_k32_f16f16f16link⭐️⭐️⭐️
✔️ hgemv_k128_f16x4f16f16link⭐️⭐️⭐️
✔️ hgemv_k16_f16f16f16link⭐️⭐️⭐️
✔️ sgemm_naive_f32f32f32link⭐️⭐️
✔️ sgemm_sliced_k_f32f32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k_f32x4f32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k...bcff32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k...dbuff32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k16...dbuff32f32link⭐️⭐️⭐️
✔️ sgemm_t_8x8_sliced_k16...asyncf32f32link⭐️⭐️⭐️
✔️ sgemm_wmma_m16n16k8...stages*tf32f32link⭐️⭐️⭐️
✔️ sgemm_wmma_m16n16k8...swizzle*tf32f32link⭐️⭐️⭐️
✔️ hgemm_naive_f16f16f16link⭐️⭐️
✔️ hgemm_sliced_k_f16f16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k_f16x4f16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k_f16x4_packf16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k_f16x8_packf16f16link⭐️⭐️⭐️
✔️ hgemm_t_8x8_sliced_k...dbuff16f16link⭐️⭐️⭐️
✔️ hgemm_t_8/16x8...k16/32...dbuff16f16link⭐️⭐️⭐️
✔️ hgemm_t_8/16x8...k16/32...asyncf16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...naive*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...mma4x2*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...mma4x4*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...dbuf*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m32n8k16....dbuf*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...stages*f16f16link⭐️⭐️⭐️
✔️ hgemm_wmma_m16n16k16...swizzle*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...naive*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...mma2x4*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...stages*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...swizzle*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...swizzle{smem}*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_m16n8k16...swizzle{tn}{smem}*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_stages_swizzle{smem}...cute*f16f16link⭐️⭐️⭐️
✔️ hgemm_mma_cublas*f16f16link⭐️⭐️

📚 Hard+ ⭐️⭐️⭐️⭐️ & Hard++ ⭐️⭐️⭐️⭐️⭐️ (©️back👆🏻)

  • 📚 FlashAttention-2 MMA (MMA Acc F32/F16, swizzle, QKV smem share, fine-grained tiling, etc.🎉)
<div id="cuda-kernel-hard-plus"></div>
📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ flash_attn_cute(naive)f16f32link⭐️⭐️⭐️
✔️ How to implement MMA smem swizzle*f16f16link⭐️⭐️⭐️
✔️ flash_attn_mma_stages_split_kv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages_split_q*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_kv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_qkv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qk*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qkv*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_kv{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...shared_qkv{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qk{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma_stages...tiling_qkv{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv{f32}{rr}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv{f32}{rr}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_kv_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...shared_qkv_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qk_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qk_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qk_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qkv_swizzle{q}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qkv_swizzle{qk}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn_mma...tiling_qkv_swizzle{qkv}*f16f16link⭐️⭐️⭐️⭐️
✔️ flash_attn...tiling_qkv_swizzle{q}{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn...tiling_qkv_swizzle{qk}{f32}*f16f32link⭐️⭐️⭐️⭐️
✔️ flash_attn...tiling_qkv_swizzle{qkv}{f32}*f16f32link⭐️⭐️⭐️⭐️

💡NOTE: rr: means reduce registers usage (for d>128); f32: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; swizzle: now, only support smem swizzle for MMA.

  • 📚 FFPA Attention MMA (1.8x~3x🎉faster vs SDPA EA, D > 256, FA2 not supported)
📖 CUDA Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ ffpa_mma_stages_split_q_L1_F16F16F16f16f16link⭐️⭐️⭐️⭐️
✔️ ffpa_mma_stages_split_q_L1_F16F16F32f16f32link⭐️⭐️⭐️⭐️
✔️ ffpa_mma_stages_split_q_L1_mixed_accf16QK f32, PV f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L2_F16F16F16f16f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L2_F16F16F32f16f32link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L2_mixed_accf16QK f32, PV f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L3_F16F16F16f16f16link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L3_F16F16F32f16f32link⭐️⭐️⭐️⭐️
⚠️ ffpa_mma_stages_split_q_L3_mixed_accf16QK f32, PV f16link⭐️⭐️⭐️⭐️

💡NOTE: 🤖ffpa-attn: 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, 1.8x~3x🎉faster than SDPA EA: 📈L20 ~1.9x↑🎉, 📈 A30 ~1.8x↑🎉, 📈3080 ~2.9x↑🎉, 📈4090 ~2.1x↑🎉.

📚 Triton Kernel (OpenAI Triton) ⭐️⭐️⭐️ (©️back👆🏻)

<div id="triton-kernel"></div>
📖 Triton Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ triton_vector_add_kernelallalllink⭐️⭐️
✔️ triton_fused_softmax(multi-stages)f16/bf16/f32f32link⭐️⭐️⭐️
✔️ triton_fused_layer_norm(forward-pass)f16/bf16/f32f32link⭐️⭐️⭐️
✔️ triton_fused_layer_norm(backward-pass)f16/bf16/f32f32link⭐️⭐️⭐️
✔️ triton_merge_attn_states_kernel(w/ CUDA)f16/bf16/f32f32link⭐️⭐️⭐️

📚 CUTLASS/CuTe Kernel ⭐️⭐️⭐️ (©️back👆🏻)

<div id="cutlass-kernel"></div>
📖 CUTLASS/CuTe Kernel📖 Elem DType📖 Acc DType📖 Docs📖 Level
✔️ mat_transpose_cutef32/link⭐️⭐️
✔️ flash_attn_cute(naive)f16f32link⭐️⭐️⭐️
✔️ hgemv_f16_cute_kernelf16f16link⭐️⭐️⭐️
✔️ hgemv_f16x8_cute_kernelf16f16link⭐️⭐️⭐️
✔️ hgemv_tensor_core_cute_kernelf16f16link⭐️⭐️⭐️
✔️ hgemm_mma_stages_swizzle{smem}...cute*f16f16link⭐️⭐️⭐️
✔️ ws_hgemm_naive_cute_kernelf16f16link⭐️⭐️⭐️

📖 100+ 高性能计算与分布式-技术博客

<div id="my-blogs-part-1"></div>

📚 高性能计算与分布式-个人技术专栏 (©️back👆🏻)

📖 类型-标题📖 作者📖 推荐
[Diffusion推理]📖简短的2025年总结,写在Cache-DiT v1.2.1之际@DefTruth⭐️⭐️
[Diffusion推理]📖CacheDiT支持Z-Image分布式推理和缓存加速​​@DefTruth⭐️⭐️
[Diffusion推理]📖cache-dit支持FLUX.2分布式推理和Cache@DefTruth⭐️⭐️
[Diffusion推理]📖Cache加速-FoCa公式理解记录@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: BlockAdapter支持HunyuanImage-2.1 Cache加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit + Qwen-Image-Lightning 实现 3.5 steps 推理!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: Wan2.2-MoE 2.4x 推理加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: Qwen-Image-Edit 2x 无损加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖cache-dit: Qwen-Image 1.5x 无损加速!@DefTruth⭐️⭐️⭐
[Diffusion推理]📖Cache加速-TaylorSeer算法简析@DefTruth⭐️⭐️⭐
[Diffusion推理]📖DiT推理加速综述: Caching@DefTruth⭐️⭐️⭐
[Triton编程][基础]📖Triton极简入门: Triton Vector Add@DefTruth⭐️⭐️⭐
[Triton编程][基础]📖Triton Fused Softmax Kernel详解: 从Python源码到PTX@DefTruth⭐️⭐️⭐
[Triton编程][基础]📖vLLM Triton Merge Attention States Kernel详解@DefTruth⭐️⭐️⭐
[Triton编程][进阶]📖vLLM Prefix Prefill Triton Kernel图解@DefTruth⭐️⭐️⭐️
[张量/序列并行]📖序列并行: BPT、Ring-Attention及Striped-Attention笔记@DefTruth⭐️⭐️⭐
[vLLM实践][算子]📖vLLM算子开发流程:”保姆级“详细记录@DefTruth⭐️⭐️⭐
[vLLM实践][万字]📖vLLM + DeepSeek-R1 671B 多机部署及修Bug笔记@DefTruth⭐️⭐️⭐
[Attention优化]📖FFPA(Split-D): FA2无限HeadDim扩展,2x↑🎉 vs SDPA EA@DefTruth⭐️⭐️⭐️
[CUDA基础][开篇]📖LeetCUDA: v3.0 大升级-面试刷题不迷路@DefTruth⭐️⭐️⭐⭐️
[分布式训推][张量/序列并行]📖图解DeepSpeed-Ulysses&Megatron-LM TP/SP@DefTruth⭐️⭐️
[VLM推理优化][InternVL系列]📖InternLM2/.../InternVL1.5系列笔记: 核心点解析@DefTruth⭐️⭐️
[LLM推理优化][TensorRT-LLM][5w字]📖TensorRT-LLM部署调优-指北@DefTruth⭐️⭐️⭐️
[LLM推理优化][KV Cache优化]📖GQA/YOCO/CLA/MLKV: 层内和层间KV Cache共享@DefTruth⭐️⭐️
[LLM推理优化][Prefill优化][万字]📖图解vLLM Automatic Prefix Caching: TTFT优化@DefTruth⭐️⭐️⭐️
[LLM推理优化][Attention优化]📖图解:从Online-Softmax到FlashAttention V1/V2/V3@DefTruth⭐️⭐️⭐️
[LLM推理优化][Decoding优化]📖原理&图解FlashDecoding/FlashDecoding++@DefTruth⭐️⭐️
[VLM推理优化][LLaVA系列]📖CLIP/LLaVA/LLaVA1.5/VILA笔记: 核心点解析@DefTruth⭐️⭐️
[LLM推理优化][Attention优化][万字]📖TensorRT MHA/Myelin vs FlashAttention-2@DefTruth⭐️⭐️⭐️
[LLM推理优化][PTX汇编]📖CUDA 12 PTX汇编: PRMT指令详解-通用模式@DefTruth⭐️
[LLM推理优化][PTX汇编]📖CUDA 12 PTX汇编: LOP3指令详解@DefTruth⭐️
[LLM推理优化][CUDA][3w字]📖高频面试题汇总-大模型手撕CUDA@DefTruth⭐️⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(00): 通俗易懂讲解-快速反量化算法@DefTruth⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(01): PRMT指令详解及FT源码解析@DefTruth⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(02): 快速反量化之INT8转BF16@DefTruth⭐️⭐️
[LLM推理优化][Weight Only]📖WINT8/4-(03): LOP3指令详解及INT4转FP16/BF16@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖100+篇: 大模型推理各方向新发展整理@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖30+篇: LLM推理论文集-500页PDF@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖FlashDecoding++: 比FlashDecoding还要快!@DefTruth⭐️
[LLM推理优化][LLM Infra整理]📖TensorRT-LLM开源,TensorRT 9.1也来了@DefTruth⭐️
[LLM推理优化][LLM Infra整理]📖20+篇: LLM推理论文集-300页PDF@DefTruth⭐️⭐️
[LLM推理优化][LLM Infra整理]📖PagedAttention论文新鲜出炉@DefTruth⭐️
[推理部署][CV/NLP]📖FastDeploy三行代码搞定150+ CV、NLP模型部署@DefTruth⭐️
[推理部署][CV]📖如何在lite.ai.toolkit(3.6k+ stars)中增加您的模型?@DefTruth⭐️⭐️
[推理部署][CV]📖美团 YOLOv6 ORT/MNN/TNN/NCNN C++推理部署@DefTruth⭐️⭐️
[推理部署][ONNX]📖ONNX推理加速技术文档-杂记@DefTruth⭐️
[推理部署][TensorFlow]📖Mac源码编译TensorFlow C++指北@DefTruth⭐️
[推理部署][CV]📖1Mb!头部姿态估计: FSANet,一个小而美的模型(C++)@DefTruth⭐️
[推理部署][CV]📖opencv+ffmpeg编译打包全解指南@DefTruth⭐️⭐️
[推理部署][CV]📖RobustVideoMatting视频抠图静态ONNX模型转换@DefTruth⭐️
[推理部署][CV]📖190Kb!SSRNet年龄检测详细解读(含C++工程)@DefTruth⭐️
[推理部署][CV]📖MGMatting(CVPR2021)人像抠图C++应用记录@DefTruth⭐️
[推理部署][CV]📖超准确人脸检测(带关键点)YOLO5Face C++工程详细记录@DefTruth⭐️⭐️
[推理部署][ORT]📖解决: ONNXRuntime(Python) GPU 部署配置记录@DefTruth⭐️
[推理部署][CV]📖记录SCRFD(CVPR2021)人脸检测C++工程化(含docker镜像)@DefTruth⭐️⭐️
[推理部署][NCNN]📖野路子:记录一个解决onnx转ncnn时op不支持的trick@DefTruth⭐️
[推理部署][CV]📖升级版NanoDet-Plus MNN/TNN/NCNN/ORT C++工程记录@DefTruth⭐️⭐️
[推理部署][CV]📖超轻量级NanoDet MNN/TNN/NCNN/ORT C++工程记录@DefTruth⭐️
[推理部署][CV]📖详细记录MGMatting之MNN、TNN和ORT C++移植@DefTruth⭐️⭐️
[推理部署][CV]📖YOLOX NCNN/MNN/TNN/ONNXRuntime C++工程简记@DefTruth⭐️
[推理部署][TNN]📖手动修改YoloX的tnnproto记录-TNN@DefTruth⭐️
[推理部署][ORT]📖全网最详细 ONNXRuntime C++/Java/Python 资料!@DefTruth⭐️
[推理部署][CV]📖RobustVideoMatting: C++工程化记录-实现篇@DefTruth⭐️⭐️
[推理部署][CV]📖RobustVideoMatting: C++工程化记录-应用篇@DefTruth⭐️⭐️
[推理部署][ORT]📖ONNXRuntime C++ CMake 工程分析及编译@DefTruth⭐️⭐️
[推理部署][ORT]📖如何使用ORT C++ API处理NCHW和NHWC输入?@DefTruth⭐️
[推理部署][TNN]📖tnn-convert搭建简记-YOLOP转TNN@DefTruth⭐️
[推理部署][CV]📖YOLOP ONNXRuntime C++工程化记录@DefTruth⭐️⭐️
[推理部署][NCNN]📖超有用NCNN参考资料整理@DefTruth⭐️
[推理部署][MNN]📖超有用MNN参考资料整理@DefTruth⭐️
[推理部署][TNN]📖超有用TNN参考资料整理@DefTruth⭐️
[推理部署][ONNX]📖超有用ONNX参考资料整理@DefTruth⭐️
[推理部署][ONNX]📖超有用ONNX模型结构参考资料整理@DefTruth⭐️
[推理部署][OpenCV-DNN]📖超有用OpenCV-DNN参考资料整理@DefTruth⭐️
[推理部署][Tensorflow]📖超有用Tensorflow C++工程化知识点@DefTruth⭐️
[推理部署][模型转换]📖深度学习模型转换资料整理@DefTruth⭐️
[技术随笔][C++][CMake]📖超有用CMake参考资料整理@DefTruth⭐️⭐️
[技术随笔][C++][3W字]📖静态链接和静态库实践指北-原理篇@DefTruth⭐️⭐️⭐️
[技术随笔][C++]📖Mac下C++内存检查指北(Valgrind VS Asan)@DefTruth⭐️
[技术随笔][CV]📖torchlm: 人脸关键点检测库@DefTruth⭐️⭐️
[技术随笔][ML]📖《统计学习方法-李航: 笔记-从原理到实现-基于R》@DefTruth⭐️⭐️
[技术随笔][Git]📖如何优雅地git clone和git submodule?@DefTruth⭐️
[技术随笔][3D]📖人脸重建3D参考资料整理@DefTruth⭐️
[技术随笔][3D]📖BlendShapes参考资料整理@DefTruth⭐️
[技术随笔][3D]📖从源码安装Pytorch3D详细记录及学习资料@DefTruth⭐️
[技术随笔][ML]📖200页:《统计学习方法:李航》笔记 -从原理到实现@DefTruth⭐️⭐️

📚 高性能计算与分布式-技术博客推荐 (©️back👆🏻)

<div id="other-blogs"></div>

💡说明: 本小节整理一些自己比较喜欢的文章。欢迎大家提PR推荐更多优秀的文章!

📖 类型-标题📖 作者📖 推荐
[cute系列详解][入门]📖cutlass cute 101@朱小霖⭐️⭐️⭐️
[cute系列详解][入门]📖CUTLASS 2.x & CUTLASS 3.x Intro 学习笔记@BBuf⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:tiled copy@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:tiled mma@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:Layout Compose & Inverse@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程: Layout Product & Divide@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给大家看的 CuTe 教程:TMA Copy@竹熙佳处⭐️⭐️⭐️
[cute系列详解][入门]📖写给进阶开发的 CuTe 笔记:permutationMNK 参数@竹熙佳处⭐️⭐️⭐️
[cute系列详解][Layout]📖cute 之 Layout@reed⭐️⭐️⭐️
[cute系列详解][Layout]📖cute Layout 的代数和几何解释@reed⭐️⭐️⭐️
[cute系列详解][Tensor]📖cute 之 Tensor@reed⭐️⭐️⭐️
[cute系列详解][MMA]📖cute 之 MMA抽象@reed⭐️⭐️⭐️
[cute系列详解][Copy]📖cute 之 Copy抽象@reed⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cute 之 Swizzle@reed⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cute Swizzle细谈@进击的Killua⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cutlass swizzle机制解析(一)@Titus⭐️⭐️⭐️
[cute系列详解][Swizzle]📖cutlass swizzle机制解析(二)@Titus⭐️⭐️⭐️
[cute系列详解][Swizzle]📖CUDA避免smem bank conflict的swizzle机制解析@frankshi⭐️⭐️⭐️
[cute系列详解][Swizzle]📖布局代数实战:Swizzle自动推导@melonedo⭐️⭐️⭐️
[cute系列详解][GEMM]📖cute 之 简单GEMM实现@reed⭐️⭐️⭐️
[cute系列详解][GEMM]📖cute 之 GEMM流水线@reed⭐️⭐️⭐️
[cute系列详解][GEMM]📖cute 之 高效GEMM实现@reed⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM流水线: single/multi-stage、pipeline@Titus⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM细节分析(一): ldmatrix的选择@Anonymous⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM细节分析(二): TiledCopy与cp.async@Anonymous⭐️⭐️⭐️
[cute系列详解][GEMM]📖GEMM细节分析(三): Swizzle<B,M,S>参数取值@Anonymous⭐️⭐️⭐️
[cute系列详解][实践]📖Hopper Mixed GEMM的CUTLASS实现笔记@BBuf⭐️⭐️⭐️
[cute系列详解][实践]📖CUTLASS CuTe实战(一): 基础@进击的Killua⭐️⭐️⭐️
[cute系列详解][实践]📖CUTLASS CuTe实战(二): 应用@进击的Killua⭐️⭐️⭐️
[cute系列详解][实践]📖FlashAttention fp8实现(ada架构)@shengying.wei⭐️⭐️⭐️
[cute系列详解][实践]📖FlashAttention 笔记: tiny-flash-attention解读@shengying.wei⭐️⭐️⭐️
[cute系列详解][实践]📖使用cutlass cute复现flash attention@66RING⭐️⭐️⭐️
[cutlass教程][入门]📖cutlass 基本认知@JoeNomad⭐️⭐️⭐️
[cutlass教程][入门]📖cutlass 软件架构@JoeNomad⭐️⭐️⭐️
[cutlass教程][入门]📖CUTLASS 基础介绍@进击的Killua⭐️⭐️⭐️
[cutlass教程][入门]📖乱谈CUTLASS GTC2020 SLIDES@zzk again⭐️⭐️⭐️
[cutlass教程][深入]📖cutlass block swizzle 和 tile iterator@JoeNomad⭐️⭐️⭐️
[cutlass教程][深入]📖cutlass bank conflict free的smem layout@JoeNomad⭐️⭐️⭐️
[cutlass教程][深入]📖cutlass 多级流水线@JoeNomad⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-前言@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-寄存器@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-Load和Cache@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-浮点运算@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-整数运算@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-比特和逻辑操作@reed⭐️⭐️⭐️
[GPU指令集架构][精解]📖NVidia GPU指令集架构-Warp级和Uniform操作@reed⭐️⭐️⭐️
[CUDA优化][入门]📖CUDA 入门的正确姿势:how-to-optimize-gemm@白牛⭐️⭐️⭐️
[CUDA优化][入门]📖CUDA(一):CUDA 编程基础@紫气东来⭐️⭐️⭐️
[CUDA优化][入门]📖CUDA(二):GPU的内存体系及其优化指南@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA(三):通用矩阵乘法:从入门到熟练@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(1):LayerNorm 算子的 CUDA 实现与优化@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(2):SoftMax算子的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(3):Cross Entropy 的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(4):AdamW 优化器的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(5):激活函数与残差连接的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(6):embedding 层与 LM head 层的 CUDA 实现@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(7):self-attention 的 CUDA 实现及优化 (上)@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖ops(8):self-attention 的 CUDA 实现及优化 (下)@紫气东来⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA(四):使用 CUDA 实现 Transformer 结构@紫气东来⭐️⭐️⭐️
[CUDA优化][Copy]📖Async Copy及Memory Barrier指令的功能与实现@Frank Wang⭐️⭐️⭐️
[CUDA优化][GEMV]📖深入浅出GPU优化系列:gemv优化@有了琦琦的棍子⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA element-wise 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA transpose 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA reduce 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[CUDA优化][实践]📖CUDA GEMM 算子详解@懒蚂蚁呀不嘿⭐️⭐️⭐️
[Tensor Cores]📖Nvidia Tensor Core初探@木子知⭐️⭐️⭐️
[Tensor Cores]📖Nvidia Tensor Core-WMMA API编程入门@木子知⭐️⭐️⭐️
[Tensor Cores]📖Nvidia Tensor Core-MMA PTX编程入门@木子知⭐️⭐️⭐️
[Tensor Cores]📖CUDA Ampere Tensor Core HGEMM 矩阵乘法优化@nicholaswilde⭐️⭐️⭐️
[GPU通信架构][精解]📖NVIDIA GPGPU(四)- 通信架构@Bruce⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: 介绍@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: TorchDynamo@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: AOTAutograd@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: TorchInductor@StarCap⭐️⭐️⭐️
[torch.compile][原理]📖Torch.compile流程解析: 算子融合@StarCap⭐️⭐️⭐️
[torch.compile][实践]📖Torch.compile使用指南@jhang⭐️⭐️⭐️
[torch.compile][实践]📖Torch.compile详细示例解析教程@Bbuf⭐️⭐️⭐️
[torch.compile][原理]📖一文搞懂TorchDynamo原理@吾乃阿尔法⭐️⭐️⭐️
[torch.compile][原理]📖理解torch.compile基本原理和使用方式@俯仰⭐️⭐️⭐️

©️License (©️back👆🏻)

<div id="License"></div>

GNU General Public License v3.0

🎉Contribute (©️back👆🏻)

<div id="contribute"></div>

How to contribute? Star this repo or check 🌤🌤CONTRIBUTE🎉🎉.

<div align='center'> <a href="https://star-history.com/#xlite-dev/LeetCUDA&Date"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=xlite-dev/LeetCUDA&type=Date&theme=dark" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=xlite-dev/LeetCUDA&type=Date" /> <img width=400 height=300 alt="Star History Chart" src="https://api.star-history.com/svg?repos=xlite-dev/LeetCUDA&type=Date" /> </picture> </a> </div>

📖 References (©️back👆🏻)

<div id="ref"></div>