Please turn JavaScript on

hgpu

Want to stay in touch with the latest updates from Hgpu? That's easy! Just subscribe clicking the Follow button below, choose topics or keywords for filtering if you want to, and we send the news to your inbox, to your phone via push notifications or we put them on your personal page here on Specificfeeds.

Reading your RSS feed has never been easier!

Website title: High performance computing on graphics processing units | hgpu.org

Is this your feed? Claim it!

Publisher:  Unclaimed!
Message frequency:  0.42 / day

Message History

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBenchX, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic compariso...


Read full story

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holdi...


Read full story

Large language models show promise for automated CUDA programming, however even the strongest coding models (e.g., Claude-Opus-4.6) may still fall short of expert-level, architecture-aware optimization. We introduce CUDAHercules, a benchmark that evaluates generated CUDA against end-to-end human-expert SOTA systems. It spans single kernels, module-level operators, full applic...


Read full story

GPUs have become essential in modern high performance computing, but programming them correctly remains a significant challenge. This difficulty arises from subtle concurrency bugs that result from the explicit management of synchronization primitives and data movement across intricate hierarchies of memory and parallel threads. At the same time, the ability to control these ...


Read full story

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than...


Read full story