Nvidia just shipped cuda-oxide, an experimental compiler that lets you write GPU kernels in idiomatic Rust and compile them straight to PTX. No C++, no DSL, no foreign language bindings. Just Rust, running on your GPU.
This is v0.1.0 alpha, so it's not production-ready. But it's an official Nvidia project, and that changes the calculus on whether Rust is a serious contender for GPU systems code.
What cuda-oxide Actually Is
cuda-oxide is a custom rustc codegen backend. That distinction matters. It's not a wrapper around CUDA C++, not a macro system that generates C, and not a domain-specific language bolted onto Rust. You write standard Rust, the compiler generates PTX (Nvidia's low-level parallel thread execution assembly), and that PTX runs on your GPU. The compilation flow uses cargo oxide run to build and execute kernels.
The quick-start example is a vector addition kernel. You annotate a Rust module with #[cuda_module], mark individual functions with #[kernel], and the macro procedurally generates two things: a device artifact embedded directly into your host binary, and a typed kernels::load function on the host side. If you need more control, lower-level APIs like load_kernel_module and cuda_launch! let you load sidecar artifacts or write custom launch logic.
"cuda-oxide is an experimental Rust-to-CUDA compiler that lets you write (SIMT) GPU kernels in safe(ish), idiomatic Rust. It compiles standard Rust code directly to PTX โ no DSLs, no foreign language bindings, just Rust."
The Safety Angle
The docs describe the safety guarantees as "safe(ish)" and that's an honest qualifier. CUDA programming involves inherently unsafe territory: device memory, host-device transfers, thread synchronization. cuda-oxide doesn't pretend otherwise.
What it provides is a set of memory abstractions that apply Rust's ownership model to GPU code. DisjointSlice handles mutable GPU memory access in a way that respects the borrow checker. DeviceBuffer manages host-to-device data transfer. The goal is to catch the class of bugs that plague CUDA C++ development: data races, use-after-free on device memory, invalid synchronization. Whether those guarantees hold up across all the edge cases in real SIMT code is still an open question at alpha. The docs assume you already know Rust ownership, traits, generics, and async patterns.
Async GPU Programming
The async story is genuinely interesting. cuda-oxide lets you compose GPU work as lazy DeviceOperation graphs, schedule them across CUDA stream pools, and await results with .await. If you've used Rust's async/await for network I/O, the mental model carries over directly to GPU work scheduling.
"Compose GPU work as lazy DeviceOperation graphs. Schedule across stream pools. Await results with .await"
That's a real ergonomics win. Managing CUDA streams in C++ is verbose and error-prone. Getting that scheduling through standard Rust async primitives could simplify a lot of systems-level GPU code.
How It Compares to the Alternatives
The existing options for Rust and GPU don't fill this exact niche. cuda-sys gives you raw FFI bindings to the CUDA runtime, which means you're essentially writing C++ CUDA in Rust syntax. wgpu is graphics-first and uses WGSL, not Rust. Python approaches like CuPy or Numba are high-level but you give up control over memory layout and kernel dispatch.
cuda-oxide sits between raw C++ CUDA (maximum control, maximum pain) and Python bindings (minimal boilerplate, minimal control). The pitch is that you get close-to-metal performance without writing a line of C++, and you get the Rust type system catching bugs at compile time instead of during a GPU kernel crash at 3am.
For AI tooling specifically, the most relevant use case is inference servers and local LLM runners. Projects like llama.cpp rely on hand-written CUDA kernels for performance-critical paths. If cuda-oxide matures, that's a path to writing those kernels in Rust without sacrificing throughput or spending weeks fighting C++ build systems.
What's Not Ready Yet
The docs are explicit about the alpha status: expect bugs, expect incomplete features, expect API breakage. Several important questions don't have public answers yet:
- Performance overhead vs. equivalent C++ CUDA kernels (no benchmarks published)
- Which GPU architectures are supported (Ada Lovelace, Hopper, older generations)
- Full support for GPU primitives like shared memory and atomics
- Multi-GPU and distributed execution
- The precise boundaries of the "safe(ish)" memory guarantees
None of that should stop you from experimenting. But it means you shouldn't rewrite your production inference stack on top of it today.
Bottom Line
cuda-oxide is the most interesting thing to happen to Rust GPU programming in a while. It's alpha software with real gaps, but it's an official Nvidia project targeting a genuine pain point: writing GPU kernels without C++. If you're building inference servers, local model runners, or any systems-level GPU code in Rust, this is worth watching closely. Star the repo, run the examples, and check back when v0.2 lands.
Sources
- cuda-oxide official documentation (NVLabs)
- wgpu: Rust graphics API
- CuPy: NumPy-compatible GPU arrays in Python
- Numba: Python JIT compiler with CUDA support
- llama.cpp: Local LLM inference with CUDA kernels