As a High-Performance / CUDA Engineer, you will focus on optimizing the performance of our AI model processing pipeline. You will develop and tune GPU-accelerated code to ensure low latency, high-throughput inference and training for large language models.
This role involves implementing efficient parallel algorithms and leveraging advanced frameworks (e.g., NVIDIA Triton, vLLM, SGLang, TensorRT) as well as writing custom CUDA kernels to maximize throughput. You will work closely with our ML researchers and MLOps team to integrate these optimizations into our product, pushing the boundaries of what our platform can achieve in speed and scalability.
Requirements:
• 4+ years of experience in performance-critical software development (HPC, GPU computing, or similar domains).
• Expertise in C/C++ programming and NVIDIA CUDA programming for GPU acceleration.
• Familiarity with deep learning frameworks (PyTorch, TensorFlow) and model optimization techniques (experience with LLM inference frameworks like vLLM or NVIDIA Triton is a plus).
• Strong understanding of parallel algorithms, GPU architectures, and performance profiling/tuning tools.
• Excellent problem-solving skills and the ability to work in a fast-paced, collaborative startup environment .