Researcher: Inference

HQ - San Francisco, CA

Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. We're pioneering the model architectures that will make it possible.

View all jobs at Cartesia

Apply now Apply later

About Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

The Role

• Conduct cutting-edge research to improve the efficiency, scalability, and robustness of inference for state-of-the-art AI models across various modalities, including audio, text, and vision.

• Design and optimize inference pipelines to balance performance, latency, and resource utilization in diverse deployment environments, from edge devices to cloud systems.

• Develop and implement novel techniques for efficient model execution, including quantization, pruning, sparsity, distillation, and hardware-aware optimizations.

• Explore speculative decoding methods, caching strategies, and other advanced techniques to reduce latency and computational overhead during inference.

• Investigate trade-offs between model quality and inference efficiency, designing architectures and workflows that meet real-world application requirements.

• Prototype and refine methods for stateful inference, streaming inference, and task-specific conditioning to enable new capabilities and use cases.

• Collaborate closely with cross-functional teams to ensure inference research seamlessly integrates into production systems and applications.

What We’re Looking For

• Deep expertise in optimizing inference for machine learning models, with a strong understanding of techniques such as speculative decoding, model compression, low-precision computation, and hardware-specific tuning.

• Strong programming skills in Python, with experience in frameworks like PyTorch, TensorFlow, or ONNX, and familiarity with inference deployment tools such as TensorRT or TVM.

• Knowledge of hardware architectures and accelerators, including GPUs, TPUs, and edge devices, and their impact on inference performance.

• Experience in designing and evaluating scalable, low-latency inference pipelines for production systems.

• A solid understanding of the trade-offs between model accuracy, latency, and computational efficiency in deployment scenarios.

• Strong problem-solving skills and a passion for exploring innovative techniques to push the boundaries of real-time and resource-constrained inference.

Nice-to-Haves

• Experience with speculative decoding and other emerging techniques for improving inference performance.

• Familiarity with stateful or streaming inference techniques.

• Background in designing hybrid architectures or task-specific models optimized for inference.

• Early-stage startup experience or a track record of developing and deploying efficient inference systems in fast-paced R&D environments.

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

Our perks

🍽 Lunch, dinner and snacks at the office.

🏥 Fully covered medical, dental, and vision insurance for employees.

🏦 401(k).

✈️ Relocation and immigration support.

🦖 Your own personal Yoshi.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  2  0  0
Category: Research Jobs

Tags: Architecture Databricks Engineering Machine Learning ML models ONNX Pipelines Python PyTorch R R&D Research Streaming TensorFlow TensorRT

Perks/benefits: Health care Relocation support Startup environment

Region: North America
Country: United States

More jobs like this