Research Engineer

Cloudglue

Date listed

1 week ago

Employment Type

Full time

Remote

Yes

Found on:

YCombinator Startups

Keywords: remote python nlp milvus ml pytorch weaviate inference

Cloudglue - Video Understanding Infrastructure

Cloudglue is a Y Combinator-backed startup building developer APIs that turn video and audio into structured, searchable data. We handle the hard infrastructure - transcription, visual analysis, search, extraction - so developers can build on top of video without managing ML pipelines themselves.

We process millions of minutes of video for customers building search, analytics, and automation products. The research problems are real: how do you retrieve the right 10 seconds from 10,000 hours of video? How do you extract structured facts from noisy, multimodal content? How do you reason across visual and spoken information at scale?

Our team has shipped large-scale systems at Snapchat and Amazon, with work presented at NeurIPS, ICCV, CVPR, KubeCon, and DEF CON. We’re a small, technical team where researchers ship code and engineers read papers.

The Role

We’re looking for a research engineer to work on the core multimodal retrieval and video reasoning systems that power Cloudglue. This is a 50/50 research and engineering role - you’ll design novel approaches to hard retrieval and understanding problems, and you’ll ship them into production where real customers depend on them.

You’ll work across:

Multimodal retrieval - finding relevant moments across visual, audio, and text signals in large video collections
Structured extraction - pulling entities, facts, and relationships from video content
Video reasoning - understanding temporal, causal, and semantic relationships across long-form content
Evaluation and benchmarking - designing metrics and datasets to measure real-world system quality

This is not a pure research role. You’ll be expected to take ideas from paper to prototype to production. But it’s also not a pure engineering role - we need someone with genuine research depth who can identify the right problems to work on and design novel solutions.

What You’ll Do

Multimodal retrieval: Design and improve retrieval systems that search across video, audio, and text - including embedding models, re-ranking, and hierarchical search strategies.
Video understanding: Build systems that extract structured information from video - temporal segmentation, entity extraction, scene understanding, and content summarization.
Model fine-tuning & integration: Fine-tune and adapt vision and language models (LoRA/PEFT, full fine-tuning) for production use cases. Evaluate open-source and proprietary models and orchestrate them in serving pipelines.
Experiment and ship: Run experiments, analyze results rigorously, and turn successful research into production systems that handle real-world video at scale.
Collaborate: Work directly with founders and infrastructure engineers. Short feedback loops, no layers of process.

What We’re Looking For

Required

MS or PhD in computer science, machine learning, or a related field
Research experience in one or more of: multimodal learning, information retrieval, computer vision, NLP, or video understanding
Strong implementation skills in Python and PyTorch (or equivalent)
Ability to independently drive research from idea to experiment to working system

Nice to Have

First-author publication at a top venue (NeurIPS, CVPR, ICCV, ECCV, ACL, EMNLP, SIGIR, ISMIR, ICASSP, or similar)
Experience with video or multimodal foundation models (CLIP, LLaVA, Qwen3-VL, etc.)
Experience with retrieval systems, embedding models, or ranking/re-ranking pipelines
Experience deploying ML systems in production
Familiarity with vector databases (Milvus, Weaviate) or search infrastructure
Experience with model fine-tuning techniques (LoRA, PEFT, QLoRA) and training infrastructure (Ray, Kubeflow, or similar)
Experience with ML inference serving (vLLM, TensorRT, Triton, or similar)

Why Cloudglue?

Video is the largest and most underutilized data source on the internet. Most software still can’t meaningfully search or reason over it. The research problems here - multimodal retrieval, temporal reasoning, structured extraction from noisy real-world content - are genuinely unsolved and directly tied to the product.

If you want to work on: