Senior Research Data Engineer
/fullyFully remote
About The Position
Kaiko’s Multimodal Large Language Model (MLLM) is trained on domain-specific, high-complexity medical data. To reach clinical-grade performance, we’ll need to ramp up our data efforts to manage massive scale, ensure consistent quality, and tightly control data relevance and integrity.
As a Senior Research Data Engineer, you will design and implement our data‑sourcing, synthetic‑generation, and curation pipelines. High‑quality datasets are the fuel for frontier‑scale language models, and you will play a pivotal role in producing them.
You will build high‑throughput data pipelines that:
- Ingest multi‑modal data at petabyte scale.
- Generate large volumes of synthetic data.
- Filter & rate content by topic, quality, and policy compliance.
You will work closely with ML researchers and help steer the development of our state‑of‑the‑art foundation models.
Profile
- Excellent programming skills in Python and deep experience with distributed frameworks such as Ray or Spark.
- Proven track record designing & operating large‑scale data pipelines and running data‑quality experiments.
- Experience building or integrating synthetic‑data pipelines for LLMs.
- Deep familiarity with lakehouse paradigms (Delta, Iceberg) and columnar formats (Parquet, ORC).
- Experience with core data‑processing primitives (hashing, deduplication, chunking etc.) and associated scalability/performance trade‑offs.
- Strong communication skills and the ability to present experimental results and technical concepts clearly and concisely.
Nice To Have:
- Hands‑on production experience orchestrating complex DAGs in Dagster (preferred) or similar workflow engines.
- Expertise in data‑quality & validation frameworks and monitoring/observability tooling.
RECRUITMENT PROCESSNO WHITEBOARDS, NO RIDDLES
We build a partnership approach and focus on getting to know
each other as well as possible.
01. CV REVIEW
First look at whether we are a good match (1-7 days).
02. TECHNICAL & HR INTERVIEW AT ONE TIME
Deep dive into experience and both theoretical and practical skills (1,5 hour).
03. OFFER
Say yes and welcome aboard!