Data Engineer - Healthcare and AI/LLM

Boston, MA

RA Capital Management

RA Capital Management funds and builds healthcare and life science companies and has a growing interest in climate and sustainability technologies.

View all jobs at RA Capital Management

Apply now Apply later

RA Capital Management is a multi-stage investment manager dedicated to evidence-based investing in public and private healthcare, life sciences, and planetary health companies. For more than 20 years, RA Capital creates and funds innovative companies, from private seed rounds to public follow-on financings, allowing management teams to drive value creation from inception through commercialization and beyond. RA Capital's knowledge engine is guided by our TechAtlas internal research division, and RA Ventures (RAven), RA Capital’s healthcare incubator, which offers entrepreneurs and innovators a collaborative and comprehensive platform to explore the novel and the re-imagined. RA Capital has more than 150 employees and over $10 billion in assets under management.

About the Team

RA Capital’s Data Engineering team is responsible for ensuring high-quality, reliable, and accessible data throughout the organization. We emphasize data integrity, compliance, and usability to support strategic decision-making across RA Capital. Our team oversees the complete data lifecycle—partnering with internal stakeholders and external vendors—to build scalable data infrastructure that fuels a data-driven culture.

About the Role

We are seeking a skilled Data Engineer with healthcare data experience and a strong interest in AI/LLM-powered data access to join our Data Engineering team. This role is pivotal in designing and maintaining robust data pipelines—with a focus on healthcare datasets like claims, provider data, and patient records—and extending that data accessibility through AI-driven solutions.

The ideal candidate will possess deep technical knowledge in data engineering, experience with healthcare data standards, and a working understanding of large language model (LLM) systems and the Model Context Protocol (MCP). You’ll help bridge structured enterprise data with AI interfaces that power self-service and natural language query workflows.

Responsibilities

  • Design, build, and optimize end-to-end enterprise data pipelines for ingesting and integrating healthcare vendor data, especially claims data.
  • Develop and maintain robust ETL processes and data integrations between data warehouses (e.g., Databricks) and downstream applications.
  • Write production-level Python and SQL code to standardize, reconcile, and match healthcare data, applying NLP and ML techniques when needed.
  • Develop scalable data models in Databricks to support efficient reporting and analytics across clinical, financial, and operational datasets.
  • Implement rigorous data quality controls and validation checks to ensure data accuracy and compliance with healthcare standards (e.g., HIPAA).
  • Collaborate with external healthcare data vendors to define delivery specifications and transformation logic.
  • Partner with internal IT, analytics, and business stakeholders to align data efforts with organizational objectives.
  • Work closely with AI/ML engineers and product teams to support LLM-based data access layers above Hasura or similar GraphQL engines.
  • Contribute to the integration and evaluation of Model Context Protocol (MCP) in real-world applications, enabling scalable, secure, and interpretable LLM usage.
  • Document data architectures, pipelines, workflows, and processes for both technical and non-technical audiences.
  • Provide Tier 1 support for monitoring data flows and resolving pipeline or integration issues.
  • Ensure ongoing compliance with data governance and security standards.

Key Skills & Experience

  • 1–2+ years working with healthcare data, including claims, structured and unstructured EMR/EHR, provider, and payer data. Familiarity with healthcare ontologies (ICD, CPT, NPI, etc.) strongly preferred.
  • Expertise in building scalable ETL/ELT pipelines and data integration workflows.
  • Strong skills in Python, SQL, and Spark. Experience with Java is a plus.
  • Hands-on experience with Databricks; familiarity with AWS (S3, EC2, EBS) preferred.
  • Strong understanding of data validation, quality assurance, and compliance practices in a healthcare setting.
  • Exposure to LLM applications and AI-driven data interfaces, particularly in structured enterprise data environments.
  • Familiarity with Model Context Protocol (MCP) and how it supports contextual integrity, auditability, and chain-of-thought in AI/LLM-based data access.
  • Proven ability to manage external data vendors and collaborate on schema, format, and delivery improvements.
  • Ability to clearly convey technical details to non-technical stakeholders and align data projects with business needs.

Key Requirements

  • Master’s degree or higher from a top Computer Science or Data Science program.
  • 1–2+ years of experience in data engineering, software development, and managing production-grade pipelines in a healthcare environment, preferred.
  • Must be based in Boston area
  • Ability to work a hybrid schedule in our Boston office
  • Must be authorized to work in the United States.

​RA Capital is an equal opportunity employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. If you require an accommodation during the interview process, please reach out to careers@racap.com for assistance.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0

Tags: Architecture AWS Computer Science Databricks Data governance Data pipelines Data quality EC2 ELT Engineering ETL GraphQL Java LLMs Machine Learning NLP Pipelines Python Research Security Spark SQL

Region: North America
Country: United States

More jobs like this