Data Engineer - Healthcare and AI/LLM
Boston, MA
RA Capital Management
RA Capital Management funds and builds healthcare and life science companies and has a growing interest in climate and sustainability technologies.About the Team
RA Capital’s Data Engineering team is responsible for ensuring high-quality, reliable, and accessible data throughout the organization. We emphasize data integrity, compliance, and usability to support strategic decision-making across RA Capital. Our team oversees the complete data lifecycle—partnering with internal stakeholders and external vendors—to build scalable data infrastructure that fuels a data-driven culture.
About the Role
We are seeking a skilled Data Engineer with healthcare data experience and a strong interest in AI/LLM-powered data access to join our Data Engineering team. This role is pivotal in designing and maintaining robust data pipelines—with a focus on healthcare datasets like claims, provider data, and patient records—and extending that data accessibility through AI-driven solutions.
The ideal candidate will possess deep technical knowledge in data engineering, experience with healthcare data standards, and a working understanding of large language model (LLM) systems and the Model Context Protocol (MCP). You’ll help bridge structured enterprise data with AI interfaces that power self-service and natural language query workflows.
Responsibilities
- Design, build, and optimize end-to-end enterprise data pipelines for ingesting and integrating healthcare vendor data, especially claims data.
- Develop and maintain robust ETL processes and data integrations between data warehouses (e.g., Databricks) and downstream applications.
- Write production-level Python and SQL code to standardize, reconcile, and match healthcare data, applying NLP and ML techniques when needed.
- Develop scalable data models in Databricks to support efficient reporting and analytics across clinical, financial, and operational datasets.
- Implement rigorous data quality controls and validation checks to ensure data accuracy and compliance with healthcare standards (e.g., HIPAA).
- Collaborate with external healthcare data vendors to define delivery specifications and transformation logic.
- Partner with internal IT, analytics, and business stakeholders to align data efforts with organizational objectives.
- Work closely with AI/ML engineers and product teams to support LLM-based data access layers above Hasura or similar GraphQL engines.
- Contribute to the integration and evaluation of Model Context Protocol (MCP) in real-world applications, enabling scalable, secure, and interpretable LLM usage.
- Document data architectures, pipelines, workflows, and processes for both technical and non-technical audiences.
- Provide Tier 1 support for monitoring data flows and resolving pipeline or integration issues.
- Ensure ongoing compliance with data governance and security standards.
Key Skills & Experience
- 1–2+ years working with healthcare data, including claims, structured and unstructured EMR/EHR, provider, and payer data. Familiarity with healthcare ontologies (ICD, CPT, NPI, etc.) strongly preferred.
- Expertise in building scalable ETL/ELT pipelines and data integration workflows.
- Strong skills in Python, SQL, and Spark. Experience with Java is a plus.
- Hands-on experience with Databricks; familiarity with AWS (S3, EC2, EBS) preferred.
- Strong understanding of data validation, quality assurance, and compliance practices in a healthcare setting.
- Exposure to LLM applications and AI-driven data interfaces, particularly in structured enterprise data environments.
- Familiarity with Model Context Protocol (MCP) and how it supports contextual integrity, auditability, and chain-of-thought in AI/LLM-based data access.
- Proven ability to manage external data vendors and collaborate on schema, format, and delivery improvements.
- Ability to clearly convey technical details to non-technical stakeholders and align data projects with business needs.
Key Requirements
- Master’s degree or higher from a top Computer Science or Data Science program.
- 1–2+ years of experience in data engineering, software development, and managing production-grade pipelines in a healthcare environment, preferred.
- Must be based in Boston area
- Ability to work a hybrid schedule in our Boston office
- Must be authorized to work in the United States.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Architecture AWS Computer Science Databricks Data governance Data pipelines Data quality EC2 ELT Engineering ETL GraphQL Java LLMs Machine Learning NLP Pipelines Python Research Security Spark SQL
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.