Data Engineer

Bangalore - Carina

Full Time Mid-level / Intermediate USD 41K - 77K *

Red Hat

Red Hat is the world’s leading provider of enterprise open source solutions, including high-performing Linux, cloud, container, and Kubernetes technologies.

View all jobs at Red Hat

Apply now Apply later

Posted 1 month ago

Data Engineer: ROAD

Job description

Job Summary

Do you like the vibe of a brand new team where you have the opportunity to influence your future way of working? Are you passionate about data and making an impact? Are you attracted to the idea of a virtual assistant which is more than just a wrapper on top of somebody else’s large language model technology? We are building a team to develop and deliver Red Hat’s own flavor of large language model to power our product-supporting virtual assistants and we look for people like you.

Become an AI data engineer in a team which will play a critical role in infusing Red Hat products with generative AI. In this role, you will collaborate with data scientists and software engineers to build and maintain high-leverage data sets containing Red Hat’s unique experience and expertise, to be used for fine-tuning large language models and creating RAG databases of high quality. Your data sets will influence and will be influenced by Red Hat infrastructure built from ground up to support the fine-tuning process. You will not only build the data sets. You will also participate in design, deployment, configuration and optimization of the infrastructure around the data. The ideal candidate will therefore be excited about the data, the science and the code.

You will contribute directly to Red Hat’s rapidly growing AI/ML (artificial intelligence and machine learning) family of products. This is a very high profile project and the position is critical to the project's future success. It includes a lot of flexibility and therefore requires a highly motivated and self-driven individual.

Primary Job Responsibilities

Design systems, integrations and processes required to achieve the best fine tuning results, including selection and integration of data sources, data pre-processing and subsequent quality evaluation.
Design, build, and maintain scalable data pipelines for extracting, transforming, and loading (ETL) data from internal Red Hat systems into LLM training process
Develop and optimize databases to ensure efficient data storage and retrieval.
Design and develop data warehousing solution to support large scale data storage.
Utilize Python for data manipulation, automation, and analysis. Ensure high quality data is used as an input for model fine tuning and RAG building.
Contribute to the entire stack, from active participation in the fine tuning process to the implementation of and ongoing optimization of the designed systems
Collaborate with other team members (data scientists, software engineers) as well as other teams to deliver a best-in-class solution and maintain it.
Work in a fast-paced agile globally distributed environment of talented engineers

Required Skills

Bachelor’s degree in Computer Science, Data Science, or a related field
2-4 years of work experience in data engineering, preferably in AI/ML contexts
Extensive, advanced experience with Python development.
Strong understanding of LLM architectures, training processes, and data requirements
Experience with RAG systems, knowledge base construction, and vector databases
Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts
Hands-on experience with data cleaning, tagging, and annotation processes (both manual and automated)
Strong self-motivation, problem solving and organizational skills.
Collaborative attitude and willingness to share ideas openly.
Excellent English written and verbal communication skills.
Ability to quickly learn and use new tools and technologies

Preferred skills

Experience with AI and Machine Learning platforms, tools, and frameworks, such as: Tensorflow, PyTorch, LLaMA.cpp, and Kubeflow.
Familiarity with different LLM parameters like temperate, top-k, and repeat penalty, and different LLM outcome evaluation data science metrics and methodologies.
Experience with various vector store technologies and their applications in AI
Experience with Cloud Native Technologies and Platforms (e.g. Kubernetes)
Understanding of data lakehouse concepts and architectures
Experience with agile development, CI/CD systems and DevOps methodology
Experience with big data storage techniques, such as Parquet, Avro, and S3.

About Red Hat

Red Hat is the world’s leading provider of enterprise open source software solutions, using a community-powered approach to deliver high-performing Linux, cloud, container, and Kubernetes technologies. Spread across 40+ countries, our associates work flexibly across work environments, from in-office, to office-flex, to fully remote, depending on the requirements of their role. Red Hatters are encouraged to bring their best ideas, no matter their title or tenure. We're a leader in open source because of our open and inclusive environment. We hire creative, passionate people ready to contribute their ideas, help solve complex problems, and make an impact.

Diversity, Equity & Inclusion at Red Hat
Red Hat’s culture is built on the open source principles of transparency, collaboration, and inclusion, where the best ideas can come from anywhere and anyone. When this is realized, it empowers people from diverse backgrounds, perspectives, and experiences to come together to share ideas, challenge the status quo, and drive innovation. Our aspiration is that everyone experiences this culture with equal opportunity and access, and that all voices are not only heard but also celebrated. We hope you will join our celebration, and we welcome and encourage applicants from all the beautiful dimensions of diversity that compose our global village.

Equal Opportunity Policy (EEO)
Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, citizenship, age, veteran status, genetic information, physical or mental disability, medical condition, marital status, or any other basis prohibited by law.