Lead Software Engineer, Applied AI Engineering

Pune, Maharashtra, India

Icertis

Icertis is the global leader in AI-powered contract intelligence. The Icertis platform revolutionizes contract lifecycle management, equipping customers with powerful AI automation and insights to grow revenue, control costs, mitigate risk, and...

View all jobs at Icertis

Apply now Apply later

We are seeking a highly skilled and motivated Data Engineer & ML/GenAI Specialist to join our growing Machine Learning team. In this unique role, you will be responsible for building and maintaining the critical data infrastructure that fuels our machine learning and Generative AI initiatives. You will leverage your expertise in data engineering, machine learning, and search indexing to manage the entire lifecycle of our data, from extraction and annotation to indexing and building data pipelines for our models. Your work will directly contribute to the development and deployment of innovative AI solutions. Your main objective is to index millions 

  • Data Indexing and Search: 
    • Design, implement, and maintain a robust indexing system for millions of business records (PDFs, DOCX files, and database metadata) using tools like Elasticsearch/OpenSearch/Solr.
    • Develop and optimize search functionalities to enable efficient retrieval of relevant data for machine learning pipelines.
    • Implement strategies for handling different data formats, including text extraction (OCR if necessary), cleaning, and normalization.
    • Integrate metadata from databases with document content to enrich the index.
    • Data deduplication and data anonymization.
  • Data Annotation and Labeling Strategy: 
    • Define data annotation guidelines and workflows for various machine learning tasks.
    • Evaluate and select appropriate data annotation tools and platforms.
    • If required, create a small group of crowd sourcing for data annotation and labelling.
    • Oversee the quality of annotated data and implement mechanisms for quality control.
  • Data Pipeline Development: 
    • Build and maintain data pipelines to extract, transform, and load data from the index to support machine learning model training and inference.
    • Develop APIs or interfaces to allow ML engineers to access and query the indexed data.
  • Machine Learning and Generative AI Support: 
    • Collaborate closely with machine learning engineers to understand their data requirements and provide tailored datasets.
    • Assist in feature engineering by extracting relevant features from indexed data for model consumption.
    • Explore and implement methods for generating synthetic data to augment training datasets or address privacy concerns.
    • Stay up-to-date with the latest advancements in ML and Generative AI to identify opportunities for improving data processes.
  • Data Management and Governance: 
    • Ensure data quality, consistency, and integrity throughout the data lifecycle.
    • Implement data versioning and lineage tracking for reproducibility.
    • Adhere to data governance policies and best practices.
  • Education: Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related field.
  • Technical Skills: 
    • Programming: Strong proficiency in Python.
  • Machine Learning: Solid understanding of machine learning concepts, algorithms, and evaluation metrics. Familiarity with Generative AI models (e.g., GANs, VAEs) is highly desirable. Experience with NLP (Natural Language Processing) techniques. 
  • Search Indexing: Hands-on experience with Elasticsearch, Solr, OpenSearch or other relevant search indexing technologies.
  • SQL: Proficient in SQL for database querying and manipulation.
  • Data Extraction: Experience with libraries like Apache Tika, datasets, PyPDF2, textract, or similar for extracting text and metadata from various document formats.
  • Cloud Platforms (Preferred): Familiarity with Azure platform and their data services. Knowledge of AWS/GCP will be an added advantage.
  • Data Annotation Tools (Preferred): Experience with data annotation and techniques is a plus.
  • Experience: 
    • 5+ years of experience in data engineering, with a focus on building and maintaining data pipelines for machine learning.
    • Proven experience in indexing large datasets and developing search functionalities.
    • Experience with data annotation and labelling processes is highly desirable.
  • Soft Skills: 
    • Excellent problem-solving and analytical skills.
    • Strong communication and collaboration skills, with the ability to work effectively in a team environment.   
    • Self-motivated and able to work independently, managing multiple priorities and deadlines.   
    • A passion for learning and staying at the forefront of data and AI technologies.

With unmatched technology and category-defining innovation, Icertis pushes the boundaries of what’s possible with contract lifecycle management (CLM). The AI-powered, analyst-validated Icertis Contract Intelligence (ICI) platform turns contracts from static documents into strategic advantage by structuring and connecting the critical contract information that defines how an organization runs. Today, the world’s most iconic brands and disruptive innovators trust Icertis to fully realize the intent of their combined 10 million contracts worth more than $1 trillion, in 40+ languages and 93 countries.


Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0

Tags: Amazon Textract APIs AWS Azure Computer Science Data governance Data management Data pipelines Data quality Elasticsearch Engineering Feature engineering GANs GCP Generative AI Machine Learning Model training NLP OCR OpenSearch Pipelines Privacy Python SQL

Perks/benefits: Career development

Region: Asia/Pacific
Country: India

More jobs like this