Data Engineer - WEN product

Bangalore North, India

Full Time Senior-level / Expert USD 58K - 109K * ^est.

Wadhwani Foundation

Wadhwani Foundation is a non-profit organization that focuses on accelerating job growth & paves the way for millions to earn family-sustaining wages. Join us!

View all jobs at Wadhwani Foundation

Apply now Apply later

Posted 1 month ago

The Role Context:

We are seeking a highly motivated and detail-oriented individual to join our team as a Data Engineer with experience in the designing, constructing, and maintaining the architecture and infrastructure necessary for data generation, storage and processing.

Key Responsibilities:

Data Architecture Design: Design, develop, and maintain scalable data pipelines and infrastructure for ingesting, processing, storing, and analyzing large volumes of data efficiently. This involves understanding business requirements and translating them into technical solutions.

Data Integration: Integrate data from various sources such as databases, APIs, streaming platforms, and third-party systems. Should ensure the data is collected reliably and efficiently, maintaining data quality and integrity throughout the process as per the Ministries/government data standards.

Data Modeling: Design and implement data models to organize and structure data for efficient storage and retrieval. They use techniques such as dimensional modeling, normalization, and denormalization depending on the specific requirements of the project.

Data Pipeline Development/ ETL (Extract, Transform, Load): Develop data pipeline/ETL processes to extract data from source systems, transform it into the desired format, and load it into the target data systems. This involves writing scripts or using ETL tools or building data pipelines to automate the process and ensure data accuracy and consistency.

Data Quality and Governance: Implement data quality checks and data governance policies to ensure data accuracy, consistency, and compliance with regulations. Should be able to design and track data lineage, data stewardship, metadata management, building business glossary etc.

Data lakes or Warehousing: Design and maintain data lakes and data warehouse to store and manage structured data from relational databases, semi-structured data like JSON or XML, and unstructured data such as text documents, images, and videos at any scale. Should be able to integrate with big data processing frameworks such as Apache Hadoop, Apache Spark, and Apache Flink, as well as with machine learning and data visualization tools.

Data Security: Implement security practices, technologies, and policies designed to protect data from unauthorized access, alteration, or destruction throughout its lifecycle. It should include data access, encryption, data masking and anonymization, data loss prevention, compliance, and regulatory requirements such as DPDP, GDPR, etc.

Database Management: Administer and optimize databases, both relational and NoSQL, to manage large volumes of data effectively.

Data Migration: Plan and execute data migration projects to transfer data between systems while ensuring data consistency and minimal downtime.

Performance Optimization: Optimize data pipelines and queries for performance and scalability. Identify and resolve bottlenecks, tune database configurations, and implement caching and indexing strategies to improve data processing speed and efficiency.

Collaboration: Collaborate with data scientists, analysts, and other stakeholders to understand their data requirements and provide them with access to the necessary data resources. They also work closely with IT operations teams to deploy and maintain data infrastructure in production environments.

Documentation and Reporting: Document their work including data models, data pipelines/ETL processes, and system configurations. Create documentation and provide training to other team members to ensure the sustainability and maintainability of data systems.

Continuous Learning: Stay updated with the latest technologies and trends in data engineering and related fields. Should participate in training programs, attend conferences, and engage with the data engineering community to enhance their skills and knowledge.

Desired Skills/ Competencies

Education: A Bachelor's or Master's degree in Computer Science, Software Engineering, Data Science, or equivalent with at least 5 to 10 years of experience.

Database Management: Strong expertise in working with databases, such as SQL databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB, Cassandra).

Big Data Technologies: Familiarity with big data technologies, such as Apache Hadoop, Spark, and related ecosystem components, for processing and analyzing large-scale datasets.

ETL Tools: Experience with ETL tools (e.g., Apache NiFi, Talend, Apache Airflow, Talend Open Studio, Pentaho, Infosphere) for designing and orchestrating data workflows.

Data Modeling and Warehousing: Knowledge of data modeling techniques and experience with data warehousing solutions (e.g., Amazon Redshift, Google BigQuery, Snowflake).

Data Governance and Security: Understanding of data governance principles and best practices for ensuring data quality and security.

Cloud Computing: Experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and their data services for scalable and cost-effective data storage and processing.

Streaming Data Processing: Familiarity with real-time data processing frameworks (e.g., Apache Kafka, Apache Flink) for handling streaming data.

Familiar with Python Programming and Prompt Engineering

KPIs:

Data Pipeline Efficiency: Measure the efficiency of data pipelines in terms of data processing time, throughput, and resource utilization. KPIs could include average time to process data, data ingestion rates, and pipeline latency.

Data Quality Metrics: Track data quality metrics such as completeness, accuracy, consistency, and timeliness of data. KPIs could include data error rates, missing values, data duplication rates, and data validation failures.

System Uptime and Availability: Monitor the uptime and availability of data infrastructure, including databases, data warehouses, and data processing systems. KPIs could include system uptime percentage, mean time between failures (MTBF), and mean time to repair (MTTR).

Data Storage Efficiency: Measure the efficiency of data storage systems in terms of storage utilization, data compression rates, and data retention policies. KPIs could include storage utilization rates, data compression ratios, and data storage costs per unit.