Data Engineer - WGDT
New Delhi, India
Wadhwani Foundation
Wadhwani Foundation is a non-profit organization that focuses on accelerating job growth & paves the way for millions to earn family-sustaining wages. Join us!The Role Context:
This is an exciting opportunity to join a dynamic and growing organization,
working at the forefront of technology trends and developments in social impact
sector. Wadhwani Center for Government Digital Transformation (WGDT) works with
the government ministries and state departments in India with a mission of “Enabling
digital transformation to enhance the impact of government policy, initiatives
and programs”.
We are seeking a highly motivated and
detail-oriented individual to join our team as a Data Engineer with
experience in the designing, constructing, and maintaining the architecture and
infrastructure necessary for data generation, storage and processing and
contribute to the successful implementation of digital government policies and
programs. You will play a key role in developing, robust, scalable, and
efficient systems to manage large volumes of data, make it accessible for
analysis and decision-making and driving innovation & optimizing operations
across various government ministries and state departments in India.
Key Responsibilities:
a. Data Architecture Design: Design, develop, and maintain scalable data pipelines and
infrastructure for ingesting, processing, storing, and analyzing large volumes
of data efficiently. This involves understanding business requirements and
translating them into technical solutions.
b. Data Integration: Integrate data from various sources such as databases, APIs, streaming
platforms, and third-party systems. Should ensure the data is collected
reliably and efficiently, maintaining data quality and integrity throughout the
process as per the Ministries/government data standards.
c. Data Modeling: Design and implement data models
to organize and structure data for efficient storage and retrieval. They use
techniques such as dimensional modeling, normalization, and denormalization
depending on the specific requirements of the project.
d. Data Pipeline Development/ ETL (Extract,
Transform, Load): Develop data pipeline/ETL processes to
extract data from source systems, transform it into the desired format, and
load it into the target data systems. This involves writing scripts or using
ETL tools or building data pipelines to automate the process and ensure data
accuracy and consistency.
e. Data Quality and Governance: Implement data quality checks and data governance policies to ensure
data accuracy, consistency, and compliance with regulations. Should be able to design
and track data lineage, data stewardship, metadata management, building business
glossary etc.
f. Data lakes or Warehousing: Design and maintain data lakes and data warehouse to store and manage
structured data from relational databases, semi-structured data like JSON or
XML, and unstructured data such as text documents, images, and videos at any
scale. Should be able to integrate with big data processing frameworks such as
Apache Hadoop, Apache Spark, and Apache Flink, as well as with machine learning
and data visualization tools.
g. Data Security: Implement security practices, technologies, and policies designed to
protect data from unauthorized access, alteration, or destruction throughout
its lifecycle. It should include data access, encryption, data masking and
anonymization, data loss prevention, compliance, and regulatory requirements
such as DPDP, GDPR, etc.
h. Database Management: Administer and optimize databases, both relational and NoSQL, to manage
large volumes of data effectively.
i. Data Migration: Plan and execute data migration projects to transfer data between
systems while ensuring data consistency and minimal downtime.
a. Performance Optimization: Optimize data pipelines and queries for performance and scalability. Identify
and resolve bottlenecks, tune database configurations, and implement caching
and indexing strategies to improve data processing speed and efficiency.
b. Collaboration: Collaborate with data scientists, analysts, and other stakeholders to
understand their data requirements and provide them with access to the
necessary data resources. They also work closely with IT operations teams to
deploy and maintain data infrastructure in production environments.
c. Documentation and Reporting: Document their work including data models, data pipelines/ETL
processes, and system configurations. Create documentation and provide training
to other team members to ensure the sustainability and maintainability of data
systems.
d. Continuous Learning: Stay updated with the latest technologies and trends in data engineering
and related fields. Should participate in training programs, attend
conferences, and engage with the data engineering community to enhance their
skills and knowledge.
Desired Skills/ Competencies
- Education: A Bachelor's or Master's degree in Computer
Science, Software Engineering, Data Science, or equivalent with at least 5
years of experience.
- Database Management: Strong expertise in working with databases,
such as SQL databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g.,
MongoDB, Cassandra).
- Big Data Technologies: Familiarity with big data technologies, such
as Apache Hadoop, Spark, and related ecosystem components, for processing
and analyzing large-scale datasets.
- ETL Tools: Experience with ETL tools (e.g., Apache NiFi,
Talend, Apache Airflow, Talend Open Studio, Pentaho, Infosphere) for
designing and orchestrating data workflows.
- Data Modeling and Warehousing: Knowledge of data modeling techniques and experience with data warehousing
solutions (e.g., Amazon Redshift, Google BigQuery, Snowflake).
- Data Governance and Security: Understanding of data governance principles and best practices for
ensuring data quality and security.
- Cloud Computing: Experience with cloud platforms (e.g., AWS,
Azure, Google Cloud) and their data services for scalable and
cost-effective data storage and processing.
- Streaming Data Processing: Familiarity with real-time data processing frameworks (e.g., Apache Kafka,
Apache Flink) for handling streaming data.
KPIs:
- Data Pipeline Efficiency: Measure
the efficiency of data pipelines in terms of data processing time,
throughput, and resource utilization. KPIs could include average time to
process data, data ingestion rates, and pipeline latency.
- Data Quality Metrics: Track data quality metrics such as
completeness, accuracy, consistency, and timeliness of data. KPIs could
include data error rates, missing values, data duplication rates, and data
validation failures.
- System Uptime and Availability: Monitor the uptime and availability of data infrastructure, including
databases, data warehouses, and data processing systems. KPIs could
include system uptime percentage, mean time between failures (MTBF), and
mean time to repair (MTTR).
- Data Storage Efficiency: Measure the efficiency of data storage systems in terms of storage
utilization, data compression rates, and data retention policies. KPIs
could include storage utilization rates, data compression ratios, and data
storage costs per unit.
- Data Security and Compliance: Track
adherence to data security policies and regulatory compliance requirements
such as DPDP, GDPR, HIPAA, or PCI DSS. KPIs could include security
incident rates, data access permissions, and compliance audit findings.
- Data Processing Performance: Monitor the performance of data processing tasks such as ETL (Extract,
Transform, Load) processes, data transformations, and data aggregations.
KPIs could include data processing time, CPU usage, and memory
consumption.
- Scalability and Performance Tuning: Measure the scalability and performance of data systems under varying
workloads and data volumes. KPIs could include scalability benchmarks,
system response times under load, and performance improvements achieved
through tuning.
- Resource Utilization and Cost Optimization: Track
resource utilization and costs associated with data infrastructure,
including compute resources, storage, and network bandwidth. KPIs could
include cost per data unit processed, cost per query, and cost savings
achieved through optimization.
- Incident Response and Resolution: Monitor the response time and resolution time for data-related incidents
and issues. KPIs could include incident response time, time to diagnose
and resolve issues, and customer satisfaction ratings for support
services.
- Documentation
and Knowledge Sharing: Measure the
quality and completeness of documentation for data infrastructure, data
pipelines, and data processes. KPIs could include documentation coverage,
documentation update frequency, and knowledge sharing activities such as
internal training sessions or knowledge base contributions.
Years of experience of the current role
holder
New Position
Ideal years of experience
3 – 5 years
Career progression for this role
CTO WGDT (Head of Incubation Centre)
*******************************************************************************
Wadhwani
Corporate Profile: (Click on this link)
Our Culture:
WF is a global
not-for-profit, and works like a start-up, in a fast-moving,
dynamic pace where change is the only constant and flexibility is the key to
success.
Three mantras that we
practice across job roles, levels, functions, programs and initiatives, are
Quality, Speed, Scale, in that order.
We are an ambitious and
inclusive organization, where everyone is encouraged to contribute and
ideate. We are intensely and insanely focused on driving excellence in
everything we do.
We want individuals with
the drive for excellence, and passion to do whatever it takes to deliver
world class outcomes to our beneficiaries. We set our own standards often more
rigorous than what our beneficiaries demand, and we want individuals who love
it this way.
We
have a creative and highly energetic environment – one in which we look to each
other to innovate new solutions not only for our beneficiaries but for
ourselves too. Open to collaborate with a borderless mentality, often going
beyond the hierarchy and siloed definitions of functional KRAs, are the
individuals who will thrive in our environment.
This
is a workplace where expertise is shared with colleagues around the
globe. Individuals uncomfortable with change, constant innovation, and
short learning cycles and those looking for stability and orderly working days
may not find WF to be the right place for them.
Finally,
we want individuals who want to do greater good for the society leveraging
their area of expertise, skills and experience.
The foundation is an equal
opportunity firm with no bias towards gender, race, colour, ethnicity, country,
language, age and any other dimension that comes in the way of progress.
Join
us and be a part of us!
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Airflow APIs Architecture AWS Azure Big Data BigQuery Cassandra Computer Science Data governance Data pipelines Data quality Data visualization Data warehouse Data Warehousing Engineering ETL Flink GCP Google Cloud Hadoop JSON Kafka KPIs Machine Learning MongoDB MySQL NiFi Nonprofit NoSQL Pentaho Pipelines PostgreSQL RDBMS Redshift Security Snowflake Spark SQL Streaming Talend Unstructured data XML
Perks/benefits: Career development Conferences Startup environment Team events
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.