Data Analyst

Maharashtra, Pune, India

Apply now Apply later

We are seeking a skilled Git Dev Data Indexing & Scraping Automation Engineer to design, build, and maintain data pipelines that scrape, index, and automate the aggregation of development data from Git repositories. The ideal candidate has hands-on experience in web scraping, data indexing, and process automation, with a solid understanding of Git ecosystems.

Who we are?

Payatu is a GPTW certified company where we strive to create a culture of excellence, growth and innovation that empowers our employees to reach new heights in their careers. We are young and passionate folks driven by the power of the latest and innovative technologies in IoT, AI/ML, Blockchain, and many other advanced technologies. We are on the mission of making Cyberworld safe for every organization, product, and individual.

What we look for outside work parameters? 

Develop and optimize web scraping tools to extract data from Git repositories (GitHub, GitLab, Bitbucket, etc.).Handle rate limits, CAPTCHAs, and anti-scraping mechanisms effectively.

Design and implement efficient data indexing strategies for large-scale Git data.

Build scalable data pipelines to process and store repository metadata, commits, issues, pull requests, etc.

Automate scraping and data refresh cycles using schedulers like cron, Airflow, or Celery.

Integrate version control systems with data management solutions for real-time updates.

Ensure data accuracy, integrity, and compliance with API usage policies and legal guidelines.

Implement logging, monitoring, and alert systems for data pipelines.
You are a perfect technical fit if:

Strong experience with Python and scraping frameworks (Scrapy, BeautifulSoup, Selenium, etc.).Knowledge of Git APIs (GitHub, GitLab, Bitbucket) and OAuth authentication.Experience with databases (SQL, NoSQL) and indexing tools (Elasticsearch, Solr).Proficiency in task automation tools (Airflow, Celery, or equivalent).Familiarity with cloud platforms (AWS, GCP, or Azure) and containerization (Docker, Kubernetes).Solid understanding of data privacy laws and ethical scraping practices.Should have experience with GraphQL (especially GitHub’s GraphQL API).
Should have background in data analytics, machine learning, or natural language processing (NLP).Should have Knowledge of DevOps practices and CI/CD pipelines.


Your everyday work will look like
Write and maintain web scrapers to pull data from GitHub, GitLab, Bitbucket, etc.Handle API rate limits, authentication, and data pagination.Monitor scraping jobs and troubleshoot failures or data inconsistencies.Structure raw data into indexed formats for faster querying and analysis.Optimize databases (SQL/NoSQL) and indexing tools (Elasticsearch, Solr) for large-scale Git data.Perform routine data cleaning to maintain data accuracy and integrity.Set up cron jobs or use workflow managers like Airflow to automate data collection and updates.Implement version control for scraping scripts and data pipelines.Continuously improve automation to reduce manual interventions.Track scraping performance and server health using monitoring tools.Manage logs, error reports, and handle exceptions in real time.Regularly update scrapers to adapt to changes in Git APIs or website structures.Collaborate with data analysts, engineers, and product teams to understand data needs.Share daily/weekly progress reports and insights from indexed data.Participate in code reviews and brainstorming sessions to optimize workflows.Ensure compliance with Git platform terms and data privacy regulations.Implement security protocols for handling sensitive or private repository data.
Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  1  0  0
Category: Analyst Jobs

Tags: Airflow APIs AWS Azure Bitbucket Blockchain CI/CD Data Analytics Data management Data pipelines DevOps Docker Elasticsearch GCP Git GitHub GitLab GraphQL Kubernetes Machine Learning NLP NoSQL Pipelines Privacy Python Security Selenium SQL

Perks/benefits: Career development Startup environment

Region: Asia/Pacific
Country: India

More jobs like this