Data Engineer Intern - Web Crawling
United States
Sayari
Get instant access to public records, financial intelligence and structured business information on over 455 million companies worldwide.
About Sayari: Sayari is the counterparty and supply chain risk intelligence provider trusted by government agencies, multinational corporations, and financial institutions. Its intuitive network analysis platform surfaces hidden risk through integrated corporate ownership, supply chain, trade transaction and risk intelligence data from over 250 jurisdictions. Sayari is headquartered in Washington, D.C., and its solutions are used by thousands of frontline analysts in over 35 countries.
Our company culture is defined by a dedication to our mission of using open data to enhance visibility into global commercial and financial networks, a passion for finding novel approaches to complex problems, and an understanding that diverse perspectives create optimal outcomes. We embrace cross-team collaboration, encourage training and learning opportunities, and reward initiative and innovation. If you like working with supportive, high-performing, and curious teams, Sayari is the place for you.
Internship Description:Sayari is looking for a Data Engineer Intern specializing in web crawling to join its Data Engineering team! Sayari has developed a robust web crawling project that collects hundreds of millions of documents every year from a diverse set of sources around the world. These documents serve as source records for Sayari’s flagship graph product, which is a global network of corporate and trade entities and relationships. As a member of Sayari's data team your primary objective will be to work on maintaining and improving Sayari’s web crawling framework, with an emphasis on scalability and reliability. You will work with our Product and Software Engineering teams to ensure our crawling deployment meets product requirements and integrates efficiently with our ETL pipelines.
This is a remote paid internship with work expectations being between 20-30 hours a week.
Our company culture is defined by a dedication to our mission of using open data to enhance visibility into global commercial and financial networks, a passion for finding novel approaches to complex problems, and an understanding that diverse perspectives create optimal outcomes. We embrace cross-team collaboration, encourage training and learning opportunities, and reward initiative and innovation. If you like working with supportive, high-performing, and curious teams, Sayari is the place for you.
Internship Description:Sayari is looking for a Data Engineer Intern specializing in web crawling to join its Data Engineering team! Sayari has developed a robust web crawling project that collects hundreds of millions of documents every year from a diverse set of sources around the world. These documents serve as source records for Sayari’s flagship graph product, which is a global network of corporate and trade entities and relationships. As a member of Sayari's data team your primary objective will be to work on maintaining and improving Sayari’s web crawling framework, with an emphasis on scalability and reliability. You will work with our Product and Software Engineering teams to ensure our crawling deployment meets product requirements and integrates efficiently with our ETL pipelines.
This is a remote paid internship with work expectations being between 20-30 hours a week.
Job Responsibilities:
- Investigate and implement web crawlers for new sources
- Maintain and improve existing crawling infrastructure
- Improve metrics and reporting for web crawling
- Help improve and maintain ETL processes
- Contribute to development and design of Sayari’s data product
Required Skills & Experience:
- Experience with Python
- Experience managing web crawling at scale, any framework, Scrapy is a plus
- Experience working with Kubernetes
- Experience working collaboratively with git
- Experience working with selectors such as: XPath, CSS, JMESPath
- Experience with WebDev tools (Chrome/Firefox)
Desired Skills & Experience:
- Experience with Apache projects such as Spark, Avro, Nifi, and Airflow
- Experience with datastores Postgres and/or RocksDB
- Experience working on a cloud platform like GCP, AWS, or Azure
- Working knowledge of API frameworks, primarily REST
- Understanding of or interest in knowledge graphs
- Experience with *nix environments
- Experience with reverse engineering
- Proficient in bypassing anti-crawling techniques
- Experience with Javascript
Job stats:
1
0
0
Category:
Engineering Jobs
Tags: Airflow APIs Avro AWS Azure Engineering ETL GCP Git JavaScript Kubernetes NiFi Pipelines PostgreSQL Python Spark
Perks/benefits: Career development
Region:
North America
Country:
United States
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Data Engineer II jobsSr. Data Engineer jobsStaff Data Scientist jobsPrincipal Data Engineer jobsBI Developer jobsSenior AI Engineer jobsStaff Machine Learning Engineer jobsData Manager jobsData Science Intern jobsBusiness Data Analyst jobsPrincipal Software Engineer jobsData Science Manager jobsJunior Data Analyst jobsData Specialist jobsResearch Scientist jobsSoftware Engineer II jobsData Analyst Intern jobsLead Data Analyst jobsSr. Data Scientist jobsDevOps Engineer jobsData Engineer III jobsJunior Data Engineer jobsAI/ML Engineer jobsBI Analyst jobsStaff Software Engineer jobs
Git jobsEconomics jobsLinux jobsKafka jobsHadoop jobsOpen Source jobsNoSQL jobsData Warehousing jobsAirflow jobsRDBMS jobsBanking jobsMLOps jobsGoogle Cloud jobsComputer Vision jobsJavaScript jobsKPIs jobsScala jobsPhysics jobsClassification jobsData warehouse jobsPostgreSQL jobsScikit-learn jobsGitHub jobsOracle jobsTerraform jobs
Streaming jobsR&D jobsSAS jobsPySpark jobsScrum jobsPandas jobsCX jobsBigQuery jobsDistributed Systems jobsData Mining jobsJira jobsMicroservices jobsdbt jobsRobotics jobsLooker jobsReact jobsJenkins jobsRedshift jobsIndustrial jobsUnstructured data jobsRAG jobsData strategy jobsMySQL jobsELT jobsNumPy jobs