Data Engineering Intern

United States

Sayari

Get instant access to public records, financial intelligence and structured business information on over 455 million companies worldwide.

View all jobs at Sayari

Apply now Apply later

About Sayari: Sayari is the counterparty and supply chain risk intelligence provider trusted by government agencies, multinational corporations, and financial institutions. Its intuitive network analysis platform surfaces hidden risk through integrated corporate ownership, supply chain, trade transaction and risk intelligence data from over 250 jurisdictions. Sayari is headquartered in Washington, D.C., and its solutions are used by thousands of frontline analysts in over 35 countries.
Our company culture is defined by a dedication to our mission of using open data to enhance visibility into global commercial and financial networks, a passion for finding novel approaches to complex problems, and an understanding that diverse perspectives create optimal outcomes. We embrace cross-team collaboration, encourage training and learning opportunities, and reward initiative and innovation. If you like working with supportive, high-performing, and curious teams, Sayari is the place for you.
Internship Description:Sayari is looking for an intern to join its Data Engineering team! Sayari’s flagship product, Sayari Graph, provides instant access to structured business information from billions of corporate, legal, and trade records. As a member of Sayari's data team you will work with our Product and Software Engineering teams to collect data from around the globe, maintain existing ETL pipelines, and develop new pipelines that power Sayari Graph.
Our application tier is built primarily in TypeScript, running in Kubernetes, and backed by Postgres, Cassandra, Elasticsearch, and Memgraph. Our data ingest tier runs on Spark, processing terabytes of data collected from hundreds of data sources. The platform allows users to explore a large knowledge graph sourced from hundreds of millions of structured and unstructured records from over 200 countries and 30 languages. As part of this team, you'll have the chance to contribute to our growing library of open-source work, including our WebGL-powered network visualization library Trellis.
This is a remote paid internship with work expectations being between 20-30 hours a week.

Job Responsibilities:

  • Write and deploy crawling scripts to collect source data from the web
  • Write and run data transformers in Scala Spark to standardize bulk data sets
  • Write and run modules in Python to parse entity references and relationships from source data
  • Diagnose and fix bugs reported by internal and external users
  • Analyze and report on internal datasets to answer questions and inform feature workWork collaboratively on and across a team of engineers using basic agile principles
  • Give and receive feedback through code reviews

Required Skills & Experience:

  • Experience with Python and/or a JVM language (e.g., Scala)
  • Experience working collaboratively with git

Desired Skills & Experience:

  • Experience with Apache Spark and Apache Airflow
  • Experience working on a cloud platform like GCP, AWS, or Azure
  • Understanding of or interest in knowledge graphs
Apply now Apply later
Job stats:  0  0  0
Category: Engineering Jobs

Tags: Agile Airflow AWS Azure Cassandra Elasticsearch Engineering ETL GCP Git Kubernetes Open Source Pipelines PostgreSQL Python Scala Spark Transformers TypeScript

Region: North America
Country: United States

More jobs like this