Machine Learning Engineer - Data Scrapping
São Paulo, SP
Tractian
Meet the most comprehensive system on the market and avoid failures, reduce costs, and increase your industry's productivity! Condition monitoring sensors, predictive maintenance software, and asset management.
Data Science at TRACTIAN
The Data Science team at TRACTIAN focuses on extracting valuable insights from vast amounts of industrial data. Using advanced statistical methods, algorithms, and data visualization techniques, this team transforms raw data into actionable intelligence that drives decision-making across engineering, product development, and operational strategies. The team constantly works on optimizing prediction models, identifying trends, and providing data-driven solutions that directly enhance the company’s operational efficiency and the quality of its products.
What you'll do
We’re looking for software and data engineers to join our newly established Data Gathering and Labeling (DGL) team. In this role, you'll be critical to building Tractian's comprehensive and diverse datasets, from industrial equipment documentation to sensor data like vibration and temperature. Your work will directly power new features in our platform and enhance our competitive advantage through richer and more reliable data resources.
The Data Science team at TRACTIAN focuses on extracting valuable insights from vast amounts of industrial data. Using advanced statistical methods, algorithms, and data visualization techniques, this team transforms raw data into actionable intelligence that drives decision-making across engineering, product development, and operational strategies. The team constantly works on optimizing prediction models, identifying trends, and providing data-driven solutions that directly enhance the company’s operational efficiency and the quality of its products.
What you'll do
We’re looking for software and data engineers to join our newly established Data Gathering and Labeling (DGL) team. In this role, you'll be critical to building Tractian's comprehensive and diverse datasets, from industrial equipment documentation to sensor data like vibration and temperature. Your work will directly power new features in our platform and enhance our competitive advantage through richer and more reliable data resources.
Responsibilities
- Design and maintain robust data collection pipelines from a wide range of sources, including websites, documents, APIs, and raw sensor data
- Extract and structure information from unstructured or semi-structured formats into clean, standardized schemas
- Handle real-world data challenges like pagination, rate limits, CAPTCHAs, noise, missing values, and inconsistent formatting
- Clean, filter, and validate raw data to ensure high quality, consistency, and usability across our systems
- Develop small tools and utilities to support and automate data collection workflows
- Support the creation and maintenance of labeling pipelines for ML applications
- Collaborate with engineering and product teams to optimize data storage and access patterns
- Document data sources, collection methodologies, and processing procedures for reproducibility
Requirements
- 0–2 years of experience in software development, data engineering, or related fields
- Degree in Computer Science, Computer Engineering, Information Systems, or equivalent technical background
- Understanding of HTML, CSS selectors, and how web pages are structured
- Strong problem-solving skills and an eye for detail
- Ability to work in a fast-paced environment and manage shifting priorities
Technical Skills
- Proficiency in Python, especially for data manipulation and automation
- Experience (academic or professional) with data extraction using tools like `requests`, `BeautifulSoup`, or similar
- Familiarity with REST APIs and the HTTP protocol
- Experience with data cleaning techniques such as:
- Handling missing or inconsistent values
- Removing duplicates and outliers
- Standardizing formats (e.g., dates, units, text normalization)
- Validating data against schemas or expected ranges
- (Optional) Exposure to browser automation tools like Selenium or Playwright
Nice to Have
- Experience with web scraping libraries/frameworks like Scrapy, Playwright, or Selenium
- Familiarity with proxy usage, headless browsers, or CAPTCHA bypass techniques
- Understanding of database systems (SQL or NoSQL)
- Exposure to rapid prototyping tools like Streamlit
- Previous experience working with or around industrial equipment or maintenance systems
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Job stats:
1
1
0
Categories:
Engineering Jobs
Machine Learning Jobs
Tags: APIs Computer Science Data visualization Engineering Industrial Machine Learning NoSQL Pipelines Playwright Prototyping Python Selenium SQL Statistics Streamlit
Region:
South America
Country:
Brazil
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Business Intelligence Analyst jobsBI Developer jobsSr. Data Engineer jobsData Engineer II jobsPrincipal Data Engineer jobsStaff Data Scientist jobsStaff Machine Learning Engineer jobsData Science Manager jobsData Manager jobsPrincipal Software Engineer jobsData Science Intern jobsBusiness Data Analyst jobsJunior Data Analyst jobsData Analyst Intern jobsData Specialist jobsSoftware Engineer II jobsResearch Scientist jobsLead Data Analyst jobsSr. Data Scientist jobsDevOps Engineer jobsAI/ML Engineer jobsData Engineer III jobsStaff Software Engineer jobsSenior Backend Engineer jobsBI Analyst jobs
Git jobsAirflow jobsEconomics jobsLinux jobsOpen Source jobsKafka jobsComputer Vision jobsGoogle Cloud jobsJavaScript jobsMLOps jobsData Warehousing jobsNoSQL jobsTerraform jobsPhysics jobsKPIs jobsPostgreSQL jobsRDBMS jobsScikit-learn jobsHadoop jobsBanking jobsScala jobsGitHub jobsStreaming jobsData warehouse jobsPandas jobs
R&D jobsClassification jobsBigQuery jobsOracle jobsDistributed Systems jobsPySpark jobsCX jobsdbt jobsReact jobsScrum jobsLooker jobsRAG jobsMicroservices jobsRobotics jobsJira jobsRedshift jobsIndustrial jobsSAS jobsData Mining jobsNumPy jobsPrompt engineering jobsMySQL jobsGPT jobsELT jobsModel training jobs