Department of Informatics and Analytics, AI & Data Science Internship (MLOps)
Massachusetts, United States - Remote
Dana-Farber Cancer Institute
Dana-Farber Cancer Institute in Boston, MA, is a world leader in adult and pediatric cancer treatment and research. Our oncologists and cancer researchers practice and develop some of the most advanced cancer treatments in the world.The Department of Informatics and Analytics at Dana-Farber Cancer Institute seeks a motivated and talented Artificial Intelligence and Data Engineering Intern for our expanding AI & data team. Successful candidates will have the opportunity to contribute on high-impact healthcare projects existing at the cutting edge of cancer research and clinical care.
Project Objective
AI Studio serves as Dana Farber’s centralized platform for AI model development and deployment, primarily utilizing Databricks. As the platform's user base grows, the AI Studio team requires effective methods to 1) track expenses and 2) monitor account usage.
Most activities are conducted on Databricks, which automatically logs data. By leveraging Databricks' dashboarding capabilities, we can create visualizations to monitor cost and usage effectively.
Requirements:
1. Team usage/cost
a. AI Studio users are generally organized into teams, for example the admins of AI Studio are the “AI in Production” team. We aim to track which teams are utilizing the most compute resources and incurring the highest costs.
2. Project usage/cost
a. Users can also be grouped by projects, which may include members from the same or different teams. Dashboards should reflect usage and costs at the project level.
3. Dashboard automation
a. The dashboards constructed for teams/projects should be versatile to apply to different varieties of teams and projects, while still effectively tracking costs and usage.
b. The Intern will develop a script to generate new dashboards for projects or teams, these parameters should include project/team name, users and relevant tags.
c. While Databricks logging tracks most key parameters, some things like projects need to be more granular. To correct for this, tags can be used to specify additional information.
4. Stretch goal: Model forecasting
a. In addition to tracking cost, we want to estimate costs for the future. If time permits, the intern may develop and implement a cost forecasting model.
Approach:
1. Prerequisites:
a. Ensure all workspace items are appropriately tagged by users.
b. Onboard the intern to Databricks.
c. Provide the intern with access to the systems access table in Unity Catalog.
d. The intern should have proficient knowledge of SQL and Python.
2. Data Handling
a. Interns should query logging data from the systems access table and the user data from tracking tables
b. Intern will use SQL or Python to construct data tables
3. Development
a. Intern will define minimum necessary parameters to automate the process
b. Intern will use the defined parameters to create two scripts, one to automate team assessments and one to automate project assessments
4. Stretch
a. Once these are automated, intern may choose to explore potential forecasting models
b. If a reasonable one is identified, intern can employ Databricks notebooks to test and train it
The Basics
- This role is intended to be part-time (<20h/week) from May until June, then full-time (40 hours per week) from June 2024 until mid-to-late August, and continue part-time after August should the candidate be considered for Full Time Employment.
- We are currently seeking to fill 1 internship positions in MLops.
- These positions will be in New England area. (MA, NH, VT, ME, RI, CT)
- Some flexibility in the overall timing is possible, but you should expect it to be as close to the stated format as possible
- Internship projects will be scoped according to mentorship expertise, present Informatics and Analytics needs, and the applicant’s unique skills and interests. Fundamentally, our team sits at the intersection of research and the clinic, and you can expect to work on a project that will have an impact in both areas to some degree.
- Remote Workers in New England Area (MA, NH, VT, ME, RI, CT)
- Candidates are expected to use their own equipment that needs to be utilized on high speed internet access, with camera, microphone and headset. Candidates will be given access to a virtual machine to execute CPU/GPU intensive tasks.
Requirements
Qualification
- Demonstrated programming experience with Python, and most commonly used libraries for data science / machine learning / AI.
- Must have completed or currently enrolled in Masters or PhD program.
- Excellent problem-solving skills and a record of successfully taking on new challenges.
- Ability to work independently and communicate effectively within an interdisciplinary environment that includes data scientists, clinicians, researchers, engineers, scientists, and business leaders.
- Experience with one or more of the following is preferred: SQL, large and complex datasets, CNNs, transformers, RNNs, linear programming, cloud services, source version control, command line.
- Familiarity with one or more core methodologies and techniques, such as generative AI, hypothesis testing, classification, regression, clustering, feature allocation, deep learning, time-series analysis, natural language processing, feature selection/engineering, recommendation systems.
- Experience with healthcare data is a big plus.
Benefits
40 hours per week
$30 Per hour
Tags: Classification Clustering Databricks Deep Learning Engineering Generative AI GPU Machine Learning ML models MLOps NLP PhD Python Research SQL Testing Transformers
Perks/benefits: Career development
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.