Data Engineering Intern
Location: ARTPARK -IISC Bangalore
Type: Full-time
Overview
ARTPARK’s One-Health team tackles interconnected challenges in human, animal, and environmental health through collaborative and interdisciplinary efforts.
Working with city, state, and national governments, we support data-driven public health responses to endemic, epidemic, and climate-related threats through innovative solutions leveraging statistical and AI/ML-based approaches.
In this role, you will build data and ML pipelines and APIs, work to scope, obtain, clean, and standardise data for various use cases across projects, and put the data out for the community to use. Requires commuting to the office 5 days a week. The minimum duration of the internship is 6 months
Responsibilities
Integrate and structure data from diverse sources into a coherent, harmonised format ready for use by advanced computational models.
Develop and automate a robust, scalable data and ETL pipeline using cutting-edge technologies to ensure smooth data flow, reliability, and real-time processing.
Work with data analysts and computational epidemiologists to design and deploy simple, accessible, and scalable data access mechanisms and policies while ensuring strict data governance that complies with relevant laws and policies.
Engage in exhaustive data cataloguing and documentation for all data acquired from various sources and maintain a repository of the standards and processes used on the data
You will be responsible for streamlining the data flow so that computational and simulation modellers can easily access and utilise the data in their models without manual intervention.
Manage and handle different types of data, including spatiotemporal complex datasets - such as semi-structured and unstructured data, climate data, image datasets
Apply state-of-the-art data standardisation techniques, leveraging AI and machine learning, including large language models (LLMs), to convert unstructured and semi-structured data into clean, usable formats for production-grade models.
Requirements
Open to students pursuing Bachelor's in computer science, engineering, mathematics or related quantitative scientific discipline.
Demonstrable experience in developing and implementing ETL pipelines and RESTful APIs
Expertise in Data Engineering and Automation: Proven experience designing and implementing robust data pipelines using tools like AWS cloud services and Python. Working on and prior experience maintaining open source stacks is highly desirable
Expertise in Database Management and Data Modelling: Deep knowledge of database management, schema design, and data modelling. Working closely with the computational epidemiology team, you will design databases and structures that align with their requirements, ensuring the data is well-organised and ready for analysis.
Prior experience with AI and Machine Learning Integration is desirable but not required.
ARTPARK @ IISc : Innovation factory for next-gen robotics & AI
ARTPARK is India's leading deep-tech venture builder and incubator focused on robotics, connected autonomous systems, and AI. Leveraging our unique facilities and ecosystems, we strive to provide meaningful support to very early-stage startups building deep-tech products based in research. We are a nonprofit organization created by Indian Institute of Science (IISc, Bengaluru) with support from the Department of Science & Technology (Government of India) and the Government of Karnataka.