Job Description
We are looking for an experienced distributed deep learning engineer to drive cutting-edge decentralized artificial intelligence and machine learning projects. The ideal candidate will play a key role in developing innovative solutions that leverage advanced distributed computing techniques to solve complex problems in AI and ML.
Key Responsibilities
- Design and implement large-scale model training using distributed deep learning frameworks such as PyTorch, TensorFlow, Ray, etc.
- Manage and optimize model training and inference processes to ensure high performance and efficiency.
- Containerize deep learning applications using Docker and orchestrate them using Kubernetes and Kubeflow.
- Deploy and manage deep learning workloads on major cloud platforms including AWS, Google Cloud, and Azure.
- Apply model compression and inference acceleration techniques to optimize performance.
- Implement stream batch data inference techniques for real-time processing.
- Collaborate with cross-functional teams to develop and execute technical strategies for distributed computing and deep learning solutions.
Job Requirements
- Extensive experience in deep learning frameworks (PyTorch, TensorFlow, etc.) and model training/optimization.
- Strong expertise in containerization (Docker) and orchestration techniques (Kubernetes, Kubeflow).
- Proven experience with cloud computing platforms (AWS, Google Cloud, Azure).
- Preferred experience in CUDA programming and multi-GPU communication optimization.
- Knowledge of stream batch data processing techniques.
- Ability to work collaboratively in a team environment and contribute to technical strategy development.
- Strong problem-solving skills and ability to work on cutting-edge AI/ML projects.
Preferred Qualifications
- Experience with Ray or other distributed computing frameworks.
- Background in decentralized AI/ML systems.
- Publications or contributions to open-source projects in relevant fields.