Job Description
This position requires a highly skilled professional to manage and maintain enterprise-level IT infrastructure, ensuring continuous system availability and optimal performance. The ideal candidate will be responsible for designing, deploying, and operating scalable cloud-native solutions, with a focus on Kubernetes-based environments. You will play a critical role in monitoring system health, proactively identifying and resolving potential issues, and implementing robust incident response protocols to minimize downtime. The role also involves collaborating with cross-functional teams to align infrastructure strategies with business objectives and technical requirements.
Key Responsibilities
- Ensure 24/7 availability of critical infrastructure through proactive monitoring, maintenance, and troubleshooting of servers, networks, and storage systems.
- Optimize system performance and scalability by analyzing bottlenecks, tuning configurations, and implementing automation tools for resource management.
- Respond to incidents promptly, conduct root-cause analysis, and document solutions to prevent recurrence while maintaining SLA compliance.
- Deploy and manage Kubernetes clusters, including container orchestration, node provisioning, and integration with CI/CD pipelines.
- Implement security best practices and compliance standards to protect infrastructure assets and ensure data integrity.
- Collaborate with developers and DevOps teams to design scalable architectures and troubleshoot application-level issues.
- Monitor system metrics and logs to identify performance trends, optimize resource allocation, and improve overall system reliability.
- Stay updated on emerging technologies and industry trends to recommend infrastructure improvements and innovations.
- Document technical processes, configurations, and incident resolutions to ensure knowledge sharing and operational continuity.
- Perform regular system audits and capacity planning to anticipate future needs and ensure infrastructure readiness.
Job Requirements
- Proven experience in infrastructure management with a minimum of 5 years in system administration, DevOps, or related fields.
- Expertise in Kubernetes cluster deployment, configuration, and operation, including familiarity with container orchestration tools like Docker and Helm.
- Strong understanding of cloud platforms (AWS, Azure, GCP) and hybrid cloud environments for infrastructure scalability.
- Proficiency in scripting languages (Python, Bash, PowerShell) and automation frameworks for system maintenance tasks.
- Knowledge of network protocols, DNS management, and security practices (firewalls, encryption, IAM) to ensure infrastructure resilience.
- Ability to analyze system performance metrics and implement solutions for latency reduction and resource optimization.
- Experience with monitoring tools (Prometheus, Grafana, ELK stack) for real-time system health tracking and incident detection.
- Excellent problem-solving skills and analytical mindset to diagnose complex technical issues and develop preventive measures.
- Strong communication abilities to collaborate with stakeholders, document technical processes, and present solutions effectively.
- Preferred certifications such as Certified Kubernetes Administrator (CKA), AWS Certified Solutions Architect, or CompTIA Security+.
- Ability to work in fast-paced environments with strong attention to detail and organizational skills.
- Experience with CI/CD pipelines and infrastructure-as-code (IaC) practices for automated deployment and configuration management.
- Understanding of disaster recovery strategies and business continuity planning for infrastructure resilience.
- Knowledge of containerization technologies and microservices architecture for scalable cloud solutions.
- Ability to design and implement secure, high-performance infrastructure solutions that meet enterprise requirements.