Job Description
Key Responsibilities
- Define and drive the reliability roadmap including SLOs, error budgets, capacity planning, and cost/performance optimization
- Establish platform standards for progressive delivery, safe rollbacks, and change management
- Enhance observability through OpenTelemetry (metrics/logs/tracing) and implement actionable alerting systems
- Oversee incident management programs including on-call rotations, root cause analysis, and postmortems to ensure continuous improvement
- Develop policies for secrets and key management (Vault/HSM/KMS) and infrastructure hardening
- Standardize blockchain node/RPC operations (setup, upgrades, failover) and integrate them into service workflows
- Lead team recruitment, mentorship, and development while collaborating with backend, infrastructure, security, and product teams
Job Requirements
- 5+ years of DevOps/SRE experience including 2+ years operating blockchain or mission-critical infrastructure
- Deep expertise with Kubernetes, automation frameworks (Terraform/Helm/Ansible), and CI/CD pipelines
- Proven track record of delivering production-grade reliability for large-scale microservices
- Hands-on experience with blockchain node operations (Ethereum, Solana, Bitcoin or similar)
- Strong foundation in observability, incident response, and system hardening
- Excellent communication skills; English proficiency preferred
Benefits
- Team building activities
- Comprehensive health checkups
- Year-end bonuses
- Professional development opportunities
- Flexible work arrangements


