Job Description
This position is responsible for the operation and maintenance of the company's business systems, ensuring stability and efficiency. The candidate will work closely with business teams to maintain communication and build collaborative relationships. They will also focus on middleware management, enhancing the reliability of core components and platforms. Additionally, the role involves developing and improving the operation and maintenance platform to establish a standardized service system. The individual will handle online issues, emergencies, and analyze incidents for optimization. They will continuously improve business quality through SLA management, disaster recovery, fault drills, monitoring, and capacity planning. The candidate will also design and optimize server architecture to support efficient and reliable business iterations. This role is crucial for maintaining the company's operational excellence and ensuring that all systems are running smoothly and efficiently.
Key Responsibilities
- Responsible for the operation and maintenance of the company's business, improving business stability and engineering efficiency, maintaining efficient communication with business parties, and establishing good cooperative relationships.
- Responsible for middleware operation and maintenance, enhancing the service-oriented capabilities and stability of basic components and platforms.
- Responsible for the planning, construction, and development of the operation and maintenance platform, establishing and improving a standardized operation and maintenance service system.
- Responsible for online major problem investigation, emergency accident handling, and subsequent accident analysis and optimization.
- Continuously promote the improvement of business quality: SLA, multi active disaster recovery, fault drills, monitoring alarms, capacity management.
- The high availability design and performance optimization of the business server architecture ensure efficient and reliable business iteration.
Job Requirements
- Proficiency in system operation and maintenance, with a strong focus on business stability and engineering efficiency.
- Experience in middleware management, capable of enhancing the service-oriented capabilities and stability of core components and platforms.
- Knowledge of operation and maintenance platform development, including planning, construction, and standardization of service systems.
- Ability to investigate online issues, handle emergencies, and conduct post-incident analysis for optimization.
- Strong skills in business quality improvement, covering SLA management, disaster recovery strategies, fault drills, monitoring alarms, and capacity planning.
- Expertise in high availability design and performance optimization of server architecture to ensure reliable business iteration.
- Excellent communication and collaboration skills to work effectively with cross-functional teams and stakeholders.
- Proficient in problem-solving and analytical thinking to address complex operational challenges.
- Ability to prioritize tasks and manage multiple responsibilities simultaneously in a dynamic environment.
- Strong understanding of IT service management frameworks and industry best practices.
- Experience with cloud computing platforms and automation tools for efficient operations.
- Knowledge of security protocols and compliance standards to ensure system integrity and data protection.
- Ability to document processes and provide clear reports on system performance and improvements.
- Proficiency in scripting languages (e.g., Python, Bash) for automation and troubleshooting.
- Experience with monitoring and alerting tools (e.g., Prometheus, Grafana) for real-time system oversight.
- Knowledge of capacity planning methodologies to ensure scalable and sustainable operations.
- Ability to lead and coordinate with teams to implement disaster recovery and business continuity plans.
- Strong attention to detail and commitment to maintaining high service standards.
- Experience with DevOps practices to streamline development and operations workflows.
- Ability to adapt to evolving technologies and continuously improve operational processes.