We are seeking a talented Senior Site Reliability Engineer to join our team. In this role, you will be instrumental in designing, implementing, and managing our cloud infrastructure on AWS. You will leverage your expertise in Kubernetes to ensure optimal performance and reliability while driving innovation through automation and SRE best practices.
Responsibilities:
- Architect, implement, and manage highly available, scalable, and secure cloud infrastructure on AWS
- Design and implement robust Kubernetes clusters to support critical applications
- Develop and implement comprehensive monitoring solutions using Grafana to proactively identify and resolve system issues
- Lead incident response efforts, conducting thorough root cause analysis and implementing preventive measures
- Provide on-call support to maintain system uptime and health
- Drive the adoption of SRE best practices, including capacity planning, performance tuning, and automation
- Collaborate with development teams to optimize system performance and reliability