2024-7120_Site Reliability Engineer - Level III
9 months ago
The Senior Site Reliability Engineer (SRE) has a strong technical background in multiple
engineering disciplines. This position will interface closely..
The Senior Site Reliability Engineer (SRE) has a strong technical background in multiple
engineering disciplines. This position will interface closely with a tight-knit team of engineers
across a broad range of technical areas that provide managed services to internal global
customers. The ideal candidate will have a balance of breadth across engineering disciplines, depth and expertise in select areas, and practical real-life experience solving very complex problems while maintaining patience and professionalism in the most critical moments. The incumbent operates independently on complex assignments involving the analysis of both business and regulatory requirements, as well as the analysis of the individual technical implementations to maximize benefit to the business. This role requires an in depth knowledge of Active Directory, Federation, Linux, Storage, VMware ESX, Windows Server, and related system technologies. As a technical subject matter expert, you will mentor system engineers, review existing and implement new solutions to meet business objectives. The incumbent will act as an internal team escalation point for systems requests and issues. Travel up to 15% maybe be required based on project needs.
Job Responsibilities
- Design & deploy a robust monitoring/alerting strategy, defining & implement self-healing capabilities, creation & updating of automated runbooks/playbooks, triaging & solutioning production incidents.
- Subject Matter Expert on SolarWinds NPM, SAM, and NCM, and growing and maintaining the SolarWinds application infrastructure Work with the Network/Systems/Applications teams.
- Support to troubleshoot and understand system faults and application performance issues to design monitoring capabilities that can detect and auto correct.
- Build monitoring, alerting and dashboarding solutions that improve the visibility into our applications' performance and infrastructure metrics and keep operational workload stable.
- Use automation to streamline the monitoring of applications and services using scripting and tools.
- Good knowledge of Splunk, NewRelic, DataDog, Pingdom, AppDynamics and other monitoring tools.
- Tracks issues and business requests, and conducts research on broad-based solutions and
new features that meet customer needs. - Monitoring application performance usage through the use of APM and other monitoring tools to isolate the fault domain and identify root cause of performance issues.
- Facilitate blameless Incident Retrospectives to understand root causes, communicate learnings, determine remediation and make continuous enhancements to monitoring.
- Identifying, evaluating, and recommending monitoring tools and diagnostic techniques. Assess gaps in as-is monitoring tool capabilities and recommend tools to augment or replace.
- Monitoring support incident queue, investigating and resolving logged 3rd level technical support incidents.
- Mentoring / assisting eNOC Support staff with incident diagnosis
Official account of Jobstore.