<- All Jobs

Tech lead Site Reliability Engineer, Edge - USDS

Site Reliability Engineering combines software and system engineering with system operations to build and run large-scale, massively distributed infrastructure. Our Edge SREs ensure infrastructure services are reliable, fault-tolerant, efficiently scalable and cost-effective. We dive deep into the stack, including network, hardware, OS, and applications, to quickly resolve complex functional and performance issues.

As an Edge Site Reliability Engineer, you will have the opportunity to manage a variety of complex systems at scale, including systems that administer hyperscale datacenters and public cloud, global content distribution networks (CDNs) and load balancers that handle Tbps of traffic. You will also have the opportunity to collaborate with various teams to translate business needs into concrete action items, and improvements in system design or procedures.

In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.

Responsibilities:
- Build data pipelines, tools, automations, visualizations and monitors to facilitate the operation and optimization of edge services.
- Data monitoring and alerting, data quality assurance and anomaly detection.
- Document team processes and policies, including methods of engagement and SLOs.
- Analyze, design and implement solutions at the system level to remove bottlenecks and improve edge service performance.
- Implement monitoring and alerting to improve issue detection and response.
- Work in a fast-paced environment. Participate in technical operations and rotations in response to performance and reliability issues.

Share job