<- All Jobs
Site Reliability Engineer, Recommendation Infrastructure - USDS
About the team
The Product Engineering team monitors and maintains the availability of TikTok, including services such as video playback, content discovery/recommendations, live streaming, and customer service feedback.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
Responsibilities:
• Engage in and improve the whole lifecycle of Recommendation systems — from system design consulting through to launch reviews, deployment, operation and refinement
• Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency
• Build availability of large-scale services deployed across global data centers
• Plan, manage and optimize cloud resources utilization, ensuring SLA of large-scale clusters
• Measure and monitor availability, latency and overall service health
• Practice sustainable incident response and postmortems.
The Product Engineering team monitors and maintains the availability of TikTok, including services such as video playback, content discovery/recommendations, live streaming, and customer service feedback.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
Responsibilities:
• Engage in and improve the whole lifecycle of Recommendation systems — from system design consulting through to launch reviews, deployment, operation and refinement
• Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency
• Build availability of large-scale services deployed across global data centers
• Plan, manage and optimize cloud resources utilization, ensuring SLA of large-scale clusters
• Measure and monitor availability, latency and overall service health
• Practice sustainable incident response and postmortems.