<- All Jobs

Site Reliability Engineer, Capcut - USDS

Team Intro:
CapCut is an all-in-one video editing app that empowers creators to express themselves and transform videos into creative masterpieces. In addition to its basic features, such as video editing, text, stickers, filters, colors and music, CapCut offers free advanced features, including keyframe animation, smooth slow-motion effects, chroma key, Picture-in-Picture (PIP), and stabilization to help you capture and snip moments.

Site Reliability Engineering(SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. In our team, you鈥檒l have the opportunity to manage the complex challenges of scale, while using expertise in coding, algorithms, complexity analysis, and large-scale system design. We embrace a culture of diversity, intellectual curiosity, openness, and problem-solving. We encourage close collaboration while promoting self-direction.

In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.

Responsibilities:
- Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
- Work closely with software engineering teams to design, deploy and operate elements to ensure that systems are functionally robust.
- Ensure system scalability to handle growth in web traffic and data.
- Implement monitoring tools and set up metrics to keep track of system health and performance.
- Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
- Conduct performance tests to find and address system bottlenecks.
- Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
- Practice sustainable user support, incident response, and blameless postmortems.
Share job