Software Engineering Manager, Site Reliability, Cloud Incident Response
- linkCopy link
- emailEmail a friend
Minimum qualifications:
- Bachelor's degree or equivalent practical experience.
- 8 years of experience with software development in one or more programming languages (e.g., Python, C, C++, Java, JavaScript).
- 3 years of experience in a technical leadership role; overseeing projects, with 2 years of experience in a people management, supervision/team leadership role.
- Experience with cloud services, telemetry systems and incident response.
Preferred qualifications:
- Master's degree or PhD in Computer Science, or a related technical field.
- Experience as a cloud customer.
About the job
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance.
The Cloud Incident Response Team supports the responders, tooling, and outcomes for Google Cloud Platform (GCP) major incidents. The team collaborates across GCP products, customer facing teams, and a wide range of stakeholders, where you will be helping to coordinate, mitigate, or resolve issues across all of GCP.
Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.
Responsibilities
- Participate in on-call rotation supporting Critical Incident Response for GCP.
- Focus on high-quality customer outcomes and continuous collaboration across GCP teams.
- Create Incident Management at Google (IMAG) training and processes for incident management life-cycle and partnering with Cloud SRE Uber Tech Leads, and Cloud Support leadership team.
- Build systems and tooling to support the team, improve visibility, detection of issues, communications to customers, stakeholders, and customer facing teams.
- Define and escalate risks in Cloud, reduce incident probabilities with strategic and pragmatic approaches as needed.
Information collected and processed as part of your Google Careers profile, and any job applications you choose to submit is subject to Google's Applicant and Candidate Privacy Policy.
Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google's EEO Policy, Know your rights: workplace discrimination is illegal, Belonging at Google, and How we hire.
If you have a need that requires accommodation, please let us know by completing our Accommodations for Applicants form.
Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.
To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.