System Hardware Reliability Manager, AI Infrastructure

GoogleTaipei, Taiwan

Google welcomes people with disabilities.

Minimum qualifications:

Bachelor's degree in Electrical Engineering, Mechanical Engineering, Reliability Engineering, Materials Science, or a related technical discipline, or equivalent practical experience.
10 years of experience in manufacturing.
8 years of experience in people management.

Preferred qualifications:

Experience with large-scale data center infrastructure, high-density compute/server topologies, or power/cooling sub-systems.
Demonstrated experience in performing risk mitigation during early design phases using predictive modeling or reliability simulations before design lockdown.
Experience designing and executing accelerated life testing (ALT, HALT) and manufacturing detection profiles tailored to data center environmental profiles.
Deep expertise in structured problem-solving methodologies (e.g., 8D, FMEA, FTA) and physical failure analysis for complex electronic assemblies or server-grade hardware.
Strong background in data analysis tools (e.g., JMP, SQL, Python/R) for life-data analysis, Weibull modeling, and predicting fleet-wide failure rates.

About the job

Be part of a team that pushes boundaries, developing custom silicon solutions that power the future of Google's direct-to-consumer products. You'll contribute to the innovation behind products loved by millions worldwide. Your expertise will shape the next generation of hardware experiences, delivering unparalleled performance, efficiency, and integration.

In this role, you will lead the team responsible for building reliability into our products from early architecture through global deployment. You will shift our focus from reactive troubleshooting to scalable strategy, partnering with Design teams and APAC manufacturers to define specifications and mitigate hardware risks before they hit production. Ultimately, you will own the technical strategy for NPI reliability frameworks, drive systemic root-cause failure analysis, and oversee the health of our active global fleet to ensure our infrastructure remains highly resilient.

The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide.

We're the driving team behind Google's groundbreaking innovations, empowering the development of our AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

Responsibilities

Coach, mentor, and scale a Reliability Engineering team across planning, validation, and fleet failure analysis, optimizing resource allocation to navigate evolving data center complexities at a fast-moving pace.
Oversee manufacturing stability to ensure intrinsic product reliability across all verticals at APAC contract manufacturer locations, proactively identifying workflow opportunities to better support dynamic business needs.
Drive Design for Reliability (DfR) methodologies and DFMEAs from the initial concept phase, formalizing a lessons learned pipeline to directly shape design rules for next-generation ML hardware.
Lead high-priority investigations for complex, intermittent field reliability failures, guiding internal teams, OEMs, and external laboratories through advanced failure analysis techniques to validate conclusions and enforce strict remediation standards.
Utilize statistical tools, physics-of-failure models, and internal reliability data to predict product life performance, feedback application stress, enable early detection, and define comprehensive end-of-life strategies.

Information collected and processed as part of your Google Careers profile, and any job applications you choose to submit is subject to Google's Applicant and Candidate Privacy Policy.

Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google's EEO Policy, Know your rights: workplace discrimination is illegal, Belonging at Google, and How we hire.

If you have a need that requires accommodation, please let us know by completing our Accommodations for Applicants form.

Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.

To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.

job details

Jobs search results

Technical Leader, Google Cloud Capacity

Staff Software Engineer, Site Reliability Engineering, Traffic Virtnet

Global Practice Architect, Security, Google Cloud

Data Center Facilities Technician, Electrical

Senior Engineering Lead, Organizational Engineering Agent Platform, DeepMind

UX Programs and Operations Manager, Health

Memory System Architect, Silicon

Principal Architect IV, Google Cloud

Cố vấn về giải pháp khách hàng, nhóm Platform, Google Cloud

Staff Software Engineer, Deep Data Research, Applied AI

Strategy and Operations Manager, Devices and Services Marketing

Technical Account Manager, Google Cloud Consulting, Telco