Cloud Hardware Development Engineer, Cloud AI/ML/storage server teams

Amazon.com Inc

Cupertino, CA

Apply

JOB DETAILS

SKILLS

Amazon Web Services (AWS), Artificial Intelligence (AI), Automation, Cloud Computing, Computer Engineering, Computer Firmware, Computer Servers, Customer Experience, Debugging Skills, Design Verification, Diving, Ecosystems, Electricity, GPU (Graphics Processing Unit), Hardware Components, Hardware Development, Identify Issues, Kernel Programming, Manufacturing, Manufacturing Engineering, Needs Assessment, Network Operations Center, Onboarding, Operations Processes, Original Design Manufacturer (ODM), Problem Solving Skills, Process Improvement, Product Management, Product/Service Launch, Quality Monitoring, Reliability Engineering, Requirements Management, Resolve Customer Issues, Root Cause Analysis, Server Hardware, Software Design, Supply Chain Management, System Architecture, Systems Reliability, Team Player, Technical Leadership, Technical/Engineering Design, Telemetry, Testability, Testing, Vehicle Fleets, Verification Plans, x86 Processors

LOCATION

Cupertino, CA

POSTED

30+ days ago

As a Cloud Hardware Development Engineer, you will be an end-to-end owner of storage and/or accelerator (AI/ML/GPU) server platforms - from New Product Introduction (NPI) through fleet health in production. You own the full lifecycle: design, development, qualification, launch, and ongoing operational excellence of servers running at scale in the AWS fleet.

You will work closely with internal customers to understand their technical needs and business goals, leveraging your experience with server design and the knowledge of various teams to architect solutions we deploy at scale. To deliver your products, you will work with an interdisciplinary team of component, firmware, power, mechanical, electrical, test, qualification, manufacturing engineers, and lead our ODM (design and manufacturing partners) to bring these servers to the data center. After launch, you own the fleet - monitoring quality, driving reliability improvements, and ensuring servers continue to meet customer requirements throughout their

operational life.

This role demands deep technical curiosity and the willingness to jump in and personally solve the hardest problems. When a complex system failure occurs - whether during NPI qualification or in a production fleet of hundreds of thousands of servers - you roll up your sleeves, dive into the details across hardware, firmware, software, and physical layers, and drive to root cause. You don"t wait for someone else to figure it out.

You will own end-to-end system reliability - proactively identifying deficiencies and driving toward zero-touch operations where automation detects, diagnoses, and resolves issues before customer impact. You will decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features, leading delivery yourself and through others in parallel.

This is a fast-paced, intellectually challenging position. You"ll work with thought leaders in multiple technology areas, hold high standards for yourself and everyone you work with, and constantly look for ways to improve your products" performance, quality, and cost. We"re changing an industry, and we want individuals who are ready for this challenge and want to reach beyond what is possible today.

Key job responsibilities

NPI - New Product Introduction

Own the end-to-end NPI lifecycle for storage and/or accelerator (AI/ML/GPU) server platforms - from architecture definition through design, qualification, manufacturing ramp, and launch
Lead technical solutions for complex server and rack system architectural challenges
Work with ODM/manufacturing partners to develop, validate, and manufacture server products at scale
Develop functional specifications, design verification plans, and test procedures
Drive qualification and readiness milestones, ensuring new platforms meet performance, reliability, and cost targets before fleet deployment
Identify and resolve technical risks early in the development cycle - don"t let problems reach production

Fleet Health, Diagnostics & Automation

Own fleet health for the server platforms you launch - reliability doesn"t end at ship
Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
Drive toward zero-touch operations - help build detection, diagnoses, and remediation of faults without human intervention
Debug complex system failures in time-sensitive settings - personally diving deep when the problem demands it
Perform root cause analysis correlating across firmware, kernel, driver, thermal, power, and physical layers

Systems Design & Technical Depth

Apply expertise across hardware, software, system design, x86 architecture, processes, and operations (compute, storage, network, GPU)
Design and implement solutions to address system-level issues at large scale
Decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features
Collaborate with hardware, software, manufacturing, supply chain, and product management teams

Cross-Team Collaboration

Work closely with internal customers to ensure new server hardware meets data path and control path requirements
Identify early any potential problems onboarding new servers into customer ecosystems
Collaborate across Hardware Engineering, component, firmware, test, qualification, and integration teams
Partner with datacenter operations to close the loop between field failures and design improvements

A day in the life

Your day-to-day responsibilities include interfacing with internal and external customers to understand product requirements and facilitate system development on top of your server designs. You will learn operational challenges facing our existing fleet with the goal of improving the current customer experience and developing improved systems for future designs. You will work directly with vendors and ODM (manufacture partners) to scale your product. Some days you"re reviewing a new platform design with your ODM; other days you"re deep in logs and telemetry data chasing a failure mode across the fleet. You thrive

on that range.

About the Company

Amazon.com Inc

At Amazon, we don’t wait for the next big idea to present itself. We envision the shape of impossible things and then we boldly make them reality. So far, this mindset has helped us achieve some incredible things. Let’s build new systems, challenge the status quo, and design the world we want to live in. We believe the work you do here will be the best work of your life.

Wherever you are in your career exploration, Amazon likely has an opportunity for you. Our research scientists and engineers shape the future of natural language understanding with Alexa. Fulfillment center associates around the globe send customer orders from our warehouses to doorsteps. Product managers set feature requirements, strategy, and marketing messages for brand new customer experiences. And as we grow, we’ll add jobs that haven’t been invented yet.

It’s Always Day 1
At Amazon, it’s always “Day 1.” Now, what does this mean and why does it matter? It means that our approach remains the same as it was on Amazon’s very first day – to make smart, fast decisions, stay nimble, invent, and stay focused on delighting our customers. In our 2016 shareholder letter, Amazon CEO Jeff Bezos shared his thoughts on how to keep up a Day 1 company mindset. “Staying in Day 1 requires you to experiment patiently, accept failures, plant seeds, protect saplings, and double down when you see customer delight,” he wrote. “A customer-obsessed culture best creates the conditions where all of that can happen.” You can read the full letter here

Our Leadership Principles
Our Leadership Principles help us keep a Day 1 mentality. They aren’t just a pretty inspirational wall hanging. Amazonians use them, every day, whether they’re discussing ideas for new projects, deciding on the best solution for a customer’s problem, or interviewing candidates. To read through our Leadership Principles from Customer Obsession to Bias for Action, visit https://www.amazon.jobs/principles

COMPANY SIZE

10,000 employees or more

INDUSTRY

Retail

FOUNDED

1994

WEBSITE

http://Amazon.com/militaryroles

Cloud Hardware Development Engineer, Cloud AI/ML/storage server teams

Amazon.com Inc

Cupertino, CA

About the Company

Amazon.com Inc

Similar Job Searches