AI Hardware Systems Engineer, Annapurna Labs, Trainium Machine Learning Fleet Operations

Amazon.com Inc

Austin, TX

JOB DETAILS
SKILLS
Amazon Web Services (AWS), Artificial Intelligence (AI), Automation Systems, Bash Scripting, Cloud Computing, Computer Engineering, Customer Experience, Customer Support/Service, Data Analysis, Data Visualization, Debugging Skills, Diving, GPU (Graphics Processing Unit), Incident Response, Machine Learning, Machine Tool, Machining Operations, Problem Solving Skills, Product Lifecycle, Python Programming/Scripting Language, Reporting Dashboards, Scalable System Development, Scripting (Scripting Languages), Server Hardware, Software Debugging, Software Design, Software Development, Software Engineering, System Operations, System Test, Systems Engineering, Testing, Trend Analysis, Vehicle Fleets
LOCATION
Austin, TX
POSTED
30+ days ago

Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago-even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world.

In Annapurna Labs, we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry.

The Machine Learning Acceleration Fleet Operations Team is looking for candidates interested in diving deep into our fleet of ML servers deployed around the world. We are seeking an engineer who is comfortable debugging emergent problems in GPU and server hardware, writing scripts in languages such as Python or Bash, running large scale experiments on a fleet of complex hardware, developing data infrastructure and analyzing trends, and developing automation software to scale operations.

Our team has end-to-end ownership of some of the most advanced server hardware in the world. We drive technical debug efforts and write truly massive scale autonomous software to monitor, optimize, and remediate machine learning hardware. Come join us!

Key job responsibilities:

• Member of a team responsible for system remediation, operational excellence, and customer experience on bleeding edge ML products • Utilize data to root cause hardware failures and identify live trends on the most complex systems in AWS • Implement and improve system level testing across the product lifecycle • Develop software which can be maintained, improved upon, documented, tested, and reused • Dive deep on issues at the intersection of hardware and software

A day in the life:

As a Platform Development Engineer, you are the dedicated owner of an ML server platform in our fleet. Your mission is to maximize its health, sellability, and customer experience.

You start each day with eyes on the fleet - reviewing dashboards to identify trends and triaging emergent issues, then partnering with hardware and software engineering teams to debug, investigate, and translate findings into permanent fixes. You own the end-to-end testing story and manage tradeoffs between coverage and velocity. You direct new automations, tooling, and data infrastructure to scale your operations. You manage software deployments, debug issues with them, and run status meetings to align all platform stakeholders on how the product is performing.

About the team:

The MLA Fleet Operations team was formed to maintain an exceptionally high quality bar for our fleet of advanced machine learning accelerators and server products. We perfect the customer experience by developing scalable software for rapid incident response times and data visualization as well as diving deep into hardware issues as they arise.

About the Company

A

Amazon.com Inc

At Amazon, we don’t wait for the next big idea to present itself. We envision the shape of impossible things and then we boldly make them reality. So far, this mindset has helped us achieve some incredible things. Let’s build new systems, challenge the status quo, and design the world we want to live in. We believe the work you do here will be the best work of your life.

Wherever you are in your career exploration, Amazon likely has an opportunity for you. Our research scientists and engineers shape the future of natural language understanding with Alexa. Fulfillment center associates around the globe send customer orders from our warehouses to doorsteps. Product managers set feature requirements, strategy, and marketing messages for brand new customer experiences. And as we grow, we’ll add jobs that haven’t been invented yet.

It’s Always Day 1
At Amazon, it’s always “Day 1.” Now, what does this mean and why does it matter? It means that our approach remains the same as it was on Amazon’s very first day – to make smart, fast decisions, stay nimble, invent, and stay focused on delighting our customers. In our 2016 shareholder letter, Amazon CEO Jeff Bezos shared his thoughts on how to keep up a Day 1 company mindset. “Staying in Day 1 requires you to experiment patiently, accept failures, plant seeds, protect saplings, and double down when you see customer delight,” he wrote. “A customer-obsessed culture best creates the conditions where all of that can happen.” You can read the full letter here

Our Leadership Principles
Our Leadership Principles help us keep a Day 1 mentality. They aren’t just a pretty inspirational wall hanging. Amazonians use them, every day, whether they’re discussing ideas for new projects, deciding on the best solution for a customer’s problem, or interviewing candidates. To read through our Leadership Principles from Customer Obsession to Bias for Action, visit https://www.amazon.jobs/principles
COMPANY SIZE
10,000 employees or more
INDUSTRY
Retail
FOUNDED
1994
WEBSITE
http://Amazon.com/militaryroles