Site Reliability Engineer II

Alibaba Cloud US LLC

Bellevue, WA

JOB DETAILS
SALARY
$144,000–$172,800 Per Year
JOB TYPE
Full-time, Employee
SKILLS
Algorithms, Architectural Services, Automation, CPU (Central Processing Unit), Cloud Architecture, Cloud Computing, Computer Science, Computer Security, Cross-Functional, Cryptography, Data Management, Database Administration, Database Clustering, Database Design, Database Management Software/Systems (DBMS), Database Programming, Disaster Recovery, Distributed Computing, Failover, Fleet Management, Go Programming Language (Golang), High Availability, Identity Data Management, Incident Management, Industry Standards, Information Science, Input/Output, Java, Large-Scale Systems, Load Balancing, Machine Tool, Memory Leaks, Metadata, Metrics, Microservices, Network Architecture/Engineering, NoSQL, Preventative Maintenance, Project Development, Prototyping, Python Programming/Scripting Language, Reliability Engineering, Research & Development (R&D), Research Skills, Risk Analysis, Root Cause Analysis, SQL Databases, Security Attacks, Security Auditing, Security Design, Software Patches, Stress Testing, System Validation, Systems Administration/Management, Systems Engineering, Telemetry, Test Automation, Virtualization, Work From Home
LOCATION
Bellevue, WA
POSTED
17 days ago

Platform Stability & High Availability: Conduct health checks, risk assessments, and preventive maintenance for database platform components. Design and implement HA solutions (e.g.,

automated fault recovery, adaptive disaster resilience) and cloud-native technologies. Optimize network architecture and Kubernetes (k8s) cluster operations for database services. Operational Tooling & Automation: Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics. Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry). Incident Management & Optimization: Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats. Collaborate with product teams to refine architectures, reduce latency, and improve availability. Cross-Functional Collaboration: Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines.


1. Research and Development of Database Platform Infrastructure

Systems & Products: The employee will design and support Database-as-a-Service (DBaaS) platforms. This includes cloud-native database engines (such as PolarDB, RDS, or similar

distributed SQL/NoSQL databases) and their control-plane orchestration systems. Research Areas: Conduct research on Distributed Consensus Protocols (e.g., Paxos, Raft) to ensure

data consistency and high availability. Research Adaptive Disaster Resilience algorithms to automate failover across multi-region cloud architectures. Process: Lead the end-to-end

lifecycle of high-availability solutions, from architectural design and prototyping to automated stress testing and chaos engineering to validate system robustness under extreme failure

modes.

2. Large-Scale Distributed Systems Management & Tooling

Equipment & Systems: Work extensively with Kubernetes (K8s) orchestration, focusing on Custom Resource Definitions (CRDs) and Operators to manage stateful database workloads.

Tools & Technologies: Develop and maintain internal automation platforms using languages such as Go (Golang), Java, or Python. Utilize Prometheus, Grafana, and OpenTelemetry to

build advanced observability frameworks that provide real-time telemetry and predictive diagnostics for thousands of database nodes. Specific Projects: Development of an automated

Database Fleet Management System that handles seamless patching, scaling, and migration of large-scale distributed clusters without service interruption.

3. Network Architecture and Cloud-Native Optimization

Technical Focus: Optimize the networking stack within virtualized environments (e.g., Service Mesh, VPC configurations, Load Balancers) to minimize tail latency and maximize throughput

for database traffic. Industry Application: These duties are situated within the Cloud Computing and Information Technology Services industry, specifically focusing on Infrastructure-as-

Software and Large-Scale Data Management.

4. Incident Management and Security Performance

Process: Implement a systematic approach to Root Cause Analysis (RCA) for complex live-site incidents involving performance bottlenecks, such as CPU saturation, I/O wait times, or

memory leaks in distributed environments. Security: Design and implement automated security auditing tools to ensure database components comply with industry standards (e.g.,

encryption at rest/in transit, identity and access management).

Telecommuting may be permitted. When not telecommuting, must report to worksite.


Requirements:

  • Bachelor’s degree or foreign degree equivalent in Computer Science, Information Science, or related field.
  • 2 years of experience in the Site Reliability Engineer II, or any other related occupation, job title/position.


Worksite Address:

205 108th Ave NE, Suite 400, Bellevue, WA, 98004


About the Company

A

Alibaba Cloud US LLC