Senior Software Development Engineer - AI-Native Caching and Memory Systems

Beijing ByteDance Technology Co Ltd

San Jose, CA

JOB DETAILS
SKILLS
Algorithms, Analysis Skills, Artificial Intelligence (AI), C Programming Language, C++ Programming Language, Caching, Computer Programming, Computer Science, Computer Storage Hardware, Concurrency, Cost Modeling, Data Recovery, Database Programming, Database Technology, Debugging Skills, Disaster Recovery, Distributed Computing, Distributed Databases, Ecosystems, Go Programming Language (Golang), Incident Response, Java, Kernel Programming, Large-Scale Systems, Linux Kernel, Linux Operating System, Memory Hardware, Multi-tier Architecture, Multithreaded Programming, Performance Analysis, Performance Management, Performance Tuning/Optimization, Predictive Modeling, Programming Languages, Project/Program Management, Python Programming/Scripting Language, Redis, Service Level Agreement (SLA), Software Development, Software Engineering, Solid State Drive (SSD), eCommerce
LOCATION
San Jose, CA
POSTED
30+ days ago

About the Team

Join ByteDances Redis Family team, where we build and operate AI-native distributed KV caching and memory systems powering ByteDances global infrastructure. Beyond traditional caching, we are evolving toward a unified Memory Infrastructure Layer that supports high-performance Redis-compatible KV systems, persistent and tiered storage engines, LLM KV Cache acceleration infrastructure, and AI-aware memory services.

Our systems serve mission-critical scenarios at massive scale - recommendation, search, ads, e-commerce, messaging, live streaming, and emerging AI-native applications - with strict requirements on availability, latency, throughput, global deployment, and cost efficiency.

Responsibilities

  • Design and develop next-generation Redis Family core systems, including distributed KV caching, persistent memory storage, LLM KV cache infrastructure, and AI-aware memory services.
  • Build planet-scale reliability, leading or contributing to HA architecture, failure isolation, multi-AZ/multi-region disaster recovery, and large-scale stability engineering for always-on business workloads.
  • Architect and optimize multi-tier memory systems (in-memory / SSD / shared storage), reducing read/write amplification and improving tail latency under extreme concurrency.
  • Build a production-grade ecosystem, including automated orchestration operations (provisioning, scaling, placement, scheduling) and monitoring systems (tracing, profiling, incident response runbooks).
  • Implement and evolve capabilities such as Bulkload, backup & restore, point-in-time recovery, tiered storage, and integration with upstream/downstream data systems to enrich data ecosystems.
  • Research new hardware and new technologies, evaluate and land improvements using ZNS SSD, io_uring, RDMA/CXL, and "AI+DB" directions in production.

Minimum Qualifications

  • BS or a higher degree in Computer Science or related fields, or equivalent practical experience.
  • Proficiency in one or more programming languages (C, C++, Java, Go, Python, Rust) with strong coding skills in a Linux environment.
  • Solid fundamentals in distributed systems, database/storage principles, networking, and multi-threaded programming; strong debugging and performance analysis skills (profiling, tracing, flame graphs, lock contention, tail latency).
  • Hands-on experience building or operating large-scale distributed systems (high QPS, high concurrency, strict SLO/SLA), with proven ability to improve stability, performance, and cost.
  • Clear and logical thinking, coupled with a product-oriented mindset, self-driven initiative, and strong project management skills.

Preferred Qualifications

  • 3+ years in database internals/storage engine/cache system development, or equivalent large-scale infrastructure experience.
  • Familiarity with or contributions to systems such as Redis, Tair, MemoryDB, RocksDB, pika, TiDB, etc.
  • Strong knowledge of distributed consensus algorithms, with experience in database kernel development.
  • Experience with Linux kernel-level performance tuning, networking stack optimization, or IO subsystem.
  • Familiarity with RDMA, CXL, ZNS SSD, or modern storage hardware.
  • Interest or experience in applying AI techniques to database systems (e.g., cost modeling, workload prediction, auto-tuning).
  • Experience building or optimizing LLM inference KV cache, memory reuse strategies, or distributed attention state management.
  • Familiarity with mem0-style memory abstraction, RAG memory systems, or session/branch-based contextual storage.

About the Company

B

Beijing ByteDance Technology Co Ltd