Platform Support Architect

DataDirect Networks Inc

San Francisco, CA(remote)

Apply

JOB DETAILS

SALARY

$175,000–$200,000 Per Year

SKILLS

Artificial Intelligence (AI), Automotive Automation, Benchmarking, Best Practices, Biology, Blueprints, Broadband, Cloud Computing, Communication Skills, Continuous Deployment/Delivery, Continuous Integration, Data Management, Data Storage, Database Design, Debugging Skills, Distributed Objects, Docker, Elasticsearch, Enterprise Architecture, Environmental Issues, Ethernet, Field Trials, Financial Services, GPU (Graphics Processing Unit), Government, HIPAA (Health Insurance Portability and Accountability Act), Healthcare, Identify Issues, Leadership, Licensing, Linux Operating System, Load Balancing, Manufacturing, Market Share, Metrics, Network Attached Storage (NAS), Network Design, Network Operations Center, Network Routers, Network Switching, OEM (Original Equipment Manufacturer), Performance Analysis, Problem Solving Skills, Process Improvement, Product Engineering, Product Management, Product/Service Launch, Production Support, Production Systems, Regulations, Reporting Dashboards, Resource Management, Set Goals, Storage Area Network (SAN), Storage Software, Switched Fabric, Technical Support, Telemetry, Testing, Topology, Use Cases

LOCATION

San Francisco, CA

POSTED

22 days ago

Overview

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world''s most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.

"DDN''s A3I solutions are transforming the landscape of AI infrastructure." - IDC

"The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments" - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA

DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Job Description

DDN is expanding our Enterprise and Sovereign AI Solution offerings, for example Hyperpod - a turnkey NVIDIA AI Data Platform built on DDN Infinia storage, NVIDIA AI Enterprise (NVAIE), and Supermicro reference hardware, optimized for inference and RAG workloads. Our support organization is deep on storage (Infinia, EXAScaler); we are now hiring an AI platform specialist to lead supportability and enablement for the AI side of the stack - NVIDIA AI Enterprise services (NIMs, NeMo, Triton, GPU Operator, licensing), vector databases (initially Milvus), RAG/agentic workflows, and the high‑performance storage and networking fabric that underpins them.

You will be a trusted technical advisor within Support and across OEM and NVIDIA partner teams, combining the mindset of a solutions architect (architecture, reference patterns, PoCs, reusable assets) with that of a L3 support engineer. You'll help DDN and our partners operate AI Data solutions as a cohesive AI platform, not just a collection of components.

Key Responsibilities

Platform support

Act as the primary NVIDIA AI Enterprise and vector database solutions expert for HyperPOD customer environments, bringing deep knowledge of NVAIE services (e.g., NIMs, NeMo, Triton, TensorRT/TensorRT‑LLM, GPU Operator, licensing/NLS) and vector databases (e.g., Milvus) to guide diagnosis, optimization, and solution design.
Own complex end‑to‑end triage across GPU, NVAIE services, vector DB, Kubernetes, Docker, high‑speed networking, and Infinia storage, distinguishing product defects from environmental and integration issues.
Diagnose and resolve performance bottlenecks in RAG and agentic AI workflows, from model selection and prompt/RAG configuration throughto vector search, GPU utilization, and data access patterns.
Collect and interpret logs and telemetry across Linux, containers, Kubernetes, GPU stack, vector DB, and storage/networking; build minimal repros and high‑quality defect reports for escalation to NVIDIA, vector‑DB vendors, OEMs, and internal engineering.

Runbooks, diagnostics, and supportability

Author and maintain support triage runbooks and checklists for HyperPOD covering NVAIE services, Milvus/vector DB, GPU stack, Docker, Kubernetes resources, and their interaction with Infinia and the network fabric.
Define and validate unified diagnostics bundles that capture the right logs/configs/metrics from all relevant layers (Infinia, GPUs, NVAIE, Milvus, Kubernetes, network) to enable fast problem isolation and high‑signal escalations.
Collaborate with observability and tools teams to shape Prometheus/Grafana/ELK/NetQ or equivalent dashboards that surface both platform health and RAG/service‑level metrics (e.g., TTFT, retrieval latency, error rates, throughput).

Enablement, PoCs, and reusable assets

Build hands‑on labs and PoCs that mirror customer RAG and agentic AI use cases on HyperPOD, validating supportability and capturing "known good" configurations and troubleshooting patterns.
Develop reusable technical assets - implementation guides, best‑practice playbooks, tuning checklists, example architectures - to accelerate time‑to‑value for customers, PS, and Support.

Design feedback, readiness, and cross‑functional leadership

Provide structured feedback from early field cases and PoCs into Product Management and Engineering on stack compatibility, upgrade order, rollback constraints, and observability needs for NVAIE, Milvus/cuVS, Infinia, and networking.
Collaborate closely with NVIDIA solutions architects, OEM architects, PS, and Support Innovation to align reference architectures and best practices with real‑world support experience.

Required Experience & Skills

Technical

5+ years in Linux‑based infrastructure roles (SRE, MLOps, platform engineering, or L2/L3 support) supporting production systems; 8+ years total technical experience preferred.
Strong hands‑on experience with containers and Kubernetes (Docker/containerd, Helm, Operators; debugging pods, DaemonSets, CSI, CNI, and ingress/load balancers).
Demonstrated experience operating GPU‑accelerated workloads in production:
NVIDIA GPUs, drivers, CUDA concepts, GPU utilization/perf triage
NVIDIA GPU Operator and Kubernetes‑based GPU lifecycle management
Familiarity with DGX / HGX or similar GPU cluster platforms.
Practical experience with AI storage and networking for HPC/AI clusters:
High‑performance storage systems (e.g., EXAScaler/Lustre, GPFS, Ceph, distributed object storage, enterprise NAS/SAN).
RDMA‑accelerated and/or high‑speed Ethernet/InfiniBand networking, including fabrics, switch topologies, and large‑scale deployments.
Hybrid cloud or cloud‑adjacent patterns (Kubernetes CSI, cloud‑native fabrics, data locality).
Experience with one or more vector databases (Milvus, Qdrant, Pinecone, pgVector, OpenSearch/Elasticsearch vectors, etc.), including schema design, ingestion, and operations.
Solid understanding of RAG and Generative AI workflows: embeddings, retrieval, reranking, prompt design, context management, and how these interplay with vector search and GPU inference at scale.
Familiarity with NVIDIA AI Enterprise components and toolchain, for example:
NVIDIA NIM inference microservices
NVIDIA NeMo framework / NeMo Retriever / NeMo Curator
Triton Inference Server, TensorRT / TensorRT‑LLM, CUDA libraries
NVIDIA blueprints for enterprise RAG and agentic AI.
Experience designing, operating, or supporting MLOps / GenAI pipelines: CI/CD for models, deployment strategies, canarying/rollback, GPU resource management, monitoring and alerting for AI services.
Strong diagnostic skills across Linux, containers, Kubernetes, GPUs, storage, and networking; able to quickly narrow fault domains and propose experiments or configuration changes.

Support, architecture, and stakeholder skills

Track record of building reusable technical assets (runbooks, KBs, implementation guides, benchmarks, PoC templates) that improve support readiness and partner/customer success.
Excellent communication skills, capable of clearly explaining complex AI platform topics to both engineers and executive stakeholders, internally and with partners.

Preferred Qualifications

Prior experience with scale‑out storage in GPU/AI environments.
Direct experience crafting and operating RDMA‑accelerated HPC/AI clusters at scale, including spine‑leaf or fat‑tree network designs and large switch/router deployments.
Hands‑on work with NVIDIA reference blueprints (Enterprise RAG, VSS, AIQ, industry‑specific blueprints) or similar enterprise AI architectures.
Familiarity with AI observability and responsible AI practices (guardrails, monitoring for drift/toxicity, basic understanding of regulatory considerations like GDPR/HIPAA in the context of AI systems).
Experience with observability stacks (Prometheus, Grafana, Loki/ELK, NetQ, etc.) tuned for AI workloads, including service‑level dashboards and SLOs.

What Success Looks Like in This Role

Within 6-12 months, a successful AI Data Platform Solutions Architect will have:

Become the go‑to internal expert for "how this AI and networking stack actually works in production" across Support, PS, Product, and NPI for HyperPOD.
Drive speed and quality of support at solution level; NVAIE, vector DB, and AI‑workflow issues through high‑quality diagnostics, architecture insight, and well‑defined "golden stack" patterns.
Established clear, repeatable triage and escalation patterns for AI‑side incidents that L1/L2 storage engineers can follow with confidence.

Salary Range for this role: $175,000 - $200,000

DDN

DDN has a very strong orientation towards these 4 characteristics and any successful employee will demonstrate these capabilities:

Self-Starter - Takes independent action to identify and solve problems. Seeks out relevant information needed to make decisions. Gets involved with new initiatives.

Success/Achievement Orientation - Delivers quality results consistently. Targets, achieves (or exceeds) measurable results. Sets challenging goals, focuses on critical priorities, and is accountable.

Problem Solving - Recognizes problems and responds with a systematic assessment that identifies and addresses cause of issue. Practical, realistic, and resourceful.

Innovative - Builds and improves key business processes that enhance the effectiveness of DDN. Generates new ideas, challenges the status quo, and solves problems creatively.

DataDirect Networks, Inc. is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law.

#LI-Remote

Platform Support Architect

DataDirect Networks Inc

San Francisco, CA(remote)

About the Company

DataDirect Networks Inc

Similar Job Searches