Data Engineer
Artech LLC
Richmond, VA
JOB DETAILS
SKILLS
Amazon Simple Storage Service (S3), Amazon Web Services (AWS), Apache Avro, Apache Spark, Application Programming Interface (API), Centers for Disease Control and Prevention (CDC), Cloud Computing, Continuous Deployment/Delivery, Continuous Integration, Data Analysis, Data Collection, Data Management, Data Processing, Data Quality, Data Science, Data Storage, Database Extract Transform and Load (ETL), Electronic Medical Records, GCP (Good Clinical Practices), Git, GitHub, HDFS (Hadoop Distributed File System), High Reliability, Identify Issues, Jenkins, Messaging Middleware, Microsoft Windows Azure, Performance Tuning/Optimization, SQL (Structured Query Language), Sales Pipeline, Scalable System Development, Structured Data
LOCATION
Richmond, VA
POSTED
6 days ago
Job Title: Data Engineer – Spark & Real-Time Data Processing
Location: Richmond, VA
Duration: 6 Months
Role Overview
We are seeking an experienced Data Engineer with strong expertise in Apache Spark–based ETL pipelines and real-time data processing. The ideal candidate will design, build, and optimize scalable data platforms that support batch and streaming workloads, enabling analytics, reporting, and data-driven decision-making across the organization.
Key Responsibilities
Data Engineering & ETL Development
Design, develop, and maintain Spark-based ETL pipelines for large-scale batch data processing.
Build reusable, fault-tolerant data frameworks for ingesting, transforming, and loading structured and semi-structured data.
Optimize Spark jobs for performance, scalability, and cost efficiency.
Real-Time & Streaming Data Processing:
Develop real-time and near real-time data pipelines using technologies such as Spark Structured Streaming, Kafka, or Kinesis.
Process high-volume event streams with low latency and high reliability.
Implement windowing, watermarking, and stateful stream processing patterns.
Data Platforms & Storage:
Integrate data from multiple sources including APIs, databases, logs, and message queues.
Design and manage data storage solutions using data lakes and lakehouse architectures (S3, ADLS, HDFS, Delta Lake, Iceberg).
Ensure data quality, consistency, and schema evolution across pipelines.
Cloud & Infrastructure:
Deploy and manage data pipelines on cloud platforms (AWS, Azure, or GCP).
Work with managed Spark platforms such as Databricks, EMR, or Synapse.
Implement CI/CD pipelines for data workflows using Git, Jenkins, or GitHub Actions.
Monitoring, Reliability & Security
Implement logging, monitoring, and alerting for batch and streaming pipelines.
Troubleshoot data failures, performance bottlenecks, and production incidents.
Ensure data security, access control, and compliance with enterprise standards.
Collaboration & Documentation:
Collaborate with data scientists, analysts, and product teams to understand data requirements.
Document data models, pipelines, and operational procedures.
Participate in code reviews and contribute to data engineering best practices.
Required Skills & Qualifications
Core Technical Skills:
Strong experience with Apache Spark (Spark SQL, DataFrames, Structured Streaming)
Proficiency in Python (PySpark) and/or Scala
Experience building batch and streaming ETL pipelines
Strong SQL skills for data transformation and analysis
Streaming & Messaging:
Hands-on experience with Kafka, Kinesis, Pub/Sub, or similar streaming platforms
Understanding of event-driven architectures and stream processing concepts
Data Storage & Formats:
Experience with data lakes / lake house architectures
Familiarity with Parquet, Avro, ORC, Delta Lake, Iceberg
Cloud & DevOps:
Experience with AWS, Azure, or GCP
Knowledge of containerization and orchestration (Docker, Kubernetes – nice to have)
Experience with workflow orchestration tools (Airflow, Dagster, or Prefect)
Nice-to-Have Skills
Experience with real-time analytics and low-latency systems
Knowledge of CDC (Change Data Capture) tools such as Debezium
Exposure to ML data pipelines and feature stores
Experience working in high-volume, regulated environments
Location: Richmond, VA
Duration: 6 Months
Role Overview
We are seeking an experienced Data Engineer with strong expertise in Apache Spark–based ETL pipelines and real-time data processing. The ideal candidate will design, build, and optimize scalable data platforms that support batch and streaming workloads, enabling analytics, reporting, and data-driven decision-making across the organization.
Key Responsibilities
Data Engineering & ETL Development
Design, develop, and maintain Spark-based ETL pipelines for large-scale batch data processing.
Build reusable, fault-tolerant data frameworks for ingesting, transforming, and loading structured and semi-structured data.
Optimize Spark jobs for performance, scalability, and cost efficiency.
Real-Time & Streaming Data Processing:
Develop real-time and near real-time data pipelines using technologies such as Spark Structured Streaming, Kafka, or Kinesis.
Process high-volume event streams with low latency and high reliability.
Implement windowing, watermarking, and stateful stream processing patterns.
Data Platforms & Storage:
Integrate data from multiple sources including APIs, databases, logs, and message queues.
Design and manage data storage solutions using data lakes and lakehouse architectures (S3, ADLS, HDFS, Delta Lake, Iceberg).
Ensure data quality, consistency, and schema evolution across pipelines.
Cloud & Infrastructure:
Deploy and manage data pipelines on cloud platforms (AWS, Azure, or GCP).
Work with managed Spark platforms such as Databricks, EMR, or Synapse.
Implement CI/CD pipelines for data workflows using Git, Jenkins, or GitHub Actions.
Monitoring, Reliability & Security
Implement logging, monitoring, and alerting for batch and streaming pipelines.
Troubleshoot data failures, performance bottlenecks, and production incidents.
Ensure data security, access control, and compliance with enterprise standards.
Collaboration & Documentation:
Collaborate with data scientists, analysts, and product teams to understand data requirements.
Document data models, pipelines, and operational procedures.
Participate in code reviews and contribute to data engineering best practices.
Required Skills & Qualifications
Core Technical Skills:
Strong experience with Apache Spark (Spark SQL, DataFrames, Structured Streaming)
Proficiency in Python (PySpark) and/or Scala
Experience building batch and streaming ETL pipelines
Strong SQL skills for data transformation and analysis
Streaming & Messaging:
Hands-on experience with Kafka, Kinesis, Pub/Sub, or similar streaming platforms
Understanding of event-driven architectures and stream processing concepts
Data Storage & Formats:
Experience with data lakes / lake house architectures
Familiarity with Parquet, Avro, ORC, Delta Lake, Iceberg
Cloud & DevOps:
Experience with AWS, Azure, or GCP
Knowledge of containerization and orchestration (Docker, Kubernetes – nice to have)
Experience with workflow orchestration tools (Airflow, Dagster, or Prefect)
Nice-to-Have Skills
Experience with real-time analytics and low-latency systems
Knowledge of CDC (Change Data Capture) tools such as Debezium
Exposure to ML data pipelines and feature stores
Experience working in high-volume, regulated environments
About the Company
A