Armeta KZ

DevOps Engineer (AI Infrastructure)

Не указана

Астана
Полная занятость
Полный день
От 3 до 6 лет

Docker
Kubernetes
Python
DevOps

Armeta Inc. is developing advanced AI-driven systems that transform how large-scale engineering and construction projects are evaluated and approved. Our technology automates complex, compliance-heavy processes, ensuring accuracy and trustworthiness.

We are building a high-performance, on-premise computing platform to power our complex multi-agent, data, and backend systems, and we are looking for a DevOps engineer to build and manage this critical infrastructure.

Key Responsibilities

Design, build, and maintain our high-availability on-premise infrastructure, built on Kubernetes and bare-metal (including supercomputers and NVIDIA DGX systems).
Develop and manage robust CI/CD pipelines (e.g., GitLab CI, Jenkins) for automated building, testing, and deployment of all services.
Manage the deployment, scaling, and operation of our core technology stack, including:
Backend microservices (FastAPI);
AI multi-agent systems and LLM-serving platforms;
Distributed compute clusters (specifically Ray);
Object storage systems (specifically Minio).
Implement and manage comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK/Loki) to ensure system health and performance.
Manage NVIDIA DGX hardware, including GPU drivers, CUDA, and high-performance networking (e.g., Infiniband).
Automate infrastructure provisioning and configuration management using IaC tools (e.g., Ansible, Terraform).
Work closely with AI and Backend teams to ensure a smooth, reliable path from research and development to production.
Implement and maintain on-premise security best practices, including network policies, access control, and vulnerability management.

Qualifications

Expert-level knowledge of Kubernetes (K8s) and the container ecosystem (Docker).
Proven experience managing on-premise, bare-metal server environments. Experience with public cloud (AWS, GCP) is a plus, but on-premise expertise is essential.
Strong experience with CI/CD tools (e.g., GitLab CI, Jenkins, GitHub Actions).
Strong experience with Infrastructure as Code (IaC) tools (especially Ansible, Terraform).
5+ years of hands-on experience in DevOps, SRE, or a similar role.
Deep understanding of networking principles (TCP/IP, load balancing, firewalls, VPCs).
Proficiency in scripting and automation (e.g., Python, Bash).
Experience with monitoring and logging stacks (e.g., Prometheus, Grafana).

Preferred Qualifications (Bonus Points)

Strong experience with MLOps tools and platforms (e.g., KubeFlow, MLflow, Seldon Core, KServe).
Hands-on experience with NVIDIA GPU management, CUDA, and the NVIDIA GPU Operator for K8s.
Direct experience deploying and managing Ray clusters.
Direct experience deploying and managing Minio clusters.
Experience with high-performance networking (e.g., Infiniband).
Experience with distributed storage systems (e.g., Ceph).

Источник вакансии

Вернуться, к списку вакансий