Armeta KZ

DevOps Engineer (AI Infrastructure)

Не указана
  • Астана
  • Полная занятость
  • Полный день
  • От 3 до 6 лет
  • Docker
  • Kubernetes
  • Python
  • DevOps

Armeta Inc. is developing advanced AI-driven systems that transform how large-scale engineering and construction projects are evaluated and approved. Our technology automates complex, compliance-heavy processes, ensuring accuracy and trustworthiness.

We are building a high-performance, on-premise computing platform to power our complex multi-agent, data, and backend systems, and we are looking for a DevOps engineer to build and manage this critical infrastructure.

Key Responsibilities

  • Design, build, and maintain our high-availability on-premise infrastructure, built on Kubernetes and bare-metal (including supercomputers and NVIDIA DGX systems).
  • Develop and manage robust CI/CD pipelines (e.g., GitLab CI, Jenkins) for automated building, testing, and deployment of all services.
  • Manage the deployment, scaling, and operation of our core technology stack, including:
  • Backend microservices (FastAPI);
  • AI multi-agent systems and LLM-serving platforms;
  • Distributed compute clusters (specifically Ray);
  • Object storage systems (specifically Minio).
  • Implement and manage comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK/Loki) to ensure system health and performance.
  • Manage NVIDIA DGX hardware, including GPU drivers, CUDA, and high-performance networking (e.g., Infiniband).
  • Automate infrastructure provisioning and configuration management using IaC tools (e.g., Ansible, Terraform).
  • Work closely with AI and Backend teams to ensure a smooth, reliable path from research and development to production.
  • Implement and maintain on-premise security best practices, including network policies, access control, and vulnerability management.

Qualifications

  • Expert-level knowledge of Kubernetes (K8s) and the container ecosystem (Docker).
  • Proven experience managing on-premise, bare-metal server environments. Experience with public cloud (AWS, GCP) is a plus, but on-premise expertise is essential.
  • Strong experience with CI/CD tools (e.g., GitLab CI, Jenkins, GitHub Actions).
  • Strong experience with Infrastructure as Code (IaC) tools (especially Ansible, Terraform).
  • 5+ years of hands-on experience in DevOps, SRE, or a similar role.
  • Deep understanding of networking principles (TCP/IP, load balancing, firewalls, VPCs).
  • Proficiency in scripting and automation (e.g., Python, Bash).
  • Experience with monitoring and logging stacks (e.g., Prometheus, Grafana).

Preferred Qualifications (Bonus Points)

  • Strong experience with MLOps tools and platforms (e.g., KubeFlow, MLflow, Seldon Core, KServe).
  • Hands-on experience with NVIDIA GPU management, CUDA, and the NVIDIA GPU Operator for K8s.
  • Direct experience deploying and managing Ray clusters.
  • Direct experience deploying and managing Minio clusters.
  • Experience with high-performance networking (e.g., Infiniband).
  • Experience with distributed storage systems (e.g., Ceph).