КейДев

Site Reliability Engineer (SRE)

Не указана
  • Минск
  • Полная занятость
  • Удаленная работа
  • От 3 до 6 лет

Keydev is on the lookout for a Site Reliability Engineer (Application Support Team) to join our Infrastructure and Operations Department.

Main responsibilities:

  • Maintain and enhance monitoring and logging infrastructure.
  • Improve observability processes and implement predictive failure analysis.
  • Optimize alerting systems: reduce noise, fine-tune critical metrics.
  • Define key monitoring parameters and enhance visibility.
  • Support and improve both cloud-based and on-premise environments.
  • Automate processes and configuration management using Infrastructure as Code (IaC) principles.
  • Train and mentor 24/7 App Support staff.
  • Develop Runbooks, documentation, and troubleshooting guides.
  • Analyze incidents, identify patterns, and drive proactive monitoring improvements.
  • Establish and support the Monitoring & Diagnostics group within App Support.
  • Develop intelligent troubleshooting instructions for faster incident resolution.
  • Optimize existing monitoring by reducing unnecessary alerts and adding meaningful metrics.
  • Enhance reliability through structured incident management and post-mortem analysis.
  • Implement GitOps best practices for managing infrastructure and configuration.
Requirements:
  • Advanced Linux user with strong command-line and diagnostic skills.
  • 4+ years of experience as an SRE/Monitoring Engineer.
  • Strong understanding of monitoring, logging, and observability in production environments.
  • Experience optimizing alerting systems and implementing predictive analytics.
  • Hands-on experience managing both cloud and on-premise solutions.
  • Automation skills using Python or Go.
  • Proficiency with configuration management tools (Ansible, Terraform).
  • Solid grasp of networking principles and protocols.
  • Understanding of information security principles.
  • Experience with CI/CD pipelines (GitLab, Jenkins).
  • Familiarity with orchestrators (Kubernetes, Rancher).
  • Experience documenting workflows and training support teams.
  • Ability to create intelligent troubleshooting instructions.
  • Skills in incident analysis and pattern recognition.

Nice to Have:

  • Experience working with high-load systems.
  • Deep understanding of APM tools (New Relic, Datadog, etc.).
  • Database and message queue performance tuning.
  • Advanced knowledge of ML-driven monitoring and predictive analysis.
  • Experience with automated incident response (self-healing systems).

Soft Skills:

  • Responsibility, initiative, and strong analytical thinking.
  • Ability to collaborate effectively within a team.
  • Focus on automation and process improvement.
  • Strong documentation and knowledge-sharing skills.
  • Capability to diagnose complex incidents and provide actionable insights.

Benefits:

  • Take advantage of 25 paid calendar vacation days to explore, relax, and unwind.
  • Join us for exciting corporate events that foster team spirit and fun!
  • Indulge in a variety of snacks available in the office.

We will tell you more about all the benefits on the interview :)

This position is planned to be created (promising).