КейДев

Site Reliability Engineer (SRE)

Не указана

Keydev is on the lookout for a Site Reliability Engineer (Application Support Team) to join our Infrastructure and Operations Department.

Main responsibilities:

Maintain and enhance monitoring and logging infrastructure.
Improve observability processes and implement predictive failure analysis.
Optimize alerting systems: reduce noise, fine-tune critical metrics.
Define key monitoring parameters and enhance visibility.
Support and improve both cloud-based and on-premise environments.
Automate processes and configuration management using Infrastructure as Code (IaC) principles.
Train and mentor 24/7 App Support staff.
Develop Runbooks, documentation, and troubleshooting guides.
Analyze incidents, identify patterns, and drive proactive monitoring improvements.
Establish and support the Monitoring & Diagnostics group within App Support.
Develop intelligent troubleshooting instructions for faster incident resolution.
Optimize existing monitoring by reducing unnecessary alerts and adding meaningful metrics.
Enhance reliability through structured incident management and post-mortem analysis.
Implement GitOps best practices for managing infrastructure and configuration.

Requirements:

Advanced Linux user with strong command-line and diagnostic skills.
4+ years of experience as an SRE/Monitoring Engineer.
Strong understanding of monitoring, logging, and observability in production environments.
Experience optimizing alerting systems and implementing predictive analytics.
Hands-on experience managing both cloud and on-premise solutions.
Automation skills using Python or Go.
Proficiency with configuration management tools (Ansible, Terraform).
Solid grasp of networking principles and protocols.
Understanding of information security principles.
Experience with CI/CD pipelines (GitLab, Jenkins).
Familiarity with orchestrators (Kubernetes, Rancher).
Experience documenting workflows and training support teams.
Ability to create intelligent troubleshooting instructions.
Skills in incident analysis and pattern recognition.

Nice to Have:

Soft Skills:

Benefits:

We will tell you more about all the benefits on the interview :)

This position is planned to be created (promising).