Keydev is on the lookout for a Site Reliability Engineer (Application Support Team) to join our Infrastructure and Operations Department.
Main responsibilities:
- Maintain and enhance monitoring and logging infrastructure.
- Improve observability processes and implement predictive failure analysis.
- Optimize alerting systems: reduce noise, fine-tune critical metrics.
- Define key monitoring parameters and enhance visibility.
- Support and improve both cloud-based and on-premise environments.
- Automate processes and configuration management using Infrastructure as Code (IaC) principles.
- Train and mentor 24/7 App Support staff.
- Develop Runbooks, documentation, and troubleshooting guides.
- Analyze incidents, identify patterns, and drive proactive monitoring improvements.
- Establish and support the Monitoring & Diagnostics group within App Support.
- Develop intelligent troubleshooting instructions for faster incident resolution.
- Optimize existing monitoring by reducing unnecessary alerts and adding meaningful metrics.
- Enhance reliability through structured incident management and post-mortem analysis.
- Implement GitOps best practices for managing infrastructure and configuration.
Requirements:
- Advanced Linux user with strong command-line and diagnostic skills.
- 4+ years of experience as an SRE/Monitoring Engineer.
- Strong understanding of monitoring, logging, and observability in production environments.
- Experience optimizing alerting systems and implementing predictive analytics.
- Hands-on experience managing both cloud and on-premise solutions.
- Automation skills using Python or Go.
- Proficiency with configuration management tools (Ansible, Terraform).
- Solid grasp of networking principles and protocols.
- Understanding of information security principles.
- Experience with CI/CD pipelines (GitLab, Jenkins).
- Familiarity with orchestrators (Kubernetes, Rancher).
- Experience documenting workflows and training support teams.
- Ability to create intelligent troubleshooting instructions.
- Skills in incident analysis and pattern recognition.
Nice to Have:
- Experience working with high-load systems.
- Deep understanding of APM tools (New Relic, Datadog, etc.).
- Database and message queue performance tuning.
- Advanced knowledge of ML-driven monitoring and predictive analysis.
- Experience with automated incident response (self-healing systems).
Soft Skills:
- Responsibility, initiative, and strong analytical thinking.
- Ability to collaborate effectively within a team.
- Focus on automation and process improvement.
- Strong documentation and knowledge-sharing skills.
- Capability to diagnose complex incidents and provide actionable insights.
Benefits:
- Take advantage of 25 paid calendar vacation days to explore, relax, and unwind.
- Join us for exciting corporate events that foster team spirit and fun!
- Indulge in a variety of snacks available in the office.
We will tell you more about all the benefits on the interview :)
This position is planned to be created (promising).