A Senior Site Reliability Engineer is required by our client to be responsible for providing and improving the observability and reliability of products/platforms/applications employed across our company and partners. We conduct advanced troubleshooting, contribute to architectural designs, and ultimately contribute to the improvements of service reliability through engineering and working closely with other technical and support teams.
Key responsibility
- Maintain, troubleshoot, diagnose, and resolve performance and reliability issues affecting the observability and monitoring infrastructure.
- Identify and contribute to solutions for reducing services outages, reducing alert noise, improving monitoring.
- Work closely with other teams to design and develop solutions that deliver value to them. This implies a level of understanding on all the proprietary products will be required.
- Build tooling to improve the automation of operations, and reduction of toil. This may include application and systems deployment, capacity planning, and automatic failure remediation.
- Collaborate with our technical support teams to optimize the availability, reliability, and performance of the production services.
- Perform proactive investigative work to track down potential issues before they appear and drive these to resolution with the necessary product teams.
- Work closely with other teams to ensure effective resolution of incident calls and effective communication.
- Participate in incident reviews to improve alerts for detection and potential proactive mitigation.
- Log, investigate, and track technical issues from clients, internal and external. Ownership, ability to learn, and problem-solving ability are key.
- Carry out work independently, and collaboratively in a team when required. This requires strong time management, prioritization, communication, and collaboration skills.
- Participate in the standby or on-call rotation.
Qualification
Who we're looking for:
- Enjoy and have fun in engineering work
- Strive for excellence, thrive under pressure
- Believe in teamwork and visions
- Demonstrate effective communication
- Take proactive approach to everything
- Hold herself/himself accountable
- Capable of turning novel idea into reality
Required:
- Relevant Degree or post-secondary education (Ex. BSc IT / IS)
- Good understanding of network infrastructure, IT system monitoring, DevOps, CICD, and SRE concepts
- Proven analytical skills, experience in analytical and logging products such as Elastic Stack, Grafana, or Application Insights
- Experience with at least one of the following: AWS/Azure
- Practical knowledge with scripting in Linux Shell or Powershell
- Practical knowledge with at least one of the programming languages: C#, Perl, Python, or relevant languages
- Basic to Intermediate SQL query knowledge
- Good problem-solving ability
- Good English oral and written proficiency to communicate and collaborate with diverse colleagues
Preferred:
- 3+ years of relevant experience in supporting a distributed production environment
- Experience in supporting monitoring infrastructure
- Experience with building and maintaining data pipelines with Logstash
- Strong understanding and experience with DevOps and SRE concepts
- Experience with developing Automation with Ansible and Octopus
- Experience with requirement analysis, system analysis/design, web application development
- Experience with Linux and open source products
- Proven experience with cloud technologies: AWS, Azure, Elastic Cloud