Responsibilities:
- Responsible for monitoring and maintaining production systems including Ethereum validators and Blockchain nodes, AVSs, and other applications. This involves setting up monitoring tools, troubleshooting issues, performing regular maintenance tasks to ensure optimal performance, and implementing custom tooling if required.
- In the event of an incident or outage, the SRE will be responsible for quickly identifying the root cause of the issue and implementing a fix to restore service. This may require working outside of normal business hours to respond to incidents in a timely manner.
- Work intensively with Container Orchestration technologies and constantly optimizing infrastructure costs.
- Responsible for documenting processes, procedures, post-incident reports, and best practices related to running our services in production. This documentation will help ensure consistency and quality across the team, and will also serve as a reference for future team members.
- Collaborate closely with other members of the team to ensure that all production services are running smoothly and that any issues are addressed quickly especially Ethereum validators. This may include participating in on-call rotations, attending team meetings, and working on cross-functional projects with other teams.
- Responsible for automating as many tasks as possible in order to reduce the amount of manual work required to manage infrastructures. This includes scripting, developing tools, and setting up automation using Terraform and CI/CD to streamline processes.
- In this role, we need you to have experience in (you should have):
- IAC experience running on any cloud platform, preferably on AWS and GCP.
- Proficiency in Linux operating system and command-line tools.
- Skills in programming languages such as Python, Golang, or Bash.
- Experience with CI/CD pipelines and automation frameworks, preferably ArgoCD.
- Proficiency with containerization technologies such as Docker with Docker Compose and Kubernetes.
- Familiarity and experience working with Helm Charts.
- Design and Implementation with high availability, reliability, security, and cost optimization in mind