mattevans.cloud | me@mattevans.cloud
| [ 704-207-6585 ]
Site
Reliability Engineer with a strong belief in K.I.S.S.,
judicious usage of AI to improve productivity, and a love of automation,
observability, and working collaboratively to achieve the
impossible.
Skills
site reliability engineering
programming
devops
databases
containers
linux systems administration
webservers
project management
system architecture & design
hybrid-clouds
SRE: Grafana, Loki, Prometheus / Victoria Metrics,
Datadog, New Relic
DevOps: Puppet, Ansible, Terraform, Jenkins, ArgoCD,
Kubernetes, Docker
Cloud: AWS, Azure, AWS CDK, Hybrid (VMware, Proxmox,
Hyper-V)
Programming: Bash, Python, Go, Github Actions
Databases: Postgres, Aurora Postgres, MongoDB,
sqlite
Linux: RHEL & Derivatives, Debian &
Derivatives
Experience
Director of
Site Reliability Engineering, Cyware
2023 to Present, Charlotte, NC (remote)
- Hired to separate SRE function from DevOps, but remain very
hands-on.
- Led team of 5 Senior Site Reliability Engineers, reporting to Senior
VP of Engineering.
- Evaluated several observability platforms and settled on continuing
with Grafana stack due to highest ROI.
- Wrote numerous custom API solutions in Python to ingest crtitical
metrics from Cyware core applications into Prometheus and automate
mundane tasks, like EC2 disk expansions.
- Deployed ‘masterless Puppet’ GitOps-based state configuration for
legacy EC2 platform, to stabilize and prevent recurring issues due to
inconsistent configurations.
- Wrote custom solution (Python + Go) to pull metrics for all alerts
to move to data driven based decision making.
- Wrote logic and Grafana dashboard to track ‘all-customer’ and
‘per-customer’ uptime and SLA metrics.
- Built-out follow-the-sun on-call rotation in Opsgenie, blameless
post mortem, and established first set of SLIs, which led to achievable
SLOs and SLAs.
- In < 1 year, decreased number of average monthly alerts from over
1,000 to less than 50, and brought uptime from 95% to 99.98% across all
customers.
- Partnered with DevOps to champion “Next-Generation” Kubernetes based
GitOps platform, rolled out in January 2024.
- Responsible for disaster recovery and cloud-related BCP architecture
and SOPs.
- Championed and deployed hybrid-cloud for non-production workloads,
decreasing monthly AWS spend by over 50%.
- Finally, championed a “document-all-the-things!” approach, to
prevent accumulation of tribal knowledge, prevent errors and
inconsistencies, and improve operational time to resolution.
Technologies used: Python, Bash, Terraform, AWS,
Azure, Grafana, Loki, LogQL, Prometheus, PromQL, ArgoCD, Puppet,
Kubernetes, Helm, Docker
Director
of Site Reliability Engineering, Prometheus Alternative Investments
2020 to 2023, Charlotte, NC (remote)
- Joined this mobile-first startup, which unfortunately lost funding
in early 2023 and closed shop.
- Led team of 4 employees and 6 contractors in Pakistan.
- Led the development and execution of the cloud strategy, resulting
in a 400% reduction in AWS costs while improving reliability from <
90% to > 99.99%, as measured by BetterStack.
- With no defined DevOps function, combined DevOps & SRE into one
team, greatly improving collaboration from developer workstation through
production deployment.
- Re-architected AWS platform using AWS Fargate and MongoDB Cloud to
reduce costs and complexity, while allowing platform to scale rapidly to
meet MAU target.
- Worked in partnership with Datadog Partner to quickly deploy full
RUM and observability stack, while transfering operational knowledge to
my team, saving both time and money versus the go-it-alone
approach.
- Established SLIs, SLOs, and resultant SLAs.
Technologies used: AWS, CDK, Terraform, Fargate,
Datadog, BetterStack, Opsgenie, Github Actions
Director
of SRE & CISO, Alpha Theory (SaaS) / Centerbook Partners (Hedge
Fund)
2011 to 2020, Charlotte, NC (remote)
- As a member of the executive team, participated in organizational
planning and direction, interviewing of all new hires, and ensuring
security had a voice at the highest levels of the organization.
- Managed team of 5 employees and 4 outsourced contractors.
- Led the organization from startup, through midlife, and finally to a
parent organization with $200 billion under management, while acting as
interim-CTO to the wholly owned hedge fund subsidiary, Centerbook
Partners ($2B AUM)
- Led the efforts to move from on-prem to an AWS/Azure hybrid
infrastructure, allowing the SaaS application to utilize ephemeral
virtual machines, reducing various daily jobs’ time by several orders of
magnitude.
- As CISO, led annual black-box penetration testing for infrastructure
and application, responsible for vulnerability management program, and
overall security posture across the entire organization.
Technologies used: AWS, Azure, VMware, pfSense,
Cisco, Datacenter, Megaport, Puppet, Bash, Jenkins
Other Noteworthy Employers
2023 to 2023, Wells Fargo
2009 to 2011, Honda Aircraft
2007 to 2009, IBM
2005 to 2007, AIG (United Guaranty)
Awards & Recognition
- Won bid to deploy first public WiFi in Center-City Park in Downtown
Greensboro, NC.
- Honored by Honda Aircraft CEO Michimasa Fujino for ingenious
internet-based streaming broadcast system for OshKosh airshow, saving
the company several million over traditional satellite live-broadcast
provider.
Projects
ipcheck.sh
(Python to Golang learning project)
Certifications
AWS Solutions Architect,
Associate
Certified Kubernetes
Administrator (CKA)
Certified
Kubernetes Security Specialist (CKS)