| Daniel McDonough

Summary

Senior infrastructure / site reliability engineer with deep experience modernizing legacy and hybrid environments through Infrastructure as Code, cloud-native identity, and Git-based workflows. Focused on reducing blast radius, removing failure-prone patterns, and enabling safer change.

Technical Skills

Cloud: GCP (GKE, Cloud Run, IAM, Workload Identity) | AWS (EKS, ECS, VPC)
Infrastructure: Terraform, Terragrunt, Ansible, GitHub Actions, ArgoCD
Operating Systems: Linux (RHEL, Ubuntu, Debian), Windows Server
Observability & Security: Datadog, CloudHealth, PagerDuty, Splunk
Datastores: Postgres, Redis, Elasticsearch
Languages & Tools: Bash, Python, Git, Docker

Experience

Site Reliability Engineer (Infrastructure / Platform)

MagMutual Insurance Company, Atlanta, GAApr. 2024 – Present

Joined a largely manual environment with no formal SLOs, incident metrics, or standardized infrastructure practices. Focused on building core infrastructure, identity, and change-management foundations to make the platform safer, more repeatable, and less failure-prone.

Designed and implemented a fully isolated GCP layout for dev, QA, UAT, and production across 12 projects using Terraform and Terragrunt, removing shared project risk and long-standing configuration drift.
Deployed and maintained 10+ GKE clusters supporting internal tooling and application workloads, including infrastructure automation with AWX and ArgoCD.
Migrated DNS and perimeter security from manual processes to Terraform-managed GitOps pipelines using GitHub Actions, reducing change risk and improving rollback safety.
Replaced long-lived cloud credentials with Workload Identity Federation and least-privilege IAM in a HIPAA-regulated environment, closing several high-risk access patterns.
Built and maintained Terraform-managed site-to-site VPN and DNS peering connectivity between on-premises systems and GCP.
Created reusable Ansible automation for Linux systems joined to Active Directory using Kerberos and GSSAPI, reducing access issues and manual server configuration across ~150 hosts.
Identified and corrected a latent production access failure caused by unsafe filesystem permissions on a long-running production VM, preventing a permanent OSLogin lockout during a planned migration.

Senior DevOps Engineer

Red Boundary Research, Charleston, SCOct. 2022 – Jan. 2024

Led infrastructure and operations for a small security startup, owning AWS architecture decisions, CI/CD strategy, and observability for their endpoint agent product.

Built and maintained infrastructure using Terraform, AWS VPCs, ECS, and related services to model diverse traffic routing and failure scenarios.
Integrated Datadog for service monitoring and Elasticsearch for centralized log aggregation and analysis.
Introduced CI/CD workflows to replace manual deployments, improving release reliability and reducing operational friction.

Senior DevOps Engineer

The Weather Company (IBM), Atlanta, GAOct. 2017 – Feb. 2021

Sole operations engineer supporting analytics and data science teams across production and QA AWS environments.

Reduced annual AWS spend by $355K+ through EMR spot instance migration, reserved instance strategy for Qliksense cluster, and systematic right-sizing informed by CloudHealth and Trusted Advisor analysis.
Deployed and operated Kubernetes workloads on EKS using Terraform and Helm, supporting large-scale analytics pipelines with centralized logging via Elasticsearch.
Diagnosed and recovered a production Cassandra cluster failure using custom recovery scripts.

Education

Master of Business Administration (M.B.A.), Coastal Carolina University2014
Bachelor of Science in Chemistry, Coastal Carolina University2013

Projects

Home Kubernetes cluster (Terraform/Helm)
Multi-network Kubernetes via HeadScale
IoT automation (Home Assistant, pigeon loft with ESPHome)