Ops Roadmap for Learn.co

Published on Thursday, March 22, 2018

Outline

  1. Review current setup
  2. Share upcoming challenges/priorities
  3. Share roadmap + new concepts/tools

Priorities

  1. Ensure new collaborative ventures are successful
  2. Support our team as we grow (make ops more automated + manageable)

Requirements

  • Move to AWS for hosting
  • Need high amounts of infrastructure + environment automation and orchestration (Terraform)
  • Scaling
  • Security
  • Lower maintenance costs as our team grows

Current Setup

  • Hosted on Digital Ocean
  • Self-hosted services:
    • Postgres
    • Redis
    • Elastisearch
    • Memcached
    • Pushstream
  • Our virtual servers are on private network in DO region

Pain points

  • Communication between services is not automated (no robust tooling available)
  • Our servers are “pets not cattle”
  • High maintenance costs
    • Lots of outages
    • Infrastructure is not self-healing (no robust tooling available)
  • Low security
  • Noisy alerts (Nagios)
  • Relying on manageable (aka more brittle) deployment and provisioning processes (Chef)
  • Our virtual servers are on shared machines, so vulnerable to leaks / attacks

Roadmap

Security

Guiding principle: Principle of Least Privilege (limit surface area / attack vectors)

More on AWS Virtual Private Cloud

  • Public and Private subnets
  • Services that don’t need to be exposed to internet (redis, etc.) will live in private subnet
  • NAT Gateway rules to manage traffic

Scaling

All about automation

  • Managed services instead of self-hosting
  • Migrate DNS from Dyn to AWS Route 53
  • Terraform for “Infrastructure as Code” orchestration automation
  • Packer for automated AMI builds (images for Amazon instances)

  • Additional things we’re thinking about:
    • Deployments
    • Alerting
    • Monitoring
    • Logs
    • Containerization / Kubernetes (way down the road)

More about Terraform

Infrastructure as code: automates your environment to match your config file (declarative code)

  • Source controlled code
  • Reduces documentation (self-documenting system)
  • Support for multiple cloud providers

Next steps

  • Port Redis (tested)
  • Port workers and SQS/Rabbit (spike in progress)

More in the Learn.co ops series