Scale and Resiliency at Area 1

Area 1 Security is responsible for delivering email and DNS services to many Forbes Global 2000 companies. This is a tremendous responsibility because email and DNS allow these companies to continue operations; any disruption would impact employees immediately. If email or DNS services stop, then contracts don’t arrive, invoices don’t get paid and meetings can’t be hosted.

Here are some insights on how the Area 1 Security engineering team has built a fully resilient and scalable set of products.

Scale and resilience can mean a lot of things. For us, it means having a system that can handle thousands of potentially disruptive events, continue to function and deliver email to our customers. Our cloud provider partners are great, but we also stay vigilant about all the possible problems that can occur. We make sure to consider the storage devices, compute devices, databases, and all the networking in between. We started with monitoring and alerting on all the things we could think of.

We continually ask ourselves questions about what would happen if the following elements weren’t available for brief or extended periods of time:

  • S3
  • BigQuery
  • Storage device on a single node dies
  • An entire AWS or GCP region become unavailable
  • Compute node is replaced without notification

I could go on. These are real problems we could face at any time without warning, some of which we have already encountered. Our alerting and monitoring today provide us visibility into the smallest misbehaving components, long before the cloud providers deliver status updates. This enables us to monitor the system closely for any issues that could affect customers and take mitigation actions

The second resilience component we focus on is the level of service we are delivering to customers. The breadth of Area 1’s phishing detections is constantly increasing and improving to stay ahead of bad actors’ evolving techniques. We have instrumented our system to measure the processing times for each module involved in processing every email.

This allows us to detect processing times that are taking a significant amount of time or processing capacity. We can also look at historical changes over days, weeks and months to determine if there are any negative trends in processing times. Our vigilance in monitoring ensures that we can deliver the fastest email service possible.

Scaling the system has been top of mind from Area 1’s very first deployment in 2014 to today. This has meant architecting the system to allow for the addition of computing resources and regions worldwide. Determining what components need to be in the region and what can reside centrally is constantly evaluated. Knowing what will happen if a region cannot reach a centrally located resource intermittently or for an extended period of time is a primary component of our planning.

Scaling can be difficult without first addressing resilience and alerting first. Things may work great when you have a few dozen computing resources and associated services, but if you don’t have the visibility into their operations and are not prepared to manage inevitable problems, scaling will only increase the frequency of bad events. If everyone were worried that an alarm will go off at any moment (though it always can) the engineering staff will live on the edge and start to lose productivity due to the sheer anxiety of the situation. And that wouldn’t be fun for anyone!

If you're interested in learning more about our engineering team and how we operate, contact us here or feel free to connect with me directly at [email protected] or LinkedIn.

