Observability Engineer, Automation & Reliability Engineers (ARE)

Location: Silver Spring, Maryland - Remote
Category: Network and Cloud
Employment Type: Contract
Job ID: 14524
Date Added: 06/14/2022

Apply Now

Fill out the form below to submit your information for this opportunity. Please upload your resume as a doc, pdf, rtf or txt file. Your information will be processed as soon as possible.

* Required field.

Observability Engineer, Automation & Reliability Engineering (ARE)
Qualifications About the job

  • Automation & Reliability Engineering (ARE) combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. ARE ensures that Discovery’s services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to user’s needs and a fast rate of improvement. Additionally, OEs will keep a watchful eye on our systems capacity and performance. Much of our engineering focuses on optimizing existing systems, building infrastructure and eliminating work through automation. An OE is a practitioner and advocate of good monitoring practices and configuration management within GT&O, and so should be a great communicator and enthusiastic champion of Technology Operations. The core purpose of the role is to ensure that our applications, platforms, and infrastructure are effectively monitored for availability, performance, and functionality, and that alerts driven by our monitoring systems are accurate and actionable.

    On the ARE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Discovery, while using your expertise in observability, monitoring, and system design.


  • Design, roadmap, and administers tools used in discovering and monitoring Discovery’s applications, services, platforms, and infrastructure.

  • Build monitoring systems that assist in infrastructure and application event detection and alert remediation.

  • Ensure all relevant infrastructure and services are properly covered within our monitoring and alerting systems in a manner consistent with our standards; collect the right metrics at the right frequency and ensure the data is readily available for effective alerting, reporting, and analysis

  • Define business and operations success metrics, establish a departmental process model for benchmarking, standardization, and process improvements.

  • Collaborate with a cross-functional team of Dev, Ops, Engineers, and architects to understand complex application architectures to implement an effective top-down monitoring strategy of holistic service visibility.

  • Participate in strategy and future implementation discussions for the redesign and implementation of monitoring environments to modernize with latest technology trends.

  • Leveraging performance counters to diagnose and troubleshoot infrastructure problems.

  • Create/maintain documentation for monitoring requirements, processes, and implementation.

  • Assist in the deployment, organization, and management of standard operating procedures.

  • Perform other duties as needed.