Job Summary

The Reliability Engineer will be responsible to ensure our systems are best- in-class for stability, observability, and enterprise scale in the healthcare services space. This role will be part of a key growth initiative within Walgreens and will partner with the application development teams to ensure the services are performant to client expectations.

You can expect to collaborate with technology and business leaders to understand and define the service levels needed for our solutions ensuring the appropriate metrics for successfully meeting SLAs are defined, measured, and met, while continually evolving to adapt to the changing needs of the business.

We are looking for an engineering mindset who will be responsible for:

  • Partner with DevOps teams to drive observability into our technology stacks and define metrics so that we can improve our stability and meeting customer needs.
  • Collaborate with engineering and DevOps so that our services are scalable and resilient.
  • Create feedback for continual improvement and evolution of non-functional requirements.
  • Knowledge of DevOPS tooling and ability to partner with development teams to implement toolset and train teams.

Areas of focus:

  • Availability – ensure max uptime, identify changes needed to weed out failures.
  • Latency – measure against SLAs, identify bottlenecks, create NFRs and recommend changes to address.
  • Efficiency – fast, frequent deployment with little to no impact on customer base (processes like canary and blue/green deployments).
  • Change Management – instill resilience and robustness in new updates and features. Clear identification and tracking of changes and ability to measure impact and revert when needed.
  • Issue resolution – resolve production issues, perform root cause analysis, feedback loop to other teams for fixes and improvements.
  • Capacity Planning – leverage data to analyze trends and plan for future state capabilities.
  • Metrics and measurements – identify service levels, identify metrics to measure to ensure these are met.
  • Observability – work with teams to ensure solution support and implement and manage tools.
  • Compliance controls and adherence – HITRUST controls implemented and followed.
  • Business continuity – backup, DR, resiliency.

Job Responsibilities:

  • Works with business and technology leaders to define appropriate service level objectives and service level indicators in partnership with product and engineering teams.
  • Analyzes production system operations using tools such as monitoring, capacity analysis and outage root cause analysis to identify and drive change that ensure continuous improvement in system stability and performance.
  • Performs real-time troubleshooting and repair of mission-critical application and platform components using critical knowledge of Azure PaaS components and application architecture.
  • Facilitates blameless post-mortems and provides feedback to product development and engineering teams for fixes and prioritization of improvements.
  • Measures and optimizes system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve.
  • Ensures timely and effective reporting, tracking, follow-up, and communication of problems to internal and external clients, technical resources, and executives.
  • Manages day to day issues including health checks of applications and processes, working closely with end users, development staff and Infrastructure teams, to prioritize and resolve and/or mitigate outages.
  • Defines and implements business continuity strategies and ensures information controls are in place across all environments.
  • Identifies opportunities to automate repeated manual tasks, develop tools and automation to improve the efficiency of the platform and infrastructure, minimize downtime, implement self-healing patterns, and achieve human free operations.
  • Designs, develops, and drives troubleshooting & mitigation tools as part of driving self-healing agenda.
  • Provides primary operational support and engineering for cloud-based ecosystem.
  • Works across product, engineering, and support teams to understand the deployment lifecycle.
  • Business and Application SLIs’/SLAs’ and creates appropriate dashboards and thresholds for monitoring and alerting to support service level achievement.
  • Defines and drives adoption of best-in-class monitoring frameworks, tools, dashboards, and automation to proactively detect, alert, and self-heal Production anomalies.
  • Partners with development, testing, and DevOps teams to improve service reliability and scalability through influencing architecture, design, and testing processes and establishing comprehensive release procedures.
  • Participates in system design consulting, platform management, and capacity planning.
  • Works with product and engineering teams to identify and prioritize nonfunctional requirements around resiliency, security, and availability which will help ensure the achievement of platform and solution SLAs.
  • Implements and advocates applicable HITRUST controls across all tools, platforms, applications, and support processes.
  • Contributes to architectural strategy and roadmap to improve scalability, availability, performance, and security concerns.
  • Implements continuous process improvement, including but not limited to policy, procedures, and production monitoring and alerting.


Tagged as: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,


Job Overview
We use cookies to improve your experience on our website. By browsing this website, you agree to our use of cookies.

Sign in

Sign Up

Forgotten Password

Job Quick Search