Senior Lead Site Reliability Engineer
Company: JPMorganChase
Location: Plano
Posted on: April 1, 2026
|
|
|
Job Description:
Description Guide and shape the future of technology at a
globally recognized firm, driven by pride in ownership. As a Senior
Lead Site Reliability Engineering at JPMorgan Chase within the
Infrastructure & Production Management sector of Consumer &
Community Banking , you are the non-functional requirement owner
and champion for the applications in your remit. You are a key
influencer in your team’s strategic planning, driving continual
improvement in customer experience, resiliency, security,
scalability, monitoring, instrumentation, and automation of the
software in your area. You act in a blameless, data-driven manner
and navigate difficult situations with composure and tact. Job
responsibilities Demonstrates expertise in site reliability
principles and demonstrates an understanding of the fine balance
between features, efficiency, and stability Effectively negotiates
with peers and executive partners to ensure optimal outcomes for
all Drives the adoption of site reliability practices throughout
the organization Ensures your teams demonstrate site reliability
best practices with the ability to demonstrate this empirically
through stability and reliability metrics Drives a culture of
continual improvement and solicits real-time feedback to improve
the customer’s experience Ensures your team collaborates with other
teams within your group’s specialization and avoids duplication of
work where possible Follows blameless, data-driven, post-mortem
strategies and conducts regular team debriefs to enable learning
from both successes and mistakes Provides personalized coaching for
entry to mid-level team members Ensures your team documents and
shares their knowledge and innovations via internal forums,
communities of practice, guilds, and conferences Required
qualifications, capabilities, and skills Formal training or
certification in software engineering concepts and 5 years of
applied experience; plus 2 years leading technologists to manage
and solve complex technical items within your domain. Advanced
proficiency in SRE culture and principles, with a track record of
implementing SRE practices across application and platform teams
while avoiding common pitfalls. Strong observability fundamentals:
define and measure SLIs, set and manage SLOs and error budgets,
build actionable alerting and dashboards; hands-on experience with
Dynatrace and Splunk. Proven resiliency engineering: capacity
planning, failure mode analysis, fault-tolerant design (circuit
breakers, retries, bulkheads), disaster recovery strategies, and
running game days. Proficiency in at least one programming language
(e.g., Python, Java Spring Boot, .NET) to build production-grade
automation and tooling; deeper coding skills are a plus but not a
hard requirement. Proficiency in CI/CD and Infrastructure as Code
(e.g., Jenkins, GitLab, Terraform), including pipeline design,
environment promotion, and secrets/artifact management. Experience
with containers and orchestration (e.g., Docker, Kubernetes, ECS),
including image hardening, Helm, and operational runbooks. Ability
to troubleshoot common networking technologies and issues (TCP/IP,
DNS, HTTP, proxies, load balancers, TLS, routing, VPCs/subnets,
firewalls). Demonstrated proficiency operating cloud-scale,
distributed systems within a technical discipline (e.g., cloud
platforms), with experience at firmwide or similarly large scale.
Ability to influence team culture by championing innovation and
change; experience mentoring and leading technologists (including
hiring, developing, and recognizing talent) as an individual
contributor. Automation mindset focused on reducing toil (target
~25% of time), building self-service capabilities, and codifying
operational procedures into code. Preferred qualifications,
capabilities, and skills Experience in banking/financial services
and familiarity with risk and control expectations in regulated
environments. AWS experience; AWS Certified Solutions Architect
(Associate or Professional) preferred. Advanced observability
ecosystem knowledge beyond Dynatrace/Splunk (e.g., OpenTelemetry,
Prometheus, Grafana, ELK). Experience scaling SRE practices across
multiple teams/platforms, including playbooks, SRE onboarding, and
maturity assessments. Exposure to payments concepts and platforms
(e.g., ISO 20022, SWIFT, real-time payments) with willingness to
learn; not required for the role. Experience with chaos engineering
tools (e.g., Gremlin, Litmus, Chaos Mesh) and integrating
resilience tests into CI/CD pipelines. Proven cloud
cost/performance optimization in production (autoscaling, caching,
capacity management, and efficiency tuning
Keywords: JPMorganChase, Tyler , Senior Lead Site Reliability Engineer, IT / Software / Systems , Plano, Texas