Oracle Principal Site Reliability Engineer in Seattle, Washington
Principal Site Reliability Engineer
Are you comfortable attempting a problem that has never been solved before?
Are you someone who thinks about how you can make things better?
Are you hands-on, driving for excellence and do you thrive with challenging high-scale problems?
We are a newly formed group within Oracle working on solving some really hard problems in the areas of chatbots, mobile, cognitive services, analytics and AR. We are like a start-up inside a large company with a big charter and lot of creative freedom. We have assembled some of the smartest people in the industry and are growing this team.
As a Principal SRE, you'll define how to use latest technologies to identify and optimize the operational efficiency. You will be responsible for the infrastructure and reliability of PaaS services including chatbots, mobile, cognitive services, analytics and AR. You will work with a team pushing the boundaries of scalable, self-healing, autonomous platform built on Kubernetes, Docker, Prometheus, and Grafana.
We are looking for someone who is passionate about:
. Owning end-to-end availability, reliability, and performance of our PaaS services on Oracle public cloud
. Identify opportunities for automation, design and develop required tools
. Designing and implementing processes for rolling out software and security updates to deployments with zero downtime
. Building and maintaining our platform and automation frameworks to ensure maximum up-time and predictability while preventing outages and service
interruptions or degradation
. Analyzing system failures and developing rapid response processes to ensure such failures do not reoccur
. Working cross-functionally with product development, product management, program management and cloud infra operations teams
. Partnering with micro-service development teams to provide the infrastructure and services required to enable innovation and ensure the highest level of
quality and service
. Predicting and providing notice of potential system vulnerabilities for current and future solutions and implementations; providing specific
recommendations and guidance to address such vulnerabilities
. Developing and managing processes and metrics that ensure maximum reliability and up-time for our customers
. Analyzing, building and maintaining all automation tools and processes to ensure the highest standards of reliability and robustness
. Fully understanding our customers' service needs and ensuring we meet those needs
. Participating in 24x7 site reliability rotations and escalation workflows
. 3 years of experience in site reliability and technical operations with experience building large and geographically disperse infrastructure supporting
business critical cloud services
. 3 years of hands-on experience in one or more scripting (Powershell, Python, Perl, BASH, ) and programming languages (Java, C# )
. Strong Technical background with an ability to troubleshoot issues impacting large scale service architectures and application stacks.
Familiarity with large scale system monitoring and alerting frameworks
Expertise utilizing Cloud Infrastructure such as Oracle Cloud, Azure, AWS, OpenStack, GCP
. Experience creating effective resource plans that ensure a high level of performance
. Experience developing repeatable processes and metrics that maximum uptime, reliability, and predictability
. Experience managing complex deployments
. Experience with Agile and DevOps methodologies
. Effective verbal, written communication and interpersonal skills including interfacing with customers on a professional and cooperative level
. Able to develop and maintain strong relationships with Oracle customers
. BS degree in Computer Science or related degree or equivalent experience
. chatbot, mobile, cognitive services knowledge is a plus
No matter your role on our team, you'll find yourself in an exciting and challenging environment where every person is empowered to show initiative, be outspoken, and be proactive, not reactive. Oracle is dedicated to the continual growth and development of its staff, striving constantly to strengthen our expertise as well as develop new skills. Our team is spread all around the world on four continents. We provide a full range of opportunities and challenges to apply your skills and grow your career in this new and exciting arena.
Detailed Description and Job Requirements
Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.
Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance. Authority for end-to-end performance and operability. Partner with development teams in defining and implementing improvements in service architecture. Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio. Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack. Demonstrate clear understanding of automation and orchestration principles. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs). Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Understand and explain the affect of product architecture decisions on distributed systems. Professional curiosity and a desire to a develop deep understanding of services and technologies.
A BS or MS in Computer Science, or equivalent. Identifies and implements complex solutions to knowledge of server hardware and software configuration, networking, standard internet services, scripting languages, cloud computing patterns, technology security and compliance. Experience running large scale customer facing web services. Identifies and implements complex solutions to understanding of load balancing technologies and experience with development in programming languages, databases and big data stores, and container technologies. Work involves defining and documenting technical architecture of complex and highly scalable products. A minimum of 8 years experience of running large scale customer facing web services.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans status or any other characteristic protected by law.
Job: Product Development
Other Locations: US-CA,California-Redwood City
Job Type: Regular Employee Hire