Oracle Principal Service Reliability Engineer in Pleasanton, California
Principal Service Reliability Engineer
Oracle, the world leader in Enterprise Cloud, is hiring the best and brightest technologists in the industry as we continue to add customer-centric, world-class, leading edge, secure, hyper-scale based solutions throughout all levels of the cloud stack. Oracle’s cloud eco-system is the only complete business cloud platform on the planet, with market leading and business transforming solutions spanning SaaS, DaaS, PaaS and IaaS. Oracle’s Cloud applications, such as Enterprise Resource Management, Customer Relationship Management, Human Capital Management, and Supply Chain Management are used by thousands of customers across the globe and are the broadest, most innovative in the industry, providing businesses with adaptive intelligence, standardized business processes and competitive advantage at low cost.
As part of market leading ERP Cloud, Oracle Expenses Cloud offers a broad suite of modules and capabilities designed to provide a complete, end-to-endsolution for digital expense management giving employees easy data entry options,and financial managers detailed spend information and policy-driven control. Online and mobile,along with spreadsheet entry options, automate travel entry details and approvals,reducing administrative headaches while still capturing essential data for effective cost management.
The Fusion Expenses is looking for passionate, innovative, high caliber, team oriented super stars that seek being a major part of a transformative revolution in the development of modern business cloud based applications. We are seeking highly capable, best in the world developers, architects and technical leaders at the very top of the industry in terms of skills, capabilities and proven delivery; who seek out and implement imaginative and strategic, yet practical, solutions; people who calmly take measured and necessary risks while putting customers first.
Key Tasks and Responsibilities
Service Ownership –You will be part of the SRE team, whose mission is the shared full stack ownership of a collection of services, with our Service Development and Operations SRE partners.
Ownership Scope – You will understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of the production services you own. In partnership with your Service Development and Operations SRE partners, you will have the responsibility to ensure that services are designed and delivered to be mission critical with focus on monitoring, telemetry, security, resiliency, scale, and performance.
Service Design – You will partner with the SRE Architect, Service Development and operations SRE teams in defining and implementing improvements in service architecture, both current and future.
You will be an expert at articulating technical characteristics of your services and the dependencies between services, and guide service Development teams to engineer and add SRE capabilities to the Oracle SaaS/ERP service portfolio.
You will participate in feature design reviews to ensure Monitoring, Telemetry, Reliability, Automation, and Runtime Debuggability is represented as a first class, design time priority.
You will provide technical leadership in defining software engineering patterns, practices, and coding standards focused on increasing reliability and resilience of Oracle SaaS/ERP services. You will deliver code artifacts (reusable components, plug-ins, blueprints, sample code, scripts and tooling, etc.) to streamline adoption by Service development.
- Operations Engineering – You will understand and be able to communicate the scale, capacity, security, performance attributes and requirements of the services you own. You are a Subject Matter Expert, able to understand and communicate every characteristic of your service stack, such as:
Degradation and behavior under load of the services and their dependencies.
End-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate.
Instrumentation and metrics that clearly describe the service behaviors.
Scaling requirements and patterns.
Resiliency and recoverability, ensuring that backup / restore and disaster recovery capabilities are implemented, tested and maintained.
Technical Experts - You are the ultimate escalation point for complex or critical issues that have not yet been documented as SOPs for Level1 staff. You will usually get called in during major incidents as an SME, when the source of a problem is unclear. You will have the deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations.
Incident Response – You will be the primary author of technical content for both customer and internal communication used throughout the incident response process, e.g. postmortem/root cause analysis, end-to-end repair item definition, fixes in production.
Automation– You will have a clear understanding of automation and orchestration principles, and will be eager to automate, wherever and whenever the possibility arises, while simultaneously eliminating technical debt. Automation must be part of your DNA.
Prevention- Using data-driven incident findings, you will work on solutions that will ultimately prevent the incident/problem from arising ever again, and interim solutions to more quickly resolve the problem next time.
Skills and Qualifications
Minimum of 5 years of software development, with demonstrated knowledge of professional software engineering best practices for the full software development life cycle, including coding standards, code reviews, source control, build and release processes, continuous deployment, and test suite development and maintenance.
Experience deploying and running large scale online systems built on Cloud platforms such as Oracle Cloud, AWS, Azure, Google Cloud Platform, and/or OpenStack.
Experience designing and implementing solutions for platform and application layer telemetry, monitoring, scalability, performance and reliability.
Experience coordinating resources across diverse teams to restore service and maintain SLA’s; ITIL certification is preferred.
Excellent written and verbal technical communications with technical and non-technical peers, customers, and at times, executive leadership.
Proven success in contributing in a collaborative, team-oriented environment, with the ability to establish and nurture relationships between multiple teams and navigate dependencies.
3 years of experience
Working in systems and network administration, application security, DevOps and/or Site Reliability Engineering.
Hands-on with web protocols and Linux/Unix tools and architecture, from kernel to shell, file systems, and client-server protocols.
Maintaining, analyzing, and troubleshooting large-scale distributed services
Building automated tools in Python, Java, GoLang, and/or Ruby.
Experience with monitoring alerting using technologies like Prometheus, Sensu, Nagios, Kafka, Wavefront, BigPanda, DataDog, and/or PagerDuty.
Experience implementing, designing, deploying: Docker, Kubernetes, and Serverless (Lambda’s).
Experience with Oracle Linux, RedHat Linux, Ubuntu, Centos, CoreOS, and/or Amazon Linux.
Experience with one or more orchestration, deployment tools, e.g. CloudFormation, Terraform, Ansible, Packer, and/or Chef.
Experience with one or more CI tools: Jenkins, TeamCity, Bamboo, Artifactory.
Experience with configuration management systems such as Ansible, Chef, or Puppet.
Experience with Agile software development practices.
Knowledge of testing methodologies, the testing pyramid (i.e., Unit, Integration, UI, E2E, etc.), testing frameworks, and testing automation tools like QTP, OATS, and Selenium.
Self-driven to keep moving things forward even in the face of ambiguity and imperfect knowledge (resilient to hazards of “analysis paralysis”).
BS in Computer Science or related field and 5 years relevant experience.
Detailed Description and Job Requirements
Analyze, design develop, troubleshoot and debug software programs for commercial or end user applications. Writes code, completes programming and performs testing and debugging of applications.
As a member of the software engineering division, you will analyze and integrate external customer specifications. Specify, design and implement modest changes to existing software architecture. Build new products and development tools. Build and execute unit tests and unit test plans. Review integration and regression test plans created by QA. Communicate with QA and porting engineering to discuss major changes to functionality.
Work is non-routine and very complex, involving the application of advanced technical/business skills in area of specialization. Leading contributor individually and as a team member, providing direction and mentoring to others. BS or MS degree or equivalent experience relevant to functional area. 7 years of software engineering or related experience.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans status or any other characteristic protected by law.
Job: Product Development
Location: US-CA,California-Redwood City
Other Locations: US-CA,California-Pleasanton
Job Type: Regular Employee Hire
- Oracle Jobs