Senior Site Reliability Engineer
hace 7 días
Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms uniquely committed to helping local businesses succeed on a performance basis. Groupon is on a radical journey to transform our business with relentless pursuit of results. Even with thousands of employees spread across multiple continents, we still maintain a culture that inspires innovation, rewards risk-taking and celebrates success. The impact here can be immediate due to our scale and the speed of our transformation. We're a "best of both worlds" kind of company. We're big enough to have the resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact. We are looking for a **Production Service Support Engineer (Incident Management)** to join our team to support and optimize the process, implementation, and operational support of internal systems that span business side and engineering departments. We're a "best of both worlds" kind of company. We're big enough to have resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact. **Role Overview**: Are you ready to leverage your expertise and take ownership of ensuring the performance and reliability of mission-critical services? As a Production Service Support Engineer, you will be at the forefront of incident management, creating innovative solutions and improving the reliability of our globally dispersed services. You will be responsible for coordinating and leading responses to critical incidents while also contributing to long-term improvements in system stability. This role offers a unique opportunity to make a significant impact by driving initiatives in a fast-paced, customer-focused environment. **Key Responsibilities**: - Leverage Site Reliability Engineering best practices and the ITIL Solutions Architecture framework to devise incident management strategies. - Act as Incident Commander and change manager, overseeing technical resources responsible for identifying, triaging, documenting, investigating, mitigating, and recovering from site/service-impacting incidents across 300+ globally dispersed services. - Lead and facilitate Post Mortems and Problem Management practices to ensure lessons learned are applied and improve incident response processes. - Dedicate project time to work on engaging projects that improve overall production stability. - Serve as a member of the Incident Management team, providing primary on-call support during weekends (one every 10 weeks), and supporting Monday-Friday shifts. - Own impacting events until resolution, coordinating with Subject Matter Experts, triage tasks, creating documentation, action items, and conducting Post Mortems. **Qualifications**: - 6+ years administering Linux system environments, with experience in root cause analysis for site-impacting issues. - 4+ years of experience creating unique Splunk or Kibana search queries for identifying, resolving, and preventing incidents and outages. - 6+ years of experience developing policies and procedures to improve production stability. - Strong communication, consulting, and collaboration skills interfacing with senior leadership teams. - Experience with one or more programming languages (Python, Ruby, Java) is a plus. - A BS, MS, or PhD in Computer Science or related fields is a plus. - Experience in designing and creating tools to manage the site and services is a plus. **What We Offer**: - The opportunity to influence the direction of incident management practices and reliability across a vast global network. - A collaborative and growth-focused work culture that values your expertise and contributions. - The chance to make a meaningful impact on the customer experience by improving system stability. - Opportunities for professional growth and leadership development in a fast-paced, high-impact role.
-
Site Reliability Engineer
hace 4 días
Lima, Perú Careers at SunDevs A tiempo completo**Descripción del puesto**: Como Site Reliability Engineer en SunDevs, colaborarás con otros ingenieros de software senior y Platform Engineers para diseñar y desarrollar sistemas y plataformas en la nube altamente disponibles, escalables, seguras y mantenibles para resolver grandes desafíos. Brindarás asesoramiento y guía a nuestros ingenieros de...
-
Senior Site Reliability Engineer
hace 1 día
Lima Metropolitan Area, Perú OpenLoop A tiempo completoOpenLoop is looking for a Senior Site Reliability Engineer to join our team in Lima, Peru.About the RoleCross-Functional CollaborationPartner with engineering teams to improve system reliability and deployment practices.Engage with teams on SRE guidelines and best practices for automation and infrastructure.Work with security teams to implement secure,...
-
Principal Site Reliability Engineer
hace 7 días
Lima, Perú Groupon A tiempo completoGroupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms...
-
Senior Site Reliability Engineer, Americas
hace 2 semanas
Lima, Perú Canonical - Jobs A tiempo completo**Site Reliability Engineer**: To become a member of this team, you need to be a software engineer fluent in Python, you need a genuine interest in the full open source infrastructure stack from metal to containers, and you need the ability to work in a high pressure operations environment with mission-critical services for global brand name customers. As a...
-
Systems Reliability Engineer
hace 2 días
Lima, Perú Scotiabank A tiempo completoHola! Felicitamos y valoramos tu interés por seguir creciendo dentro del Grupo Scotiabank, nos encontramos en búsqueda de talento que aporte con sus conocimientos y experiência a la posición y sobre todo con OPTIMISMO. **Purpose**: As a member of the Global Systems Reliability team,the Global System Reliability Engineer (SRE) will work in collaboration...
-
Site Reliability Engineer
hace 4 días
Lima Metropolitan Area, Perú Nearsure A tiempo completoExplore the Nearsure experience Join our close-knit LATAM remote team:Connect through fun activities like coffee breaks, tech talks, and games with your team-mates and management. Say goodbye to micromanagementWe champion autonomy, open communication, and respect for diversity as our core values.Your well-being matters:Our People Care team is here from day...
-
Site Reliability Engineer
hace 4 días
Lima, Perú Willis Towers Watson A tiempo completo**The Role** We are a group of passionate engineers who have built the largest private Medicare marketplace in the United States. We focus on the continuous improvement of our systems and culture. We improve and maintain a platform that provides the best possible experience to shop for insurance plans, and allows our insurance carriers to be be confident...
-
Reliability Mechanical Engineer
hace 2 semanas
Lima, Perú Hunt Consolidated, Inc. A tiempo completo**ROLES AND RESPONSIBILITIES**: - Monitoring and calculation of reliability KPI (RAM, MTBF, etc). - Analyze predictive alerts from machine learning software ( for Rotaing and Mechanical assets) - Identify threats and opportunities for Plant production and manage them in MTO (mitigate Threats and Opportunities) process. - Analyze data and perform reliability...
-
Senior Systems Reliability Engineer
hace 1 semana
Lima, Perú Scotiabank A tiempo completoID de la solicitud: 227737 Gracias por tu interés en ser parte de Scotiabank Perú, apreciamos tu postulación. Estamos en la búsqueda de personas con talento que quieran crecer y lograr los objetivos de nuestra organización. ¡Te deseamos mucho éxito dentro de este proceso! **Senior Systems Reliability Engineer** - Business Line: Operaciones &...
-
Senior Site Reliability Engineer
hace 1 semana
Lima, Perú Fusemachines A tiempo completo**About Fusemachines** Fusemachines is a leading AI strategy, talent, and education services provider. Founded by Sameer Maskey Ph.D., Adjunct Associate Professor at Columbia University, Fusemachines has a core mission of democratizing AI. With a presence in 4 countries (Nepal, the United States, Canada, and the Dominican Republic and more than 250...