Chief / Senior Manager - Site Reliability Engineering -Application (Production Support) Navi Mumbai

Chief / Senior Manager - Site Reliability Engineering -Application (Production Support) Navi Mumbai

1 Nos.
135813
Full Time
12.0 Year(s) To 14.0 Year(s)
30.00 LPA TO 32.00 LPA
IT Software- Application Programming / Maintenance
Banking/Financial Services
BCA/BCS - Computers
Job Description:
  • Be responsible for production support & release management for application assigned - SRE C1 - Elastic Stack : ELK , Application Performance Management : APM and Disaster Recovery (DR).
  • Should possess excellent troubleshooting and analytical skills.
  • This senior leadership role requires strong technical expertise, strategic thinking, and proven experience in managing mission-critical systems at scale.

 

 Elastic Stack (ELK) Cluster Lead
 Architect, deploy, and optimize ELK clusters for enterprise observability.
 Ensure log ingestion, parsing, and visualization meet compliance and
performance standards.
 Drive automation for scaling, resilience, and performance tuning.
 Application Monitoring Management (APM) Cluster Lead
 Define and implement APM strategy across critical applications.
 Lead deployment and integration of APM tools (Dynatrace, AppDynamics,
New Relic, Datadog etc..).
 Establish KPIs, SLAs, and proactive monitoring frameworks to ensure
application reliability.
 Design synthetic monitoring for different critical business journey & key
metrics.

 Disaster Recovery(DR) Oversight
 Own DR strategy, planning, and execution for enterprise applications.
 Conduct regular DR drills, audits, and compliance checks.
 Align DR processes with business continuity and regulatory requirements.
 Ensuring the robust replication between primary & secondary sites.
 Oversee daily backup(s).
 Ensuring all Disaster recovery process and documentation meets oblication
mandate by the regulators.
 Provide comprehensive Audit reports for DR/DC environments.
 Lead the command structure when disruptive event occurs and direct the
recovery team such network, database, application etc.
 Co-ordinate the dissemination of critical information for senior management
& external stakeholders.
 Conduct through evaluation of incidents to determine failures and issues.
 SRE Practices
 Champion SRE principles: reliability, scalability, automation, and continuous
improvement.
 Monitor error budgets, SLIs, SLOs, and SLAs for critical systems.
 Drive incident management, root cause analysis, and long-term remediation.

Company Profile

A leading Non-Banking --- Company (NBFC) that caters to the growing needs of an Aspirational India, serving both Individual & Business Clients.Incorporated

Apply Now

  • Interested candidates are requested to apply for this job.
  • Recruiters will evaluate your candidature and will get in touch with you.

Similar Jobs