Service Level Agreements (SLAs)
This Service Level Agreement (SLA) outlines the relationship between the CIRRUS team - who provides the on-premise cloud infrastructure - and its recognized users, including UCAR Employees, Visitors, and external collaborators authorized to use the on-premise cloud resources.
NSF NCAR | CISL operates Compute, Storage & Network hardware in robust Data Centers at multiple organizational facilities. The on-premise cloud offers users the ability to utilize those highly available, organizationally supported, compute resources for approved use cases. This includes access to routable network space and UCAR Domain Name Systems (DNS). These resources provide a supplement to computing needs that aren't fulfilled by the HPC offering, public cloud, or what is available locally.
Audience: Service Technical Staff, System Administrators, On & Off Site Personnel, and Authorized Affiliates
Recognized Customers: On & Off Site Personnel, and Authorized Affiliates
Important
Availability: The service is designed to operate 24/7. However, support is currently limited to business hours only.
Response Level and Service Definitions
Definitions
Response Times
Important
There is currently no after-hours support. All issues occurring after business hours will be triaged at the start of the next workday.
User Applications Support Policy
CIRRUS provides and maintains the underlying Kubernetes infrastructure, but application owners are responsible for their applications deployed on the platform.
CIRRUS Team Responsibilities:
- Maintain Kubernetes cluster infrastructure and core platform services
- Ensure infrastructure availability and performance
- Provide consultation on best practices for application design and deployment
- Offer guidance on infrastructure features (persistent volume replication, storage options, etc.)
Application Owner Responsibilities:
- Design, develop, deploy, and maintain their applications
- Ensure applications are built to be resilient and can handle infrastructure maintenance windows
- Implement high availability patterns (database replication, persistent volume mirroring, etc.)
- Monitor and troubleshoot application-specific issues
- Follow Infrastructure as Code (IaC) practices for reproducible deployments
Support Prioritization:
CIRRUS team support for user applications is provided on a consultation basis as time allows. Application support requests are prioritized according to the severity levels defined above, but will always be secondary to infrastructure-level issues affecting core platform services.
Infrastructure Maintenance: The CIRRUS team reserves the right to perform maintenance on infrastructure nodes at any time. Applications should be designed to tolerate node restarts, rescheduling, and temporary service disruptions without data loss or extended downtime.
Backup & Disaster Recovery Policy
CIRRUS follows Infrastructure as Code (IaC) practices. All applications deployed on the on-prem cloud are defined via code repositories and can be redeployed as needed.
- Application Backups: Applications themselves are not backed up individually; they are re-deployed via Argo CD and source-controlled templates.
- Argo CD: Argo projects are backed up after changes, enabling project restoration in case of data loss.
- Container Images (Harbor): Images stored in Harbor are backed up to object storage and can be restored from there.
Persistent Volume Backups
Persistent Volumes (PVs) in CIRRUS can be replicated across sites to improve resiliency.
To request PV replication for your application, please create a ticket.
Change Management
All changes must be submitted via a Jira ticket. For more information on this process, please see create tickets.
Tickets are reviewed and prioritized by the CIRRUS Product Owner.
- Critical and Urgent tickets will be addressed based on SLA response times.
- Regular requests are reviewed during the team's bi-weekly planning sessions.
Contact Information
Business Hours: 08:00 - 17:00 MST, Monday - Friday
Primary Contact: Nick Cote
Secondary Contact: Submit a Jira Request
Off Hours Contact: Nick Cote and/or Jira Request
Monitoring & Reporting
For observability, the CIRRUS infrastructure leverages:
- Prometheus for metrics collection
- Grafana for visualization and dashboards
- Loki for centralized log aggregation
These tools work together to detect, surface, and alert the CIRRUS team to any operational issues within the platform.