1. Purpose
The purpose of this Disaster Recovery Plan (DRP) is to define Synoptix’s procedures for restoring IT systems, data, and operations after a disruptive event (data-center outage, major hardware failure, ransomware/data corruption, environmental event, etc.). The DRP focuses on technical recovery steps, verification, and returning services to a secure, operational state in line with the Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) defined by Synoptix.
2. Scope
This DRP covers Synoptix-managed infrastructure and services that support customer-facing applications and critical internal systems, including:
- Synoptix Cloud production systems and databases (when customers host with Synoptix)
- Configuration and application stacks required for service delivery (API endpoints, integration services, middleware)
- Backup repositories and off-site/third-party backup media
- Supporting infrastructure (DNS, VPN, authentication/AD, jump boxes, monitoring)
- Recovery actions that require coordination with third-party providers (cloud/data-center vendors, off-site backup storage vendors, couriers)
It does not cover customer-managed environments unless Synoptix has explicit contractual responsibility for recovery.
Note: Customer contracts may specify different RTO/RPOs. Always check the applicable DPA/SLA.
5. DR Activation & Escalation
5.1 Activation Criteria
Activate the DR plan when one or more of the following is true:
- Production services unavailable and outage expected to exceed SLA thresholds.
- Confirmed data loss or corruption affecting production databases.
- Primary cloud region or data center unreachable for extended periods.
- Confirmed ransomware or severe compromise that impacts availability or data integrity.
- Any event the Incident Commander and Executive Sponsor determine requires DR activation.
5.2 Activation Procedure
- Incident detected (internal log review, customer report, or staff observation).
- IRT Lead performs initial impact assessment and recommends DR activation.
- Executive Sponsor authorizes formal DR activation.
- Incident Commander opens War Room channel, creates DR incident ticket, and notifies key stakeholders per Appendix A contact list.
- Notify customers as required (see Communication section). Note: Synoptix typically notifies affected customers within 48 hours of confirmation per the Security Incident Response Program; DR communications may be faster for availability incidents.
6. Recovery Strategies & Playbooks
For each major scenario below, the runbook steps are: Detect → Activate DR → Contain → Recover → Validate → Communicate → Lessons Learned.
6.1 Cloud Region / Data-Center Outage (Primary Region Failure)
Detect: Outage reported by cloud provider or internal failure metrics; services unreachable.
Immediate (0–1 hour):
- Incident Commander verifies impact and scope, notifies Exec Sponsor and BCC.
- Lock down any administrative changes to avoid compounding the outage.
Contain / Failover (1–4 hours):
- If automatic multi-region failover exists: initiate failover (DNS or provider failover).
- If manual: provision standby infrastructure in secondary region from golden images; restore latest available snapshot to DR region.
- Update DNS TTLs as required; coordinate with DNS provider.
- Retrieve encrypted keys and credentials from KMS (dual control if required) for restored services.
Recovery & Validation (4–24 hours):
- Run application smoke tests and data-integrity checks (checksum comparisons, test transactions).
- Monitor logs and metrics for anomalous behavior.
Communicate: Send initial customer notification (impact scope and expected cadence) and follow-ups every 4–8 hours.
Failback: Once primary region is available and validated, plan failback: resynchronize data, test, schedule cutover during maintenance window, and update customers.
6.2 Ransomware / Data Corruption
Detect: Data encryption detections, ransom note, abnormal file behavior, corruption noticed in DB.
Immediate (0–1 hour):
- Isolate infected systems; disable network access to prevent spread.
- Preserve forensic evidence (system images, volatile memory) per IR playbook. Coordinate with InfoSec before restoration.
- Suspend backup rotations temporarily to protect backups from being overwritten.
Contain & Clean (1–24 hours):
- Identify clean backup snapshot (pre-compromise).
- Rebuild affected machines from clean images.
- Rotate credentials (especially admin and service accounts) and revoke tokens.
- Engage third-party incident response consultants if severity warrants.
Recovery & Validation (24–72 hours):
- Restore databases from pre-compromise backups to DR environment; validate integrity.
- Perform extended monitoring for persistence or reinfection.
Communicate: Notify affected customers per IR Program (within 48 hours of confirmation), include guidance (password rotations, log review). Coordinate with Legal.
Post-Recovery: Enhance immutable/air-gapped backup strategy, increase backup retention/segregation.
6.3 Database Failure / Corruption (Non-malicious)
Detect: Failed restores, integrity check failures, abnormal transaction logs.
Immediate (0–2 hours):
- Stop services that may write to affected DB to prevent further corruption.
- Identify most recent consistent backup/snapshot.
Recovery (2–8 hours):
- Restore DB to staging environment; run consistency checks and sample queries to confirm correctness.
Validation (8–24 hours):
- Run application-level tests; compare reconciliation totals with customer data if applicable.
Communicate: Target customer notification within 24–48 hours if their data or service is impacted.
6.4 Application Stack Failure (Code / Config)
Detect: Deployment causes failures, high error rates.
Immediate (0–2 hours):
- Roll back to last known-good deployment (version-controlled artifacts).
- Disable any faulty feature flags.
Recovery (2–8 hours):
- Re-deploy stable artifact to production or DR environment.
- Fix pipeline or configuration issue, test, and re-deploy.
Validation: End-to-end tests and sanity checks.
6.5 Hardware Failure (Host / Storage)
Detect: Host hardware errors, degraded RAID arrays, storage controller faults.
Immediate (0–2 hours):
- Failover to redundant hardware (if HA exists) or provision replacement hosts.
- If physical disk failed, request replacement via provider or hardware vendor.
Recovery (2–24 hours):
- Rebuild arrays from parity or restore from backups; validate data integrity.
15. Appendices
Appendix A — Contacts & Escalation Template
(Populate with live contacts; store sensitive contacts in KMS)
- Executive Sponsor (CTO/CEO): David Andersen | 801-815-2877 | dandersen@synoptixsoftware.com
- Incident Commander / IRT Lead: Dan Weatbrook | 801-918-1676 | dweatbrook@synoptixsoftware.com
- Business Continuity Coordinator: Robby Hilder | 801-554-1416 | rhilder@synoptixsoftware.com
- Support Lead: Pete Alberico | 801-201-3202 | support@synoptixsoftware.com
- Infrastructure/DBA Lead: Denver Campbell | 801-608-4880 | dcampbell@synoptixsoftware.cominfra@synoptix.com
- Infrastructure/DBA Lead: Denver Campbell | 801-608-4880 | dcampbell@synoptixsoftware.com infra@synoptix.com
- Legal: Mike Black | 801-898-0341 | legal@synoptix.com mblack@mbmlawyers.com
- Primary Cloud Provider Rep: company | rep name | emergency phone | account id
- Backup Vendor: name | phone | chain-of-custody contact
- DNS Provider / Registrar: name | phone | login admin
- Third-Party IR Consultant: name | phone | email
Appendix B — Quick DR Activation Checklist (for Incident Commander)
- Confirm detection & impact scope.
- Recommend DR activation; obtain Exec Sponsor authorization.
- Open War Room channel and DR incident ticket.
- Identify critical systems and owners.
- Assign recovery teams (Infra/DBA, DevOps, Support, Communications).
- Verify availability of clean backup snapshot (identify timestamp).
- Retrieve required keys from KMS (dual control if required).
- Start restore/failover; log all actions/times.
- Execute validation tests; report results to Exec Sponsor.
- Communicate status to customers per cadence.
Appendix C — Backup Inventory Template (example columns)
- Media ID | Backup Type (full/incremental) | Creation Timestamp | Retention | Location | Encrypted (Y/N) | Owner | Restoration Notes
Appendix D — Restore Runbook (DBA)
- Identify latest clean backup snapshot (pre-incident snapshot).
- Provision staging instance in DR region.
- Restore DB snapshot to staging.
- Run DB consistency and checksum validation.
- Run application-level smoke tests against restored DB.
- If validated, schedule promotion to production or re-route traffic to DR endpoint.
- Document start/end times, validation results, and issues.
Appendix E — Test Report Template
- Test name | Date | Systems included | Objective | Start time | End time | RTO met? (Y/N) | Issues found | Action items & owners | Sign-off