- Who we are
- — Overview
- — Purpose & Values
- — Our People
- — Governance
- — Contact Us
- What we do
- — Overview
- — Case studies
- — Services
- — Industries
- — Alliances
- Our Thinking
- — Overview
- — Latest Insights
- — Industry thinking
- Careers
- — Latest Opportunities
- — Join as an Associate
- — Submit your Request for Expression of Interest
BCDR Policy
- Introduction and Purpose
- Scope
- Alignment with Best Practices and Standards
- Roles and Responsibilities
- Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
- Backup and Restoration Strategy
- Incident Response and Recovery Procedures
- General Incident Response Framework
- Scenario-Specific Procedures
- External Dependencies and Third-Party Integrations
- Communication and Escalation Plan
- Backup Data Security and Testing
- Maintenance, Audit, and Continuous Improvement
- Conclusion
Introduction and Purpose
This Business Continuity and Disaster Recovery (BCDR) policy outlines how Humanics Global Advisors (HGA) will prepare for, respond to, and recover from disruptive events affecting the Humanics Global Advisors Digital Platform. The purpose is to ensure that critical platform operations can continue or be restored quickly during and after incidents, thereby minimizing downtime and data loss, protecting our clients and reputation, and maintaining compliance with industry standards[1]. This policy is aligned with international best practices, including ISO 22301 (Business Continuity Management) and NIST SP 800-34 (IT contingency planning guidelines)[2][3]. By adhering to these standards, HGA’s continuity program will follow a structured framework for risk assessment, response planning, and continual improvement.
Scope
This policy covers all BCDR activities for the Humanics Global Advisors Digital Platform, including its cloud infrastructure on DigitalOcean, associated databases, application servers, and all integrated third-party services (such as AWS S3 for storage and Stripe API for payments). It addresses a range of potential disruptions – from natural disasters to cyber-attacks – that could impact platform availability or data integrity. All HGA team members involved in platform development, maintenance, and support (especially the System Manager and Business Developer) are within scope of this policy and must be familiar with their roles herein.
Objectives: The primary objectives of this BCDR policy are to:
- Protect Life and Safety: Ensure the well-being of personnel during any crisis (this may include following broader HGA emergency procedures for physical safety, though the focus here is on digital operations).
- Minimize Downtime & Data Loss: Restore critical digital platform functions within acceptable timeframes to reduce financial loss and service disruption[1].
- Recover Critical Functions: Guarantee that essential platform features (e.g. user access, core database, payment processing) are prioritized for rapid recovery after an incident.
- Compliance and Accountability: Meet or exceed relevant regulatory, contractual, and standards-based requirements for business continuity (e.g. ISO 22301’s BCMS requirements and NIST SP 800-34 guidelines)[2].
Continuous Operations: Maintain a basic level of service or provide workarounds so that HGA can continue business operations, even in a degraded mode, until full restoration.
Alignment with Best Practices and Standards
HGA’s BCDR approach is informed by ISO 22301 and NIST SP 800-34 recommendations. ISO 22301 provides a framework for establishing a robust Business Continuity Management System (BCMS), including performing Business Impact Analysis (BIA), setting continuity objectives, and periodically auditing the plan. NIST SP 800-34 offers guidance on developing IT contingency plans and detailed recovery procedures. By aligning with these, HGA ensures its policy reflects internationally recognized best practices. Key principles adopted include conducting risk assessments and BIA to identify critical functions, defining clear recovery strategies, documenting step-by-step response procedures, and training and testing regularly[4][5]. Compliance with these standards enhances our resilience and demonstrates due diligence in continuity planning[2].
Roles and Responsibilities
Effective continuity and recovery depend on clearly defined roles. The following teams/individuals have specific responsibilities under this policy:
- System Manager (BCDR Lead): The System Manager is the primary owner of the BCDR plan. Responsibilities: maintaining the Digital Platform’s infrastructure, performing daily backups, monitoring system health and security alerts, and executing technical recovery procedures during an incident. The System Manager (or designee) will activate the disaster recovery procedures when triggers are met, coordinate IT staff in restoring services, and communicate technical status updates to the Business Developer and leadership. This role also ensures that all continuity mechanisms (backups, failovers, scripts) are in place and routinely tested.
- Business Developer (Continuity Coordinator): The Business Developer coordinates continuity from the business side. Responsibilities: assessing the business impact of outages, deciding on workarounds for interrupted services (e.g. manual processes for critical tasks), and leading communications to stakeholders (clients, partners, and internal leadership) about service disruptions and recovery timelines. The Business Developer works closely with the System Manager to understand technical issues and to formulate messages (e.g., posting notices on the platform or sending client advisories). They also assist in prioritizing which business functions are restored first based on client and operational needs.
- Executive Management (CEO/CTO or Directors): Provide oversight and decision-making for major disruptions. They must be informed of any major incident promptly. They authorize activation of emergency measures (e.g. public incident disclosure, significant expenditures for recovery, or engagement of external disaster recovery services). Executives also ensure adequate resources are allocated for BCDR (personnel, budget) and give final approval to the BCDR policy and updates. In a severe crisis, Executive Management may lead high-level communications (press or client briefings) and liaison with regulators or authorities if needed.
- Development Team & IT Support: Assist the System Manager in technical recovery actions. This includes restoring application code from repositories, redeploying applications, applying necessary patches or configurations, and verifying system integrity post-recovery. They follow the System Manager’s directions in an emergency and help troubleshoot issues during restoration. Additionally, development/IT staff contribute to testing the BCDR plan (such as participating in disaster drills or backup restoration tests).
- All HGA Staff: All employees should be aware of basic emergency communication protocols. In a platform outage or incident, non-IT staff should know how to report issues through the proper channels and where to get status updates (e.g., an emergency portal or designated point of contact). Staff may be called upon to perform manual workaround procedures (for example, using offline tools or alternate processes) if automated systems are down. Employees must also comply with any interim controls or security measures the BCDR team implements during a disruption.
The chain of command for incident response is: System Manager → Business Developer → Executive Management. The System Manager has the authority to declare an IT incident and initiate response; the Business Developer ensures business impacts are managed; Executive Management is engaged for major decisions or if an incident exceeds predefined severity or duration thresholds.
Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are key parameters that define our recovery goals for the digital platform. In alignment with ISO 22301 terminology, RTO is the maximum acceptable downtime – the target time within which services must be restored after a disruption. RPO is the maximum acceptable amount of data loss measured in time – essentially, how far back in time we might have to recover data, defining the tolerance for data loss[6][7]. In simple terms, RTO is about time to recover, while RPO is about data currency (the age of data that could be lost).
For the HGA Digital Platform, the following RTO and RPO are recommended for critical functions:
- RTO: 4 hours for the core platform services. This means that after a major disruption (e.g. server failure or cyberattack), the goal is to have the platform’s critical functionalities back online within four hours. This limit is set to ensure HGA can continue providing key services to users in a timely manner, minimizing operational and reputational impact. (Less critical supporting systems may have a longer RTO, but all essential user-facing components share the 4-hour target).
- RPO: 24 hours for all production data. This corresponds to a daily backup frequency, meaning the platform should not lose more than one day’s worth of data in the worst-case scenario of a major data loss incident. By aiming for an RPO of 24 hours, we acknowledge that nightly backups are in place; if a restoration is needed, we would restore from the last daily backup and at most a day’s data would need re-entry or reconciliation.
These targets were determined based on business impact analysis and client service expectations. An outage of more than 4 hours is considered significantly disruptive to HGA’s operations and clients, hence the aggressive RTO. A shorter RPO (e.g., near zero data loss) is desirable and we will continuously improve toward that, but given current infrastructure (daily full backups), 24 hours is the practicable commitment. Note: Meeting these objectives requires that backup and recovery procedures are effective; if monitoring or tests indicate we cannot achieve RTO/RPO targets, the policy will be revisited.
[8][9](Figure: Illustration of RTO vs RPO – RTO looks forward (max downtime allowed) while RPO looks backward (max data loss window). Shorter RTOs demand more resources; RPO dictates backup frequency.)[10]
Backup and Restoration Strategy
Regular data backups and a solid restoration process are the backbone of our disaster recovery capability. HGA will perform daily backups of the Digital Platform’s critical data and systems, leveraging our existing DigitalOcean cloud infrastructure and secure offsite storage:
- Scope of Backups: All essential components of the platform are included in the daily backup routine. This includes application code (if not already in source control), system configurations, and most importantly the production database and any user-uploaded files or data. For components hosted on DigitalOcean (e.g., Droplet virtual machines, managed databases), we use DigitalOcean’s snapshot/backup features to capture entire system images. DigitalOcean backups create disk images of Droplets, and can be scheduled at daily intervals to enable reverting to a previous state if needed[11]. Database backups (SQL dumps or volume snapshots) are taken daily as well.
- Backup Storage: Backups are stored offsite and securely to ensure they remain available even if the primary hosting environment is compromised. DigitalOcean’s automated backups are retained in their offsite storage by default, and HGA additionally copies backup data to AWS S3 (or another secure cloud storage service) for redundancy. Offsite storage in AWS S3 is encrypted and access-controlled. This dual-backup approach (DigitalOcean backups plus replication to AWS S3) protects against worst-case scenarios such as a region-wide cloud outage or a security breach in one environment. Storing backup data on a separate platform (AWS) ensures we can still retrieve data if either provider (DigitalOcean or AWS) faces an issue affecting stored backups.
- Frequency and Retention: Backups of the core database and server snapshots are taken every 24 hours (typically during low-usage hours overnight) to meet the 24h RPO. The retention policy is to keep at least 7 daily backups and 4 weekly backups at minimum. This provides the ability to recover from an incident that is noticed with delay (for example, if data corruption went undetected for a few days, we could restore from an older point). Weekly backups may be archived for several months for compliance or investigative needs. Backup schedules and retention settings are documented and configured in an automated fashion where possible (leveraging DigitalOcean’s backup schedule and lifecycle management on S3).
- Integrity and Encryption: All backup files are encrypted at rest and in transit. Where possible, full disk snapshots are encrypted by the cloud provider; database dump files and any sensitive data backups are encrypted using strong encryption (e.g., AES-256) before storage. Backup integrity is verified by periodic test restores (described in Testing below) and by automated checksum comparisons if supported. Access to backup repositories is restricted to authorized personnel (System Manager and limited IT staff) with multi-factor authentication to prevent unauthorized access or tampering.
- Restoration Procedures: A detailed restoration procedure is documented, covering multiple scenarios:
- Full server restore: If a Droplet (VM) fails or is corrupted, the System Manager can restore from the latest DigitalOcean snapshot to create a new VM instance[11]. After launching the backup image, any incremental data (e.g., transactions since the last backup) will be recovered from logs or the separate database backups if available. The system will be configured with the last known good configuration (application environment variables, SSL certificates, etc.) which are stored securely in our config management.
- Database restore: If the database is compromised or data is corrupted, the latest daily DB backup (or a point-in-time snapshot) will be loaded onto a fresh database instance. The process includes stopping the application (to prevent new writes), restoring the backup, verifying data integrity, and then repointing the application to the restored database. Any transactions that occurred after the backup and were lost will be identified via application logs or other means and communicated to the Business Developer for potential manual re-entry or reconciliation with users.
- File storage restore: If user-uploaded files or documents are stored in AWS S3 and were accidentally deleted or corrupted, versioning on S3 (if enabled) can retrieve previous versions. If S3 itself faces an outage, once restored, any files uploaded in the interim via offline means will be reconciled. In parallel, if the platform uses a local storage that is backed up, those files can be restored from the offsite backup archive to a new storage location and the application configured to use that until S3 service returns.
- Configuration restore: Important configuration (e.g., environment variables, API keys, TLS certificates) are backed up in secure configuration files or password managers. After a system rebuild, the System Manager will reapply these configurations. We also maintain infrastructure-as-code scripts (for example, provisioning scripts or Docker containers) that can rapidly re-deploy the application environment on a new server, which speeds up recovery and ensures consistency.
- Verification: After any restoration, the System Manager and development team will verify the system’s functionality and data consistency. This includes running basic sanity tests on the platform (login, key transactions) and checking data counts or checksums against pre-incident metrics, to confirm that recovery is complete. Only after verification will the system be declared fully operational and handed back to normal operations.
By having daily offsite backups and a practiced restoration process, HGA ensures data integrity and availability can be restored when needed[12]. This daily backup regimen and rapid restore capability underpin our ability to achieve the stated RTO and RPO.
Incident Response and Recovery Procedures
This section describes procedures for detecting, responding to, and recovering from disruptions, categorized by the type of incident. We address natural disasters, cyber-attacks, hardware/software failures, and data corruption scenarios. In all cases, our approach follows the general phases of Preparation → Detection → Containment → Recovery → Continuity → Post-Incident Review[13][14]:
General Incident Response Framework
- Detection & Alerting: Early detection is critical. The System Manager is responsible for monitoring alerts from various sources – e.g., server monitoring (CPU, memory, ping/uptime checks), security systems (intrusion detection, firewall logs), and third-party status feeds (AWS, Stripe status pages). If a threshold is breached (such as the site being down or unusual traffic indicating an attack), alerts will automatically page the System Manager (and backup on-call staff) via email/SMS. Users or employees may also report issues via a designated support channel. Once an incident is detected or reported, the System Manager initiates the incident response, logging the start time and initial details.
- Assessment & Classification: The System Manager quickly assesses the nature and severity of the incident. This includes identifying affected systems, potential cause (if immediately known), and initial impact (e.g., “database server not reachable,” or “ransomware encryption detected on file store”). Incidents are classified into severity levels (e.g., Critical – total platform outage or major data loss; High – key functionality impaired; Moderate/Low – minor issues with workarounds). The severity dictates the escalation and communication steps (see Communication section for details). The System Manager also consults runbooks or checklists corresponding to the incident type.
- Containment: For incidents like cyber-attacks or malware, containment is the first priority to prevent further damage. The System Manager will isolate affected components (e.g., taking a server offline if it’s compromised, or revoking access keys if an API is being misused). In case of malware/ransomware, the infected machine will be disconnected from the network, and backups will be safeguarded (to ensure they are not infected). For network attacks (e.g., DDoS), containment may involve engaging Cloudflare or similar DDoS protection or temporarily throttling traffic. Containment actions are taken as per our Incident Response Plan playbooks, aiming to limit the scope and impact of the disruption.
- Investigation: Concurrently or after immediate containment, the System Manager (and any security/IT support) investigates the root cause. For hardware or software failures, this might mean checking logs, error messages, or cloud provider status (e.g., to see if DigitalOcean has an outage). For security incidents, it involves reviewing security logs, identifying unauthorized access, etc. While full root cause analysis might happen post-recovery, an initial understanding guides the recovery strategy (e.g., if a database crashed due to corruption vs. an attacker deleted data – the response differs). If needed, third-party experts or vendor support (DigitalOcean support, security consultants) will be engaged at this stage to assist.
- Recovery Actions: Once containment is achieved (or for some failures, containment is simply removing the faulty component), the team proceeds to restore services. Recovery steps for specific scenarios are detailed below, but generally the System Manager will decide whether to fail over to alternate resources or restore from backups:
- If the primary server is down (hardware issue), spin up a new instance (possibly in an alternate data center if a region-wide issue) and restore backups to it.
- If data is corrupted or deleted, initiate database/file restore from the most recent backup.
- If an application deployment caused the outage (software bug), roll back to a previous stable version (the code repository and CI/CD pipeline should support quick rollback).
- If a third-party service is down (e.g., Stripe outage), implement temporary workarounds or enable an alternate provider if available (for instance, switch to a backup payment gateway if one is configured, or queue transactions for later processing).
- Document every action taken and any changes made during recovery (for audit and for potential back-out if something fails).
- Continuity of Operations: During the incident, the Business Developer ensures that HGA’s critical business operations continue to the extent possible. For example, if the platform is completely down for users, HGA staff might manually serve client needs: e.g., using spreadsheets to record transactions or taking orders/requests via phone/email. If only a subset of functionality is down (say, payment processing), the Business Developer may arrange alternate methods (such as instructing customers to use a direct bank transfer temporarily, or simply recording the intent to pay and charging later when Stripe is back). The goal is to provide an acceptable level of service manually or via fallback procedures until the system is restored. These manual workaround procedures are documented for key processes so staff can execute them during prolonged outages. The Business Developer coordinates these workarounds and ensures all staff are aware of them when needed.
- Resolution & Restoration: Once technical recovery steps are completed (e.g., systems are rebuilt, data restored), the System Manager performs tests to confirm that all services are operational and data integrity is intact. For instance, after a database restore, query some recent records to ensure they are correct; after a server rebuild, attempt a full user transaction on the platform. Any discrepancies or missing data are addressed (if feasible) – e.g., re-entering data from paper records if minor, or at least logging what was lost. When fully satisfied, the System Manager declares the incident “resolved” from an IT perspective.
Post-Incident Activities: After restoration, deactivation steps include cleaning up any temporary fixes (e.g., turning off maintenance pages, re-enabling normal user access), and ensuring all backup systems or logs used are properly secured again[15]. The System Manager notifies stakeholders that systems are back to normal. A “lessons learned” review meeting is scheduled within a few days with the involved team (System Manager, Business Developer, executives, and any others) to analyze the incident. They will review cause, what went well or poorly, and update the BCDR plan accordingly[16][17]. Additionally, any new security patches or measures identified during the incident (for example, if it was a cyber-attack, steps to prevent recurrence) will be implemented as part of the follow-up.
Scenario-Specific Procedures
- Natural Disasters (e.g., Data Center Outage, Regional Disaster):
If a natural disaster (flood, earthquake, fire) impacts the data center hosting our DigitalOcean servers (or a widespread cloud outage occurs in that region), HGA will execute its Disaster Recovery Plan (DRP) to relocate or restore operations to an alternate location[18]. DigitalOcean has multiple regions; we maintain the capability to redeploy the platform to an alternate region if the primary region is down. Steps:
– System Manager will communicate with DigitalOcean to confirm the scope of the outage and expected duration.
– If the outage is likely to exceed our RTO (4 hours), the System Manager initiates a restoration from backups in a different region or cloud. Using infrastructure-as-code and saved images, new Droplet instances will be launched in a healthy region, and the latest backups will be restored onto them.
– DNS failover: Update DNS records (or leverage a failover DNS service) to point users to the new instance once it’s ready. Our DNS Time-to-Live (TTL) is kept reasonably low (e.g., 5 minutes) to allow quick propagation of changes.
– Verify the platform in the new region is functioning correctly, then announce to stakeholders that services have been restored (with possibly slightly reduced performance if the region is farther for some users – but continuity is achieved).
– When the original region/data center is back, evaluate whether to revert to it or continue in the new region. Future architecture may consider an active-active or active-passive multi-region deployment to make this failover seamless, as per industry best practices for cloud resilience[19][20]. For now, it is a manual but prepared failover.
– Natural disasters could also affect HGA’s physical offices or personnel. While that’s beyond this digital platform scope, it’s noted that if key staff are unavailable, secondary personnel are cross-trained to take over crucial BCDR tasks. Also, if a disaster impacts local internet, key staff have the ability to work remotely from safe locations to execute the recovery. - Cyber-Attacks (e.g., hacking, ransomware, DDoS):
In case of a cybersecurity incident targeting the platform:
– Data Breach/Hack: If unauthorized access or data breach is detected (through logs or an anomaly), the System Manager will immediately contain it by, for example, disabling affected user accounts or admin credentials, revoking stolen API keys, or temporarily taking the system offline to prevent further data exfiltration. A forensic investigation will commence (preserve logs, possibly engage a security consultant). Meanwhile, focus on restoring secure operations: patch exploited vulnerabilities (e.g., apply security fixes if an unpatched flaw was used), change all passwords/keys that were compromised, and restore any defaced or compromised components from clean backups. Before bringing the system fully back online, ensure the threat is eradicated (malware removed, backdoors closed). Communicate with users as needed (if personal data was breached, follow breach notification laws/policies in coordination with HGA legal/exec team). Post-incident, enhance security measures (e.g., stricter firewall rules, 2FA enforcement, additional monitoring) based on lessons learned.
– Ransomware/Malware: If ransomware encrypts data or malware corrupts systems, isolation is critical: disconnect infected systems. Do not pay ransom – instead rely on backups. Wipe the affected server or storage, rebuild from scratch using clean OS images, then restore data from the latest backup that is malware-free. We ensure offline or offsite backups are available so they are not hit by the ransomware (backups stored in AWS S3 are logically separate and versioned). Once restored, verify no malware persists (run antivirus/malware scans). Additionally, analyze entry point of ransomware (phishing? RDP? etc.) and plug that gap immediately.
– DDoS (Distributed Denial of Service) attack: If the platform is targeted by a DDoS causing service slowdown or outage, engage our DDoS protection measures. HGA uses Cloudflare (or similar content delivery network) in front of the platform to absorb and filter unwanted traffic at the network edge if configured. If not already in place, the System Manager will work with DigitalOcean to apply rate limiting or temporarily scale up resources to handle the load. Communication with the ISP or cloud provider’s abuse team may help trace and mitigate the attack. The priority is to restore normal access for legitimate users – possibly by implementing an IP blocking rule or challenge (captcha) for suspect traffic. After the immediate attack subsides, consider long-term solutions (traffic scrubbing service, improving network capacity, etc.).
– Critical Vulnerability (zero-day) exploitation: If a global security flaw is announced (e.g., a critical zero-day affecting our software stack), the System Manager treats this as an imminent threat incident. Even if not yet attacked, apply emergency patches or workarounds as recommended by vendors or security community. In some cases, taking the platform offline briefly for patching is preferable to staying online and vulnerable. This proactive incident response can prevent a disaster.
Throughout any cyber-attack scenario, close communication with the Business Developer is maintained so that external messaging to clients is handled sensitively (for instance, informing users of an outage without causing undue alarm, and later providing incident reports as needed). Legal counsel might be involved if the incident has regulatory implications (like data breach reporting).
- Hardware or Software Failures:
Not all outages are due to disasters or attacks – a simple hardware failure or software bug can also cause downtime. Our approach:
– Server Hardware Failure: (In cloud context, this could be the underlying host failure for a Droplet). DigitalOcean’s infrastructure generally handles host failures by migrating Droplets, but if our VM becomes unavailable, the System Manager will quickly provision a replacement server (we keep automated deployment scripts and configuration backups for this reason). Using the latest backup or a snapshot image, the new server can be brought to the last known good state swiftly. Because this is effectively a “restore to new hardware,” it’s covered by our backup restoration procedures. We also consider adding high-availability configurations for critical components (e.g., a secondary database instance that can take over if primary fails). In the current setup, manual failover is used: the RTO of 4 hours accounts for time to spin up and restore on new hardware.
– Software/Application Failure (bug or crash): If a deployment of new code causes the platform to crash or malfunction, the immediate step is to rollback to the previous stable version. Our version control and CI/CD pipelines maintain the ability to redeploy an earlier build quickly. The development team will be alerted (if not already the ones who discovered the issue) and they will either hotfix the bug or rollback. If the platform is down due to this issue, an outage notification may be posted, but since this is usually within our control, we expect to fix it within RTO. If a critical bug in third-party software (e.g., the database engine) causes a crash, we will follow vendor guidance – possibly applying patches or migrating to a stable version.
– Database Crash or Errors: If the database software crashes due to internal errors (not data corruption), attempt a restart of the service. Investigate error logs; if it’s a known issue (like hitting a resource limit), allocate more resources or configuration tuning and restart. If the database won’t restart cleanly, treat it as potential corruption – then proceed with a restore to a new instance (after safely taking a snapshot of the current state for forensic analysis).
– Network or Connectivity Issues: If the issue is network-related (can’t reach server due to network outage or misconfiguration), the System Manager checks both HGA’s network (if any on-prem components) and DigitalOcean’s network status. Many times, routing issues resolve in short time; if not, moving the server to a different network or using a backup VPN may be needed. For example, if an internet service problem prevents staff from accessing the cloud, staff can relocate to a backup internet source or use cellular networks to reach it. If the platform is multi-tier and internal networking fails (e.g., app server can’t reach database due to firewall), the issue is corrected by reconfiguring network rules or switching to a backup network path (some DO deployments allow private network between components – ensure those are up, etc.). - Data Corruption or Data Loss:
If data becomes corrupted (e.g., due to software error, storage issue, or accidental deletion) the focus is on preserving data integrity and recovering lost data:
– The System Manager will first stop any processes that might continue to alter data (to avoid propagating corruption). For instance, if an automated process is writing bad data (say, a bug writing incorrect entries), halt that process.
– Identify the scope of corruption: which dataset or tables are affected and when it likely started. This is where having multiple backup points is valuable. The System Manager might restore a backup to a separate environment to compare data and pinpoint what’s wrong.
– Once scope is known, decide recovery strategy: It could be full restore from the last known good backup (losing changes since then, which aligns with RPO tolerance) or a targeted repair if possible. For example, if one table is corrupted but others are fine, we might restore that one table from backup rather than the whole database. This reduces data loss.
– After restore, reconcile any missing data. Using application logs or user input, try to re-enter or recover transactions made after the backup. In some cases, if only a small number of records are lost, the Business Developer may contact those users to inform them and manually re-input data or ask them to resubmit if feasible. If the data loss is larger, a broader communication might be needed (e.g., “transactions made between 8–10AM were lost and need to be redone”).
– Going forward, implement measures to detect corruption early. This might include turning on database checksums, running routine data validations, and ensuring our monitoring can flag unusual data patterns. If the corruption came from a software bug, fix that bug and possibly add application-level controls to prevent such erroneous data operations.
Throughout any of these scenarios, continuity of operations is a parallel concern: the Business Developer and other team members will activate any necessary manual workarounds or alternative procedures to keep the business running. For example, if the platform is down due to a disaster, staff can use a pre-prepared Excel template to log client interactions or transactions and later import them. If Stripe payments cannot be processed for a period, HGA might record pending charges and notifies clients that payments will be processed once systems are up (or provide an alternate payment method if downtime is extended). The goal is that HGA can still deliver essential services or at least clear communication to clients, even if the normal digital platform is impaired.
External Dependencies and Third-Party Integrations
The HGA Digital Platform relies on several third-party services, which themselves must be considered in our BCDR planning. Key dependencies include: Amazon Web Services (AWS) S3, Stripe Payments API, and potentially others (e.g., email/SMS gateways, mapping services, etc. as per the technical specs). We document these dependencies, assess their continuity provisions, and outline strategies to handle their failures:
- AWS S3 (Data Storage): The platform uses AWS S3 to store certain data (for example, user-uploaded documents, static media, or backup files). AWS S3 is a highly durable and available storage service, but in the rare event of an S3 outage or data unavailability, the platform components that rely on S3 could be affected (e.g., images not loading, or inability to fetch certain content). HGA’s plan for S3 issues:
- We enable versioning and cross-region replication for critical S3 buckets if feasible, so data is replicated to another AWS region. If the primary S3 region has an outage, we can repoint the application to the replicated bucket in the alternate region.
- For short-term S3 outages, the platform will be designed to fail gracefully: e.g., if a file cannot be retrieved, the application will not crash but show a placeholder or a message. This ensures partial functionality continues.
- If S3 downtime is extended, the System Manager can retrieve the needed files from the latest backup (since we back up S3 data periodically to DigitalOcean or another storage) and serve them from a temporary local storage or an alternate cloud storage (like DigitalOcean Spaces or an S3 bucket in another region) by reconfiguring the app. This is a manual switch but can restore access to critical files.
- Once AWS resolves the issue, we switch back to normal S3 operations and reconcile any changes (e.g., if any new files were uploaded to the alternate storage during the outage, they will be copied to the main S3 to keep records complete).
- Stripe API (Payments): Stripe is used to process online payments on the platform. While Stripe’s uptime is generally high, an outage or integration issue could prevent transactions (subscriptions, donations, etc.) from completing. Our continuity approach for Stripe:
- The platform is built such that a Stripe outage does not affect overall availability for non-payment functionalities – users can still log in, access content, etc., even if payments are temporarily unavailable[21]. If Stripe calls fail, the application will catch the errors and inform users of a payment processing issue rather than crashing.
- In case of a prolonged Stripe outage, HGA can manually manage payment processes. For example, the platform could queue any transactions that were attempted (storing the necessary details securely) and the Business Developer will inform users that their payment will be processed once the system is back online. If urgent, HGA might offer an alternate payment method (like an invoice or bank transfer outside the platform) as a stopgap. As seen in another continuity plan, even if Stripe is down, paying users can continue to use the site and subscriptions can be handled manually by staff until Stripe is restored[21].
- If Stripe itself has a disaster recovery event (unlikely, but if Stripe informs of data loss on their side), HGA would work with Stripe support to reconcile transactions. We keep our own transaction logs so we know which payments were completed or pending, to cross-verify with Stripe’s records once available.
- HGA will evaluate the feasibility of integrating a secondary payment provider as a backup. This isn’t implemented yet, but the policy leaves room for future addition: e.g., having PayPal or a secondary gateway that can be enabled if Stripe is down could provide immediate redundancy for critical payment flows.
- Other Third-Party APIs/Services: The platform may integrate with other services (examples: email delivery services like SendGrid, mapping APIs, analytics services, etc.). For each such dependency:
- We maintain a registry of third-party services in the technical documentation, noting their purpose, how critical they are, and whether we have alternative solutions.
- We assess the BCDR posture of each vendor. For critical ones, we ensure they have SLA or uptime guarantees and that we have support contacts. For example, if we use an email API and it goes down, we have a backup method (maybe an SMTP relay or even Gmail for urgent communications).
- If a third-party outage affects our platform’s functionality, our incident response will include checking that service’s status and possibly contacting their support. Meanwhile, we implement workarounds: e.g., if the email service is down, the System Manager can route emails through a backup service or queue them to send later; if a mapping API is down, perhaps the feature is non-critical and we simply hide/disable map features temporarily.
- For any third-party that is vital (marked critical in our BIA), we either have an alternative provider in mind or a contingency plan. For those deemed non-critical, we accept that those features will be unavailable until the vendor fixes it, which is communicated to users if noticeable.
- Shared Infrastructure Considerations: It’s noted that cloud providers operate on a shared responsibility model for continuity. For instance, DigitalOcean ensures infrastructure-level continuity (power, networking, etc.), while HGA is responsible for app-level continuity like backups and failovers. Similarly, AWS maintains S3 service durability, but we plan for unlikely failures. Recognizing this, we don’t duplicate efforts but rather complement vendor resilience with our own measures (like additional backups or alternative services).
By documenting these dependencies and our responses, we ensure that an outage in any linked service does not catch us off-guard. Additionally, as part of vendor management, HGA will: (a) Monitor third-party service status (we subscribe to status alerts or RSS feeds for AWS and Stripe, etc., to get immediate notification of incidents on their side), and (b) Periodically review third-party SLA and BCDR documentation to stay aware of their reliability and any actions we should take (for example, some services might schedule downtime that we need to plan around).
Communication and Escalation Plan
Clear communication and timely escalation are crucial during any disruptive event. This section describes how HGA will communicate internally and externally, the hierarchy of contacts, and the timeline for escalation during a crisis. The goals of the communication plan are to keep all stakeholders informed with accurate information, manage expectations, and provide instructions or workarounds as needed.
Incident Notification and Internal Escalation:
– Immediate Alerts (0 – 15 minutes): When an incident is detected, the System Manager (or whoever discovers it) will immediately notify key internal stakeholders. An incident alert will be sent via the designated urgent channel (e.g., a Slack #incident room, SMS text, or phone) to the System Manager, Business Developer, and backup technical staff. This message will contain a brief description of the issue (“e.g., Database server unresponsive, investigating”) and the initial severity assessment. The System Manager begins response, and the Business Developer is now aware to potentially prepare business continuity actions.
– Engaging Response Team (within 30 minutes): If the incident is confirmed as real and not a false alarm, the Incident Response Team is formally engaged. For HGA, this typically includes the System Manager (lead), available IT/Dev team members, and the Business Developer. Each knows their role: technical containment/recovery for IT, and impact assessment/communication for Business side. At this point, if it’s a Critical incident (total outage or major breach), the Executive Management (e.g., CEO or CTO) is also informed by the Business Developer or System Manager. They may join the response coordination, especially to make any high-level decisions or to authorize major steps (like issuing a broad client notification or spending on external help).
– Contacting Vendors (within 1 hour if needed): If the incident involves third-party services or infrastructure (e.g., a cloud provider issue or Stripe outage), the System Manager will reach out to those vendors’ support teams within the first hour, after internal actions are underway. Concurrently, we track those vendors’ public status updates. Escalating to vendor support early ensures we are in their queue if assistance or information is needed.
– Internal Status Updates (ongoing): The System Manager will provide updates to the internal team at regular intervals (e.g., every 30 minutes for critical incidents, or as significant milestones are reached for lower-severity issues). These updates cover what has been done and what the current ETA is for resolution. Internally, a communication log is kept (timestamped updates in the incident channel or a shared document) to maintain a common operational picture.
External Communication:
– Customer/Client Communications: If the incident impacts end-users or clients (e.g., platform downtime, data breach), HGA will communicate proactively. The Business Developer leads drafting of these messages, with input from technical staff for accuracy and from execs for tone/approval. We will use multiple channels as appropriate: an alert banner on the platform, an email to clients, or a status page update. The communication will be honest about the issue without unnecessary technical jargon, providing reassurance that we are addressing it and (if possible) an estimated time to restore service. For example, “We are currently experiencing an unexpected outage on the HGA Digital Platform. Our team is working to restore access, and we expect to be back online by approximately 2 PM UTC. We apologize for the inconvenience.” If a workaround exists (like alternate payment method), mention that. Regular updates should follow if the downtime is prolonged (say updates every 1–2 hours or whenever new info is available). Once resolved, a final update is sent indicating everything is back to normal.
– Escalation to Media/Regulators: In certain scenarios (major data breach, or outage exceeding a day affecting contractual obligations), HGA’s Executive Management will handle external stakeholder communication beyond customers. This could include notifying regulatory bodies if required by law (e.g., data protection authorities in a breach of personal data), or preparing press statements if the incident draws public attention. The BCDR policy mandates that any such communications be done in consultation with legal counsel and in line with HGA’s crisis communication plan.
– Third-Party Notification: If our outage or issue could affect partners or suppliers, we inform them as needed. For example, if HGA provides data or services to another partner system, we’d alert them of our downtime. Similarly, if the issue is caused by a partner (like Stripe outage), we might coordinate messaging with them (ensuring we relay correct info to our users that matches what Stripe is saying).
Escalation Hierarchy & Contact Information:
Below is a Table of Responsibilities and Escalation detailing key contacts, their roles, and how/when to escalate issues to them:
Role / Team | Name (Primary Contact) | Contact Info | Responsibilities | Escalation Path & Timeline |
|---|---|---|---|---|
System Manager (BCDR Lead) | [Name] (Primary On-Call) | [Phone] / [Email] | – Monitor platform & detect incidents<br>- Lead technical response (containment, recovery)<br>- Liaise with vendor support (DO, AWS, etc.)<br>- Provide internal tech updates | Initial responder – receives alerts first. Initiates response immediately. If issue not resolved in 30 min, escalates to Business Dev and informs Executive. |
Business Developer (Continuity Coordinator) | [Name] | [Phone] / [Email] | – Assess business impact<br>- Coordinate manual workarounds for continuity<br>- Draft and send communications to clients<br>- Update management on business implications | Notified at incident start by System Manager. Joins response within first 15–30 min. If customer-facing impact, begins external comms within 1 hour. Escalates to Executive if outage >1 hr or client SLA breached. |
Executive Management (CEO/CTO) | [Name] | [Phone] / [Email] | – Strategic decisions (e.g., declare disaster, authorize major expenditures)<br>- Public/press communications if needed<br>- Approve customer communication content (if high-severity)<br>- Interface with regulators or authorities | Engaged for Critical incidents or if criteria met (e.g., data breach, extended downtime > 4 hrs). Updated by Bus. Dev at least hourly during major incident. Joins incident calls as needed to guide and approve actions. |
Development & IT Support Team | [Team Lead Name] | [Phone] / [Email] | – Implement recovery steps under Sys Mgr direction (restore backups, fix code, etc.)<br>- Provide specialized expertise (DB admin, security analysis, etc.)<br>- Assist in testing after recovery | Activated as needed – System Manager pages relevant team members immediately when incident confirmed (within 15 min). Expected to respond and be hands-on within 30 min. If primary dev is unavailable, escalate to backup team member immediately. |
Client Support/Account Mgmt | [Name] | [Phone] / [Email] | – Field inquiries from clients during incident<br>- Provide clients with status updates (using info from Business Dev)<br>- Log any client-reported issues for tech team<br>- Reassure clients and manage expectations | During an incident – alerted by Business Dev when external notification goes out. They respond to any incoming client calls/emails, using approved messaging. If clients report new info (e.g., “I see X issue”), they pass to System Manager. Escalate any critical client concern directly to Business Dev/Executive immediately. |
(Note: Specific names and contact details to be maintained in an Annex and updated as personnel change. Primary and secondary contacts are designated for each key role.)
This escalation chart ensures that no time is lost in engaging the right people. It delineates who must be contacted and by whom, within what timeframe. For instance, a moderate issue might involve only System Manager and Dev Team; a critical one brings in everyone up to the CEO quickly. If a primary contact is unreachable for 10 minutes, the backup person for that role is contacted (e.g., if the System Manager is on leave, the designated deputy steps in).
Communication Tools & Protocols:
– Internally, a dedicated incident conference call or virtual war room will be opened for coordination if the situation is complex. This ensures real-time collaboration.
– We maintain an updated contact list (phone numbers, personal emails, etc.) in a secure but accessible location (e.g., an encrypted file stored locally with key team members) for use if normal systems (like corporate email) are down.
– We have a pre-prepared template for incident reports to clients, which can be quickly customized for the situation, to speed up messaging.
– For extended outages, HGA will use a “dark site” or external status page hosted independently of our infrastructure to post updates (ensuring clients can get information even if our main site is down)[22]. For example, a status page on a separate cloud or a social media channel can serve this role.
Escalation Timeline Summary:
– 0-15 min: Incident identified; System Manager on case; initial containment; notify Business Dev.
– 15-30 min: Incident confirmed as major -> Business Dev notifies Execs; core team assembled; internal alert to staff if needed (“tech issue ongoing, team working”).
– 30-60 min: If user-facing outage, initial client communication prepared and sent (aim to notify clients within 1 hour of major disruption). If resolved quickly (<1 hr), a post-incident note can suffice.
– Every 60 min: Ongoing incidents – update clients (status page or email) at least hourly, even if there’s not much news, just to reassure we’re working on it. Internally, escalate to higher management if something new arises (e.g., need approval for drastic measure).
– 4 hours: If still down, executive management considers declaring “disaster” formally, possibly invoking more drastic continuity actions (like moving to alternate site if not already) and perhaps informing any regulators or stakeholders per obligations.
– End of incident: within 24-48 hours after resolution, send a follow-up to clients summarizing what happened (if it was noticeable to them) and steps taken to prevent future issues – showing transparency and commitment to reliability.
This communication plan aligns with best practices that emphasize effective stakeholder messaging during outages[23][24] and ensures that issues are escalated promptly to avoid delays in decision-making or customer notification.
Backup Data Security and Testing
Having backups is not enough – we must ensure backup data is secure and that restoration processes work when needed. HGA implements strict measures for secure storage of backups and conducts regular testing to validate our BCDR capabilities:
- Secure Storage of Backups: All backup files (database dumps, system snapshots, etc.) are stored in encrypted form. For instance, database backups are encrypted with a strong key before uploading to AWS S3, and S3 itself encrypts data at rest. Access to these backups is limited to the System Manager and one other designated administrator for redundancy. Credentials for backup storage (AWS keys, etc.) are kept in a password manager accessible to authorized personnel only. Additionally, we segregate backup storage from the main environment – e.g., backups in S3 are under a separate account or with strict IAM policies – so that even if the main platform is compromised, attackers cannot easily purge our backups. After each backup operation, an automated verification is done (simple checksum comparison or restore test in a sandbox) to confirm the backup file is not corrupted and is complete.
- Retention and Versioning: As mentioned, multiple versions of backups are kept. This means even if an attacker tried to alter or delete recent backups (in the worst case of gaining access), we have older restore points offline. We also periodically create immutable backups (write-once media or Object Lock on S3) that cannot be deleted until a set date, to defend against malicious or accidental deletion.
- Periodic Restoration Testing: Testing the backups by actually restoring is the only way to be confident in our recovery. HGA will perform routine DR tests: at least quarterly, the System Manager will simulate a recovery scenario. This involves taking the latest backup and attempting to restore it to a staging environment. For example, spin up a new droplet, restore yesterday’s database backup to it, and see if the application can run against it. Or retrieve files from the S3 backup archive and open them to ensure they are intact. These tests confirm that backups are viable and that staff know the restoration steps. We document the time it takes and any issues encountered. Regular testing through simulations and drills helps evaluate the plan’s effectiveness and identifies weaknesses, allowing us to refine our procedures and ensure staff are prepared[25]. After each test, results are reviewed and any failures or delays are addressed (e.g., if a backup was found to be incomplete, we fix the backup process; if the restore took too long, we look for optimization or adjust RTO expectations).
- Annual BCDR Drill: At least once a year, a more comprehensive BCDR exercise is conducted. This may involve a full-scale simulation where an incident (like a database corruption or data center outage) is feigned and the team must go through the motions of declaring the incident, communicating, restoring from backups, etc. without actually impacting production. This functional exercise ensures all parts of the plan (technical and communications) work together[26][14]. We may also conduct surprise drills (with management approval) to test response times and see how the team handles an unexpected scenario, then provide training as needed if gaps show up.
- Real-Time Monitoring of Backups: We leverage monitoring scripts to ensure backups occur as scheduled and to alert if any backup fails or if the backup data size deviates unexpectedly (which might indicate a problem). The System Manager checks backup logs daily. Any anomalies trigger immediate investigation. This proactive monitoring, along with periodic test restores, gives confidence that when we need the backups, they will function.
- Audit and Access Reviews: On a scheduled basis (e.g., semi-annually), an audit is performed on who has access to backup systems and keys. Any staff changes are updated (removing access for former employees, etc.). We also review if backup procedures follow defined policy (for example, verifying that backups are indeed happening daily, and that at least X versions are retained). This is documented in an audit log. Compliance with this backup policy is part of HGA’s internal IT controls and is reported to management. If we were to pursue certifications in the future (like ISO 27001 or others), these logs and tests would help demonstrate our continuity controls in action.
In summary, secure and tested backups ensure that HGA’s data is safe and recoverable. We treat backups as a lifeline: protected rigorously and regularly proven effective.
Maintenance, Audit, and Continuous Improvement
Business continuity is not a one-time setup – it requires ongoing maintenance, review, and adaptation to new threats. HGA commits to the following practices to keep the BCDR policy effective and up-to-date:
- Regular Plan Reviews: This BCDR policy and the associated detailed recovery procedures will be reviewed at least annually, or whenever there are significant changes to the platform or business. The review will involve the System Manager, Business Developer, and Executive sponsor. They will verify that all information (like contact details, system architecture, critical processes) is current. Any changes (e.g., new dependencies, different RTO requirements due to business growth) will be updated in the policy. We align this with ISO 22301’s requirement for continual improvement of the BCMS, ensuring the plan stays relevant as HGA evolves[5].
- Change Management: Any time a major system change is introduced (such as a new module in the platform, a new third-party integration, or infrastructure change), we assess its impact on BCDR. For example, if we adopt a new payment gateway in addition to Stripe, we update the dependencies section and add recovery steps for it. If we move to a different cloud provider or deploy an additional environment, the BCDR strategy is adjusted accordingly (and new recovery procedures are documented and tested). This way the continuity plan is woven into project deployment processes, not separate.
- Real-Time Threat Monitoring: As part of risk management, HGA employs real-time monitoring tools for early threat detection. This includes:
- Security Monitoring: Firewalls and intrusion detection systems (IDS) on our servers to alert on suspicious activities (e.g., repeated failed logins, unusual network traffic). We also keep system and application logs streaming to a secure log management system; these logs are monitored for defined triggers (like errors or security events).
- Uptime and Performance Monitoring: External services ping the platform and key API endpoints periodically. If the platform doesn’t respond or responds with errors, it triggers an immediate alert to the on-call personnel. We also monitor performance metrics that could indicate problems (high error rates, slow response times) to catch issues before they escalate into full outages.
- Threat Intelligence and Patching: The System Manager subscribes to security bulletins relevant to our technology stack (for instance, mailing lists or RSS feeds for vulnerabilities in the programming framework we use, or alerts from DigitalOcean/AWS about incidents). We maintain an inventory of systems and promptly apply critical patches – ideally in a staging environment then to production in a controlled manner. For zero-day threats, as mentioned, we have procedures to take emergency actions to mitigate risk until a patch is applied.
- Automated Alerts to Management: For certain severe alerts (like a successful intrusion detection or a system going offline), not only the technical team but also a member of management is notified right away. This ensures leadership is aware of significant threats or downtime in real time, contributing to faster decision-making and resource mobilization if needed.
- Compliance and Documentation: Every incident, whether a near-miss or a full disruption, is documented in an Incident Report. This report includes timeline of events, actions taken, impact, and lessons learned. These reports are reviewed in management meetings and at least annually in aggregate to identify patterns or recurring issues. Additionally, any compliance requirements (for instance, if we were subject to SOC 2 or specific industry regulations) related to continuity and incident handling are tracked and integrated into this plan. If auditors need to see evidence of BCDR activities, we can provide backup logs, test results, training records, and incident reports as proof of our active continuity program.
- Training and Awareness: Key staff (System Manager, Business Developer, IT support) will receive annual training on their BCDR roles and the latest procedures. This can include tabletop exercises where we walk through a hypothetical scenario. We also include general staff in awareness training so they know, for example, how to find out if the system is down and what interim steps to take. Training might involve small drills (like an unannounced simulation during a workday) to keep everyone sharp. Leveraging modern techniques (possibly even gamified drills as suggested by industry experts[27]) can improve engagement.
- Audit of Third-Party Preparedness: Part of maintaining our plan is ensuring our vendors remain reliable. We periodically check that third-party services we use still meet our needs. If a vendor’s SLA or performance degrades, we consider if we need to enhance our contingencies or even switch services. Also, if vendors offer new continuity features (e.g., DigitalOcean introduces easier failover features, or Stripe offers a new redundancy option), we evaluate and adopt beneficial ones.
- Metrics and KPIs: We track metrics such as: number of incidents per quarter, average downtime per incident, success rate of meeting RTO/RPO in drills or real events, number of backup failures, etc. Key Performance Indicators may include the percentage of incidents resolved within RTO, or whether data loss stayed within RPO during an event. We also measure backup success rate (e.g., “backups completed successfully 98% of days, with failures promptly fixed”). These metrics are reviewed to gauge the effectiveness of our BCDR efforts and are reported to leadership. They help highlight if the continuity strategy is improving or if adjustments are needed (for example, if we consistently miss RTO in tests, we may need more resources or a different strategy).
- Continuous Improvement: After each incident or test, action items are captured. For example, a test might reveal that team members were unclear on who to contact – leading us to improve the contact table or training. Or an incident might show a gap in monitoring – prompting us to add a new alert or upgrade a tool. The BCDR plan is a living document; we version-control it and maintain a change log. Improvements, once identified, are implemented promptly (not just at the annual review). This aligns with the principle of continuous improvement in ISO 22301 – learning from each experience to strengthen our resilience[5].
Finally, executive management involvement ensures that BCDR remains a priority. Management receives regular updates on continuity readiness (e.g., an annual BCDR report summarizing all tests, incidents, improvements) and signs off on necessary investments (like additional backup infrastructure or training programs). With leadership commitment and a culture that values preparedness, HGA’s BCDR program will continue to mature over time[28].
Conclusion
This Business Continuity and Disaster Recovery Policy provides a comprehensive framework to safeguard the Humanics Global Advisors Digital Platform against disruptions. By aligning with ISO 22301 and NIST SP 800-34 standards, clearly defining RTO/RPO targets, implementing robust daily backups on DigitalOcean with offsite copies, and detailing step-by-step response and recovery procedures, HGA is well-prepared to handle emergencies ranging from natural disasters to cyber-attacks. Roles and responsibilities – from the System Manager’s technical lead to the Business Developer’s communications role – are assigned to ensure a coordinated response with effective communication to all stakeholders. Through secure backup storage, regular testing, and continuous monitoring for threats, we maintain a high state of readiness.
This policy document shall be approved by HGA executive management and disseminated to all relevant team members. It will be reviewed and tested regularly to remain effective and up-to-date[5], and it will serve as a trusted guide in the event of any crisis affecting the digital platform. By following this BCDR policy, Humanics Global Advisors demonstrates its commitment to resilience, ensuring that even in adverse situations, we can continue our mission with minimal interruption and uphold the trust of our clients and partners.