Local cache verification for D2C

x360Recover

Written By Tami Sutcliffe (Super Administrator)

Updated at July 27th, 2022

What is local cache verification?  

Local cache verification is a process which checks all data on the protected systems against the contents of the local cache repository. Any missing data is then flagged for inclusion during the next backup. 

  • Because verification of very large protected systems can take some time (potentially hours, in extreme cases), verification runs asynchronously from backups.
  • If a new backup is scheduled to occur while verification is running, verification is paused while the backup is performed and then resumed afterward.  
  • Verification runs at the lowest CPU and IO priority allowed by the operating system, so it should never interfere with the performance of your protected systems.

How does local cache verification work?

Local cache verification efficiently compares (a) the content index of the local cache repository with (b) the change block hash table (used in every backup to indicate which blocks and data have already been send to the backup server.)

A single CPU core is capable of comparing about 1TB per minute, so the total verification process requires very little time to complete on almost any system.

  • Because verification of very large protected systems can take some time (potentially hours, in extreme cases), verification runs asynchronously from backups.
  • If a new backup is scheduled to occur while verification is running, verification is paused while the backup is performed and then resumed afterward.  
  • Verification runs at the lowest CPU and IO priority allowed by the operating system, so it should never interfere with the performance of your protected systems.

How does "self-healing" work?

If missing blocks are found (e.g. data on the protected system hard drive that is not already in the local cache), a bitmap of missing blocks is created and saved. The x360Recover agent will process the bitmap during the next backup cycle. Any missing blocks will be populated to the local cache repository to ‘self-heal’ the missing data.

Why might data be missing? 

Many factors can cause missing blocks, including intermittent write errors to the local cache, or backups taken while the local cache device was detached or unavailable.

What happens next?

 If any blocks of data are found to be missing on any volume during verification:

  • A missing block bitmap file will be generated and saved in the agent directory. 
  • During the next backup, this bitmap will be included in the list of blocks that ‘might have changed’ to ensure that missing blocks are sent to the cache during the backup. 
  • If more than 1% of blocks are found to be missing from the cache, the backup status is considered Failed.  Cache status is sent to the backup server at the end of the verification process.

The local cache verification and self-healing process then

  • confirms that all data necessary for a recovery is held within the cache
  • repairs any missing block data
  • scores the results of testing to report it to the backup server

How to configure the agent

Local cache verification adds several new configuration parameters to the agent for enabling and managing local cache verification checks.

Note: By default, verification jobs are run once every 24 hours, preferably not during business hours.

Configuration parameters:

Local cache verification frequency

NOTE: The following configuration information is provided for reference only. We recommend that you accept the default behavior of local cache verification (performed once every 24 hours.)

  • By default, local cache verification is performed once every 24 hours.  
  • Verification is triggered at the end of a backup cycle. If no verification has been performed in more than 24 hours (and no verification job is currently running), a new verification job will be started.

Local cache verification frequency can be configured by setting:

LOCAL_CACHE_VERIFY_FREQUENCY_HOURS=<Hours>

This metric defaults to 24 hours if nothing else is specified and cannot be set to less than 1 hour or to more than 720 hours.

Local cache verification and business hours

By default, a new local cache verification job can be started during the assigned customer's business hours.  

The agent determines business hours in one of two ways:

  1. If the agent is configured with an advanced backup policy job schedule on the server, the business hours defined within that job determine when business hours are in effect for the agent

  2. If the agent is configured with a 'classic' backup policy job schedule on the server (and no explicit business hours schedule exists), business hours are defined internally as M-F from 8:00 AM to 6:00 PM

LOCAL_CACHE_VERIFY_ALLOW_DURING_BUSINESS_HOURS: Default True

By default, this setting will allow running verification if it occurs during the current  ‘Business Hours’ as defined by the assigned backup schedule. 

Verification runs at the lowest priority allowed by Windows and is run asynchronously to Backups.  It is expected that verification will have no noticeable impact on system performance.  In the unlikely event that it becomes necessary to block verification during business hours, this setting can be overridden in the agent configuration file.

LOCAL_CACHE_VERIFIY_ALLOW_SELF_HEALING: Default True

When enabled, any ‘missing’ blocks discovered during verification will be flagged for the agent to incorporate during the next backup.

This option exists for troubleshooting purposes only.

To ensure your local cache data is kept up to date and reliable, this option should not be disabled.

Exclude local cache verification during business hours

Verification is an extremely low priority background job and is expected to have no noticeable performance impact on any system. However, if you find it necessary to explicitly exclude verification during business hours, there is a configuration option available to address this. It is essential that verification be allowed to run periodically, however, so if times outside of business hours are not available, we will trigger jobs (eventually), as necessary.

Only new verification jobs are blocked from starting during business hours.

Verification blocking during business hours

There is a minimum delay period when we will accept blocking of verification during business hours. After that minimum delay period expires, the verification block is overridden and a verification is forced. This minimum delay defaults to 27 hours

Notes: 

  • Using the combined settings for default verification frequency and minimum blocking delay, if verification during business hours is blocked it is possible that it might be a maximum of 51 hours between the last verification attempt before verification is run again.
  • If verification fails, the self-healing feature should repair the missing blocks on the next backup. However, the status of the local cache displayed on the vault UI will not be updated until the next verification cycle completes.
  • Once a verification job is started, it will run to completion regardless of business hours. 

Example:

  • A verification job is triggered after the 7:00 AM backup prior to business hours starting. 
  • When the 8:00 AM backup occurs, verification is paused while the backup is performed. 
  • When the backup is completed, the previously running verification job will be resumed even though it is new within the business hours’ time window.  

Local cache business hours handling can be configured by setting:

LOCAL_CACHE_VERIFY_ALLOW_DURING_BUSINESS_HOURS=<True|False>

This metric defaults to True if not specified

LOCAL_CACHE_VERIFY_BUSINESS_HOURS_MIN_DELAY=<Hours>

This metric defaults to 27 and can be set to any integer value >= 1

Note: We recommend keeping the default agent settings for most partners. 

Currently, these settings can only be set manually in the agent aristos.cfg configuration file on the protected system. 

A future enhancement to agent orchestration will add these parameters to the UI for remote management where necessary.


What is local cache trimming?

Local cache trimming identifies ‘older’ blocks and and purges them from the local cache. This ensures that all storage space is not fully consumed in the local cache repository.  

During the verification process, the time stamp on every block in the local cache on the protected system is updated. This time stamp determines which blocks are considered older than others.

Of course, protected systems might all have different backup schedules and different verification times. Due to the complexity of supporting multiple protected systems within a single local cache, it is not safe to assume we can delete blocks not seen in only a relatively few number of days. So, local cache trimming is hard coded to prevent removal of any block data less than fourteen (14) days old -  regardless of any other factor.  

IMPORTANT NOTE: If the local cache device is too small to store at least fourteen (14)  days of block data, that device will be filled to capacity and trimming WILL NOT remove the data. Based on the cache verification algorithm, data in a cache cannot be deleted until it is at least 14 days old. Therefore, you always must provide a local cache storage device that is large enough to store at least the recent backup and 14 days of incremental backups.

Local cache trimming is enabled by default with agent 2.33 and newer


Local cache trimming controls

You can specify how local cache trimming is performed and you can also define the thresholds when data will be pruned from the local cache.

  • Local cache trimming is checked at the end of every backup cycle. 
  • If the storage used by the local cache is sufficient to trigger any of the configured trimming mechanisms, then a cleanup will be performed. This will reduce the used storage below the configured threshold.
  • Blocks stored within the local cache are indexed by date and will be removed, one by one, from oldest to newest, until either (a) the configured threshold is achieved or (b) no blocks remain that are less than fourteen (14) days old.

NOTE:  The following configuration information is provided for reference only.  We recommend using the default behavior of local cache trimming at this time (which will maintain 15% free space on the local cache device.)

Planned enhancements to local cache trimming:  Management of trimming controls will be included in a future release of x360Recover, with additional settings added to the agent configuration feature within the protected systems Details page.


Define percentage of free space remaining

The default method of local cache cleanup is based on the free space remaining on the local cache path.

  • By default, local cache trimming attempts to maintain a minimum of 15% free space on the target device. 
  • If less than 15% of the target device is free, blocks will be removed from the local cache until sufficient free space is achieved.

Percentage of free space remaining can be configuring by this setting in Aristos.cfg 

LOCAL_CACHE_MIN_FREE_SPACE_PERCENTAGE=<percentage> 

This metric cannot be disabled.  Valid values range between 1-95%


Define a maximum cache size

  • This option is not enabled by default.  

You can specify a maximum amount of storage space which the local cache is allowed to consume in GB. 

If total storage size for the local cache exceeds your specified amount, blocks will be purged from the local cache until the used space is equal to 95% of your specified limit.

NOTE: If the specified maximum storage amount is found to be insufficient to store at least fourteen (14) days of block data, then blocks less than fourteen (14) days old will not be purged, even if the device runs out of space.

Maximum cache size can be configured by setting:

LOCAL_CACHE_LIMIT_GB=<GB>

This metric has no default value and is not enabled by default


Define a maximum age

  • This option is not enabled by default. 

Similar to setting a backup retention policy to retain a certain number of days of backups, this option aims to retain only block data that is not older than a specified number of days.  

  • At the end of each backup cycle, all blocks within the cache which are older than the specified age will be purged.

Note: This value cannot be set to less than the minimum trimming age + 1 day (Fifteen (15) days by default.)  If this value is found to be set to less than the minimum value, it will automatically be removed from the configuration and disabled.

Maximum age can be configured by setting:

LOCAL_CACHE_TRIMMING_MAXIMUM_AGE_DAYS=<days>

This metric has no default value and is not enabled by default.


Local cache reporting and monitoring

Scan results from local cache verification are sent to the backup server for reporting. 

The results of the latest local cache verification are displayed on the Protected Systems page in the Status column.

Additional details about the local cache verification are displayed on the Protected System Details page.


What do the status icons mean?

Green indicates that local cache testing passed successfully
Yellow indicates that local cache testing has passed successfully in the past, but the most recent successful test is greater than 48 hours (up to 72 hours) longer than the scheduled testing interval. (After 72 hours, the status is elevated to red.)
Red indicates that the most recent local cache test failed. More than 1% of total data blocks on the protected system were not found in the local cache. Missing blocks should be pushed into the cache on the next backup and the next local cache verification test should then be successful.
Gray indicates that (a) local cache is not enabled for this system, or (b) local cache verification testing has not yet been performed.
  • Direct-to-Cloud endpoints are expected to have local cache enabled. 
  • Non-Direct-to-Cloud endpoints are not expected to have local cache enabled and will not show the gray status icon.

Troubleshooting

Local cache is disabled

The status icon will be gray if you have not yet enabled local cache for your Direct-to-Cloud (D2C) endpoints (or if local cache verification has not yet been run). 

  • D2C is intended to leverage local cache. This allows you to perform rapid recovery and virtualization using Recovery Center and provide the full 'no-hardware BDR experience' to protected systems. If you have not yet configured local cache for your endpoints, refer to this knowledgebase article for assistance: Local cache for D2C

If you have already configured local cache and are receiving a gray icon, verify that the agent installed on the end point is version 2.31.877 or higher.

Local cache warning: The most recent successful test is 48-72 hours older than the configured verification testing frequency.

A yellow status icon indicates that (a) at least one successful local cache verification test has been performed in the past, but (b) the most recent successful test is 48-72 hours older than the configured verification testing frequency. 

Example: The default testing interval is every 24 hours. A yellow icon indicates that the most recent successful test is at least 72 hours old (but no more that 96 hours old.)

There may either be no more recent test attempt, or the most recent attempt may have failed. If the most recent successful test is greater than 72 hours + testing frequency hours old, the status will be elevated to failed

Local cache failed:   More than 1% of the total data blocks were found to be missing from the cache during the verification run and/or the last successful test is more than (72 hours + testing interval) hours old.

A red icon indicates that the local cache is considered to be in a failed state. 

If more than 1% of total data blocks were found to be missing from the cache during the verification run and/or if the last successful verification is more than 72 hours + testing interval hours old, the local cache is considered to be in a failed state.

  • Check to make sure the local cache storage device is accessible and not out of storage space.  
  • Contact Axcient Support for additional assistance in troubleshooting local cache failures

 


 SUPPORT | 720-204-4500 | 800-352-0248

1003  |  1140