HP D2D D2D Best Practices for VTL, NAS and Replication implementations (EH985- - Page 55

Seeding and why it is required - appliance state not running

Page 55 highlights

Amount of data in each backup Data change per backup (deduplication ratio) Number of D2D systems replicating Number of concurrent replication jobs from each source Number of concurrent replication jobs to each target As a general rule of thumb, however, a minimum bandwidth of 2 Mb/s per replication job should be allowed. For example, if a replication target is capable of accepting 8 concurrent replication jobs (HP D2D4112) and there are enough concurrently running source jobs to reach that maximum, the WAN link needs to be able to provide 16 Mb/s to ensure that replication will run correctly at maximum efficiency - below this threshold replication jobs will begin to pause and restart due to link contention. It is important to note that this minimum value does not ensure that replication will meet the performance requirements of the replication solution, a lot more bandwidth may be required to deliver optimal performance. Seeding and why it is required One of the benefits of deduplication is the ability to identify unique data, which then enables us to replicate between a source and a target D2D, only transferring the unique data identified. This process only requires low bandwidth WAN links, which is a great advantage to the customer because it delivers automated disaster recovery in a very cost-effective manner. However prior to being able to replicate only unique data between source and target D2D, we must first ensure that each site has the same hash codes or ―bulk data‖ loaded on it - this can be thought of as the reference data against which future backups are compared to see if the hash codes exist already on either source or target. The process of getting the same bulk data or reference data loaded on the D2D source and D2D target is known as ―seeding‖. Seeding is generally is a one-time operation which must take place before steady-state, low bandwidth replication can commence. Seeding can take place in a number of ways: Over the WAN link - although this can take some time for large volumes of data Using co-location where two devices are physically in the same location and can use a GbE replication link for seeding. After seeding is complete, one unit is physically shipped to its permanent destination. Using a form of removable media (physical tape or portable USB disks) to ―ship data‖ between sites. Once seeding is complete there will typically be a 90+% hit rate, meaning most of the hash codes are already loaded on the source and target and only the unique data will be transferred during replication. It is good practice to plan for seeding time in your D2D deployment plan as it can sometimes be very time consuming or manually intensive work. During the seeding process it is recommended that no other operations are taking place on the source D2D, such as further backups or tape copies. It is also important to ensure that the D2D has no failed disks and that RAID parity initialization is complete because these will impact performance. When seeding over fast networks (co-located D2D devices) it should be expected that performance to replicate a cartridge or file is similar to the performance of the original backup. If, however, a lot of replication jobs are running to a single target appliance from several source appliances, performance will be reduced due to the amount of disk activity required on the target system. Replication models and seeding The diagrams in Replication usage models starting on page 49 indicate the different replication models supported by HP D2D Backup Systems; the complexity of the replication models has a direct influence on which seeding process is best. For example an Active - Passive replication model can easily use co-location to quickly seed the target device, where as co-location may not be the best seeding method to use with a 50:1, many to 1 replication model. 55

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131

55
Amount of data in each backup
Data change per backup (deduplication ratio)
Number of D2D systems replicating
Number of concurrent replication jobs from each source
Number of concurrent replication jobs to each target
As a general rule of thumb, however, a minimum bandwidth of 2 Mb/s per replication job should be allowed.
For example, if a replication target is capable of accepting 8 concurrent replication jobs (HP D2D4112) and
there are enough concurrently running source jobs to reach that maximum, the WAN link needs to be able to
provide 16 Mb/s to ensure that replication will run correctly at maximum efficiency
below this threshold
replication jobs will begin to pause and restart due to link contention. It is important to note that this minimum
value does not ensure that replication will meet the performance requirements of the replication solution, a lot
more bandwidth may be required to deliver optimal performance.
Seeding and why it is required
One of the benefits of deduplication is the ability to identify unique data, which then enables us to replicate
between a source and a target D2D, only transferring the unique data identified. This process only requires low
bandwidth WAN links, which is a great advantage to the customer because it delivers automated disaster
recovery in a very cost-effective manner.
However
prior
to being able to replicate only unique data between source and target D2D, we must first ensure
that each site has the
same
hash codes or ―bulk data‖ loaded on it –
this can be thought of as the reference data
against which future backups are compared to see if the hash codes exist already on either source or target. The
process of getting the
same
bulk data or reference data loaded on the D2D source and D2D target is known as
―seeding‖.
Seeding is generally is a one-time operation which must take place before steady-state, low bandwidth
replication can commence. Seeding can take place in a number of ways:
Over the WAN link
although this can take some time for large volumes of data
Using co-location where two devices are physically in the same location and can use a GbE replication
link for seeding. After seeding is complete, one unit is physically shipped to its permanent destination.
Using a form of removable media (physical tape or portable USB disks) to ―ship data‖ between sites.
Once seeding is complete there will typically be a 90+% hit rate, meaning most of the hash codes are already
loaded on the source and target and only the unique data will be transferred during replication.
It is good practice to plan for seeding time in your D2D deployment plan as it can sometimes be very time
consuming or manually intensive work.
During the seeding process it is recommended that no other operations are taking place on the source D2D, such
as further backups or tape copies. It is also important to ensure that the D2D has no failed disks and that RAID
parity initialization is complete because these will impact performance.
When seeding over fast networks (co-located D2D devices) it should be expected that performance to replicate a
cartridge or file is similar to the performance of the original backup. If, however, a lot of replication jobs are
running to a single target appliance from several source appliances, performance will be reduced due to the
amount of disk activity required on the target system.
Replication models and seeding
The diagrams in
Replication usage models
starting on page
49
indicate the different replication models
supported by HP D2D Backup Systems; the complexity of the replication models has a direct influence on which
seeding process is best.
For example an Active
Passive replication model can easily use co-location to quickly
seed the target device, where as co-location may not be the best seeding method to use with a 50:1, many to 1
replication model.