diff options
Diffstat (limited to 'Documentation/device-mapper/thin-provisioning.txt')
-rw-r--r-- | Documentation/device-mapper/thin-provisioning.txt | 411 |
1 files changed, 0 insertions, 411 deletions
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt deleted file mode 100644 index 883e7ca5f745..000000000000 --- a/Documentation/device-mapper/thin-provisioning.txt +++ /dev/null @@ -1,411 +0,0 @@ -Introduction -============ - -This document describes a collection of device-mapper targets that -between them implement thin-provisioning and snapshots. - -The main highlight of this implementation, compared to the previous -implementation of snapshots, is that it allows many virtual devices to -be stored on the same data volume. This simplifies administration and -allows the sharing of data between volumes, thus reducing disk usage. - -Another significant feature is support for an arbitrary depth of -recursive snapshots (snapshots of snapshots of snapshots ...). The -previous implementation of snapshots did this by chaining together -lookup tables, and so performance was O(depth). This new -implementation uses a single data structure to avoid this degradation -with depth. Fragmentation may still be an issue, however, in some -scenarios. - -Metadata is stored on a separate device from data, giving the -administrator some freedom, for example to: - -- Improve metadata resilience by storing metadata on a mirrored volume - but data on a non-mirrored one. - -- Improve performance by storing the metadata on SSD. - -Status -====== - -These targets are considered safe for production use. But different use -cases will have different performance characteristics, for example due -to fragmentation of the data volume. - -If you find this software is not performing as expected please mail -dm-devel@redhat.com with details and we'll try our best to improve -things for you. - -Userspace tools for checking and repairing the metadata have been fully -developed and are available as 'thin_check' and 'thin_repair'. The name -of the package that provides these utilities varies by distribution (on -a Red Hat distribution it is named 'device-mapper-persistent-data'). - -Cookbook -======== - -This section describes some quick recipes for using thin provisioning. -They use the dmsetup program to control the device-mapper driver -directly. End users will be advised to use a higher-level volume -manager such as LVM2 once support has been added. - -Pool device ------------ - -The pool device ties together the metadata volume and the data volume. -It maps I/O linearly to the data volume and updates the metadata via -two mechanisms: - -- Function calls from the thin targets - -- Device-mapper 'messages' from userspace which control the creation of new - virtual devices amongst other things. - -Setting up a fresh pool device ------------------------------- - -Setting up a pool device requires a valid metadata device, and a -data device. If you do not have an existing metadata device you can -make one by zeroing the first 4k to indicate empty metadata. - - dd if=/dev/zero of=$metadata_dev bs=4096 count=1 - -The amount of metadata you need will vary according to how many blocks -are shared between thin devices (i.e. through snapshots). If you have -less sharing than average you'll need a larger-than-average metadata device. - -As a guide, we suggest you calculate the number of bytes to use in the -metadata device as 48 * $data_dev_size / $data_block_size but round it up -to 2MB if the answer is smaller. If you're creating large numbers of -snapshots which are recording large amounts of change, you may find you -need to increase this. - -The largest size supported is 16GB: If the device is larger, -a warning will be issued and the excess space will not be used. - -Reloading a pool table ----------------------- - -You may reload a pool's table, indeed this is how the pool is resized -if it runs out of space. (N.B. While specifying a different metadata -device when reloading is not forbidden at the moment, things will go -wrong if it does not route I/O to exactly the same on-disk location as -previously.) - -Using an existing pool device ------------------------------ - - dmsetup create pool \ - --table "0 20971520 thin-pool $metadata_dev $data_dev \ - $data_block_size $low_water_mark" - -$data_block_size gives the smallest unit of disk space that can be -allocated at a time expressed in units of 512-byte sectors. -$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a -multiple of 128 (64KB). $data_block_size cannot be changed after the -thin-pool is created. People primarily interested in thin provisioning -may want to use a value such as 1024 (512KB). People doing lots of -snapshotting may want a smaller value such as 128 (64KB). If you are -not zeroing newly-allocated data, a larger $data_block_size in the -region of 256000 (128MB) is suggested. - -$low_water_mark is expressed in blocks of size $data_block_size. If -free space on the data device drops below this level then a dm event -will be triggered which a userspace daemon should catch allowing it to -extend the pool device. Only one such event will be sent. - -No special event is triggered if a just resumed device's free space is below -the low water mark. However, resuming a device always triggers an -event; a userspace daemon should verify that free space exceeds the low -water mark when handling this event. - -A low water mark for the metadata device is maintained in the kernel and -will trigger a dm event if free space on the metadata device drops below -it. - -Updating on-disk metadata -------------------------- - -On-disk metadata is committed every time a FLUSH or FUA bio is written. -If no such requests are made then commits will occur every second. This -means the thin-provisioning target behaves like a physical disk that has -a volatile write cache. If power is lost you may lose some recent -writes. The metadata should always be consistent in spite of any crash. - -If data space is exhausted the pool will either error or queue IO -according to the configuration (see: error_if_no_space). If metadata -space is exhausted or a metadata operation fails: the pool will error IO -until the pool is taken offline and repair is performed to 1) fix any -potential inconsistencies and 2) clear the flag that imposes repair. -Once the pool's metadata device is repaired it may be resized, which -will allow the pool to return to normal operation. Note that if a pool -is flagged as needing repair, the pool's data and metadata devices -cannot be resized until repair is performed. It should also be noted -that when the pool's metadata space is exhausted the current metadata -transaction is aborted. Given that the pool will cache IO whose -completion may have already been acknowledged to upper IO layers -(e.g. filesystem) it is strongly suggested that consistency checks -(e.g. fsck) be performed on those layers when repair of the pool is -required. - -Thin provisioning ------------------ - -i) Creating a new thinly-provisioned volume. - - To create a new thinly- provisioned volume you must send a message to an - active pool device, /dev/mapper/pool in this example. - - dmsetup message /dev/mapper/pool 0 "create_thin 0" - - Here '0' is an identifier for the volume, a 24-bit number. It's up - to the caller to allocate and manage these identifiers. If the - identifier is already in use, the message will fail with -EEXIST. - -ii) Using a thinly-provisioned volume. - - Thinly-provisioned volumes are activated using the 'thin' target: - - dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" - - The last parameter is the identifier for the thinp device. - -Internal snapshots ------------------- - -i) Creating an internal snapshot. - - Snapshots are created with another message to the pool. - - N.B. If the origin device that you wish to snapshot is active, you - must suspend it before creating the snapshot to avoid corruption. - This is NOT enforced at the moment, so please be careful! - - dmsetup suspend /dev/mapper/thin - dmsetup message /dev/mapper/pool 0 "create_snap 1 0" - dmsetup resume /dev/mapper/thin - - Here '1' is the identifier for the volume, a 24-bit number. '0' is the - identifier for the origin device. - -ii) Using an internal snapshot. - - Once created, the user doesn't have to worry about any connection - between the origin and the snapshot. Indeed the snapshot is no - different from any other thinly-provisioned device and can be - snapshotted itself via the same method. It's perfectly legal to - have only one of them active, and there's no ordering requirement on - activating or removing them both. (This differs from conventional - device-mapper snapshots.) - - Activate it exactly the same way as any other thinly-provisioned volume: - - dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" - -External snapshots ------------------- - -You can use an external _read only_ device as an origin for a -thinly-provisioned volume. Any read to an unprovisioned area of the -thin device will be passed through to the origin. Writes trigger -the allocation of new blocks as usual. - -One use case for this is VM hosts that want to run guests on -thinly-provisioned volumes but have the base image on another device -(possibly shared between many VMs). - -You must not write to the origin device if you use this technique! -Of course, you may write to the thin device and take internal snapshots -of the thin volume. - -i) Creating a snapshot of an external device - - This is the same as creating a thin device. - You don't mention the origin at this stage. - - dmsetup message /dev/mapper/pool 0 "create_thin 0" - -ii) Using a snapshot of an external device. - - Append an extra parameter to the thin target specifying the origin: - - dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" - - N.B. All descendants (internal snapshots) of this snapshot require the - same extra origin parameter. - -Deactivation ------------- - -All devices using a pool must be deactivated before the pool itself -can be. - - dmsetup remove thin - dmsetup remove snap - dmsetup remove pool - -Reference -========= - -'thin-pool' target ------------------- - -i) Constructor - - thin-pool <metadata dev> <data dev> <data block size (sectors)> \ - <low water mark (blocks)> [<number of feature args> [<arg>]*] - - Optional feature arguments: - - skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. - - ignore_discard: Disable discard support. - - no_discard_passdown: Don't pass discards down to the underlying - data device, but just remove the mapping. - - read_only: Don't allow any changes to be made to the pool - metadata. This mode is only available after the - thin-pool has been created and first used in full - read/write mode. It cannot be specified on initial - thin-pool creation. - - error_if_no_space: Error IOs, instead of queueing, if no space. - - Data block size must be between 64KB (128 sectors) and 1GB - (2097152 sectors) inclusive. - - -ii) Status - - <transaction id> <used metadata blocks>/<total metadata blocks> - <used data blocks>/<total data blocks> <held metadata root> - ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space - needs_check|- metadata_low_watermark - - transaction id: - A 64-bit number used by userspace to help synchronise with metadata - from volume managers. - - used data blocks / total data blocks - If the number of free blocks drops below the pool's low water mark a - dm event will be sent to userspace. This event is edge-triggered and - it will occur only once after each resume so volume manager writers - should register for the event and then check the target's status. - - held metadata root: - The location, in blocks, of the metadata root that has been - 'held' for userspace read access. '-' indicates there is no - held root. - - discard_passdown|no_discard_passdown - Whether or not discards are actually being passed down to the - underlying device. When this is enabled when loading the table, - it can get disabled if the underlying device doesn't support it. - - ro|rw|out_of_data_space - If the pool encounters certain types of device failures it will - drop into a read-only metadata mode in which no changes to - the pool metadata (like allocating new blocks) are permitted. - - In serious cases where even a read-only mode is deemed unsafe - no further I/O will be permitted and the status will just - contain the string 'Fail'. The userspace recovery tools - should then be used. - - error_if_no_space|queue_if_no_space - If the pool runs out of data or metadata space, the pool will - either queue or error the IO destined to the data device. The - default is to queue the IO until more space is added or the - 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool - module parameter can be used to change this timeout -- it - defaults to 60 seconds but may be disabled using a value of 0. - - needs_check - A metadata operation has failed, resulting in the needs_check - flag being set in the metadata's superblock. The metadata - device must be deactivated and checked/repaired before the - thin-pool can be made fully operational again. '-' indicates - needs_check is not set. - - metadata_low_watermark: - Value of metadata low watermark in blocks. The kernel sets this - value internally but userspace needs to know this value to - determine if an event was caused by crossing this threshold. - -iii) Messages - - create_thin <dev id> - - Create a new thinly-provisioned device. - <dev id> is an arbitrary unique 24-bit identifier chosen by - the caller. - - create_snap <dev id> <origin id> - - Create a new snapshot of another thinly-provisioned device. - <dev id> is an arbitrary unique 24-bit identifier chosen by - the caller. - <origin id> is the identifier of the thinly-provisioned device - of which the new device will be a snapshot. - - delete <dev id> - - Deletes a thin device. Irreversible. - - set_transaction_id <current id> <new id> - - Userland volume managers, such as LVM, need a way to - synchronise their external metadata with the internal metadata of the - pool target. The thin-pool target offers to store an - arbitrary 64-bit transaction id and return it on the target's - status line. To avoid races you must provide what you think - the current transaction id is when you change it with this - compare-and-swap message. - - reserve_metadata_snap - - Reserve a copy of the data mapping btree for use by userland. - This allows userland to inspect the mappings as they were when - this message was executed. Use the pool's status command to - get the root block associated with the metadata snapshot. - - release_metadata_snap - - Release a previously reserved copy of the data mapping btree. - -'thin' target -------------- - -i) Constructor - - thin <pool dev> <dev id> [<external origin dev>] - - pool dev: - the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 - - dev id: - the internal device identifier of the device to be - activated. - - external origin dev: - an optional block device outside the pool to be treated as a - read-only snapshot origin: reads to unprovisioned areas of the - thin target will be mapped to this device. - -The pool doesn't store any size against the thin devices. If you -load a thin target that is smaller than you've been using previously, -then you'll have no access to blocks mapped beyond the end. If you -load a target that is bigger than before, then extra blocks will be -provisioned as and when needed. - -ii) Status - - <nr mapped sectors> <highest mapped sector> - - If the pool has encountered device errors and failed, the status - will just contain the string 'Fail'. The userspace recovery - tools should then be used. - - In the case where <nr mapped sectors> is 0, there is no highest - mapped sector and the value of <highest mapped sector> is unspecified. |