aboutsummaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds2015-04-15 16:39:15 -0700
committerLinus Torvalds2015-04-15 16:39:15 -0700
commiteea3a00264cf243a28e4331566ce67b86059339d (patch)
tree487f16389e0dfa32e9caa7604d1274a7dcda8f04 /Documentation
parente7c82412433a8039616c7314533a0a1c025d99bf (diff)
parente693d73c20ffdb06840c9378f367bad849ac0d5d (diff)
Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton: - the rest of MM - various misc bits - add ability to run /sbin/reboot at reboot time - printk/vsprintf changes - fiddle with seq_printf() return value * akpm: (114 commits) parisc: remove use of seq_printf return value lru_cache: remove use of seq_printf return value tracing: remove use of seq_printf return value cgroup: remove use of seq_printf return value proc: remove use of seq_printf return value s390: remove use of seq_printf return value cris fasttimer: remove use of seq_printf return value cris: remove use of seq_printf return value openrisc: remove use of seq_printf return value ARM: plat-pxa: remove use of seq_printf return value nios2: cpuinfo: remove use of seq_printf return value microblaze: mb: remove use of seq_printf return value ipc: remove use of seq_printf return value rtc: remove use of seq_printf return value power: wakeup: remove use of seq_printf return value x86: mtrr: if: remove use of seq_printf return value linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK MAINTAINERS: CREDITS: remove Stefano Brivio from B43 .mailmap: add Ricardo Ribalda CREDITS: add Ricardo Ribalda Delgado ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/obsolete/sysfs-block-zram119
-rw-r--r--Documentation/ABI/testing/sysfs-block-zram25
-rw-r--r--Documentation/blockdev/zram.txt87
-rw-r--r--Documentation/filesystems/Locking8
-rw-r--r--Documentation/printk-formats.txt49
-rw-r--r--Documentation/sysctl/vm.txt11
-rw-r--r--Documentation/vm/hugetlbpage.txt55
-rw-r--r--Documentation/vm/unevictable-lru.txt12
-rw-r--r--Documentation/vm/zsmalloc.txt70
9 files changed, 393 insertions, 43 deletions
diff --git a/Documentation/ABI/obsolete/sysfs-block-zram b/Documentation/ABI/obsolete/sysfs-block-zram
new file mode 100644
index 000000000000..720ea92cfb2e
--- /dev/null
+++ b/Documentation/ABI/obsolete/sysfs-block-zram
@@ -0,0 +1,119 @@
+What: /sys/block/zram<id>/num_reads
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The num_reads file is read-only and specifies the number of
+ reads (failed or successful) done on this device.
+ Now accessible via zram<id>/stat node.
+
+What: /sys/block/zram<id>/num_writes
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The num_writes file is read-only and specifies the number of
+ writes (failed or successful) done on this device.
+ Now accessible via zram<id>/stat node.
+
+What: /sys/block/zram<id>/invalid_io
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The invalid_io file is read-only and specifies the number of
+ non-page-size-aligned I/O requests issued to this device.
+ Now accessible via zram<id>/io_stat node.
+
+What: /sys/block/zram<id>/failed_reads
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The failed_reads file is read-only and specifies the number of
+ failed reads happened on this device.
+ Now accessible via zram<id>/io_stat node.
+
+What: /sys/block/zram<id>/failed_writes
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The failed_writes file is read-only and specifies the number of
+ failed writes happened on this device.
+ Now accessible via zram<id>/io_stat node.
+
+What: /sys/block/zram<id>/notify_free
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The notify_free file is read-only. Depending on device usage
+ scenario it may account a) the number of pages freed because
+ of swap slot free notifications or b) the number of pages freed
+ because of REQ_DISCARD requests sent by bio. The former ones
+ are sent to a swap block device when a swap slot is freed, which
+ implies that this disk is being used as a swap disk. The latter
+ ones are sent by filesystem mounted with discard option,
+ whenever some data blocks are getting discarded.
+ Now accessible via zram<id>/io_stat node.
+
+What: /sys/block/zram<id>/zero_pages
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The zero_pages file is read-only and specifies number of zero
+ filled pages written to this disk. No memory is allocated for
+ such pages.
+ Now accessible via zram<id>/mm_stat node.
+
+What: /sys/block/zram<id>/orig_data_size
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The orig_data_size file is read-only and specifies uncompressed
+ size of data stored in this disk. This excludes zero-filled
+ pages (zero_pages) since no memory is allocated for them.
+ Unit: bytes
+ Now accessible via zram<id>/mm_stat node.
+
+What: /sys/block/zram<id>/compr_data_size
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The compr_data_size file is read-only and specifies compressed
+ size of data stored in this disk. So, compression ratio can be
+ calculated using orig_data_size and this statistic.
+ Unit: bytes
+ Now accessible via zram<id>/mm_stat node.
+
+What: /sys/block/zram<id>/mem_used_total
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The mem_used_total file is read-only and specifies the amount
+ of memory, including allocator fragmentation and metadata
+ overhead, allocated for this disk. So, allocator space
+ efficiency can be calculated using compr_data_size and this
+ statistic.
+ Unit: bytes
+ Now accessible via zram<id>/mm_stat node.
+
+What: /sys/block/zram<id>/mem_used_max
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The mem_used_max file is read/write and specifies the amount
+ of maximum memory zram have consumed to store compressed data.
+ For resetting the value, you should write "0". Otherwise,
+ you could see -EINVAL.
+ Unit: bytes
+ Downgraded to write-only node: so it's possible to set new
+ value only; its current value is stored in zram<id>/mm_stat
+ node.
+
+What: /sys/block/zram<id>/mem_limit
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The mem_limit file is read/write and specifies the maximum
+ amount of memory ZRAM can use to store the compressed data.
+ The limit could be changed in run time and "0" means disable
+ the limit. No limit is the initial state. Unit: bytes
+ Downgraded to write-only node: so it's possible to set new
+ value only; its current value is stored in zram<id>/mm_stat
+ node.
diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index a6148eaf91e5..2e69e83bf510 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -141,3 +141,28 @@ Description:
amount of memory ZRAM can use to store the compressed data. The
limit could be changed in run time and "0" means disable the
limit. No limit is the initial state. Unit: bytes
+
+What: /sys/block/zram<id>/compact
+Date: August 2015
+Contact: Minchan Kim <minchan@kernel.org>
+Description:
+ The compact file is write-only and trigger compaction for
+ allocator zrm uses. The allocator moves some objects so that
+ it could free fragment space.
+
+What: /sys/block/zram<id>/io_stat
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The io_stat file is read-only and accumulates device's I/O
+ statistics not accounted by block layer. For example,
+ failed_reads, failed_writes, etc. File format is similar to
+ block layer statistics file format.
+
+What: /sys/block/zram<id>/mm_stat
+Date: August 2015
+Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
+Description:
+ The mm_stat file is read-only and represents device's mm
+ statistics (orig_data_size, compr_data_size, etc.) in a format
+ similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 7fcf9c6592ec..48a183e29988 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -98,20 +98,79 @@ size of the disk when not in use so a huge zram is wasteful.
mount /dev/zram1 /tmp
7) Stats:
- Per-device statistics are exported as various nodes under
- /sys/block/zram<id>/
- disksize
- num_reads
- num_writes
- failed_reads
- failed_writes
- invalid_io
- notify_free
- zero_pages
- orig_data_size
- compr_data_size
- mem_used_total
- mem_used_max
+Per-device statistics are exported as various nodes under /sys/block/zram<id>/
+
+A brief description of exported device attritbutes. For more details please
+read Documentation/ABI/testing/sysfs-block-zram.
+
+Name access description
+---- ------ -----------
+disksize RW show and set the device's disk size
+initstate RO shows the initialization state of the device
+reset WO trigger device reset
+num_reads RO the number of reads
+failed_reads RO the number of failed reads
+num_write RO the number of writes
+failed_writes RO the number of failed writes
+invalid_io RO the number of non-page-size-aligned I/O requests
+max_comp_streams RW the number of possible concurrent compress operations
+comp_algorithm RW show and change the compression algorithm
+notify_free RO the number of notifications to free pages (either
+ slot free notifications or REQ_DISCARD requests)
+zero_pages RO the number of zero filled pages written to this disk
+orig_data_size RO uncompressed size of data stored in this disk
+compr_data_size RO compressed size of data stored in this disk
+mem_used_total RO the amount of memory allocated for this disk
+mem_used_max RW the maximum amount memory zram have consumed to
+ store compressed data
+mem_limit RW the maximum amount of memory ZRAM can use to store
+ the compressed data
+num_migrated RO the number of objects migrated migrated by compaction
+
+
+WARNING
+=======
+per-stat sysfs attributes are considered to be deprecated.
+The basic strategy is:
+-- the existing RW nodes will be downgraded to WO nodes (in linux 4.11)
+-- deprecated RO sysfs nodes will eventually be removed (in linux 4.11)
+
+The list of deprecated attributes can be found here:
+Documentation/ABI/obsolete/sysfs-block-zram
+
+Basically, every attribute that has its own read accessible sysfs node
+(e.g. num_reads) *AND* is accessible via one of the stat files (zram<id>/stat
+or zram<id>/io_stat or zram<id>/mm_stat) is considered to be deprecated.
+
+User space is advised to use the following files to read the device statistics.
+
+File /sys/block/zram<id>/stat
+
+Represents block layer statistics. Read Documentation/block/stat.txt for
+details.
+
+File /sys/block/zram<id>/io_stat
+
+The stat file represents device's I/O statistics not accounted by block
+layer and, thus, not available in zram<id>/stat file. It consists of a
+single line of text and contains the following stats separated by
+whitespace:
+ failed_reads
+ failed_writes
+ invalid_io
+ notify_free
+
+File /sys/block/zram<id>/mm_stat
+
+The stat file represents device's mm statistics. It consists of a single
+line of text and contains the following stats separated by whitespace:
+ orig_data_size
+ compr_data_size
+ mem_used_total
+ mem_limit
+ mem_used_max
+ zero_pages
+ num_migrated
8) Deactivate:
swapoff /dev/zram0
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index c3cd6279e92e..7c3f187d48bf 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -523,6 +523,7 @@ prototypes:
void (*close)(struct vm_area_struct*);
int (*fault)(struct vm_area_struct*, struct vm_fault *);
int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
+ int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
locking rules:
@@ -532,6 +533,7 @@ close: yes
fault: yes can return with page locked
map_pages: yes
page_mkwrite: yes can return with page locked
+pfn_mkwrite: yes
access: yes
->fault() is called when a previously not present pte is about
@@ -558,6 +560,12 @@ the page has been truncated, the filesystem should not look up a new page
like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
will cause the VM to retry the fault.
+ ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
+VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
+VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
+after this call is to make the pte read-write, unless pfn_mkwrite returns
+an error.
+
->access() is called when get_user_pages() fails in
access_process_vm(), typically used to debug a process through
/proc/pid/mem or ptrace. This function is needed only for
diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index 5a615c14f75d..cb6a596072bb 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -8,6 +8,21 @@ If variable is of Type, use printk format specifier:
unsigned long long %llu or %llx
size_t %zu or %zx
ssize_t %zd or %zx
+ s32 %d or %x
+ u32 %u or %x
+ s64 %lld or %llx
+ u64 %llu or %llx
+
+If <type> is dependent on a config option for its size (e.g., sector_t,
+blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a
+format specifier of its largest possible type and explicitly cast to it.
+Example:
+
+ printk("test: sector number/total blocks: %llu/%llu\n",
+ (unsigned long long)sector, (unsigned long long)blockcount);
+
+Reminder: sizeof() result is of type size_t.
+
Raw pointer value SHOULD be printed with %p. The kernel supports
the following extended format specifiers for pointer types:
@@ -54,6 +69,7 @@ Struct Resources:
For printing struct resources. The 'R' and 'r' specifiers result in a
printed resource with ('R') or without ('r') a decoded flags member.
+ Passed by reference.
Physical addresses types phys_addr_t:
@@ -132,6 +148,8 @@ MAC/FDDI addresses:
specifier to use reversed byte order suitable for visual interpretation
of Bluetooth addresses which are in the little endian order.
+ Passed by reference.
+
IPv4 addresses:
%pI4 1.2.3.4
@@ -146,6 +164,8 @@ IPv4 addresses:
host, network, big or little endian order addresses respectively. Where
no specifier is provided the default network/big endian order is used.
+ Passed by reference.
+
IPv6 addresses:
%pI6 0001:0002:0003:0004:0005:0006:0007:0008
@@ -160,6 +180,8 @@ IPv6 addresses:
print a compressed IPv6 address as described by
http://tools.ietf.org/html/rfc5952
+ Passed by reference.
+
IPv4/IPv6 addresses (generic, with port, flowinfo, scope):
%pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008
@@ -186,6 +208,8 @@ IPv4/IPv6 addresses (generic, with port, flowinfo, scope):
specifiers can be used as well and are ignored in case of an IPv6
address.
+ Passed by reference.
+
Further examples:
%pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789
@@ -207,6 +231,8 @@ UUID/GUID addresses:
Where no additional specifiers are used the default little endian
order with lower case hex characters will be printed.
+ Passed by reference.
+
dentry names:
%pd{,2,3,4}
%pD{,2,3,4}
@@ -216,6 +242,8 @@ dentry names:
equivalent of %s dentry->d_name.name we used to use, %pd<n> prints
n last components. %pD does the same thing for struct file.
+ Passed by reference.
+
struct va_format:
%pV
@@ -231,23 +259,20 @@ struct va_format:
Do not use this feature without some mechanism to verify the
correctness of the format string and va_list arguments.
-u64 SHOULD be printed with %llu/%llx:
-
- printk("%llu", u64_var);
+ Passed by reference.
-s64 SHOULD be printed with %lld/%llx:
+struct clk:
- printk("%lld", s64_var);
+ %pC pll1
+ %pCn pll1
+ %pCr 1560000000
-If <type> is dependent on a config option for its size (e.g., sector_t,
-blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a
-format specifier of its largest possible type and explicitly cast to it.
-Example:
+ For printing struct clk structures. '%pC' and '%pCn' print the name
+ (Common Clock Framework) or address (legacy clock framework) of the
+ structure; '%pCr' prints the current clock rate.
- printk("test: sector number/total blocks: %llu/%llu\n",
- (unsigned long long)sector, (unsigned long long)blockcount);
+ Passed by reference.
-Reminder: sizeof() result is of type size_t.
Thank you for your cooperation and attention.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 902b4574acfb..9832ec52f859 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -21,6 +21,7 @@ Currently, these files are in /proc/sys/vm:
- admin_reserve_kbytes
- block_dump
- compact_memory
+- compact_unevictable_allowed
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
@@ -106,6 +107,16 @@ huge pages although processes will also directly compact memory as required.
==============================================================
+compact_unevictable_allowed
+
+Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
+allowed to examine the unevictable lru (mlocked pages) for pages to compact.
+This should be used on systems where stalls for minor page faults are an
+acceptable trade for large contiguous free memory. Set to 0 to prevent
+compaction from moving pages that are unevictable. Default value is 1.
+
+==============================================================
+
dirty_background_bytes
Contains the amount of dirty memory at which the background kernel
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index f2d3a100fe38..030977fb8d2d 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -267,21 +267,34 @@ call, then it is required that system administrator mount a file system of
type hugetlbfs:
mount -t hugetlbfs \
- -o uid=<value>,gid=<value>,mode=<value>,size=<value>,nr_inodes=<value> \
- none /mnt/huge
+ -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
+ min_size=<value>,nr_inodes=<value> none /mnt/huge
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid
options sets the owner and group of the root of the file system. By default
the uid and gid of the current process are taken. The mode option sets the
mode of root of file system to value & 01777. This value is given in octal.
-By default the value 0755 is picked. The size option sets the maximum value of
-memory (huge pages) allowed for that filesystem (/mnt/huge). The size is
-rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of
-inodes that /mnt/huge can use. If the size or nr_inodes option is not
-provided on command line then no limits are set. For size and nr_inodes
-options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For
-example, size=2K has the same meaning as size=2048.
+By default the value 0755 is picked. If the paltform supports multiple huge
+page sizes, the pagesize option can be used to specify the huge page size and
+associated pool. pagesize is specified in bytes. If pagesize is not specified
+the paltform's default huge page size and associated pool will be used. The
+size option sets the maximum value of memory (huge pages) allowed for that
+filesystem (/mnt/huge). The size option can be specified in bytes, or as a
+percentage of the specified huge page pool (nr_hugepages). The size is
+rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum
+value of memory (huge pages) allowed for the filesystem. min_size can be
+specified in the same way as size, either bytes or a percentage of the
+huge page pool. At mount time, the number of huge pages specified by
+min_size are reserved for use by the filesystem. If there are not enough
+free huge pages available, the mount will fail. As huge pages are allocated
+to the filesystem and freed, the reserve count is adjusted so that the sum
+of allocated and reserved huge pages is always at least min_size. The option
+nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the
+size, min_size or nr_inodes option is not provided on command line then
+no limits are set. For pagesize, size, min_size and nr_inodes options, you
+can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K
+has the same meaning as size=2048.
While read system calls are supported on files that reside on hugetlb
file systems, write system calls are not.
@@ -289,15 +302,23 @@ file systems, write system calls are not.
Regular chown, chgrp, and chmod commands (with right permissions) could be
used to change the file attributes on hugetlbfs.
-Also, it is important to note that no such mount command is required if the
+Also, it is important to note that no such mount command is required if
applications are going to use only shmat/shmget system calls or mmap with
-MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment
-should be a member of a supplementary group and system admin needs to
-configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for
-same or different applications to use any combination of mmaps and shm*
-calls, though the mount of filesystem will be required for using mmap calls
-without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
-map_hugetlb.c.
+MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb
+below.
+
+Users who wish to use hugetlb memory via shared memory segment should be a
+member of a supplementary group and system admin needs to configure that gid
+into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different
+applications to use any combination of mmaps and shm* calls, though the mount of
+filesystem will be required for using mmap calls without MAP_HUGETLB.
+
+Syscalls that operate on memory backed by hugetlb pages only have their lengths
+aligned to the native page size of the processor; they will normally fail with
+errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
+not hugepage aligned. For example, munmap(2) will fail if memory is backed by
+a hugetlb page and the length is smaller than the hugepage size.
+
Examples
========
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 86cb4624fc5a..3be0bfc4738d 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -22,6 +22,7 @@ CONTENTS
- Filtering special vmas.
- munlock()/munlockall() system call handling.
- Migrating mlocked pages.
+ - Compacting mlocked pages.
- mmap(MAP_LOCKED) system call handling.
- munmap()/exit()/exec() system call handling.
- try_to_unmap().
@@ -450,6 +451,17 @@ list because of a race between munlock and migration, page migration uses the
putback_lru_page() function to add migrated pages back to the LRU.
+COMPACTING MLOCKED PAGES
+------------------------
+
+The unevictable LRU can be scanned for compactable regions and the default
+behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls
+this behavior (see Documentation/sysctl/vm.txt). Once scanning of the
+unevictable LRU is enabled, the work of compaction is mostly handled by
+the page migration code and the same work flow as described in MIGRATING
+MLOCKED PAGES will apply.
+
+
mmap(MAP_LOCKED) SYSTEM CALL HANDLING
-------------------------------------
diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000000000000..64ed63c4f69d
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,70 @@
+zsmalloc
+--------
+
+This allocator is designed for use with zram. Thus, the allocator is
+supposed to work well under low memory conditions. In particular, it
+never attempts higher order page allocation which is very likely to
+fail under memory pressure. On the other hand, if we just use single
+(0-order) pages, it would suffer from very high fragmentation --
+any object of size PAGE_SIZE/2 or larger would occupy an entire page.
+This was one of the major issues with its predecessor (xvmalloc).
+
+To overcome these issues, zsmalloc allocates a bunch of 0-order pages
+and links them together using various 'struct page' fields. These linked
+pages act as a single higher-order page i.e. an object can span 0-order
+page boundaries. The code refers to these linked pages as a single entity
+called zspage.
+
+For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE
+since this satisfies the requirements of all its current users (in the
+worst case, page is incompressible and is thus stored "as-is" i.e. in
+uncompressed form). For allocation requests larger than this size, failure
+is returned (see zs_malloc).
+
+Additionally, zs_malloc() does not return a dereferenceable pointer.
+Instead, it returns an opaque handle (unsigned long) which encodes actual
+location of the allocated object. The reason for this indirection is that
+zsmalloc does not keep zspages permanently mapped since that would cause
+issues on 32-bit systems where the VA region for kernel space mappings
+is very small. So, before using the allocating memory, the object has to
+be mapped using zs_map_object() to get a usable pointer and subsequently
+unmapped using zs_unmap_object().
+
+stat
+----
+
+With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via
+/sys/kernel/debug/zsmalloc/<user name>. Here is a sample of stat output:
+
+# cat /sys/kernel/debug/zsmalloc/zram0/classes
+
+ class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage
+ ..
+ ..
+ 9 176 0 1 186 129 8 4
+ 10 192 1 0 2880 2872 135 3
+ 11 208 0 1 819 795 42 2
+ 12 224 0 1 219 159 12 4
+ ..
+ ..
+
+
+class: index
+size: object size zspage stores
+almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below)
+almost_full: the number of ZS_ALMOST_FULL zspages(see below)
+obj_allocated: the number of objects allocated
+obj_used: the number of objects allocated to the user
+pages_used: the number of pages allocated for the class
+pages_per_zspage: the number of 0-order pages to make a zspage
+
+We assign a zspage to ZS_ALMOST_EMPTY fullness group when:
+ n <= N / f, where
+n = number of allocated objects
+N = total number of objects zspage can store
+f = fullness_threshold_frac(ie, 4 at the moment)
+
+Similarly, we assign zspage to:
+ ZS_ALMOST_FULL when n > N / f
+ ZS_EMPTY when n == 0
+ ZS_FULL when n == N