diff options
author | Oscar Salvador | 2021-05-04 18:39:42 -0700 |
---|---|---|
committer | Linus Torvalds | 2021-05-05 11:27:26 -0700 |
commit | a08a2ae3461383c2d50d0997dcc6cd1dd1fefb08 (patch) | |
tree | 1378519ee9afca95a301f8e4afca6dbd6f9981a4 /drivers | |
parent | f9901144e48f6a7ba186249add705d10e74738ec (diff) |
mm,memory_hotplug: allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.
This has some disadvantages:
a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
This can even lead to extreme cases where system goes OOM because
the physically hotplugged memory depletes the available memory before
it is onlined.
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.
This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
Vmemap page tables can map arbitrary memory. That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables. This implementation uses the beginning of the hotplugged memory
for that purpose.
There are some non-obviously things to consider though.
Vmemmap pages are allocated/freed during the memory hotplug events
(add_memory_resource(), try_remove_memory()) when the memory is
added/removed. This means that the reserved physical range is not
online although it is used. The most obvious side effect is that
pfn_to_online_page() returns NULL for those pfns. The current design
expects that this should be OK as the hotplugged memory is considered a
garbage until it is onlined. For example hibernation wouldn't save the
content of those vmmemmaps into the image so it wouldn't be restored on
resume but this should be OK as there no real content to recover anyway
while metadata is reachable from other data structures (e.g. vmemmap
page tables).
The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory). That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of
{on,off}line_pages is to make each initialization specific to the
purpose rather than special case them in a single function.
As per above, the functions that are introduced are:
- mhp_init_memmap_on_memory:
Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
fully span.
- mhp_deinit_memmap_on_memory:
Offlines as many sections as vmemmap pages fully span, removes the
range from zhe zone by remove_pfn_range_from_zone(), and calls
kasan_remove_zero_shadow() for the range.
The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages(). Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.
On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages(). This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty. If offline_pages() fails, we account back
vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory().
Hot-remove:
We need to be careful when removing memory, as adding and
removing memory needs to be done with the same granularity.
To check that this assumption is not violated, we check the
memory range we want to remove and if a) any memory block has
vmemmap pages and b) the range spans more than a single memory
block, we scream out loud and refuse to proceed.
If all is good and the range was using memmap on memory (aka vmemmap pages),
we construct an altmap structure so free_hugepage_table does the right
thing and calls vmem_altmap_free instead of free_pagetable.
Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'drivers')
-rw-r--r-- | drivers/base/memory.c | 72 |
1 files changed, 66 insertions, 6 deletions
diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f209925a5d4e..b31b3af5c490 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -173,16 +173,73 @@ static int memory_block_online(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; + struct zone *zone; + int ret; + + zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages); + + /* + * Although vmemmap pages have a different lifecycle than the pages + * they describe (they remain until the memory is unplugged), doing + * their initialization and accounting at memory onlining/offlining + * stage helps to keep accounting easier to follow - e.g vmemmaps + * belong to the same zone as the memory they backed. + */ + if (nr_vmemmap_pages) { + ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone); + if (ret) + return ret; + } + + ret = online_pages(start_pfn + nr_vmemmap_pages, + nr_pages - nr_vmemmap_pages, zone); + if (ret) { + if (nr_vmemmap_pages) + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); + return ret; + } + + /* + * Account once onlining succeeded. If the zone was unpopulated, it is + * now already properly populated. + */ + if (nr_vmemmap_pages) + adjust_present_page_count(zone, nr_vmemmap_pages); - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); + return ret; } static int memory_block_offline(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; + struct zone *zone; + int ret; + + zone = page_zone(pfn_to_page(start_pfn)); + + /* + * Unaccount before offlining, such that unpopulated zone and kthreads + * can properly be torn down in offline_pages(). + */ + if (nr_vmemmap_pages) + adjust_present_page_count(zone, -nr_vmemmap_pages); - return offline_pages(start_pfn, nr_pages); + ret = offline_pages(start_pfn + nr_vmemmap_pages, + nr_pages - nr_vmemmap_pages); + if (ret) { + /* offline_pages() failed. Account back. */ + if (nr_vmemmap_pages) + adjust_present_page_count(zone, nr_vmemmap_pages); + return ret; + } + + if (nr_vmemmap_pages) + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); + + return ret; } /* @@ -576,7 +633,8 @@ int register_memory(struct memory_block *memory) return ret; } -static int init_memory_block(unsigned long block_id, unsigned long state) +static int init_memory_block(unsigned long block_id, unsigned long state, + unsigned long nr_vmemmap_pages) { struct memory_block *mem; int ret = 0; @@ -593,6 +651,7 @@ static int init_memory_block(unsigned long block_id, unsigned long state) mem->start_section_nr = block_id * sections_per_block; mem->state = state; mem->nid = NUMA_NO_NODE; + mem->nr_vmemmap_pages = nr_vmemmap_pages; ret = register_memory(mem); @@ -612,7 +671,7 @@ static int add_memory_block(unsigned long base_section_nr) if (section_count == 0) return 0; return init_memory_block(memory_block_id(base_section_nr), - MEM_ONLINE); + MEM_ONLINE, 0); } static void unregister_memory(struct memory_block *memory) @@ -634,7 +693,8 @@ static void unregister_memory(struct memory_block *memory) * * Called under device_hotplug_lock. */ -int create_memory_block_devices(unsigned long start, unsigned long size) +int create_memory_block_devices(unsigned long start, unsigned long size, + unsigned long vmemmap_pages) { const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start)); unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size)); @@ -647,7 +707,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size) return -EINVAL; for (block_id = start_block_id; block_id != end_block_id; block_id++) { - ret = init_memory_block(block_id, MEM_OFFLINE); + ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages); if (ret) break; } |