aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2019-05-14mm/hmm: add helpers to test if mm is still alive or notJérôme Glisse
The device driver can have kernel thread or worker doing work against a process mm and it is useful for those to test wether the mm is dead or alive to avoid doing useless work. Add an helper to test that so that driver can bail out early if a process is dying. Note that the helper does not perform any lock synchronization and thus is just a hint ie a process might be dying but the helper might still return the process as alive. All HMM functions are safe to use in that case as HMM internal properly protect itself with lock. If driver use this helper with non HMM functions it should ascertain that it is safe to do so. Link: http://lkml.kernel.org/r/20190403193318.16478-11-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping)Jérôme Glisse
HMM mirror is a device driver helpers to mirror range of virtual address. It means that the process jobs running on the device can access the same virtual address as the CPU threads of that process. This patch adds support for hugetlbfs mapping (ie range of virtual address that are mmap of a hugetlbfs). [rcampbell@nvidia.com: fix initial PFN for hugetlbfs pages] Link: http://lkml.kernel.org/r/20190419233536.8080-1-rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20190403193318.16478-9-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: add default fault flags to avoid the need to pre-fill pfns arraysJérôme Glisse
The HMM mirror API can be use in two fashions. The first one where the HMM user coalesce multiple page faults into one request and set flags per pfns for of those faults. The second one where the HMM user want to pre-fault a range with specific flags. For the latter one it is a waste to have the user pre-fill the pfn arrays with a default flags value. This patch adds a default flags value allowing user to set them for a range without having to pre-fill the pfn array. Link: http://lkml.kernel.org/r/20190403193318.16478-8-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: improve driver API to work and wait over a rangeJérôme Glisse
A common use case for HMM mirror is user trying to mirror a range and before they could program the hardware it get invalidated by some core mm event. Instead of having user re-try right away to mirror the range provide a completion mechanism for them to wait for any active invalidation affecting the range. This also changes how hmm_range_snapshot() and hmm_range_fault() works by not relying on vma so that we can drop the mmap_sem when waiting and lookup the vma again on retry. Link: http://lkml.kernel.org/r/20190403193318.16478-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()Jérôme Glisse
Minor optimization around hmm_pte_need_fault(). Rename for consistency between code, comments and documentation. Also improves the comments on all the possible returns values. Improve the function by returning the number of populated entries in pfns array. Link: http://lkml.kernel.org/r/20190403193318.16478-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()Jérôme Glisse
Rename for consistency between code, comments and documentation. Also improves the comments on all the possible returns values. Improve the function by returning the number of populated entries in pfns array. Link: http://lkml.kernel.org/r/20190403193318.16478-5-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/hmm: use reference counting for HMM structJérôme Glisse
Every time I read the code to check that the HMM structure does not vanish before it should thanks to the many lock protecting its removal i get a headache. Switch to reference counting instead it is much easier to follow and harder to break. This also remove some code that is no longer needed with refcounting. Link: http://lkml.kernel.org/r/20190403193318.16478-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14hugetlb: use same fault hash key for shared and private mappingsMike Kravetz
hugetlb uses a fault mutex hash table to prevent page faults of the same pages concurrently. The key for shared and private mappings is different. Shared keys off address_space and file index. Private keys off mm and virtual address. Consider a private mappings of a populated hugetlbfs file. A fault will map the page from the file and if needed do a COW to map a writable page. Hugetlbfs hole punch uses the fault mutex to prevent mappings of file pages. It uses the address_space file index key. However, private mappings will use a different key and could race with this code to map the file page. This causes problems (BUG) for the page cache remove code as it expects the page to be unmapped. A sample stack is: page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) kernel BUG at mm/filemap.c:169! ... RIP: 0010:unaccount_page_cache_page+0x1b8/0x200 ... Call Trace: __delete_from_page_cache+0x39/0x220 delete_from_page_cache+0x45/0x70 remove_inode_hugepages+0x13c/0x380 ? __add_to_page_cache_locked+0x162/0x380 hugetlbfs_fallocate+0x403/0x540 ? _cond_resched+0x15/0x30 ? __inode_security_revalidate+0x5d/0x70 ? selinux_file_permission+0x100/0x130 vfs_fallocate+0x13f/0x270 ksys_fallocate+0x3c/0x80 __x64_sys_fallocate+0x1a/0x20 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There seems to be another potential COW issue/race with this approach of different private and shared keys as noted in commit 8382d914ebf7 ("mm, hugetlb: improve page-fault scalability"). Since every hugetlb mapping (even anon and private) is actually a file mapping, just use the address_space index key for all mappings. This results in potentially more hash collisions. However, this should not be the common case. Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5 Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14include/linux/balloon_compaction.h: drop unused function stubsDavid Hildenbrand
These are leftovers from the pre-"general non-lru movable page" era. Link: http://lkml.kernel.org/r/20190329122649.28404-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Pankaj Gupta <pagupta@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan: drop may_writepage and classzone_idx from direct reclaim begin ↵Yafang Shao
template There are three tracepoints using this template, which are mm_vmscan_direct_reclaim_begin, mm_vmscan_memcg_reclaim_begin, mm_vmscan_memcg_softlimit_reclaim_begin. Regarding mm_vmscan_direct_reclaim_begin, sc.may_writepage is !laptop_mode, that's a static setting, and reclaim_idx is derived from gfp_mask which is already show in this tracepoint. Regarding mm_vmscan_memcg_reclaim_begin, may_writepage is !laptop_mode too, and reclaim_idx is (MAX_NR_ZONES-1), which are both static value. mm_vmscan_memcg_softlimit_reclaim_begin is the same with mm_vmscan_memcg_reclaim_begin. So we can drop them all. Link: http://lkml.kernel.org/r/1553736322-32235-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: introduce put_user_page*(), placeholder versionsJohn Hubbard
A discussion of the overall problem is below. As mentioned in patch 0001, the steps are to fix the problem are: 1) Provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. Overview ======== Some kernel components (file systems, device drivers) need to access memory that is specified via process virtual address. For a long time, the API to achieve that was get_user_pages ("GUP") and its variations. However, GUP has critical limitations that have been overlooked; in particular, GUP does not interact correctly with filesystems in all situations. That means that file-backed memory + GUP is a recipe for potential problems, some of which have already occurred in the field. GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code to get the struct page behind a virtual address and to let storage hardware perform a direct copy to or from that page. This is a short-lived access pattern, and as such, the window for a concurrent writeback of GUP'd page was small enough that there were not (we think) any reported problems. Also, userspace was expected to understand and accept that Direct IO was not synchronized with memory-mapped access to that data, nor with any process address space changes such as munmap(), mremap(), etc. Over the years, more GUP uses have appeared (virtualization, device drivers, RDMA) that can keep the pages they get via GUP for a long period of time (seconds, minutes, hours, days, ...). This long-term pinning makes an underlying design problem more obvious. In fact, there are a number of key problems inherent to GUP: Interactions with file systems ============================== File systems expect to be able to write back data, both to reclaim pages, and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain write access to the file memory pages means that such hardware can dirty the pages, without the filesystem being aware. This can, in some cases (depending on filesystem, filesystem options, block device, block device options, and other variables), lead to data corruption, and also to kernel bugs of the form: kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) Dave Chinner's description of this is very clear: "The fundamental issue is that ->page_mkwrite must be called on every write access to a clean file backed page, not just the first one. How long the GUP reference lasts is irrelevant, if the page is clean and you need to dirty it, you must call ->page_mkwrite before it is marked writeable and dirtied. Every. Time." This is just one symptom of the larger design problem: real filesystems that actually write to a backing device, do not actually support get_user_pages() being called on their pages, and letting hardware write directly to those pages--even though that pattern has been going on since about 2005 or so. Long term GUP ============= Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a writeable mapping is created), and the pages are file-backed. That can lead to filesystem corruption. What happens is that when a file-backed page is being written back, it is first mapped read-only in all of the CPU page tables; the file system then assumes that nobody can write to the page, and that the page content is therefore stable. Unfortunately, the GUP callers generally do not monitor changes to the CPU pages tables; they instead assume that the following pattern is safe (it's not): get_user_pages() Hardware can keep a reference to those pages for a very long time, and write to it at any time. Because "hardware" here means "devices that are not a CPU", this activity occurs without any interaction with the kernel's file system code. for each page set_page_dirty put_page() In fact, the GUP documentation even recommends that pattern. Anyway, the file system assumes that the page is stable (nothing is writing to the page), and that is a problem: stable page content is necessary for many filesystem actions during writeback, such as checksum, encryption, RAID striping, etc. Furthermore, filesystem features like COW (copy on write) or snapshot also rely on being able to use a new page for as memory for that memory range inside the file. Corruption during write back is clearly possible here. To solve that, one idea is to identify pages that have active GUP, so that we can use a bounce page to write stable data to the filesystem. The filesystem would work on the bounce page, while any of the active GUP might write to the original page. This would avoid the stable page violation problem, but note that it is only part of the overall solution, because other problems remain. Other filesystem features that need to replace the page with a new one can be inhibited for pages that are GUP-pinned. This will, however, alter and limit some of those filesystem features. The only fix for that would be to require GUP users to monitor and respond to CPU page table updates. Subsystems such as ODP and HMM do this, for example. This aspect of the problem is still under discussion. Direct IO ========= Direct IO can cause corruption, if userspace does Direct-IO that writes to a range of virtual addresses that are mmap'd to a file. The pages written to are file-backed pages that can be under write back, while the Direct IO is taking place. Here, Direct IO races with a write back: it calls GUP before page_mkclean() has replaced the CPU pte with a read-only entry. The race window is pretty small, which is probably why years have gone by before we noticed this problem: Direct IO is generally very quick, and tends to finish up before the filesystem gets around to do anything with the page contents. However, it's still a real problem. The solution is to never let GUP return pages that are under write back, but instead, force GUP to take a write fault on those pages. That way, GUP will properly synchronize with the active write back. This does not change the required GUP behavior, it just avoids that race. Details ======= Introduces put_user_page(), which simply calls put_page(). This provides a way to update all get_user_pages*() callers, so that they call put_user_page(), instead of put_page(). Also introduces put_user_pages(), and a few dirty/locked variations, as a replacement for release_pages(), and also as a replacement for open-coded loops that release multiple pages. These may be used for subsequent performance improvements, via batching of pages to be released. This is the first step of fixing a problem (also described in [1] and [2]) with interactions between get_user_pages ("gup") and filesystems. Problem description: let's start with a bug report. Below, is what happens sometimes, under memory pressure, when a driver pins some pages via gup, and then marks those pages dirty, and releases them. Note that the gup documentation actually recommends that pattern. The problem is that the filesystem may do a writeback while the pages were gup-pinned, and then the filesystem believes that the pages are clean. So, when the driver later marks the pages as dirty, that conflicts with the filesystem's page tracking and results in a BUG(), like this one that I experienced: kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) Dave Chinner's description of this is very clear: "The fundamental issue is that ->page_mkwrite must be called on every write access to a clean file backed page, not just the first one. How long the GUP reference lasts is irrelevant, if the page is clean and you need to dirty it, you must call ->page_mkwrite before it is marked writeable and dirtied. Every. Time." This is just one symptom of the larger design problem: real filesystems that actually write to a backing device, do not actually support get_user_pages() being called on their pages, and letting hardware write directly to those pages--even though that pattern has been going on since about 2005 or so. The steps are to fix it are: 1) (This patch): provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()" [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [docs] Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Christoph Lameter <cl@linux.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14hugetlb: allow to free gigantic pages regardless of the configurationAlexandre Ghiti
On systems without CONTIG_ALLOC activated but that support gigantic pages, boottime reserved gigantic pages can not be freed at all. This patch simply enables the possibility to hand back those pages to memory allocator. Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: David S. Miller <davem@davemloft.net> [sparc] Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOCAlexandre Ghiti
This condition allows to define alloc_contig_range, so simplify it into a more accurate naming. Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() APIJohannes Weiner
Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks, group them together. Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: push down mem_cgroup_node_nr_lru_pages()Johannes Weiner
mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around lruvec_page_state() that takes bitmasks of lru indexes and aggregates the counts for those. Replace callsites where the bitmask is simple enough with direct lruvec_page_state() calls. This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so make that function private again, too. Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: replace zone summing with lruvec_page_state()Johannes Weiner
Instead of adding up the zone counters, use lruvec_page_state() to get the node state directly. This is a bit cheaper and more stream-lined. Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: memcontrol: track LRU counts in the vmstats arrayJohannes Weiner
Patch series "mm: memcontrol: clean up the LRU counts tracking". The memcg LRU stats usage is currently a bit messy. Memcg has private per-zone counters because reclaim needs zone granularity sometimes, but we also have plenty of users that need to awkwardly sum them up to node or memcg granularity. Meanwhile the canonical per-memcg vmstats do not track the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect. This series enables LRU count tracking in the per-memcg vmstats array such that lruvec_page_state() and memcg_page_state() work on the enum node_stat_item items for the LRU counters. Then it converts all the callers that don't specifically need per-zone numbers over to that. This patch (of 6): The memcg code currently maintains private per-zone breakdowns of the LRU counters. This is necessary for reclaim decisions which are still zone-based, but there are a variety of users of these counters that only want the aggregate per-lruvec or per-memcg LRU counts, and they need to painfully sum up the zone counters on each request for that. These would be better served using the memcg vmstats arrays, which track VM statistics at the desired scope already. They just don't have the LRU counts right now. So to kick off the conversion, begin tracking LRU counts in those. Link: http://lkml.kernel.org/r/20190228163020.24100-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/vmscan: add tracepoints for node reclaimYafang Shao
The page alloc fast path it may perform node reclaim, which may cause a latency spike. We should add tracepoint for this event, and also measure the latency it causes. So bellow two tracepoints are introduced, mm_vmscan_node_reclaim_begin mm_vmscan_node_reclaim_end Link: http://lkml.kernel.org/r/1551421452-5385-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <shaoyafang@didiglobal.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm, compaction: some tracepoints should be defined only when ↵Yafang Shao
CONFIG_COMPACTION is set Only mm_compaction_isolate_{free, migrate}pages may be used when CONFIG_COMPACTION is not set. All others are used only when CONFIG_COMPACTION is set. After this change, if CONFIG_COMPACTION is not set, the tracepoints that only work when CONFIG_COMPACTION is set will not be exposed to userspace. Without this change, they will always be exposed in debugfs whether CONFIG_COMPACTION is set or not. This is an improvement. Link: http://lkml.kernel.org/r/1552440403-11780-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: compaction: show gfp flag names in try_to_compact_pages tracepointYafang Shao
Showing the gfp flag names instead of the gfp_mask makes trace more convenient. Link: http://lkml.kernel.org/r/1552527998-13162-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/gup: change GUP fast to use flags rather than a write 'bool'Ira Weiny
To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERMIra Weiny
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move nr_deactivate accounting to shrink_active_list()Kirill Tkhai
We know which LRU is not active. [chris@chrisdown.name: fix build on !CONFIG_MEMCG] Link: http://lkml.kernel.org/r/20190322150513.GA22021@chrisdown.name Link: http://lkml.kernel.org/r/155290128498.31489.18250485448913338607.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: move recent_rotated pages calculation to shrink_inactive_list()Kirill Tkhai
Patch series "mm: Generalize putback functions"] putback_inactive_pages() and move_active_pages_to_lru() are almost similar, so this patchset merges them ina single function. This patch (of 4): The patch moves the calculation from putback_inactive_pages() to shrink_inactive_list(). This makes putback_inactive_pages() looking more similar to move_active_pages_to_lru(). To do that, we account activated pages in reclaim_stat::nr_activate. Since a page may change its LRU type from anon to file cache inside shrink_page_list() (see ClearPageSwapBacked()), we have to account pages for the both types. So, nr_activate becomes an array. Previously we used nr_activate to account PGACTIVATE events, but now we account them into pgactivate variable (since they are about number of pages in general, not about sum of hpage_nr_pages). Link: http://lkml.kernel.org/r/155290127956.31489.3393586616054413298.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: page cache: store only head pages in i_pagesMatthew Wilcox
Transparent Huge Pages are currently stored in i_pages as pointers to consecutive subpages. This patch changes that to storing consecutive pointers to the head page in preparation for storing huge pages more efficiently in i_pages. Large parts of this are "inspired" by Kirill's patch https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/ [willy@infradead.org: fix swapcache pages] Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org [kirill@shutemov.name: hugetlb stores pages in page cache differently] Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1 Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Kirill Shutemov <kirill@shutemov.name> Reviewed-and-tested-by: Song Liu <songliubraving@fb.com> Tested-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Tested-by: Qian Cai <cai@lca.pw> Cc: Hugh Dickins <hughd@google.com> Cc: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14userfaultfd/sysctl: add vm.unprivileged_userfaultfdPeter Xu
Userfaultfd can be misued to make it easier to exploit existing use-after-free (and similar) bugs that might otherwise only make a short window or race condition available. By using userfaultfd to stall a kernel thread, a malicious program can keep some state that it wrote, stable for an extended period, which it can then access using an existing exploit. While it doesn't cause the exploit itself, and while it's not the only thing that can stall a kernel thread when accessing a memory location, it's one of the few that never needs privilege. We can add a flag, allowing userfaultfd to be restricted, so that in general it won't be useable by arbitrary user programs, but in environments that require userfaultfd it can be turned back on. Add a global sysctl knob "vm.unprivileged_userfaultfd" to control whether userfaultfd is allowed by unprivileged users. When this is set to zero, only privileged users (root user, or users with the CAP_SYS_PTRACE capability) will be able to use the userfaultfd syscalls. Andrea said: : The only difference between the bpf sysctl and the userfaultfd sysctl : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement, : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE : already if it's doing other kind of tracking on processes runtime, in : addition of userfaultfd. In other words both syscalls works only for : root, when the two sysctl are opt-in set to 1. [dgilbert@redhat.com: changelog additions] [akpm@linux-foundation.org: documentation tweak, per Mike] Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Martin Cracauer <cracauer@cons.org> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14include/trace/events/vmscan.h: drop zone id from kswapd tracepointsYafang Shao
It is not clear how the zone id is useful in kswapd tracepoints and the id itself is not really easy to process because it depends on the configuration (available zones). Let's drop the id for now. If somebody really needs that information then the zone name should be used instead. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/1552451813-10833-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm: remove stale comment from page structTobin C. Harding
We now use the slab_list list_head instead of the lru list_head. This comment has become stale. Remove stale comment from page struct slab_list list_head. Link: http://lkml.kernel.org/r/20190402230545.2929-8-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14list: add function list_rotate_to_front()Tobin C. Harding
Patch series "mm: Use slab_list list_head instead of lru", v5. Currently the slab allocators (ab)use the struct page 'lru' list_head. We have a list head for slab allocators to use, 'slab_list'. During v2 it was noted by Christoph that the SLOB allocator was reaching into a list_head, this version adds 2 patches to the front of the set to fix that. Clean up all three allocators by using the 'slab_list' list_head instead of overloading the 'lru' list_head. This patch (of 7): Currently if we wish to rotate a list until a specific item is at the front of the list we can call list_move_tail(head, list). Note that the arguments are the reverse way to the usual use of list_move_tail(list, head). This is a hack, it depends on the developer knowing how the list_head operates internally which violates the layer of abstraction offered by the list_head. Also, it is not intuitive so the next developer to come along must study list.h in order to fully understand what is meant by the call, while this is 'good for' the developer it makes reading the code harder. We should have an function appropriately named that does this if there are users for it intree. By grep'ing the tree for list_move_tail() and list_tail() and attempting to guess the argument order from the names it seems there is only one place currently in the tree that does this - the slob allocatator. Add function list_rotate_to_front() to rotate a list until the specified item is at the front of the list. Link: http://lkml.kernel.org/r/20190402230545.2929-2-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned ↵Dan Williams
addresses Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls pmdp_set_access_flags(). That helper enforces a pmd aligned @address argument via VM_BUG_ON() assertion. Update the implementation to take a 'struct vm_fault' argument directly and apply the address alignment fixup internally to fix crash signatures like: kernel BUG at arch/x86/mm/pgtable.c:515! invalid opcode: 0000 [#1] SMP NOPTI CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1 [..] RIP: 0010:pmdp_set_access_flags+0x48/0x50 [..] Call Trace: vmf_insert_pfn_pmd+0x198/0x350 dax_iomap_fault+0xe82/0x1190 ext4_dax_huge_fault+0x103/0x1f0 ? __switch_to_asm+0x40/0x70 __handle_mm_fault+0x3f6/0x1370 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 handle_mm_fault+0xda/0x200 __do_page_fault+0x249/0x4f0 do_page_fault+0x32/0x110 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Piotr Balcer <piotr.balcer@intel.com> Tested-by: Yan Ma <yan.ma@intel.com> Tested-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Chandan Rajendra <chandan@linux.ibm.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-13Merge tag 'iommu-updates-v5.2' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU updates from Joerg Roedel: - ATS support for ARM-SMMU-v3. - AUX domain support in the IOMMU-API and the Intel VT-d driver. This adds support for multiple DMA address spaces per (PCI-)device. The use-case is to multiplex devices between host and KVM guests in a more flexible way than supported by SR-IOV. - the rest are smaller cleanups and fixes, two of which needed to be reverted after testing in linux-next. * tag 'iommu-updates-v5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (45 commits) Revert "iommu/amd: Flush not present cache in iommu_map_page" Revert "iommu/amd: Remove the leftover of bypass support" iommu/vt-d: Fix leak in intel_pasid_alloc_table on error path iommu/vt-d: Make kernel parameter igfx_off work with vIOMMU iommu/vt-d: Set intel_iommu_gfx_mapped correctly iommu/amd: Flush not present cache in iommu_map_page iommu/vt-d: Cleanup: no spaces at the start of a line iommu/vt-d: Don't request page request irq under dmar_global_lock iommu/vt-d: Use struct_size() helper iommu/mediatek: Fix leaked of_node references iommu/amd: Remove amd_iommu_pd_list iommu/arm-smmu: Log CBFRSYNRA register on context fault iommu/arm-smmu-v3: Don't disable SMMU in kdump kernel iommu/arm-smmu-v3: Disable tagged pointers iommu/arm-smmu-v3: Add support for PCI ATS iommu/arm-smmu-v3: Link domains and devices iommu/arm-smmu-v3: Add a master->domain pointer iommu/arm-smmu-v3: Store SteamIDs in master iommu/arm-smmu-v3: Rename arm_smmu_master_data to arm_smmu_master ACPI/IORT: Check ATS capability in root complex nodes ...
2019-05-12Merge tag 'upstream-5.2-rc1' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBI/UBIFS updates from Richard Weinberger: - fscrypt framework usage updates - One huge fix for xattr unlink - Cleanup of fscrypt ifdefs - Fix for our new UBIFS auth feature * tag 'upstream-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubi: wl: Fix uninitialized variable ubifs: Drop unnecessary setting of zbr->znode ubifs: Remove ifdefs around CONFIG_UBIFS_ATIME_SUPPORT ubifs: Remove #ifdef around CONFIG_FS_ENCRYPTION ubifs: Limit number of xattrs per inode ubifs: orphan: Handle xattrs like files ubifs: journal: Handle xattrs like files ubifs: find.c: replace swap function with built-in one ubifs: Do not skip hash checking in data nodes ubifs: work around high stack usage with clang ubifs: remove unused function __ubifs_shash_final ubifs: remove unnecessary #ifdef around fscrypt_ioctl_get_policy() ubifs: remove unnecessary calls to set up directory key
2019-05-12Merge tag 'mtd/for-5.2' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull MTD updates from Richard Weinberger: "MTD core changes: - New AFS partition parser - Update MAINTAINERS entry - Use of fall-throughs markers NAND core changes: - Support having the bad block markers in either the first, second or last page of a block. The combination of all three location is now possible. - Constification of NAND_OP_PARSER(_PATTERN) elements. - Generic NAND DT bindings changed to yaml format (can be used to check the proposed bindings. First platform to be fully supported: sunxi. - Stopped using several legacy hooks. - Preparation to use the generic NAND layer with the addition of several helpers and the removal of the struct nand_chip from generic functions. - Kconfig cleanup to prepare the introduction of external ECC engines support. - Fallthrough comments. - Introduction of the SPI-mem dirmap API for SPI-NAND devices. Raw NAND controller drivers changes: - nandsim: - Switch to ->exec-op(). - meson: - Misc cleanups and fixes. - New OOB layout. - Sunxi: - A23/A33 NAND DMA support. - Ingenic: - Full reorganization and cleanup. - Clear separation between NAND controller and ECC engine. - Support JZ4740 an JZ4725B. - Denali: - Clear controller/chip separation. - ->exec_op() migration. - Various cleanups. - fsl_elbc: - Enable software ECC support. - Atmel: - Sam9x60 support. - GPMI: - Introduce the GPMI_IS_MXS() macro. - Various trivial/spelling/coding style fixes. SPI NOR core changes: - Print all JEDEC ID bytes on error - Fix comment of spi_nor_find_best_erase_type() - Add region locking flags for s25fl512s SPI NOR controller drivers changes: - intel-spi: - Avoid crossing 4K address boundary on read/write - Add support for Intel Comet Lake SPI serial flash" * tag 'mtd/for-5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (120 commits) mtd: part: fix incorrect format specifier for an unsigned long long mtd: lpddr_cmds: Mark expected switch fall-through mtd: phram: Mark expected switch fall-throughs mtd: cfi_cmdset_0002: Mark expected switch fall-throughs mtd: cfi_util: mark expected switch fall-throughs MAINTAINERS: MTD Git repository is hosted on kernel.org MAINTAINERS: Update jffs2 entry mtd: afs: add v2 partition parsing mtd: afs: factor the IIS read into partition parser mtd: afs: factor footer parsing into the v1 part parsing mtd: factor out v1 partition parsing mtd: afs: simplify partition detection mtd: afs: simplify partition parsing mtd: partitions: Add OF support to AFS partitions mtd: partitions: Add AFS partitions DT bindings mtd: afs: Move AFS partition parser to parsers subdir mtd: maps: Make uclinux_ram_map static mtd: maps: Allow MTD_PHYSMAP with MTD_RAM MAINTAINERS: Add myself as MTD maintainer MAINTAINERS: Remove my name from the MTD and NAND entries ...
2019-05-12Merge tag 'tag-chrome-platform-for-v5.2' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux Pull chrome platform updates from Benson Leung: "CrOS EC: - Add EC host command support using rpmsg - Add new CrOS USB PD logging driver - Transfer spi messages at high priority - Add support to trace CrOS EC commands - Minor fixes and cleanups in protocol and debugfs Wilco EC: - Standardize Wilco EC mailbox interface - Add h1_gpio status to debugfs" * tag 'tag-chrome-platform-for-v5.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: platform/chrome: cros_ec_proto: Add trace event to trace EC commands platform/chrome: cros_ec_debugfs: Use cros_ec_cmd_xfer_status helper platform/chrome: cros_ec: Add EC host command support using rpmsg platform/chrome: wilco_ec: Add h1_gpio status to debugfs platform/chrome: wilco_ec: Standardize mailbox interface platform/chrome: cros_ec_proto: check for NULL transfer function platform/chrome: Add CrOS USB PD logging driver platform/chrome: cros_ec_spi: Transfer messages at high priority platform/chrome: cros_ec_debugfs: no need to check return value of debugfs_create functions platform/chrome: cros_ec_debugfs: Remove dev_warn when console log is not supported
2019-05-11Merge tag 'gpio-v5.2-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull gpio updates from Linus Walleij: "This is the bulk of the GPIO changes for the v5.2 kernel cycle. A bit later than usual because I was ironing out my own mistakes. I'm holding some stuff back for the next kernel as a result, and this should be a healthy and well tested batch. Core changes: - The gpiolib MMIO driver has been enhanced to handle two direction registers, i.e. one register to set lines as input and one register to set lines as output. It turns out some silicon engineer thinks the ability to configure a line as input and output at the same time makes sense, this can be debated but includes a lot of analog electronics reasoning, and the registers are there and need to be handled consistently. Unsurprisingly, we enforce the lines to be either inputs or outputs in such schemes. - Send in the proper argument value to .set_config() dispatched to the pin control subsystem. Nobody used it before, now someone does, so fix it to work as expected. - The ACPI gpiolib portions can now handle pin bias setting (pull up or pull down). This has been in the ACPI spec for years and we finally have it properly integrated with Linux GPIOs. It was based on an observation from Andy Schevchenko that Thomas Petazzoni's changes to the core for biasing the PCA950x GPIO expander actually happen to fit hand-in-glove with what the ACPI core needed. Such nice synergies happen sometimes. New drivers: - A new driver for the Mellanox BlueField GPIO controller. This is using 64bit MMIO registers and can configure lines as inputs and outputs at the same time and after improving the MMIO library we handle it just fine. Interesting. - A new IXP4xx proper gpiochip driver with hierarchical interrupts should be coming in from the ARM SoC tree as well. Driver enhancements: - The PCA053x driver handles the CAT9554 GPIO expander. - The PCA053x driver handles the NXP PCAL6416 GPIO expander. - Wake-up support on PCA053x GPIO lines. - OMAP now does a nice asynchronous IRQ handling on wake-ups by letting everything wake up on edges, and this makes runtime PM work as expected too. Misc: - Several cleanups such as devres fixes. - Get rid of some languager comstructs that cause problems when compiling with LLVMs clang. - Documentation review and update" * tag 'gpio-v5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (85 commits) gpio: Update documentation docs: gpio: convert docs to ReST and rename to *.rst gpio: sch: Remove write-only core_base gpio: pxa: Make two symbols static gpiolib: acpi: Respect pin bias setting gpiolib: acpi: Add acpi_gpio_update_gpiod_lookup_flags() helper gpiolib: acpi: Set pin value, based on bias, more accurately gpiolib: acpi: Change type of dflags gpiolib: Introduce GPIO_LOOKUP_FLAGS_DEFAULT gpiolib: Make use of enum gpio_lookup_flags consistent gpiolib: Indent entry values of enum gpio_lookup_flags gpio: pca953x: add support for pca6416 dt-bindings: gpio: pca953x: document the nxp,pca6416 gpio: pca953x: add pcal6416 to the of_device_id table gpio: gpio-omap: Remove conditional pm_runtime handling for GPIO interrupts gpio: gpio-omap: configure edge detection for level IRQs for idle wakeup tracing: stop making gpio tracing configurable gpio: pca953x: Configure wake-up path when wake-up is enabled gpio: of: Optimize quirk checks gpio: mmio: Drop bgpio_dir_inverted ...
2019-05-11Merge tag 'vfio-v5.2-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Improve dev_printk() usage (Bjorn Helgaas) - Fix issue with blocking in !TASK_RUNNING state while waiting for userspace to release devices (Farhan Ali) - Fix error path cleanup in nvlink setup (Greg Kurz) - mdev-core cleanups and fixes in preparation for more use cases (Parav Pandit) - Cornelia has volunteered as an official vfio reviewer (Cornelia Huck) * tag 'vfio-v5.2-rc1' of git://github.com/awilliam/linux-vfio: vfio: Add Cornelia Huck as reviewer vfio/mdev: Avoid inline get and put parent helpers vfio/mdev: Fix aborting mdev child device removal if one fails vfio/mdev: Follow correct remove sequence vfio/mdev: Avoid masking error code to EBUSY vfio/mdev: Drop redundant extern for exported symbols vfio/mdev: Removed unused kref vfio/mdev: Avoid release parent reference during error path vfio-pci/nvlink2: Fix potential VMA leak vfio: Fix WARNING "do not call blocking ops when !TASK_RUNNING" vfio: Use dev_printk() when possible
2019-05-10Merge tag 'platform-drivers-x86-v5.2-1' of ↵Linus Torvalds
git://git.infradead.org/linux-platform-drivers-x86 Pull x86 platform driver updates from Andy Shevchenko: "Gathered pile of patches for Platform Drivers x86. No surprises and no merge conflicts. Business as usual. Summary: - New driver of power button for Basin Cove PMIC. - ASUS WMI driver has got a Fn lock mode switch support. - Resolve a never end story with non working Wi-Fi on newer Lenovo Ideapad computers. Now the black list is replaced with white list. - New facility to debug S0ix failures on Intel Atom platforms. The Intel PMC and accompanying drivers are cleaned up. - Mellanox got a new TmFifo driver. Besides tachometer sensor and watchdog are enabled on Mellanox platforms. - The information of embedded controller is now recognized on new Thinkpads. Bluetooth driver on Thinkpads is blacklisted for some models. - Touchscreen DMI driver extended to support 'jumper ezpad 6 pro b' and Myria MY8307 2-in-1. - Additionally few small fixes here and there for WMI and ACPI laptop drivers. - The following is an automated git shortlog grouped by driver: - alienware-wmi: - printing the wrong error code - fix kfree on potentially uninitialized pointer - asus-wmi: - Add fn-lock mode switch support - dell-laptop: - fix rfkill functionality - dell-rbtn: - Add missing #include - ideapad-laptop: - Remove no_hw_rfkill_list - intel_pmc_core: - Allow to dump debug registers on S0ix failure - Convert to a platform_driver - Mark local function static - intel_pmc_ipc: - Don't map non-used optional resources - Apply same width for offset definitions - Use BIT() macro - adding error handling - intel_punit_ipc: - Revert "Fix resource ioremap warning" - mlx-platform: - Add mlx-wdt platform driver activation - Add support for tachometer speed register - Add TmFifo driver for Mellanox BlueField Soc - sony-laptop: - Fix unintentional fall-through - thinkpad_acpi: - cleanup for Thinkpad ACPI led - Mark expected switch fall-throughs - fix spelling mistake "capabilites" -> "capabilities" - Read EC information on newer models - Disable Bluetooth for some machines - touchscreen_dmi: - Add info for 'jumper ezpad 6 pro b' touchscreen - Add info for Myria MY8307 2-in-1" * tag 'platform-drivers-x86-v5.2-1' of git://git.infradead.org/linux-platform-drivers-x86: (26 commits) platform/x86: Add support for Basin Cove power button platform/x86: asus-wmi: Add fn-lock mode switch support platform/x86: ideapad-laptop: Remove no_hw_rfkill_list platform/x86: touchscreen_dmi: Add info for 'jumper ezpad 6 pro b' touchscreen platform/x86: thinkpad_acpi: cleanup for Thinkpad ACPI led platform/x86: thinkpad_acpi: Mark expected switch fall-throughs platform/x86: sony-laptop: Fix unintentional fall-through platform/x86: alienware-wmi: printing the wrong error code platform/x86: intel_pmc_core: Allow to dump debug registers on S0ix failure platform/x86: intel_pmc_core: Convert to a platform_driver platform/x86: mlx-platform: Add mlx-wdt platform driver activation platform/x86: mlx-platform: Add support for tachometer speed register platform/mellanox: Add TmFifo driver for Mellanox BlueField Soc platform/x86: thinkpad_acpi: fix spelling mistake "capabilites" -> "capabilities" platform/x86: intel_punit_ipc: Revert "Fix resource ioremap warning" platform/x86: intel_pmc_ipc: Don't map non-used optional resources platform/x86: intel_pmc_ipc: Apply same width for offset definitions platform/x86: intel_pmc_ipc: Use BIT() macro platform/x86: alienware-wmi: fix kfree on potentially uninitialized pointer platform/x86: dell-laptop: fix rfkill functionality ...
2019-05-10Merge tag 'fbdev-v5.2' of git://github.com/bzolnier/linuxLinus Torvalds
Pull fbdev updates from Bartlomiej Zolnierkiewicz: "Four small fixes for fb core, updates for udlfb, sm712fb, macfb and atafb drivers. Redundant code removals from amba-clcd and atmel_lcdfb drivers. Minor fixes/cleanups for other fb drivers Detailed summary: - fix regression in fbcon logo handling on 'quiet' boots (Andreas Schwab) - fix divide-by-zero error in fb_var_to_videomode() (Shile Zhang) - fix 'WARNING in __alloc_pages_nodemask' bug (Jiufei Xue) - list all PCI memory BARs as conflicting apertures (Gerd Hoffmann) - update udlfb driver: fix sleeping inside spinlock, add mutex around rendering calls and remove redundant code (Mikulas Patocka) - update sm712fb driver: fix SM720 support related issues (Yifeng Li) - update macfb driver: fix DAFB colour table pointer initialization and remove redundant code (Finn Thain) - update atafb driver: fix kexec support, use dev_*() calls instead of printk() and remove obsolete module support (Geert Uytterhoeven) - add support to mxsfb driver for skipping display initialization for flicker-free display takeover from bootloader (Melchior Franz) - remove Versatile and Nomadik board families support from amba-clcd driver as they are handled by DRM driver nowadays (Linus Walleij) - remove no longer needed AVR and platform_data support from atmel_lcdfb driver (Alexandre Belloni) - misc fixes (Colin Ian King, Julia Lawall, Gustavo A. R. Silva, Aditya Pakki, Kangjie Lu, YueHaibing) - misc cleanups (Enrico Weigelt, Kefeng Wang)" * tag 'fbdev-v5.2' of git://github.com/bzolnier/linux: (38 commits) video: fbdev: Use dev_get_drvdata() fbcon: Don't reset logo_shown when logo is currently shown video: fbdev: atmel_lcdfb: remove set but not used variable 'pdata' video: fbdev: mxsfb: remove set but not used variable 'line_count' video: fbdev: pvr2fb: remove set but not used variable 'size' fbdev: fix WARNING in __alloc_pages_nodemask bug video: amba-clcd: Decomission Versatile and Nomadik fbdev: sm712fb: fix memory frequency by avoiding a switch/case fallthrough fbdev: fix divide error in fb_var_to_videomode fbdev: sm712fb: use 1024x768 by default on non-MIPS, fix garbled display fbdev: sm712fb: fix support for 1024x768-16 mode fbdev: sm712fb: fix crashes and garbled display during DPMS modesetting fbdev: sm712fb: fix crashes during framebuffer writes by correctly mapping VRAM fbdev: sm712fb: fix boot screen glitch when sm712fb replaces VGA fbdev: sm712fb: fix VRAM detection, don't set SR70/71/74/75 fbdev: sm712fb: fix brightness control on reboot, don't set SR30 fbdev: sm712fb: fix white screen of death on reboot, don't set CR3B-CR3F video: imsttfb: fix potential NULL pointer dereferences video: hgafb: fix potential NULL pointer dereference fbdev: list all pci memory bars as conflicting apertures ...
2019-05-10Merge tag 'pwm/for-5.2-rc1' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm Pull pwm updates from Thierry Reding: "Nothing out of the ordinary this cycle. The bulk of this is a collection of fixes for existing drivers and some cleanups. There's one new driver for i.MX SoCs and addition of support for some new variants to existing drivers" * tag 'pwm/for-5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm: pwm: meson: Add clock source configuration for Meson G12A dt-bindings: pwm: Update bindings for the Meson G12A Family pwm: samsung: Don't uses devm_*() functions in ->request() pwm: Clear chip_data in pwm_put() pwm: Add i.MX TPM PWM driver support dt-bindings: pwm: Add i.MX TPM PWM binding pwm: imx27: Use devm_platform_ioremap_resource() to simplify code pwm: meson: Use the spin-lock only to protect register modifications pwm: meson: Don't disable PWM when setting duty repeatedly pwm: meson: Consider 128 a valid pre-divider pwm: sysfs: fix typo "its" -> "it's" pwm: tiehrpwm: Enable compilation for ARCH_K3 dt-bindings: pwm: tiehrpwm: Add TI AM654 SoC specific compatible pwm: tiehrpwm: Update shadow register for disabling PWMs pwm: img: Turn final 'else if' into 'else' in img_pwm_config pwm: Fix deadlock warning when removing PWM device
2019-05-10Merge tag 'mailbox-v5.2' of ↵Linus Torvalds
git://git.linaro.org/landing-teams/working/fujitsu/integration Pull mailbox updates from Jassi Brar: - New driver: Armada 37xx mailbox controller - Misc: Use devm_ api for imx and platform_get_irq for stm32 * tag 'mailbox-v5.2' of git://git.linaro.org/landing-teams/working/fujitsu/integration: mailbox: Add support for Armada 37xx rWTM mailbox dt-bindings: mailbox: Document armada-3700-rwtm-mailbox binding mailbox: stm32-ipcc: check invalid irq mailbox: imx: use devm_platform_ioremap_resource() to simplify code
2019-05-10Merge tag 'powerpc-5.2-1' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Slightly delayed due to the issue with printk() calling probe_kernel_read() interacting with our new user access prevention stuff, but all fixed now. The only out-of-area changes are the addition of a cpuhp_state, small additions to Documentation and MAINTAINERS updates. Highlights: - Support for Kernel Userspace Access/Execution Prevention (like SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents the kernel from accidentally accessing userspace outside copy_to/from_user(), or ever executing userspace. - KASAN support on 32-bit. - Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to use the same address ranges we use with the Radix MMU. - A rewrite into C of large parts of our idle handling code for 64-bit Book3S (ie. power8 & power9). - A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup in the null_syscall benchmark. - On 64-bit bare metal we have support for recovering from errors with the time base (our clocksource), however if that fails currently we hang in __delay() and never crash. We now have support for detecting that case and short circuiting __delay() so we at least panic() and reboot. - Add support for optionally enabling the DAWR on Power9, which had to be disabled by default due to a hardware erratum. This has the effect of enabling hardware breakpoints for GDB, the downside is a badly behaved program could crash the machine by pointing the DAWR at cache inhibited memory. This is opt-in obviously. - xmon, our crash handler, gets support for a read only mode where operations that could change memory or otherwise disturb the system are disabled. Plus many clean-ups, reworks and minor fixes etc. Thanks to: Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Ben Hutchings, Bo YU, Breno Leitao, Cédric Le Goater, Christopher M. Riedl, Christoph Hellwig, Colin Ian King, David Gibson, Ganesh Goudar, Gautham R. Shenoy, George Spelvin, Greg Kroah-Hartman, Greg Kurz, Horia Geantă, Jagadeesh Pagadala, Joel Stanley, Joe Perches, Julia Lawall, Laurentiu Tudor, Laurent Vivier, Lukas Bulwahn, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith, Sukadev Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler, Valentin Schneider, Wei Yongjun, Wen Yang, YueHaibing" * tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (205 commits) powerpc/64s: Use early_mmu_has_feature() in set_kuap() powerpc/book3s/64: check for NULL pointer in pgd_alloc() powerpc/mm: Fix hugetlb page initialization ocxl: Fix return value check in afu_ioctl() powerpc/mm: fix section mismatch for setup_kup() powerpc/mm: fix redundant inclusion of pgtable-frag.o in Makefile powerpc/mm: Fix makefile for KASAN powerpc/kasan: add missing/lost Makefile selftests/powerpc: Add a signal fuzzer selftest powerpc/booke64: set RI in default MSR ocxl: Provide global MMIO accessors for external drivers ocxl: move event_fd handling to frontend ocxl: afu_irq only deals with IRQ IDs, not offsets ocxl: Allow external drivers to use OpenCAPI contexts ocxl: Create a clear delineation between ocxl backend & frontend ocxl: Don't pass pci_dev around ocxl: Split pci.c ocxl: Remove some unused exported symbols ocxl: Remove superfluous 'extern' from headers ocxl: read_pasid never returns an error, so make it void ...
2019-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Several bug fixes, many are quick merge-window regression cures: - When NLM_F_EXCL is not set, allow same fib rule insertion. From Hangbin Liu. - Several cures in sja1105 DSA driver (while loop exit condition fix, return of negative u8, etc.) from Vladimir Oltean. - Handle tx/rx delays in realtek PHY driver properly, from Serge Semin. - Double free in cls_matchall, from Pieter Jansen van Vuuren. - Disable SIOCSHWTSTAMP in macvlan/vlan containers, from Hangbin Liu. - Endainness fixes in aqc111, from Oliver Neukum. - Handle errors in packet_init properly, from Haibing Yue. - Various W=1 warning fixes in kTLS, from Jakub Kicinski" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits) nfp: add missing kdoc net/tls: handle errors from padding_length() net/tls: remove set but not used variables docs/btf: fix the missing section marks nfp: bpf: fix static check error through tightening shift amount adjustment selftests: bpf: initialize bpf_object pointers where needed packet: Fix error path in packet_init net/tcp: use deferred jump label for TCP acked data hook net: aquantia: fix undefined devm_hwmon_device_register_with_info reference aqc111: fix double endianness swap on BE aqc111: fix writing to the phy on BE aqc111: fix endianness issue in aqc111_change_mtu vlan: disable SIOCSHWTSTAMP in container macvlan: disable SIOCSHWTSTAMP in container tipc: fix hanging clients using poll with EPOLLOUT flag tuntap: synchronize through tfiles array instead of tun->numqueues tuntap: fix dividing by zero in ebpf queue selection dwmac4_prog_mtl_tx_algorithms() missing write operation ptp_qoriq: fix NULL access if ptp dt node missing net/sched: avoid double free on matchall reoffload ...
2019-05-09Merge tag 'clk-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk framework updates from Stephen Boyd: "We have a couple new features and changes in the core clk framework this time around because we've finally gotten around to fixing some long standing issues. There's still work to do though, so this pull request is largely laying down the foundation for all the driver changes to come in the next merge window. The first problem we're alleviating is how parents of clks are specified. With the new method, we should see lots of drivers migrate away from the current design of string comparisons on the entire clk tree to a more direct method where they can use clk_hw pointers or more localized names specified in DT or via clkdev. This should reduce our reliance on string comparisons for all the topology description logic that we've been using for years and hopefully speed some things up while avoiding problems we have with generating clk names. Beyond that we also got rid of the CLK_IS_BASIC flag because it wasn't really helping anyone and we introduced big-endian versions of the basic clk types so that we can get rid of clk_{readl,writel}(). Both of these are things that driver developers have tried to use over the years that I typically bat away during code reviews because they're not useful. It's great to see these two things go away so maintainers can save time not worrying about these things. On the driver side we got the usual collection of new SoC support and non-critical fixes and updates to existing code. The big topics that stand out are the new driver support for Mediatek MT8183 and MT8516 SoCs, Amlogic Meson8b and G12a SoCs, and the SiFive FU540 SoC. The other patches in the driver pile are mostly fixes for things that are being used for the first time or additions for clks that couldn't be tested before because there wasn't a consumer driver that exercised them. Details are below and also in the sub-maintainer tags. Core: - Remove clk_readl() and introduce BE versions of basic clk types - Rewrite how clk parents can be specified to allow DT/clkdev lookups - Removal of the CLK_IS_BASIC clk flag - Framework documentation updates and fixes New Drivers: - Support for STM32F769 - AT91 sam9x60 PMC support - SiFive FU540 PRCI and PLL support - Qualcomm QCS404 CDSP clk support - Qualcomm QCS404 Turing clk support - Mediatek MT8183 clock support - Mediatek MT8516 clock support - Milbeaut M10V clk controller support - Support for Cirrus Logic Lochnagar clks Updates: - Rework AT91 sckc DT bindings - Fix slow RC oscillator issue on sama5d3 - Mark UFS clk as critical on Hi-Silicon hi3660 SoCs - Various static analysis fixes/finds and const markings - Video Engine (ECLK) support on Aspeed SoCs - Xilinx ZynqMP Versal platform support - Convert Xilinx ZynqMP driver to be struct oriented - Fixes for Rockchip rk3328 and rk3288 SoCs - Sub-type for Rockchip SoCs where mux and divider aren't a single register - Remove SNVS clock from i.MX7UPL clock driver and bindings - Improve i.MX5 clock driver for i.MX50 support - Addition of ADC clock definition for Exynos 5410 SoC (Odroid XU) - Export a new clock for the MBUS controller on the A13 - Allwinner H6 fixes to support a finer clocking of the video and VPU engines - Add g12a support in the Amlogic axg audio clock controller - Add missing PCI USB clock on Rensas RZ/N1 - Add Z2 (Cortex-A53) clocks on Rensas R-Car E3 and RZ/G2E - A new helper DIV64_U64_ROUND_CLOSEST() in <linux/math64.h> - VPU and Video Decoder clocks on Amlogic Meson8b - Finally remove the wrong ABP Meson8b clock id - Add Video Decoder, PCIe PLL, and CPU Clocks on Amlogic G12A - Re-expose SAR_ADC_SEL and CTS_OSCIN on Amlogic G12A AO clock controller - Un-expose some Amlogic AXG-Audio input clocks IDs" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (172 commits) clk: Cache core in clk_fetch_parent_index() without names clk: imx: correct pfdv2 gate_bit/vld_bit operations clk: sifive: add a driver for the SiFive FU540 PRCI IP block clk: analogbits: add Wide-Range PLL library clk: imx: clk-pllv3: mark expected switch fall-throughs clk: imx8mq: Add dsi_ipg_div clk: imx: pllv4: add fractional-N pll support clk: sunxi-ng: Use the correct style for SPDX License Identifier clk: sprd: Use the correct style for SPDX License Identifier clk: renesas: Use the correct style for SPDX License Identifier clk: qcom: Use the correct style for SPDX License Identifier clk: davinci: Use the correct style for SPDX License Identifier clk: actions: Use the correct style for SPDX License Identifier clk: imx: keep uart clock on during system boot clk: imx: correct i.MX7D AV PLL num/denom offset dt-bindings: clk: add documentation for the SiFive PRCI driver clk: stm32mp1: Add ddrperfm clock clk: Remove CLK_IS_BASIC clk flag clock: milbeaut: Add Milbeaut M10V clock controller dt-bindings: clock: milbeaut: add Milbeaut clock description ...
2019-05-09Merge tag 'rtc-5.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "A huge series from me this cycle. I went through many drivers to set the date and time range supported by the RTC which helps solving HW limitation when the time comes (as early as next year for some). This time, I focused on drivers using .set_mms and .set_mmss64, allowing me to remove those callbacks. About a third of the patches got reviews, I actually own the RTCs and I tested another third and the remaining one are unlikely to cause any issues. Other than that, a single new driver and the usual fixes here and there. Summary: Subsystem: - set_mmss and set_mmss64 rtc_ops removal - Fix timestamp value for RTC_TIMESTAMP_BEGIN_1900 - Use SPDX identifier for the core - validate upper bound of tm->tm_year New driver: - Aspeed BMC SoC RTC Drivers: - abx80x: use rtc_add_group - ds3232: nvram support - pcf85063: add alarm, nvram, offset correction and microcrystal rv8263 support - x1205: add of_match_table - Use set_time instead of set_mms/set_mmss64 for: ab3100, coh901331, digicolor, ds1672, ds2404, ep93xx, imxdi, jz4740, lpc32xx, mc13xxx, mxc, pcap, stmp3xxx, test, wm831x, xgene. - Set RTC range for: ab3100, at91sam9, coh901331, da9063, digicolor, dm355evm, ds1672, ds2404, ep39xx, goldfish, imxdi, jz4740, lpc32xx, mc13xxx, mv, mxc, omap, pcap, pcf85063, pcf85363, ps3, sh, stmp3xxx, sun4v, tegra, wm831x, xgene. - Switch to rtc_time64_to_tm/rtc_tm_to_time64 for the driver that properly set the RTC range. - Use dev_get_drvdata instead of multiple indirections" * tag 'rtc-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (177 commits) rtc: snvs: Use __maybe_unused instead of #if CONFIG_PM_SLEEP rtc: imxdi: remove unused variable rtc: drop set_mms and set_mmss64 rtc: pcap: convert to SPDX identifier rtc: pcap: use .set_time rtc: pcap: switch to rtc_time64_to_tm/rtc_tm_to_time64 rtc: pcap: set range rtc: digicolor: convert to SPDX identifier rtc: digicolor: use .set_time rtc: digicolor: set range rtc: digicolor: fix possible race condition rtc: jz4740: convert to SPDX identifier rtc: jz4740: rework invalid time detection rtc: jz4740: use dev_pm_set_wake_irq() to simplify code rtc: jz4740: use .set_time rtc: jz4740: remove useless check rtc: jz4740: switch to rtc_time64_to_tm/rtc_tm_to_time64 rtc: jz4740: set range rtc: 88pm860x: prevent use-after-free on device remove rtc: Use dev_get_drvdata() ...
2019-05-09Merge branch 'i2c/for-5.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: - API for late atomic transfers (e.g. to shut down via PMIC). We have a seperate callback now which is called under clearly defined conditions. In-kernel users are converted, too. - new driver for the AMD PCIe MP2 I2C controller - large refactoring for at91 and bcm-iproc (both gain slave support due to this) - and a good share of various driver improvements anf fixes * 'i2c/for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (57 commits) dt-bindings: i2c: riic: document r7s9210 support i2c: imx-lpi2c: Use __maybe_unused instead of #if CONFIG_PM_SLEEP i2c-piix4: Add Hygon Dhyana SMBus support i2c: core: apply 'is_suspended' check for SMBus, too i2c: core: ratelimit 'transfer when suspended' errors i2c: iproc: Change driver to use 'BIT' macro i2c: riic: Add Runtime PM support i2c: mux: demux-pinctrl: use struct_size() in devm_kzalloc() i2c: mux: pca954x: allow management of device idle state via sysfs i2c: mux: pca9541: remove support for unused platform data i2c: mux: pca954x: remove support for unused platform data dt-bindings: i2c: i2c-mtk: add support for MT8516 i2c: axxia: use auto cmd for last message i2c: gpio: flag atomic capability if possible i2c: algo: bit: add flag to whitelist atomic transfers i2c: stu300: use xfer_atomic callback to bail out early i2c: ocores: enable atomic xfers i2c: ocores: refactor setup for polling i2c: tegra-bpmp: convert to use new atomic callbacks i2c: omap: Add the master_xfer_atomic hook ...
2019-05-09Merge tag 'nfs-for-5.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Highlights include: Stable bugfixes: - Fall back to MDS if no deviceid is found rather than aborting # v4.11+ - NFS4: Fix v4.0 client state corruption when mount Features: - Much improved handling of soft mounts with NFS v4.0: - Reduce risk of false positive timeouts - Faster failover of reads and writes after a timeout - Added a "softerr" mount option to return ETIMEDOUT instead of EIO to the application after a timeout - Increase number of xprtrdma backchannel requests - Add additional xprtrdma tracepoints - Improved send completion batching for xprtrdma Other bugfixes and cleanups: - Return -EINVAL when NFS v4.2 is passed an invalid dedup mode - Reduce usage of GFP_ATOMIC pages in SUNRPC - Various minor NFS over RDMA cleanups and bugfixes - Use the correct container namespace for upcalls - Don't share superblocks between user namespaces - Various other container fixes - Make nfs_match_client() killable to prevent soft lockups - Don't mark all open state for recovery when handling recallable state revoked flag" * tag 'nfs-for-5.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (69 commits) SUNRPC: Rebalance a kref in auth_gss.c NFS: Fix a double unlock from nfs_match,get_client nfs: pass the correct prototype to read_cache_page NFSv4: don't mark all open state for recovery when handling recallable state revoked flag SUNRPC: Fix an error code in gss_alloc_msg() SUNRPC: task should be exit if encode return EKEYEXPIRED more times NFS4: Fix v4.0 client state corruption when mount PNFS fallback to MDS if no deviceid found NFS: make nfs_match_client killable lockd: Store the lockd client credential in struct nlm_host NFS: When mounting, don't share filesystems between different user namespaces NFS: Convert NFSv2 to use the container user namespace NFSv4: Convert the NFS client idmapper to use the container user namespace NFS: Convert NFSv3 to use the container user namespace SUNRPC: Use namespace of listening daemon in the client AUTH_GSS upcall SUNRPC: Use the client user namespace when encoding creds NFS: Store the credential of the mount process in the nfs_server SUNRPC: Cache cred of process creating the rpc_client xprtrdma: Remove stale comment xprtrdma: Update comments that reference ib_drain_qp ...
2019-05-09Merge branch 'for-5.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "This includes Roman's cgroup2 freezer implementation. It's a separate machanism from cgroup1 freezer. Instead of blocking user tasks in arbitrary uninterruptible sleeps, the new implementation extends jobctl stop - frozen tasks are trapped in jobctl stop until thawed and can be killed and ptraced. Lots of thanks to Oleg for sheperding the effort. Other than that, there are a few trivial changes" * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: never call do_group_exit() with task->frozen bit set kernel: cgroup: fix misuse of %x cgroup: get rid of cgroup_freezer_frozen_exit() cgroup: prevent spurious transition into non-frozen state cgroup: Remove unused cgrp variable cgroup: document cgroup v2 freezer interface cgroup: add tracing points for cgroup v2 freezer cgroup: make TRACE_CGROUP_PATH irq-safe kselftests: cgroup: add freezer controller self-tests kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy() cgroup: cgroup v2 freezer cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock cgroup: implement __cgroup_task_count() helper cgroup: rename freezer.c into legacy_freezer.c cgroup: remove extra cgroup_migrate_finish() call
2019-05-09Merge branch 'next-integrity' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull intgrity updates from James Morris: "This contains just three patches, the remainder were either included in other pull requests (eg. audit, lockdown) or will be upstreamed via other subsystems (eg. kselftests, Power). Included here is one bug fix, one documentation update, and extending the x86 IMA arch policy rules to coordinate the different kernel module signature verification methods" * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: doc/kernel-parameters.txt: Deprecate ima_appraise_tcb x86/ima: add missing include x86/ima: require signed kernel modules
2019-05-09net/tcp: use deferred jump label for TCP acked data hookJakub Kicinski
User space can flip the clean_acked_data_enabled static branch on and off with TLS offload when CONFIG_TLS_DEVICE is enabled. jump_label.h suggests we use the delayed version in this case. Deferred branches now also don't take the branch mutex on decrement, so we avoid potential locking issues. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "This has been a smaller cycle than normal. One new driver was accepted, which is unusual, and at least one more driver remains in review on the list. Summary: - Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4, vmw_pvrdma - Many patches from MatthewW converting radix tree and IDR users to use xarray - Introduction of tracepoints to the MAD layer - Build large SGLs at the start for DMA mapping and get the driver to split them - Generally clean SGL handling code throughout the subsystem - Support for restricting RDMA devices to net namespaces for containers - Progress to remove object allocation boilerplate code from drivers - Change in how the mlx5 driver shows representor ports linked to VFs - mlx5 uapi feature to access the on chip SW ICM memory - Add a new driver for 'EFA'. This is HW that supports user space packet processing through QPs in Amazon's cloud" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (186 commits) RDMA/ipoib: Allow user space differentiate between valid dev_port IB/core, ipoib: Do not overreact to SM LID change event RDMA/device: Don't fire uevent before device is fully initialized lib/scatterlist: Remove leftover from sg_page_iter comment RDMA/efa: Add driver to Kconfig/Makefile RDMA/efa: Add the efa module RDMA/efa: Add EFA verbs implementation RDMA/efa: Add common command handlers RDMA/efa: Implement functions that submit and complete admin commands RDMA/efa: Add the ABI definitions RDMA/efa: Add the com service API definitions RDMA/efa: Add the efa_com.h file RDMA/efa: Add the efa.h header file RDMA/efa: Add EFA device definitions RDMA: Add EFA related definitions RDMA/umem: Remove hugetlb flag RDMA/bnxt_re: Use core helpers to get aligned DMA address RDMA/i40iw: Use core helpers to get aligned DMA address within a supported page size RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks RDMA/umem: Add API to find best driver supported page size in an MR ...