linux.git - Linux kernel

Age	Commit message (Collapse)	Author
2020-12-16	NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()	Trond Myklebust
	Don't bump the index twice. Fixes: 563c53e73b8b ("NFS: Fix flexfiles read failover") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-16	pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read	Trond Myklebust
	The callers of ff_layout_choose_ds_for_read() should decide whether or not they want to return the layout on error. Sometimes, we may just want to retry from the beginning. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-16	NFSv4/pnfs: Add tracing for the deviceid cache	Trond Myklebust
	Add tracepoints to allow debugging of the deviceid cache. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-16	fs/lockd: convert comma to semicolon	Zheng Yongjun
	Replace a comma between expression statements by a semicolon. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-16	NFSv4.2: fix error return on memory allocation failure	Colin Ian King
	Currently when an alloc_page fails the error return is not set in variable err and a garbage initialized value is returned. Fix this by setting err to -ENOMEM before taking the error return path. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: a1f26739ccdc ("NFSv4.2: improve page handling for GETXATTR") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-15	Merge tag 'nfs-rdma-for-5.11-1' of ↵	Trond Myklebust
	git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next NFSoRDmA Client updates for Linux 5.11 Cleanups and improvements: - Remove use of raw kernel memory addresses in tracepoints - Replace dprintk() call sites in ERR_CHUNK path - Trace unmap sync calls - Optimize MR DMA-unmapping Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet	Trond Myklebust
	We have no way of tracking server READ_PLUS support in pNFS for now, so just disable it. Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow	Trond Myklebust
	If the server returns more data than we have buffer space for, then we need to truncate and exit early. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow	Trond Myklebust
	Expanding the READ_PLUS extents can cause the read buffer to overflow. If it does, then don't error, but just exit early. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer	Trond Myklebust
	If a hole extends beyond the READ_PLUS read buffer, then we want to fill just the remaining buffer with zeros. Also ignore eof... Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2: decode_read_plus_hole() needs to check the extent offset	Trond Myklebust
	The server is allowed to return a hole extent with an offset that starts before the offset supplied in the READ_PLUS argument. Ensure that we support that case too. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2: decode_read_plus_data() must skip padding after data segment	Trond Myklebust
	All XDR opaque object sizes are 32-bit aligned, and a data segment is no exception. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2: Ensure we always reset the result->count in decode_read_plus()	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: When expanding the buffer, we may need grow the sparse pages	Trond Myklebust
	If we're shifting the page data to the right, and this happens to be a sparse page array, then we may need to allocate new pages in order to receive the data. Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: Cleanup - constify a number of xdr_buf helpers	Trond Myklebust
	There are a number of xdr helpers for struct xdr_buf that do not change the structure itself. Mark those as taking const pointers for documentation purposes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field	Trond Myklebust
	Move the setting of the xdr_stream 'nwords' field into the helpers that reset the xdr_stream cursor. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: _copy_to/from_pages() now check for zero length	Trond Myklebust
	Clean up callers of _copy_to/from_pages() that still check for a zero length. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: Cleanup xdr_shrink_bufhead()	Trond Myklebust
	Clean up xdr_shrink_bufhead() to use the new helpers instead of doing its own thing. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: Fix xdr_expand_hole()	Trond Myklebust
	We do want to try to grow the buffer if possible, but if that attempt fails, we still want to move the data and truncate the XDR message. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: Fixes for xdr_align_data()	Trond Myklebust
	The main use case right now for xdr_align_data() is to shift the page data to the left, and in practice shrink the total XDR data buffer. This patch ensures that we fix up the accounting for the buffer length as we shift that data around. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	SUNRPC: _shift_data_left/right_pages should check the shift length	Trond Myklebust
	Exit early if the shift is zero. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.1: use BITS_PER_LONG macro in nfs4session.h	Geliang Tang
	Use the existing BITS_PER_LONG macro instead of calculating the value. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	xprtrdma: Fix XDRBUF_SPARSE_PAGES support	Chuck Lever
	Olga K. observed that rpcrdma_marsh_req() allocates sparse pages only when it has determined that a Reply chunk is necessary. There are plenty of cases where no Reply chunk is needed, but the XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in rpcrdma_inline_fixup() when it tries to copy parts of the received Reply into a missing page. To avoid crashing, handle sparse page allocation up front. Until XATTR support was added, this issue did not appear often because the only SPARSE_PAGES consumer always expected a reply large enough to always require a Reply chunk. Reported-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	NFSv4.2: improve page handling for GETXATTR	Frank van der Linden
	XDRBUF_SPARSE_PAGES can cause problems for the RDMA transport, and it's easy enough to allocate enough pages for the request up front, so do that. Also, since we've allocated the pages anyway, use the full page aligned length for the receive buffer. This will allow caching of valid replies that are too large for the caller, but that still fit in the allocated pages. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14	sunrpc: fix xs_read_xdr_buf for partial pages receive	Dan Aloni
	When receiving pages data, return value 'ret' when positive includes `buf->page_base`, so we should subtract that before it is used for changing `offset` and comparing against `want`. This was discovered on the very rare cases where the server returned a chunk of bytes that when added to the already received amount of bytes for the pages happened to match the current `recv.len`, for example on this case: buf->page_base : 258356 actually received from socket: 1740 ret : 260096 want : 260096 In this case neither of the two 'if ... goto out' trigger, and we continue to tail parsing. Worth to mention that the ensuing EMSGSIZE from the continued execution of `xs_read_xdr_buf` may be observed by an application due to 4 superfluous bytes being added to the pages data. Fixes: 277e4ab7d530 ("SUNRPC: Simplify TCP receive code by switching to using iterators") Signed-off-by: Dan Aloni <dan@kernelim.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-10	NFSv4.2: Fix up the get/listxattr calls to rpc_prepare_reply_pages()	Trond Myklebust
	Ensure that both getxattr and listxattr page array are correctly aligned, and that getxattr correctly accounts for the page padding word. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFS: switch nfsiod to be an UNBOUND workqueue.	NeilBrown
	nfsiod is currently a concurrency-managed workqueue (CMWQ). This means that workitems scheduled to nfsiod on a given CPU are queued behind all other work items queued on any CMWQ on the same CPU. This can introduce unexpected latency. Occaionally nfsiod can even cause excessive latency. If the work item to complete a CLOSE request calls the final iput() on an inode, the address_space of that inode will be dismantled. This takes time proportional to the number of in-memory pages, which on a large host working on large files (e.g.. 5TB), can be a large number of pages resulting in a noticable number of seconds. We can avoid these latency problems by switching nfsiod to WQ_UNBOUND. This causes each concurrent work item to gets a dedicated thread which can be scheduled to an idle CPU. There is precedent for this as several other filesystems use WQ_UNBOUND workqueue for handling various async events. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: ada609ee2ac2 ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	lockd: don't use interval-based rebinding over TCP	Calum Mackay
	NLM uses an interval-based rebinding, i.e. it clears the transport's binding under certain conditions if more than 60 seconds have elapsed since the connection was last bound. This rebinding is not necessary for an autobind RPC client over a connection-oriented protocol like TCP. It can also cause problems: it is possible for nlm_bind_host() to clear XPRT_BOUND whilst a connection worker is in the middle of trying to reconnect, after it had already been checked in xprt_connect(). When the connection worker notices that XPRT_BOUND has been cleared under it, in xs_tcp_finish_connecting(), that results in: xs_tcp_setup_socket: connect returned unhandled error -107 Worse, it's possible that the two can get into lockstep, resulting in the same behaviour repeated indefinitely, with the above error every 300 seconds, without ever recovering, and the connection never being established. This has been seen in practice, with a large number of NLM client tasks, following a server restart. The existing callers of nlm_bind_host & nlm_rebind_host should not need to force the rebind, for TCP, so restrict the interval-based rebinding to UDP only. For TCP, we will still rebind when needed, e.g. on timeout, and connection error (including closure), since connection-related errors on an existing connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(), already unconditionally clear XPRT_BOUND. To avoid having to add the fix, and explanation, to both nlm_bind_host() and nlm_rebind_host(), remove the duplicate code from the former, and have it call the latter. Drop the dprintk, which adds no value over a trace. Signed-off-by: Calum Mackay <calum.mackay@oracle.com> Fixes: 35f5a422ce1a ("SUNRPC: new interface to force an RPC rebind") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs'	Fedor Tokarev
	'snprintf' returns the number of characters which would have been written if enough space had been available, excluding the terminating null byte. Thus, the return value of 'sizeof(buf)' means that the last character has been dropped. Signed-off-by: Fedor Tokarev <ftokarev@gmail.com> Fixes: 2f34b8bfae19 ("SUNRPC: add links for all client xprts to debugfs") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFSv4: Refactor to use user namespaces for nfs4idmap	Sargun Dhillon
	In several patches work has been done to enable NFSv4 to use user namespaces: 58002399da65: NFSv4: Convert the NFS client idmapper to use the container user namespace 3b7eb5e35d0f: NFS: When mounting, don't share filesystems between different user namespaces Unfortunately, the userspace APIs were only such that the userspace facing side of the filesystem (superblock s_user_ns) could be set to a non init user namespace. This furthers the fs_context related refactoring, and piggybacks on top of that logic, so the superblock user namespace, and the NFS user namespace are the same. Users can still use rpc.idmapd if they choose to, but there are complexities with user namespaces and request-key that have yet to be addresssed. Eventually, we will need to at least: * Come up with an upcall mechanism that can be triggered inside of the container, or safely triggered outside, with the requisite context to do the right mapping. * Handle whatever refactoring needs to be done in net/sunrpc. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Tested-by: Alban Crequy <alban.crequy@gmail.com> Fixes: 62a55d088cd8 ("NFS: Additional refactoring for fs_context conversion") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFS: NFSv2/NFSv3: Use cred from fs_context during mount	Sargun Dhillon
	There was refactoring done to use the fs_context for mounting done in: 62a55d088cd87: NFS: Additional refactoring for fs_context conversion This made it so that the net_ns is fetched from the fs_context (the netns that fsopen is called in). This change also makes it so that the credential fetched during fsopen is used as well as the net_ns. NFS has already had a number of changes to prepare it for user namespaces: 1a58e8a0e5c1: NFS: Store the credential of the mount process in the nfs_server 264d948ce7d0: NFS: Convert NFSv3 to use the container user namespace c207db2f5da5: NFS: Convert NFSv2 to use the container user namespace Previously, different credentials could be used for creation of the fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did the actual credential check, and that's where current_creds() were fetched. This meant that the user namespace which fsopen was called in could be a non-init user namespace. This still requires that the user that calls FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns. This roughly allows a privileged user to mount on behalf of an unprivileged usernamespace, by forking off and calling fsopen in the unprivileged user namespace. It can then pass back that fsfd to the privileged process which can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE before switching back into the mount namespace of the container, and finish up the mounting process and call fsmount and move_mount. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Tested-by: Alban Crequy <alban.crequy@gmail.com> Fixes: 62a55d088cd8 ("NFS: Additional refactoring for fs_context conversion") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode	Trond Myklebust
	When returning the layout in nfs4_evict_inode(), we need to ensure that the layout is actually done being freed before we can proceed to free the inode itself. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFSv4: Fix open coded xdr_stream_remaining()	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Fix open coded xdr_stream_remaining()	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Fix up xdr_set_page()	Trond Myklebust
	While we always want to align to the next page and/or the beginning of the tail buffer when we call xdr_set_next_page(), the functions xdr_align_data() and xdr_expand_hole() really want to align to the next object in that next page or tail. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()	Trond Myklebust
	rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to contain the length of the data that we expect to want placed in the head kvec plus a count of 1 word of padding that is placed after the page data. This is very confusing when trying to read the code, and sometimes leads to callers adding an arbitrary value of '1' just in order to satisfy the requirement (whether or not the page data actually needs such padding). This patch aims to clarify the code by changing the 'hdrsize' argument to remove that 1 word of padding. This means we need to subtract the padding from all the existing callers. Fixes: 02ef04e432ba ("NFS: Account for XDR pad of buf->pages") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Fix up xdr_read_pages() to take arbitrary object lengths	Trond Myklebust
	Fix up xdr_read_pages() so that it can handle object lengths that are larger than the page length, by simply aligning to the next object in the buffer tail. The function will continue to return the length of the truncate object data that actually fit into the pages. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Clean up helpers xdr_set_iov() and xdr_set_page_base()	Trond Myklebust
	Allow xdr_set_iov() to set a base so that we can use it to set the cursor to a specific position in the kvec buffer. If the new base overflows the kvec/pages buffer in either xdr_set_iov() or xdr_set_page_base(), then truncate it so that we point to the end of the buffer. Finally, change both function to return the number of bytes remaining to read in their buffers. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Fix up typo in xdr_init_decode()	Trond Myklebust
	We already know that the head buffer and page are empty, so if there is any data, it is in the tail. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFSv4: Fix the alignment of page data in the getdeviceinfo reply	Trond Myklebust
	We can fit the device_addr4 opaque data padding in the pages. Fixes: cf500bac8fd4 ("SUNRPC: Introduce rpc_prepare_reply_pages()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	pNFS: Clean up open coded xdr string decoding	Trond Myklebust
	Use the existing xdr_stream_decode_string_dup() to safely decode into kmalloced strings. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Fix up open coded kmemdup_nul()	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	pNFS/flexfiles: Fix up layoutstats reporting for non-TCP transports	Trond Myklebust
	Ensure that we report the correct netid when using UDP or RDMA transports to the DSes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFSv4/pNFS: Store the transport type in struct nfs4_pnfs_ds_addr	Trond Myklebust
	We want to enable RDMA and UDP as valid transport methods if a GETDEVICEINFO call specifies it. Do so by adding a parser for the netid that translates it to an appropriate argument for the RPC transport layer. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	pNFS: Add helpers for allocation/free of struct nfs4_pnfs_ds_addr	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFSv4/pNFS: Use connections to a DS that are all of the same protocol family	Trond Myklebust
	If the pNFS metadata server advertises multiple addresses for the same data server, we should try to connect to just one protocol family and transport type on the assumption that homogeneity will improve performance. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Remove unused function xprt_load_transport()	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	NFS: Switch mount code to use xprt_find_transport_ident()	Trond Myklebust
	Switch the mount code to use xprt_find_transport_ident() and to check the results before allowing the mount to proceed. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Add a helper to return the transport identifier given a netid	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02	SUNRPC: Close a race with transport setup and module put	Trond Myklebust
	After we've looked up the transport module, we need to ensure it can't go away until we've finished running the transport setup code. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>