aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-12-09NFSD generalize nfsd4_compound_state flag namesOlga Kornievskaia
Allow for sid_flag field non-stateid use. Signed-off-by: Andy Adamson <andros@netapp.com>
2019-12-09NFSD check stateids against copy stateidsOlga Kornievskaia
Incoming stateid (used by a READ) could be a saved copy stateid. Using the provided stateid, look it up in the list of copy_notify stateids. If found, use the parent's stateid and parent's clid to look up the parent's stid to do the appropriate checks. Update the copy notify timestamp (cpntf_time) with current time this making it 'active' so that laundromat thread will not delete copy notify state. Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
2019-12-09NFSD add COPY_NOTIFY operationOlga Kornievskaia
Introducing the COPY_NOTIFY operation. Create a new unique stateid that will keep track of the copy state and the upcoming READs that will use that stateid. Each associated parent stateid has a list of copy notify stateids. A copy notify structure makes a copy of the parent stateid and a clientid and will use it to look up the parent stateid during the READ request (suggested by Trond Myklebust <trond.myklebust@hammerspace.com>). At nfs4_put_stid() time, we walk the list of the associated copy notify stateids and delete them. Laundromat thread will traverse globally stored copy notify stateid in idr and notice if any haven't been referenced in the lease period, if so, it'll remove them. Return single netaddr to advertise to the copy. Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Andy Adamson <andros@netapp.com>
2019-12-09NFSD COPY_NOTIFY xdrOlga Kornievskaia
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
2019-12-09NFSD add ca_source_server<> to COPYOlga Kornievskaia
Decode the ca_source_server list that's sent but only use the first one. Presence of non-zero list indicates an "inter" copy. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
2019-12-09NFSD fill-in netloc4 structureOlga Kornievskaia
nfs.4 defines nfs42_netaddr structure that represents netloc4. Populate needed fields from the sockaddr structure. This will be used by flexfiles and 4.2 inter copy Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
2019-12-08Merge tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Nine cifs/smb3 fixes: - one fix for stable (oops during oplock break) - two timestamp fixes including important one for updating mtime at close to avoid stale metadata caching issue on dirty files (also improves perf by using SMB2_CLOSE_FLAG_POSTQUERY_ATTRIB over the wire) - two fixes for "modefromsid" mount option for file create (now allows mode bits to be set more atomically and accurately on create by adding "sd_context" on create when modefromsid specified on mount) - two fixes for multichannel found in testing this week against different servers - two small cleanup patches" * tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: smb3: improve check for when we send the security descriptor context on create smb3: fix mode passed in on create for modetosid mount option cifs: fix possible uninitialized access and race on iface_list cifs: Fix lookup of SMB connections on multichannel smb3: query attributes on file close smb3: remove unused flag passed into close functions cifs: remove redundant assignment to pointer pneg_ctxt fs: cifs: Fix atime update check vs mtime CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
2019-12-08Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs cleanups from Al Viro: "No common topic, just three cleanups". * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make __d_alloc() static fs/namespace: add __user to open_tree and move_mount syscalls fs/fnctl: fix missing __user in fcntl_rw_hint()
2019-12-07Merge tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap fixes from Darrick Wong: "Fix a race condition and a use-after-free error: - Fix a UAF when reporting writeback errors - Fix a race condition when handling page uptodate on fragmented file with blocksize < pagesize" * tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: stop using ioend after it's been freed in iomap_finish_ioend() iomap: fix sub-page uptodate handling
2019-12-07Merge tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "Fix a couple of resource management errors and a hang: - fix a crash in the log setup code when log mounting fails - fix a hang when allocating space on the realtime device - fix a block leak when freeing space on the realtime device" * tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix mount failure crash on invalid iclog memory access xfs: don't check for AG deadlock for realtime files in bunmapi xfs: fix realtime file data space leak
2019-12-07Merge tag 'for-linus-5.5-ofs1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs update from Mike Marshall: "orangefs: posix open permission checking... Orangefs has no open, and orangefs checks file permissions on each file access. Posix requires that file permissions be checked on open and nowhere else. Orangefs-through-the-kernel needs to seem posix compliant. The VFS opens files, even if the filesystem provides no method. We can see if a file was successfully opened for read and or for write by looking at file->f_mode. When writes are flowing from the page cache, file is no longer available. We can trust the VFS to have checked file->f_mode before writing to the page cache. The mode of a file might change between when it is opened and IO commences, or it might be created with an arbitrary mode. We'll make sure we don't hit EACCES during the IO stage by using UID 0" [ This is "posixish", but not a great solution in the long run, since a proper secure network server shouldn't really trust the client like this. But proper and secure POSIX behavior requires an open method and a resulting cookie for IO of some kind, or similar. - Linus ] * tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: posix open permission checking...
2019-12-07Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "This is a relatively quiet cycle for nfsd, mainly various bugfixes. Possibly most interesting is Trond's fixes for some callback races that were due to my incomplete understanding of rpc client shutdown. Unfortunately at the last minute I've started noticing a new intermittent failure to send callbacks. As the logic seems basically correct, I'm leaving Trond's patches in for now, and hope to find a fix in the next week so I don't have to revert those patches" * tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits) nfsd: depend on CRYPTO_MD5 for legacy client tracking NFSD fixing possible null pointer derefering in copy offload nfsd: check for EBUSY from vfs_rmdir/vfs_unink. nfsd: Ensure CLONE persists data and metadata changes to the target file SUNRPC: Fix backchannel latency metrics nfsd: restore NFSv3 ACL support nfsd: v4 support requires CRYPTO_SHA256 nfsd: Fix cld_net->cn_tfm initialization lockd: remove __KERNEL__ ifdefs sunrpc: remove __KERNEL__ ifdefs race in exportfs_decode_fh() nfsd: Drop LIST_HEAD where the variable it declares is never used. nfsd: document callback_wq serialization of callback code nfsd: mark cb path down on unknown errors nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback() nfsd: minor 4.1 callback cleanup SUNRPC: Fix svcauth_gss_proxy_init() SUNRPC: Trace gssproxy upcall results sunrpc: fix crash when cache_head become valid before update nfsd: remove private bin2hex implementation ...
2019-12-07Merge tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded copy of a file from one source server to a different target server). - New RDMA tracepoints for debugging congestion control and Local Invalidate WRs. Bugfixes and cleanups - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for layoutreturn - Handle bad/dead sessions correctly in nfs41_sequence_process() - Various bugfixes to the delegation return operation. - Various bugfixes pertaining to delegations that have been revoked. - Cleanups to the NFS timespec code to avoid unnecessary conversions between timespec and timespec64. - Fix unstable RDMA connections after a reconnect - Close race between waking an RDMA sender and posting a receive - Wake pending RDMA tasks if connection fails - Fix MR list corruption, and clean up MR usage - Fix another RPCSEC_GSS issue with MIC buffer space" * tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) SUNRPC: Capture completion of all RPC tasks SUNRPC: Fix another issue with MIC buffer space NFS4: Trace lock reclaims NFS4: Trace state recovery operation NFSv4.2 fix memory leak in nfs42_ssc_open NFSv4.2 fix kfree in __nfs42_copy_file_range NFS: remove duplicated include from nfs4file.c NFSv4: Make _nfs42_proc_copy_notify() static NFS: Fallocate should use the nfs4_fattr_bitmap NFS: Return -ETXTBSY when attempting to write to a swapfile fs: nfs: sysfs: Remove NULL check before kfree NFS: remove unneeded semicolon NFSv4: add declaration of current_stateid NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process() nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list SUNRPC: Avoid RPC delays when exiting suspend NFS: Add a tracepoint in nfs_fh_to_dentry() NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done() NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn ...
2019-12-07smb3: improve check for when we send the security descriptor context on createSteve French
We had cases in the previous patch where we were sending the security descriptor context on SMB3 open (file create) in cases when we hadn't mounted with with "modefromsid" mount option. Add check for that mount flag before calling ad_sd_context in open init. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-07pipe: don't use 'pipe_wait() for basic pipe IOLinus Torvalds
pipe_wait() may be simple, but since it relies on the pipe lock, it means that we have to do the wakeup while holding the lock. That's unfortunate, because the very first thing the waked entity will want to do is to get the pipe lock for itself. So get rid of the pipe_wait() usage by simply releasing the pipe lock, doing the wakeup (if required) and then using wait_event_interruptible() to wait on the right condition instead. wait_event_interruptible() handles races on its own by comparing the wakeup condition before and after adding itself to the wait queue, so you can use an optimistic unlocked condition for it. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07pipe: remove 'waiting_writers' merging logicLinus Torvalds
This code is ancient, and goes back to when we only had a single page for the pipe buffers. The exact history is hidden in the mists of time (ie "before git", and in fact predates the BK repository too). At that long-ago point in time, it actually helped to try to merge big back-and-forth pipe reads and writes, and not limit pipe reads to the single pipe buffer in length just because that was all we had at a time. However, since then we've expanded the pipe buffers to multiple pages, and this logic really doesn't seem to make sense. And a lot of it is somewhat questionable (ie "hmm, the user asked for a non-blocking read, but we see that there's a writer pending, so let's wait anyway to get the extra data that the writer will have"). But more importantly, it makes the "go to sleep" logic much less obvious, and considering the wakeup issues we've had, I want to make for less of those kinds of things. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07pipe: fix and clarify pipe read wakeup logicLinus Torvalds
This is the read side version of the previous commit: it simplifies the logic to only wake up waiting writers when necessary, and makes sure to use a synchronous wakeup. This time not so much for GNU make jobserver reasons (that pipe never fills up), but simply to get the writer going quickly again. A bit less verbose commentary this time, if only because I assume that the write side commentary isn't going to be ignored if you touch this code. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07pipe: fix and clarify pipe write wakeup logicLinus Torvalds
The pipe rework ends up having been extra painful, partly becaused of actual bugs with ordering and caching of the pipe state, but also because of subtle performance issues. In particular, the pipe rework caused the kernel build to inexplicably slow down. The reason turns out to be that the GNU make jobserver (which limits the parallelism of the build) uses a pipe to implement a "token" system: a parallel submake will read a character from the pipe to get the job token before starting a new job, and will write a character back to the pipe when it is done. The overall job limit is thus easily controlled by just writing the appropriate number of initial token characters into the pipe. But to work well, that really means that the old behavior of write wakeups being synchronous (WF_SYNC) is very important - when the pipe writer wakes up a reader, we want the reader to actually get scheduled immediately. Otherwise you lose the parallelism of the build. The pipe rework lost that synchronous wakeup on write, and we had clearly all forgotten the reasons and rules for it. This rewrites the pipe write wakeup logic to do the required Wsync wakeups, but also clarifies the logic and avoids extraneous wakeups. It also ends up addign a number of comments about what oit does and why, so that we hopefully don't end up forgetting about this next time we change this code. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07pipe: fix poll/select race introduced by the pipe reworkLinus Torvalds
The kernel wait queues have a basic rule to them: you add yourself to the wait-queue first, and then you check the things that you're going to wait on. That avoids the races with the event you're waiting for. The same goes for poll/select logic: the "poll_wait()" goes first, and then you check the things you're polling for. Of course, if you use locking, the ordering doesn't matter since the lock will serialize with anything that changes the state you're looking at. That's not the case here, though. So move the poll_wait() first in pipe_poll(), before you start looking at the pipe state. Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07nfsd: depend on CRYPTO_MD5 for legacy client trackingPatrick Steinhardt
The legacy client tracking infrastructure of nfsd makes use of MD5 to derive a client's recovery directory name. As the nfsd module doesn't declare any dependency on CRYPTO_MD5, though, it may fail to allocate the hash if the kernel was compiled without it. As a result, generation of client recovery directories will fail with the following error: NFSD: unable to generate recoverydir name The explicit dependency on CRYPTO_MD5 was removed as redundant back in 6aaa67b5f3b9 (NFSD: Remove redundant "select" clauses in fs/Kconfig 2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5. This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit df486a25900f (NFS: Fix the selection of security flavours in Kconfig) at a later point. Fix the issue by adding back an explicit dependency on CRYPTO_MD5. Fixes: df486a25900f (NFS: Fix the selection of security flavours in Kconfig) Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-07NFSD fixing possible null pointer derefering in copy offloadOlga Kornievskaia
Static checker revealed possible error path leading to possible NULL pointer dereferencing. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: e0639dc5805a: ("NFSD introduce async copy feature") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-06pipe: Fix iteration end check in fuse_dev_splice_write()David Howells
Fix the iteration end check in fuse_dev_splice_write(). The iterator position can only be compared with == or != since wrappage may be involved. Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-06pipe: fix incorrect caching of pipe state over pipe_wait()Linus Torvalds
Similarly to commit 8f868d68d335 ("pipe: Fix missing mask update after pipe_wait()") this fixes a case where the pipe rewrite ended up caching the pipe state incorrectly over a pipe lock drop event. It wasn't quite as obvious, because you needed to splice data from a pipe to a file, which is a fairly unusual operation, but it's completely wrong. Make sure we load the pipe head/tail/size information only after we've waited for there to be data in the pipe. While in that file, also make one of the splice helper functions use the canonical arghument order for pipe_empty(). That's syntactic - pipe emptiness is just that head and tail are equal, and thus mixing up head and tail doesn't really matter. It's still wrong, though. Reported-by: David Sterba <dsterba@suse.cz> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-06smb3: fix mode passed in on create for modetosid mount optionSteve French
When using the special SID to store the mode bits in an ACE (See http://technet.microsoft.com/en-us/library/hh509017(v=ws.10).aspx) which is enabled with mount parm "modefromsid" we were not passing in the mode via SMB3 create (although chmod was enabled). SMB3 create allows a security descriptor context to be passed in (which is more atomic and thus preferable to setting the mode bits after create via a setinfo). This patch enables setting the mode bits on create when using modefromsid mount option. In addition it fixes an endian error in the definition of the Control field flags in the SMB3 security descriptor. It also makes the ACE type of the special SID better match the documentation (and behavior of servers which use this to store mode bits in SMB3 ACLs). Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-06Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block and io_uring updates from Jens Axboe: "I wasn't expecting this to be so big, and if I was, I would have used separate branches for this. Going forward I'll be doing separate branches for the current tree, just like for the next kernel version tree. In any case, this contains: - Series from Christoph that fixes an inherent race condition with zoned devices and revalidation. - null_blk zone size fix (Damien) - Fix for a regression in this merge window that caused busy spins by sending empty disk uevents (Eric) - Fix for a regression in this merge window for bfq stats (Hou) - Fix for io_uring creds allocation failure handling (me) - io_uring -ERESTARTSYS send/recvmsg fix (me) - Series that fixes the need for applications to retain state across async request punts for io_uring. This one is a bit larger than I would have hoped, but I think it's important we get this fixed for 5.5. - connect(2) improvement for io_uring, handling EINPROGRESS instead of having applications needing to poll for it (me) - Have io_uring use a hash for poll requests instead of an rbtree. This turned out to work much better in practice, so I think we should make the switch now. For some workloads, even with a fair amount of cancellations, the insertion sort is just too expensive. (me) - Various little io_uring fixes (me, Jackie, Pavel, LimingWu) - Fix for brd unaligned IO, and a warning for the future (Ming) - Fix for a bio integrity data leak (Justin) - bvec_iter_advance() improvement (Pavel) - Xen blkback page unmap fix (SeongJae) The major items in here are all well tested, and on the liburing side we continue to add regression and feature test cases. We're up to 50 topic cases now, each with anywhere from 1 to more than 10 cases in each" * tag 'for-linus-20191205' of git://git.kernel.dk/linux-block: (33 commits) block: fix memleak of bio integrity data io_uring: fix a typo in a comment bfq-iosched: Ensure bio->bi_blkg is valid before using it io_uring: hook all linked requests via link_list io_uring: fix error handling in io_queue_link_head io_uring: use hash table for poll command lookups io-wq: clear node->next on list deletion io_uring: ensure deferred timeouts copy necessary data io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED brd: warn on un-aligned buffer brd: remove max_hw_sectors queue limit xen/blkback: Avoid unmapping unmapped grant pages io_uring: handle connect -EINPROGRESS like -EAGAIN block: set the zone size in blk_revalidate_disk_zones atomically block: don't handle bio based drivers in blk_revalidate_disk_zones block: allocate the zone bitmaps lazily block: replace seq_zones_bitmap with conv_zones_bitmap block: simplify blkdev_nr_zones block: remove the empty line at the end of blk-zoned.c ...
2019-12-06Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro: "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use. Basically, the problem is that negative pinned dentries require careful treatment - unless ->d_lock is locked or parent is held at least shared, another thread can make them positive right under us. Most of the uses turned out to be safe - the main surprises as far as filesystems are concerned were - race in dget_parent() fastpath, that might end up with the caller observing the returned dentry _negative_, due to insufficient barriers. It is positive in memory, but we could end up seeing the wrong value of ->d_inode in CPU cache. Fixed. - manual checks that result of lookup_one_len_unlocked() is positive (and rejection of negatives). Again, insufficient barriers (we might end up with inconsistent observed values of ->d_inode and ->d_flags). Fixed by switching to a new primitive that does the checks itself and returns ERR_PTR(-ENOENT) instead of a negative dentry. That way we get rid of boilerplate converting negatives into ERR_PTR(-ENOENT) in the callers and have a single place to deal with the barrier-related mess - inside fs/namei.c rather than in every caller out there. The guts of pathname resolution *do* need to be careful - the race found by Ritesh is real, as well as several similar races. Fortunately, it turns out that we can take care of that with fairly local changes in there. The tree-wide audit had not been fun, and I hate the idea of repeating it. I think the right approach would be to annotate the places where we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse catch regressions. But I'm still not sure what would be the least invasive way of doing that and it's clearly the next cycle fodder" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/namei.c: fix missing barriers when checking positivity fix dget_parent() fastpath race new helper: lookup_positive_unlocked() fs/namei.c: pull positivity check into follow_managed()
2019-12-05Merge branch 'next.autofs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull autofs updates from Al Viro: "autofs misuses checks for ->d_subdirs emptiness; the cursors are in the same lists, resulting in false negatives. It's not needed anyway, since autofs maintains counter in struct autofs_info, containing 0 for removed ones, 1 for live symlinks and 1 + number of children for live directories, which is precisely what we need for those checks. This series switches to use of that counter and untangles the crap around its uses (it needs not be atomic and there's a bunch of completely pointless "defensive" checks). This fell out of dcache_readdir work; the main point is to get rid of ->d_subdirs abuses in there. I've more followup cleanups, but I hadn't run those by Ian yet, so they can go next cycle" * 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: autofs: don't bother with atomics for ino->count autofs_dir_rmdir(): check ino->count for deciding whether it's empty... autofs: get rid of pointless checks around ->count handling autofs_clear_leaf_automount_flags(): use ino->count instead of ->d_subdirs
2019-12-05Merge branch 'pipe-rework' (patches from David Howells)Linus Torvalds
Merge two fixes for the pipe rework from David Howells: "Here are a couple of patches to fix bugs syzbot found in the pipe changes: - An assertion check will sometimes trip when polling a pipe because the ring size and indices used are approximate and may be being changed simultaneously. An equivalent approximate calculation was done previously, but without the assertion check, so I've just dropped the check. To make it accurate, the pipe mutex would need to be taken or the spin lock could be used - but usage of the spinlock would need to be rolled out into splice, iov_iter and other places for that. - The index mask and the max_usage values cannot be cached across pipe_wait() as F_SETPIPE_SZ could have been called during the wait. This can cause pipe_write() to break" * pipe-rework: pipe: Fix missing mask update after pipe_wait() pipe: Remove assertion from pipe_poll()
2019-12-05pipe: Fix missing mask update after pipe_wait()David Howells
Fix pipe_write() to not cache the ring index mask and max_usage as their values are invalidated by calling pipe_wait() because the latter function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them. Without this, pipe_write() may subsequently miscalculate the array indices and pipe fullness, leading to an oops like the following: BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481 Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987 ... CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0 ... Call Trace: pipe_write+0xc25/0xe10 fs/pipe.c:481 call_write_iter include/linux/fs.h:1895 [inline] new_sync_write+0x3fd/0x7e0 fs/read_write.c:483 __vfs_write+0x94/0x110 fs/read_write.c:496 vfs_write+0x18a/0x520 fs/read_write.c:558 ksys_write+0x105/0x220 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __x64_sys_write+0x6e/0xb0 fs/read_write.c:620 do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is not a problem for pipe_read() as the mask is recalculated on each pass of the loop, after pipe_wait() has been called. Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Cc: Eric Biggers <ebiggers@kernel.org> [ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05pipe: Remove assertion from pipe_poll()David Howells
An assertion check was added to pipe_poll() to make sure that the ring occupancy isn't seen to overflow the ring size. However, since no locks are held when the three values are read, it is possible for F_SETPIPE_SZ to intervene and muck up the calculation, thereby causing the oops. Fix this by simply removing the assertion and accepting that the calculation might be approximate. Note that the previous code also had a similar issue, though there was no assertion check, since the occupancy counter and the ring size were not read with a lock held, so it's possible that the poll check might have malfunctioned then too. Also wake up all the waiters so that they can reissue their checks if there was a competing read or write. Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-05Merge tag 'gfs2-for-5.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Andreas Gruenbacher: "Bob's extensive filesystem withdrawal and recovery testing: - don't write log headers after file system withdraw - clean up iopen glock mess in gfs2_create_inode - close timing window with GLF_INVALIDATE_IN_PROGRESS - abort gfs2_freeze if io error is seen - don't loop forever in gfs2_freeze if withdrawn - fix infinite loop in gfs2_ail1_flush on io error - introduce function gfs2_withdrawn - fix glock reference problem in gfs2_trans_remove_revoke Filesystems with a block size smaller than the page size: - fix end-of-file handling in gfs2_page_mkwrite - improve mmap write vs. punch_hole consistency Other: - remove active journal side effect from gfs2_write_log_header - multi-block allocations in gfs2_page_mkwrite Minor cleanups and coding style fixes: - remove duplicate call from gfs2_create_inode - make gfs2_log_shutdown static - make gfs2_fs_parameters static - some whitespace cleanups - removed unnecessary semicolon" * tag 'gfs2-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Don't write log headers after file system withdraw gfs2: Remove duplicate call from gfs2_create_inode gfs2: clean up iopen glock mess in gfs2_create_inode gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS gfs2: Abort gfs2_freeze if io error is seen gfs2: Don't loop forever in gfs2_freeze if withdrawn gfs2: fix infinite loop in gfs2_ail1_flush on io error gfs2: Introduce function gfs2_withdrawn gfs2: fix glock reference problem in gfs2_trans_remove_revoke gfs2: make gfs2_log_shutdown static gfs2: Remove active journal side effect from gfs2_write_log_header gfs2: Fix end-of-file handling in gfs2_page_mkwrite gfs2: Multi-block allocations in gfs2_page_mkwrite gfs2: Improve mmap write vs. punch_hole consistency gfs2: make gfs2_fs_parameters static gfs2: Some whitespace cleanups gfs2: removed unnecessary semicolon
2019-12-05Merge tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "The two highlights are a set of improvements to how rbd read-only mappings are handled and a conversion to the new mount API (slightly complicated by the fact that we had a common option parsing framework that called out into rbd and the filesystem instead of them calling into it). Also included a few scattered fixes and a MAINTAINERS update for rbd, adding Dongsheng as a reviewer" * tag 'ceph-for-5.5-rc1' of git://github.com/ceph/ceph-client: libceph, rbd, ceph: convert to use the new mount API rbd: ask for a weaker incompat mask for read-only mappings rbd: don't query snapshot features rbd: remove snapshot existence validation code rbd: don't establish watch for read-only mappings rbd: don't acquire exclusive lock for read-only mappings rbd: disallow read-write partitions on images mapped read-only rbd: treat images mapped read-only seriously rbd: introduce RBD_DEV_FLAG_READONLY rbd: introduce rbd_is_snap() ceph: don't leave ino field in ceph_mds_request_head uninitialized ceph: tone down loglevel on ceph_mdsc_build_path warning rbd: update MAINTAINERS info ceph: fix geting random mds from mdsmap rbd: fix spelling mistake "requeueing" -> "requeuing" ceph: make several helper accessors take const pointers libceph: drop unnecessary check from dispatch() in mon_client.c
2019-12-05Merge tag 'fuse-update-5.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: - Fix a regression introduced in the last release - Fix a number of issues with validating data coming from userspace - Some cleanups in virtiofs * tag 'fuse-update-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix Kconfig indentation fuse: fix leak of fuse_io_priv virtiofs: Use completions while waiting for queue to be drained virtiofs: Do not send forget request "struct list_head" element virtiofs: Use a common function to send forget virtiofs: Fix old-style declaration fuse: verify nlink fuse: verify write return fuse: verify attributes
2019-12-05iomap: stop using ioend after it's been freed in iomap_finish_ioend()Zorro Lang
This patch fixes the following KASAN report. The @ioend has been freed by dio_put(), but the iomap_finish_ioend() still trys to access its data. [20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0 [20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184 [20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1 [20563.653887] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019 [20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs] [20563.669547] Call trace: [20563.671993] dump_backtrace+0x0/0x370 [20563.675648] show_stack+0x1c/0x28 [20563.678958] dump_stack+0x138/0x1b0 [20563.682455] print_address_description.isra.9+0x60/0x378 [20563.687759] __kasan_report+0x1a4/0x2a8 [20563.691587] kasan_report+0xc/0x18 [20563.694985] __asan_report_load8_noabort+0x18/0x20 [20563.699769] iomap_finish_ioend+0x58c/0x5c0 [20563.703944] iomap_finish_ioends+0x110/0x270 [20563.708396] xfs_end_ioend+0x168/0x598 [xfs] [20563.712823] xfs_end_io+0x1e0/0x2d0 [xfs] [20563.716834] process_one_work+0x7f0/0x1ac8 [20563.720922] worker_thread+0x334/0xae0 [20563.724664] kthread+0x2c4/0x348 [20563.727889] ret_from_fork+0x10/0x18 [20563.732941] Allocated by task 83403: [20563.736512] save_stack+0x24/0xb0 [20563.739820] __kasan_kmalloc.isra.9+0xc4/0xe0 [20563.744169] kasan_slab_alloc+0x14/0x20 [20563.747998] slab_post_alloc_hook+0x50/0xa8 [20563.752173] kmem_cache_alloc+0x154/0x330 [20563.756185] mempool_alloc_slab+0x20/0x28 [20563.760186] mempool_alloc+0xf4/0x2a8 [20563.763845] bio_alloc_bioset+0x2d0/0x448 [20563.767849] iomap_writepage_map+0x4b8/0x1740 [20563.772198] iomap_do_writepage+0x200/0x8d0 [20563.776380] write_cache_pages+0x8a4/0xed8 [20563.780469] iomap_writepages+0x4c/0xb0 [20563.784463] xfs_vm_writepages+0xf8/0x148 [xfs] [20563.788989] do_writepages+0xc8/0x218 [20563.792658] __writeback_single_inode+0x168/0x18f8 [20563.797441] writeback_sb_inodes+0x370/0xd30 [20563.801703] wb_writeback+0x2d4/0x1270 [20563.805446] wb_workfn+0x344/0x1178 [20563.808928] process_one_work+0x7f0/0x1ac8 [20563.813016] worker_thread+0x334/0xae0 [20563.816757] kthread+0x2c4/0x348 [20563.819979] ret_from_fork+0x10/0x18 [20563.825028] Freed by task 22184: [20563.828251] save_stack+0x24/0xb0 [20563.831559] __kasan_slab_free+0x10c/0x180 [20563.835648] kasan_slab_free+0x10/0x18 [20563.839389] slab_free_freelist_hook+0xb4/0x1c0 [20563.843912] kmem_cache_free+0x8c/0x3e8 [20563.847745] mempool_free_slab+0x20/0x28 [20563.851660] mempool_free+0xd4/0x2f8 [20563.855231] bio_free+0x33c/0x518 [20563.858537] bio_put+0xb8/0x100 [20563.861672] iomap_finish_ioend+0x168/0x5c0 [20563.865847] iomap_finish_ioends+0x110/0x270 [20563.870328] xfs_end_ioend+0x168/0x598 [xfs] [20563.874751] xfs_end_io+0x1e0/0x2d0 [xfs] [20563.878755] process_one_work+0x7f0/0x1ac8 [20563.882844] worker_thread+0x334/0xae0 [20563.886584] kthread+0x2c4/0x348 [20563.889804] ret_from_fork+0x10/0x18 [20563.894855] The buggy address belongs to the object at fffffc0c54a36900 which belongs to the cache bio-1 of size 248 [20563.906844] The buggy address is located 40 bytes inside of 248-byte region [fffffc0c54a36900, fffffc0c54a369f8) [20563.918485] The buggy address belongs to the page: [20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300 [20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900 [20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000 [20563.948300] page dumped because: kasan: bad access detected [20563.955345] Memory state around the buggy address: [20563.960129] fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc [20563.967342] fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [20563.981766] ^ [20563.986288] fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc [20563.993501] fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [20564.000713] ================================================================== Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703 Signed-off-by: Zorro Lang <zlang@redhat.com> Fixes: 9cd0ed63ca514 ("iomap: enhance writeback error message") Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-12-05io_uring: fix a typo in a commentLimingWu
thatn -> than. Signed-off-by: Liming Wu <19092205@suning.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05io_uring: hook all linked requests via link_listPavel Begunkov
Links are created by chaining requests through req->list with an exception that head uses req->link_list. (e.g. link_list->list->list) Because of that, io_req_link_next() needs complex splicing to advance. Link them all through list_list. Also, it seems to be simpler and more consistent IMHO. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05io_uring: fix error handling in io_queue_link_headPavel Begunkov
In case of an error io_submit_sqe() drops a request and continues without it, even if the request was a part of a link. Not only it doesn't cancel links, but also may execute wrong sequence of actions. Stop consuming sqes, and let the user handle errors. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04fs/binfmt_elf.c: extract elf_read() functionAlexey Dobriyan
ELF reads done by the kernel have very complicated error detection code which better live in one place. Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04fs/binfmt_elf.c: delete unused "interp_map_addr" argumentAlexey Dobriyan
Link: http://lkml.kernel.org/r/20191005165049.GA26927@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04fs/epoll: remove unnecessary wakeups of nested epollHeiher
Take the case where we have: t0 | (ew) e0 | (et) e1 | (lt) s0 t0: thread 0 e0: epoll fd 0 e1: epoll fd 1 s0: socket fd 0 ew: epoll_wait et: edge-trigger lt: level-trigger We remove unnecessary wakeups to prevent the nested epoll that working in edge- triggered mode to waking up continuously. Test code: #include <unistd.h> #include <sys/epoll.h> #include <sys/socket.h> int main(int argc, char *argv[]) { int sfd[2]; int efd[2]; struct epoll_event e; if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0) goto out; efd[0] = epoll_create(1); if (efd[0] < 0) goto out; efd[1] = epoll_create(1); if (efd[1] < 0) goto out; e.events = EPOLLIN; if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0) goto out; e.events = EPOLLIN | EPOLLET; if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0) goto out; if (write(sfd[1], "w", 1) != 1) goto out; if (epoll_wait(efd[0], &e, 1, 0) != 1) goto out; if (epoll_wait(efd[0], &e, 1, 0) != 0) goto out; close(efd[0]); close(efd[1]); close(sfd[0]); close(sfd[1]); return 0; out: return -1; } More tests: https://github.com/heiher/epoll-wakeup Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.cc Signed-off-by: hev <r@hev.cc> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04epoll: simplify ep_poll_safewake() for CONFIG_DEBUG_LOCK_ALLOCJason Baron
Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses ep_call_nested() in order to pass the correct subclass argument to spin_lock_irqsave_nested(). However, ep_call_nested() adds unnecessary checks for epoll depth and loops that are already verified when doing EPOLL_CTL_ADD. This mirrors a conversion that was done for !CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a44 ("epoll: remove ep_call_nested() from ep_eventpoll_poll()") Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com Signed-off-by: Jason Baron <jbaron@akamai.com> Reviewed-by: Roman Penyaev <rpenyaev@suse.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Wong <normalperson@yhbt.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04fs/proc/Kconfig: fix indentationKrzysztof Kozlowski
Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ / /' -i */Kconfig [adobriyan@gmail.com: add two spaces where necessary] Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2 Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04fs/proc/internal.h: shuffle "struct pde_opener"Alexey Dobriyan
List iteration takes more code than anything else which means embedded list_head should be the first element of the structure. Space savings: add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18) Function old new delta close_pdeo 228 227 -1 proc_reg_release 86 82 -4 proc_entry_rundown 143 139 -4 proc_reg_open 298 289 -9 Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04fs/proc/generic.c: delete useless "len" variableAlexey Dobriyan
Pointer to next '/' encodes length of path element and next start position. Subtraction and increment are redundant. Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04proc: change ->nlink under proc_subdir_lockAlexey Dobriyan
Currently gluing PDE into global /proc tree is done under lock, but changing ->nlink is not. Additionally struct proc_dir_entry::nlink is not atomic so updates can be lost. Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04io_uring: use hash table for poll command lookupsJens Axboe
We recently changed this from a single list to an rbtree, but for some real life workloads, the rbtree slows down the submission/insertion case enough so that it's the top cycle consumer on the io_uring side. In testing, using a hash table is a more well rounded compromise. It is fast for insertion, and as long as it's sized appropriately, it works well for the cancellation case as well. Running TAO with a lot of network sockets, this removes io_poll_req_insert() from spending 2% of the CPU cycles. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04io-wq: clear node->next on list deletionJens Axboe
If someone removes a node from a list, and then later adds it back to a list, we can have invalid data in ->next. This can cause all sorts of issues. One such use case is the IORING_OP_POLL_ADD command, which will do just that if we race and get woken twice without any pending events. This is a pretty rare case, but can happen under extreme loads. Dan reports that he saw the following crash: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0 Oops: 0002 [#1] SMP CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019 RIP: 0010:io_wqe_enqueue+0x3e/0xd0 Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00 RSP: 0000:ffffc90006858a08 EFLAGS: 00010082 RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000 RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0 RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8 R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100 FS: 00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> io_poll_wake+0x12f/0x2a0 __wake_up_common+0x86/0x120 __wake_up_common_lock+0x7a/0xc0 sock_def_readable+0x3c/0x70 tcp_rcv_established+0x557/0x630 tcp_v6_do_rcv+0x118/0x3c0 tcp_v6_rcv+0x97e/0x9d0 ip6_protocol_deliver_rcu+0xe3/0x440 ip6_input+0x3d/0xc0 ? ip6_protocol_deliver_rcu+0x440/0x440 ipv6_rcv+0x56/0xd0 ? ip6_rcv_finish_core.isra.18+0x80/0x80 __netif_receive_skb_one_core+0x50/0x70 netif_receive_skb_internal+0x2f/0xa0 napi_gro_receive+0x125/0x150 mlx5e_handle_rx_cqe+0x1d9/0x5a0 ? mlx5e_poll_tx_cq+0x305/0x560 mlx5e_poll_rx_cq+0x49f/0x9c5 mlx5e_napi_poll+0xee/0x640 ? smp_reschedule_interrupt+0x16/0xd0 ? reschedule_interrupt+0xf/0x20 net_rx_action+0x286/0x3d0 __do_softirq+0xca/0x297 irq_exit+0x96/0xa0 do_IRQ+0x54/0xe0 common_interrupt+0xf/0xf </IRQ> RIP: 0033:0x7fdc627a2e3a Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10 when running a networked workload with about 5000 sockets being polled for. Fix this by clearing node->next when the node is being removed from the list. Fixes: 6206f0e180d4 ("io-wq: shrink io_wq_work a bit") Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04io_uring: ensure deferred timeouts copy necessary dataJens Axboe
If we defer a timeout, we should ensure that we copy the timespec when we have consumed the sqe. This is similar to commit f67676d160c6 for read/write requests. We already did this correctly for timeouts deferred as links, but do it generally and use the infrastructure added by commit 1a6b74fc8702 instead of having the timeout deferral use its own. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04cifs: fix possible uninitialized access and race on iface_listAurelien Aptel
iface[0] was accessed regardless of the count value and without locking. * check count before accessing any ifaces * make copy of iface list (it's a simple POD array) and use it without locking. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2019-12-04cifs: Fix lookup of SMB connections on multichannelPaulo Alcantara (SUSE)
With the addition of SMB session channels, we introduced new TCP server pointers that have no sessions or tcons associated with them. In this case, when we started looking for TCP connections, we might end up picking session channel rather than the master connection, hence failing to get either a session or a tcon. In order to fix that, this patch introduces a new "is_channel" field to TCP_Server_Info structure so we can skip session channels during lookup of connections. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>