aboutsummaryrefslogtreecommitdiff
path: root/net/sctp/socket.c
AgeCommit message (Collapse)Author
2017-11-18sctp: set frag_point in sctp_setsockopt_maxseg correctlyXin Long
Now in sctp_setsockopt_maxseg user_frag or frag_point can be set with val >= 8 and val <= SCTP_MAX_CHUNK_LEN. But both checks are incorrect. val >= 8 means frag_point can even be less than SCTP_DEFAULT_MINSEGMENT. Then in sctp_datamsg_from_user(), when it's value is greater than cookie echo len and trying to bundle with cookie echo chunk, the first_len will overflow. The worse case is when it's value is equal as cookie echo len, first_len becomes 0, it will go into a dead loop for fragment later on. In Hangbin syzkaller testing env, oom was even triggered due to consecutive memory allocation in that loop. Besides, SCTP_MAX_CHUNK_LEN is the max size of the whole chunk, it should deduct the data header for frag_point or user_frag check. This patch does a proper check with SCTP_DEFAULT_MINSEGMENT subtracting the sctphdr and datahdr, SCTP_MAX_CHUNK_LEN subtracting datahdr when setting frag_point via sockopt. It also improves sctp_setsockopt_maxseg codes. Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-16sctp: use the right sk after waking up from wait_buf sleepXin Long
Commit dfcb9f4f99f1 ("sctp: deny peeloff operation on asocs with threads sleeping on it") fixed the race between peeloff and wait sndbuf by checking waitqueue_active(&asoc->wait) in sctp_do_peeloff(). But it actually doesn't work, as even if waitqueue_active returns false the waiting sndbuf thread may still not yet hold sk lock. After asoc is peeled off, sk is not asoc->base.sk any more, then to hold the old sk lock couldn't make assoc safe to access. This patch is to fix this by changing to hold the new sk lock if sk is not asoc->base.sk, meanwhile, also set the sk in sctp_sendmsg with the new sk. With this fix, there is no more race between peeloff and waitbuf, the check 'waitqueue_active' in sctp_do_peeloff can be removed. Thanks Marcelo and Neil for making this clear. v1->v2: fix it by changing to lock the new sock instead of adding a flag in asoc. Suggested-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-16sctp: do not free asoc when it is already dead in sctp_sendmsgXin Long
Now in sctp_sendmsg sctp_wait_for_sndbuf could schedule out without holding sock sk. It means the current asoc can be freed elsewhere, like when receiving an abort packet. If the asoc is just created in sctp_sendmsg and sctp_wait_for_sndbuf returns err, the asoc will be freed again due to new_asoc is not nil. An use-after-free issue would be triggered by this. This patch is to fix it by setting new_asoc with nil if the asoc is already dead when cpu schedules back, so that it will not be freed again in sctp_sendmsg. v1->v2: set new_asoc as nil in sctp_sendmsg instead of sctp_wait_for_sndbuf. Suggested-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several conflicts here. NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to nfp_fl_output() needed some adjustments because the code block is in an else block now. Parallel additions to net/pkt_cls.h and net/sch_generic.h A bug fix in __tcp_retransmit_skb() conflicted with some of the rbtree changes in net-next. The tc action RCU callback fixes in 'net' had some overlap with some of the recent tcf_block reworking. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29sctp: reset owner sk for data chunks on out queues when migrating a sockXin Long
Now when migrating sock to another one in sctp_sock_migrate(), it only resets owner sk for the data in receive queues, not the chunks on out queues. It would cause that data chunks length on the sock is not consistent with sk sk_wmem_alloc. When closing the sock or freeing these chunks, the old sk would never be freed, and the new sock may crash due to the overflow sk_wmem_alloc. syzbot found this issue with this series: r0 = socket$inet_sctp() sendto$inet(r0) listen(r0) accept4(r0) close(r0) Although listen() should have returned error when one TCP-style socket is in connecting (I may fix this one in another patch), it could also be reproduced by peeling off an assoc. This issue is there since very beginning. This patch is to reset owner sk for the chunks on out queues so that sk sk_wmem_alloc has correct value after accept one sock or peeloff an assoc to one sock. Note that when resetting owner sk for chunks on outqueue, it has to sctp_clear_owner_w/skb_orphan chunks before changing assoc->base.sk first and then sctp_set_owner_w them after changing assoc->base.sk, due to that sctp_wfree and it's callees are using assoc->base.sk. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
There were quite a few overlapping sets of changes here. Daniel's bug fix for off-by-ones in the new BPF branch instructions, along with the added allowances for "data_end > ptr + x" forms collided with the metadata additions. Along with those three changes came veritifer test cases, which in their final form I tried to group together properly. If I had just trimmed GIT's conflict tags as-is, this would have split up the meta tests unnecessarily. In the socketmap code, a set of preemption disabling changes overlapped with the rename of bpf_compute_data_end() to bpf_compute_data_pointers(). Changes were made to the mv88e6060.c driver set addr method which got removed in net-next. The hyperv transport socket layer had a locking change in 'net' which overlapped with a change of socket state macro usage in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19sctp: do not peel off an assoc from one netns to another oneXin Long
Now when peeling off an association to the sock in another netns, all transports in this assoc are not to be rehashed and keep use the old key in hashtable. As a transport uses sk->net as the hash key to insert into hashtable, it would miss removing these transports from hashtable due to the new netns when closing the sock and all transports are being freeed, then later an use-after-free issue could be caused when looking up an asoc and dereferencing those transports. This is a very old issue since very beginning, ChunYu found it with syzkaller fuzz testing with this series: socket$inet6_sctp() bind$inet6() sendto$inet6() unshare(0x40000000) getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST() getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF() This patch is to block this call when peeling one assoc off from one netns to another one, so that the netns of all transport would not go out-sync with the key in hashtable. Note that this patch didn't fix it by rehashing transports, as it's difficult to handle the situation when the tuple is already in use in the new netns. Besides, no one would like to peel off one assoc to another netns, considering ipaddrs, ifaces, etc. are usually different. Reported-by: ChunYu Wang <chunwang@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03sctp: add sockopt to get/set stream scheduler parametersMarcelo Ricardo Leitner
As defined per RFC Draft ndata Section 4.3.3, named as SCTP_STREAM_SCHEDULER_VALUE. See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13 Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03sctp: add sockopt to get/set stream schedulerMarcelo Ricardo Leitner
As defined per RFC Draft ndata Section 4.3.2, named as SCTP_STREAM_SCHEDULER. See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13 Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03sctp: introduce struct sctp_stream_out_extMarcelo Ricardo Leitner
With the stream schedulers, sctp_stream_out will become too big to be allocated by kmalloc and as we need to allocate with BH disabled, we cannot use __vmalloc in sctp_stream_init(). This patch moves out the stats from sctp_stream_out to sctp_stream_out_ext, which will be allocated only when the application tries to sendmsg something on it. Just the introduction of sctp_stream_out_ext would already fix the issue described above by splitting the allocation in two. Moving the stats to it also reduces the pressure on the allocator as we will ask for less memory atomically when creating the socket and we will use GFP_KERNEL later. Then, for stream schedulers, we will just use sctp_stream_out_ext. Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15sctp: fix an use-after-free issue in sctp_sock_dumpXin Long
Commit 86fdb3448cc1 ("sctp: ensure ep is not destroyed before doing the dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep with holding sock and sock lock. But Paolo noticed that endpoint could be destroyed in sctp_rcv without sock lock protection. It means the use-after-free issue still could be triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks !ep, although it's pretty hard to reproduce. I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close and sctp_sock_dump long time. This patch is to add another param cb_done to sctp_for_each_transport and dump ep->assocs with holding tsp after jumping out of transport's traversal in it to avoid this issue. It can also improve sctp diag dump to make it run faster, as no need to save sk into cb->args[5] and keep calling sctp_for_each_transport any more. This patch is also to use int * instead of int for the pos argument in sctp_for_each_transport, which could make postion increment only in sctp_for_each_transport and no need to keep changing cb->args[2] in sctp_sock_filter and sctp_sock_dump any more. Fixes: 86fdb3448cc1 ("sctp: ensure ep is not destroyed before doing the dump") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Three cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23sctp: Avoid out-of-bounds reads from address storageStefano Brivio
inet_diag_msg_sctp{,l}addr_fill() and sctp_get_sctp_info() copy sizeof(sockaddr_storage) bytes to fill in sockaddr structs used to export diagnostic information to userspace. However, the memory allocated to store sockaddr information is smaller than that and depends on the address family, so we leak up to 100 uninitialized bytes to userspace. Just use the size of the source structs instead, in all the three cases this is what userspace expects. Zero out the remaining memory. Unused bytes (i.e. when IPv4 addresses are used) in source structs sctp_sockaddr_entry and sctp_transport are already cleared by sctp_add_bind_addr() and sctp_transport_new(), respectively. Noticed while testing KASAN-enabled kernel with 'ss': [ 2326.885243] BUG: KASAN: slab-out-of-bounds in inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] at addr ffff881be8779800 [ 2326.896800] Read of size 128 by task ss/9527 [ 2326.901564] CPU: 0 PID: 9527 Comm: ss Not tainted 4.11.0-22.el7a.x86_64 #1 [ 2326.909236] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017 [ 2326.917585] Call Trace: [ 2326.920312] dump_stack+0x63/0x8d [ 2326.924014] kasan_object_err+0x21/0x70 [ 2326.928295] kasan_report+0x288/0x540 [ 2326.932380] ? inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] [ 2326.938500] ? skb_put+0x8b/0xd0 [ 2326.942098] ? memset+0x31/0x40 [ 2326.945599] check_memory_region+0x13c/0x1a0 [ 2326.950362] memcpy+0x23/0x50 [ 2326.953669] inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] [ 2326.959596] ? inet_diag_msg_sctpasoc_fill+0x460/0x460 [sctp_diag] [ 2326.966495] ? __lock_sock+0x102/0x150 [ 2326.970671] ? sock_def_wakeup+0x60/0x60 [ 2326.975048] ? remove_wait_queue+0xc0/0xc0 [ 2326.979619] sctp_diag_dump+0x44a/0x760 [sctp_diag] [ 2326.985063] ? sctp_ep_dump+0x280/0x280 [sctp_diag] [ 2326.990504] ? memset+0x31/0x40 [ 2326.994007] ? mutex_lock+0x12/0x40 [ 2326.997900] __inet_diag_dump+0x57/0xb0 [inet_diag] [ 2327.003340] ? __sys_sendmsg+0x150/0x150 [ 2327.007715] inet_diag_dump+0x4d/0x80 [inet_diag] [ 2327.012979] netlink_dump+0x1e6/0x490 [ 2327.017064] __netlink_dump_start+0x28e/0x2c0 [ 2327.021924] inet_diag_handler_cmd+0x189/0x1a0 [inet_diag] [ 2327.028045] ? inet_diag_rcv_msg_compat+0x1b0/0x1b0 [inet_diag] [ 2327.034651] ? inet_diag_dump_compat+0x190/0x190 [inet_diag] [ 2327.040965] ? __netlink_lookup+0x1b9/0x260 [ 2327.045631] sock_diag_rcv_msg+0x18b/0x1e0 [ 2327.050199] netlink_rcv_skb+0x14b/0x180 [ 2327.054574] ? sock_diag_bind+0x60/0x60 [ 2327.058850] sock_diag_rcv+0x28/0x40 [ 2327.062837] netlink_unicast+0x2e7/0x3b0 [ 2327.067212] ? netlink_attachskb+0x330/0x330 [ 2327.071975] ? kasan_check_write+0x14/0x20 [ 2327.076544] netlink_sendmsg+0x5be/0x730 [ 2327.080918] ? netlink_unicast+0x3b0/0x3b0 [ 2327.085486] ? kasan_check_write+0x14/0x20 [ 2327.090057] ? selinux_socket_sendmsg+0x24/0x30 [ 2327.095109] ? netlink_unicast+0x3b0/0x3b0 [ 2327.099678] sock_sendmsg+0x74/0x80 [ 2327.103567] ___sys_sendmsg+0x520/0x530 [ 2327.107844] ? __get_locked_pte+0x178/0x200 [ 2327.112510] ? copy_msghdr_from_user+0x270/0x270 [ 2327.117660] ? vm_insert_page+0x360/0x360 [ 2327.122133] ? vm_insert_pfn_prot+0xb4/0x150 [ 2327.126895] ? vm_insert_pfn+0x32/0x40 [ 2327.131077] ? vvar_fault+0x71/0xd0 [ 2327.134968] ? special_mapping_fault+0x69/0x110 [ 2327.140022] ? __do_fault+0x42/0x120 [ 2327.144008] ? __handle_mm_fault+0x1062/0x17a0 [ 2327.148965] ? __fget_light+0xa7/0xc0 [ 2327.153049] __sys_sendmsg+0xcb/0x150 [ 2327.157133] ? __sys_sendmsg+0xcb/0x150 [ 2327.161409] ? SyS_shutdown+0x140/0x140 [ 2327.165688] ? exit_to_usermode_loop+0xd0/0xd0 [ 2327.170646] ? __do_page_fault+0x55d/0x620 [ 2327.175216] ? __sys_sendmsg+0x150/0x150 [ 2327.179591] SyS_sendmsg+0x12/0x20 [ 2327.183384] do_syscall_64+0xe3/0x230 [ 2327.187471] entry_SYSCALL64_slow_path+0x25/0x25 [ 2327.192622] RIP: 0033:0x7f41d18fa3b0 [ 2327.196608] RSP: 002b:00007ffc3b731218 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 2327.205055] RAX: ffffffffffffffda RBX: 00007ffc3b731380 RCX: 00007f41d18fa3b0 [ 2327.213017] RDX: 0000000000000000 RSI: 00007ffc3b731340 RDI: 0000000000000003 [ 2327.220978] RBP: 0000000000000002 R08: 0000000000000004 R09: 0000000000000040 [ 2327.228939] R10: 00007ffc3b730f30 R11: 0000000000000246 R12: 0000000000000003 [ 2327.236901] R13: 00007ffc3b731340 R14: 00007ffc3b7313d0 R15: 0000000000000084 [ 2327.244865] Object at ffff881be87797e0, in cache kmalloc-64 size: 64 [ 2327.251953] Allocated: [ 2327.254581] PID = 9484 [ 2327.257215] save_stack_trace+0x1b/0x20 [ 2327.261485] save_stack+0x46/0xd0 [ 2327.265179] kasan_kmalloc+0xad/0xe0 [ 2327.269165] kmem_cache_alloc_trace+0xe6/0x1d0 [ 2327.274138] sctp_add_bind_addr+0x58/0x180 [sctp] [ 2327.279400] sctp_do_bind+0x208/0x310 [sctp] [ 2327.284176] sctp_bind+0x61/0xa0 [sctp] [ 2327.288455] inet_bind+0x5f/0x3a0 [ 2327.292151] SYSC_bind+0x1a4/0x1e0 [ 2327.295944] SyS_bind+0xe/0x10 [ 2327.299349] do_syscall_64+0xe3/0x230 [ 2327.303433] return_from_SYSCALL_64+0x0/0x6a [ 2327.308194] Freed: [ 2327.310434] PID = 4131 [ 2327.313065] save_stack_trace+0x1b/0x20 [ 2327.317344] save_stack+0x46/0xd0 [ 2327.321040] kasan_slab_free+0x73/0xc0 [ 2327.325220] kfree+0x96/0x1a0 [ 2327.328530] dynamic_kobj_release+0x15/0x40 [ 2327.333195] kobject_release+0x99/0x1e0 [ 2327.337472] kobject_put+0x38/0x70 [ 2327.341266] free_notes_attrs+0x66/0x80 [ 2327.345545] mod_sysfs_teardown+0x1a5/0x270 [ 2327.350211] free_module+0x20/0x2a0 [ 2327.354099] SyS_delete_module+0x2cb/0x2f0 [ 2327.358667] do_syscall_64+0xe3/0x230 [ 2327.362750] return_from_SYSCALL_64+0x0/0x6a [ 2327.367510] Memory state around the buggy address: [ 2327.372855] ffff881be8779700: fc fc fc fc 00 00 00 00 00 00 00 00 fc fc fc fc [ 2327.380914] ffff881be8779780: fb fb fb fb fb fb fb fb fc fc fc fc 00 00 00 00 [ 2327.388972] >ffff881be8779800: 00 00 00 00 fc fc fc fc fb fb fb fb fb fb fb fb [ 2327.397031] ^ [ 2327.401792] ffff881be8779880: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc [ 2327.409850] ffff881be8779900: 00 00 00 00 00 04 fc fc fc fc fc fc 00 00 00 00 [ 2327.417907] ================================================================== This fixes CVE-2017-7558. References: https://bugzilla.redhat.com/show_bug.cgi?id=1480266 Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Cc: Xin Long <lucien.xin@gmail.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11sctp: remove the typedef sctp_socket_type_tXin Long
This patch is to remove the typedef sctp_socket_type_t, and replace with enum sctp_socket_type in the places where it's using this typedef. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11sctp: remove the typedef sctp_cmsgs_tXin Long
This patch is to remove the typedef sctp_cmsgs_t, and replace with struct sctp_cmsgs in the places where it's using this typedef. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06sctp: remove the typedef sctp_scope_tXin Long
This patch is to remove the typedef sctp_scope_t, and replace with enum sctp_scope in the places where it's using this typedef. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01sctp: Add peeloff-flags socket optionNeil Horman
Based on a request raised on the sctp devel list, there is a need to augment the sctp_peeloff operation while specifying the O_CLOEXEC and O_NONBLOCK flags (simmilar to the socket syscall). Since modifying the SCTP_SOCKOPT_PEELOFF socket option would break user space ABI for existing programs, this patch creates a new socket option SCTP_SOCKOPT_PEELOFF_FLAGS, which accepts a third flags parameter to allow atomic assignment of the socket descriptor flags. Tested successfully by myself and the requestor Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: "David S. Miller" <davem@davemloft.net> CC: Andreas Steinmetz <ast@domdv.de> CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01sctp: remove the typedef sctp_paramhdr_tXin Long
This patch is to remove the typedef sctp_paramhdr_t, and replace with struct sctp_paramhdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert sock.sk_wmem_alloc from atomic_t to refcount_tReshetova, Elena
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert sk_buff.users from atomic_t to refcount_tReshetova, Elena
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two entries being added at the same time to the IFLA policy table, whilst parallel bug fixes to decnet routing dst handling overlapping with the dst gc removal in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15sctp: return next obj by passing pos + 1 into sctp_transport_get_idxXin Long
In sctp_for_each_transport, pos is used to save how many objs it has dumped. Now it gets the last obj by sctp_transport_get_idx, then gets the next obj by sctp_transport_get_next. The issue is that in the meanwhile if some objs in transport hashtable are removed and the objs nums are less than pos, sctp_transport_get_idx would return NULL and hti.walker.tbl is NULL as well. At this moment it should stop hti, instead of continue getting the next obj. Or it would cause a NULL pointer dereference in sctp_transport_get_next. This patch is to pass pos + 1 into sctp_transport_get_idx to get the next obj directly, even if pos > objs nums, it would return NULL and stop hti. Fixes: 626d16f50f39 ("sctp: export some apis or variables for sctp_diag and reuse some for proc") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10sctp: fix recursive locking warning in sctp_do_peeloffXin Long
Dmitry got the following recursive locking report while running syzkaller fuzzer, the Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:52 print_deadlock_bug kernel/locking/lockdep.c:1729 [inline] check_deadlock kernel/locking/lockdep.c:1773 [inline] validate_chain kernel/locking/lockdep.c:2251 [inline] __lock_acquire+0xef2/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 lock_sock_nested+0xcb/0x120 net/core/sock.c:2536 lock_sock include/net/sock.h:1460 [inline] sctp_close+0xcd/0x9d0 net/sctp/socket.c:1497 inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:432 sock_release+0x8d/0x1e0 net/socket.c:597 __sock_create+0x38b/0x870 net/socket.c:1226 sock_create+0x7f/0xa0 net/socket.c:1237 sctp_do_peeloff+0x1a2/0x440 net/sctp/socket.c:4879 sctp_getsockopt_peeloff net/sctp/socket.c:4914 [inline] sctp_getsockopt+0x111a/0x67e0 net/sctp/socket.c:6628 sock_common_getsockopt+0x95/0xd0 net/core/sock.c:2690 SYSC_getsockopt net/socket.c:1817 [inline] SyS_getsockopt+0x240/0x380 net/socket.c:1799 entry_SYSCALL_64_fastpath+0x1f/0xc2 This warning is caused by the lock held by sctp_getsockopt() is on one socket, while the other lock that sctp_close() is getting later is on the newly created (which failed) socket during peeloff operation. This patch is to avoid this warning by use lock_sock with subclass SINGLE_DEPTH_NESTING as Wang Cong and Marcelo's suggestion. Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10sctp: disable BH in sctp_for_each_endpointXin Long
Now sctp holds read_lock when foreach sctp_ep_hashtable without disabling BH. If CPU schedules to another thread A at this moment, the thread A may be trying to hold the write_lock with disabling BH. As BH is disabled and CPU cannot schedule back to the thread holding the read_lock, while the thread A keeps waiting for the read_lock. A dead lock would be triggered by this. This patch is to fix this dead lock by calling read_lock_bh instead to disable BH when holding the read_lock in sctp_for_each_endpoint. Fixes: 626d16f50f39 ("sctp: export some apis or variables for sctp_diag and reuse some for proc") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: add TCPMemoryPressuresChrono counterEric Dumazet
DRAM supply shortage and poor memory pressure tracking in TCP stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning limits) and tcp_mem[] quite hazardous. TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl limits being hit, but only tracking number of transitions. If TCP stack behavior under stress was perfect : 1) It would maintain memory usage close to the limit. 2) Memory pressure state would be entered for short times. We certainly prefer 100 events lasting 10ms compared to one event lasting 200 seconds. This patch adds a new SNMP counter tracking cumulative duration of memory pressure events, given in ms units. $ cat /proc/sys/net/ipv4/tcp_mem 3088 4117 6176 $ grep TCP /proc/net/sockstat TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140 $ nstat -n ; sleep 10 ; nstat |grep Pressure TcpExtTCPMemoryPressures 1700 TcpExtTCPMemoryPressuresChrono 5209 v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David instructed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02sctp: define the member stream as an object instead of pointer in asocXin Long
As Marcelo's suggestion, stream is a fixed size member of asoc and would not grow with more streams. To avoid an allocation for it, this patch is to define it as an object instead of pointer and update the places using it, also create sctp_stream_update() called in sctp_assoc_update() to migrate the stream info from one stream to another. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were simply overlapping changes. In the net/ipv4/route.c case the code had simply moved around a little bit and the same fix was made in both 'net' and 'net-next'. In the net/sched/sch_generic.c case a fix in 'net' happened at the same time that a new argument was added to qdisc_hash_add(). Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06sctp: listen on the sock only when it's state is listening or closedXin Long
Now sctp doesn't check sock's state before listening on it. It could even cause changing a sock with any state to become a listening sock when doing sctp_listen. This patch is to fix it by checking sock's state in sctp_listen, so that it will listen on the sock with right state. Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Mostly simple cases of overlapping changes (adding code nearby, a function whose name changes, for example). Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05sctp: get sock from transport in sctp_transport_update_pmtuXin Long
This patch is almost to revert commit 02f3d4ce9e81 ("sctp: Adjust PMTU updates to accomodate route invalidation."). As t->asoc can't be NULL in sctp_transport_update_pmtu, it could get sk from asoc, and no need to pass sk into that function. It is also to remove some duplicated codes from that function. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctpXin Long
Before when implementing sctp prsctp, SCTP_PR_STREAM_STATUS wasn't added, as it needs to save abandoned_(un)sent for every stream. After sctp stream reconf is added in sctp, assoc has structure sctp_stream_out to save per stream info. This patch is to add SCTP_PR_STREAM_STATUS by putting the prsctp per stream statistics into sctp_stream_out. v1->v2: fix an indent issue. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01sctp: use right in and out stream cntXin Long
Since sctp reconf was added in sctp, the real cnt of in/out stream have not been c.sinit_max_instreams and c.sinit_num_ostreams any more. This patch is to replace them with stream->in/outcnt. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28sctp: change to save MSG_MORE flag into assocXin Long
David Laight noticed the support for MSG_MORE with datamsg->force_delay didn't really work as we expected, as the first msg with MSG_MORE set would always block the following chunks' dequeuing. This Patch is to rewrite it by saving the MSG_MORE flag into assoc as David Laight suggested. asoc->force_delay is used to save MSG_MORE flag before a msg is sent. All chunks in queue would not be sent out if asoc->force_delay is set by the msg with MSG_MORE flag, until a new msg without MSG_MORE flag clears asoc->force_delay. Note that this change would not affect the flush is generated by other triggers, like asoc->state != ESTABLISHED, queue size > pmtu etc. v1->v2: Not clear asoc->force_delay after sending the msg with MSG_MORE flag. Fixes: 4ea0c32f5f42 ("sctp: add support for MSG_MORE") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Laight <david.laight@aculab.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Change return type of sk_busy_loop from bool to voidAlexander Duyck
checking the return value of sk_busy_loop. As there are only a few consumers of that data, and the data being checked for can be replaced with a check for !skb_queue_empty() we might as well just pull the code out of sk_busy_loop and place it in the spots that actually need it. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/genet/bcmgenet.c net/core/sock.c Conflicts were overlapping changes in bcmgenet and the lockdep handling of sockets. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12sctp: add get and set sockopt for reconf_enableXin Long
This patchset is to add SCTP_RECONFIG_SUPPORTED sockopt, it would set and get asoc reconf_enable value when asoc_id is set, or it would set and get ep reconf_enalbe value if asoc_id is 0. It is also to add sysctl interface for users to set the default value for reconf_enable. After this patch, stream reconf will work. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09net: Work around lockdep limitation in sockets that use socketsDavid Howells
Lockdep issues a circular dependency warning when AFS issues an operation through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem. The theory lockdep comes up with is as follows: (1) If the pagefault handler decides it needs to read pages from AFS, it calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but creating a call requires the socket lock: mmap_sem must be taken before sk_lock-AF_RXRPC (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind() binds the underlying UDP socket whilst holding its socket lock. inet_bind() takes its own socket lock: sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET (3) Reading from a TCP socket into a userspace buffer might cause a fault and thus cause the kernel to take the mmap_sem, but the TCP socket is locked whilst doing this: sk_lock-AF_INET must be taken before mmap_sem However, lockdep's theory is wrong in this instance because it deals only with lock classes and not individual locks. The AF_INET lock in (2) isn't really equivalent to the AF_INET lock in (3) as the former deals with a socket entirely internal to the kernel that never sees userspace. This is a limitation in the design of lockdep. Fix the general case by: (1) Double up all the locking keys used in sockets so that one set are used if the socket is created by userspace and the other set is used if the socket is created by the kernel. (2) Store the kern parameter passed to sk_alloc() in a variable in the sock struct (sk_kern_sock). This informs sock_lock_init(), sock_init_data() and sk_clone_lock() as to the lock keys to be used. Note that the child created by sk_clone_lock() inherits the parent's kern setting. (3) Add a 'kern' parameter to ->accept() that is analogous to the one passed in to ->create() that distinguishes whether kernel_accept() or sys_accept4() was the caller and can be passed to sk_alloc(). Note that a lot of accept functions merely dequeue an already allocated socket. I haven't touched these as the new socket already exists before we get the parameter. Note also that there are a couple of places where I've made the accepted socket unconditionally kernel-based: irda_accept() rds_rcp_accept_one() tcp_accept_from_sock() because they follow a sock_create_kern() and accept off of that. Whilst creating this, I noticed that lustre and ocfs don't create sockets through sock_create_kern() and thus they aren't marked as for-kernel, though they appear to be internal. I wonder if these should do that so that they use the new set of lock keys. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24sctp: deny peeloff operation on asocs with threads sleeping on itMarcelo Ricardo Leitner
commit 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf") attempted to avoid a BUG_ON call when the association being used for a sendmsg() is blocked waiting for more sndbuf and another thread did a peeloff operation on such asoc, moving it to another socket. As Ben Hutchings noticed, then in such case it would return without locking back the socket and would cause two unlocks in a row. Further analysis also revealed that it could allow a double free if the application managed to peeloff the asoc that is created during the sendmsg call, because then sctp_sendmsg() would try to free the asoc that was created only for that call. This patch takes another approach. It will deny the peeloff operation if there is a thread sleeping on the asoc, so this situation doesn't exist anymore. This avoids the issues described above and also honors the syscalls that are already being handled (it can be multiple sendmsg calls). Joint work with Xin Long. Fixes: 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf") Cc: Alexander Popov <alex.popov@linux.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20sctp: add support for MSG_MOREXin Long
This patch is to add support for MSG_MORE on sctp. It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after creating datamsg according to the send flag. sctp_packet_can_append_data then uses it to decide if the chunks of this msg will be sent at once or delay it. Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of in assoc. As sctp enqueues the chunks first, then dequeue them one by one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may affect other chunks' bundling. Since last patch, sctp flush out queue once assoc state falls into SHUTDOWN_PENDING, the close block problem mentioned in [1] has been solved as well. [1] https://patchwork.ozlabs.org/patch/372404/ Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09sctp: implement sender-side procedures for Add Incoming/Outgoing Streams ↵Xin Long
Request Parameter This patch is to implement Sender-Side Procedures for the Add Outgoing and Incoming Streams Request Parameter described in rfc6525 section 5.1.5-5.1.6. It is also to add sockopt SCTP_ADD_STREAMS in rfc6525 section 6.3.4 for users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-09sctp: implement sender-side procedures for SSN/TSN Reset Request ParameterXin Long
This patch is to implement Sender-Side Procedures for the SSN/TSN Reset Request Parameter descibed in rfc6525 section 5.1.4. It is also to add sockopt SCTP_RESET_ASSOC in rfc6525 section 6.3.3 for users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The conflict was an interaction between a bug fix in the netvsc driver in 'net' and an optimization of the RX path in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07sctp: check af before verify address in sctp_addr_id2transportXin Long
Commit 6f29a1306131 ("sctp: sctp_addr_id2transport should verify the addr before looking up assoc") invoked sctp_verify_addr to verify the addr. But it didn't check af variable beforehand, once users pass an address with family = 0 through sockopt, sctp_get_af_specific will return NULL and NULL pointer dereference will be caused by af->sockaddr_len. This patch is to fix it by returning NULL if af variable is NULL. Fixes: 6f29a1306131 ("sctp: sctp_addr_id2transport should verify the addr before looking up assoc") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07sctp: add dst_pending_confirm flagJulian Anastasov
Add new transport flag to allow sockets to confirm neighbour. When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. The flag is propagated from transport to every packet. It is reset when cached dst is reset. Reported-by: YueHaibing <yuehaibing@huawei.com> Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.") Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07sctp: avoid BUG_ON on sctp_wait_for_sndbufMarcelo Ricardo Leitner
Alexander Popov reported that an application may trigger a BUG_ON in sctp_wait_for_sndbuf if the socket tx buffer is full, a thread is waiting on it to queue more data and meanwhile another thread peels off the association being used by the first thread. This patch replaces the BUG_ON call with a proper error handling. It will return -EPIPE to the original sendmsg call, similarly to what would have been done if the association wasn't found in the first place. Acked-by: Alexander Popov <alex.popov@linux.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25sctp: sctp_addr_id2transport should verify the addr before looking up assocXin Long
sctp_addr_id2transport is a function for sockopt to look up assoc by address. As the address is from userspace, it can be a v4-mapped v6 address. But in sctp protocol stack, it always handles a v4-mapped v6 address as a v4 address. So it's necessary to convert it to a v4 address before looking up assoc by address. This patch is to fix it by calling sctp_verify_addr in which it can do this conversion before calling sctp_endpoint_lookup_assoc, just like what sctp_sendmsg and __sctp_connect do for the address from users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Introduce a sysctl that modifies the value of PROT_SOCK.Krister Johansen
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl that denotes the first unprivileged inet port in the namespace. To disable all privileged ports set this to zero. It also checks for overlap with the local port range. The privileged and local range may not overlap. The use case for this change is to allow containerized processes to bind to priviliged ports, but prevent them from ever being allowed to modify their container's network configuration. The latter is accomplished by ensuring that the network namespace is not a child of the user namespace. This modification was needed to allow the container manager to disable a namespace's priviliged port restrictions without exposing control of the network namespace to processes in the user namespace. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: David S. Miller <davem@davemloft.net>