linux.git - Linux kernel

Age	Commit message (Collapse)	Author
2016-03-10	net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef	Amir Vadai
	Introduce the macros tc_no_actions and tc_for_each_action to make code clearer. Extracted struct tc_action out of the ifdef to make calls to is_tcf_gact_shot() and similar functions valid, even when it is a nop. Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Suggested-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net/flow_dissector: Make dissector_uses_key() and ↵	Amir Vadai
	skb_flow_dissector_target() public Will be used in a following patch to query if a key is being used, and what it's value in the target object. Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net/flower: Introduce hardware offload support	Amir Vadai
	This patch is based on a patch made by John Fastabend. It adds support for offloading cls_flower. when NETIF_F_HW_TC is on: flags = 0 => Rule will be processed twice - by hardware, and if still relevant, by software. flags = SKIP_HW => Rull will be processed by software only If hardware fail/not capabale to apply the rule, operation will NOT fail. Filter will be processed by SW only. Acked-by: Jiri Pirko <jiri@mellanox.com> Suggested-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	Merge branch 'mediatek-eth'	David S. Miller
	John Crispin says: ==================== net-next: mediatek: add ethernet driver This series adds support for the Mediatek ethernet core found on current ARM based SoCs. The driver works on MT2701 and MT7623 SoCs Instead of trying to upstream everything at once I decided to concentrate on the important parts required to make current generation silicon work. The V3 series only includes the code required to make dual MAC setups work and only supports the newer QDMA engine. Changes in V5 * reduce the mdio timeut to HZ * add a call to usleep_range() which schedules in the background. Changes in V4 * remove ugly _FE macro, use offsetof() instead Changes in V3 * only include code for MT2701/7623 support * drop support for PDMA and older MIPS based SoCs * drop switch support Changes in V2 * change the namespace of the functions from fe_* to mtk_* * add support for the latest generation of ARM SoCs * add dual MAC support * remove the swconfig specific bits * remove most of the magic values and replace them with defines * add verbose descriptions to the patches ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net-next: mediatek: add an entry to MAINTAINERS	John Crispin
	Add myself and Felix as the Maintainers for the MediaTek ethernet driver. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net-next: mediatek: add Kconfig and Makefile	John Crispin
	This patch adds the Makefile and Kconfig required to make the driver build. Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net-next: mediatek: add support for MT7623 ethernet	John Crispin
	Add ethernet support for MediaTek SoCs from the MT7623 family. These have dual GMAC. Depending on the exact version, there might be a built-in Gigabit switch (MT7530). The core does not have the typical DMA ring setup. Instead there is a linked list that we add descriptors to. There is only one linked list that both MACs use together. There is a special field inside the TX descriptors called the VQID. This allows us to assign packets to different internal queues. By using a separate id for each MAC we are able to get deterministic results for BQL. Additionally we need to provide the core with a block of scratch memory that is the same size as the RX ring and data buffer. This is really needed to make the HW datapath work. Although the driver does not support this yet, we still need to assign the memory and tell the core about it for RX to work. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Michael Lee <igvtee@gmail.com> Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net-next: mediatek: document MediaTek SoC ethernet binding	John Crispin
	This adds the binding documentation for the MediaTek Ethernet controller. Signed-off-by: John Crispin <blogic@openwrt.org> Acked-by: Rob Herring <robh@kernel.org> Cc: devicetree@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net: dsa: Fix cleanup resources upon module removal	Neil Armstrong
	The initial commit badly merged into the dsa_resume method instead of the dsa_remove_dst method. As consequence, the dst->master_netdev->dsa_ptr is not set to NULL on removal and re-bind of the dsa device fails with error -17. Fixes: b0dc635d923c ("net: dsa: cleanup resources upon module removal ") Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Acked-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	qede: Fix net-next "make ARCH=x86_64"	Manish Chopra
	'commit 55482edc25f0606851de42e73618f813f310d009 ("qede: Add slowpath/fastpath support and enable hardware GRO")' introduces below error when compiling net-next with "make ARCH=x86_64" drivers/built-in.o: In function `qede_rx_int': qede_main.c:(.text+0x6101a0): undefined reference to `tcp_gro_complete' Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	Merge branch 'qlcnic-next'	David S. Miller
	Rajesh Borundia says: ==================== qlcnic fixes This series adds following fixes. o While processing mailbox if driver gets a spurious mailbox interrupt it leads into premature completion of a next mailbox request. Added a guard against this by checking current state of mailbox and ignored spurious interrupt. Added a stats counter to record this condition. v2: o Added patch that removes usage of atomic_t as we are not implemeting atomicity by using atomic_t value. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	qlcnic: Fix mailbox completion handling during spurious interrupt	Rajesh Borundia
	o While the driver is in the middle of a MB completion processing and it receives a spurious MB interrupt, it is mistaken as a good MB completion interrupt leading to premature completion of the next MB request. Fix the driver to guard against this by checking the current state of MB processing and ignore the spurious interrupt. Also added a stats counter to record this condition. Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	qlcnic: Remove unnecessary usage of atomic_t	Rajesh Borundia
	o atomic_t usage is incorrect as we are not implementing any atomicity. Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	Merge branch 'cxgb4-next'	David S. Miller
	Hariprasad Shenai says: ==================== cxgb4vf: Interrupt and queue configuration changes This series fixes some issues and some changes in the queue and interrupt configuration for cxgb4vf driver. We need to enable interrupts before we register our network device, so that we don't loose link up interrupts. Allocate rx queues based on interrupt type. Set number of tx/rx queues in probe function only. Also adds check for some invalid configurations. This patch series has been created against net-next tree and includes patches on cxgb4vf driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	cxgb4vf: Set number of queues in pci probe only	Hariprasad Shenai
	Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	cxgb4vf: Add a couple more checks for invalid provisioning configurations	Hariprasad Shenai
	Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	cxgb4vf: Configure queue based on resource and interrupt type	Hariprasad Shenai
	The Queue Set Configuration code was always reserving room for a Forwarded interrupt Queue even in the cases where we weren't using it. Figure out how many Ports and Queue Sets we can support. This depends on knowing our Virtual Function Resources and may be called a second time if we fall back from MSI-X to MSI Interrupt Mode. This change fixes that problem. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	cxgb4vf: Enable interrupts before we register our network devices	Hariprasad Shenai
	This avoids a race condition where a system that has network devices set up to be automatically configured and we get the first Port Link Status message from the firmware on the Asynchronous Firmware Event Queue before we've enabled interrupts. If that happens, we end up losing the interrupt and never realizing that the links has actually come up. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net: dsa: mv88e6xxx: avoid writing the same mode	Vivien Didelot
	There is no need to change the 802.1Q port mode for the same value. Thus avoid such message: [ 401.954836] dsa dsa@0 lan0: 802.1Q Mode: Disabled (was Disabled) Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net: dsa: mv88e6xxx: read then write PVID	Vivien Didelot
	The port register 0x07 contains more options than just the default VID, even though they are not used yet. So prefer a read then write operation over a direct write. This also allows to keep track of the change through dynamic debug. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	net: dsa: mv88e6xxx: rework port state setter	Vivien Didelot
	Apply a few non-functional changes on the port state setter: * add a dynamic debug message with state names to track changes * explicit states checking instead of assuming their numeric values * lock mutex only once when changing several port states * use bitmap macros to declare and access port_state_update_mask Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	Merge branch 'sh_eth-fixes'	David S. Miller
	Sergei Shtylyov says: ==================== sh_eth: fix couple of bugs in sh_eth_ring_format() Here's a set of 2 patches against DaveM's 'net.git' repo fixing two bugs in sh_eth_.ring_format()... [1/2] sh_eth: fix NULL pointer dereference in sh_eth_ring_format() [2/2] sh_eth: advance 'rxdesc' later in sh_eth_ring_format() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	sh_eth: advance 'rxdesc' later in sh_eth_ring_format()	Sergei Shtylyov
	Iff dma_map_single() fails, 'rxdesc' should point to the last filled RX descriptor, so that it can be marked as the last one, however the driver would have already advanced it by that time. In order to fix that, only fill an RX descriptor once all the data for it is ready. Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	sh_eth: fix NULL pointer dereference in sh_eth_ring_format()	Sergei Shtylyov
	In a low memory situation, if netdev_alloc_skb() fails on a first RX ring loop iteration in sh_eth_ring_format(), 'rxdesc' is still NULL. Avoid kernel oops by adding the 'rxdesc' check after the loop. Reported-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	kcm: mark helper functions inline	Arnd Bergmann
	The stub helper functions for the newly added kcm_proc_init/exit interfaces are defined as 'static' in a header file, which leads to build warnings for each file that includes them without calling them: include/net/kcm.h:183:12: error: 'kcm_proc_init' defined but not used [-Werror=unused-function] include/net/kcm.h:184:13: error: 'kcm_proc_exit' defined but not used [-Werror=unused-function] This marks the two functions as 'static inline' instead, which avoids the warnings and is obviously what was meant here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: cd6e111bf5be ("kcm: Add statistics and proc interfaces") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	Merge tag 'linux-can-next-for-4.6-20160310' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2016-03-10 this is a pull request of 5 patch for net-next/master. Marek Vasut contributes 4 patches for the ifi CAN driver, which makes it work on real hardware. There is one patch by Ramesh Shanmugasundaram for the rcar_can driver that adds support for the 3rd generation IP core. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-10	can: rcar_can: Add r8a7795 support	Ramesh Shanmugasundaram
	Added r8a7795 SoC support. Signed-off-by: Ramesh Shanmugasundaram <ramesh.shanmugasundaram@bp.renesas.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-03-10	can: ifi: Add obscure bit swap for EFF frame IDs	Marek Vasut
	In case of CAN2.0 EFF frame, the controller handles frame IDs in a rather bizzare way. The ID is split into an extended part, IDX[28:11] and standard part, ID[10:0]. In the TX path, the core first sends the top 11 bits of the IDX, followed by ID and finally the rest of IDX. In the RX path, the core stores the ID the LSbit part of IDX field, followed by the LSbit parts of real IDX. The MSbit parts of IDX are stored in ID field of the register. This patch implements the necessary bit shuffling to mitigate this obscure behavior. In case two of these controllers are connected together, the RX and TX bit swapping nullifies itself and the issue does not manifest. The issue only manifests when talking to another different CAN controller. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-03-10	can: ifi: Fix RX and TX ID mask	Marek Vasut
	The RX and TX ID mask for CAN2.0 is 11 bits wide. This patch fixes the incorrect mask, which caused the CAN IDs to miss the MSBit both on receive and transmit. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-03-10	can: ifi: Fix TX DLC configuration	Marek Vasut
	The TX DLC, the transmission length information, was not written into the transmit configuration register. When using the CAN core with different CAN controller, the receiving CAN controller will receive only the ID part of the CAN frame, but no data at all. This patch adds the TX DLC into the register to fix this issue. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-03-10	can: ifi: Fix clock generator configuration	Marek Vasut
	The clock generation does not match reality when using the CAN IP core outside of the FPGA design. This patch fixes the computation of values which are programmed into the clock generator registers. First, there are some off-by-one errors which manifest themselves only when communicating with different controller, so those are fixed. Second, the bits in the clock generator registers have different meaning depending on whether the core is in ISO CANFD mode or any of the other modes (BOSCH CANFD or CAN2.0). Detect the ISO CANFD mode and fix handling of this special case of clock configuration. Finally, the CAN clock speed is in CANCLOCK register, not SYSCLOCK register, so fix this as well. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-03-09	bpf: avoid copying junk bytes in bpf_get_current_comm()	Alexei Starovoitov
	Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but the result is typically passed to print("%s", buf) and extra bytes after zero don't cause any harm. In bpf the result of bpf_get_current_comm() is used as the part of map key and was causing spurious hash map mismatches. Use strlcpy() to guarantee zero-terminated string. bpf verifier checks that output buffer is zero-initialized, so even for short task names the output buffer don't have junk bytes. Note it's not a security concern, since kprobe+bpf is root only. Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors") Reported-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS	Alexei Starovoitov
	0-day bot reported build error: kernel/built-in.o: In function `map_lookup_elem': >> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy' when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not. Add weak definition to resolve it. This code path in map_lookup_elem() is never taken when CONFIG_PERF_EVENTS is not set. Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	Merge branch 'variable-length-ll-headers'	David S. Miller
	Willem de Bruijn says: ==================== net: validate variable length ll headers Allow device-specific validation of link layer headers. Existing checks drop all packets shorter than hard_header_len. For variable length protocols, such packets can be valid. patch 1 adds header_ops.validate and dev_validate_header patch 2 implements the protocol specific callback for AX25 patch 3 replaces ll_header_truncated with dev_validate_header ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	packet: validate variable length ll headers	Willem de Bruijn
	Replace link layer header validation check ll_header_truncate with more generic dev_validate_header. Validation based on hard_header_len incorrectly drops valid packets in variable length protocols, such as AX25. dev_validate_header calls header_ops.validate for such protocols to ensure correctness below hard_header_len. See also http://comments.gmane.org/gmane.linux.network/401064 Fixes 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	ax25: add link layer header validation function	Willem de Bruijn
	As variable length protocol, AX25 fails link layer header validation tests based on a minimum length. header_ops.validate allows protocols to validate headers that are shorter than hard_header_len. Implement this callback for AX25. See also http://comments.gmane.org/gmane.linux.network/401064 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	net: validate variable length ll headers	Willem de Bruijn
	Netdevice parameter hard_header_len is variously interpreted both as an upper and lower bound on link layer header length. The field is used as upper bound when reserving room at allocation, as lower bound when validating user input in PF_PACKET. Clarify the definition to be maximum header length. For validation of untrusted headers, add an optional validate member to header_ops. Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance for deliberate testing of corrupt input. In this case, pad trailing bytes, as some device drivers expect completely initialized headers. See also http://comments.gmane.org/gmane.linux.network/401064 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	Merge branch 'kcm'	David S. Miller
	Tom Herbert says: ==================== kcm: Kernel Connection Multiplexor (KCM) Kernel Connection Multiplexor (KCM) is a facility that provides a message based interface over TCP for generic application protocols. The motivation for this is based on the observation that although TCP is byte stream transport protocol with no concept of message boundaries, a common use case is to implement a framed application layer protocol running over TCP. To date, most TCP stacks offer byte stream API for applications, which places the burden of message delineation, message I/O operation atomicity, and load balancing in the application. With KCM an application can efficiently send and receive application protocol messages over TCP using a datagram interface. In order to delineate message in a TCP stream for receive in KCM, the kernel implements a message parser. For this we chose to employ BPF which is applied to the TCP stream. BPF code parses application layer messages and returns a message length. Nearly all binary application protocols are parsable in this manner, so KCM should be applicable across a wide range of applications. Other than message length determination in receive, KCM does not require any other application specific awareness. KCM does not implement any other application protocol semantics-- these are are provided in userspace or could be implemented in a kernel module layered above KCM. KCM implements an NxM multiplexor in the kernel as diagrammed below: +------------+ +------------+ +------------+ +------------+ \| KCM socket \| \| KCM socket \| \| KCM socket \| \| KCM socket \| +------------+ +------------+ +------------+ +------------+ \| \| \| \| +-----------+ \| \| +----------+ \| \| \| \| +----------------------------------+ \| Multiplexor \| +----------------------------------+ \| \| \| \| \| +---------+ \| \| \| ------------+ \| \| \| \| \| +----------+ +----------+ +----------+ +----------+ +----------+ \| Psock \| \| Psock \| \| Psock \| \| Psock \| \| Psock \| +----------+ +----------+ +----------+ +----------+ +----------+ \| \| \| \| \| +----------+ +----------+ +----------+ +----------+ +----------+ \| TCP sock \| \| TCP sock \| \| TCP sock \| \| TCP sock \| \| TCP sock \| +----------+ +----------+ +----------+ +----------+ +----------+ The KCM sockets provide the datagram interface to applications, Psocks are the state for each attached TCP connection (i.e. where message delineation is performed on receive). A description of the APIs and design can be found in the included Documentation/networking/kcm.txt. In this patch set: - Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to indicate that more messages will be sent on the socket. The stack may batch messages up if it is beneficial for transmission. - In sendmmsg, set MSG_BATCH in all sub messages except for the last one. - In order to allow sendmmsg to contain multiple messages with SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR. - Add KCM module - This supports SOCK_DGRAM and SOCK_SEQPACKET. - KCM documentation v2: - Added splice and page operations. - Assemble receive messages in place on TCP socket (don't have a separate assembly queue. - Based on above, enforce maxmimum receive message to be the size of the recceive socket buffer. - Support message assembly timeout. Use the timeout value in sk_rcvtimeo on the TCP socket. - Tested some with a couple of other production applications, see ~5% improvement in application latency. Testing: Dave Watson has integrated KCM into Thrift and we intend to put these changes into open source. Example of this is in: https://github.com/djwatson/fbthrift/commit/ dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc Some initial KCM Thrift benchmark numbers (comment from Dave) Thrift by default ties a single connection to a single thread. KCM is instead able to load balance multiple connections across multiple epoll loops easily. A test sending ~5k bytes of data to a kcm thrift server, dropping the bytes on recv: QPS Latency / std dev Latency without KCM 70336 209/123 with KCM 70353 191/124 A test sending a small request, then doing work in the epoll thread, before serving more requests: QPS Latency / std dev Latency without KCM 14282 559/602 with KCM 23192 344/234 At the high end, there's definitely some additional kernel overhead: Cranking the pipelining way up, with lots of small requests QPS Latency / std dev Latency without KCM 1863429 127/119 with KCM 1337713 192/241 --- So for a "realistic" workload, KCM performs pretty well (second case). Under extreme conditions of highest tps we still have some work to do. In its nature a multiplexor will spread work between CPUs which is logically good for load balancing but coan conflict with the goal promoting affinity. Batching messages on both send and receive are the means to recoup performance. Future support: - Integration with TLS (TLS-in-kernel is a separate initiative). - Page operations/splice support - Unconnected KCM sockets. Will be able to attach sockets to different destinations, AF_KCM addresses with be used in sendmsg and recvmsg to indicate destination - Explore more utility in performing BPF inline with a TCP data stream (setting SO_MARK, rxhash for messages being sent received on KCM sockets). - Performance work - Diagnose performance issues under high message load FAQ (Questions posted on LWN) Q: Why do this in the kernel? A: Because the kernel is good at scheduling threads and steering packets to threads. KCM fits well into this model since it allows the unit of work for scheduling and steering to be the application layer messages themselves. KCM should be thought of as generic application protocol acceleration. It to the philosophy that the kernel provides generic and extensible interfaces. Q: How can adding code in the path yield better performance? A: It is true that for just sending receiving a single message there would be some performance loss since the code path is longer (for instance comparing netperf to KCM). But for real production applications performance takes on many dynamics. Parallelism, context switching, affinity, granularity of locking, and load balancing are all relevant. The theory of KCM is that by an application-centric interface, the kernel can provide better support for these performance characteristics. Q: Why not use an existing message-oriented protocol such as RUDP, DCCP, SCTP, RDS, and others? A: Because that would entail using a completely new transport protocol. Deploying a new protocol at scale is either a huge undertaking or fundamentally infeasible. This is true in either the Internet and in the data center due in a large part to protocol ossification. Besides, KCM we want KCM to work existing, well deployed application protocols that we couldn't change even if we wanted to (e.g. http/2). KCM simply defines a new interface method, it does not redefine any aspect of the transport protocol nor application protocol, nor set any new requirements on these. Neither does KCM attempt to implement any application protocol logic other than message deliniation in the stream. These are fundamental requirement of KCM. Q: How does this affect TCP? A: It doesn't, not in the slightest. The use of KCM can be one-sided, KCM has no effect on the wire. Q: Why force TCP into doing something it's not designed for? A: TCP is defined as transport protocol and there is no standard that says the API into TCP must be stream based sockets, or for that matter sockets at all (or even that TCP needs to be implemented in a kernel). KCM is not inconsistent with the design of TCP just because to makes an message based interface over TCP, if it were then every application protocol sending messages over TCP would also be! :-) Q: What about the problem of a connections with very slow rate of incoming data? As a result your application can get storms of very short reads. And it actually happens a lot with connection from mobile devices and it is a problem for servers handling a lot of connections. A: The storm of short reads will occur regardless of whether KCM is used or not. KCM does have one advantage in this scenario though, it will only wake up the application when a full message has been received, not for each packet that makes up part of a bigger messages. If a bunch of small messages are received, the application can receive messages in batches using recvmmsg. Q: Why not just use DPDK, or at least provide KCM like functionality in DPDK? A: DPDK, or more generally OS bypass presumably with a TCP stack in userland, presents a different model of load balancing than that of KCM (and the kernel). KCM implements load balancing of messages across the threads of an application, whereas DPDK load balances based on queues which are more static and coarse-grained since multiple connections are bound to queues. DPDK works best when processing of packets is silo'ed in a thread on the CPU processing a queue, and packet processing (for both the stack and application) is fairly uniform. KCM works well for applications where the amount of work to process messages varies an application work is commonly delegated to worker threads often on different CPUs. The message based interface over TCP is something that could be provide by a DPDK or OS bypass library. Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web sockets? A: Yes. KCM is most appropriate for message based protocols over TCP where is easy to deduce the message length (e.g. a length field) and the protocol implements its own message ordering semantics. Fortunately this encompasses many modern protocols. Q: How is memory limited and controlled? A: In v2 all data for messages is now kept in socket buffers, either those for TCP or KCM, so socket buffer limits are applicable. This includes receive messages assembly which is now done ont teh TCP socket buffer instead of a separate queue-- this has the consequence that the TCP socket buffer limit provides an enforceable maxmimum message size. Additionally, a timeout may be set for messages assembly. The value used for this is taken from sk_rcvtimeo of the TCP socket. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	kcm: Add description in Documentation	Tom Herbert
	Add kcm.txt to desribe KCM and interfaces. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	kcm: Add receive message timeout	Tom Herbert
	This patch adds receive timeout for message assembly on the attached TCP sockets. The timeout is set when a new messages is started and the whole message has not been received by TCP (not in the receive queue). If the completely message is subsequently received the timer is cancelled, if the timer expires the RX side is aborted. The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket to KCM. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	kcm: Add memory limit for receive message construction	Tom Herbert
	Message assembly is performed on the TCP socket. This is logically equivalent of an application that performs a peek on the socket to find out how much memory is needed for a receive buffer. The receive socket buffer also provides the maximum message size which is checked. The receive algorithm is something like: 1) Receive the first skbuf for a message (or skbufs if multiple are needed to determine message length). 2) Check the message length against the number of bytes in the TCP receive queue (tcp_inq()). - If all the bytes of the message are in the queue (incluing the skbuf received), then proceed with message assembly (it should complete with the tcp_read_sock) - Else, mark the psock with the number of bytes needed to complete the message. 3) In TCP data ready function, if the psock indicates that we are waiting for the rest of the bytes of a messages, check the number of queued bytes against that. - If there are still not enough bytes for the message, just return - Else, clear the waiting bytes and proceed to receive the skbufs. The message should now be received in one tcp_read_sock Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	kcm: Sendpage support	Tom Herbert
	Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both dgram and seqpacket ops. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	kcm: Splice support	Tom Herbert
	Implement kcm_splice_read. This is supported only for seqpacket. Add kcm_seqpacket_ops and set splice read to kcm_splice_read. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	kcm: Add statistics and proc interfaces	Tom Herbert
	This patch adds various counters for KCM. These include counters for messages and bytes received or sent, as well as counters for number of attached/unattached TCP sockets and other error or edge events. The statistics are exposed via a proc interface. /proc/net/kcm provides statistics per KCM socket and per psock (attached TCP sockets). /proc/net/kcm_stats provides aggregate statistics. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	kcm: Kernel Connection Multiplexor module	Tom Herbert
	This module implements the Kernel Connection Multiplexor. Kernel Connection Multiplexor (KCM) is a facility that provides a message based interface over TCP for generic application protocols. With KCM an application can efficiently send and receive application protocol messages over TCP using datagram sockets. For more information see the included Documentation/networking/kcm.txt Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	tcp: Add tcp_inq to get available receive bytes on socket	Tom Herbert
	Create a common kernel function to get the number of bytes available on a TCP socket. This is based on code in INQ getsockopt and we now call the function for that getsockopt. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	net: Walk fragments in __skb_splice_bits	Tom Herbert
	Add walking of fragments in __skb_splice_bits. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	net: Add MSG_BATCH flag	Tom Herbert
	Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to indicate that more messages will follow (i.e. a batch of messages is being sent). This is similar to MSG_MORE except that the following messages are not merged into one packet, they are sent individually. sendmmsg is updated so that each contained message except for the last one is marked as MSG_BATCH. MSG_BATCH is a performance optimization in cases where a socket implementation can benefit by transmitting packets in a batch. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	net: Allow MSG_EOR in each msghdr of sendmmsg	Tom Herbert
	This patch allows setting MSG_EOR in each individual msghdr passed in sendmmsg. This allows a sendmmsg to send multiple messages when using SOCK_SEQPACKET. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09	net: Make sock_alloc exportable	Tom Herbert
	Export it for cases where we want to create sockets by hand. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>