aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-12-27ethtool: provide link state with LINKSTATE_GET requestMichal Kubecek
Implement LINKSTATE_GET netlink request to get link state information. At the moment, only link up flag as provided by ETHTOOL_GLINK ioctl command is returned. LINKSTATE_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: add LINKMODES_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_LINKMODES_NTF notification message whenever device link settings or advertised modes are modified using ETHTOOL_MSG_LINKMODES_SET netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands. The notification message has the same format as reply to LINKMODES_GET request. ETHTOOL_MSG_LINKMODES_SET netlink request only triggers the notification if there is a change but the ioctl command handlers do not check if there is an actual change and trigger the notification whenever the commands are executed. As all work is done by ethnl_default_notify() handler and callback functions introduced to handle LINKMODES_GET requests, all that remains is adding entries for ETHTOOL_MSG_LINKMODES_NTF into ethnl_notify_handlers and ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where needed. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: set link modes related data with LINKMODES_SET requestMichal Kubecek
Implement LINKMODES_SET netlink request to set advertised linkmodes and related attributes as ETHTOOL_SLINKSETTINGS and ETHTOOL_SSET commands do. The request allows setting autonegotiation flag, speed, duplex and advertised link modes. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: provide link mode information with LINKMODES_GET requestMichal Kubecek
Implement LINKMODES_GET netlink request to get link modes related information provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands. This request provides supported, advertised and peer advertised link modes, autonegotiation flag, speed and duplex. LINKMODES_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: add LINKINFO_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_LINKINFO_NTF notification message whenever device link settings are modified using ETHTOOL_MSG_LINKINFO_SET netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands. The notification message has the same format as reply to LINKINFO_GET request. ETHTOOL_MSG_LINKINFO_SET netlink request only triggers the notification if there is a change but the ioctl command handlers do not check if there is an actual change and trigger the notification whenever the commands are executed. As all work is done by ethnl_default_notify() handler and callback functions introduced to handle LINKINFO_GET requests, all that remains is adding entries for ETHTOOL_MSG_LINKINFO_NTF into ethnl_notify_handlers and ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where needed. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: add default notification handlerMichal Kubecek
The ethtool netlink notifications have the same format as related GET replies so that if generic GET handling framework is used to process GET requests, its callbacks and instance of struct get_request_ops can be also used to compose corresponding notification message. Provide function ethnl_std_notify() to be used as notification handler in ethnl_notify_handlers table. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: set link settings with LINKINFO_SET requestMichal Kubecek
Implement LINKINFO_SET netlink request to set link settings queried by LINKINFO_GET message. Only physical port, phy MDIO address and MDI(-X) control can be set, attempt to modify MDI(-X) status and transceiver is rejected. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: provide link settings with LINKINFO_GET requestMichal Kubecek
Implement LINKINFO_GET netlink request to get basic link settings provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands. This request provides settings not directly related to autonegotiation and link mode selection: physical port, phy MDIO address, MDI(-X) status, MDI(-X) control and transceiver. LINKINFO_GET request can be used with NLM_F_DUMP (without device identification) to request the information for all devices in current network namespace providing the data. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: provide string sets with STRSET_GET requestMichal Kubecek
Requests a contents of one or more string sets, i.e. indexed arrays of strings; this information is provided by ETHTOOL_GSSET_INFO and ETHTOOL_GSTRINGS commands of ioctl interface. Unlike ioctl interface, all information can be retrieved with one request and mulitple string sets can be requested at once. There are three types of requests: - no NLM_F_DUMP, no device: get "global" stringsets - no NLM_F_DUMP, with device: get string sets related to the device - NLM_F_DUMP, no device: get device related string sets for all devices Client can request either all string sets of given type (global or device related) or only specific sets. With ETHTOOL_A_STRSET_COUNTS flag set, only set sizes (numbers of strings) are returned. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: default handlers for GET requestsMichal Kubecek
Significant part of GET request processing is common for most request types but unfortunately it cannot be easily separated from type specific code as we need to alternate between common actions (parsing common request header, allocating message and filling netlink/genetlink headers etc.) and specific actions (querying the device, composing the reply). The processing also happens in three different situations: "do" request, "dump" request and notification, each doing things in slightly different way. The request specific code is implemented in four or five callbacks defined in an instance of struct get_request_ops: parse_request() - parse incoming message prepare_data() - retrieve data from driver or NIC reply_size() - estimate reply message size fill_reply() - compose reply message cleanup_data() - (optional) clean up additional data Other members of struct get_request_ops describe the data structure holding information from client request and data used to compose the message. The default handlers ethnl_default_doit(), ethnl_default_dumpit(), ethnl_default_start() and ethnl_default_done() can be then used in genl_ops handler. Notification handler will be introduced in a later patch. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: support for netlink notificationsMichal Kubecek
Add infrastructure for ethtool netlink notifications. There is only one multicast group "monitor" which is used to notify userspace about changes and actions performed. Notification messages (types using suffix _NTF) share the format with replies to GET requests. Notifications are supposed to be broadcasted on every configuration change, whether it is done using the netlink interface or ioctl one. Netlink SET requests only trigger a notification if some data is actually changed. To trigger an ethtool notification, both ethtool netlink and external code use ethtool_notify() helper. This helper requires RTNL to be held and may sleep. Handlers sending messages for specific notification message types are registered in ethnl_notify_handlers array. As notifications can be triggered from other code, ethnl_ok flag is used to prevent an attempt to send notification before genetlink family is registered. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: netlink bitset handlingMichal Kubecek
The ethtool netlink code uses common framework for passing arbitrary length bit sets to allow future extensions. A bitset can be a list (only one bitmap) or can consist of value and mask pair (used e.g. when client want to modify only some bits). A bitset can use one of two formats: verbose (bit by bit) or compact. Verbose format consists of bitset size (number of bits), list flag and an array of bit nests, telling which bits are part of the list or which bits are in the mask and which of them are to be set. In requests, bits can be identified by index (position) or by name. In replies, kernel provides both index and name. Verbose format is suitable for "one shot" applications like standard ethtool command as it avoids the need to either keep bit names (e.g. link modes) in sync with kernel or having to add an extra roundtrip for string set request (e.g. for private flags). Compact format uses one (list) or two (value/mask) arrays of 32-bit words to store the bitmap(s). It is more suitable for long running applications (ethtool in monitor mode or network management daemons) which can retrieve the names once and then pass only compact bitmaps to save space. Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS flag in request header tells kernel which format to use in reply. Notifications always use compact format. As some code uses arrays of unsigned long for internal representation and some arrays of u32 (or even a single u32), two sets of parse/compose helpers are introduced. To avoid code duplication, helpers for unsigned long arrays are implemented as wrappers around helpers for u32 arrays. There are two reasons for this choice: (1) u32 arrays are more frequent in ethtool code and (2) unsigned long array can be always interpreted as an u32 array on little endian 64-bit and all 32-bit architectures while we would need special handling for odd number of u32 words in the opposite direction. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: helper functions for netlink interfaceMichal Kubecek
Add common request/reply header definition and helpers to parse request header and fill reply header. Provide ethnl_update_* helpers to update structure members from request attributes (to be used for *_SET requests). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ethtool: introduce ethtool netlink interfaceMichal Kubecek
Basic genetlink and init infrastructure for the netlink interface, register genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to make the build optional. Add initial overall interface description into Documentation/networking/ethtool-netlink.rst, further patches will add more detailed information. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27sctp: do trace_sctp_probe after SACK validation and checkKevin Kou
The function sctp_sf_eat_sack_6_2 now performs the Verification Tag validation, Chunk length validation, Bogu check, and also the detection of out-of-order SACK based on the RFC2960 Section 6.2 at the beginning, and finally performs the further processing of SACK. The trace_sctp_probe now triggered before the above necessary validation and check. this patch is to do the trace_sctp_probe after the chunk sanity tests, but keep doing trace if the SACK received is out of order, for the out-of-order SACK is valuable to congestion control debugging. v1->v2: - keep doing SCTP trace if the SACK is out of order as Marcelo's suggestion. v2->v3: - regenerate the patch as v2 generated on top of v1, and add 'net-next' tag to the new one as Marcelo's comments. Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27mv88e6xxx: Add serdes Rx statisticsNikita Yushchenko
If packet checker is enabled in the serdes, then Rx counter registers start working, and no side effects have been detected. This patch enables packet checker automatically when powering serdes on, and exposes Rx counter registers via ethtool statistics interface. Code partially basded by older attempt by Andrew Lunn. Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27net: ena: remove set but not used variable 'rx_ring'YueHaibing
drivers/net/ethernet/amazon/ena/ena_netdev.c: In function ena_xdp_xmit_buff: drivers/net/ethernet/amazon/ena/ena_netdev.c:316:19: warning: variable rx_ring set but not used [-Wunused-but-set-variable] commit 548c4940b9f1 ("net: ena: Implement XDP_TX action") left behind this unused variable. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27net: dsa: qca: ar9331: drop pointless static qualifier in ar9331_sw_mbus_initMao Wenan
There is no need to set variable 'mbus' static since new value always be assigned before use it. Signed-off-by: Mao Wenan <maowenan@huawei.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27ppp: Remove redundant BUG_ON() check in ppp_pernetXu Wang
Passing NULL to ppp_pernet causes a crash via BUG_ON. Dereferencing net in net_generic() also has the same effect. This patch removes the redundant BUG_ON check on the same parameter. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27Merge branch 'tcp_cubic-various-fixes'David S. Miller
Eric Dumazet says: ==================== tcp_cubic: various fixes This patch series converts tcp_cubic to usec clock resolution for Hystart logic. This makes Hystart more relevant for data-center flows. Prior to this series, Hystart was not kicking, or was kicking without good reason, since the 1ms clock was too coarse. Last patch also fixes an issue with Hystart vs TCP pacing. v2: removed a last-minute debug chunk from last patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: make Hystart aware of pacingEric Dumazet
For years we disabled Hystart ACK train detection at Google because it was fooled by TCP pacing. ACK train detection uses a simple heuristic, detecting if we receive ACK past half the RTT, to exit slow start before hitting the bottleneck and experience massive drops. But pacing by design might delay packets up to RTT/2, so we need to tweak the Hystart logic to be aware of this extra delay. Tested: Added a 100 usec delay at receiver. Before: nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 9117 7057 9553 8300 7030 6849 9533 10126 6876 8473 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 1230 0.0 After : nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 9845 10103 10866 11096 11936 11487 11773 12188 11066 11894 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 6462 0.0 Disabling Hystart ACK Train detection gives similar numbers echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 11173 10954 12455 10627 11578 11583 11222 10880 10665 11366 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: tweak Hystart detection for short RTT flowsEric Dumazet
After switching ca->delay_min to usec resolution, we exit slow start prematurely for very low RTT flows, setting snd_ssthresh to 20. The reason is that delay_min is fed with RTT of small packet trains. Then as cwnd is increased, TCP sends bigger TSO packets. LRO/GRO aggregation and/or interrupt mitigation strategies on receiver tend to inflate RTT samples. Fix this by adding to delay_min the expected delay of two TSO packets, given current pacing rate. Tested: Sender uses pfifo_fast qdisc Before : $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 11348 11707 11562 11428 11773 11534 9878 11693 10597 10968 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 200 0.0 After : $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 14877 14517 15797 18466 17376 14833 17558 17933 16039 18059 TcpExtTCPHystartTrainDetect 10 0.0 TcpExtTCPHystartTrainCwnd 1670 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: switch bictcp_clock() to usec resolutionEric Dumazet
Current 1ms clock feeds ca->round_start, ca->delay_min, ca->last_ack. This is quite problematic for data-center flows, where delay_min is way below 1 ms. This means Hystart Train detection triggers every time jiffies value is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)" expression becomes true. This kind of random behavior can be solved by reusing the existing usec timestamp that TCP keeps in tp->tcp_mstamp Note that a followup patch will tweak things a bit, because during slow start, GRO aggregation on receivers naturally increases the RTT as TSO packets gradually come to ~64KB size. To recap, right after this patch CUBIC Hystart train detection is more aggressive, since short RTT flows might exit slow start at cwnd = 20, instead of being possibly unbounded. Following patch will address this problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: remove one conditional from hystart_update()Eric Dumazet
If we initialize ca->curr_rtt to ~0U, we do not need to test for zero value in hystart_update() We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have been processed, and thus ca->curr_rtt will have a sane value. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27tcp_cubic: optimize hystart_update()Eric Dumazet
We do not care which bit in ca->found is set. We avoid accessing hystart and hystart_detect unless really needed, possibly avoiding one cache line miss. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2019-12-27 The following pull-request contains BPF updates for your *net-next* tree. We've added 127 non-merge commits during the last 17 day(s) which contain a total of 110 files changed, 6901 insertions(+), 2721 deletions(-). There are three merge conflicts. Conflicts and resolution looks as follows: 1) Merge conflict in net/bpf/test_run.c: There was a tree-wide cleanup c593642c8be0 ("treewide: Use sizeof_field() macro") which gets in the way with b590cb5f802d ("bpf: Switch to offsetofend in BPF_PROG_TEST_RUN"): <<<<<<< HEAD if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) + sizeof_field(struct __sk_buff, priority), ======= if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority), >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c There are a few occasions that look similar to this. Always take the chunk with offsetofend(). Note that there is one where the fields differ in here: <<<<<<< HEAD if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) + sizeof_field(struct __sk_buff, tstamp), ======= if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs), >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to 850a88cc4096 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN"). 2) Merge conflict in arch/riscv/net/bpf_jit_comp.c: (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.) <<<<<<< HEAD if (is_13b_check(off, insn)) return -1; emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx); ======= emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx); >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c Result should look like: emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx); 3) Merge conflict in arch/riscv/include/asm/pgtable.h: <<<<<<< HEAD ======= #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1) #define VMALLOC_END (PAGE_OFFSET - 1) #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE) #define BPF_JIT_REGION_SIZE (SZ_128M) #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE) #define BPF_JIT_REGION_END (VMALLOC_END) /* * Roughly size the vmemmap space to be large enough to fit enough * struct pages to map half the virtual address space. Then * position vmemmap directly below the VMALLOC region. */ #define VMEMMAP_SHIFT \ (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT) #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT) #define VMEMMAP_END (VMALLOC_START - 1) #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE) #define vmemmap ((struct page *)VMEMMAP_START) >>>>>>> 7c8dce4b166113743adad131b5a24c4acc12f92c Only take the BPF_* defines from there and move them higher up in the same file. Remove the rest from the chunk. The VMALLOC_* etc defines got moved via 01f52e16b868 ("riscv: define vmemmap before pfn_to_page calls"). Result: [...] #define __S101 PAGE_READ_EXEC #define __S110 PAGE_SHARED_EXEC #define __S111 PAGE_SHARED_EXEC #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1) #define VMALLOC_END (PAGE_OFFSET - 1) #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE) #define BPF_JIT_REGION_SIZE (SZ_128M) #define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE) #define BPF_JIT_REGION_END (VMALLOC_END) /* * Roughly size the vmemmap space to be large enough to fit enough * struct pages to map half the virtual address space. Then * position vmemmap directly below the VMALLOC region. */ #define VMEMMAP_SHIFT \ (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT) #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT) #define VMEMMAP_END (VMALLOC_START - 1) #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE) [...] Let me know if there are any other issues. Anyway, the main changes are: 1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific to a provided BPF object file. This provides an alternative, simplified API compared to standard libbpf interaction. Also, add libbpf extern variable resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko. 2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also, add various BPF riscv JIT improvements, from Björn Töpel. 3) Extend bpftool to allow matching BPF programs and maps by name, from Paul Chaignon. 4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI flag for allowing updates without service interruption, from Andrey Ignatov. 5) Cleanup and simplification of ring access functions for AF_XDP with a bonus of 0-5% performance improvement, from Magnus Karlsson. 6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa. 7) Move and extend test_select_reuseport into BPF program tests under BPF selftests, from Jakub Sitnicki. 8) Various BPF sample improvements for xdpsock for customizing parameters to set up and benchmark AF_XDP, from Jay Jayatheerthan. 9) Improve libbpf to provide a ulimit hint on permission denied errors. Also change XDP sample programs to attach in driver mode by default, from Toke Høiland-Jørgensen. 10) Extend BPF test infrastructure to allow changing skb mark from tc BPF programs, from Nikita V. Shirokov. 11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King. 12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after libbpf conversion, from Jesper Dangaard Brouer. 13) Minor misc improvements from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27bpftool: Make skeleton C code compilable with C++ compilerAndrii Nakryiko
When auto-generated BPF skeleton C code is included from C++ application, it triggers compilation error due to void * being implicitly casted to whatever target pointer type. This is supported by C, but not C++. To solve this problem, add explicit casts, where necessary. To ensure issues like this are captured going forward, add skeleton usage in test_cpp test. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191226210253.3132060-1-andriin@fb.com
2019-12-26Merge branch 's390-qeth-next'David S. Miller
Julian Wiedmann says: ==================== s390/qeth: updates 2019-12-23 please apply the following patch series for qeth to your net-next tree. This reworks the RX code to use napi_gro_frags() when building non-linear skbs, along with some consolidation and cleanups. Happy holidays - and many thanks for all the effort & support over the past year, to both Jakub and you. It's much appreciated. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26s390/qeth: remove QETH_RX_PULL_LENJulian Wiedmann
Since commit f677fcb9aeb6 ("s390/qeth: ensure linear access to packet headers"), the CQ-specific skbs are allocated with a slightly bigger linear part than necessary. Shrink it down to the maximum that's needed by qeth_extract_skb(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26s390/qeth: use napi_gro_frags() for SG skbsJulian Wiedmann
For non-linear packets, get the skb for attaching the page fragments from napi_get_frags() so that it can be recycled during GRO. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26s390/qeth: consolidate RX codeJulian Wiedmann
To reduce the path length and levels of indirection, move the RX processing from the sub-drivers into the core. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26af_packet: refactoring code for prb_calc_retire_blk_tmoMao Wenan
If __ethtool_get_link_ksettings() is failed and with non-zero value, prb_calc_retire_blk_tmo() should return DEFAULT_PRB_RETIRE_TOV firstly. This patch is to refactory code and make it more readable. Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26xen-netback: support dynamic unbind/bindPaul Durrant
By re-attaching RX, TX, and CTL rings during connect() rather than assuming they are freshly allocated (i.e. assuming the counters are zero), and avoiding forcing state to Closed in netback_remove() it is possible for vif instances to be unbound and re-bound from and to (respectively) a running guest. Dynamic unbind/bind is a highly useful feature for a backend module as it allows it to be unloaded and re-loaded (i.e. updated) without requiring domUs to be halted. This has been tested by running iperf as a server in the test VM and then running a client against it in a continuous loop, whilst also running: while true; do echo vif-$DOMID-$VIF >unbind; echo down; rmmod xen-netback; echo unloaded; modprobe xen-netback; cd $(pwd); brctl addif xenbr0 vif$DOMID.$VIF; ip link set vif$DOMID.$VIF up; echo up; sleep 5; done in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind, unload, re-load, re-bind and re-plumb the backend. Clearly a performance drop was seen but no TCP connection resets were observed during this test and moreover a parallel SSH connection into the guest remained perfectly usable throughout. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26Merge branch 'RTL8211F-RGMII-RX-TX-delay-configuration-improvements'David S. Miller
Martin Blumenstingl says: ==================== RTL8211F: RGMII RX/TX delay configuration improvements In discussion with Andrew [0] we figured out that it would be best to make the RX delay of the RTL8211F PHY configurable (just like the TX delay is already configurable). While here I took the opportunity to add some logging to the TX delay configuration as well. There is no public documentation for the RX and TX delay registers. I received this information a while ago (and created this RfC patch back then: [1]). Realtek gave me permission to take the information from the datasheet extracts and phase them in my own words and publish that (I am not allowed to publish the datasheet extracts). I have tested these patches on two boards: - Amlogic Meson8b Odroid-C1 - Amlogic GXM Khadas VIM2 Both still behave as before these changes (iperf3 speeds are the same in both directions: RX and TX), which is expected because they are currently using phy-mode = "rgmii" with the RX delay not being generated by the PHY. [0] https://patchwork.ozlabs.org/patch/1215313/ [1] https://patchwork.ozlabs.org/patch/843946/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26net: phy: realtek: add support for configuring the RX delay on RTL8211FMartin Blumenstingl
On RTL8211F the RX and TX delays (2ns) can be configured in two ways: - pin strapping (RXD1 for the TX delay and RXD0 for the RX delay, LOW means "off" and HIGH means "on") which is read during PHY reset - using software to configure the TX and RX delay registers So far only the configuration using pin strapping has been supported. Add support for enabling or disabling the RGMII RX delay based on the phy-mode to be able to get the RX delay into a known state. This is important because the RX delay has to be coordinated between the PHY, MAC and the PCB design (trace length). With an invalid RX delay applied (for example if both PHY and MAC add a 2ns RX delay) Ethernet may not work at all. Also add debug logging when configuring the RX delay (just like the TX delay) because this is a common source of problems. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26net: phy: realtek: add logging for the RGMII TX delay configurationMartin Blumenstingl
RGMII requires a delay of 2ns between the data and the clock signal. There are at least three ways this can happen. One possibility is by having the PHY generate this delay. This is a common source for problems (for example with slow TX speeds or packet loss when sending data). The TX delay configuration of the RTL8211F PHY can be set either by pin-strappping the RXD1 pin (HIGH means enabled, LOW means disabled) or through configuring a paged register. The setting from the RXD1 pin is also reflected in the register. Add debug logging to the TX delay configuration on RTL8211F so it's easier to spot these issues (for example if the TX delay is enabled for both, the RTL8211F PHY and the MAC). This is especially helpful because there is no public datasheet for the RTL8211F PHY available with all the RX/TX delay specifics. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26Merge branch 'mlxsw-spectrum_router-Cleanups'David S. Miller
Ido Schimmel says: ==================== mlxsw: spectrum_router: Cleanups This patch set removes from mlxsw code that is no longer necessary after the simplification of the IPv4 and IPv6 route offload API. The patches eliminate unnecessary code by taking advantage of the fact that mlxsw no longer needs to maintain a list of identical routes, following recent changes in route offload API. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26mlxsw: spectrum_router: Remove FIB entry list from FIB nodeIdo Schimmel
As explained in previous patches, the driver no longer needs to maintain a list of identical FIB entries (i.e, same {tb_id, prefix, prefix length}) and therefore each FIB node can only store one FIB entry. Remove the FIB entry list and simplify the code. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26mlxsw: spectrum_router: Consolidate identical functionsIdo Schimmel
After the last patch mlxsw_sp_fib{4,6}_node_entry_link() and mlxsw_sp_fib{4,6}_node_entry_unlink() are identical and can therefore be consolidated into the same common function. Perform the consolidation. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26mlxsw: spectrum_router: Make route creation and destruction symmetricIdo Schimmel
Host routes that perform decapsulation of IP in IP tunnels have a special adjacency entry linked to them. This entry stores information such as the expected underlay source IP. When the route is deleted this entry needs to be freed. The allocation of the adjacency entry happens in mlxsw_sp_fib4_entry_type_set(), but it is freed in mlxsw_sp_fib4_node_entry_unlink(). Create a new function - mlxsw_sp_fib4_entry_type_unset() - and free the adjacency entry there. This will allow us to consolidate mlxsw_sp_fib{4,6}_node_entry_unlink() in the next patch. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26mlxsw: spectrum_router: Eliminate dead codeIdo Schimmel
Since the driver no longer maintains a list of identical routes there is no route to promote when a route is deleted. Remove that code that took care of it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26mlxsw: spectrum_router: Remove unnecessary checksIdo Schimmel
Now that the networking stack takes care of only notifying the routes of interest, we do not need to maintain a list of identical routes. Remove the check that tests if the route is the first route in the FIB node. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26bonding: rename AD_STATE_* to LACP_STATE_*Andy Roulin
As the LACP actor/partner state is now part of the uapi, rename the 3ad state defines with LACP prefix. The LACP prefix is preferred over BOND_3AD as the LACP standard moved to 802.1AX. Fixes: 826f66b30c2e3 ("bonding: move 802.3ad port state flags to uapi") Signed-off-by: Andy Roulin <aroulin@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26sctp: move trace_sctp_probe_path into sctp_outq_sackKevin Kou
The original patch bringed in the "SCTP ACK tracking trace event" feature was committed at Dec.20, 2017, it replaced jprobe usage with trace events, and bringed in two trace events, one is TRACE_EVENT(sctp_probe), another one is TRACE_EVENT(sctp_probe_path). The original patch intended to trigger the trace_sctp_probe_path in TRACE_EVENT(sctp_probe) as below code, +TRACE_EVENT(sctp_probe, + + TP_PROTO(const struct sctp_endpoint *ep, + const struct sctp_association *asoc, + struct sctp_chunk *chunk), + + TP_ARGS(ep, asoc, chunk), + + TP_STRUCT__entry( + __field(__u64, asoc) + __field(__u32, mark) + __field(__u16, bind_port) + __field(__u16, peer_port) + __field(__u32, pathmtu) + __field(__u32, rwnd) + __field(__u16, unack_data) + ), + + TP_fast_assign( + struct sk_buff *skb = chunk->skb; + + __entry->asoc = (unsigned long)asoc; + __entry->mark = skb->mark; + __entry->bind_port = ep->base.bind_addr.port; + __entry->peer_port = asoc->peer.port; + __entry->pathmtu = asoc->pathmtu; + __entry->rwnd = asoc->peer.rwnd; + __entry->unack_data = asoc->unack_data; + + if (trace_sctp_probe_path_enabled()) { + struct sctp_transport *sp; + + list_for_each_entry(sp, &asoc->peer.transport_addr_list, + transports) { + trace_sctp_probe_path(sp, asoc); + } + } + ), But I found it did not work when I did testing, and trace_sctp_probe_path had no output, I finally found that there is trace buffer lock operation(trace_event_buffer_reserve) in include/trace/trace_events.h: static notrace void \ trace_event_raw_event_##call(void *__data, proto) \ { \ struct trace_event_file *trace_file = __data; \ struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\ struct trace_event_buffer fbuffer; \ struct trace_event_raw_##call *entry; \ int __data_size; \ \ if (trace_trigger_soft_disabled(trace_file)) \ return; \ \ __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \ \ entry = trace_event_buffer_reserve(&fbuffer, trace_file, \ sizeof(*entry) + __data_size); \ \ if (!entry) \ return; \ \ tstruct \ \ { assign; } \ \ trace_event_buffer_commit(&fbuffer); \ } The reason caused no output of trace_sctp_probe_path is that trace_sctp_probe_path written in TP_fast_assign part of TRACE_EVENT(sctp_probe), and it will be placed( { assign; } ) after the trace_event_buffer_reserve() when compiler expands Macro, entry = trace_event_buffer_reserve(&fbuffer, trace_file, \ sizeof(*entry) + __data_size); \ \ if (!entry) \ return; \ \ tstruct \ \ { assign; } \ so trace_sctp_probe_path finally can not acquire trace_event_buffer and return no output, that is to say the nest of tracepoint entry function is not allowed. The function call flow is: trace_sctp_probe() -> trace_event_raw_event_sctp_probe() -> lock buffer -> trace_sctp_probe_path() -> trace_event_raw_event_sctp_probe_path() --nested -> buffer has been locked and return no output. This patch is to remove trace_sctp_probe_path from the TP_fast_assign part of TRACE_EVENT(sctp_probe) to avoid the nest of entry function, and trigger sctp_probe_path_trace in sctp_outq_sack. After this patch, you can enable both events individually, # cd /sys/kernel/debug/tracing # echo 1 > events/sctp/sctp_probe/enable # echo 1 > events/sctp/sctp_probe_path/enable Or, you can enable all the events under sctp. # echo 1 > events/sctp/enable Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26bpf: Print error message for bpftool cgroup showHechao Li
Currently, when bpftool cgroup show <path> has an error, no error message is printed. This is confusing because the user may think the result is empty. Before the change: $ bpftool cgroup show /sys/fs/cgroup ID AttachType AttachFlags Name $ echo $? 255 After the change: $ ./bpftool cgroup show /sys/fs/cgroup Error: can't query bpf programs attached to /sys/fs/cgroup: Operation not permitted v2: Rename check_query_cgroup_progs to cgroup_has_attached_progs Signed-off-by: Hechao Li <hechaol@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191224011742.3714301-1-hechaol@fb.com
2019-12-26libbpf: Support CO-RE relocations for LDX/ST/STX instructionsAndrii Nakryiko
Clang patch [0] enables emitting relocatable generic ALU/ALU64 instructions (i.e, shifts and arithmetic operations), as well as generic load/store instructions. The former ones are already supported by libbpf as is. This patch adds further support for load/store instructions. Relocatable field offset is encoded in BPF instruction's 16-bit offset section and are adjusted by libbpf based on target kernel BTF. These Clang changes and corresponding libbpf changes allow for more succinct generated BPF code by encoding relocatable field reads as a single ST/LDX/STX instruction. It also enables relocatable access to BPF context. Previously, if context struct (e.g., __sk_buff) was accessed with CO-RE relocations (e.g., due to preserve_access_index attribute), it would be rejected by BPF verifier due to modified context pointer dereference. With Clang patch, such context accesses are both relocatable and have a fixed offset from the point of view of BPF verifier. [0] https://reviews.llvm.org/D71790 Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20191223180305.86417-1-andriin@fb.com
2019-12-25Merge branch 'Peer-to-Peer-One-Step-time-stamping'David S. Miller
Richard Cochran says: ==================== Peer to Peer One-Step time stamping This series adds support for PTP (IEEE 1588) P2P one-step time stamping along with a driver for a hardware device that supports this. If the hardware supports p2p one-step, it subtracts the ingress time stamp value from the Pdelay_Request correction field. The user space software stack then simply copies the correction field into the Pdelay_Response, and on transmission the hardware adds the egress time stamp into the correction field. This new functionality extends CONFIG_NETWORK_PHY_TIMESTAMPING to cover MII snooping devices, but it still depends on phylib, just as that option does. Expanding beyond phylib is not within the scope of the this series. User space support is available in the current linuxptp master branch. - Patch 1 adds phy_device methods for existing time stamping fields. - Patches 2-5 convert the stack and drivers to the new methods. - Patch 6 moves code around the dp83640 driver. - Patches 7-10 add support for MII time stamping in non-PHY devices. - Patch 11 adds the new P2P 1-step option. - Patch 12 adds a driver implementing the new option. Thanks, Richard Changed in v9: ~~~~~~~~~~~~~~ - Fix two more drivers' switch/case blocks WRT the new HWTSTAMP ioctl. - Picked up two more review tags from Andrew. Changed in v8: ~~~~~~~~~~~~~~ - Avoided adding forward functional declarations in the dp83640 driver. - Picked up Florian's new review tags and another one from Andrew. Changed in v7: ~~~~~~~~~~~~~~ - Converted pr_debug|err to dev_ variants in new driver. - Fixed device tree documentation per Rob's v6 review. - Picked up Andrew's and Rob's review tags. - Silenced sparse warnings in new driver. Changed in v6: ~~~~~~~~~~~~~~ - Added methods for accessing the phy_device time stamping fields. - Adjust the device tree documentation per Rob's v5 review. - Fixed the build failures due to missing exports. Changed in v5: ~~~~~~~~~~~~~~ - Fixed build failure in macvlan. - Fixed latent bug with its gcc warning in the driver. Changed in v4: ~~~~~~~~~~~~~~ - Correct error paths and PTR_ERR return values in the framework. - Expanded KernelDoc comments WRT PHY locking. - Pick up Andrew's review tag. Changed in v3: ~~~~~~~~~~~~~~ - Simplify the device tree binding and document the time stamping phandle by itself. Changed in v2: ~~~~~~~~~~~~~~ - Per the v1 review, changed the modeling of MII time stamping devices. They are no longer a kind of mdio device. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25ptp: Add a driver for InES time stamping IP core.Richard Cochran
The InES at the ZHAW offers a PTP time stamping IP core. The FPGA logic recognizes and time stamps PTP frames on the MII bus. This patch adds a driver for the core along with a device tree binding to allow hooking the driver to MII buses. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25net: Introduce peer to peer one step PTP time stamping.Richard Cochran
The 1588 standard defines one step operation for both Sync and PDelay_Resp messages. Up until now, hardware with P2P one step has been rare, and kernel support was lacking. This patch adds support of the mode in anticipation of new hardware developments. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25net: mdio: of: Register discovered MII time stampers.Richard Cochran
When parsing a PHY node, register its time stamper, if any, and attach the instance to the PHY device. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>