From 9a69fb9c21c4bf4107becb877729544759bdd059 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:30 +0200 Subject: docs: networking: convert decnet.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark lists as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'MAINTAINERS') diff --git a/MAINTAINERS b/MAINTAINERS index 453fe0713e68..7323bfc1720f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4728,7 +4728,7 @@ DECnet NETWORK LAYER L: linux-decnet-user@lists.sourceforge.net S: Orphan W: http://linux-decnet.sourceforge.net -F: Documentation/networking/decnet.txt +F: Documentation/networking/decnet.rst F: net/decnet/ DECSTATION PLATFORM SUPPORT -- cgit v1.2.3 From cb3f0d56e153398a035eb22769d2cb2837f29747 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:36 +0200 Subject: docs: networking: convert filter.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - use footnote markup; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/bpf/index.rst | 4 +- Documentation/networking/filter.rst | 1651 ++++++++++++++++++++++++++++++ Documentation/networking/filter.txt | 1545 ---------------------------- Documentation/networking/index.rst | 1 + Documentation/networking/packet_mmap.txt | 2 +- MAINTAINERS | 2 +- tools/bpf/bpf_asm.c | 2 +- tools/bpf/bpf_dbg.c | 2 +- 8 files changed, 1658 insertions(+), 1551 deletions(-) create mode 100644 Documentation/networking/filter.rst delete mode 100644 Documentation/networking/filter.txt (limited to 'MAINTAINERS') diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index f99677f3572f..38b4db8be7a2 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -7,7 +7,7 @@ Filter) facility, with a focus on the extended BPF version (eBPF). This kernel side documentation is still work in progress. The main textual documentation is (for historical reasons) described in -`Documentation/networking/filter.txt`_, which describe both classical +`Documentation/networking/filter.rst`_, which describe both classical and extended BPF instruction-set. The Cilium project also maintains a `BPF and XDP Reference Guide`_ that goes into great technical depth about the BPF Architecture. @@ -59,7 +59,7 @@ Testing and debugging BPF .. Links: -.. _Documentation/networking/filter.txt: ../networking/filter.txt +.. _Documentation/networking/filter.rst: ../networking/filter.txt .. _man-pages: https://www.kernel.org/doc/man-pages/ .. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html .. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/ diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst new file mode 100644 index 000000000000..a1d3e192b9fa --- /dev/null +++ b/Documentation/networking/filter.rst @@ -0,0 +1,1651 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================================= +Linux Socket Filtering aka Berkeley Packet Filter (BPF) +======================================================= + +Introduction +------------ + +Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. +Though there are some distinct differences between the BSD and Linux +Kernel filtering, but when we speak of BPF or LSF in Linux context, we +mean the very same mechanism of filtering in the Linux kernel. + +BPF allows a user-space program to attach a filter onto any socket and +allow or disallow certain types of data to come through the socket. LSF +follows exactly the same filter code structure as BSD's BPF, so referring +to the BSD bpf.4 manpage is very helpful in creating filters. + +On Linux, BPF is much simpler than on BSD. One does not have to worry +about devices or anything like that. You simply create your filter code, +send it to the kernel via the SO_ATTACH_FILTER option and if your filter +code passes the kernel check on it, you then immediately begin filtering +data on that socket. + +You can also detach filters from your socket via the SO_DETACH_FILTER +option. This will probably not be used much since when you close a socket +that has a filter on it the filter is automagically removed. The other +less common case may be adding a different filter on the same socket where +you had another filter that is still running: the kernel takes care of +removing the old one and placing your new one in its place, assuming your +filter has passed the checks, otherwise if it fails the old filter will +remain on that socket. + +SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once +set, a filter cannot be removed or changed. This allows one process to +setup a socket, attach a filter, lock it then drop privileges and be +assured that the filter will be kept until the socket is closed. + +The biggest user of this construct might be libpcap. Issuing a high-level +filter command like `tcpdump -i em1 port 22` passes through the libpcap +internal compiler that generates a structure that can eventually be loaded +via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` +displays what is being placed into this structure. + +Although we were only speaking about sockets here, BPF in Linux is used +in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel +qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places +such as team driver, PTP code, etc where BPF is being used. + +.. [1] Documentation/userspace-api/seccomp_filter.rst + +Original BPF paper: + +Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new +architecture for user-level packet capture. In Proceedings of the +USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 +Conference Proceedings (USENIX'93). USENIX Association, Berkeley, +CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] + +Structure +--------- + +User space applications include which contains the +following relevant structures:: + + struct sock_filter { /* Filter block */ + __u16 code; /* Actual filter code */ + __u8 jt; /* Jump true */ + __u8 jf; /* Jump false */ + __u32 k; /* Generic multiuse field */ + }; + +Such a structure is assembled as an array of 4-tuples, that contains +a code, jt, jf and k value. jt and jf are jump offsets and k a generic +value to be used for a provided code:: + + struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ + unsigned short len; /* Number of filter blocks */ + struct sock_filter __user *filter; + }; + +For socket filtering, a pointer to this structure (as shown in +follow-up example) is being passed to the kernel through setsockopt(2). + +Example +------- + +:: + + #include + #include + #include + #include + /* ... */ + + /* From the example above: tcpdump -i em1 port 22 -dd */ + struct sock_filter code[] = { + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 8, 0x000086dd }, + { 0x30, 0, 0, 0x00000014 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 17, 0x00000011 }, + { 0x28, 0, 0, 0x00000036 }, + { 0x15, 14, 0, 0x00000016 }, + { 0x28, 0, 0, 0x00000038 }, + { 0x15, 12, 13, 0x00000016 }, + { 0x15, 0, 12, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 8, 0x00000011 }, + { 0x28, 0, 0, 0x00000014 }, + { 0x45, 6, 0, 0x00001fff }, + { 0xb1, 0, 0, 0x0000000e }, + { 0x48, 0, 0, 0x0000000e }, + { 0x15, 2, 0, 0x00000016 }, + { 0x48, 0, 0, 0x00000010 }, + { 0x15, 0, 1, 0x00000016 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0x00000000 }, + }; + + struct sock_fprog bpf = { + .len = ARRAY_SIZE(code), + .filter = code, + }; + + sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (sock < 0) + /* ... bail out ... */ + + ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); + if (ret < 0) + /* ... bail out ... */ + + /* ... */ + close(sock); + +The above example code attaches a socket filter for a PF_PACKET socket +in order to let all IPv4/IPv6 packets with port 22 pass. The rest will +be dropped for this socket. + +The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments +and SO_LOCK_FILTER for preventing the filter to be detached, takes an +integer value with 0 or 1. + +Note that socket filters are not restricted to PF_PACKET sockets only, +but can also be used on other socket families. + +Summary of system calls: + + * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); + +Normally, most use cases for socket filtering on packet sockets will be +covered by libpcap in high-level syntax, so as an application developer +you should stick to that. libpcap wraps its own layer around all that. + +Unless i) using/linking to libpcap is not an option, ii) the required BPF +filters use Linux extensions that are not supported by libpcap's compiler, +iii) a filter might be more complex and not cleanly implementable with +libpcap's compiler, or iv) particular filter codes should be optimized +differently than libpcap's internal compiler does; then in such cases +writing such a filter "by hand" can be of an alternative. For example, +xt_bpf and cls_bpf users might have requirements that could result in +more complex filter code, or one that cannot be expressed with libpcap +(e.g. different return codes for various code paths). Moreover, BPF JIT +implementors may wish to manually write test cases and thus need low-level +access to BPF code as well. + +BPF engine and instruction set +------------------------------ + +Under tools/bpf/ there's a small helper tool called bpf_asm which can +be used to write low-level filters for example scenarios mentioned in the +previous section. Asm-like syntax mentioned here has been implemented in +bpf_asm and will be used for further explanations (instead of dealing with +less readable opcodes directly, principles are the same). The syntax is +closely modelled after Steven McCanne's and Van Jacobson's BPF paper. + +The BPF architecture consists of the following basic elements: + + ======= ==================================================== + Element Description + ======= ==================================================== + A 32 bit wide accumulator + X 32 bit wide X register + M[] 16 x 32 bit wide misc registers aka "scratch memory + store", addressable from 0 to 15 + ======= ==================================================== + +A program, that is translated by bpf_asm into "opcodes" is an array that +consists of the following elements (as already mentioned):: + + op:16, jt:8, jf:8, k:32 + +The element op is a 16 bit wide opcode that has a particular instruction +encoded. jt and jf are two 8 bit wide jump targets, one for condition +"jump if true", the other one "jump if false". Eventually, element k +contains a miscellaneous argument that can be interpreted in different +ways depending on the given instruction in op. + +The instruction set consists of load, store, branch, alu, miscellaneous +and return instructions that are also represented in bpf_asm syntax. This +table lists all bpf_asm instructions available resp. what their underlying +opcodes as defined in linux/filter.h stand for: + + =========== =================== ===================== + Instruction Addressing mode Description + =========== =================== ===================== + ld 1, 2, 3, 4, 12 Load word into A + ldi 4 Load word into A + ldh 1, 2 Load half-word into A + ldb 1, 2 Load byte into A + ldx 3, 4, 5, 12 Load word into X + ldxi 4 Load word into X + ldxb 5 Load byte into X + + st 3 Store A into M[] + stx 3 Store X into M[] + + jmp 6 Jump to label + ja 6 Jump to label + jeq 7, 8, 9, 10 Jump on A == + jneq 9, 10 Jump on A != + jne 9, 10 Jump on A != + jlt 9, 10 Jump on A < + jle 9, 10 Jump on A <= + jgt 7, 8, 9, 10 Jump on A > + jge 7, 8, 9, 10 Jump on A >= + jset 7, 8, 9, 10 Jump on A & + + add 0, 4 A + + sub 0, 4 A - + mul 0, 4 A * + div 0, 4 A / + mod 0, 4 A % + neg !A + and 0, 4 A & + or 0, 4 A | + xor 0, 4 A ^ + lsh 0, 4 A << + rsh 0, 4 A >> + + tax Copy A into X + txa Copy X into A + + ret 4, 11 Return + =========== =================== ===================== + +The next table shows addressing formats from the 2nd column: + + =============== =================== =============================================== + Addressing mode Syntax Description + =============== =================== =============================================== + 0 x/%x Register X + 1 [k] BHW at byte offset k in the packet + 2 [x + k] BHW at the offset X + k in the packet + 3 M[k] Word at offset k in M[] + 4 #k Literal value stored in k + 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet + 6 L Jump label L + 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 9 #k,Lt Jump to Lt if predicate is true + 10 x/%x,Lt Jump to Lt if predicate is true + 11 a/%a Accumulator A + 12 extension BPF extension + =============== =================== =============================================== + +The Linux kernel also has a couple of BPF extensions that are used along +with the class of load instructions by "overloading" the k argument with +a negative offset + a particular extension offset. The result of such BPF +extensions are loaded into A. + +Possible BPF extensions are shown in the following table: + + =================================== ================================================= + Extension Description + =================================== ================================================= + len skb->len + proto skb->protocol + type skb->pkt_type + poff Payload start offset + ifidx skb->dev->ifindex + nla Netlink attribute of type X with offset A + nlan Nested Netlink attribute of type X with offset A + mark skb->mark + queue skb->queue_mapping + hatype skb->dev->type + rxhash skb->hash + cpu raw_smp_processor_id() + vlan_tci skb_vlan_tag_get(skb) + vlan_avail skb_vlan_tag_present(skb) + vlan_tpid skb->vlan_proto + rand prandom_u32() + =================================== ================================================= + +These extensions can also be prefixed with '#'. +Examples for low-level BPF: + +**ARP packets**:: + + ldh [12] + jne #0x806, drop + ret #-1 + drop: ret #0 + +**IPv4 TCP packets**:: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #6, drop + ret #-1 + drop: ret #0 + +**(Accelerated) VLAN w/ id 10**:: + + ld vlan_tci + jneq #10, drop + ret #-1 + drop: ret #0 + +**icmp random packet sampling, 1 in 4**: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + +**SECCOMP filter example**:: + + ld [4] /* offsetof(struct seccomp_data, arch) */ + jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ + ld [0] /* offsetof(struct seccomp_data, nr) */ + jeq #15, good /* __NR_rt_sigreturn */ + jeq #231, good /* __NR_exit_group */ + jeq #60, good /* __NR_exit */ + jeq #0, good /* __NR_read */ + jeq #1, good /* __NR_write */ + jeq #5, good /* __NR_fstat */ + jeq #9, good /* __NR_mmap */ + jeq #14, good /* __NR_rt_sigprocmask */ + jeq #13, good /* __NR_rt_sigaction */ + jeq #35, good /* __NR_nanosleep */ + bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ + good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ + +The above example code can be placed into a file (here called "foo"), and +then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf +and cls_bpf understands and can directly be loaded with. Example with above +ARP code:: + + $ ./bpf_asm foo + 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, + +In copy and paste C-like output:: + + $ ./bpf_asm -c foo + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 1, 0x00000806 }, + { 0x06, 0, 0, 0xffffffff }, + { 0x06, 0, 0, 0000000000 }, + +In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF +filters that might not be obvious at first, it's good to test filters before +attaching to a live system. For that purpose, there's a small tool called +bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows +for testing BPF filters against given pcap files, single stepping through the +BPF code on the pcap's packets and to do BPF machine register dumps. + +Starting bpf_dbg is trivial and just requires issuing:: + + # ./bpf_dbg + +In case input and output do not equal stdin/stdout, bpf_dbg takes an +alternative stdin source as a first argument, and an alternative stdout +sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. + +Other than that, a particular libreadline configuration can be set via +file "~/.bpf_dbg_init" and the command history is stored in the file +"~/.bpf_dbg_history". + +Interaction in bpf_dbg happens through a shell that also has auto-completion +support (follow-up example commands starting with '>' denote bpf_dbg shell). +The usual workflow would be to ... + +* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 + Loads a BPF filter from standard output of bpf_asm, or transformed via + e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT + debugging (next section), this command creates a temporary socket and + loads the BPF code into the kernel. Thus, this will also be useful for + JIT developers. + +* load pcap foo.pcap + + Loads standard tcpdump pcap file. + +* run [] + +bpf passes:1 fails:9 + Runs through all packets from a pcap to account how many passes and fails + the filter will generate. A limit of packets to traverse can be given. + +* disassemble:: + + l0: ldh [12] + l1: jeq #0x800, l2, l5 + l2: ldb [23] + l3: jeq #0x1, l4, l5 + l4: ret #0xffff + l5: ret #0 + + Prints out BPF code disassembly. + +* dump:: + + /* { op, jt, jf, k }, */ + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 3, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 0, 1, 0x00000001 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0000000000 }, + + Prints out C-style BPF code dump. + +* breakpoint 0:: + + breakpoint at: l0: ldh [12] + +* breakpoint 1:: + + breakpoint at: l1: jeq #0x800, l2, l5 + + ... + + Sets breakpoints at particular BPF instructions. Issuing a `run` command + will walk through the pcap file continuing from the current packet and + break when a breakpoint is being hit (another `run` will continue from + the currently active breakpoint executing next instructions): + + * run:: + + -- register dump -- + pc: [0] <-- program counter + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction + curr: l0: ldh [12] <-- disassembly of current instruction + A: [00000000][0] <-- content of A (hex, decimal) + X: [00000000][0] <-- content of X (hex, decimal) + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) + -- packet dump -- <-- Current packet from pcap (hex) + len: 42 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 + 32: 00 00 00 00 00 00 0a 3b 01 01 + (breakpoint) + > + + * breakpoint:: + + breakpoints: 0 1 + + Prints currently set breakpoints. + +* step [-, +] + + Performs single stepping through the BPF program from the current pc + offset. Thus, on each step invocation, above register dump is issued. + This can go forwards and backwards in time, a plain `step` will break + on the next BPF instruction, thus +1. (No `run` needs to be issued here.) + +* select + + Selects a given packet from the pcap file to continue from. Thus, on + the next `run` or `step`, the BPF program is being evaluated against + the user pre-selected packet. Numbering starts just as in Wireshark + with index 1. + +* quit + + Exits bpf_dbg. + +JIT compiler +------------ + +The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, +PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through +CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each +attached filter from user space or for internal kernel users if it has +been previously enabled by root:: + + echo 1 > /proc/sys/net/core/bpf_jit_enable + +For JIT developers, doing audits etc, each compile run can output the generated +opcode image into the kernel log via:: + + echo 2 > /proc/sys/net/core/bpf_jit_enable + +Example output from dmesg:: + + [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f + [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 + [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 + [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 + [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 + [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 + +When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and +setting any other value than that will return in failure. This is even the case for +setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log +is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the +generally recommended approach instead. + +In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for +generating disassembly out of the kernel log's hexdump:: + + # ./bpf_jit_disasm + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + : + 0: push %rbp + 1: mov %rsp,%rbp + 4: sub $0x60,%rsp + 8: mov %rbx,-0x8(%rbp) + c: mov 0x68(%rdi),%r9d + 10: sub 0x6c(%rdi),%r9d + 14: mov 0xd8(%rdi),%r8 + 1b: mov $0xc,%esi + 20: callq 0xffffffffe0ff9442 + 25: cmp $0x800,%eax + 2a: jne 0x0000000000000042 + 2c: mov $0x17,%esi + 31: callq 0xffffffffe0ff945e + 36: cmp $0x1,%eax + 39: jne 0x0000000000000042 + 3b: mov $0xffff,%eax + 40: jmp 0x0000000000000044 + 42: xor %eax,%eax + 44: leaveq + 45: retq + + Issuing option `-o` will "annotate" opcodes to resulting assembler + instructions, which can be very useful for JIT developers: + + # ./bpf_jit_disasm -o + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + : + 0: push %rbp + 55 + 1: mov %rsp,%rbp + 48 89 e5 + 4: sub $0x60,%rsp + 48 83 ec 60 + 8: mov %rbx,-0x8(%rbp) + 48 89 5d f8 + c: mov 0x68(%rdi),%r9d + 44 8b 4f 68 + 10: sub 0x6c(%rdi),%r9d + 44 2b 4f 6c + 14: mov 0xd8(%rdi),%r8 + 4c 8b 87 d8 00 00 00 + 1b: mov $0xc,%esi + be 0c 00 00 00 + 20: callq 0xffffffffe0ff9442 + e8 1d 94 ff e0 + 25: cmp $0x800,%eax + 3d 00 08 00 00 + 2a: jne 0x0000000000000042 + 75 16 + 2c: mov $0x17,%esi + be 17 00 00 00 + 31: callq 0xffffffffe0ff945e + e8 28 94 ff e0 + 36: cmp $0x1,%eax + 83 f8 01 + 39: jne 0x0000000000000042 + 75 07 + 3b: mov $0xffff,%eax + b8 ff ff 00 00 + 40: jmp 0x0000000000000044 + eb 02 + 42: xor %eax,%eax + 31 c0 + 44: leaveq + c9 + 45: retq + c3 + +For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful +toolchain for developing and testing the kernel's JIT compiler. + +BPF kernel internals +-------------------- +Internally, for the kernel interpreter, a different instruction set +format with similar underlying principles from BPF described in previous +paragraphs is being used. However, the instruction set format is modelled +closer to the underlying architecture to mimic native instruction sets, so +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) + +It is designed to be JITed with one to one mapping, which can also open up +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. + +The new instruction set was originally designed with the possible goal in +mind to write programs in "restricted C" and compile into eBPF with a optional +GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with +minimal performance overhead over two steps, that is, C -> eBPF -> native code. + +Currently, the new format is being used for running user BPF programs, which +includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, +team driver's classifier for its load-balancing mode, netfilter's xt_bpf +extension, PTP dissector/classifier, and much more. They are all internally +converted by the kernel into the new instruction set representation and run +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using bpf_prog_create() for setting up the filter, resp. +bpf_prog_destroy() for destroying it. The macro +BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct bpf_prog that we +got from bpf_prog_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from bpf_check_classic() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most +32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, +sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF +instruction set. + +Some core changes of the new internal format: + +- Number of registers increase from 2 to 10: + + The old format had two registers A and X, and a hidden frame pointer. The + new layout extends this to be 10 internal registers and a read-only frame + pointer. Since 64-bit CPUs are passing arguments to functions via registers + the number of args from eBPF program to in-kernel function is restricted + to 5 and one register is used to accept return value from an in-kernel + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. + + Therefore, eBPF calling convention is defined as: + + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function + * R6 - R9 - callee saved registers that in-kernel function will preserve + * R10 - read-only frame pointer to access stack + + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on + 64-bit architectures. + + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic + and may let more complex programs to be interpreted. + + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. + +- Register width increases from 32-bit to 64-bit: + + Still, the semantics of the original 32-bit ALU operations are preserved + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower + subregisters that zero-extend into 64-bit if they are being written to. + That behavior maps directly to x86_64 and arm64 subregister definition, but + makes other JITs more difficult. + + 32-bit architectures run 64-bit internal BPF programs via interpreter. + Their JITs may convert BPF programs that only use 32-bit subregisters into + native instruction set and let the rest being interpreted. + + Operation is 64-bit, because on 64-bit architectures, pointers are also + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register + mapping and JIT would need to do combine/split/move operations for every + register in and out of the function, which is complex, bug prone and slow. + Another reason is the use of atomic 64-bit counters. + +- Conditional jt/jf targets replaced with jt/fall-through: + + While the original design has constructs such as ``if (cond) jump_true; + else jump_false;``, they are being replaced into alternative constructs like + ``if (cond) jump_true; /* else fall-through */``. + +- Introduces bpf_call insn and register passing convention for zero overhead + calls from/to other kernel functions: + + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions:: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64:: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like:: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + :: + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program:: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like:: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x6,%esi + mov $0x7,%edx + mov $0x8,%ecx + mov $0x9,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to:: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into ``%rax`` which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid:: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + An in-kernel eBPF verifier is used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any +program will terminate quickly and will only call a fixed number of kernel +functions. Original BPF and the new format are two operand instructions, +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. + +The input context pointer for invoking the interpreter function is generic, +its content is defined by a specific use case. For seccomp register R1 points +to seccomp_data, for converted BPF filters R1 points to a skb. + +A program, that is translated internally consists of the following elements:: + + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using ``exclusive add`` instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can be used as a generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. + +Just like the original BPF, the new format runs within a controlled environment, +is deterministic and the kernel can easily prove that. The safety of the program +can be determined in two steps: first step does depth-first-search to disallow +loops and other CFG validation; second step starts from the first insn and +descends all possible paths. It simulates execution of every insn and observes +the state change of registers and stack. + +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts:: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + =================== =============== + Classic BPF classes eBPF classes + =================== =============== + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 BPF_JMP32 0x06 + BPF_MISC 0x07 BPF_ALU64 0x07 + =================== =============== + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + :: + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means:: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means:: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 + BPF_DIV 0x30 + BPF_OR 0x40 + BPF_AND 0x50 + BPF_LSH 0x60 + BPF_RSH 0x70 + BPF_NEG 0x80 + BPF_MOD 0x90 + BPF_XOR 0xa0 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ + BPF_END 0xd0 /* eBPF only: endianness conversion */ + +If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: + + BPF_JA 0x00 /* BPF_JMP only */ + BPF_JEQ 0x10 + BPF_JGT 0x20 + BPF_JGE 0x30 + BPF_JSET 0x40 + BPF_JNE 0x50 /* eBPF only: jump != */ + BPF_JSGT 0x60 /* eBPF only: signed '>' */ + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ + BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ + BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ + BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ + BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ + BPF_JSLT 0xc0 /* eBPF only: signed '<' */ + BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ + +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF +and eBPF. There are only two registers in classic BPF, so it means A += X. +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. + +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean +exactly the same operations as BPF_ALU, but with 64-bit wide operands +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: +dst_reg = dst_reg + src_reg + +Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` +operation. Classic BPF_RET | BPF_K means copy imm32 into return register +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT +in eBPF means function exit only. The eBPF program needs to store return +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as +BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide +operands for the comparisons instead. + +For load and store instructions the 8-bit 'code' field is divided as:: + + +--------+--------+-------------------+ + | 3 bits | 2 bits | 3 bits | + | mode | size | instruction class | + +--------+--------+-------------------+ + (MSB) (LSB) + +Size modifier is one of ... + +:: + + BPF_W 0x00 /* word */ + BPF_H 0x08 /* half word */ + BPF_B 0x10 /* byte */ + BPF_DW 0x18 /* eBPF only, double word */ + +... which encodes size of load/store operation:: + + B - 1 byte + H - 2 byte + W - 4 byte + DW - 8 byte (eBPF only) + +Mode modifier is one of:: + + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ + BPF_ABS 0x20 + BPF_IND 0x40 + BPF_MEM 0x60 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ + BPF_XADD 0xc0 /* eBPF only, exclusive add */ + +eBPF has two non-generic instructions: (BPF_ABS | | BPF_LD) and +(BPF_IND | | BPF_LD) which are used to access packet data. + +They had to be carried over from classic to have strong performance of +socket filters running in eBPF interpreter. These instructions can only +be used when interpreter context is a pointer to ``struct sk_buff`` and +have seven implicit operands. Register R6 is an implicit input that must +contain pointer to sk_buff. Register R0 is an implicit output which contains +the data fetched from the packet. Registers R1-R5 are scratch registers +and must not be used to store the data across BPF_ABS | BPF_LD or +BPF_IND | BPF_LD instructions. + +These instructions have implicit program exit condition as well. When +eBPF program is trying to access the data beyond the packet boundary, +the interpreter will abort the execution of the program. JIT compilers +therefore must preserve this property. src_reg and imm32 fields are +explicit inputs to these instructions. + +For example:: + + BPF_IND | BPF_W | BPF_LD means: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) + and R1 - R5 were scratched. + +Unlike classic BPF instruction set, eBPF has generic load/store operations:: + + BPF_MEM | | BPF_STX: *(size *) (dst_reg + off) = src_reg + BPF_MEM | | BPF_ST: *(size *) (dst_reg + off) = imm32 + BPF_MEM | | BPF_LDX: dst_reg = *(size *) (src_reg + off) + BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg + BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg + +Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and +2 byte atomic increments are not supported. + +eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists +of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single +instruction that loads 64-bit immediate value into a dst_reg. +Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads +32-bit immediate value into a register. + +eBPF verifier +------------- +The safety of the eBPF program is determined in two steps. + +First step does DAG check to disallow loops and other CFG validation. +In particular it will detect programs that have unreachable instructions. +(though classic BPF checker allows them) + +Second step starts from the first insn and descends all possible paths. +It simulates execution of every insn and observes the state change of +registers and stack. + +At the start of the program the register R1 contains a pointer to context +and has type PTR_TO_CTX. +If verifier sees an insn that does R2=R1, then R2 has now type +PTR_TO_CTX as well and can be used on the right hand side of expression. +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, +since addition of two valid pointers makes invalid pointer. +(In 'secure' mode verifier will reject any type of pointer arithmetic to make +sure that kernel addresses don't leak to unprivileged users) + +If register was never written to, it's not readable:: + + bpf_mov R0 = R2 + bpf_exit + +will be rejected, since R2 is unreadable at the start of the program. + +After kernel function call, R1-R5 are reset to unreadable and +R0 has a return type of the function. + +Since R6-R9 are callee saved, their state is preserved across the call. + +:: + + bpf_mov R6 = 1 + bpf_call foo + bpf_mov R0 = R6 + bpf_exit + +is a correct program. If there was R1 instead of R6, it would have +been rejected. + +load/store instructions are allowed only with registers of valid types, which +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. +For example:: + + bpf_mov R1 = 1 + bpf_mov R2 = 2 + bpf_xadd *(u32 *)(R1 + 3) += R2 + bpf_exit + +will be rejected, since R1 doesn't have a valid pointer type at the time of +execution of instruction bpf_xadd. + +At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) +A callback is used to customize verifier to restrict eBPF program access to only +certain fields within ctx structure with specified size and alignment. + +For example, the following insn:: + + bpf_ld R0 = *(u32 *)(R6 + 8) + +intends to load a word from address R6 + 8 and store it into R0 +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know +that offset 8 of size 4 bytes can be accessed for reading, otherwise +the verifier will reject the program. +If R6=PTR_TO_STACK, then access should be aligned and be within +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, +so it will fail verification, since it's out of bounds. + +The verifier will allow eBPF program to read data from stack only after +it wrote into it. + +Classic BPF verifier does similar check with M[0-15] memory slots. +For example:: + + bpf_ld R0 = *(u32 *)(R10 - 4) + bpf_exit + +is invalid program. +Though R10 is correct read-only register and has type PTR_TO_STACK +and R10 - 4 is within stack bounds, there were no stores into that location. + +Pointer register spill/fill is tracked as well, since four (R6-R9) +callee saved registers may not be enough for some programs. + +Allowed function calls are customized with bpf_verifier_ops->get_func_proto() +The eBPF verifier will check that registers match argument constraints. +After the call register R0 will be set to return type of the function. + +Function calls is a main mechanism to extend functionality of eBPF programs. +Socket filters may let programs to call one set of functions, whereas tracing +filters may allow completely different set. + +If a function made accessible to eBPF program, it needs to be thought through +from safety point of view. The verifier will guarantee that the function is +called with valid arguments. + +seccomp vs socket filters have different security restrictions for classic BPF. +Seccomp solves this by two stage verifier: classic BPF verifier is followed +by seccomp verifier. In case of eBPF one configurable verifier is shared for +all use cases. + +See details of eBPF verifier in kernel/bpf/verifier.c + +Register value tracking +----------------------- +In order to determine the safety of an eBPF program, the verifier must track +the range of possible values in each register and also in each stack slot. +This is done with ``struct bpf_reg_state``, defined in include/linux/ +bpf_verifier.h, which unifies tracking of scalar and pointer values. Each +register state has a type, which is either NOT_INIT (the register has not been +written to), SCALAR_VALUE (some value which is not usable as a pointer), or a +pointer type. The types of pointers describe their base, as follows: + + + PTR_TO_CTX + Pointer to bpf_context. + CONST_PTR_TO_MAP + Pointer to struct bpf_map. "Const" because arithmetic + on these pointers is forbidden. + PTR_TO_MAP_VALUE + Pointer to the value stored in a map element. + PTR_TO_MAP_VALUE_OR_NULL + Either a pointer to a map value, or NULL; map accesses + (see section 'eBPF maps', below) return this type, + which becomes a PTR_TO_MAP_VALUE when checked != NULL. + Arithmetic on these pointers is forbidden. + PTR_TO_STACK + Frame pointer. + PTR_TO_PACKET + skb->data. + PTR_TO_PACKET_END + skb->data + headlen; arithmetic forbidden. + PTR_TO_SOCKET + Pointer to struct bpf_sock_ops, implicitly refcounted. + PTR_TO_SOCKET_OR_NULL + Either a pointer to a socket, or NULL; socket lookup + returns this type, which becomes a PTR_TO_SOCKET when + checked != NULL. PTR_TO_SOCKET is reference-counted, + so programs must release the reference through the + socket release function before the end of the program. + Arithmetic on these pointers is forbidden. + +However, a pointer may be offset from this base (as a result of pointer +arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable +offset'. The former is used when an exactly-known value (e.g. an immediate +operand) is added to a pointer, while the latter is used for values which are +not exactly known. The variable offset is also used in SCALAR_VALUEs, to track +the range of possible values in the register. + +The verifier's knowledge about the variable offset consists of: + +* minimum and maximum values as unsigned +* minimum and maximum values as signed + +* knowledge of the values of individual bits, in the form of a 'tnum': a u64 + 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; + 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both + mask and value; no bit should ever be 1 in both. For example, if a byte is read + into a register from memory, the register's top 56 bits are known zero, while + the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we + then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; + 0x1ff), because of potential carries. + +Besides arithmetic, the register state can also be updated by conditional +branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch +it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' +branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or +BPF_JSGE) would instead update the signed minimum/maximum values. Information +from the signed and unsigned bounds can be combined; for instance if a value is +first tested < 8 and then tested s> 4, the verifier will conclude that the value +is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. + +PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all +pointers sharing that same variable offset. This is important for packet range +checks: after adding a variable to a packet pointer register A, if you then copy +it to another register B and then add a constant 4 to A, both registers will +share the same 'id' but the A will have a fixed offset of +4. Then if A is +bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is +now known to have a safe range of at least 4 bytes. See 'Direct packet access', +below, for more on PTR_TO_PACKET ranges. + +The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of +the pointer returned from a map lookup. This means that when one copy is +checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. +As well as range-checking, the tracked information is also used for enforcing +alignment of pointer accesses. For instance, on most systems the packet pointer +is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump +over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting +pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 +bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through +that pointer are safe. +The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common +to all copies of the pointer returned from a socket lookup. This has similar +behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but +it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly +represents a reference to the corresponding ``struct sock``. To ensure that the +reference is not leaked, it is imperative to NULL-check the reference and in +the non-NULL case, and pass the valid reference to the socket release function. + +Direct packet access +-------------------- +In cls_bpf and act_bpf programs the verifier allows direct access to the packet +data via skb->data and skb->data_end pointers. +Ex:: + + 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ + 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ + 3: r5 = r3 + 4: r5 += 14 + 5: if r5 > r4 goto pc+16 + R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ + +this 2byte load from the packet is safe to do, since the program author +did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which +means that in the fall-through case the register R3 (which points to skb->data) +has at least 14 directly accessible bytes. The verifier marks it +as R3=pkt(id=0,off=0,r=14). +id=0 means that no additional variables were added to the register. +off=0 means that no additional constants were added. +r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. +Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points +to the packet data, but constant 14 was added to the register, so +it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) +which is zero bytes. + +More complex packet access may look like:: + + + R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ + 7: r4 = *(u8 *)(r3 +12) + 8: r4 *= 14 + 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ + 10: r3 += r4 + 11: r2 = r1 + 12: r2 <<= 48 + 13: r2 >>= 48 + 14: r3 += r2 + 15: r2 = r3 + 16: r2 += 8 + 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ + 18: if r2 > r1 goto pc+2 + R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp + 19: r1 = *(u8 *)(r3 +4) + +The state of the register R3 is R3=pkt(id=2,off=0,r=8) +id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some +offset within a packet and since the program author did +``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). +The verifier only allows 'add'/'sub' operations on packet registers. Any other +operation will set the register state to 'SCALAR_VALUE' and it won't be +available for direct packet access. + +Operation ``r3 += rX`` may overflow and become less than original skb->data, +therefore the verifier has to prevent that. So when it sees ``r3 += rX`` +instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 +against skb->data_end will not give us 'range' information, so attempts to read +through the pointer will give "invalid access to packet" error. + +Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is +R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits +of the register are guaranteed to be zero, and nothing is known about the lower +8 bits. After insn ``r4 *= 14`` the state becomes +R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit +value by constant 14 will keep upper 52 bits as zero, also the least significant +bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make +R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign +extending. This logic is implemented in adjust_reg_min_max_vals() function, +which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice +versa) and adjust_scalar_min_max_vals() for operations on two scalars. + +The end result is that bpf program author can access packet directly +using normal C code as:: + + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct eth_hdr *eth = data; + struct iphdr *iph = data + sizeof(*eth); + struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); + + if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) + return 0; + if (eth->h_proto != htons(ETH_P_IP)) + return 0; + if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) + return 0; + if (udp->dest == 53 || udp->source == 9) + ...; + +which makes such programs easier to write comparing to LD_ABS insn +and significantly faster. + +eBPF maps +--------- +'maps' is a generic storage of different types for sharing data between kernel +and userspace. + +The maps are accessed from user space via BPF syscall, which has commands: + +- create a map with given type and attributes + ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)`` + using attr->map_type, attr->key_size, attr->value_size, attr->max_entries + returns process-local file descriptor or negative error + +- lookup key in a given map + ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key, attr->value + returns zero and stores found elem into value or negative error + +- create or update key/value pair in a given map + ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key, attr->value + returns zero or negative error + +- find and delete element by key in a given map + ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key + +- to delete map: close(fd) + Exiting process will delete maps automatically + +userspace programs use this syscall to create/access maps that eBPF programs +are concurrently updating. + +maps can have different types: hash, array, bloom filter, radix-tree, etc. + +The map is defined by: + + - type + - max number of elements + - key size in bytes + - value size in bytes + +Pruning +------- +The verifier does not actually walk all possible paths through the program. For +each new branch to analyse, the verifier looks at all the states it's previously +been in when at this instruction. If any of them contain the current state as a +subset, the branch is 'pruned' - that is, the fact that the previous state was +accepted implies the current state would be as well. For instance, if in the +previous state, r1 held a packet-pointer, and in the current state, r1 holds a +packet-pointer with a range as long or longer and at least as strict an +alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't +have been used by any path from that point, so any value in r2 (including +another NOT_INIT) is safe. The implementation is in the function regsafe(). +Pruning considers not only the registers but also the stack (and any spilled +registers it may hold). They must all be safe for the branch to be pruned. +This is implemented in states_equal(). + +Understanding eBPF verifier messages +------------------------------------ + +The following are few examples of invalid eBPF programs and verifier error +messages as seen in the log: + +Program with unreachable instructions:: + + static struct bpf_insn prog[] = { + BPF_EXIT_INSN(), + BPF_EXIT_INSN(), + }; + +Error: + + unreachable insn 1 + +Program that reads uninitialized register:: + + BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r0 = r2 + R2 !read_ok + +Program that doesn't initialize R0 before exiting:: + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r2 = r1 + 1: (95) exit + R0 !read_ok + +Program that accesses stack out of bounds:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 +8) = 0 + invalid stack off=8 size=8 + +Program that doesn't initialize stack before passing its address into function:: + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r2 = r10 + 1: (07) r2 += -8 + 2: (b7) r1 = 0x0 + 3: (85) call 1 + invalid indirect read from stack off -8+0 size 8 + +Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + fd 0 is not pointing to valid bpf_map + +Program that doesn't check return value of map_lookup_elem() before accessing +map element:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + 5: (7a) *(u64 *)(r0 +0) = 0 + R0 invalid mem access 'map_value_or_null' + +Program that correctly checks map_lookup_elem() returned value for NULL, but +accesses the memory with incorrect alignment:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+1 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +4) = 0 + misaligned access off 4 size 8 + +Program that correctly checks map_lookup_elem() returned value for NULL and +accesses memory with correct alignment in one side of 'if' branch, but fails +to do so in the other side of 'if' branch:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+2 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +0) = 0 + 7: (95) exit + + from 5 to 8: R0=imm0 R10=fp + 8: (7a) *(u64 *)(r0 +0) = 1 + R0 invalid mem access 'imm' + +Program that performs a socket lookup then sets the pointer to NULL without +checking it:: + + BPF_MOV64_IMM(BPF_REG_2, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_3, 4), + BPF_MOV64_IMM(BPF_REG_4, 0), + BPF_MOV64_IMM(BPF_REG_5, 0), + BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (b7) r2 = 0 + 1: (63) *(u32 *)(r10 -8) = r2 + 2: (bf) r2 = r10 + 3: (07) r2 += -8 + 4: (b7) r3 = 4 + 5: (b7) r4 = 0 + 6: (b7) r5 = 0 + 7: (85) call bpf_sk_lookup_tcp#65 + 8: (b7) r0 = 0 + 9: (95) exit + Unreleased reference id=1, alloc_insn=7 + +Program that performs a socket lookup but does not NULL-check the returned +value:: + + BPF_MOV64_IMM(BPF_REG_2, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_3, 4), + BPF_MOV64_IMM(BPF_REG_4, 0), + BPF_MOV64_IMM(BPF_REG_5, 0), + BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), + BPF_EXIT_INSN(), + +Error:: + + 0: (b7) r2 = 0 + 1: (63) *(u32 *)(r10 -8) = r2 + 2: (bf) r2 = r10 + 3: (07) r2 += -8 + 4: (b7) r3 = 4 + 5: (b7) r4 = 0 + 6: (b7) r5 = 0 + 7: (85) call bpf_sk_lookup_tcp#65 + 8: (95) exit + Unreleased reference id=1, alloc_insn=7 + +Testing +------- + +Next to the BPF toolchain, the kernel also ships a test module that contains +various test cases for classic and internal BPF that can be executed against +the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and +enabled via Kconfig:: + + CONFIG_TEST_BPF=m + +After the module has been built and installed, the test suite can be executed +via insmod or modprobe against 'test_bpf' module. Results of the test cases +including timings in nsec can be found in the kernel log (dmesg). + +Misc +---- + +Also trinity, the Linux syscall fuzzer, has built-in support for BPF and +SECCOMP-BPF kernel fuzzing. + +Written by +---------- + +The document was written in the hope that it is found useful and in order +to give potential BPF hackers or security auditors a better overview of +the underlying architecture. + +- Jay Schulist +- Daniel Borkmann +- Alexei Starovoitov diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt deleted file mode 100644 index 2f0f8b17dade..000000000000 --- a/Documentation/networking/filter.txt +++ /dev/null @@ -1,1545 +0,0 @@ -Linux Socket Filtering aka Berkeley Packet Filter (BPF) -======================================================= - -Introduction ------------- - -Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. -Though there are some distinct differences between the BSD and Linux -Kernel filtering, but when we speak of BPF or LSF in Linux context, we -mean the very same mechanism of filtering in the Linux kernel. - -BPF allows a user-space program to attach a filter onto any socket and -allow or disallow certain types of data to come through the socket. LSF -follows exactly the same filter code structure as BSD's BPF, so referring -to the BSD bpf.4 manpage is very helpful in creating filters. - -On Linux, BPF is much simpler than on BSD. One does not have to worry -about devices or anything like that. You simply create your filter code, -send it to the kernel via the SO_ATTACH_FILTER option and if your filter -code passes the kernel check on it, you then immediately begin filtering -data on that socket. - -You can also detach filters from your socket via the SO_DETACH_FILTER -option. This will probably not be used much since when you close a socket -that has a filter on it the filter is automagically removed. The other -less common case may be adding a different filter on the same socket where -you had another filter that is still running: the kernel takes care of -removing the old one and placing your new one in its place, assuming your -filter has passed the checks, otherwise if it fails the old filter will -remain on that socket. - -SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once -set, a filter cannot be removed or changed. This allows one process to -setup a socket, attach a filter, lock it then drop privileges and be -assured that the filter will be kept until the socket is closed. - -The biggest user of this construct might be libpcap. Issuing a high-level -filter command like `tcpdump -i em1 port 22` passes through the libpcap -internal compiler that generates a structure that can eventually be loaded -via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` -displays what is being placed into this structure. - -Although we were only speaking about sockets here, BPF in Linux is used -in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel -qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places -such as team driver, PTP code, etc where BPF is being used. - - [1] Documentation/userspace-api/seccomp_filter.rst - -Original BPF paper: - -Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new -architecture for user-level packet capture. In Proceedings of the -USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 -Conference Proceedings (USENIX'93). USENIX Association, Berkeley, -CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] - -Structure ---------- - -User space applications include which contains the -following relevant structures: - -struct sock_filter { /* Filter block */ - __u16 code; /* Actual filter code */ - __u8 jt; /* Jump true */ - __u8 jf; /* Jump false */ - __u32 k; /* Generic multiuse field */ -}; - -Such a structure is assembled as an array of 4-tuples, that contains -a code, jt, jf and k value. jt and jf are jump offsets and k a generic -value to be used for a provided code. - -struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ - unsigned short len; /* Number of filter blocks */ - struct sock_filter __user *filter; -}; - -For socket filtering, a pointer to this structure (as shown in -follow-up example) is being passed to the kernel through setsockopt(2). - -Example -------- - -#include -#include -#include -#include -/* ... */ - -/* From the example above: tcpdump -i em1 port 22 -dd */ -struct sock_filter code[] = { - { 0x28, 0, 0, 0x0000000c }, - { 0x15, 0, 8, 0x000086dd }, - { 0x30, 0, 0, 0x00000014 }, - { 0x15, 2, 0, 0x00000084 }, - { 0x15, 1, 0, 0x00000006 }, - { 0x15, 0, 17, 0x00000011 }, - { 0x28, 0, 0, 0x00000036 }, - { 0x15, 14, 0, 0x00000016 }, - { 0x28, 0, 0, 0x00000038 }, - { 0x15, 12, 13, 0x00000016 }, - { 0x15, 0, 12, 0x00000800 }, - { 0x30, 0, 0, 0x00000017 }, - { 0x15, 2, 0, 0x00000084 }, - { 0x15, 1, 0, 0x00000006 }, - { 0x15, 0, 8, 0x00000011 }, - { 0x28, 0, 0, 0x00000014 }, - { 0x45, 6, 0, 0x00001fff }, - { 0xb1, 0, 0, 0x0000000e }, - { 0x48, 0, 0, 0x0000000e }, - { 0x15, 2, 0, 0x00000016 }, - { 0x48, 0, 0, 0x00000010 }, - { 0x15, 0, 1, 0x00000016 }, - { 0x06, 0, 0, 0x0000ffff }, - { 0x06, 0, 0, 0x00000000 }, -}; - -struct sock_fprog bpf = { - .len = ARRAY_SIZE(code), - .filter = code, -}; - -sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); -if (sock < 0) - /* ... bail out ... */ - -ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); -if (ret < 0) - /* ... bail out ... */ - -/* ... */ -close(sock); - -The above example code attaches a socket filter for a PF_PACKET socket -in order to let all IPv4/IPv6 packets with port 22 pass. The rest will -be dropped for this socket. - -The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments -and SO_LOCK_FILTER for preventing the filter to be detached, takes an -integer value with 0 or 1. - -Note that socket filters are not restricted to PF_PACKET sockets only, -but can also be used on other socket families. - -Summary of system calls: - - * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); - * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); - * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); - -Normally, most use cases for socket filtering on packet sockets will be -covered by libpcap in high-level syntax, so as an application developer -you should stick to that. libpcap wraps its own layer around all that. - -Unless i) using/linking to libpcap is not an option, ii) the required BPF -filters use Linux extensions that are not supported by libpcap's compiler, -iii) a filter might be more complex and not cleanly implementable with -libpcap's compiler, or iv) particular filter codes should be optimized -differently than libpcap's internal compiler does; then in such cases -writing such a filter "by hand" can be of an alternative. For example, -xt_bpf and cls_bpf users might have requirements that could result in -more complex filter code, or one that cannot be expressed with libpcap -(e.g. different return codes for various code paths). Moreover, BPF JIT -implementors may wish to manually write test cases and thus need low-level -access to BPF code as well. - -BPF engine and instruction set ------------------------------- - -Under tools/bpf/ there's a small helper tool called bpf_asm which can -be used to write low-level filters for example scenarios mentioned in the -previous section. Asm-like syntax mentioned here has been implemented in -bpf_asm and will be used for further explanations (instead of dealing with -less readable opcodes directly, principles are the same). The syntax is -closely modelled after Steven McCanne's and Van Jacobson's BPF paper. - -The BPF architecture consists of the following basic elements: - - Element Description - - A 32 bit wide accumulator - X 32 bit wide X register - M[] 16 x 32 bit wide misc registers aka "scratch memory - store", addressable from 0 to 15 - -A program, that is translated by bpf_asm into "opcodes" is an array that -consists of the following elements (as already mentioned): - - op:16, jt:8, jf:8, k:32 - -The element op is a 16 bit wide opcode that has a particular instruction -encoded. jt and jf are two 8 bit wide jump targets, one for condition -"jump if true", the other one "jump if false". Eventually, element k -contains a miscellaneous argument that can be interpreted in different -ways depending on the given instruction in op. - -The instruction set consists of load, store, branch, alu, miscellaneous -and return instructions that are also represented in bpf_asm syntax. This -table lists all bpf_asm instructions available resp. what their underlying -opcodes as defined in linux/filter.h stand for: - - Instruction Addressing mode Description - - ld 1, 2, 3, 4, 12 Load word into A - ldi 4 Load word into A - ldh 1, 2 Load half-word into A - ldb 1, 2 Load byte into A - ldx 3, 4, 5, 12 Load word into X - ldxi 4 Load word into X - ldxb 5 Load byte into X - - st 3 Store A into M[] - stx 3 Store X into M[] - - jmp 6 Jump to label - ja 6 Jump to label - jeq 7, 8, 9, 10 Jump on A == - jneq 9, 10 Jump on A != - jne 9, 10 Jump on A != - jlt 9, 10 Jump on A < - jle 9, 10 Jump on A <= - jgt 7, 8, 9, 10 Jump on A > - jge 7, 8, 9, 10 Jump on A >= - jset 7, 8, 9, 10 Jump on A & - - add 0, 4 A + - sub 0, 4 A - - mul 0, 4 A * - div 0, 4 A / - mod 0, 4 A % - neg !A - and 0, 4 A & - or 0, 4 A | - xor 0, 4 A ^ - lsh 0, 4 A << - rsh 0, 4 A >> - - tax Copy A into X - txa Copy X into A - - ret 4, 11 Return - -The next table shows addressing formats from the 2nd column: - - Addressing mode Syntax Description - - 0 x/%x Register X - 1 [k] BHW at byte offset k in the packet - 2 [x + k] BHW at the offset X + k in the packet - 3 M[k] Word at offset k in M[] - 4 #k Literal value stored in k - 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet - 6 L Jump label L - 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf - 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf - 9 #k,Lt Jump to Lt if predicate is true - 10 x/%x,Lt Jump to Lt if predicate is true - 11 a/%a Accumulator A - 12 extension BPF extension - -The Linux kernel also has a couple of BPF extensions that are used along -with the class of load instructions by "overloading" the k argument with -a negative offset + a particular extension offset. The result of such BPF -extensions are loaded into A. - -Possible BPF extensions are shown in the following table: - - Extension Description - - len skb->len - proto skb->protocol - type skb->pkt_type - poff Payload start offset - ifidx skb->dev->ifindex - nla Netlink attribute of type X with offset A - nlan Nested Netlink attribute of type X with offset A - mark skb->mark - queue skb->queue_mapping - hatype skb->dev->type - rxhash skb->hash - cpu raw_smp_processor_id() - vlan_tci skb_vlan_tag_get(skb) - vlan_avail skb_vlan_tag_present(skb) - vlan_tpid skb->vlan_proto - rand prandom_u32() - -These extensions can also be prefixed with '#'. -Examples for low-level BPF: - -** ARP packets: - - ldh [12] - jne #0x806, drop - ret #-1 - drop: ret #0 - -** IPv4 TCP packets: - - ldh [12] - jne #0x800, drop - ldb [23] - jneq #6, drop - ret #-1 - drop: ret #0 - -** (Accelerated) VLAN w/ id 10: - - ld vlan_tci - jneq #10, drop - ret #-1 - drop: ret #0 - -** icmp random packet sampling, 1 in 4 - ldh [12] - jne #0x800, drop - ldb [23] - jneq #1, drop - # get a random uint32 number - ld rand - mod #4 - jneq #1, drop - ret #-1 - drop: ret #0 - -** SECCOMP filter example: - - ld [4] /* offsetof(struct seccomp_data, arch) */ - jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ - ld [0] /* offsetof(struct seccomp_data, nr) */ - jeq #15, good /* __NR_rt_sigreturn */ - jeq #231, good /* __NR_exit_group */ - jeq #60, good /* __NR_exit */ - jeq #0, good /* __NR_read */ - jeq #1, good /* __NR_write */ - jeq #5, good /* __NR_fstat */ - jeq #9, good /* __NR_mmap */ - jeq #14, good /* __NR_rt_sigprocmask */ - jeq #13, good /* __NR_rt_sigaction */ - jeq #35, good /* __NR_nanosleep */ - bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ - good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ - -The above example code can be placed into a file (here called "foo"), and -then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf -and cls_bpf understands and can directly be loaded with. Example with above -ARP code: - -$ ./bpf_asm foo -4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, - -In copy and paste C-like output: - -$ ./bpf_asm -c foo -{ 0x28, 0, 0, 0x0000000c }, -{ 0x15, 0, 1, 0x00000806 }, -{ 0x06, 0, 0, 0xffffffff }, -{ 0x06, 0, 0, 0000000000 }, - -In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF -filters that might not be obvious at first, it's good to test filters before -attaching to a live system. For that purpose, there's a small tool called -bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows -for testing BPF filters against given pcap files, single stepping through the -BPF code on the pcap's packets and to do BPF machine register dumps. - -Starting bpf_dbg is trivial and just requires issuing: - -# ./bpf_dbg - -In case input and output do not equal stdin/stdout, bpf_dbg takes an -alternative stdin source as a first argument, and an alternative stdout -sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. - -Other than that, a particular libreadline configuration can be set via -file "~/.bpf_dbg_init" and the command history is stored in the file -"~/.bpf_dbg_history". - -Interaction in bpf_dbg happens through a shell that also has auto-completion -support (follow-up example commands starting with '>' denote bpf_dbg shell). -The usual workflow would be to ... - -> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 - Loads a BPF filter from standard output of bpf_asm, or transformed via - e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT - debugging (next section), this command creates a temporary socket and - loads the BPF code into the kernel. Thus, this will also be useful for - JIT developers. - -> load pcap foo.pcap - Loads standard tcpdump pcap file. - -> run [] -bpf passes:1 fails:9 - Runs through all packets from a pcap to account how many passes and fails - the filter will generate. A limit of packets to traverse can be given. - -> disassemble -l0: ldh [12] -l1: jeq #0x800, l2, l5 -l2: ldb [23] -l3: jeq #0x1, l4, l5 -l4: ret #0xffff -l5: ret #0 - Prints out BPF code disassembly. - -> dump -/* { op, jt, jf, k }, */ -{ 0x28, 0, 0, 0x0000000c }, -{ 0x15, 0, 3, 0x00000800 }, -{ 0x30, 0, 0, 0x00000017 }, -{ 0x15, 0, 1, 0x00000001 }, -{ 0x06, 0, 0, 0x0000ffff }, -{ 0x06, 0, 0, 0000000000 }, - Prints out C-style BPF code dump. - -> breakpoint 0 -breakpoint at: l0: ldh [12] -> breakpoint 1 -breakpoint at: l1: jeq #0x800, l2, l5 - ... - Sets breakpoints at particular BPF instructions. Issuing a `run` command - will walk through the pcap file continuing from the current packet and - break when a breakpoint is being hit (another `run` will continue from - the currently active breakpoint executing next instructions): - - > run - -- register dump -- - pc: [0] <-- program counter - code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction - curr: l0: ldh [12] <-- disassembly of current instruction - A: [00000000][0] <-- content of A (hex, decimal) - X: [00000000][0] <-- content of X (hex, decimal) - M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) - -- packet dump -- <-- Current packet from pcap (hex) - len: 42 - 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 - 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 - 32: 00 00 00 00 00 00 0a 3b 01 01 - (breakpoint) - > - -> breakpoint -breakpoints: 0 1 - Prints currently set breakpoints. - -> step [-, +] - Performs single stepping through the BPF program from the current pc - offset. Thus, on each step invocation, above register dump is issued. - This can go forwards and backwards in time, a plain `step` will break - on the next BPF instruction, thus +1. (No `run` needs to be issued here.) - -> select - Selects a given packet from the pcap file to continue from. Thus, on - the next `run` or `step`, the BPF program is being evaluated against - the user pre-selected packet. Numbering starts just as in Wireshark - with index 1. - -> quit -# - Exits bpf_dbg. - -JIT compiler ------------- - -The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, -PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through -CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each -attached filter from user space or for internal kernel users if it has -been previously enabled by root: - - echo 1 > /proc/sys/net/core/bpf_jit_enable - -For JIT developers, doing audits etc, each compile run can output the generated -opcode image into the kernel log via: - - echo 2 > /proc/sys/net/core/bpf_jit_enable - -Example output from dmesg: - -[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f -[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 -[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 -[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 -[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 -[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 - -When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and -setting any other value than that will return in failure. This is even the case for -setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log -is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the -generally recommended approach instead. - -In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for -generating disassembly out of the kernel log's hexdump: - -# ./bpf_jit_disasm -70 bytes emitted from JIT compiler (pass:3, flen:6) -ffffffffa0069c8f + : - 0: push %rbp - 1: mov %rsp,%rbp - 4: sub $0x60,%rsp - 8: mov %rbx,-0x8(%rbp) - c: mov 0x68(%rdi),%r9d - 10: sub 0x6c(%rdi),%r9d - 14: mov 0xd8(%rdi),%r8 - 1b: mov $0xc,%esi - 20: callq 0xffffffffe0ff9442 - 25: cmp $0x800,%eax - 2a: jne 0x0000000000000042 - 2c: mov $0x17,%esi - 31: callq 0xffffffffe0ff945e - 36: cmp $0x1,%eax - 39: jne 0x0000000000000042 - 3b: mov $0xffff,%eax - 40: jmp 0x0000000000000044 - 42: xor %eax,%eax - 44: leaveq - 45: retq - -Issuing option `-o` will "annotate" opcodes to resulting assembler -instructions, which can be very useful for JIT developers: - -# ./bpf_jit_disasm -o -70 bytes emitted from JIT compiler (pass:3, flen:6) -ffffffffa0069c8f + : - 0: push %rbp - 55 - 1: mov %rsp,%rbp - 48 89 e5 - 4: sub $0x60,%rsp - 48 83 ec 60 - 8: mov %rbx,-0x8(%rbp) - 48 89 5d f8 - c: mov 0x68(%rdi),%r9d - 44 8b 4f 68 - 10: sub 0x6c(%rdi),%r9d - 44 2b 4f 6c - 14: mov 0xd8(%rdi),%r8 - 4c 8b 87 d8 00 00 00 - 1b: mov $0xc,%esi - be 0c 00 00 00 - 20: callq 0xffffffffe0ff9442 - e8 1d 94 ff e0 - 25: cmp $0x800,%eax - 3d 00 08 00 00 - 2a: jne 0x0000000000000042 - 75 16 - 2c: mov $0x17,%esi - be 17 00 00 00 - 31: callq 0xffffffffe0ff945e - e8 28 94 ff e0 - 36: cmp $0x1,%eax - 83 f8 01 - 39: jne 0x0000000000000042 - 75 07 - 3b: mov $0xffff,%eax - b8 ff ff 00 00 - 40: jmp 0x0000000000000044 - eb 02 - 42: xor %eax,%eax - 31 c0 - 44: leaveq - c9 - 45: retq - c3 - -For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful -toolchain for developing and testing the kernel's JIT compiler. - -BPF kernel internals --------------------- -Internally, for the kernel interpreter, a different instruction set -format with similar underlying principles from BPF described in previous -paragraphs is being used. However, the instruction set format is modelled -closer to the underlying architecture to mimic native instruction sets, so -that a better performance can be achieved (more details later). This new -ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which -originates from [e]xtended BPF is not the same as BPF extensions! While -eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' -of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) - -It is designed to be JITed with one to one mapping, which can also open up -the possibility for GCC/LLVM compilers to generate optimized eBPF code through -an eBPF backend that performs almost as fast as natively compiled code. - -The new instruction set was originally designed with the possible goal in -mind to write programs in "restricted C" and compile into eBPF with a optional -GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with -minimal performance overhead over two steps, that is, C -> eBPF -> native code. - -Currently, the new format is being used for running user BPF programs, which -includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, -team driver's classifier for its load-balancing mode, netfilter's xt_bpf -extension, PTP dissector/classifier, and much more. They are all internally -converted by the kernel into the new instruction set representation and run -in the eBPF interpreter. For in-kernel handlers, this all works transparently -by using bpf_prog_create() for setting up the filter, resp. -bpf_prog_destroy() for destroying it. The macro -BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed -code to run the filter. 'filter' is a pointer to struct bpf_prog that we -got from bpf_prog_create(), and 'ctx' the given context (e.g. -skb pointer). All constraints and restrictions from bpf_check_classic() apply -before a conversion to the new layout is being done behind the scenes! - -Currently, the classic BPF format is being used for JITing on most -32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, -sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF -instruction set. - -Some core changes of the new internal format: - -- Number of registers increase from 2 to 10: - - The old format had two registers A and X, and a hidden frame pointer. The - new layout extends this to be 10 internal registers and a read-only frame - pointer. Since 64-bit CPUs are passing arguments to functions via registers - the number of args from eBPF program to in-kernel function is restricted - to 5 and one register is used to accept return value from an in-kernel - function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ - sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved - registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. - - Therefore, eBPF calling convention is defined as: - - * R0 - return value from in-kernel function, and exit value for eBPF program - * R1 - R5 - arguments from eBPF program to in-kernel function - * R6 - R9 - callee saved registers that in-kernel function will preserve - * R10 - read-only frame pointer to access stack - - Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, - etc, and eBPF calling convention maps directly to ABIs used by the kernel on - 64-bit architectures. - - On 32-bit architectures JIT may map programs that use only 32-bit arithmetic - and may let more complex programs to be interpreted. - - R0 - R5 are scratch registers and eBPF program needs spill/fill them if - necessary across calls. Note that there is only one eBPF program (== one - eBPF main routine) and it cannot call other eBPF functions, it can only - call predefined in-kernel functions, though. - -- Register width increases from 32-bit to 64-bit: - - Still, the semantics of the original 32-bit ALU operations are preserved - via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower - subregisters that zero-extend into 64-bit if they are being written to. - That behavior maps directly to x86_64 and arm64 subregister definition, but - makes other JITs more difficult. - - 32-bit architectures run 64-bit internal BPF programs via interpreter. - Their JITs may convert BPF programs that only use 32-bit subregisters into - native instruction set and let the rest being interpreted. - - Operation is 64-bit, because on 64-bit architectures, pointers are also - 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, - so 32-bit eBPF registers would otherwise require to define register-pair - ABI, thus, there won't be able to use a direct eBPF register to HW register - mapping and JIT would need to do combine/split/move operations for every - register in and out of the function, which is complex, bug prone and slow. - Another reason is the use of atomic 64-bit counters. - -- Conditional jt/jf targets replaced with jt/fall-through: - - While the original design has constructs such as "if (cond) jump_true; - else jump_false;", they are being replaced into alternative constructs like - "if (cond) jump_true; /* else fall-through */". - -- Introduces bpf_call insn and register passing convention for zero overhead - calls from/to other kernel functions: - - Before an in-kernel function call, the internal BPF program needs to - place function arguments into R1 to R5 registers to satisfy calling - convention, then the interpreter will take them from registers and pass - to in-kernel function. If R1 - R5 registers are mapped to CPU registers - that are used for argument passing on given architecture, the JIT compiler - doesn't need to emit extra moves. Function arguments will be in the correct - registers and BPF_CALL instruction will be JITed as single 'call' HW - instruction. This calling convention was picked to cover common call - situations without performance penalty. - - After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has - a return value of the function. Since R6 - R9 are callee saved, their state - is preserved across the call. - - For example, consider three C functions: - - u64 f1() { return (*_f2)(1); } - u64 f2(u64 a) { return f3(a + 1, a); } - u64 f3(u64 a, u64 b) { return a - b; } - - GCC can compile f1, f3 into x86_64: - - f1: - movl $1, %edi - movq _f2(%rip), %rax - jmp *%rax - f3: - movq %rdi, %rax - subq %rsi, %rax - ret - - Function f2 in eBPF may look like: - - f2: - bpf_mov R2, R1 - bpf_add R1, 1 - bpf_call f3 - bpf_exit - - If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and - returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to - be used to call into f2. - - For practical reasons all eBPF programs have only one argument 'ctx' which is - already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs - can call kernel functions with up to 5 arguments. Calls with 6 or more arguments - are currently not supported, but these restrictions can be lifted if necessary - in the future. - - On 64-bit architectures all register map to HW registers one to one. For - example, x86_64 JIT compiler can map them as ... - - R0 - rax - R1 - rdi - R2 - rsi - R3 - rdx - R4 - rcx - R5 - r8 - R6 - rbx - R7 - r13 - R8 - r14 - R9 - r15 - R10 - rbp - - ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing - and rbx, r12 - r15 are callee saved. - - Then the following internal BPF pseudo-program: - - bpf_mov R6, R1 /* save ctx */ - bpf_mov R2, 2 - bpf_mov R3, 3 - bpf_mov R4, 4 - bpf_mov R5, 5 - bpf_call foo - bpf_mov R7, R0 /* save foo() return value */ - bpf_mov R1, R6 /* restore ctx for next call */ - bpf_mov R2, 6 - bpf_mov R3, 7 - bpf_mov R4, 8 - bpf_mov R5, 9 - bpf_call bar - bpf_add R0, R7 - bpf_exit - - After JIT to x86_64 may look like: - - push %rbp - mov %rsp,%rbp - sub $0x228,%rsp - mov %rbx,-0x228(%rbp) - mov %r13,-0x220(%rbp) - mov %rdi,%rbx - mov $0x2,%esi - mov $0x3,%edx - mov $0x4,%ecx - mov $0x5,%r8d - callq foo - mov %rax,%r13 - mov %rbx,%rdi - mov $0x6,%esi - mov $0x7,%edx - mov $0x8,%ecx - mov $0x9,%r8d - callq bar - add %r13,%rax - mov -0x228(%rbp),%rbx - mov -0x220(%rbp),%r13 - leaveq - retq - - Which is in this example equivalent in C to: - - u64 bpf_filter(u64 ctx) - { - return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); - } - - In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 - arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper - registers and place their return value into '%rax' which is R0 in eBPF. - Prologue and epilogue are emitted by JIT and are implicit in the - interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve - them across the calls as defined by calling convention. - - For example the following program is invalid: - - bpf_mov R1, 1 - bpf_call foo - bpf_mov R0, R1 - bpf_exit - - After the call the registers R1-R5 contain junk values and cannot be read. - An in-kernel eBPF verifier is used to validate internal BPF programs. - -Also in the new design, eBPF is limited to 4096 insns, which means that any -program will terminate quickly and will only call a fixed number of kernel -functions. Original BPF and the new format are two operand instructions, -which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. - -The input context pointer for invoking the interpreter function is generic, -its content is defined by a specific use case. For seccomp register R1 points -to seccomp_data, for converted BPF filters R1 points to a skb. - -A program, that is translated internally consists of the following elements: - - op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 - -So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field -has room for new instructions. Some of them may use 16/24/32 byte encoding. New -instructions must be multiple of 8 bytes to preserve backward compatibility. - -Internal BPF is a general purpose RISC instruction set. Not every register and -every instruction are used during translation from original BPF to new format. -For example, socket filters are not using 'exclusive add' instruction, but -tracing filters may do to maintain counters of events, for example. Register R9 -is not used by socket filters either, but more complex filters may be running -out of registers and would have to resort to spill/fill to stack. - -Internal BPF can be used as a generic assembler for last step performance -optimizations, socket filters and seccomp are using it as assembler. Tracing -filters may use it as assembler to generate code from kernel. In kernel usage -may not be bounded by security considerations, since generated internal BPF code -may be optimizing internal code path and not being exposed to the user space. -Safety of internal BPF can come from a verifier (TBD). In such use cases as -described, it may be used as safe instruction set. - -Just like the original BPF, the new format runs within a controlled environment, -is deterministic and the kernel can easily prove that. The safety of the program -can be determined in two steps: first step does depth-first-search to disallow -loops and other CFG validation; second step starts from the first insn and -descends all possible paths. It simulates execution of every insn and observes -the state change of registers and stack. - -eBPF opcode encoding --------------------- - -eBPF is reusing most of the opcode encoding from classic to simplify conversion -of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' -field is divided into three parts: - - +----------------+--------+--------------------+ - | 4 bits | 1 bit | 3 bits | - | operation code | source | instruction class | - +----------------+--------+--------------------+ - (MSB) (LSB) - -Three LSB bits store instruction class which is one of: - - Classic BPF classes: eBPF classes: - - BPF_LD 0x00 BPF_LD 0x00 - BPF_LDX 0x01 BPF_LDX 0x01 - BPF_ST 0x02 BPF_ST 0x02 - BPF_STX 0x03 BPF_STX 0x03 - BPF_ALU 0x04 BPF_ALU 0x04 - BPF_JMP 0x05 BPF_JMP 0x05 - BPF_RET 0x06 BPF_JMP32 0x06 - BPF_MISC 0x07 BPF_ALU64 0x07 - -When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... - - BPF_K 0x00 - BPF_X 0x08 - - * in classic BPF, this means: - - BPF_SRC(code) == BPF_X - use register X as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - - * in eBPF, this means: - - BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - -... and four MSB bits store operation code. - -If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: - - BPF_ADD 0x00 - BPF_SUB 0x10 - BPF_MUL 0x20 - BPF_DIV 0x30 - BPF_OR 0x40 - BPF_AND 0x50 - BPF_LSH 0x60 - BPF_RSH 0x70 - BPF_NEG 0x80 - BPF_MOD 0x90 - BPF_XOR 0xa0 - BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ - BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ - BPF_END 0xd0 /* eBPF only: endianness conversion */ - -If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of: - - BPF_JA 0x00 /* BPF_JMP only */ - BPF_JEQ 0x10 - BPF_JGT 0x20 - BPF_JGE 0x30 - BPF_JSET 0x40 - BPF_JNE 0x50 /* eBPF only: jump != */ - BPF_JSGT 0x60 /* eBPF only: signed '>' */ - BPF_JSGE 0x70 /* eBPF only: signed '>=' */ - BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ - BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ - BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ - BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ - BPF_JSLT 0xc0 /* eBPF only: signed '<' */ - BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ - -So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF -and eBPF. There are only two registers in classic BPF, so it means A += X. -In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, -BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous -src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. - -Classic BPF is using BPF_MISC class to represent A = X and X = A moves. -eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no -BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean -exactly the same operations as BPF_ALU, but with 64-bit wide operands -instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: -dst_reg = dst_reg + src_reg - -Classic BPF wastes the whole BPF_RET class to represent a single 'ret' -operation. Classic BPF_RET | BPF_K means copy imm32 into return register -and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT -in eBPF means function exit only. The eBPF program needs to store return -value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as -BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide -operands for the comparisons instead. - -For load and store instructions the 8-bit 'code' field is divided as: - - +--------+--------+-------------------+ - | 3 bits | 2 bits | 3 bits | - | mode | size | instruction class | - +--------+--------+-------------------+ - (MSB) (LSB) - -Size modifier is one of ... - - BPF_W 0x00 /* word */ - BPF_H 0x08 /* half word */ - BPF_B 0x10 /* byte */ - BPF_DW 0x18 /* eBPF only, double word */ - -... which encodes size of load/store operation: - - B - 1 byte - H - 2 byte - W - 4 byte - DW - 8 byte (eBPF only) - -Mode modifier is one of: - - BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ - BPF_ABS 0x20 - BPF_IND 0x40 - BPF_MEM 0x60 - BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ - BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ - BPF_XADD 0xc0 /* eBPF only, exclusive add */ - -eBPF has two non-generic instructions: (BPF_ABS | | BPF_LD) and -(BPF_IND | | BPF_LD) which are used to access packet data. - -They had to be carried over from classic to have strong performance of -socket filters running in eBPF interpreter. These instructions can only -be used when interpreter context is a pointer to 'struct sk_buff' and -have seven implicit operands. Register R6 is an implicit input that must -contain pointer to sk_buff. Register R0 is an implicit output which contains -the data fetched from the packet. Registers R1-R5 are scratch registers -and must not be used to store the data across BPF_ABS | BPF_LD or -BPF_IND | BPF_LD instructions. - -These instructions have implicit program exit condition as well. When -eBPF program is trying to access the data beyond the packet boundary, -the interpreter will abort the execution of the program. JIT compilers -therefore must preserve this property. src_reg and imm32 fields are -explicit inputs to these instructions. - -For example: - - BPF_IND | BPF_W | BPF_LD means: - - R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) - and R1 - R5 were scratched. - -Unlike classic BPF instruction set, eBPF has generic load/store operations: - -BPF_MEM | | BPF_STX: *(size *) (dst_reg + off) = src_reg -BPF_MEM | | BPF_ST: *(size *) (dst_reg + off) = imm32 -BPF_MEM | | BPF_LDX: dst_reg = *(size *) (src_reg + off) -BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg -BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg - -Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and -2 byte atomic increments are not supported. - -eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists -of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single -instruction that loads 64-bit immediate value into a dst_reg. -Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads -32-bit immediate value into a register. - -eBPF verifier -------------- -The safety of the eBPF program is determined in two steps. - -First step does DAG check to disallow loops and other CFG validation. -In particular it will detect programs that have unreachable instructions. -(though classic BPF checker allows them) - -Second step starts from the first insn and descends all possible paths. -It simulates execution of every insn and observes the state change of -registers and stack. - -At the start of the program the register R1 contains a pointer to context -and has type PTR_TO_CTX. -If verifier sees an insn that does R2=R1, then R2 has now type -PTR_TO_CTX as well and can be used on the right hand side of expression. -If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, -since addition of two valid pointers makes invalid pointer. -(In 'secure' mode verifier will reject any type of pointer arithmetic to make -sure that kernel addresses don't leak to unprivileged users) - -If register was never written to, it's not readable: - bpf_mov R0 = R2 - bpf_exit -will be rejected, since R2 is unreadable at the start of the program. - -After kernel function call, R1-R5 are reset to unreadable and -R0 has a return type of the function. - -Since R6-R9 are callee saved, their state is preserved across the call. - bpf_mov R6 = 1 - bpf_call foo - bpf_mov R0 = R6 - bpf_exit -is a correct program. If there was R1 instead of R6, it would have -been rejected. - -load/store instructions are allowed only with registers of valid types, which -are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. -For example: - bpf_mov R1 = 1 - bpf_mov R2 = 2 - bpf_xadd *(u32 *)(R1 + 3) += R2 - bpf_exit -will be rejected, since R1 doesn't have a valid pointer type at the time of -execution of instruction bpf_xadd. - -At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context') -A callback is used to customize verifier to restrict eBPF program access to only -certain fields within ctx structure with specified size and alignment. - -For example, the following insn: - bpf_ld R0 = *(u32 *)(R6 + 8) -intends to load a word from address R6 + 8 and store it into R0 -If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know -that offset 8 of size 4 bytes can be accessed for reading, otherwise -the verifier will reject the program. -If R6=PTR_TO_STACK, then access should be aligned and be within -stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, -so it will fail verification, since it's out of bounds. - -The verifier will allow eBPF program to read data from stack only after -it wrote into it. -Classic BPF verifier does similar check with M[0-15] memory slots. -For example: - bpf_ld R0 = *(u32 *)(R10 - 4) - bpf_exit -is invalid program. -Though R10 is correct read-only register and has type PTR_TO_STACK -and R10 - 4 is within stack bounds, there were no stores into that location. - -Pointer register spill/fill is tracked as well, since four (R6-R9) -callee saved registers may not be enough for some programs. - -Allowed function calls are customized with bpf_verifier_ops->get_func_proto() -The eBPF verifier will check that registers match argument constraints. -After the call register R0 will be set to return type of the function. - -Function calls is a main mechanism to extend functionality of eBPF programs. -Socket filters may let programs to call one set of functions, whereas tracing -filters may allow completely different set. - -If a function made accessible to eBPF program, it needs to be thought through -from safety point of view. The verifier will guarantee that the function is -called with valid arguments. - -seccomp vs socket filters have different security restrictions for classic BPF. -Seccomp solves this by two stage verifier: classic BPF verifier is followed -by seccomp verifier. In case of eBPF one configurable verifier is shared for -all use cases. - -See details of eBPF verifier in kernel/bpf/verifier.c - -Register value tracking ------------------------ -In order to determine the safety of an eBPF program, the verifier must track -the range of possible values in each register and also in each stack slot. -This is done with 'struct bpf_reg_state', defined in include/linux/ -bpf_verifier.h, which unifies tracking of scalar and pointer values. Each -register state has a type, which is either NOT_INIT (the register has not been -written to), SCALAR_VALUE (some value which is not usable as a pointer), or a -pointer type. The types of pointers describe their base, as follows: - PTR_TO_CTX Pointer to bpf_context. - CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic - on these pointers is forbidden. - PTR_TO_MAP_VALUE Pointer to the value stored in a map element. - PTR_TO_MAP_VALUE_OR_NULL - Either a pointer to a map value, or NULL; map accesses - (see section 'eBPF maps', below) return this type, - which becomes a PTR_TO_MAP_VALUE when checked != NULL. - Arithmetic on these pointers is forbidden. - PTR_TO_STACK Frame pointer. - PTR_TO_PACKET skb->data. - PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. - PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted. - PTR_TO_SOCKET_OR_NULL - Either a pointer to a socket, or NULL; socket lookup - returns this type, which becomes a PTR_TO_SOCKET when - checked != NULL. PTR_TO_SOCKET is reference-counted, - so programs must release the reference through the - socket release function before the end of the program. - Arithmetic on these pointers is forbidden. -However, a pointer may be offset from this base (as a result of pointer -arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable -offset'. The former is used when an exactly-known value (e.g. an immediate -operand) is added to a pointer, while the latter is used for values which are -not exactly known. The variable offset is also used in SCALAR_VALUEs, to track -the range of possible values in the register. -The verifier's knowledge about the variable offset consists of: -* minimum and maximum values as unsigned -* minimum and maximum values as signed -* knowledge of the values of individual bits, in the form of a 'tnum': a u64 -'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; -1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both -mask and value; no bit should ever be 1 in both. For example, if a byte is read -into a register from memory, the register's top 56 bits are known zero, while -the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we -then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; -0x1ff), because of potential carries. - -Besides arithmetic, the register state can also be updated by conditional -branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch -it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' -branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or -BPF_JSGE) would instead update the signed minimum/maximum values. Information -from the signed and unsigned bounds can be combined; for instance if a value is -first tested < 8 and then tested s> 4, the verifier will conclude that the value -is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. - -PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all -pointers sharing that same variable offset. This is important for packet range -checks: after adding a variable to a packet pointer register A, if you then copy -it to another register B and then add a constant 4 to A, both registers will -share the same 'id' but the A will have a fixed offset of +4. Then if A is -bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is -now known to have a safe range of at least 4 bytes. See 'Direct packet access', -below, for more on PTR_TO_PACKET ranges. - -The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of -the pointer returned from a map lookup. This means that when one copy is -checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. -As well as range-checking, the tracked information is also used for enforcing -alignment of pointer accesses. For instance, on most systems the packet pointer -is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump -over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting -pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 -bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through -that pointer are safe. -The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common -to all copies of the pointer returned from a socket lookup. This has similar -behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but -it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly -represents a reference to the corresponding 'struct sock'. To ensure that the -reference is not leaked, it is imperative to NULL-check the reference and in -the non-NULL case, and pass the valid reference to the socket release function. - -Direct packet access --------------------- -In cls_bpf and act_bpf programs the verifier allows direct access to the packet -data via skb->data and skb->data_end pointers. -Ex: -1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ -2: r3 = *(u32 *)(r1 +76) /* load skb->data */ -3: r5 = r3 -4: r5 += 14 -5: if r5 > r4 goto pc+16 -R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp -6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ - -this 2byte load from the packet is safe to do, since the program author -did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which -means that in the fall-through case the register R3 (which points to skb->data) -has at least 14 directly accessible bytes. The verifier marks it -as R3=pkt(id=0,off=0,r=14). -id=0 means that no additional variables were added to the register. -off=0 means that no additional constants were added. -r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. -Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points -to the packet data, but constant 14 was added to the register, so -it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) -which is zero bytes. - -More complex packet access may look like: - R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp - 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ - 7: r4 = *(u8 *)(r3 +12) - 8: r4 *= 14 - 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ -10: r3 += r4 -11: r2 = r1 -12: r2 <<= 48 -13: r2 >>= 48 -14: r3 += r2 -15: r2 = r3 -16: r2 += 8 -17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ -18: if r2 > r1 goto pc+2 - R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp -19: r1 = *(u8 *)(r3 +4) -The state of the register R3 is R3=pkt(id=2,off=0,r=8) -id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some -offset within a packet and since the program author did -'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). -The verifier only allows 'add'/'sub' operations on packet registers. Any other -operation will set the register state to 'SCALAR_VALUE' and it won't be -available for direct packet access. -Operation 'r3 += rX' may overflow and become less than original skb->data, -therefore the verifier has to prevent that. So when it sees 'r3 += rX' -instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 -against skb->data_end will not give us 'range' information, so attempts to read -through the pointer will give "invalid access to packet" error. -Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is -R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits -of the register are guaranteed to be zero, and nothing is known about the lower -8 bits. After insn 'r4 *= 14' the state becomes -R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit -value by constant 14 will keep upper 52 bits as zero, also the least significant -bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make -R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign -extending. This logic is implemented in adjust_reg_min_max_vals() function, -which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice -versa) and adjust_scalar_min_max_vals() for operations on two scalars. - -The end result is that bpf program author can access packet directly -using normal C code as: - void *data = (void *)(long)skb->data; - void *data_end = (void *)(long)skb->data_end; - struct eth_hdr *eth = data; - struct iphdr *iph = data + sizeof(*eth); - struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); - - if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) - return 0; - if (eth->h_proto != htons(ETH_P_IP)) - return 0; - if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) - return 0; - if (udp->dest == 53 || udp->source == 9) - ...; -which makes such programs easier to write comparing to LD_ABS insn -and significantly faster. - -eBPF maps ---------- -'maps' is a generic storage of different types for sharing data between kernel -and userspace. - -The maps are accessed from user space via BPF syscall, which has commands: -- create a map with given type and attributes - map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) - using attr->map_type, attr->key_size, attr->value_size, attr->max_entries - returns process-local file descriptor or negative error - -- lookup key in a given map - err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key, attr->value - returns zero and stores found elem into value or negative error - -- create or update key/value pair in a given map - err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key, attr->value - returns zero or negative error - -- find and delete element by key in a given map - err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key - -- to delete map: close(fd) - Exiting process will delete maps automatically - -userspace programs use this syscall to create/access maps that eBPF programs -are concurrently updating. - -maps can have different types: hash, array, bloom filter, radix-tree, etc. - -The map is defined by: - . type - . max number of elements - . key size in bytes - . value size in bytes - -Pruning -------- -The verifier does not actually walk all possible paths through the program. For -each new branch to analyse, the verifier looks at all the states it's previously -been in when at this instruction. If any of them contain the current state as a -subset, the branch is 'pruned' - that is, the fact that the previous state was -accepted implies the current state would be as well. For instance, if in the -previous state, r1 held a packet-pointer, and in the current state, r1 holds a -packet-pointer with a range as long or longer and at least as strict an -alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't -have been used by any path from that point, so any value in r2 (including -another NOT_INIT) is safe. The implementation is in the function regsafe(). -Pruning considers not only the registers but also the stack (and any spilled -registers it may hold). They must all be safe for the branch to be pruned. -This is implemented in states_equal(). - -Understanding eBPF verifier messages ------------------------------------- - -The following are few examples of invalid eBPF programs and verifier error -messages as seen in the log: - -Program with unreachable instructions: -static struct bpf_insn prog[] = { - BPF_EXIT_INSN(), - BPF_EXIT_INSN(), -}; -Error: - unreachable insn 1 - -Program that reads uninitialized register: - BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), - BPF_EXIT_INSN(), -Error: - 0: (bf) r0 = r2 - R2 !read_ok - -Program that doesn't initialize R0 before exiting: - BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), - BPF_EXIT_INSN(), -Error: - 0: (bf) r2 = r1 - 1: (95) exit - R0 !read_ok - -Program that accesses stack out of bounds: - BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 +8) = 0 - invalid stack off=8 size=8 - -Program that doesn't initialize stack before passing its address into function: - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), -Error: - 0: (bf) r2 = r10 - 1: (07) r2 += -8 - 2: (b7) r1 = 0x0 - 3: (85) call 1 - invalid indirect read from stack off -8+0 size 8 - -Program that uses invalid map_fd=0 while calling to map_lookup_elem() function: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - fd 0 is not pointing to valid bpf_map - -Program that doesn't check return value of map_lookup_elem() before accessing -map element: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - 5: (7a) *(u64 *)(r0 +0) = 0 - R0 invalid mem access 'map_value_or_null' - -Program that correctly checks map_lookup_elem() returned value for NULL, but -accesses the memory with incorrect alignment: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+1 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +4) = 0 - misaligned access off 4 size 8 - -Program that correctly checks map_lookup_elem() returned value for NULL and -accesses memory with correct alignment in one side of 'if' branch, but fails -to do so in the other side of 'if' branch: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+2 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +0) = 0 - 7: (95) exit - - from 5 to 8: R0=imm0 R10=fp - 8: (7a) *(u64 *)(r0 +0) = 1 - R0 invalid mem access 'imm' - -Program that performs a socket lookup then sets the pointer to NULL without -checking it: -value: - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_MOV64_IMM(BPF_REG_0, 0), - BPF_EXIT_INSN(), -Error: - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (b7) r0 = 0 - 9: (95) exit - Unreleased reference id=1, alloc_insn=7 - -Program that performs a socket lookup but does not NULL-check the returned -value: - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_EXIT_INSN(), -Error: - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (95) exit - Unreleased reference id=1, alloc_insn=7 - -Testing -------- - -Next to the BPF toolchain, the kernel also ships a test module that contains -various test cases for classic and internal BPF that can be executed against -the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and -enabled via Kconfig: - - CONFIG_TEST_BPF=m - -After the module has been built and installed, the test suite can be executed -via insmod or modprobe against 'test_bpf' module. Results of the test cases -including timings in nsec can be found in the kernel log (dmesg). - -Misc ----- - -Also trinity, the Linux syscall fuzzer, has built-in support for BPF and -SECCOMP-BPF kernel fuzzing. - -Written by ----------- - -The document was written in the hope that it is found useful and in order -to give potential BPF hackers or security auditors a better overview of -the underlying architecture. - -Jay Schulist -Daniel Borkmann -Alexei Starovoitov diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 807abe25ae4b..144ed838c1a9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -56,6 +56,7 @@ Contents: driver eql fib_trie + filter .. only:: subproject and html diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 999eb41da81d..494614573c67 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -1051,7 +1051,7 @@ for more information on hardware timestamps. ------------------------------------------------------------------------------- - Packet sockets work well together with Linux socket filters, thus you also - might want to have a look at Documentation/networking/filter.txt + might want to have a look at Documentation/networking/filter.rst -------------------------------------------------------------------------------- + THANKS diff --git a/MAINTAINERS b/MAINTAINERS index 7323bfc1720f..4ec6d2741d36 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3192,7 +3192,7 @@ Q: https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git F: Documentation/bpf/ -F: Documentation/networking/filter.txt +F: Documentation/networking/filter.rst F: arch/*/net/* F: include/linux/bpf* F: include/linux/filter.h diff --git a/tools/bpf/bpf_asm.c b/tools/bpf/bpf_asm.c index e5f95e3eede3..0063c3c029e7 100644 --- a/tools/bpf/bpf_asm.c +++ b/tools/bpf/bpf_asm.c @@ -11,7 +11,7 @@ * * How to get into it: * - * 1) read Documentation/networking/filter.txt + * 1) read Documentation/networking/filter.rst * 2) Run `bpf_asm [-c] ` to translate into binary * blob that is loadable with xt_bpf, cls_bpf et al. Note: -c will * pretty print a C-like construct. diff --git a/tools/bpf/bpf_dbg.c b/tools/bpf/bpf_dbg.c index 9d3766e653a9..a0ebcdf59c31 100644 --- a/tools/bpf/bpf_dbg.c +++ b/tools/bpf/bpf_dbg.c @@ -13,7 +13,7 @@ * for making a verdict when multiple simple BPF programs are combined * into one in order to prevent parsing same headers multiple times. * - * More on how to debug BPF opcodes see Documentation/networking/filter.txt + * More on how to debug BPF opcodes see Documentation/networking/filter.rst * which is the main document on BPF. Mini howto for getting started: * * 1) `./bpf_dbg` to enter the shell (shell cmds denoted with '>'): -- cgit v1.2.3 From 3c3a2fde4d88bb3d6c0592b4b7754f26dab9f697 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:43 +0200 Subject: docs: networking: convert hinic.txt to ReST Not much to be done here: - add SPDX header; - adjust titles and chapters, adding proper markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/hinic.rst | 128 +++++++++++++++++++++++++++++++++++++ Documentation/networking/hinic.txt | 125 ------------------------------------ Documentation/networking/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 130 insertions(+), 126 deletions(-) create mode 100644 Documentation/networking/hinic.rst delete mode 100644 Documentation/networking/hinic.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/hinic.rst b/Documentation/networking/hinic.rst new file mode 100644 index 000000000000..867ac8f4e04a --- /dev/null +++ b/Documentation/networking/hinic.rst @@ -0,0 +1,128 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family +============================================================ + +Overview: +========= +HiNIC is a network interface card for the Data Center Area. + +The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). +The driver supports also a negotiated and extendable feature set. + +Some HiNIC devices support SR-IOV. This driver is used for Physical Function +(PF). + +HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and +adaptive interrupt moderation. + +HiNIC devices support also various offload features such as checksum offload, +TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and +LRO(Large Receive Offload). + + +Supported PCI vendor ID/device IDs: +=================================== + +19e5:1822 - HiNIC PF + + +Driver Architecture and Source Code: +==================================== + +hinic_dev - Implement a Logical Network device that is independent from +specific HW details about HW data structure formats. + +hinic_hwdev - Implement the HW details of the device and include the components +for accessing the PCI NIC. + +hinic_hwdev contains the following components: +=============================================== + +HW Interface: +============= + +The interface for accessing the pci device (DMA memory and PCI BARs). +(hinic_hw_if.c, hinic_hw_if.h) + +Configuration Status Registers Area that describes the HW Registers on the +configuration and status BAR0. (hinic_hw_csr.h) + +MGMT components: +================ + +Asynchronous Event Queues(AEQs) - The event queues for receiving messages from +the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Application Programmable Interface commands(API CMD) - Interface for sending +MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) + +Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT +commands to the card and receives notifications from the MGMT modules on the +card by AEQs. Also set the addresses of the IO CMDQs in HW. +(hinic_hw_mgmt.c, hinic_hw_mgmt.h) + +IO components: +============== + +Completion Event Queues(CEQs) - The completion Event Queues that describe IO +tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Work Queues(WQ) - Contain the memory and operations for use by CMD queues and +the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains +pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). +(hinic_hw_wq.c, hinic_hw_wq.h) + +Command Queues(CMDQ) - The queues for sending commands for IO management and is +used to set the QPs addresses in HW. The commands completion events are +accumulated on the CEQ that is configured to receive the CMDQ completion events. +(hinic_hw_cmdq.c, hinic_hw_cmdq.h) + +Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting +Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) + +IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) + +HW device: +========== + +HW device - de/constructs the HW Interface, the MGMT components on the +initialization of the driver and the IO components on the case of Interface +UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) + + +hinic_dev contains the following components: +=============================================== + +PCI ID table - Contains the supported PCI Vendor/Device IDs. +(hinic_pci_tbl.h) + +Port Commands - Send commands to the HW device for port management +(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) + +Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. +The Logical Tx queue is not dependent on the format of the HW Send Queue. +(hinic_tx.c, hinic_tx.h) + +Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. +The Logical Rx queue is not dependent on the format of the HW Receive Queue. +(hinic_rx.c, hinic_rx.h) + +hinic_dev - de/constructs the Logical Tx and Rx Queues. +(hinic_main.c, hinic_dev.h) + + +Miscellaneous +============= + +Common functions that are used by HW and Logical Device. +(hinic_common.c, hinic_common.h) + + +Support +======= + +If an issue is identified with the released source code on the supported kernel +with a supported adapter, email the specific information related to the issue to +aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/hinic.txt b/Documentation/networking/hinic.txt deleted file mode 100644 index 989366a4039c..000000000000 --- a/Documentation/networking/hinic.txt +++ /dev/null @@ -1,125 +0,0 @@ -Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family -============================================================ - -Overview: -========= -HiNIC is a network interface card for the Data Center Area. - -The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). -The driver supports also a negotiated and extendable feature set. - -Some HiNIC devices support SR-IOV. This driver is used for Physical Function -(PF). - -HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and -adaptive interrupt moderation. - -HiNIC devices support also various offload features such as checksum offload, -TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and -LRO(Large Receive Offload). - - -Supported PCI vendor ID/device IDs: -=================================== - -19e5:1822 - HiNIC PF - - -Driver Architecture and Source Code: -==================================== - -hinic_dev - Implement a Logical Network device that is independent from -specific HW details about HW data structure formats. - -hinic_hwdev - Implement the HW details of the device and include the components -for accessing the PCI NIC. - -hinic_hwdev contains the following components: -=============================================== - -HW Interface: -============= - -The interface for accessing the pci device (DMA memory and PCI BARs). -(hinic_hw_if.c, hinic_hw_if.h) - -Configuration Status Registers Area that describes the HW Registers on the -configuration and status BAR0. (hinic_hw_csr.h) - -MGMT components: -================ - -Asynchronous Event Queues(AEQs) - The event queues for receiving messages from -the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) - -Application Programmable Interface commands(API CMD) - Interface for sending -MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) - -Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT -commands to the card and receives notifications from the MGMT modules on the -card by AEQs. Also set the addresses of the IO CMDQs in HW. -(hinic_hw_mgmt.c, hinic_hw_mgmt.h) - -IO components: -============== - -Completion Event Queues(CEQs) - The completion Event Queues that describe IO -tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) - -Work Queues(WQ) - Contain the memory and operations for use by CMD queues and -the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains -pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). -(hinic_hw_wq.c, hinic_hw_wq.h) - -Command Queues(CMDQ) - The queues for sending commands for IO management and is -used to set the QPs addresses in HW. The commands completion events are -accumulated on the CEQ that is configured to receive the CMDQ completion events. -(hinic_hw_cmdq.c, hinic_hw_cmdq.h) - -Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting -Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) - -IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) - -HW device: -========== - -HW device - de/constructs the HW Interface, the MGMT components on the -initialization of the driver and the IO components on the case of Interface -UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) - - -hinic_dev contains the following components: -=============================================== - -PCI ID table - Contains the supported PCI Vendor/Device IDs. -(hinic_pci_tbl.h) - -Port Commands - Send commands to the HW device for port management -(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) - -Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. -The Logical Tx queue is not dependent on the format of the HW Send Queue. -(hinic_tx.c, hinic_tx.h) - -Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. -The Logical Rx queue is not dependent on the format of the HW Receive Queue. -(hinic_rx.c, hinic_rx.h) - -hinic_dev - de/constructs the Logical Tx and Rx Queues. -(hinic_main.c, hinic_dev.h) - - -Miscellaneous: -============= - -Common functions that are used by HW and Logical Device. -(hinic_common.c, hinic_common.h) - - -Support -======= - -If an issue is identified with the released source code on the supported kernel -with a supported adapter, email the specific information related to the issue to -aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b29a08d1f941..5a7889df1375 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -63,6 +63,7 @@ Contents: generic_netlink gen_stats gtp + hinic .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 4ec6d2741d36..df5e4ccc1ccb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7815,7 +7815,7 @@ HUAWEI ETHERNET DRIVER M: Aviad Krawczyk L: netdev@vger.kernel.org S: Supported -F: Documentation/networking/hinic.txt +F: Documentation/networking/hinic.rst F: drivers/net/ethernet/huawei/hinic/ HUGETLB FILESYSTEM -- cgit v1.2.3 From 82a07bf33d7d0c3a194f62178e0fea2d68227b89 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:52 +0200 Subject: docs: networking: convert ipvs-sysctl.txt to ReST - add SPDX header; - add a document title; - mark lists as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Simon Horman Signed-off-by: David S. Miller --- Documentation/admin-guide/sysctl/net.rst | 4 +- Documentation/networking/index.rst | 1 + Documentation/networking/ipvs-sysctl.rst | 302 +++++++++++++++++++++++++++++++ Documentation/networking/ipvs-sysctl.txt | 294 ------------------------------ MAINTAINERS | 2 +- 5 files changed, 306 insertions(+), 297 deletions(-) create mode 100644 Documentation/networking/ipvs-sysctl.rst delete mode 100644 Documentation/networking/ipvs-sysctl.txt (limited to 'MAINTAINERS') diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 84e3348a9543..2ad1b77a7182 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -353,8 +353,8 @@ socket's buffer. It will not take effect unless PF_UNIX flag is specified. 3. /proc/sys/net/ipv4 - IPV4 settings ------------------------------------- -Please see: Documentation/networking/ip-sysctl.rst and ipvs-sysctl.txt for -descriptions of these entries. +Please see: Documentation/networking/ip-sysctl.rst and +Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. 4. Appletalk diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 54dee1575b54..bbd4e0041457 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -72,6 +72,7 @@ Contents: ip-sysctl ipv6 ipvlan + ipvs-sysctl .. only:: subproject and html diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst new file mode 100644 index 000000000000..be36c4600e8f --- /dev/null +++ b/Documentation/networking/ipvs-sysctl.rst @@ -0,0 +1,302 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +IPvs-sysctl +=========== + +/proc/sys/net/ipv4/vs/* Variables: +================================== + +am_droprate - INTEGER + default 10 + + It sets the always mode drop rate, which is used in the mode 3 + of the drop_rate defense. + +amemthresh - INTEGER + default 1024 + + It sets the available memory threshold (in pages), which is + used in the automatic modes of defense. When there is no + enough available memory, the respective strategy will be + enabled and the variable is automatically set to 2, otherwise + the strategy is disabled and the variable is set to 1. + +backup_only - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, disable the director function while the server is + in backup mode to avoid packet loops for DR/TUN methods. + +conn_reuse_mode - INTEGER + 1 - default + + Controls how ipvs will deal with connections that are detected + port reuse. It is a bitmap, with the values being: + + 0: disable any special handling on port reuse. The new + connection will be delivered to the same real server that was + servicing the previous connection. This will effectively + disable expire_nodest_conn. + + bit 1: enable rescheduling of new connections when it is safe. + That is, whenever expire_nodest_conn and for TCP sockets, when + the connection is in TIME_WAIT state (which is only possible if + you use NAT mode). + + bit 2: it is bit 1 plus, for TCP connections, when connections + are in FIN_WAIT state, as this is the last state seen by load + balancer in Direct Routing mode. This bit helps on adding new + real servers to a very busy cluster. + +conntrack - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, maintain connection tracking entries for + connections handled by IPVS. + + This should be enabled if connections handled by IPVS are to be + also handled by stateful firewall rules. That is, iptables rules + that make use of connection tracking. It is a performance + optimisation to disable this setting otherwise. + + Connections handled by the IPVS FTP application module + will have connection tracking entries regardless of this setting. + + Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. + +cache_bypass - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If it is enabled, forward packets to the original destination + directly when no cache server is available and destination + address is not local (iph->daddr is RTN_UNICAST). It is mostly + used in transparent web cache cluster. + +debug_level - INTEGER + - 0 - transmission error messages (default) + - 1 - non-fatal error messages + - 2 - configuration + - 3 - destination trash + - 4 - drop entry + - 5 - service lookup + - 6 - scheduling + - 7 - connection new/expire, lookup and synchronization + - 8 - state transition + - 9 - binding destination, template checks and applications + - 10 - IPVS packet transmission + - 11 - IPVS packet handling (ip_vs_in/ip_vs_out) + - 12 or more - packet traversal + + Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. + + Higher debugging levels include the messages for lower debugging + levels, so setting debug level 2, includes level 0, 1 and 2 + messages. Thus, logging becomes more and more verbose the higher + the level. + +drop_entry - INTEGER + - 0 - disabled (default) + + The drop_entry defense is to randomly drop entries in the + connection hash table, just in order to collect back some + memory for new connections. In the current code, the + drop_entry procedure can be activated every second, then it + randomly scans 1/32 of the whole and drops entries that are in + the SYN-RECV/SYNACK state, which should be effective against + syn-flooding attack. + + The valid values of drop_entry are from 0 to 3, where 0 means + that this strategy is always disabled, 1 and 2 mean automatic + modes (when there is no enough available memory, the strategy + is enabled and the variable is automatically set to 2, + otherwise the strategy is disabled and the variable is set to + 1), and 3 means that that the strategy is always enabled. + +drop_packet - INTEGER + - 0 - disabled (default) + + The drop_packet defense is designed to drop 1/rate packets + before forwarding them to real servers. If the rate is 1, then + drop all the incoming packets. + + The value definition is the same as that of the drop_entry. In + the automatic mode, the rate is determined by the follow + formula: rate = amemthresh / (amemthresh - available_memory) + when available memory is less than the available memory + threshold. When the mode 3 is set, the always mode drop rate + is controlled by the /proc/sys/net/ipv4/vs/am_droprate. + +expire_nodest_conn - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + The default value is 0, the load balancer will silently drop + packets when its destination server is not available. It may + be useful, when user-space monitoring program deletes the + destination server (because of server overload or wrong + detection) and add back the server later, and the connections + to the server can continue. + + If this feature is enabled, the load balancer will expire the + connection immediately when a packet arrives and its + destination server is not available, then the client program + will be notified that the connection is closed. This is + equivalent to the feature some people requires to flush + connections when its destination is not available. + +expire_quiescent_template - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + When set to a non-zero value, the load balancer will expire + persistent templates when the destination server is quiescent. + This may be useful, when a user makes a destination server + quiescent by setting its weight to 0 and it is desired that + subsequent otherwise persistent connections are sent to a + different destination server. By default new persistent + connections are allowed to quiescent destination servers. + + If this feature is enabled, the load balancer will expire the + persistence template if it is to be used to schedule a new + connection and the destination server is quiescent. + +ignore_tunneled - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, ipvs will set the ipvs_property on all packets which are of + unrecognized protocols. This prevents us from routing tunneled + protocols like ipip, which is useful to prevent rescheduling + packets that have been tunneled to the ipvs host (i.e. to prevent + ipvs routing loops when ipvs is also acting as a real server). + +nat_icmp_send - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + It controls sending icmp error messages (ICMP_DEST_UNREACH) + for VS/NAT when the load balancer receives packets from real + servers but the connection entries don't exist. + +pmtu_disc - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + By default, reject with FRAG_NEEDED all DF packets that exceed + the PMTU, irrespective of the forwarding method. For TUN method + the flag can be disabled to fragment such packets. + +secure_tcp - INTEGER + - 0 - disabled (default) + + The secure_tcp defense is to use a more complicated TCP state + transition table. For VS/NAT, it also delays entering the + TCP ESTABLISHED state until the three way handshake is completed. + + The value definition is the same as that of drop_entry and + drop_packet. + +sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period + default 3 50 + + It sets synchronization threshold, which is the minimum number + of incoming packets that a connection needs to receive before + the connection will be synchronized. A connection will be + synchronized, every time the number of its incoming packets + modulus sync_period equals the threshold. The range of the + threshold is from 0 to sync_period. + + When sync_period and sync_refresh_period are 0, send sync only + for state changes or only once when pkts matches sync_threshold + +sync_refresh_period - UNSIGNED INTEGER + default 0 + + In seconds, difference in reported connection timer that triggers + new sync message. It can be used to avoid sync messages for the + specified period (or half of the connection timeout if it is lower) + if connection state is not changed since last sync. + + This is useful for normal connections with high traffic to reduce + sync rate. Additionally, retry sync_retries times with period of + sync_refresh_period/8. + +sync_retries - INTEGER + default 0 + + Defines sync retries with period of sync_refresh_period/8. Useful + to protect against loss of sync messages. The range of the + sync_retries is from 0 to 3. + +sync_qlen_max - UNSIGNED LONG + + Hard limit for queued sync messages that are not sent yet. It + defaults to 1/32 of the memory pages but actually represents + number of messages. It will protect us from allocating large + parts of memory when the sending rate is lower than the queuing + rate. + +sync_sock_size - INTEGER + default 0 + + Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. + Default value is 0 (preserve system defaults). + +sync_ports - INTEGER + default 1 + + The number of threads that master and backup servers can use for + sync traffic. Every thread will use single UDP port, thread 0 will + use the default port 8848 while last thread will use port + 8848+sync_ports-1. + +snat_reroute - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + If enabled, recalculate the route of SNATed packets from + realservers so that they are routed as if they originate from the + director. Otherwise they are routed as if they are forwarded by the + director. + + If policy routing is in effect then it is possible that the route + of a packet originating from a director is routed differently to a + packet being forwarded by the director. + + If policy routing is not in effect then the recalculated route will + always be the same as the original route so it is an optimisation + to disable snat_reroute and avoid the recalculation. + +sync_persist_mode - INTEGER + default 0 + + Controls the synchronisation of connections when using persistence + + 0: All types of connections are synchronised + + 1: Attempt to reduce the synchronisation traffic depending on + the connection type. For persistent services avoid synchronisation + for normal connections, do it only for persistence templates. + In such case, for TCP and SCTP it may need enabling sloppy_tcp and + sloppy_sctp flags on backup servers. For non-persistent services + such optimization is not applied, mode 0 is assumed. + +sync_version - INTEGER + default 1 + + The version of the synchronisation protocol used when sending + synchronisation messages. + + 0 selects the original synchronisation protocol (version 0). This + should be used when sending synchronisation messages to a legacy + system that only understands the original synchronisation protocol. + + 1 selects the current synchronisation protocol (version 1). This + should be used where possible. + + Kernels with this sync_version entry are able to receive messages + of both version 1 and version 2 of the synchronisation protocol. diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt deleted file mode 100644 index 056898685d40..000000000000 --- a/Documentation/networking/ipvs-sysctl.txt +++ /dev/null @@ -1,294 +0,0 @@ -/proc/sys/net/ipv4/vs/* Variables: - -am_droprate - INTEGER - default 10 - - It sets the always mode drop rate, which is used in the mode 3 - of the drop_rate defense. - -amemthresh - INTEGER - default 1024 - - It sets the available memory threshold (in pages), which is - used in the automatic modes of defense. When there is no - enough available memory, the respective strategy will be - enabled and the variable is automatically set to 2, otherwise - the strategy is disabled and the variable is set to 1. - -backup_only - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, disable the director function while the server is - in backup mode to avoid packet loops for DR/TUN methods. - -conn_reuse_mode - INTEGER - 1 - default - - Controls how ipvs will deal with connections that are detected - port reuse. It is a bitmap, with the values being: - - 0: disable any special handling on port reuse. The new - connection will be delivered to the same real server that was - servicing the previous connection. This will effectively - disable expire_nodest_conn. - - bit 1: enable rescheduling of new connections when it is safe. - That is, whenever expire_nodest_conn and for TCP sockets, when - the connection is in TIME_WAIT state (which is only possible if - you use NAT mode). - - bit 2: it is bit 1 plus, for TCP connections, when connections - are in FIN_WAIT state, as this is the last state seen by load - balancer in Direct Routing mode. This bit helps on adding new - real servers to a very busy cluster. - -conntrack - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, maintain connection tracking entries for - connections handled by IPVS. - - This should be enabled if connections handled by IPVS are to be - also handled by stateful firewall rules. That is, iptables rules - that make use of connection tracking. It is a performance - optimisation to disable this setting otherwise. - - Connections handled by the IPVS FTP application module - will have connection tracking entries regardless of this setting. - - Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. - -cache_bypass - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If it is enabled, forward packets to the original destination - directly when no cache server is available and destination - address is not local (iph->daddr is RTN_UNICAST). It is mostly - used in transparent web cache cluster. - -debug_level - INTEGER - 0 - transmission error messages (default) - 1 - non-fatal error messages - 2 - configuration - 3 - destination trash - 4 - drop entry - 5 - service lookup - 6 - scheduling - 7 - connection new/expire, lookup and synchronization - 8 - state transition - 9 - binding destination, template checks and applications - 10 - IPVS packet transmission - 11 - IPVS packet handling (ip_vs_in/ip_vs_out) - 12 or more - packet traversal - - Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. - - Higher debugging levels include the messages for lower debugging - levels, so setting debug level 2, includes level 0, 1 and 2 - messages. Thus, logging becomes more and more verbose the higher - the level. - -drop_entry - INTEGER - 0 - disabled (default) - - The drop_entry defense is to randomly drop entries in the - connection hash table, just in order to collect back some - memory for new connections. In the current code, the - drop_entry procedure can be activated every second, then it - randomly scans 1/32 of the whole and drops entries that are in - the SYN-RECV/SYNACK state, which should be effective against - syn-flooding attack. - - The valid values of drop_entry are from 0 to 3, where 0 means - that this strategy is always disabled, 1 and 2 mean automatic - modes (when there is no enough available memory, the strategy - is enabled and the variable is automatically set to 2, - otherwise the strategy is disabled and the variable is set to - 1), and 3 means that that the strategy is always enabled. - -drop_packet - INTEGER - 0 - disabled (default) - - The drop_packet defense is designed to drop 1/rate packets - before forwarding them to real servers. If the rate is 1, then - drop all the incoming packets. - - The value definition is the same as that of the drop_entry. In - the automatic mode, the rate is determined by the follow - formula: rate = amemthresh / (amemthresh - available_memory) - when available memory is less than the available memory - threshold. When the mode 3 is set, the always mode drop rate - is controlled by the /proc/sys/net/ipv4/vs/am_droprate. - -expire_nodest_conn - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - The default value is 0, the load balancer will silently drop - packets when its destination server is not available. It may - be useful, when user-space monitoring program deletes the - destination server (because of server overload or wrong - detection) and add back the server later, and the connections - to the server can continue. - - If this feature is enabled, the load balancer will expire the - connection immediately when a packet arrives and its - destination server is not available, then the client program - will be notified that the connection is closed. This is - equivalent to the feature some people requires to flush - connections when its destination is not available. - -expire_quiescent_template - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - When set to a non-zero value, the load balancer will expire - persistent templates when the destination server is quiescent. - This may be useful, when a user makes a destination server - quiescent by setting its weight to 0 and it is desired that - subsequent otherwise persistent connections are sent to a - different destination server. By default new persistent - connections are allowed to quiescent destination servers. - - If this feature is enabled, the load balancer will expire the - persistence template if it is to be used to schedule a new - connection and the destination server is quiescent. - -ignore_tunneled - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, ipvs will set the ipvs_property on all packets which are of - unrecognized protocols. This prevents us from routing tunneled - protocols like ipip, which is useful to prevent rescheduling - packets that have been tunneled to the ipvs host (i.e. to prevent - ipvs routing loops when ipvs is also acting as a real server). - -nat_icmp_send - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - It controls sending icmp error messages (ICMP_DEST_UNREACH) - for VS/NAT when the load balancer receives packets from real - servers but the connection entries don't exist. - -pmtu_disc - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - By default, reject with FRAG_NEEDED all DF packets that exceed - the PMTU, irrespective of the forwarding method. For TUN method - the flag can be disabled to fragment such packets. - -secure_tcp - INTEGER - 0 - disabled (default) - - The secure_tcp defense is to use a more complicated TCP state - transition table. For VS/NAT, it also delays entering the - TCP ESTABLISHED state until the three way handshake is completed. - - The value definition is the same as that of drop_entry and - drop_packet. - -sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period - default 3 50 - - It sets synchronization threshold, which is the minimum number - of incoming packets that a connection needs to receive before - the connection will be synchronized. A connection will be - synchronized, every time the number of its incoming packets - modulus sync_period equals the threshold. The range of the - threshold is from 0 to sync_period. - - When sync_period and sync_refresh_period are 0, send sync only - for state changes or only once when pkts matches sync_threshold - -sync_refresh_period - UNSIGNED INTEGER - default 0 - - In seconds, difference in reported connection timer that triggers - new sync message. It can be used to avoid sync messages for the - specified period (or half of the connection timeout if it is lower) - if connection state is not changed since last sync. - - This is useful for normal connections with high traffic to reduce - sync rate. Additionally, retry sync_retries times with period of - sync_refresh_period/8. - -sync_retries - INTEGER - default 0 - - Defines sync retries with period of sync_refresh_period/8. Useful - to protect against loss of sync messages. The range of the - sync_retries is from 0 to 3. - -sync_qlen_max - UNSIGNED LONG - - Hard limit for queued sync messages that are not sent yet. It - defaults to 1/32 of the memory pages but actually represents - number of messages. It will protect us from allocating large - parts of memory when the sending rate is lower than the queuing - rate. - -sync_sock_size - INTEGER - default 0 - - Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. - Default value is 0 (preserve system defaults). - -sync_ports - INTEGER - default 1 - - The number of threads that master and backup servers can use for - sync traffic. Every thread will use single UDP port, thread 0 will - use the default port 8848 while last thread will use port - 8848+sync_ports-1. - -snat_reroute - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - If enabled, recalculate the route of SNATed packets from - realservers so that they are routed as if they originate from the - director. Otherwise they are routed as if they are forwarded by the - director. - - If policy routing is in effect then it is possible that the route - of a packet originating from a director is routed differently to a - packet being forwarded by the director. - - If policy routing is not in effect then the recalculated route will - always be the same as the original route so it is an optimisation - to disable snat_reroute and avoid the recalculation. - -sync_persist_mode - INTEGER - default 0 - - Controls the synchronisation of connections when using persistence - - 0: All types of connections are synchronised - 1: Attempt to reduce the synchronisation traffic depending on - the connection type. For persistent services avoid synchronisation - for normal connections, do it only for persistence templates. - In such case, for TCP and SCTP it may need enabling sloppy_tcp and - sloppy_sctp flags on backup servers. For non-persistent services - such optimization is not applied, mode 0 is assumed. - -sync_version - INTEGER - default 1 - - The version of the synchronisation protocol used when sending - synchronisation messages. - - 0 selects the original synchronisation protocol (version 0). This - should be used when sending synchronisation messages to a legacy - system that only understands the original synchronisation protocol. - - 1 selects the current synchronisation protocol (version 1). This - should be used where possible. - - Kernels with this sync_version entry are able to receive messages - of both version 1 and version 2 of the synchronisation protocol. diff --git a/MAINTAINERS b/MAINTAINERS index df5e4ccc1ccb..3a5f52a3c055 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8934,7 +8934,7 @@ L: lvs-devel@vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs.git -F: Documentation/networking/ipvs-sysctl.txt +F: Documentation/networking/ipvs-sysctl.rst F: include/net/ip_vs.h F: include/uapi/linux/ip_vs.h F: net/netfilter/ipvs/ -- cgit v1.2.3 From 40e79150c1686263e6a031d7702aec63aff31332 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:03:57 +0200 Subject: docs: networking: convert lapb-module.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/lapb-module.rst | 305 +++++++++++++++++++++++++++++++ Documentation/networking/lapb-module.txt | 263 -------------------------- MAINTAINERS | 2 +- net/lapb/Kconfig | 2 +- 5 files changed, 308 insertions(+), 265 deletions(-) create mode 100644 Documentation/networking/lapb-module.rst delete mode 100644 Documentation/networking/lapb-module.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 0c5d7a037983..acd2567cf0d4 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -75,6 +75,7 @@ Contents: ipvs-sysctl kcm l2tp + lapb-module .. only:: subproject and html diff --git a/Documentation/networking/lapb-module.rst b/Documentation/networking/lapb-module.rst new file mode 100644 index 000000000000..ff586bc9f005 --- /dev/null +++ b/Documentation/networking/lapb-module.rst @@ -0,0 +1,305 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +The Linux LAPB Module Interface +=============================== + +Version 1.3 + +Jonathan Naylor 29.12.96 + +Changed (Henner Eisen, 2000-10-29): int return value for data_indication() + +The LAPB module will be a separately compiled module for use by any parts of +the Linux operating system that require a LAPB service. This document +defines the interfaces to, and the services provided by this module. The +term module in this context does not imply that the LAPB module is a +separately loadable module, although it may be. The term module is used in +its more standard meaning. + +The interface to the LAPB module consists of functions to the module, +callbacks from the module to indicate important state changes, and +structures for getting and setting information about the module. + +Structures +---------- + +Probably the most important structure is the skbuff structure for holding +received and transmitted data, however it is beyond the scope of this +document. + +The two LAPB specific structures are the LAPB initialisation structure and +the LAPB parameter structure. These will be defined in a standard header +file, . The header file is internal to the LAPB +module and is not for use. + +LAPB Initialisation Structure +----------------------------- + +This structure is used only once, in the call to lapb_register (see below). +It contains information about the device driver that requires the services +of the LAPB module:: + + struct lapb_register_struct { + void (*connect_confirmation)(int token, int reason); + void (*connect_indication)(int token, int reason); + void (*disconnect_confirmation)(int token, int reason); + void (*disconnect_indication)(int token, int reason); + int (*data_indication)(int token, struct sk_buff *skb); + void (*data_transmit)(int token, struct sk_buff *skb); + }; + +Each member of this structure corresponds to a function in the device driver +that is called when a particular event in the LAPB module occurs. These will +be described in detail below. If a callback is not required (!!) then a NULL +may be substituted. + + +LAPB Parameter Structure +------------------------ + +This structure is used with the lapb_getparms and lapb_setparms functions +(see below). They are used to allow the device driver to get and set the +operational parameters of the LAPB implementation for a given connection:: + + struct lapb_parms_struct { + unsigned int t1; + unsigned int t1timer; + unsigned int t2; + unsigned int t2timer; + unsigned int n2; + unsigned int n2count; + unsigned int window; + unsigned int state; + unsigned int mode; + }; + +T1 and T2 are protocol timing parameters and are given in units of 100ms. N2 +is the maximum number of tries on the link before it is declared a failure. +The window size is the maximum number of outstanding data packets allowed to +be unacknowledged by the remote end, the value of the window is between 1 +and 7 for a standard LAPB link, and between 1 and 127 for an extended LAPB +link. + +The mode variable is a bit field used for setting (at present) three values. +The bit fields have the following meanings: + +====== ================================================= +Bit Meaning +====== ================================================= +0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED). +1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP). +2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE) +3-31 Reserved, must be 0. +====== ================================================= + +Extended LAPB operation indicates the use of extended sequence numbers and +consequently larger window sizes, the default is standard LAPB operation. +MLP operation is the same as SLP operation except that the addresses used by +LAPB are different to indicate the mode of operation, the default is Single +Link Procedure. The difference between DCE and DTE operation is (i) the +addresses used for commands and responses, and (ii) when the DCE is not +connected, it sends DM without polls set, every T1. The upper case constant +names will be defined in the public LAPB header file. + + +Functions +--------- + +The LAPB module provides a number of function entry points. + +:: + + int lapb_register(void *token, struct lapb_register_struct); + +This must be called before the LAPB module may be used. If the call is +successful then LAPB_OK is returned. The token must be a unique identifier +generated by the device driver to allow for the unique identification of the +instance of the LAPB link. It is returned by the LAPB module in all of the +callbacks, and is used by the device driver in all calls to the LAPB module. +For multiple LAPB links in a single device driver, multiple calls to +lapb_register must be made. The format of the lapb_register_struct is given +above. The return values are: + +============= ============================= +LAPB_OK LAPB registered successfully. +LAPB_BADTOKEN Token is already registered. +LAPB_NOMEM Out of memory +============= ============================= + +:: + + int lapb_unregister(void *token); + +This releases all the resources associated with a LAPB link. Any current +LAPB link will be abandoned without further messages being passed. After +this call, the value of token is no longer valid for any calls to the LAPB +function. The valid return values are: + +============= =============================== +LAPB_OK LAPB unregistered successfully. +LAPB_BADTOKEN Invalid/unknown LAPB token. +============= =============================== + +:: + + int lapb_getparms(void *token, struct lapb_parms_struct *parms); + +This allows the device driver to get the values of the current LAPB +variables, the lapb_parms_struct is described above. The valid return values +are: + +============= ============================= +LAPB_OK LAPB getparms was successful. +LAPB_BADTOKEN Invalid/unknown LAPB token. +============= ============================= + +:: + + int lapb_setparms(void *token, struct lapb_parms_struct *parms); + +This allows the device driver to set the values of the current LAPB +variables, the lapb_parms_struct is described above. The values of t1timer, +t2timer and n2count are ignored, likewise changing the mode bits when +connected will be ignored. An error implies that none of the values have +been changed. The valid return values are: + +============= ================================================= +LAPB_OK LAPB getparms was successful. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_INVALUE One of the values was out of its allowable range. +============= ================================================= + +:: + + int lapb_connect_request(void *token); + +Initiate a connect using the current parameter settings. The valid return +values are: + +============== ================================= +LAPB_OK LAPB is starting to connect. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_CONNECTED LAPB module is already connected. +============== ================================= + +:: + + int lapb_disconnect_request(void *token); + +Initiate a disconnect. The valid return values are: + +================= =============================== +LAPB_OK LAPB is starting to disconnect. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_NOTCONNECTED LAPB module is not connected. +================= =============================== + +:: + + int lapb_data_request(void *token, struct sk_buff *skb); + +Queue data with the LAPB module for transmitting over the link. If the call +is successful then the skbuff is owned by the LAPB module and may not be +used by the device driver again. The valid return values are: + +================= ============================= +LAPB_OK LAPB has accepted the data. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_NOTCONNECTED LAPB module is not connected. +================= ============================= + +:: + + int lapb_data_received(void *token, struct sk_buff *skb); + +Queue data with the LAPB module which has been received from the device. It +is expected that the data passed to the LAPB module has skb->data pointing +to the beginning of the LAPB data. If the call is successful then the skbuff +is owned by the LAPB module and may not be used by the device driver again. +The valid return values are: + +============= =========================== +LAPB_OK LAPB has accepted the data. +LAPB_BADTOKEN Invalid/unknown LAPB token. +============= =========================== + +Callbacks +--------- + +These callbacks are functions provided by the device driver for the LAPB +module to call when an event occurs. They are registered with the LAPB +module with lapb_register (see above) in the structure lapb_register_struct +(see above). + +:: + + void (*connect_confirmation)(void *token, int reason); + +This is called by the LAPB module when a connection is established after +being requested by a call to lapb_connect_request (see above). The reason is +always LAPB_OK. + +:: + + void (*connect_indication)(void *token, int reason); + +This is called by the LAPB module when the link is established by the remote +system. The value of reason is always LAPB_OK. + +:: + + void (*disconnect_confirmation)(void *token, int reason); + +This is called by the LAPB module when an event occurs after the device +driver has called lapb_disconnect_request (see above). The reason indicates +what has happened. In all cases the LAPB link can be regarded as being +terminated. The values for reason are: + +================= ==================================================== +LAPB_OK The LAPB link was terminated normally. +LAPB_NOTCONNECTED The remote system was not connected. +LAPB_TIMEDOUT No response was received in N2 tries from the remote + system. +================= ==================================================== + +:: + + void (*disconnect_indication)(void *token, int reason); + +This is called by the LAPB module when the link is terminated by the remote +system or another event has occurred to terminate the link. This may be +returned in response to a lapb_connect_request (see above) if the remote +system refused the request. The values for reason are: + +================= ==================================================== +LAPB_OK The LAPB link was terminated normally by the remote + system. +LAPB_REFUSED The remote system refused the connect request. +LAPB_NOTCONNECTED The remote system was not connected. +LAPB_TIMEDOUT No response was received in N2 tries from the remote + system. +================= ==================================================== + +:: + + int (*data_indication)(void *token, struct sk_buff *skb); + +This is called by the LAPB module when data has been received from the +remote system that should be passed onto the next layer in the protocol +stack. The skbuff becomes the property of the device driver and the LAPB +module will not perform any more actions on it. The skb->data pointer will +be pointing to the first byte of data after the LAPB header. + +This method should return NET_RX_DROP (as defined in the header +file include/linux/netdevice.h) if and only if the frame was dropped +before it could be delivered to the upper layer. + +:: + + void (*data_transmit)(void *token, struct sk_buff *skb); + +This is called by the LAPB module when data is to be transmitted to the +remote system by the device driver. The skbuff becomes the property of the +device driver and the LAPB module will not perform any more actions on it. +The skb->data pointer will be pointing to the first byte of the LAPB header. diff --git a/Documentation/networking/lapb-module.txt b/Documentation/networking/lapb-module.txt deleted file mode 100644 index d4fc8f221559..000000000000 --- a/Documentation/networking/lapb-module.txt +++ /dev/null @@ -1,263 +0,0 @@ - The Linux LAPB Module Interface 1.3 - - Jonathan Naylor 29.12.96 - -Changed (Henner Eisen, 2000-10-29): int return value for data_indication() - -The LAPB module will be a separately compiled module for use by any parts of -the Linux operating system that require a LAPB service. This document -defines the interfaces to, and the services provided by this module. The -term module in this context does not imply that the LAPB module is a -separately loadable module, although it may be. The term module is used in -its more standard meaning. - -The interface to the LAPB module consists of functions to the module, -callbacks from the module to indicate important state changes, and -structures for getting and setting information about the module. - -Structures ----------- - -Probably the most important structure is the skbuff structure for holding -received and transmitted data, however it is beyond the scope of this -document. - -The two LAPB specific structures are the LAPB initialisation structure and -the LAPB parameter structure. These will be defined in a standard header -file, . The header file is internal to the LAPB -module and is not for use. - -LAPB Initialisation Structure ------------------------------ - -This structure is used only once, in the call to lapb_register (see below). -It contains information about the device driver that requires the services -of the LAPB module. - -struct lapb_register_struct { - void (*connect_confirmation)(int token, int reason); - void (*connect_indication)(int token, int reason); - void (*disconnect_confirmation)(int token, int reason); - void (*disconnect_indication)(int token, int reason); - int (*data_indication)(int token, struct sk_buff *skb); - void (*data_transmit)(int token, struct sk_buff *skb); -}; - -Each member of this structure corresponds to a function in the device driver -that is called when a particular event in the LAPB module occurs. These will -be described in detail below. If a callback is not required (!!) then a NULL -may be substituted. - - -LAPB Parameter Structure ------------------------- - -This structure is used with the lapb_getparms and lapb_setparms functions -(see below). They are used to allow the device driver to get and set the -operational parameters of the LAPB implementation for a given connection. - -struct lapb_parms_struct { - unsigned int t1; - unsigned int t1timer; - unsigned int t2; - unsigned int t2timer; - unsigned int n2; - unsigned int n2count; - unsigned int window; - unsigned int state; - unsigned int mode; -}; - -T1 and T2 are protocol timing parameters and are given in units of 100ms. N2 -is the maximum number of tries on the link before it is declared a failure. -The window size is the maximum number of outstanding data packets allowed to -be unacknowledged by the remote end, the value of the window is between 1 -and 7 for a standard LAPB link, and between 1 and 127 for an extended LAPB -link. - -The mode variable is a bit field used for setting (at present) three values. -The bit fields have the following meanings: - -Bit Meaning -0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED). -1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP). -2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE) -3-31 Reserved, must be 0. - -Extended LAPB operation indicates the use of extended sequence numbers and -consequently larger window sizes, the default is standard LAPB operation. -MLP operation is the same as SLP operation except that the addresses used by -LAPB are different to indicate the mode of operation, the default is Single -Link Procedure. The difference between DCE and DTE operation is (i) the -addresses used for commands and responses, and (ii) when the DCE is not -connected, it sends DM without polls set, every T1. The upper case constant -names will be defined in the public LAPB header file. - - -Functions ---------- - -The LAPB module provides a number of function entry points. - - -int lapb_register(void *token, struct lapb_register_struct); - -This must be called before the LAPB module may be used. If the call is -successful then LAPB_OK is returned. The token must be a unique identifier -generated by the device driver to allow for the unique identification of the -instance of the LAPB link. It is returned by the LAPB module in all of the -callbacks, and is used by the device driver in all calls to the LAPB module. -For multiple LAPB links in a single device driver, multiple calls to -lapb_register must be made. The format of the lapb_register_struct is given -above. The return values are: - -LAPB_OK LAPB registered successfully. -LAPB_BADTOKEN Token is already registered. -LAPB_NOMEM Out of memory - - -int lapb_unregister(void *token); - -This releases all the resources associated with a LAPB link. Any current -LAPB link will be abandoned without further messages being passed. After -this call, the value of token is no longer valid for any calls to the LAPB -function. The valid return values are: - -LAPB_OK LAPB unregistered successfully. -LAPB_BADTOKEN Invalid/unknown LAPB token. - - -int lapb_getparms(void *token, struct lapb_parms_struct *parms); - -This allows the device driver to get the values of the current LAPB -variables, the lapb_parms_struct is described above. The valid return values -are: - -LAPB_OK LAPB getparms was successful. -LAPB_BADTOKEN Invalid/unknown LAPB token. - - -int lapb_setparms(void *token, struct lapb_parms_struct *parms); - -This allows the device driver to set the values of the current LAPB -variables, the lapb_parms_struct is described above. The values of t1timer, -t2timer and n2count are ignored, likewise changing the mode bits when -connected will be ignored. An error implies that none of the values have -been changed. The valid return values are: - -LAPB_OK LAPB getparms was successful. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_INVALUE One of the values was out of its allowable range. - - -int lapb_connect_request(void *token); - -Initiate a connect using the current parameter settings. The valid return -values are: - -LAPB_OK LAPB is starting to connect. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_CONNECTED LAPB module is already connected. - - -int lapb_disconnect_request(void *token); - -Initiate a disconnect. The valid return values are: - -LAPB_OK LAPB is starting to disconnect. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_NOTCONNECTED LAPB module is not connected. - - -int lapb_data_request(void *token, struct sk_buff *skb); - -Queue data with the LAPB module for transmitting over the link. If the call -is successful then the skbuff is owned by the LAPB module and may not be -used by the device driver again. The valid return values are: - -LAPB_OK LAPB has accepted the data. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_NOTCONNECTED LAPB module is not connected. - - -int lapb_data_received(void *token, struct sk_buff *skb); - -Queue data with the LAPB module which has been received from the device. It -is expected that the data passed to the LAPB module has skb->data pointing -to the beginning of the LAPB data. If the call is successful then the skbuff -is owned by the LAPB module and may not be used by the device driver again. -The valid return values are: - -LAPB_OK LAPB has accepted the data. -LAPB_BADTOKEN Invalid/unknown LAPB token. - - -Callbacks ---------- - -These callbacks are functions provided by the device driver for the LAPB -module to call when an event occurs. They are registered with the LAPB -module with lapb_register (see above) in the structure lapb_register_struct -(see above). - - -void (*connect_confirmation)(void *token, int reason); - -This is called by the LAPB module when a connection is established after -being requested by a call to lapb_connect_request (see above). The reason is -always LAPB_OK. - - -void (*connect_indication)(void *token, int reason); - -This is called by the LAPB module when the link is established by the remote -system. The value of reason is always LAPB_OK. - - -void (*disconnect_confirmation)(void *token, int reason); - -This is called by the LAPB module when an event occurs after the device -driver has called lapb_disconnect_request (see above). The reason indicates -what has happened. In all cases the LAPB link can be regarded as being -terminated. The values for reason are: - -LAPB_OK The LAPB link was terminated normally. -LAPB_NOTCONNECTED The remote system was not connected. -LAPB_TIMEDOUT No response was received in N2 tries from the remote - system. - - -void (*disconnect_indication)(void *token, int reason); - -This is called by the LAPB module when the link is terminated by the remote -system or another event has occurred to terminate the link. This may be -returned in response to a lapb_connect_request (see above) if the remote -system refused the request. The values for reason are: - -LAPB_OK The LAPB link was terminated normally by the remote - system. -LAPB_REFUSED The remote system refused the connect request. -LAPB_NOTCONNECTED The remote system was not connected. -LAPB_TIMEDOUT No response was received in N2 tries from the remote - system. - - -int (*data_indication)(void *token, struct sk_buff *skb); - -This is called by the LAPB module when data has been received from the -remote system that should be passed onto the next layer in the protocol -stack. The skbuff becomes the property of the device driver and the LAPB -module will not perform any more actions on it. The skb->data pointer will -be pointing to the first byte of data after the LAPB header. - -This method should return NET_RX_DROP (as defined in the header -file include/linux/netdevice.h) if and only if the frame was dropped -before it could be delivered to the upper layer. - - -void (*data_transmit)(void *token, struct sk_buff *skb); - -This is called by the LAPB module when data is to be transmitted to the -remote system by the device driver. The skbuff becomes the property of the -device driver and the LAPB module will not perform any more actions on it. -The skb->data pointer will be pointing to the first byte of the LAPB header. diff --git a/MAINTAINERS b/MAINTAINERS index 3a5f52a3c055..956999d2d979 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9515,7 +9515,7 @@ F: drivers/soc/lantiq LAPB module L: linux-x25@vger.kernel.org S: Orphan -F: Documentation/networking/lapb-module.txt +F: Documentation/networking/lapb-module.rst F: include/*/lapb.h F: net/lapb/ diff --git a/net/lapb/Kconfig b/net/lapb/Kconfig index 6acfc999c952..5b50e8d64f26 100644 --- a/net/lapb/Kconfig +++ b/net/lapb/Kconfig @@ -15,7 +15,7 @@ config LAPB currently supports LAPB only over Ethernet connections. If you want to use LAPB connections over Ethernet, say Y here and to "LAPB over Ethernet driver" below. Read - for technical + for technical details. To compile this driver as a module, choose M here: the -- cgit v1.2.3 From 429ff87bcac75b929d9ffec8d4d24be2616f8052 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:03:59 +0200 Subject: docs: networking: convert mac80211-injection.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/mac80211-injection.rst | 106 ++++++++++++++++++++++++ Documentation/networking/mac80211-injection.txt | 97 ---------------------- MAINTAINERS | 2 +- net/mac80211/tx.c | 2 +- 5 files changed, 109 insertions(+), 99 deletions(-) create mode 100644 Documentation/networking/mac80211-injection.rst delete mode 100644 Documentation/networking/mac80211-injection.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b3608b177a8b..81c1834bfb57 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -77,6 +77,7 @@ Contents: l2tp lapb-module ltpc + mac80211-injection .. only:: subproject and html diff --git a/Documentation/networking/mac80211-injection.rst b/Documentation/networking/mac80211-injection.rst new file mode 100644 index 000000000000..75d4edcae852 --- /dev/null +++ b/Documentation/networking/mac80211-injection.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +How to use packet injection with mac80211 +========================================= + +mac80211 now allows arbitrary packets to be injected down any Monitor Mode +interface from userland. The packet you inject needs to be composed in the +following format:: + + [ radiotap header ] + [ ieee80211 header ] + [ payload ] + +The radiotap format is discussed in +./Documentation/networking/radiotap-headers.txt. + +Despite many radiotap parameters being currently defined, most only make sense +to appear on received packets. The following information is parsed from the +radiotap headers and used to control injection: + + * IEEE80211_RADIOTAP_FLAGS + + ========================= =========================================== + IEEE80211_RADIOTAP_F_FCS FCS will be removed and recalculated + IEEE80211_RADIOTAP_F_WEP frame will be encrypted if key available + IEEE80211_RADIOTAP_F_FRAG frame will be fragmented if longer than the + current fragmentation threshold. + ========================= =========================================== + + * IEEE80211_RADIOTAP_TX_FLAGS + + ============================= ======================================== + IEEE80211_RADIOTAP_F_TX_NOACK frame should be sent without waiting for + an ACK even if it is a unicast frame + ============================= ======================================== + + * IEEE80211_RADIOTAP_RATE + + legacy rate for the transmission (only for devices without own rate control) + + * IEEE80211_RADIOTAP_MCS + + HT rate for the transmission (only for devices without own rate control). + Also some flags are parsed + + ============================ ======================== + IEEE80211_RADIOTAP_MCS_SGI use short guard interval + IEEE80211_RADIOTAP_MCS_BW_40 send in HT40 mode + ============================ ======================== + + * IEEE80211_RADIOTAP_DATA_RETRIES + + number of retries when either IEEE80211_RADIOTAP_RATE or + IEEE80211_RADIOTAP_MCS was used + + * IEEE80211_RADIOTAP_VHT + + VHT mcs and number of streams used in the transmission (only for devices + without own rate control). Also other fields are parsed + + flags field + IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval + + bandwidth field + * 1: send using 40MHz channel width + * 4: send using 80MHz channel width + * 11: send using 160MHz channel width + +The injection code can also skip all other currently defined radiotap fields +facilitating replay of captured radiotap headers directly. + +Here is an example valid radiotap header defining some parameters:: + + 0x00, 0x00, // <-- radiotap version + 0x0b, 0x00, // <- radiotap header length + 0x04, 0x0c, 0x00, 0x00, // <-- bitmap + 0x6c, // <-- rate + 0x0c, //<-- tx power + 0x01 //<-- antenna + +The ieee80211 header follows immediately afterwards, looking for example like +this:: + + 0x08, 0x01, 0x00, 0x00, + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x10, 0x86 + +Then lastly there is the payload. + +After composing the packet contents, it is sent by send()-ing it to a logical +mac80211 interface that is in Monitor mode. Libpcap can also be used, +(which is easier than doing the work to bind the socket to the right +interface), along the following lines::: + + ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf); + ... + r = pcap_inject(ppcap, u8aSendBuffer, nLength); + +You can also find a link to a complete inject application here: + +http://wireless.kernel.org/en/users/Documentation/packetspammer + +Andy Green diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt deleted file mode 100644 index d58d78df9ca2..000000000000 --- a/Documentation/networking/mac80211-injection.txt +++ /dev/null @@ -1,97 +0,0 @@ -How to use packet injection with mac80211 -========================================= - -mac80211 now allows arbitrary packets to be injected down any Monitor Mode -interface from userland. The packet you inject needs to be composed in the -following format: - - [ radiotap header ] - [ ieee80211 header ] - [ payload ] - -The radiotap format is discussed in -./Documentation/networking/radiotap-headers.txt. - -Despite many radiotap parameters being currently defined, most only make sense -to appear on received packets. The following information is parsed from the -radiotap headers and used to control injection: - - * IEEE80211_RADIOTAP_FLAGS - - IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated - IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available - IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the - current fragmentation threshold. - - * IEEE80211_RADIOTAP_TX_FLAGS - - IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for - an ACK even if it is a unicast frame - - * IEEE80211_RADIOTAP_RATE - - legacy rate for the transmission (only for devices without own rate control) - - * IEEE80211_RADIOTAP_MCS - - HT rate for the transmission (only for devices without own rate control). - Also some flags are parsed - - IEEE80211_RADIOTAP_MCS_SGI: use short guard interval - IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode - - * IEEE80211_RADIOTAP_DATA_RETRIES - - number of retries when either IEEE80211_RADIOTAP_RATE or - IEEE80211_RADIOTAP_MCS was used - - * IEEE80211_RADIOTAP_VHT - - VHT mcs and number of streams used in the transmission (only for devices - without own rate control). Also other fields are parsed - - flags field - IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval - - bandwidth field - 1: send using 40MHz channel width - 4: send using 80MHz channel width - 11: send using 160MHz channel width - -The injection code can also skip all other currently defined radiotap fields -facilitating replay of captured radiotap headers directly. - -Here is an example valid radiotap header defining some parameters - - 0x00, 0x00, // <-- radiotap version - 0x0b, 0x00, // <- radiotap header length - 0x04, 0x0c, 0x00, 0x00, // <-- bitmap - 0x6c, // <-- rate - 0x0c, //<-- tx power - 0x01 //<-- antenna - -The ieee80211 header follows immediately afterwards, looking for example like -this: - - 0x08, 0x01, 0x00, 0x00, - 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, - 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, - 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, - 0x10, 0x86 - -Then lastly there is the payload. - -After composing the packet contents, it is sent by send()-ing it to a logical -mac80211 interface that is in Monitor mode. Libpcap can also be used, -(which is easier than doing the work to bind the socket to the right -interface), along the following lines: - - ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf); -... - r = pcap_inject(ppcap, u8aSendBuffer, nLength); - -You can also find a link to a complete inject application here: - -http://wireless.kernel.org/en/users/Documentation/packetspammer - -Andy Green diff --git a/MAINTAINERS b/MAINTAINERS index 956999d2d979..33bfc9e4aead 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10079,7 +10079,7 @@ S: Maintained W: https://wireless.wiki.kernel.org/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git -F: Documentation/networking/mac80211-injection.txt +F: Documentation/networking/mac80211-injection.rst F: Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst F: drivers/net/wireless/mac80211_hwsim.[ch] F: include/net/mac80211.h diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c index 82846aca86d9..9849c14694db 100644 --- a/net/mac80211/tx.c +++ b/net/mac80211/tx.c @@ -2144,7 +2144,7 @@ static bool ieee80211_parse_tx_radiotap(struct ieee80211_local *local, /* * Please update the file - * Documentation/networking/mac80211-injection.txt + * Documentation/networking/mac80211-injection.rst * when parsing new fields here. */ -- cgit v1.2.3 From 6e94eaaa400d66f13e25e071926047ef2e3d21e3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:12 +0200 Subject: docs: networking: convert phonet.txt to ReST - add SPDX header; - adjust title markup; - use copyright symbol; - add notes markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Rémi Denis-Courmont Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/packet_mmap.rst | 2 +- Documentation/networking/phonet.rst | 230 +++++++++++++++++++++++++++++++ Documentation/networking/phonet.txt | 214 ---------------------------- MAINTAINERS | 2 +- 5 files changed, 233 insertions(+), 216 deletions(-) create mode 100644 Documentation/networking/phonet.rst delete mode 100644 Documentation/networking/phonet.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 8262b535a83e..e460026331c6 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -90,6 +90,7 @@ Contents: openvswitch operstates packet_mmap + phonet .. only:: subproject and html diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst index 5f213d17652f..884c7222b9e9 100644 --- a/Documentation/networking/packet_mmap.rst +++ b/Documentation/networking/packet_mmap.rst @@ -1076,7 +1076,7 @@ Miscellaneous bits ================== - Packet sockets work well together with Linux socket filters, thus you also - might want to have a look at Documentation/networking/filter.txt + might want to have a look at Documentation/networking/filter.rst THANKS ====== diff --git a/Documentation/networking/phonet.rst b/Documentation/networking/phonet.rst new file mode 100644 index 000000000000..8668dcbc5e6a --- /dev/null +++ b/Documentation/networking/phonet.rst @@ -0,0 +1,230 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +============================ +Linux Phonet protocol family +============================ + +Introduction +------------ + +Phonet is a packet protocol used by Nokia cellular modems for both IPC +and RPC. With the Linux Phonet socket family, Linux host processes can +receive and send messages from/to the modem, or any other external +device attached to the modem. The modem takes care of routing. + +Phonet packets can be exchanged through various hardware connections +depending on the device, such as: + + - USB with the CDC Phonet interface, + - infrared, + - Bluetooth, + - an RS232 serial port (with a dedicated "FBUS" line discipline), + - the SSI bus with some TI OMAP processors. + + +Packets format +-------------- + +Phonet packets have a common header as follows:: + + struct phonethdr { + uint8_t pn_media; /* Media type (link-layer identifier) */ + uint8_t pn_rdev; /* Receiver device ID */ + uint8_t pn_sdev; /* Sender device ID */ + uint8_t pn_res; /* Resource ID or function */ + uint16_t pn_length; /* Big-endian message byte length (minus 6) */ + uint8_t pn_robj; /* Receiver object ID */ + uint8_t pn_sobj; /* Sender object ID */ + }; + +On Linux, the link-layer header includes the pn_media byte (see below). +The next 7 bytes are part of the network-layer header. + +The device ID is split: the 6 higher-order bits constitute the device +address, while the 2 lower-order bits are used for multiplexing, as are +the 8-bit object identifiers. As such, Phonet can be considered as a +network layer with 6 bits of address space and 10 bits for transport +protocol (much like port numbers in IP world). + +The modem always has address number zero. All other device have a their +own 6-bit address. + + +Link layer +---------- + +Phonet links are always point-to-point links. The link layer header +consists of a single Phonet media type byte. It uniquely identifies the +link through which the packet is transmitted, from the modem's +perspective. Each Phonet network device shall prepend and set the media +type byte as appropriate. For convenience, a common phonet_header_ops +link-layer header operations structure is provided. It sets the +media type according to the network device hardware address. + +Linux Phonet network interfaces support a dedicated link layer packets +type (ETH_P_PHONET) which is out of the Ethernet type range. They can +only send and receive Phonet packets. + +The virtual TUN tunnel device driver can also be used for Phonet. This +requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case, +there is no link-layer header, so there is no Phonet media type byte. + +Note that Phonet interfaces are not allowed to re-order packets, so +only the (default) Linux FIFO qdisc should be used with them. + + +Network layer +------------- + +The Phonet socket address family maps the Phonet packet header:: + + struct sockaddr_pn { + sa_family_t spn_family; /* AF_PHONET */ + uint8_t spn_obj; /* Object ID */ + uint8_t spn_dev; /* Device ID */ + uint8_t spn_resource; /* Resource or function */ + uint8_t spn_zero[...]; /* Padding */ + }; + +The resource field is only used when sending and receiving; +It is ignored by bind() and getsockname(). + + +Low-level datagram protocol +--------------------------- + +Applications can send Phonet messages using the Phonet datagram socket +protocol from the PF_PHONET family. Each socket is bound to one of the +2^10 object IDs available, and can send and receive packets with any +other peer. + +:: + + struct sockaddr_pn addr = { .spn_family = AF_PHONET, }; + ssize_t len; + socklen_t addrlen = sizeof(addr); + int fd; + + fd = socket(PF_PHONET, SOCK_DGRAM, 0); + bind(fd, (struct sockaddr *)&addr, sizeof(addr)); + /* ... */ + + sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr)); + len = recvfrom(fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addrlen); + +This protocol follows the SOCK_DGRAM connection-less semantics. +However, connect() and getpeername() are not supported, as they did +not seem useful with Phonet usages (could be added easily). + + +Resource subscription +--------------------- + +A Phonet datagram socket can be subscribed to any number of 8-bits +Phonet resources, as follow:: + + uint32_t res = 0xXX; + ioctl(fd, SIOCPNADDRESOURCE, &res); + +Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O +control request, or when the socket is closed. + +Note that no more than one socket can be subcribed to any given +resource at a time. If not, ioctl() will return EBUSY. + + +Phonet Pipe protocol +-------------------- + +The Phonet Pipe protocol is a simple sequenced packets protocol +with end-to-end congestion control. It uses the passive listening +socket paradigm. The listening socket is bound to an unique free object +ID. Each listening socket can handle up to 255 simultaneous +connections, one per accept()'d socket. + +:: + + int lfd, cfd; + + lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + listen (lfd, INT_MAX); + + /* ... */ + cfd = accept(lfd, NULL, NULL); + for (;;) + { + char buf[...]; + ssize_t len = read(cfd, buf, sizeof(buf)); + + /* ... */ + + write(cfd, msg, msglen); + } + +Connections are traditionally established between two endpoints by a +"third party" application. This means that both endpoints are passive. + + +As of Linux kernel version 2.6.39, it is also possible to connect +two endpoints directly, using connect() on the active side. This is +intended to support the newer Nokia Wireless Modem API, as found in +e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:: + + struct sockaddr_spn spn; + int fd; + + fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + memset(&spn, 0, sizeof(spn)); + spn.spn_family = AF_PHONET; + spn.spn_obj = ...; + spn.spn_dev = ...; + spn.spn_resource = 0xD9; + connect(fd, (struct sockaddr *)&spn, sizeof(spn)); + /* normal I/O here ... */ + close(fd); + + +.. Warning: + + When polling a connected pipe socket for writability, there is an + intrinsic race condition whereby writability might be lost between the + polling and the writing system calls. In this case, the socket will + block until write becomes possible again, unless non-blocking mode + is enabled. + + +The pipe protocol provides two socket options at the SOL_PNPIPE level: + + PNPIPE_ENCAP accepts one integer value (int) of: + + PNPIPE_ENCAP_NONE: + The socket operates normally (default). + + PNPIPE_ENCAP_IP: + The socket is used as a backend for a virtual IP + interface. This requires CAP_NET_ADMIN capability. GPRS data + support on Nokia modems can use this. Note that the socket cannot + be reliably poll()'d or read() from while in this mode. + + PNPIPE_IFINDEX + is a read-only integer value. It contains the + interface index of the network interface created by PNPIPE_ENCAP, + or zero if encapsulation is off. + + PNPIPE_HANDLE + is a read-only integer value. It contains the underlying + identifier ("pipe handle") of the pipe. This is only defined for + socket descriptors that are already connected or being connected. + + +Authors +------- + +Linux Phonet was initially written by Sakari Ailus. + +Other contributors include Mikä Liljeberg, Andras Domokos, +Carlos Chinea and Rémi Denis-Courmont. + +Copyright |copy| 2008 Nokia Corporation. diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.txt deleted file mode 100644 index 81003581f47a..000000000000 --- a/Documentation/networking/phonet.txt +++ /dev/null @@ -1,214 +0,0 @@ -Linux Phonet protocol family -============================ - -Introduction ------------- - -Phonet is a packet protocol used by Nokia cellular modems for both IPC -and RPC. With the Linux Phonet socket family, Linux host processes can -receive and send messages from/to the modem, or any other external -device attached to the modem. The modem takes care of routing. - -Phonet packets can be exchanged through various hardware connections -depending on the device, such as: - - USB with the CDC Phonet interface, - - infrared, - - Bluetooth, - - an RS232 serial port (with a dedicated "FBUS" line discipline), - - the SSI bus with some TI OMAP processors. - - -Packets format --------------- - -Phonet packets have a common header as follows: - - struct phonethdr { - uint8_t pn_media; /* Media type (link-layer identifier) */ - uint8_t pn_rdev; /* Receiver device ID */ - uint8_t pn_sdev; /* Sender device ID */ - uint8_t pn_res; /* Resource ID or function */ - uint16_t pn_length; /* Big-endian message byte length (minus 6) */ - uint8_t pn_robj; /* Receiver object ID */ - uint8_t pn_sobj; /* Sender object ID */ - }; - -On Linux, the link-layer header includes the pn_media byte (see below). -The next 7 bytes are part of the network-layer header. - -The device ID is split: the 6 higher-order bits constitute the device -address, while the 2 lower-order bits are used for multiplexing, as are -the 8-bit object identifiers. As such, Phonet can be considered as a -network layer with 6 bits of address space and 10 bits for transport -protocol (much like port numbers in IP world). - -The modem always has address number zero. All other device have a their -own 6-bit address. - - -Link layer ----------- - -Phonet links are always point-to-point links. The link layer header -consists of a single Phonet media type byte. It uniquely identifies the -link through which the packet is transmitted, from the modem's -perspective. Each Phonet network device shall prepend and set the media -type byte as appropriate. For convenience, a common phonet_header_ops -link-layer header operations structure is provided. It sets the -media type according to the network device hardware address. - -Linux Phonet network interfaces support a dedicated link layer packets -type (ETH_P_PHONET) which is out of the Ethernet type range. They can -only send and receive Phonet packets. - -The virtual TUN tunnel device driver can also be used for Phonet. This -requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case, -there is no link-layer header, so there is no Phonet media type byte. - -Note that Phonet interfaces are not allowed to re-order packets, so -only the (default) Linux FIFO qdisc should be used with them. - - -Network layer -------------- - -The Phonet socket address family maps the Phonet packet header: - - struct sockaddr_pn { - sa_family_t spn_family; /* AF_PHONET */ - uint8_t spn_obj; /* Object ID */ - uint8_t spn_dev; /* Device ID */ - uint8_t spn_resource; /* Resource or function */ - uint8_t spn_zero[...]; /* Padding */ - }; - -The resource field is only used when sending and receiving; -It is ignored by bind() and getsockname(). - - -Low-level datagram protocol ---------------------------- - -Applications can send Phonet messages using the Phonet datagram socket -protocol from the PF_PHONET family. Each socket is bound to one of the -2^10 object IDs available, and can send and receive packets with any -other peer. - - struct sockaddr_pn addr = { .spn_family = AF_PHONET, }; - ssize_t len; - socklen_t addrlen = sizeof(addr); - int fd; - - fd = socket(PF_PHONET, SOCK_DGRAM, 0); - bind(fd, (struct sockaddr *)&addr, sizeof(addr)); - /* ... */ - - sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr)); - len = recvfrom(fd, buf, sizeof(buf), 0, - (struct sockaddr *)&addr, &addrlen); - -This protocol follows the SOCK_DGRAM connection-less semantics. -However, connect() and getpeername() are not supported, as they did -not seem useful with Phonet usages (could be added easily). - - -Resource subscription ---------------------- - -A Phonet datagram socket can be subscribed to any number of 8-bits -Phonet resources, as follow: - - uint32_t res = 0xXX; - ioctl(fd, SIOCPNADDRESOURCE, &res); - -Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O -control request, or when the socket is closed. - -Note that no more than one socket can be subcribed to any given -resource at a time. If not, ioctl() will return EBUSY. - - -Phonet Pipe protocol --------------------- - -The Phonet Pipe protocol is a simple sequenced packets protocol -with end-to-end congestion control. It uses the passive listening -socket paradigm. The listening socket is bound to an unique free object -ID. Each listening socket can handle up to 255 simultaneous -connections, one per accept()'d socket. - - int lfd, cfd; - - lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); - listen (lfd, INT_MAX); - - /* ... */ - cfd = accept(lfd, NULL, NULL); - for (;;) - { - char buf[...]; - ssize_t len = read(cfd, buf, sizeof(buf)); - - /* ... */ - - write(cfd, msg, msglen); - } - -Connections are traditionally established between two endpoints by a -"third party" application. This means that both endpoints are passive. - - -As of Linux kernel version 2.6.39, it is also possible to connect -two endpoints directly, using connect() on the active side. This is -intended to support the newer Nokia Wireless Modem API, as found in -e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform: - - struct sockaddr_spn spn; - int fd; - - fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); - memset(&spn, 0, sizeof(spn)); - spn.spn_family = AF_PHONET; - spn.spn_obj = ...; - spn.spn_dev = ...; - spn.spn_resource = 0xD9; - connect(fd, (struct sockaddr *)&spn, sizeof(spn)); - /* normal I/O here ... */ - close(fd); - - -WARNING: -When polling a connected pipe socket for writability, there is an -intrinsic race condition whereby writability might be lost between the -polling and the writing system calls. In this case, the socket will -block until write becomes possible again, unless non-blocking mode -is enabled. - - -The pipe protocol provides two socket options at the SOL_PNPIPE level: - - PNPIPE_ENCAP accepts one integer value (int) of: - - PNPIPE_ENCAP_NONE: The socket operates normally (default). - - PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP - interface. This requires CAP_NET_ADMIN capability. GPRS data - support on Nokia modems can use this. Note that the socket cannot - be reliably poll()'d or read() from while in this mode. - - PNPIPE_IFINDEX is a read-only integer value. It contains the - interface index of the network interface created by PNPIPE_ENCAP, - or zero if encapsulation is off. - - PNPIPE_HANDLE is a read-only integer value. It contains the underlying - identifier ("pipe handle") of the pipe. This is only defined for - socket descriptors that are already connected or being connected. - - -Authors -------- - -Linux Phonet was initially written by Sakari Ailus. -Other contributors include Mikä Liljeberg, Andras Domokos, -Carlos Chinea and Rémi Denis-Courmont. -Copyright (C) 2008 Nokia Corporation. diff --git a/MAINTAINERS b/MAINTAINERS index 33bfc9e4aead..785f56e5f210 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13262,7 +13262,7 @@ F: drivers/input/joystick/pxrc.c PHONET PROTOCOL M: Remi Denis-Courmont S: Supported -F: Documentation/networking/phonet.txt +F: Documentation/networking/phonet.rst F: include/linux/phonet.h F: include/net/phonet/ F: include/uapi/linux/phonet.h -- cgit v1.2.3 From bad5b6e223e8409c860c0574d5239ee4348f06b3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:19 +0200 Subject: docs: networking: convert rds.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Santosh Shilimkar Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/rds.rst | 448 +++++++++++++++++++++++++++++++++++++ Documentation/networking/rds.txt | 423 ---------------------------------- MAINTAINERS | 2 +- 4 files changed, 450 insertions(+), 424 deletions(-) create mode 100644 Documentation/networking/rds.rst delete mode 100644 Documentation/networking/rds.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b7e35b0d905c..e63a2cb2e4cb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -97,6 +97,7 @@ Contents: proc_net_tcp radiotap-headers ray_cs + rds .. only:: subproject and html diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst new file mode 100644 index 000000000000..44936c27ab3a --- /dev/null +++ b/Documentation/networking/rds.rst @@ -0,0 +1,448 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +RDS +=== + +Overview +======== + +This readme tries to provide some background on the hows and whys of RDS, +and will hopefully help you find your way around the code. + +In addition, please see this email about RDS origins: +http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html + +RDS Architecture +================ + +RDS provides reliable, ordered datagram delivery by using a single +reliable connection between any two nodes in the cluster. This allows +applications to use a single socket to talk to any other process in the +cluster - so in a cluster with N processes you need N sockets, in contrast +to N*N if you use a connection-oriented socket transport like TCP. + +RDS is not Infiniband-specific; it was designed to support different +transports. The current implementation used to support RDS over TCP as well +as IB. + +The high-level semantics of RDS from the application's point of view are + + * Addressing + + RDS uses IPv4 addresses and 16bit port numbers to identify + the end point of a connection. All socket operations that involve + passing addresses between kernel and user space generally + use a struct sockaddr_in. + + The fact that IPv4 addresses are used does not mean the underlying + transport has to be IP-based. In fact, RDS over IB uses a + reliable IB connection; the IP address is used exclusively to + locate the remote node's GID (by ARPing for the given IP). + + The port space is entirely independent of UDP, TCP or any other + protocol. + + * Socket interface + + RDS sockets work *mostly* as you would expect from a BSD + socket. The next section will cover the details. At any rate, + all I/O is performed through the standard BSD socket API. + Some additions like zerocopy support are implemented through + control messages, while other extensions use the getsockopt/ + setsockopt calls. + + Sockets must be bound before you can send or receive data. + This is needed because binding also selects a transport and + attaches it to the socket. Once bound, the transport assignment + does not change. RDS will tolerate IPs moving around (eg in + a active-active HA scenario), but only as long as the address + doesn't move to a different transport. + + * sysctls + + RDS supports a number of sysctls in /proc/sys/net/rds + + +Socket Interface +================ + + AF_RDS, PF_RDS, SOL_RDS + AF_RDS and PF_RDS are the domain type to be used with socket(2) + to create RDS sockets. SOL_RDS is the socket-level to be used + with setsockopt(2) and getsockopt(2) for RDS specific socket + options. + + fd = socket(PF_RDS, SOCK_SEQPACKET, 0); + This creates a new, unbound RDS socket. + + setsockopt(SOL_SOCKET): send and receive buffer size + RDS honors the send and receive buffer size socket options. + You are not allowed to queue more than SO_SNDSIZE bytes to + a socket. A message is queued when sendmsg is called, and + it leaves the queue when the remote system acknowledges + its arrival. + + The SO_RCVSIZE option controls the maximum receive queue length. + This is a soft limit rather than a hard limit - RDS will + continue to accept and queue incoming messages, even if that + takes the queue length over the limit. However, it will also + mark the port as "congested" and send a congestion update to + the source node. The source node is supposed to throttle any + processes sending to this congested port. + + bind(fd, &sockaddr_in, ...) + This binds the socket to a local IP address and port, and a + transport, if one has not already been selected via the + SO_RDS_TRANSPORT socket option + + sendmsg(fd, ...) + Sends a message to the indicated recipient. The kernel will + transparently establish the underlying reliable connection + if it isn't up yet. + + An attempt to send a message that exceeds SO_SNDSIZE will + return with -EMSGSIZE + + An attempt to send a message that would take the total number + of queued bytes over the SO_SNDSIZE threshold will return + EAGAIN. + + An attempt to send a message to a destination that is marked + as "congested" will return ENOBUFS. + + recvmsg(fd, ...) + Receives a message that was queued to this socket. The sockets + recv queue accounting is adjusted, and if the queue length + drops below SO_SNDSIZE, the port is marked uncongested, and + a congestion update is sent to all peers. + + Applications can ask the RDS kernel module to receive + notifications via control messages (for instance, there is a + notification when a congestion update arrived, or when a RDMA + operation completes). These notifications are received through + the msg.msg_control buffer of struct msghdr. The format of the + messages is described in manpages. + + poll(fd) + RDS supports the poll interface to allow the application + to implement async I/O. + + POLLIN handling is pretty straightforward. When there's an + incoming message queued to the socket, or a pending notification, + we signal POLLIN. + + POLLOUT is a little harder. Since you can essentially send + to any destination, RDS will always signal POLLOUT as long as + there's room on the send queue (ie the number of bytes queued + is less than the sendbuf size). + + However, the kernel will refuse to accept messages to + a destination marked congested - in this case you will loop + forever if you rely on poll to tell you what to do. + This isn't a trivial problem, but applications can deal with + this - by using congestion notifications, and by checking for + ENOBUFS errors returned by sendmsg. + + setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) + This allows the application to discard all messages queued to a + specific destination on this particular socket. + + This allows the application to cancel outstanding messages if + it detects a timeout. For instance, if it tried to send a message, + and the remote host is unreachable, RDS will keep trying forever. + The application may decide it's not worth it, and cancel the + operation. In this case, it would use RDS_CANCEL_SENT_TO to + nuke any pending messages. + + ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)`` + Set or read an integer defining the underlying + encapsulating transport to be used for RDS packets on the + socket. When setting the option, integer argument may be + one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the + value, RDS_TRANS_NONE will be returned on an unbound socket. + This socket option may only be set exactly once on the socket, + prior to binding it via the bind(2) system call. Attempts to + set SO_RDS_TRANSPORT on a socket for which the transport has + been previously attached explicitly (by SO_RDS_TRANSPORT) or + implicitly (via bind(2)) will return an error of EOPNOTSUPP. + An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will + always return EINVAL. + +RDMA for RDS +============ + + see rds-rdma(7) manpage (available in rds-tools) + + +Congestion Notifications +======================== + + see rds(7) manpage + + +RDS Protocol +============ + + Message header + + The message header is a 'struct rds_header' (see rds.h): + + Fields: + + h_sequence: + per-packet sequence number + h_ack: + piggybacked acknowledgment of last packet received + h_len: + length of data, not including header + h_sport: + source port + h_dport: + destination port + h_flags: + Can be: + + ============= ================================== + CONG_BITMAP this is a congestion update bitmap + ACK_REQUIRED receiver must ack this packet + RETRANSMITTED packet has previously been sent + ============= ================================== + + h_credit: + indicate to other end of connection that + it has more credits available (i.e. there is + more send room) + h_padding[4]: + unused, for future use + h_csum: + header checksum + h_exthdr: + optional data can be passed here. This is currently used for + passing RDMA-related information. + + ACK and retransmit handling + + One might think that with reliable IB connections you wouldn't need + to ack messages that have been received. The problem is that IB + hardware generates an ack message before it has DMAed the message + into memory. This creates a potential message loss if the HCA is + disabled for any reason between when it sends the ack and before + the message is DMAed and processed. This is only a potential issue + if another HCA is available for fail-over. + + Sending an ack immediately would allow the sender to free the sent + message from their send queue quickly, but could cause excessive + traffic to be used for acks. RDS piggybacks acks on sent data + packets. Ack-only packets are reduced by only allowing one to be + in flight at a time, and by the sender only asking for acks when + its send buffers start to fill up. All retransmissions are also + acked. + + Flow Control + + RDS's IB transport uses a credit-based mechanism to verify that + there is space in the peer's receive buffers for more data. This + eliminates the need for hardware retries on the connection. + + Congestion + + Messages waiting in the receive queue on the receiving socket + are accounted against the sockets SO_RCVBUF option value. Only + the payload bytes in the message are accounted for. If the + number of bytes queued equals or exceeds rcvbuf then the socket + is congested. All sends attempted to this socket's address + should return block or return -EWOULDBLOCK. + + Applications are expected to be reasonably tuned such that this + situation very rarely occurs. An application encountering this + "back-pressure" is considered a bug. + + This is implemented by having each node maintain bitmaps which + indicate which ports on bound addresses are congested. As the + bitmap changes it is sent through all the connections which + terminate in the local address of the bitmap which changed. + + The bitmaps are allocated as connections are brought up. This + avoids allocation in the interrupt handling path which queues + sages on sockets. The dense bitmaps let transports send the + entire bitmap on any bitmap change reasonably efficiently. This + is much easier to implement than some finer-grained + communication of per-port congestion. The sender does a very + inexpensive bit test to test if the port it's about to send to + is congested or not. + + +RDS Transport Layer +=================== + + As mentioned above, RDS is not IB-specific. Its code is divided + into a general RDS layer and a transport layer. + + The general layer handles the socket API, congestion handling, + loopback, stats, usermem pinning, and the connection state machine. + + The transport layer handles the details of the transport. The IB + transport, for example, handles all the queue pairs, work requests, + CM event handlers, and other Infiniband details. + + +RDS Kernel Structures +===================== + + struct rds_message + aka possibly "rds_outgoing", the generic RDS layer copies data to + be sent and sets header fields as needed, based on the socket API. + This is then queued for the individual connection and sent by the + connection's transport. + + struct rds_incoming + a generic struct referring to incoming data that can be handed from + the transport to the general code and queued by the general code + while the socket is awoken. It is then passed back to the transport + code to handle the actual copy-to-user. + + struct rds_socket + per-socket information + + struct rds_connection + per-connection information + + struct rds_transport + pointers to transport-specific functions + + struct rds_statistics + non-transport-specific statistics + + struct rds_cong_map + wraps the raw congestion bitmap, contains rbnode, waitq, etc. + +Connection management +===================== + + Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and + ERROR states. + + The first time an attempt is made by an RDS socket to send data to + a node, a connection is allocated and connected. That connection is + then maintained forever -- if there are transport errors, the + connection will be dropped and re-established. + + Dropping a connection while packets are queued will cause queued or + partially-sent datagrams to be retransmitted when the connection is + re-established. + + +The send path +============= + + rds_sendmsg() + - struct rds_message built from incoming data + - CMSGs parsed (e.g. RDMA ops) + - transport connection alloced and connected if not already + - rds_message placed on send queue + - send worker awoken + + rds_send_worker() + - calls rds_send_xmit() until queue is empty + + rds_send_xmit() + - transmits congestion map if one is pending + - may set ACK_REQUIRED + - calls transport to send either non-RDMA or RDMA message + (RDMA ops never retransmitted) + + rds_ib_xmit() + - allocs work requests from send ring + - adds any new send credits available to peer (h_credits) + - maps the rds_message's sg list + - piggybacks ack + - populates work requests + - post send to connection's queue pair + +The recv path +============= + + rds_ib_recv_cq_comp_handler() + - looks at write completions + - unmaps recv buffer from device + - no errors, call rds_ib_process_recv() + - refill recv ring + + rds_ib_process_recv() + - validate header checksum + - copy header to rds_ib_incoming struct if start of a new datagram + - add to ibinc's fraglist + - if competed datagram: + - update cong map if datagram was cong update + - call rds_recv_incoming() otherwise + - note if ack is required + + rds_recv_incoming() + - drop duplicate packets + - respond to pings + - find the sock associated with this datagram + - add to sock queue + - wake up sock + - do some congestion calculations + rds_recvmsg + - copy data into user iovec + - handle CMSGs + - return to application + +Multipath RDS (mprds) +===================== + Mprds is multipathed-RDS, primarily intended for RDS-over-TCP + (though the concept can be extended to other transports). The classical + implementation of RDS-over-TCP is implemented by demultiplexing multiple + PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, + port]) over a single TCP socket between the 2 IP addresses involved. This + has the limitation that it ends up funneling multiple RDS flows over a + single TCP flow, thus it is + (a) upper-bounded to the single-flow bandwidth, + (b) suffers from head-of-line blocking for all the RDS sockets. + + Better throughput (for a fixed small packet size, MTU) can be achieved + by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed + RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp + connection. RDS sockets will be attached to a path based on some hash + (e.g., of local address and RDS port number) and packets for that RDS + socket will be sent over the attached path using TCP to segment/reassemble + RDS datagrams on that path. + + Multipathed RDS is implemented by splitting the struct rds_connection into + a common (to all paths) part, and a per-path struct rds_conn_path. All + I/O workqs and reconnect threads are driven from the rds_conn_path. + Transports such as TCP that are multipath capable may then set up a + TCP socket per rds_conn_path, and this is managed by the transport via + the transport privatee cp_transport_data pointer. + + Transports announce themselves as multipath capable by setting the + t_mp_capable bit during registration with the rds core module. When the + transport is multipath-capable, rds_sendmsg() hashes outgoing traffic + across multiple paths. The outgoing hash is computed based on the + local address and port that the PF_RDS socket is bound to. + + Additionally, even if the transport is MP capable, we may be + peering with some node that does not support mprds, or supports + a different number of paths. As a result, the peering nodes need + to agree on the number of paths to be used for the connection. + This is done by sending out a control packet exchange before the + first data packet. The control packet exchange must have completed + prior to outgoing hash completion in rds_sendmsg() when the transport + is mutlipath capable. + + The control packet is an RDS ping packet (i.e., packet to rds dest + port 0) with the ping packet having a rds extension header option of + type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the + number of paths supported by the sender. The "probe" ping packet will + get sent from some reserved port, RDS_FLAG_PROBE_PORT (in ) + The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately + be able to compute the min(sender_paths, rcvr_paths). The pong + sent in response to a probe-ping should contain the rcvr's npaths + when the rcvr is mprds-capable. + + If the rcvr is not mprds-capable, the exthdr in the ping will be + ignored. In this case the pong will not have any exthdrs, so the sender + of the probe-ping can default to single-path mprds. + diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt deleted file mode 100644 index eec61694e894..000000000000 --- a/Documentation/networking/rds.txt +++ /dev/null @@ -1,423 +0,0 @@ - -Overview -======== - -This readme tries to provide some background on the hows and whys of RDS, -and will hopefully help you find your way around the code. - -In addition, please see this email about RDS origins: -http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html - -RDS Architecture -================ - -RDS provides reliable, ordered datagram delivery by using a single -reliable connection between any two nodes in the cluster. This allows -applications to use a single socket to talk to any other process in the -cluster - so in a cluster with N processes you need N sockets, in contrast -to N*N if you use a connection-oriented socket transport like TCP. - -RDS is not Infiniband-specific; it was designed to support different -transports. The current implementation used to support RDS over TCP as well -as IB. - -The high-level semantics of RDS from the application's point of view are - - * Addressing - RDS uses IPv4 addresses and 16bit port numbers to identify - the end point of a connection. All socket operations that involve - passing addresses between kernel and user space generally - use a struct sockaddr_in. - - The fact that IPv4 addresses are used does not mean the underlying - transport has to be IP-based. In fact, RDS over IB uses a - reliable IB connection; the IP address is used exclusively to - locate the remote node's GID (by ARPing for the given IP). - - The port space is entirely independent of UDP, TCP or any other - protocol. - - * Socket interface - RDS sockets work *mostly* as you would expect from a BSD - socket. The next section will cover the details. At any rate, - all I/O is performed through the standard BSD socket API. - Some additions like zerocopy support are implemented through - control messages, while other extensions use the getsockopt/ - setsockopt calls. - - Sockets must be bound before you can send or receive data. - This is needed because binding also selects a transport and - attaches it to the socket. Once bound, the transport assignment - does not change. RDS will tolerate IPs moving around (eg in - a active-active HA scenario), but only as long as the address - doesn't move to a different transport. - - * sysctls - RDS supports a number of sysctls in /proc/sys/net/rds - - -Socket Interface -================ - - AF_RDS, PF_RDS, SOL_RDS - AF_RDS and PF_RDS are the domain type to be used with socket(2) - to create RDS sockets. SOL_RDS is the socket-level to be used - with setsockopt(2) and getsockopt(2) for RDS specific socket - options. - - fd = socket(PF_RDS, SOCK_SEQPACKET, 0); - This creates a new, unbound RDS socket. - - setsockopt(SOL_SOCKET): send and receive buffer size - RDS honors the send and receive buffer size socket options. - You are not allowed to queue more than SO_SNDSIZE bytes to - a socket. A message is queued when sendmsg is called, and - it leaves the queue when the remote system acknowledges - its arrival. - - The SO_RCVSIZE option controls the maximum receive queue length. - This is a soft limit rather than a hard limit - RDS will - continue to accept and queue incoming messages, even if that - takes the queue length over the limit. However, it will also - mark the port as "congested" and send a congestion update to - the source node. The source node is supposed to throttle any - processes sending to this congested port. - - bind(fd, &sockaddr_in, ...) - This binds the socket to a local IP address and port, and a - transport, if one has not already been selected via the - SO_RDS_TRANSPORT socket option - - sendmsg(fd, ...) - Sends a message to the indicated recipient. The kernel will - transparently establish the underlying reliable connection - if it isn't up yet. - - An attempt to send a message that exceeds SO_SNDSIZE will - return with -EMSGSIZE - - An attempt to send a message that would take the total number - of queued bytes over the SO_SNDSIZE threshold will return - EAGAIN. - - An attempt to send a message to a destination that is marked - as "congested" will return ENOBUFS. - - recvmsg(fd, ...) - Receives a message that was queued to this socket. The sockets - recv queue accounting is adjusted, and if the queue length - drops below SO_SNDSIZE, the port is marked uncongested, and - a congestion update is sent to all peers. - - Applications can ask the RDS kernel module to receive - notifications via control messages (for instance, there is a - notification when a congestion update arrived, or when a RDMA - operation completes). These notifications are received through - the msg.msg_control buffer of struct msghdr. The format of the - messages is described in manpages. - - poll(fd) - RDS supports the poll interface to allow the application - to implement async I/O. - - POLLIN handling is pretty straightforward. When there's an - incoming message queued to the socket, or a pending notification, - we signal POLLIN. - - POLLOUT is a little harder. Since you can essentially send - to any destination, RDS will always signal POLLOUT as long as - there's room on the send queue (ie the number of bytes queued - is less than the sendbuf size). - - However, the kernel will refuse to accept messages to - a destination marked congested - in this case you will loop - forever if you rely on poll to tell you what to do. - This isn't a trivial problem, but applications can deal with - this - by using congestion notifications, and by checking for - ENOBUFS errors returned by sendmsg. - - setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) - This allows the application to discard all messages queued to a - specific destination on this particular socket. - - This allows the application to cancel outstanding messages if - it detects a timeout. For instance, if it tried to send a message, - and the remote host is unreachable, RDS will keep trying forever. - The application may decide it's not worth it, and cancel the - operation. In this case, it would use RDS_CANCEL_SENT_TO to - nuke any pending messages. - - setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) - getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) - Set or read an integer defining the underlying - encapsulating transport to be used for RDS packets on the - socket. When setting the option, integer argument may be - one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the - value, RDS_TRANS_NONE will be returned on an unbound socket. - This socket option may only be set exactly once on the socket, - prior to binding it via the bind(2) system call. Attempts to - set SO_RDS_TRANSPORT on a socket for which the transport has - been previously attached explicitly (by SO_RDS_TRANSPORT) or - implicitly (via bind(2)) will return an error of EOPNOTSUPP. - An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will - always return EINVAL. - -RDMA for RDS -============ - - see rds-rdma(7) manpage (available in rds-tools) - - -Congestion Notifications -======================== - - see rds(7) manpage - - -RDS Protocol -============ - - Message header - - The message header is a 'struct rds_header' (see rds.h): - Fields: - h_sequence: - per-packet sequence number - h_ack: - piggybacked acknowledgment of last packet received - h_len: - length of data, not including header - h_sport: - source port - h_dport: - destination port - h_flags: - CONG_BITMAP - this is a congestion update bitmap - ACK_REQUIRED - receiver must ack this packet - RETRANSMITTED - packet has previously been sent - h_credit: - indicate to other end of connection that - it has more credits available (i.e. there is - more send room) - h_padding[4]: - unused, for future use - h_csum: - header checksum - h_exthdr: - optional data can be passed here. This is currently used for - passing RDMA-related information. - - ACK and retransmit handling - - One might think that with reliable IB connections you wouldn't need - to ack messages that have been received. The problem is that IB - hardware generates an ack message before it has DMAed the message - into memory. This creates a potential message loss if the HCA is - disabled for any reason between when it sends the ack and before - the message is DMAed and processed. This is only a potential issue - if another HCA is available for fail-over. - - Sending an ack immediately would allow the sender to free the sent - message from their send queue quickly, but could cause excessive - traffic to be used for acks. RDS piggybacks acks on sent data - packets. Ack-only packets are reduced by only allowing one to be - in flight at a time, and by the sender only asking for acks when - its send buffers start to fill up. All retransmissions are also - acked. - - Flow Control - - RDS's IB transport uses a credit-based mechanism to verify that - there is space in the peer's receive buffers for more data. This - eliminates the need for hardware retries on the connection. - - Congestion - - Messages waiting in the receive queue on the receiving socket - are accounted against the sockets SO_RCVBUF option value. Only - the payload bytes in the message are accounted for. If the - number of bytes queued equals or exceeds rcvbuf then the socket - is congested. All sends attempted to this socket's address - should return block or return -EWOULDBLOCK. - - Applications are expected to be reasonably tuned such that this - situation very rarely occurs. An application encountering this - "back-pressure" is considered a bug. - - This is implemented by having each node maintain bitmaps which - indicate which ports on bound addresses are congested. As the - bitmap changes it is sent through all the connections which - terminate in the local address of the bitmap which changed. - - The bitmaps are allocated as connections are brought up. This - avoids allocation in the interrupt handling path which queues - sages on sockets. The dense bitmaps let transports send the - entire bitmap on any bitmap change reasonably efficiently. This - is much easier to implement than some finer-grained - communication of per-port congestion. The sender does a very - inexpensive bit test to test if the port it's about to send to - is congested or not. - - -RDS Transport Layer -================== - - As mentioned above, RDS is not IB-specific. Its code is divided - into a general RDS layer and a transport layer. - - The general layer handles the socket API, congestion handling, - loopback, stats, usermem pinning, and the connection state machine. - - The transport layer handles the details of the transport. The IB - transport, for example, handles all the queue pairs, work requests, - CM event handlers, and other Infiniband details. - - -RDS Kernel Structures -===================== - - struct rds_message - aka possibly "rds_outgoing", the generic RDS layer copies data to - be sent and sets header fields as needed, based on the socket API. - This is then queued for the individual connection and sent by the - connection's transport. - struct rds_incoming - a generic struct referring to incoming data that can be handed from - the transport to the general code and queued by the general code - while the socket is awoken. It is then passed back to the transport - code to handle the actual copy-to-user. - struct rds_socket - per-socket information - struct rds_connection - per-connection information - struct rds_transport - pointers to transport-specific functions - struct rds_statistics - non-transport-specific statistics - struct rds_cong_map - wraps the raw congestion bitmap, contains rbnode, waitq, etc. - -Connection management -===================== - - Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and - ERROR states. - - The first time an attempt is made by an RDS socket to send data to - a node, a connection is allocated and connected. That connection is - then maintained forever -- if there are transport errors, the - connection will be dropped and re-established. - - Dropping a connection while packets are queued will cause queued or - partially-sent datagrams to be retransmitted when the connection is - re-established. - - -The send path -============= - - rds_sendmsg() - struct rds_message built from incoming data - CMSGs parsed (e.g. RDMA ops) - transport connection alloced and connected if not already - rds_message placed on send queue - send worker awoken - rds_send_worker() - calls rds_send_xmit() until queue is empty - rds_send_xmit() - transmits congestion map if one is pending - may set ACK_REQUIRED - calls transport to send either non-RDMA or RDMA message - (RDMA ops never retransmitted) - rds_ib_xmit() - allocs work requests from send ring - adds any new send credits available to peer (h_credits) - maps the rds_message's sg list - piggybacks ack - populates work requests - post send to connection's queue pair - -The recv path -============= - - rds_ib_recv_cq_comp_handler() - looks at write completions - unmaps recv buffer from device - no errors, call rds_ib_process_recv() - refill recv ring - rds_ib_process_recv() - validate header checksum - copy header to rds_ib_incoming struct if start of a new datagram - add to ibinc's fraglist - if competed datagram: - update cong map if datagram was cong update - call rds_recv_incoming() otherwise - note if ack is required - rds_recv_incoming() - drop duplicate packets - respond to pings - find the sock associated with this datagram - add to sock queue - wake up sock - do some congestion calculations - rds_recvmsg - copy data into user iovec - handle CMSGs - return to application - -Multipath RDS (mprds) -===================== - Mprds is multipathed-RDS, primarily intended for RDS-over-TCP - (though the concept can be extended to other transports). The classical - implementation of RDS-over-TCP is implemented by demultiplexing multiple - PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, - port]) over a single TCP socket between the 2 IP addresses involved. This - has the limitation that it ends up funneling multiple RDS flows over a - single TCP flow, thus it is - (a) upper-bounded to the single-flow bandwidth, - (b) suffers from head-of-line blocking for all the RDS sockets. - - Better throughput (for a fixed small packet size, MTU) can be achieved - by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed - RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp - connection. RDS sockets will be attached to a path based on some hash - (e.g., of local address and RDS port number) and packets for that RDS - socket will be sent over the attached path using TCP to segment/reassemble - RDS datagrams on that path. - - Multipathed RDS is implemented by splitting the struct rds_connection into - a common (to all paths) part, and a per-path struct rds_conn_path. All - I/O workqs and reconnect threads are driven from the rds_conn_path. - Transports such as TCP that are multipath capable may then set up a - TCP socket per rds_conn_path, and this is managed by the transport via - the transport privatee cp_transport_data pointer. - - Transports announce themselves as multipath capable by setting the - t_mp_capable bit during registration with the rds core module. When the - transport is multipath-capable, rds_sendmsg() hashes outgoing traffic - across multiple paths. The outgoing hash is computed based on the - local address and port that the PF_RDS socket is bound to. - - Additionally, even if the transport is MP capable, we may be - peering with some node that does not support mprds, or supports - a different number of paths. As a result, the peering nodes need - to agree on the number of paths to be used for the connection. - This is done by sending out a control packet exchange before the - first data packet. The control packet exchange must have completed - prior to outgoing hash completion in rds_sendmsg() when the transport - is mutlipath capable. - - The control packet is an RDS ping packet (i.e., packet to rds dest - port 0) with the ping packet having a rds extension header option of - type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the - number of paths supported by the sender. The "probe" ping packet will - get sent from some reserved port, RDS_FLAG_PROBE_PORT (in ) - The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately - be able to compute the min(sender_paths, rcvr_paths). The pong - sent in response to a probe-ping should contain the rcvr's npaths - when the rcvr is mprds-capable. - - If the rcvr is not mprds-capable, the exthdr in the ping will be - ignored. In this case the pong will not have any exthdrs, so the sender - of the probe-ping can default to single-path mprds. - diff --git a/MAINTAINERS b/MAINTAINERS index 785f56e5f210..ea5dd3d1df9d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14219,7 +14219,7 @@ L: linux-rdma@vger.kernel.org L: rds-devel@oss.oracle.com (moderated for non-subscribers) S: Supported W: https://oss.oracle.com/projects/rds/ -F: Documentation/networking/rds.txt +F: Documentation/networking/rds.rst F: net/rds/ RDT - RESOURCE ALLOCATION -- cgit v1.2.3 From 98661e0c579dbda0e0910185f752fddd95e2d29c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:20 +0200 Subject: docs: networking: convert regulatory.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/regulatory.rst | 209 ++++++++++++++++++++++++++++++++ Documentation/networking/regulatory.txt | 204 ------------------------------- MAINTAINERS | 2 +- 4 files changed, 211 insertions(+), 205 deletions(-) create mode 100644 Documentation/networking/regulatory.rst delete mode 100644 Documentation/networking/regulatory.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e63a2cb2e4cb..bc3b04a2edde 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -98,6 +98,7 @@ Contents: radiotap-headers ray_cs rds + regulatory .. only:: subproject and html diff --git a/Documentation/networking/regulatory.rst b/Documentation/networking/regulatory.rst new file mode 100644 index 000000000000..8701b91e81ee --- /dev/null +++ b/Documentation/networking/regulatory.rst @@ -0,0 +1,209 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +Linux wireless regulatory documentation +======================================= + +This document gives a brief review over how the Linux wireless +regulatory infrastructure works. + +More up to date information can be obtained at the project's web page: + +http://wireless.kernel.org/en/developers/Regulatory + +Keeping regulatory domains in userspace +--------------------------------------- + +Due to the dynamic nature of regulatory domains we keep them +in userspace and provide a framework for userspace to upload +to the kernel one regulatory domain to be used as the central +core regulatory domain all wireless devices should adhere to. + +How to get regulatory domains to the kernel +------------------------------------------- + +When the regulatory domain is first set up, the kernel will request a +database file (regulatory.db) containing all the regulatory rules. It +will then use that database when it needs to look up the rules for a +given country. + +How to get regulatory domains to the kernel (old CRDA solution) +--------------------------------------------------------------- + +Userspace gets a regulatory domain in the kernel by having +a userspace agent build it and send it via nl80211. Only +expected regulatory domains will be respected by the kernel. + +A currently available userspace agent which can accomplish this +is CRDA - central regulatory domain agent. Its documented here: + +http://wireless.kernel.org/en/developers/Regulatory/CRDA + +Essentially the kernel will send a udev event when it knows +it needs a new regulatory domain. A udev rule can be put in place +to trigger crda to send the respective regulatory domain for a +specific ISO/IEC 3166 alpha2. + +Below is an example udev rule which can be used: + +# Example file, should be put in /etc/udev/rules.d/regulatory.rules +KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda" + +The alpha2 is passed as an environment variable under the variable COUNTRY. + +Who asks for regulatory domains? +-------------------------------- + +* Users + +Users can use iw: + +http://wireless.kernel.org/en/users/Documentation/iw + +An example:: + + # set regulatory domain to "Costa Rica" + iw reg set CR + +This will request the kernel to set the regulatory domain to +the specificied alpha2. The kernel in turn will then ask userspace +to provide a regulatory domain for the alpha2 specified by the user +by sending a uevent. + +* Wireless subsystems for Country Information elements + +The kernel will send a uevent to inform userspace a new +regulatory domain is required. More on this to be added +as its integration is added. + +* Drivers + +If drivers determine they need a specific regulatory domain +set they can inform the wireless core using regulatory_hint(). +They have two options -- they either provide an alpha2 so that +crda can provide back a regulatory domain for that country or +they can build their own regulatory domain based on internal +custom knowledge so the wireless core can respect it. + +*Most* drivers will rely on the first mechanism of providing a +regulatory hint with an alpha2. For these drivers there is an additional +check that can be used to ensure compliance based on custom EEPROM +regulatory data. This additional check can be used by drivers by +registering on its struct wiphy a reg_notifier() callback. This notifier +is called when the core's regulatory domain has been changed. The driver +can use this to review the changes made and also review who made them +(driver, user, country IE) and determine what to allow based on its +internal EEPROM data. Devices drivers wishing to be capable of world +roaming should use this callback. More on world roaming will be +added to this document when its support is enabled. + +Device drivers who provide their own built regulatory domain +do not need a callback as the channels registered by them are +the only ones that will be allowed and therefore *additional* +channels cannot be enabled. + +Example code - drivers hinting an alpha2: +------------------------------------------ + +This example comes from the zd1211rw device driver. You can start +by having a mapping of your device's EEPROM country/regulatory +domain value to a specific alpha2 as follows:: + + static struct zd_reg_alpha2_map reg_alpha2_map[] = { + { ZD_REGDOMAIN_FCC, "US" }, + { ZD_REGDOMAIN_IC, "CA" }, + { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */ + { ZD_REGDOMAIN_JAPAN, "JP" }, + { ZD_REGDOMAIN_JAPAN_ADD, "JP" }, + { ZD_REGDOMAIN_SPAIN, "ES" }, + { ZD_REGDOMAIN_FRANCE, "FR" }, + +Then you can define a routine to map your read EEPROM value to an alpha2, +as follows:: + + static int zd_reg2alpha2(u8 regdomain, char *alpha2) + { + unsigned int i; + struct zd_reg_alpha2_map *reg_map; + for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) { + reg_map = ®_alpha2_map[i]; + if (regdomain == reg_map->reg) { + alpha2[0] = reg_map->alpha2[0]; + alpha2[1] = reg_map->alpha2[1]; + return 0; + } + } + return 1; + } + +Lastly, you can then hint to the core of your discovered alpha2, if a match +was found. You need to do this after you have registered your wiphy. You +are expected to do this during initialization. + +:: + + r = zd_reg2alpha2(mac->regdomain, alpha2); + if (!r) + regulatory_hint(hw->wiphy, alpha2); + +Example code - drivers providing a built in regulatory domain: +-------------------------------------------------------------- + +[NOTE: This API is not currently available, it can be added when required] + +If you have regulatory information you can obtain from your +driver and you *need* to use this we let you build a regulatory domain +structure and pass it to the wireless core. To do this you should +kmalloc() a structure big enough to hold your regulatory domain +structure and you should then fill it with your data. Finally you simply +call regulatory_hint() with the regulatory domain structure in it. + +Bellow is a simple example, with a regulatory domain cached using the stack. +Your implementation may vary (read EEPROM cache instead, for example). + +Example cache of some regulatory domain:: + + struct ieee80211_regdomain mydriver_jp_regdom = { + .n_reg_rules = 3, + .alpha2 = "JP", + //.alpha2 = "99", /* If I have no alpha2 to map it to */ + .reg_rules = { + /* IEEE 802.11b/g, channels 1..14 */ + REG_RULE(2412-10, 2484+10, 40, 6, 20, 0), + /* IEEE 802.11a, channels 34..48 */ + REG_RULE(5170-10, 5240+10, 40, 6, 20, + NL80211_RRF_NO_IR), + /* IEEE 802.11a, channels 52..64 */ + REG_RULE(5260-10, 5320+10, 40, 6, 20, + NL80211_RRF_NO_IR| + NL80211_RRF_DFS), + } + }; + +Then in some part of your code after your wiphy has been registered:: + + struct ieee80211_regdomain *rd; + int size_of_regd; + int num_rules = mydriver_jp_regdom.n_reg_rules; + unsigned int i; + + size_of_regd = sizeof(struct ieee80211_regdomain) + + (num_rules * sizeof(struct ieee80211_reg_rule)); + + rd = kzalloc(size_of_regd, GFP_KERNEL); + if (!rd) + return -ENOMEM; + + memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain)); + + for (i=0; i < num_rules; i++) + memcpy(&rd->reg_rules[i], + &mydriver_jp_regdom.reg_rules[i], + sizeof(struct ieee80211_reg_rule)); + regulatory_struct_hint(rd); + +Statically compiled regulatory database +--------------------------------------- + +When a database should be fixed into the kernel, it can be provided as a +firmware file at build time that is then linked into the kernel. diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt deleted file mode 100644 index 381e5b23d61d..000000000000 --- a/Documentation/networking/regulatory.txt +++ /dev/null @@ -1,204 +0,0 @@ -Linux wireless regulatory documentation ---------------------------------------- - -This document gives a brief review over how the Linux wireless -regulatory infrastructure works. - -More up to date information can be obtained at the project's web page: - -http://wireless.kernel.org/en/developers/Regulatory - -Keeping regulatory domains in userspace ---------------------------------------- - -Due to the dynamic nature of regulatory domains we keep them -in userspace and provide a framework for userspace to upload -to the kernel one regulatory domain to be used as the central -core regulatory domain all wireless devices should adhere to. - -How to get regulatory domains to the kernel -------------------------------------------- - -When the regulatory domain is first set up, the kernel will request a -database file (regulatory.db) containing all the regulatory rules. It -will then use that database when it needs to look up the rules for a -given country. - -How to get regulatory domains to the kernel (old CRDA solution) ---------------------------------------------------------------- - -Userspace gets a regulatory domain in the kernel by having -a userspace agent build it and send it via nl80211. Only -expected regulatory domains will be respected by the kernel. - -A currently available userspace agent which can accomplish this -is CRDA - central regulatory domain agent. Its documented here: - -http://wireless.kernel.org/en/developers/Regulatory/CRDA - -Essentially the kernel will send a udev event when it knows -it needs a new regulatory domain. A udev rule can be put in place -to trigger crda to send the respective regulatory domain for a -specific ISO/IEC 3166 alpha2. - -Below is an example udev rule which can be used: - -# Example file, should be put in /etc/udev/rules.d/regulatory.rules -KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda" - -The alpha2 is passed as an environment variable under the variable COUNTRY. - -Who asks for regulatory domains? --------------------------------- - -* Users - -Users can use iw: - -http://wireless.kernel.org/en/users/Documentation/iw - -An example: - - # set regulatory domain to "Costa Rica" - iw reg set CR - -This will request the kernel to set the regulatory domain to -the specificied alpha2. The kernel in turn will then ask userspace -to provide a regulatory domain for the alpha2 specified by the user -by sending a uevent. - -* Wireless subsystems for Country Information elements - -The kernel will send a uevent to inform userspace a new -regulatory domain is required. More on this to be added -as its integration is added. - -* Drivers - -If drivers determine they need a specific regulatory domain -set they can inform the wireless core using regulatory_hint(). -They have two options -- they either provide an alpha2 so that -crda can provide back a regulatory domain for that country or -they can build their own regulatory domain based on internal -custom knowledge so the wireless core can respect it. - -*Most* drivers will rely on the first mechanism of providing a -regulatory hint with an alpha2. For these drivers there is an additional -check that can be used to ensure compliance based on custom EEPROM -regulatory data. This additional check can be used by drivers by -registering on its struct wiphy a reg_notifier() callback. This notifier -is called when the core's regulatory domain has been changed. The driver -can use this to review the changes made and also review who made them -(driver, user, country IE) and determine what to allow based on its -internal EEPROM data. Devices drivers wishing to be capable of world -roaming should use this callback. More on world roaming will be -added to this document when its support is enabled. - -Device drivers who provide their own built regulatory domain -do not need a callback as the channels registered by them are -the only ones that will be allowed and therefore *additional* -channels cannot be enabled. - -Example code - drivers hinting an alpha2: ------------------------------------------- - -This example comes from the zd1211rw device driver. You can start -by having a mapping of your device's EEPROM country/regulatory -domain value to a specific alpha2 as follows: - -static struct zd_reg_alpha2_map reg_alpha2_map[] = { - { ZD_REGDOMAIN_FCC, "US" }, - { ZD_REGDOMAIN_IC, "CA" }, - { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */ - { ZD_REGDOMAIN_JAPAN, "JP" }, - { ZD_REGDOMAIN_JAPAN_ADD, "JP" }, - { ZD_REGDOMAIN_SPAIN, "ES" }, - { ZD_REGDOMAIN_FRANCE, "FR" }, - -Then you can define a routine to map your read EEPROM value to an alpha2, -as follows: - -static int zd_reg2alpha2(u8 regdomain, char *alpha2) -{ - unsigned int i; - struct zd_reg_alpha2_map *reg_map; - for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) { - reg_map = ®_alpha2_map[i]; - if (regdomain == reg_map->reg) { - alpha2[0] = reg_map->alpha2[0]; - alpha2[1] = reg_map->alpha2[1]; - return 0; - } - } - return 1; -} - -Lastly, you can then hint to the core of your discovered alpha2, if a match -was found. You need to do this after you have registered your wiphy. You -are expected to do this during initialization. - - r = zd_reg2alpha2(mac->regdomain, alpha2); - if (!r) - regulatory_hint(hw->wiphy, alpha2); - -Example code - drivers providing a built in regulatory domain: --------------------------------------------------------------- - -[NOTE: This API is not currently available, it can be added when required] - -If you have regulatory information you can obtain from your -driver and you *need* to use this we let you build a regulatory domain -structure and pass it to the wireless core. To do this you should -kmalloc() a structure big enough to hold your regulatory domain -structure and you should then fill it with your data. Finally you simply -call regulatory_hint() with the regulatory domain structure in it. - -Bellow is a simple example, with a regulatory domain cached using the stack. -Your implementation may vary (read EEPROM cache instead, for example). - -Example cache of some regulatory domain - -struct ieee80211_regdomain mydriver_jp_regdom = { - .n_reg_rules = 3, - .alpha2 = "JP", - //.alpha2 = "99", /* If I have no alpha2 to map it to */ - .reg_rules = { - /* IEEE 802.11b/g, channels 1..14 */ - REG_RULE(2412-10, 2484+10, 40, 6, 20, 0), - /* IEEE 802.11a, channels 34..48 */ - REG_RULE(5170-10, 5240+10, 40, 6, 20, - NL80211_RRF_NO_IR), - /* IEEE 802.11a, channels 52..64 */ - REG_RULE(5260-10, 5320+10, 40, 6, 20, - NL80211_RRF_NO_IR| - NL80211_RRF_DFS), - } -}; - -Then in some part of your code after your wiphy has been registered: - - struct ieee80211_regdomain *rd; - int size_of_regd; - int num_rules = mydriver_jp_regdom.n_reg_rules; - unsigned int i; - - size_of_regd = sizeof(struct ieee80211_regdomain) + - (num_rules * sizeof(struct ieee80211_reg_rule)); - - rd = kzalloc(size_of_regd, GFP_KERNEL); - if (!rd) - return -ENOMEM; - - memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain)); - - for (i=0; i < num_rules; i++) - memcpy(&rd->reg_rules[i], - &mydriver_jp_regdom.reg_rules[i], - sizeof(struct ieee80211_reg_rule)); - regulatory_struct_hint(rd); - -Statically compiled regulatory database ---------------------------------------- - -When a database should be fixed into the kernel, it can be provided as a -firmware file at build time that is then linked into the kernel. diff --git a/MAINTAINERS b/MAINTAINERS index ea5dd3d1df9d..b28823ab48c5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -193,7 +193,7 @@ W: https://wireless.wiki.kernel.org/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git F: Documentation/driver-api/80211/cfg80211.rst -F: Documentation/networking/regulatory.txt +F: Documentation/networking/regulatory.rst F: include/linux/ieee80211.h F: include/net/cfg80211.h F: include/net/ieee80211_radiotap.h -- cgit v1.2.3 From 9f72374cb5959556870be8078b128158edde5d3e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:21 +0200 Subject: docs: networking: convert rxrpc.txt to ReST - add SPDX header; - adjust title markup; - use autonumbered list markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/filesystems/afs.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/rxrpc.rst | 1169 ++++++++++++++++++++++++++++++++++++ Documentation/networking/rxrpc.txt | 1155 ----------------------------------- MAINTAINERS | 2 +- net/rxrpc/Kconfig | 6 +- net/rxrpc/sysctl.c | 2 +- 7 files changed, 1176 insertions(+), 1161 deletions(-) create mode 100644 Documentation/networking/rxrpc.rst delete mode 100644 Documentation/networking/rxrpc.txt (limited to 'MAINTAINERS') diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst index c4ec39a5966e..cada9464d6bd 100644 --- a/Documentation/filesystems/afs.rst +++ b/Documentation/filesystems/afs.rst @@ -70,7 +70,7 @@ list of volume location server IP addresses:: The first module is the AF_RXRPC network protocol driver. This provides the RxRPC remote operation protocol and may also be accessed from userspace. See: - Documentation/networking/rxrpc.txt + Documentation/networking/rxrpc.rst The second module is the kerberos RxRPC security driver, and the third module is the actual filesystem driver for the AFS filesystem. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index bc3b04a2edde..cd307b9601fa 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -99,6 +99,7 @@ Contents: ray_cs rds regulatory + rxrpc .. only:: subproject and html diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst new file mode 100644 index 000000000000..5ad35113d0f4 --- /dev/null +++ b/Documentation/networking/rxrpc.rst @@ -0,0 +1,1169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +RxRPC Network Protocol +====================== + +The RxRPC protocol driver provides a reliable two-phase transport on top of UDP +that can be used to perform RxRPC remote operations. This is done over sockets +of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and +receive data, aborts and errors. + +Contents of this document: + + (#) Overview. + + (#) RxRPC protocol summary. + + (#) AF_RXRPC driver model. + + (#) Control messages. + + (#) Socket options. + + (#) Security. + + (#) Example client usage. + + (#) Example server usage. + + (#) AF_RXRPC kernel interface. + + (#) Configurable parameters. + + +Overview +======== + +RxRPC is a two-layer protocol. There is a session layer which provides +reliable virtual connections using UDP over IPv4 (or IPv6) as the transport +layer, but implements a real network protocol; and there's the presentation +layer which renders structured data to binary blobs and back again using XDR +(as does SunRPC):: + + +-------------+ + | Application | + +-------------+ + | XDR | Presentation + +-------------+ + | RxRPC | Session + +-------------+ + | UDP | Transport + +-------------+ + + +AF_RXRPC provides: + + (1) Part of an RxRPC facility for both kernel and userspace applications by + making the session part of it a Linux network protocol (AF_RXRPC). + + (2) A two-phase protocol. The client transmits a blob (the request) and then + receives a blob (the reply), and the server receives the request and then + transmits the reply. + + (3) Retention of the reusable bits of the transport system set up for one call + to speed up subsequent calls. + + (4) A secure protocol, using the Linux kernel's key retention facility to + manage security on the client end. The server end must of necessity be + more active in security negotiations. + +AF_RXRPC does not provide XDR marshalling/presentation facilities. That is +left to the application. AF_RXRPC only deals in blobs. Even the operation ID +is just the first four bytes of the request blob, and as such is beyond the +kernel's interest. + + +Sockets of AF_RXRPC family are: + + (1) created as type SOCK_DGRAM; + + (2) provided with a protocol of the type of underlying transport they're going + to use - currently only PF_INET is supported. + + +The Andrew File System (AFS) is an example of an application that uses this and +that has both kernel (filesystem) and userspace (utility) components. + + +RxRPC Protocol Summary +====================== + +An overview of the RxRPC protocol: + + (#) RxRPC sits on top of another networking protocol (UDP is the only option + currently), and uses this to provide network transport. UDP ports, for + example, provide transport endpoints. + + (#) RxRPC supports multiple virtual "connections" from any given transport + endpoint, thus allowing the endpoints to be shared, even to the same + remote endpoint. + + (#) Each connection goes to a particular "service". A connection may not go + to multiple services. A service may be considered the RxRPC equivalent of + a port number. AF_RXRPC permits multiple services to share an endpoint. + + (#) Client-originating packets are marked, thus a transport endpoint can be + shared between client and server connections (connections have a + direction). + + (#) Up to a billion connections may be supported concurrently between one + local transport endpoint and one service on one remote endpoint. An RxRPC + connection is described by seven numbers:: + + Local address } + Local port } Transport (UDP) address + Remote address } + Remote port } + Direction + Connection ID + Service ID + + (#) Each RxRPC operation is a "call". A connection may make up to four + billion calls, but only up to four calls may be in progress on a + connection at any one time. + + (#) Calls are two-phase and asymmetric: the client sends its request data, + which the service receives; then the service transmits the reply data + which the client receives. + + (#) The data blobs are of indefinite size, the end of a phase is marked with a + flag in the packet. The number of packets of data making up one blob may + not exceed 4 billion, however, as this would cause the sequence number to + wrap. + + (#) The first four bytes of the request data are the service operation ID. + + (#) Security is negotiated on a per-connection basis. The connection is + initiated by the first data packet on it arriving. If security is + requested, the server then issues a "challenge" and then the client + replies with a "response". If the response is successful, the security is + set for the lifetime of that connection, and all subsequent calls made + upon it use that same security. In the event that the server lets a + connection lapse before the client, the security will be renegotiated if + the client uses the connection again. + + (#) Calls use ACK packets to handle reliability. Data packets are also + explicitly sequenced per call. + + (#) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs. + A hard-ACK indicates to the far side that all the data received to a point + has been received and processed; a soft-ACK indicates that the data has + been received but may yet be discarded and re-requested. The sender may + not discard any transmittable packets until they've been hard-ACK'd. + + (#) Reception of a reply data packet implicitly hard-ACK's all the data + packets that make up the request. + + (#) An call is complete when the request has been sent, the reply has been + received and the final hard-ACK on the last packet of the reply has + reached the server. + + (#) An call may be aborted by either end at any time up to its completion. + + +AF_RXRPC Driver Model +===================== + +About the AF_RXRPC driver: + + (#) The AF_RXRPC protocol transparently uses internal sockets of the transport + protocol to represent transport endpoints. + + (#) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC + connections are handled transparently. One client socket may be used to + make multiple simultaneous calls to the same service. One server socket + may handle calls from many clients. + + (#) Additional parallel client connections will be initiated to support extra + concurrent calls, up to a tunable limit. + + (#) Each connection is retained for a certain amount of time [tunable] after + the last call currently using it has completed in case a new call is made + that could reuse it. + + (#) Each internal UDP socket is retained [tunable] for a certain amount of + time [tunable] after the last connection using it discarded, in case a new + connection is made that could use it. + + (#) A client-side connection is only shared between calls if they have have + the same key struct describing their security (and assuming the calls + would otherwise share the connection). Non-secured calls would also be + able to share connections with each other. + + (#) A server-side connection is shared if the client says it is. + + (#) ACK'ing is handled by the protocol driver automatically, including ping + replying. + + (#) SO_KEEPALIVE automatically pings the other side to keep the connection + alive [TODO]. + + (#) If an ICMP error is received, all calls affected by that error will be + aborted with an appropriate network error passed through recvmsg(). + + +Interaction with the user of the RxRPC socket: + + (#) A socket is made into a server socket by binding an address with a + non-zero service ID. + + (#) In the client, sending a request is achieved with one or more sendmsgs, + followed by the reply being received with one or more recvmsgs. + + (#) The first sendmsg for a request to be sent from a client contains a tag to + be used in all other sendmsgs or recvmsgs associated with that call. The + tag is carried in the control data. + + (#) connect() is used to supply a default destination address for a client + socket. This may be overridden by supplying an alternate address to the + first sendmsg() of a call (struct msghdr::msg_name). + + (#) If connect() is called on an unbound client, a random local port will + bound before the operation takes place. + + (#) A server socket may also be used to make client calls. To do this, the + first sendmsg() of the call must specify the target address. The server's + transport endpoint is used to send the packets. + + (#) Once the application has received the last message associated with a call, + the tag is guaranteed not to be seen again, and so it can be used to pin + client resources. A new call can then be initiated with the same tag + without fear of interference. + + (#) In the server, a request is received with one or more recvmsgs, then the + the reply is transmitted with one or more sendmsgs, and then the final ACK + is received with a last recvmsg. + + (#) When sending data for a call, sendmsg is given MSG_MORE if there's more + data to come on that call. + + (#) When receiving data for a call, recvmsg flags MSG_MORE if there's more + data to come for that call. + + (#) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg + to indicate the terminal message for that call. + + (#) A call may be aborted by adding an abort control message to the control + data. Issuing an abort terminates the kernel's use of that call's tag. + Any messages waiting in the receive queue for that call will be discarded. + + (#) Aborts, busy notifications and challenge packets are delivered by recvmsg, + and control data messages will be set to indicate the context. Receiving + an abort or a busy message terminates the kernel's use of that call's tag. + + (#) The control data part of the msghdr struct is used for a number of things: + + (#) The tag of the intended or affected call. + + (#) Sending or receiving errors, aborts and busy notifications. + + (#) Notifications of incoming calls. + + (#) Sending debug requests and receiving debug replies [TODO]. + + (#) When the kernel has received and set up an incoming call, it sends a + message to server application to let it know there's a new call awaiting + its acceptance [recvmsg reports a special control message]. The server + application then uses sendmsg to assign a tag to the new call. Once that + is done, the first part of the request data will be delivered by recvmsg. + + (#) The server application has to provide the server socket with a keyring of + secret keys corresponding to the security types it permits. When a secure + connection is being set up, the kernel looks up the appropriate secret key + in the keyring and then sends a challenge packet to the client and + receives a response packet. The kernel then checks the authorisation of + the packet and either aborts the connection or sets up the security. + + (#) The name of the key a client will use to secure its communications is + nominated by a socket option. + + +Notes on sendmsg: + + (#) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is + making progress at accepting packets within a reasonable time such that we + manage to queue up all the data for transmission. This requires the + client to accept at least one packet per 2*RTT time period. + + If this isn't set, sendmsg() will return immediately, either returning + EINTR/ERESTARTSYS if nothing was consumed or returning the amount of data + consumed. + + +Notes on recvmsg: + + (#) If there's a sequence of data messages belonging to a particular call on + the receive queue, then recvmsg will keep working through them until: + + (a) it meets the end of that call's received data, + + (b) it meets a non-data message, + + (c) it meets a message belonging to a different call, or + + (d) it fills the user buffer. + + If recvmsg is called in blocking mode, it will keep sleeping, awaiting the + reception of further data, until one of the above four conditions is met. + + (2) MSG_PEEK operates similarly, but will return immediately if it has put any + data in the buffer rather than sleeping until it can fill the buffer. + + (3) If a data message is only partially consumed in filling a user buffer, + then the remainder of that message will be left on the front of the queue + for the next taker. MSG_TRUNC will never be flagged. + + (4) If there is more data to be had on a call (it hasn't copied the last byte + of the last data message in that phase yet), then MSG_MORE will be + flagged. + + +Control Messages +================ + +AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex +calls, to invoke certain actions and to report certain conditions. These are: + + ======================= === =========== =============================== + MESSAGE ID SRT DATA MEANING + ======================= === =========== =============================== + RXRPC_USER_CALL_ID sr- User ID App's call specifier + RXRPC_ABORT srt Abort code Abort code to issue/received + RXRPC_ACK -rt n/a Final ACK received + RXRPC_NET_ERROR -rt error num Network error on call + RXRPC_BUSY -rt n/a Call rejected (server busy) + RXRPC_LOCAL_ERROR -rt error num Local error encountered + RXRPC_NEW_CALL -r- n/a New call received + RXRPC_ACCEPT s-- n/a Accept new call + RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call + RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded + RXRPC_TX_LENGTH s-- data len Total length of Tx data + ======================= === =========== =============================== + + (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) + + (#) RXRPC_USER_CALL_ID + + This is used to indicate the application's call ID. It's an unsigned long + that the app specifies in the client by attaching it to the first data + message or in the server by passing it in association with an RXRPC_ACCEPT + message. recvmsg() passes it in conjunction with all messages except + those of the RXRPC_NEW_CALL message. + + (#) RXRPC_ABORT + + This is can be used by an application to abort a call by passing it to + sendmsg, or it can be delivered by recvmsg to indicate a remote abort was + received. Either way, it must be associated with an RXRPC_USER_CALL_ID to + specify the call affected. If an abort is being sent, then error EBADSLT + will be returned if there is no call with that user ID. + + (#) RXRPC_ACK + + This is delivered to a server application to indicate that the final ACK + of a call was received from the client. It will be associated with an + RXRPC_USER_CALL_ID to indicate the call that's now complete. + + (#) RXRPC_NET_ERROR + + This is delivered to an application to indicate that an ICMP error message + was encountered in the process of trying to talk to the peer. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (#) RXRPC_BUSY + + This is delivered to a client application to indicate that a call was + rejected by the server due to the server being busy. It will be + associated with an RXRPC_USER_CALL_ID to indicate the rejected call. + + (#) RXRPC_LOCAL_ERROR + + This is delivered to an application to indicate that a local error was + encountered and that a call has been aborted because of it. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (#) RXRPC_NEW_CALL + + This is delivered to indicate to a server application that a new call has + arrived and is awaiting acceptance. No user ID is associated with this, + as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT. + + (#) RXRPC_ACCEPT + + This is used by a server application to attempt to accept a call and + assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID + to indicate the user ID to be assigned. If there is no call to be + accepted (it may have timed out, been aborted, etc.), then sendmsg will + return error ENODATA. If the user ID is already in use by another call, + then error EBADSLT will be returned. + + (#) RXRPC_EXCLUSIVE_CALL + + This is used to indicate that a client call should be made on a one-off + connection. The connection is discarded once the call has terminated. + + (#) RXRPC_UPGRADE_SERVICE + + This is used to make a client call to probe if the specified service ID + may be upgraded by the server. The caller must check msg_name returned to + recvmsg() for the service ID actually in use. The operation probed must + be one that takes the same arguments in both services. + + Once this has been used to establish the upgrade capability (or lack + thereof) of the server, the service ID returned should be used for all + future communication to that server and RXRPC_UPGRADE_SERVICE should no + longer be set. + + (#) RXRPC_TX_LENGTH + + This is used to inform the kernel of the total amount of data that is + going to be transmitted by a call (whether in a client request or a + service response). If given, it allows the kernel to encrypt from the + userspace buffer directly to the packet buffers, rather than copying into + the buffer and then encrypting in place. This may only be given with the + first sendmsg() providing data for a call. EMSGSIZE will be generated if + the amount of data actually given is different. + + This takes a parameter of __s64 type that indicates how much will be + transmitted. This may not be less than zero. + +The symbol RXRPC__SUPPORTED is defined as one more than the highest control +message type supported. At run time this can be queried by means of the +RXRPC_SUPPORTED_CMSG socket option (see below). + + +============== +SOCKET OPTIONS +============== + +AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: + + (#) RXRPC_SECURITY_KEY + + This is used to specify the description of the key to be used. The key is + extracted from the calling process's keyrings with request_key() and + should be of "rxrpc" type. + + The optval pointer points to the description string, and optlen indicates + how long the string is, without the NUL terminator. + + (#) RXRPC_SECURITY_KEYRING + + Similar to above but specifies a keyring of server secret keys to use (key + type "keyring"). See the "Security" section. + + (#) RXRPC_EXCLUSIVE_CONNECTION + + This is used to request that new connections should be used for each call + made subsequently on this socket. optval should be NULL and optlen 0. + + (#) RXRPC_MIN_SECURITY_LEVEL + + This is used to specify the minimum security level required for calls on + this socket. optval must point to an int containing one of the following + values: + + (a) RXRPC_SECURITY_PLAIN + + Encrypted checksum only. + + (b) RXRPC_SECURITY_AUTH + + Encrypted checksum plus packet padded and first eight bytes of packet + encrypted - which includes the actual packet length. + + (c) RXRPC_SECURITY_ENCRYPTED + + Encrypted checksum plus entire packet padded and encrypted, including + actual packet length. + + (#) RXRPC_UPGRADEABLE_SERVICE + + This is used to indicate that a service socket with two bindings may + upgrade one bound service to the other if requested by the client. optval + must point to an array of two unsigned short ints. The first is the + service ID to upgrade from and the second the service ID to upgrade to. + + (#) RXRPC_SUPPORTED_CMSG + + This is a read-only option that writes an int into the buffer indicating + the highest control message type supported. + + +======== +SECURITY +======== + +Currently, only the kerberos 4 equivalent protocol has been implemented +(security index 2 - rxkad). This requires the rxkad module to be loaded and, +on the client, tickets of the appropriate type to be obtained from the AFS +kaserver or the kerberos server and installed as "rxrpc" type keys. This is +normally done using the klog program. An example simple klog program can be +found at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +The payload provided to add_key() on the client should be of the following +form:: + + struct rxrpc_key_sec2_v1 { + uint16_t security_index; /* 2 */ + uint16_t ticket_length; /* length of ticket[] */ + uint32_t expiry; /* time at which expires */ + uint8_t kvno; /* key version number */ + uint8_t __pad[3]; + uint8_t session_key[8]; /* DES session key */ + uint8_t ticket[0]; /* the encrypted ticket */ + }; + +Where the ticket blob is just appended to the above structure. + + +For the server, keys of type "rxrpc_s" must be made available to the server. +They have a description of ":" (eg: "52:2" for an +rxkad key for the AFS VL service). When such a key is created, it should be +given the server's secret key as the instantiation data (see the example +below). + + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + +A keyring is passed to the server socket by naming it in a sockopt. The server +socket then looks the server secret keys up in this keyring when secure +incoming connections are made. This can be seen in an example program that can +be found at: + + http://people.redhat.com/~dhowells/rxrpc/listen.c + + +==================== +EXAMPLE CLIENT USAGE +==================== + +A client would issue an operation by: + + (1) An RxRPC socket is set up by:: + + client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the protocol family of the transport + socket used - usually IPv4 but it can also be IPv6 [TODO]. + + (2) A local address can optionally be bound:: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = 0, /* we're a client */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(client, &srx, sizeof(srx)); + + This specifies the local UDP port to be used. If not given, a random + non-privileged port will be used. A UDP port may be shared between + several unrelated RxRPC sockets. Security is handled on a basis of + per-RxRPC virtual connection. + + (3) The security is set:: + + const char *key = "AFS:cambridge.redhat.com"; + setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key)); + + This issues a request_key() to get the key representing the security + context. The minimum security level can be set:: + + unsigned int sec = RXRPC_SECURITY_ENCRYPTED; + setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL, + &sec, sizeof(sec)); + + (4) The server to be contacted can then be specified (alternatively this can + be done through sendmsg):: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7005), /* AFS volume manager */ + .transport.sin_address = ..., + }; + connect(client, &srx, sizeof(srx)); + + (5) The request data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control message attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + ================== =================================== + + MSG_MORE should be set in msghdr::msg_flags on all but the last part of + the request. Multiple requests may be made simultaneously. + + An RXRPC_TX_LENGTH control message can also be specified on the first + sendmsg() call. + + If a call is intended to go to a destination other than the default + specified through connect(), then msghdr::msg_name should be set on the + first request message of that call. + + (6) The reply data will then be posted to the server socket for recvmsg() to + pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data + for a particular call to be read. MSG_EOR will be set on the terminal + read for a call. + + All data will be delivered with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + If an abort or error occurred, this will be returned in the control data + buffer instead, and MSG_EOR will be flagged to indicate the end of that + call. + +A client may ask for a service ID it knows and ask that this be upgraded to a +better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the +first sendmsg() of a call. The client should then check srx_service in the +msg_name filled in by recvmsg() when collecting the result. srx_service will +hold the same value as given to sendmsg() if the upgrade request was ignored by +the service - otherwise it will be altered to indicate the service ID the +server upgraded to. Note that the upgraded service ID is chosen by the server. +The caller has to wait until it sees the service ID in the reply before sending +any more calls (further calls to the same destination will be blocked until the +probe is concluded). + + +Example Server Usage +==================== + +A server would be set up to accept operations in the following manner: + + (1) An RxRPC socket is created by:: + + server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the address type of the transport + socket used - usually IPv4. + + (2) Security is set up if desired by giving the socket a keyring with server + secret keys in it:: + + keyring = add_key("keyring", "AFSkeys", NULL, 0, + KEY_SPEC_PROCESS_KEYRING); + + const char secret_key[8] = { + 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 }; + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + + setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7); + + The keyring can be manipulated after it has been given to the socket. This + permits the server to add more keys, replace keys, etc. while it is live. + + (3) A local address must then be bound:: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, /* RxRPC service ID */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(server, &srx, sizeof(srx)); + + More than one service ID may be bound to a socket, provided the transport + parameters are the same. The limit is currently two. To do this, bind() + should be called twice. + + (4) If service upgrading is required, first two service IDs must have been + bound and then the following option must be set:: + + unsigned short service_ids[2] = { from_ID, to_ID }; + setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE, + service_ids, sizeof(service_ids)); + + This will automatically upgrade connections on service from_ID to service + to_ID if they request it. This will be reflected in msg_name obtained + through recvmsg() when the request data is delivered to userspace. + + (5) The server is then set to listen out for incoming calls:: + + listen(server, 100); + + (6) The kernel notifies the server of pending incoming connections by sending + it a message for each. This is received with recvmsg() on the server + socket. It has no data, and has a single dataless control message + attached:: + + RXRPC_NEW_CALL + + The address that can be passed back by recvmsg() at this point should be + ignored since the call for which the message was posted may have gone by + the time it is accepted - in which case the first call still on the queue + will be accepted. + + (7) The server then accepts the new call by issuing a sendmsg() with two + pieces of control data and no actual data: + + ================== ============================== + RXRPC_ACCEPT indicate connection acceptance + RXRPC_USER_CALL_ID specify user ID for this call + ================== ============================== + + (8) The first request data packet will then be posted to the server socket for + recvmsg() to pick up. At that point, the RxRPC address for the call can + be read from the address fields in the msghdr struct. + + Subsequent request data will be posted to the server socket for recvmsg() + to collect as it arrives. All but the last piece of the request data will + be delivered with MSG_MORE flagged. + + All data will be delivered with the following control message attached: + + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + ================== =================================== + + (9) The reply data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control messages attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + ================== =================================== + + MSG_MORE should be set in msghdr::msg_flags on all but the last message + for a particular call. + +(10) The final ACK from the client will be posted for retrieval by recvmsg() + when it is received. It will take the form of a dataless message with two + control messages attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + RXRPC_ACK indicates final ACK (no data) + ================== =================================== + + MSG_EOR will be flagged to indicate that this is the final message for + this call. + +(11) Up to the point the final packet of reply data is sent, the call can be + aborted by calling sendmsg() with a dataless message with the following + control messages attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + RXRPC_ABORT indicates abort code (4 byte data) + ================== =================================== + + Any packets waiting in the socket's receive queue will be discarded if + this is issued. + +Note that all the communications for a particular service take place through +the one server socket, using control messages on sendmsg() and recvmsg() to +determine the call affected. + + +AF_RXRPC Kernel Interface +========================= + +The AF_RXRPC module also provides an interface for use by in-kernel utilities +such as the AFS filesystem. This permits such a utility to: + + (1) Use different keys directly on individual client calls on one socket + rather than having to open a whole slew of sockets, one for each key it + might want to use. + + (2) Avoid having RxRPC call request_key() at the point of issue of a call or + opening of a socket. Instead the utility is responsible for requesting a + key at the appropriate point. AFS, for instance, would do this during VFS + operations such as open() or unlink(). The key is then handed through + when the call is initiated. + + (3) Request the use of something other than GFP_KERNEL to allocate memory. + + (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be + intercepted before they get put into the socket Rx queue and the socket + buffers manipulated directly. + +To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket, +bind an address as appropriate and listen if it's to be a server socket, but +then it passes this to the kernel interface functions. + +The kernel interface functions are as follows: + + (#) Begin a new client call:: + + struct rxrpc_call * + rxrpc_kernel_begin_call(struct socket *sock, + struct sockaddr_rxrpc *srx, + struct key *key, + unsigned long user_call_ID, + s64 tx_total_len, + gfp_t gfp, + rxrpc_notify_rx_t notify_rx, + bool upgrade, + bool intr, + unsigned int debug_id); + + This allocates the infrastructure to make a new RxRPC call and assigns + call and connection numbers. The call will be made on the UDP port that + the socket is bound to. The call will go to the destination address of a + connected client socket unless an alternative is supplied (srx is + non-NULL). + + If a key is supplied then this will be used to secure the call instead of + the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls + secured in this way will still share connections if at all possible. + + The user_call_ID is equivalent to that supplied to sendmsg() in the + control data buffer. It is entirely feasible to use this to point to a + kernel data structure. + + tx_total_len is the amount of data the caller is intending to transmit + with this call (or -1 if unknown at this point). Setting the data size + allows the kernel to encrypt directly to the packet buffers, thereby + saving a copy. The value may not be less than -1. + + notify_rx is a pointer to a function to be called when events such as + incoming data packets or remote aborts happen. + + upgrade should be set to true if a client operation should request that + the server upgrade the service to a better one. The resultant service ID + is returned by rxrpc_kernel_recv_data(). + + intr should be set to true if the call should be interruptible. If this + is not set, this function may not return until a channel has been + allocated; if it is set, the function may return -ERESTARTSYS. + + debug_id is the call debugging ID to be used for tracing. This can be + obtained by atomically incrementing rxrpc_debug_id. + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (#) End a client call:: + + void rxrpc_kernel_end_call(struct socket *sock, + struct rxrpc_call *call); + + This is used to end a previously begun call. The user_call_ID is expunged + from AF_RXRPC's knowledge and will not be seen again in association with + the specified call. + + (#) Send data through a call:: + + typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + + int rxrpc_kernel_send_data(struct socket *sock, + struct rxrpc_call *call, + struct msghdr *msg, + size_t len, + rxrpc_notify_end_tx_t notify_end_rx); + + This is used to supply either the request part of a client call or the + reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the + data buffers to be used. msg_iov may not be NULL and must point + exclusively to in-kernel virtual addresses. msg.msg_flags may be given + MSG_MORE if there will be subsequent data sends for this call. + + The msg must not specify a destination address, control data or any flags + other than MSG_MORE. len is the total amount of data to transmit. + + notify_end_rx can be NULL or it can be used to specify a function to be + called when the call changes state to end the Tx phase. This function is + called with the call-state spinlock held to prevent any reply or final ACK + from being delivered first. + + (#) Receive data from a call:: + + int rxrpc_kernel_recv_data(struct socket *sock, + struct rxrpc_call *call, + void *buf, + size_t size, + size_t *_offset, + bool want_more, + u32 *_abort, + u16 *_service) + + This is used to receive data from either the reply part of a client call + or the request part of a service call. buf and size specify how much + data is desired and where to store it. *_offset is added on to buf and + subtracted from size internally; the amount copied into the buffer is + added to *_offset before returning. + + want_more should be true if further data will be required after this is + satisfied and false if this is the last item of the receive phase. + + There are three normal returns: 0 if the buffer was filled and want_more + was true; 1 if the buffer was filled, the last DATA packet has been + emptied and want_more was false; and -EAGAIN if the function needs to be + called again. + + If the last DATA packet is processed but the buffer contains less than + the amount requested, EBADMSG is returned. If want_more wasn't set, but + more data was available, EMSGSIZE is returned. + + If a remote ABORT is detected, the abort code received will be stored in + ``*_abort`` and ECONNABORTED will be returned. + + The service ID that the call ended up with is returned into *_service. + This can be used to see if a call got a service upgrade. + + (#) Abort a call?? + + :: + + void rxrpc_kernel_abort_call(struct socket *sock, + struct rxrpc_call *call, + u32 abort_code); + + This is used to abort a call if it's still in an abortable state. The + abort code specified will be placed in the ABORT message sent. + + (#) Intercept received RxRPC messages:: + + typedef void (*rxrpc_interceptor_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + + void + rxrpc_kernel_intercept_rx_messages(struct socket *sock, + rxrpc_interceptor_t interceptor); + + This installs an interceptor function on the specified AF_RXRPC socket. + All messages that would otherwise wind up in the socket's Rx queue are + then diverted to this function. Note that care must be taken to process + the messages in the right order to maintain DATA message sequentiality. + + The interceptor function itself is provided with the address of the socket + and handling the incoming message, the ID assigned by the kernel utility + to the call and the socket buffer containing the message. + + The skb->mark field indicates the type of message: + + =============================== ======================================= + Mark Meaning + =============================== ======================================= + RXRPC_SKB_MARK_DATA Data message + RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call + RXRPC_SKB_MARK_BUSY Client call rejected as server busy + RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer + RXRPC_SKB_MARK_NET_ERROR Network error detected + RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered + RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance + =============================== ======================================= + + The remote abort message can be probed with rxrpc_kernel_get_abort_code(). + The two error messages can be probed with rxrpc_kernel_get_error_number(). + A new call can be accepted with rxrpc_kernel_accept_call(). + + Data messages can have their contents extracted with the usual bunch of + socket buffer manipulation functions. A data message can be determined to + be the last one in a sequence with rxrpc_kernel_is_data_last(). When a + data message has been used up, rxrpc_kernel_data_consumed() should be + called on it. + + Messages should be handled to rxrpc_kernel_free_skb() to dispose of. It + is possible to get extra refs on all types of message for later freeing, + but this may pin the state of a call until the message is finally freed. + + (#) Accept an incoming call:: + + struct rxrpc_call * + rxrpc_kernel_accept_call(struct socket *sock, + unsigned long user_call_ID); + + This is used to accept an incoming call and to assign it a call ID. This + function is similar to rxrpc_kernel_begin_call() and calls accepted must + be ended in the same way. + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (#) Reject an incoming call:: + + int rxrpc_kernel_reject_call(struct socket *sock); + + This is used to reject the first incoming call on the socket's queue with + a BUSY message. -ENODATA is returned if there were no incoming calls. + Other errors may be returned if the call had been aborted (-ECONNABORTED) + or had timed out (-ETIME). + + (#) Allocate a null key for doing anonymous security:: + + struct key *rxrpc_get_null_key(const char *keyname); + + This is used to allocate a null RxRPC key that can be used to indicate + anonymous security for a particular domain. + + (#) Get the peer address of a call:: + + void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call, + struct sockaddr_rxrpc *_srx); + + This is used to find the remote peer address of a call. + + (#) Set the total transmit data size on a call:: + + void rxrpc_kernel_set_tx_length(struct socket *sock, + struct rxrpc_call *call, + s64 tx_total_len); + + This sets the amount of data that the caller is intending to transmit on a + call. It's intended to be used for setting the reply size as the request + size should be set when the call is begun. tx_total_len may not be less + than zero. + + (#) Get call RTT:: + + u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call); + + Get the RTT time to the peer in use by a call. The value returned is in + nanoseconds. + + (#) Check call still alive:: + + bool rxrpc_kernel_check_life(struct socket *sock, + struct rxrpc_call *call, + u32 *_life); + void rxrpc_kernel_probe_life(struct socket *sock, + struct rxrpc_call *call); + + The first function passes back in ``*_life`` a number that is updated when + ACKs are received from the peer (notably including PING RESPONSE ACKs + which we can elicit by sending PING ACKs to see if the call still exists + on the server). The caller should compare the numbers of two calls to see + if the call is still alive after waiting for a suitable interval. It also + returns true as long as the call hasn't yet reached the completed state. + + This allows the caller to work out if the server is still contactable and + if the call is still alive on the server while waiting for the server to + process a client operation. + + The second function causes a ping ACK to be transmitted to try to provoke + the peer into responding, which would then cause the value returned by the + first function to change. Note that this must be called in TASK_RUNNING + state. + + (#) Get reply timestamp:: + + bool rxrpc_kernel_get_reply_time(struct socket *sock, + struct rxrpc_call *call, + ktime_t *_ts) + + This allows the timestamp on the first DATA packet of the reply of a + client call to be queried, provided that it is still in the Rx ring. If + successful, the timestamp will be stored into ``*_ts`` and true will be + returned; false will be returned otherwise. + + (#) Get remote client epoch:: + + u32 rxrpc_kernel_get_epoch(struct socket *sock, + struct rxrpc_call *call) + + This allows the epoch that's contained in packets of an incoming client + call to be queried. This value is returned. The function always + successful if the call is still in progress. It shouldn't be called once + the call has expired. Note that calling this on a local client call only + returns the local epoch. + + This value can be used to determine if the remote client has been + restarted as it shouldn't change otherwise. + + (#) Set the maxmimum lifespan on a call:: + + void rxrpc_kernel_set_max_life(struct socket *sock, + struct rxrpc_call *call, + unsigned long hard_timeout) + + This sets the maximum lifespan on a call to hard_timeout (which is in + jiffies). In the event of the timeout occurring, the call will be + aborted and -ETIME or -ETIMEDOUT will be returned. + + +Configurable Parameters +======================= + +The RxRPC protocol driver has a number of configurable parameters that can be +adjusted through sysctls in /proc/net/rxrpc/: + + (#) req_ack_delay + + The amount of time in milliseconds after receiving a packet with the + request-ack flag set before we honour the flag and actually send the + requested ack. + + Usually the other side won't stop sending packets until the advertised + reception window is full (to a maximum of 255 packets), so delaying the + ACK permits several packets to be ACK'd in one go. + + (#) soft_ack_delay + + The amount of time in milliseconds after receiving a new packet before we + generate a soft-ACK to tell the sender that it doesn't need to resend. + + (#) idle_ack_delay + + The amount of time in milliseconds after all the packets currently in the + received queue have been consumed before we generate a hard-ACK to tell + the sender it can free its buffers, assuming no other reason occurs that + we would send an ACK. + + (#) resend_timeout + + The amount of time in milliseconds after transmitting a packet before we + transmit it again, assuming no ACK is received from the receiver telling + us they got it. + + (#) max_call_lifetime + + The maximum amount of time in seconds that a call may be in progress + before we preemptively kill it. + + (#) dead_call_expiry + + The amount of time in seconds before we remove a dead call from the call + list. Dead calls are kept around for a little while for the purpose of + repeating ACK and ABORT packets. + + (#) connection_expiry + + The amount of time in seconds after a connection was last used before we + remove it from the connection list. While a connection is in existence, + it serves as a placeholder for negotiated security; when it is deleted, + the security must be renegotiated. + + (#) transport_expiry + + The amount of time in seconds after a transport was last used before we + remove it from the transport list. While a transport is in existence, it + serves to anchor the peer data and keeps the connection ID counter. + + (#) rxrpc_rx_window_size + + The size of the receive window in packets. This is the maximum number of + unconsumed received packets we're willing to hold in memory for any + particular call. + + (#) rxrpc_rx_mtu + + The maximum packet MTU size that we're willing to receive in bytes. This + indicates to the peer whether we're willing to accept jumbo packets. + + (#) rxrpc_rx_jumbo_max + + The maximum number of packets that we're willing to accept in a jumbo + packet. Non-terminal packets in a jumbo packet must contain a four byte + header plus exactly 1412 bytes of data. The terminal packet must contain + a four byte header plus any amount of data. In any event, a jumbo packet + may not exceed rxrpc_rx_mtu in size. diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt deleted file mode 100644 index 180e07d956a7..000000000000 --- a/Documentation/networking/rxrpc.txt +++ /dev/null @@ -1,1155 +0,0 @@ - ====================== - RxRPC NETWORK PROTOCOL - ====================== - -The RxRPC protocol driver provides a reliable two-phase transport on top of UDP -that can be used to perform RxRPC remote operations. This is done over sockets -of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and -receive data, aborts and errors. - -Contents of this document: - - (*) Overview. - - (*) RxRPC protocol summary. - - (*) AF_RXRPC driver model. - - (*) Control messages. - - (*) Socket options. - - (*) Security. - - (*) Example client usage. - - (*) Example server usage. - - (*) AF_RXRPC kernel interface. - - (*) Configurable parameters. - - -======== -OVERVIEW -======== - -RxRPC is a two-layer protocol. There is a session layer which provides -reliable virtual connections using UDP over IPv4 (or IPv6) as the transport -layer, but implements a real network protocol; and there's the presentation -layer which renders structured data to binary blobs and back again using XDR -(as does SunRPC): - - +-------------+ - | Application | - +-------------+ - | XDR | Presentation - +-------------+ - | RxRPC | Session - +-------------+ - | UDP | Transport - +-------------+ - - -AF_RXRPC provides: - - (1) Part of an RxRPC facility for both kernel and userspace applications by - making the session part of it a Linux network protocol (AF_RXRPC). - - (2) A two-phase protocol. The client transmits a blob (the request) and then - receives a blob (the reply), and the server receives the request and then - transmits the reply. - - (3) Retention of the reusable bits of the transport system set up for one call - to speed up subsequent calls. - - (4) A secure protocol, using the Linux kernel's key retention facility to - manage security on the client end. The server end must of necessity be - more active in security negotiations. - -AF_RXRPC does not provide XDR marshalling/presentation facilities. That is -left to the application. AF_RXRPC only deals in blobs. Even the operation ID -is just the first four bytes of the request blob, and as such is beyond the -kernel's interest. - - -Sockets of AF_RXRPC family are: - - (1) created as type SOCK_DGRAM; - - (2) provided with a protocol of the type of underlying transport they're going - to use - currently only PF_INET is supported. - - -The Andrew File System (AFS) is an example of an application that uses this and -that has both kernel (filesystem) and userspace (utility) components. - - -====================== -RXRPC PROTOCOL SUMMARY -====================== - -An overview of the RxRPC protocol: - - (*) RxRPC sits on top of another networking protocol (UDP is the only option - currently), and uses this to provide network transport. UDP ports, for - example, provide transport endpoints. - - (*) RxRPC supports multiple virtual "connections" from any given transport - endpoint, thus allowing the endpoints to be shared, even to the same - remote endpoint. - - (*) Each connection goes to a particular "service". A connection may not go - to multiple services. A service may be considered the RxRPC equivalent of - a port number. AF_RXRPC permits multiple services to share an endpoint. - - (*) Client-originating packets are marked, thus a transport endpoint can be - shared between client and server connections (connections have a - direction). - - (*) Up to a billion connections may be supported concurrently between one - local transport endpoint and one service on one remote endpoint. An RxRPC - connection is described by seven numbers: - - Local address } - Local port } Transport (UDP) address - Remote address } - Remote port } - Direction - Connection ID - Service ID - - (*) Each RxRPC operation is a "call". A connection may make up to four - billion calls, but only up to four calls may be in progress on a - connection at any one time. - - (*) Calls are two-phase and asymmetric: the client sends its request data, - which the service receives; then the service transmits the reply data - which the client receives. - - (*) The data blobs are of indefinite size, the end of a phase is marked with a - flag in the packet. The number of packets of data making up one blob may - not exceed 4 billion, however, as this would cause the sequence number to - wrap. - - (*) The first four bytes of the request data are the service operation ID. - - (*) Security is negotiated on a per-connection basis. The connection is - initiated by the first data packet on it arriving. If security is - requested, the server then issues a "challenge" and then the client - replies with a "response". If the response is successful, the security is - set for the lifetime of that connection, and all subsequent calls made - upon it use that same security. In the event that the server lets a - connection lapse before the client, the security will be renegotiated if - the client uses the connection again. - - (*) Calls use ACK packets to handle reliability. Data packets are also - explicitly sequenced per call. - - (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs. - A hard-ACK indicates to the far side that all the data received to a point - has been received and processed; a soft-ACK indicates that the data has - been received but may yet be discarded and re-requested. The sender may - not discard any transmittable packets until they've been hard-ACK'd. - - (*) Reception of a reply data packet implicitly hard-ACK's all the data - packets that make up the request. - - (*) An call is complete when the request has been sent, the reply has been - received and the final hard-ACK on the last packet of the reply has - reached the server. - - (*) An call may be aborted by either end at any time up to its completion. - - -===================== -AF_RXRPC DRIVER MODEL -===================== - -About the AF_RXRPC driver: - - (*) The AF_RXRPC protocol transparently uses internal sockets of the transport - protocol to represent transport endpoints. - - (*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC - connections are handled transparently. One client socket may be used to - make multiple simultaneous calls to the same service. One server socket - may handle calls from many clients. - - (*) Additional parallel client connections will be initiated to support extra - concurrent calls, up to a tunable limit. - - (*) Each connection is retained for a certain amount of time [tunable] after - the last call currently using it has completed in case a new call is made - that could reuse it. - - (*) Each internal UDP socket is retained [tunable] for a certain amount of - time [tunable] after the last connection using it discarded, in case a new - connection is made that could use it. - - (*) A client-side connection is only shared between calls if they have have - the same key struct describing their security (and assuming the calls - would otherwise share the connection). Non-secured calls would also be - able to share connections with each other. - - (*) A server-side connection is shared if the client says it is. - - (*) ACK'ing is handled by the protocol driver automatically, including ping - replying. - - (*) SO_KEEPALIVE automatically pings the other side to keep the connection - alive [TODO]. - - (*) If an ICMP error is received, all calls affected by that error will be - aborted with an appropriate network error passed through recvmsg(). - - -Interaction with the user of the RxRPC socket: - - (*) A socket is made into a server socket by binding an address with a - non-zero service ID. - - (*) In the client, sending a request is achieved with one or more sendmsgs, - followed by the reply being received with one or more recvmsgs. - - (*) The first sendmsg for a request to be sent from a client contains a tag to - be used in all other sendmsgs or recvmsgs associated with that call. The - tag is carried in the control data. - - (*) connect() is used to supply a default destination address for a client - socket. This may be overridden by supplying an alternate address to the - first sendmsg() of a call (struct msghdr::msg_name). - - (*) If connect() is called on an unbound client, a random local port will - bound before the operation takes place. - - (*) A server socket may also be used to make client calls. To do this, the - first sendmsg() of the call must specify the target address. The server's - transport endpoint is used to send the packets. - - (*) Once the application has received the last message associated with a call, - the tag is guaranteed not to be seen again, and so it can be used to pin - client resources. A new call can then be initiated with the same tag - without fear of interference. - - (*) In the server, a request is received with one or more recvmsgs, then the - the reply is transmitted with one or more sendmsgs, and then the final ACK - is received with a last recvmsg. - - (*) When sending data for a call, sendmsg is given MSG_MORE if there's more - data to come on that call. - - (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more - data to come for that call. - - (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg - to indicate the terminal message for that call. - - (*) A call may be aborted by adding an abort control message to the control - data. Issuing an abort terminates the kernel's use of that call's tag. - Any messages waiting in the receive queue for that call will be discarded. - - (*) Aborts, busy notifications and challenge packets are delivered by recvmsg, - and control data messages will be set to indicate the context. Receiving - an abort or a busy message terminates the kernel's use of that call's tag. - - (*) The control data part of the msghdr struct is used for a number of things: - - (*) The tag of the intended or affected call. - - (*) Sending or receiving errors, aborts and busy notifications. - - (*) Notifications of incoming calls. - - (*) Sending debug requests and receiving debug replies [TODO]. - - (*) When the kernel has received and set up an incoming call, it sends a - message to server application to let it know there's a new call awaiting - its acceptance [recvmsg reports a special control message]. The server - application then uses sendmsg to assign a tag to the new call. Once that - is done, the first part of the request data will be delivered by recvmsg. - - (*) The server application has to provide the server socket with a keyring of - secret keys corresponding to the security types it permits. When a secure - connection is being set up, the kernel looks up the appropriate secret key - in the keyring and then sends a challenge packet to the client and - receives a response packet. The kernel then checks the authorisation of - the packet and either aborts the connection or sets up the security. - - (*) The name of the key a client will use to secure its communications is - nominated by a socket option. - - -Notes on sendmsg: - - (*) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is - making progress at accepting packets within a reasonable time such that we - manage to queue up all the data for transmission. This requires the - client to accept at least one packet per 2*RTT time period. - - If this isn't set, sendmsg() will return immediately, either returning - EINTR/ERESTARTSYS if nothing was consumed or returning the amount of data - consumed. - - -Notes on recvmsg: - - (*) If there's a sequence of data messages belonging to a particular call on - the receive queue, then recvmsg will keep working through them until: - - (a) it meets the end of that call's received data, - - (b) it meets a non-data message, - - (c) it meets a message belonging to a different call, or - - (d) it fills the user buffer. - - If recvmsg is called in blocking mode, it will keep sleeping, awaiting the - reception of further data, until one of the above four conditions is met. - - (2) MSG_PEEK operates similarly, but will return immediately if it has put any - data in the buffer rather than sleeping until it can fill the buffer. - - (3) If a data message is only partially consumed in filling a user buffer, - then the remainder of that message will be left on the front of the queue - for the next taker. MSG_TRUNC will never be flagged. - - (4) If there is more data to be had on a call (it hasn't copied the last byte - of the last data message in that phase yet), then MSG_MORE will be - flagged. - - -================ -CONTROL MESSAGES -================ - -AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex -calls, to invoke certain actions and to report certain conditions. These are: - - MESSAGE ID SRT DATA MEANING - ======================= === =========== =============================== - RXRPC_USER_CALL_ID sr- User ID App's call specifier - RXRPC_ABORT srt Abort code Abort code to issue/received - RXRPC_ACK -rt n/a Final ACK received - RXRPC_NET_ERROR -rt error num Network error on call - RXRPC_BUSY -rt n/a Call rejected (server busy) - RXRPC_LOCAL_ERROR -rt error num Local error encountered - RXRPC_NEW_CALL -r- n/a New call received - RXRPC_ACCEPT s-- n/a Accept new call - RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call - RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded - RXRPC_TX_LENGTH s-- data len Total length of Tx data - - (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) - - (*) RXRPC_USER_CALL_ID - - This is used to indicate the application's call ID. It's an unsigned long - that the app specifies in the client by attaching it to the first data - message or in the server by passing it in association with an RXRPC_ACCEPT - message. recvmsg() passes it in conjunction with all messages except - those of the RXRPC_NEW_CALL message. - - (*) RXRPC_ABORT - - This is can be used by an application to abort a call by passing it to - sendmsg, or it can be delivered by recvmsg to indicate a remote abort was - received. Either way, it must be associated with an RXRPC_USER_CALL_ID to - specify the call affected. If an abort is being sent, then error EBADSLT - will be returned if there is no call with that user ID. - - (*) RXRPC_ACK - - This is delivered to a server application to indicate that the final ACK - of a call was received from the client. It will be associated with an - RXRPC_USER_CALL_ID to indicate the call that's now complete. - - (*) RXRPC_NET_ERROR - - This is delivered to an application to indicate that an ICMP error message - was encountered in the process of trying to talk to the peer. An - errno-class integer value will be included in the control message data - indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call - affected. - - (*) RXRPC_BUSY - - This is delivered to a client application to indicate that a call was - rejected by the server due to the server being busy. It will be - associated with an RXRPC_USER_CALL_ID to indicate the rejected call. - - (*) RXRPC_LOCAL_ERROR - - This is delivered to an application to indicate that a local error was - encountered and that a call has been aborted because of it. An - errno-class integer value will be included in the control message data - indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call - affected. - - (*) RXRPC_NEW_CALL - - This is delivered to indicate to a server application that a new call has - arrived and is awaiting acceptance. No user ID is associated with this, - as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT. - - (*) RXRPC_ACCEPT - - This is used by a server application to attempt to accept a call and - assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID - to indicate the user ID to be assigned. If there is no call to be - accepted (it may have timed out, been aborted, etc.), then sendmsg will - return error ENODATA. If the user ID is already in use by another call, - then error EBADSLT will be returned. - - (*) RXRPC_EXCLUSIVE_CALL - - This is used to indicate that a client call should be made on a one-off - connection. The connection is discarded once the call has terminated. - - (*) RXRPC_UPGRADE_SERVICE - - This is used to make a client call to probe if the specified service ID - may be upgraded by the server. The caller must check msg_name returned to - recvmsg() for the service ID actually in use. The operation probed must - be one that takes the same arguments in both services. - - Once this has been used to establish the upgrade capability (or lack - thereof) of the server, the service ID returned should be used for all - future communication to that server and RXRPC_UPGRADE_SERVICE should no - longer be set. - - (*) RXRPC_TX_LENGTH - - This is used to inform the kernel of the total amount of data that is - going to be transmitted by a call (whether in a client request or a - service response). If given, it allows the kernel to encrypt from the - userspace buffer directly to the packet buffers, rather than copying into - the buffer and then encrypting in place. This may only be given with the - first sendmsg() providing data for a call. EMSGSIZE will be generated if - the amount of data actually given is different. - - This takes a parameter of __s64 type that indicates how much will be - transmitted. This may not be less than zero. - -The symbol RXRPC__SUPPORTED is defined as one more than the highest control -message type supported. At run time this can be queried by means of the -RXRPC_SUPPORTED_CMSG socket option (see below). - - -============== -SOCKET OPTIONS -============== - -AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: - - (*) RXRPC_SECURITY_KEY - - This is used to specify the description of the key to be used. The key is - extracted from the calling process's keyrings with request_key() and - should be of "rxrpc" type. - - The optval pointer points to the description string, and optlen indicates - how long the string is, without the NUL terminator. - - (*) RXRPC_SECURITY_KEYRING - - Similar to above but specifies a keyring of server secret keys to use (key - type "keyring"). See the "Security" section. - - (*) RXRPC_EXCLUSIVE_CONNECTION - - This is used to request that new connections should be used for each call - made subsequently on this socket. optval should be NULL and optlen 0. - - (*) RXRPC_MIN_SECURITY_LEVEL - - This is used to specify the minimum security level required for calls on - this socket. optval must point to an int containing one of the following - values: - - (a) RXRPC_SECURITY_PLAIN - - Encrypted checksum only. - - (b) RXRPC_SECURITY_AUTH - - Encrypted checksum plus packet padded and first eight bytes of packet - encrypted - which includes the actual packet length. - - (c) RXRPC_SECURITY_ENCRYPTED - - Encrypted checksum plus entire packet padded and encrypted, including - actual packet length. - - (*) RXRPC_UPGRADEABLE_SERVICE - - This is used to indicate that a service socket with two bindings may - upgrade one bound service to the other if requested by the client. optval - must point to an array of two unsigned short ints. The first is the - service ID to upgrade from and the second the service ID to upgrade to. - - (*) RXRPC_SUPPORTED_CMSG - - This is a read-only option that writes an int into the buffer indicating - the highest control message type supported. - - -======== -SECURITY -======== - -Currently, only the kerberos 4 equivalent protocol has been implemented -(security index 2 - rxkad). This requires the rxkad module to be loaded and, -on the client, tickets of the appropriate type to be obtained from the AFS -kaserver or the kerberos server and installed as "rxrpc" type keys. This is -normally done using the klog program. An example simple klog program can be -found at: - - http://people.redhat.com/~dhowells/rxrpc/klog.c - -The payload provided to add_key() on the client should be of the following -form: - - struct rxrpc_key_sec2_v1 { - uint16_t security_index; /* 2 */ - uint16_t ticket_length; /* length of ticket[] */ - uint32_t expiry; /* time at which expires */ - uint8_t kvno; /* key version number */ - uint8_t __pad[3]; - uint8_t session_key[8]; /* DES session key */ - uint8_t ticket[0]; /* the encrypted ticket */ - }; - -Where the ticket blob is just appended to the above structure. - - -For the server, keys of type "rxrpc_s" must be made available to the server. -They have a description of ":" (eg: "52:2" for an -rxkad key for the AFS VL service). When such a key is created, it should be -given the server's secret key as the instantiation data (see the example -below). - - add_key("rxrpc_s", "52:2", secret_key, 8, keyring); - -A keyring is passed to the server socket by naming it in a sockopt. The server -socket then looks the server secret keys up in this keyring when secure -incoming connections are made. This can be seen in an example program that can -be found at: - - http://people.redhat.com/~dhowells/rxrpc/listen.c - - -==================== -EXAMPLE CLIENT USAGE -==================== - -A client would issue an operation by: - - (1) An RxRPC socket is set up by: - - client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); - - Where the third parameter indicates the protocol family of the transport - socket used - usually IPv4 but it can also be IPv6 [TODO]. - - (2) A local address can optionally be bound: - - struct sockaddr_rxrpc srx = { - .srx_family = AF_RXRPC, - .srx_service = 0, /* we're a client */ - .transport_type = SOCK_DGRAM, /* type of transport socket */ - .transport.sin_family = AF_INET, - .transport.sin_port = htons(7000), /* AFS callback */ - .transport.sin_address = 0, /* all local interfaces */ - }; - bind(client, &srx, sizeof(srx)); - - This specifies the local UDP port to be used. If not given, a random - non-privileged port will be used. A UDP port may be shared between - several unrelated RxRPC sockets. Security is handled on a basis of - per-RxRPC virtual connection. - - (3) The security is set: - - const char *key = "AFS:cambridge.redhat.com"; - setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key)); - - This issues a request_key() to get the key representing the security - context. The minimum security level can be set: - - unsigned int sec = RXRPC_SECURITY_ENCRYPTED; - setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL, - &sec, sizeof(sec)); - - (4) The server to be contacted can then be specified (alternatively this can - be done through sendmsg): - - struct sockaddr_rxrpc srx = { - .srx_family = AF_RXRPC, - .srx_service = VL_SERVICE_ID, - .transport_type = SOCK_DGRAM, /* type of transport socket */ - .transport.sin_family = AF_INET, - .transport.sin_port = htons(7005), /* AFS volume manager */ - .transport.sin_address = ..., - }; - connect(client, &srx, sizeof(srx)); - - (5) The request data should then be posted to the server socket using a series - of sendmsg() calls, each with the following control message attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - MSG_MORE should be set in msghdr::msg_flags on all but the last part of - the request. Multiple requests may be made simultaneously. - - An RXRPC_TX_LENGTH control message can also be specified on the first - sendmsg() call. - - If a call is intended to go to a destination other than the default - specified through connect(), then msghdr::msg_name should be set on the - first request message of that call. - - (6) The reply data will then be posted to the server socket for recvmsg() to - pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data - for a particular call to be read. MSG_EOR will be set on the terminal - read for a call. - - All data will be delivered with the following control message attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - If an abort or error occurred, this will be returned in the control data - buffer instead, and MSG_EOR will be flagged to indicate the end of that - call. - -A client may ask for a service ID it knows and ask that this be upgraded to a -better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the -first sendmsg() of a call. The client should then check srx_service in the -msg_name filled in by recvmsg() when collecting the result. srx_service will -hold the same value as given to sendmsg() if the upgrade request was ignored by -the service - otherwise it will be altered to indicate the service ID the -server upgraded to. Note that the upgraded service ID is chosen by the server. -The caller has to wait until it sees the service ID in the reply before sending -any more calls (further calls to the same destination will be blocked until the -probe is concluded). - - -==================== -EXAMPLE SERVER USAGE -==================== - -A server would be set up to accept operations in the following manner: - - (1) An RxRPC socket is created by: - - server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); - - Where the third parameter indicates the address type of the transport - socket used - usually IPv4. - - (2) Security is set up if desired by giving the socket a keyring with server - secret keys in it: - - keyring = add_key("keyring", "AFSkeys", NULL, 0, - KEY_SPEC_PROCESS_KEYRING); - - const char secret_key[8] = { - 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 }; - add_key("rxrpc_s", "52:2", secret_key, 8, keyring); - - setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7); - - The keyring can be manipulated after it has been given to the socket. This - permits the server to add more keys, replace keys, etc. while it is live. - - (3) A local address must then be bound: - - struct sockaddr_rxrpc srx = { - .srx_family = AF_RXRPC, - .srx_service = VL_SERVICE_ID, /* RxRPC service ID */ - .transport_type = SOCK_DGRAM, /* type of transport socket */ - .transport.sin_family = AF_INET, - .transport.sin_port = htons(7000), /* AFS callback */ - .transport.sin_address = 0, /* all local interfaces */ - }; - bind(server, &srx, sizeof(srx)); - - More than one service ID may be bound to a socket, provided the transport - parameters are the same. The limit is currently two. To do this, bind() - should be called twice. - - (4) If service upgrading is required, first two service IDs must have been - bound and then the following option must be set: - - unsigned short service_ids[2] = { from_ID, to_ID }; - setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE, - service_ids, sizeof(service_ids)); - - This will automatically upgrade connections on service from_ID to service - to_ID if they request it. This will be reflected in msg_name obtained - through recvmsg() when the request data is delivered to userspace. - - (5) The server is then set to listen out for incoming calls: - - listen(server, 100); - - (6) The kernel notifies the server of pending incoming connections by sending - it a message for each. This is received with recvmsg() on the server - socket. It has no data, and has a single dataless control message - attached: - - RXRPC_NEW_CALL - - The address that can be passed back by recvmsg() at this point should be - ignored since the call for which the message was posted may have gone by - the time it is accepted - in which case the first call still on the queue - will be accepted. - - (7) The server then accepts the new call by issuing a sendmsg() with two - pieces of control data and no actual data: - - RXRPC_ACCEPT - indicate connection acceptance - RXRPC_USER_CALL_ID - specify user ID for this call - - (8) The first request data packet will then be posted to the server socket for - recvmsg() to pick up. At that point, the RxRPC address for the call can - be read from the address fields in the msghdr struct. - - Subsequent request data will be posted to the server socket for recvmsg() - to collect as it arrives. All but the last piece of the request data will - be delivered with MSG_MORE flagged. - - All data will be delivered with the following control message attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - (9) The reply data should then be posted to the server socket using a series - of sendmsg() calls, each with the following control messages attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - MSG_MORE should be set in msghdr::msg_flags on all but the last message - for a particular call. - -(10) The final ACK from the client will be posted for retrieval by recvmsg() - when it is received. It will take the form of a dataless message with two - control messages attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - RXRPC_ACK - indicates final ACK (no data) - - MSG_EOR will be flagged to indicate that this is the final message for - this call. - -(11) Up to the point the final packet of reply data is sent, the call can be - aborted by calling sendmsg() with a dataless message with the following - control messages attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - RXRPC_ABORT - indicates abort code (4 byte data) - - Any packets waiting in the socket's receive queue will be discarded if - this is issued. - -Note that all the communications for a particular service take place through -the one server socket, using control messages on sendmsg() and recvmsg() to -determine the call affected. - - -========================= -AF_RXRPC KERNEL INTERFACE -========================= - -The AF_RXRPC module also provides an interface for use by in-kernel utilities -such as the AFS filesystem. This permits such a utility to: - - (1) Use different keys directly on individual client calls on one socket - rather than having to open a whole slew of sockets, one for each key it - might want to use. - - (2) Avoid having RxRPC call request_key() at the point of issue of a call or - opening of a socket. Instead the utility is responsible for requesting a - key at the appropriate point. AFS, for instance, would do this during VFS - operations such as open() or unlink(). The key is then handed through - when the call is initiated. - - (3) Request the use of something other than GFP_KERNEL to allocate memory. - - (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be - intercepted before they get put into the socket Rx queue and the socket - buffers manipulated directly. - -To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket, -bind an address as appropriate and listen if it's to be a server socket, but -then it passes this to the kernel interface functions. - -The kernel interface functions are as follows: - - (*) Begin a new client call. - - struct rxrpc_call * - rxrpc_kernel_begin_call(struct socket *sock, - struct sockaddr_rxrpc *srx, - struct key *key, - unsigned long user_call_ID, - s64 tx_total_len, - gfp_t gfp, - rxrpc_notify_rx_t notify_rx, - bool upgrade, - bool intr, - unsigned int debug_id); - - This allocates the infrastructure to make a new RxRPC call and assigns - call and connection numbers. The call will be made on the UDP port that - the socket is bound to. The call will go to the destination address of a - connected client socket unless an alternative is supplied (srx is - non-NULL). - - If a key is supplied then this will be used to secure the call instead of - the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls - secured in this way will still share connections if at all possible. - - The user_call_ID is equivalent to that supplied to sendmsg() in the - control data buffer. It is entirely feasible to use this to point to a - kernel data structure. - - tx_total_len is the amount of data the caller is intending to transmit - with this call (or -1 if unknown at this point). Setting the data size - allows the kernel to encrypt directly to the packet buffers, thereby - saving a copy. The value may not be less than -1. - - notify_rx is a pointer to a function to be called when events such as - incoming data packets or remote aborts happen. - - upgrade should be set to true if a client operation should request that - the server upgrade the service to a better one. The resultant service ID - is returned by rxrpc_kernel_recv_data(). - - intr should be set to true if the call should be interruptible. If this - is not set, this function may not return until a channel has been - allocated; if it is set, the function may return -ERESTARTSYS. - - debug_id is the call debugging ID to be used for tracing. This can be - obtained by atomically incrementing rxrpc_debug_id. - - If this function is successful, an opaque reference to the RxRPC call is - returned. The caller now holds a reference on this and it must be - properly ended. - - (*) End a client call. - - void rxrpc_kernel_end_call(struct socket *sock, - struct rxrpc_call *call); - - This is used to end a previously begun call. The user_call_ID is expunged - from AF_RXRPC's knowledge and will not be seen again in association with - the specified call. - - (*) Send data through a call. - - typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk, - unsigned long user_call_ID, - struct sk_buff *skb); - - int rxrpc_kernel_send_data(struct socket *sock, - struct rxrpc_call *call, - struct msghdr *msg, - size_t len, - rxrpc_notify_end_tx_t notify_end_rx); - - This is used to supply either the request part of a client call or the - reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the - data buffers to be used. msg_iov may not be NULL and must point - exclusively to in-kernel virtual addresses. msg.msg_flags may be given - MSG_MORE if there will be subsequent data sends for this call. - - The msg must not specify a destination address, control data or any flags - other than MSG_MORE. len is the total amount of data to transmit. - - notify_end_rx can be NULL or it can be used to specify a function to be - called when the call changes state to end the Tx phase. This function is - called with the call-state spinlock held to prevent any reply or final ACK - from being delivered first. - - (*) Receive data from a call. - - int rxrpc_kernel_recv_data(struct socket *sock, - struct rxrpc_call *call, - void *buf, - size_t size, - size_t *_offset, - bool want_more, - u32 *_abort, - u16 *_service) - - This is used to receive data from either the reply part of a client call - or the request part of a service call. buf and size specify how much - data is desired and where to store it. *_offset is added on to buf and - subtracted from size internally; the amount copied into the buffer is - added to *_offset before returning. - - want_more should be true if further data will be required after this is - satisfied and false if this is the last item of the receive phase. - - There are three normal returns: 0 if the buffer was filled and want_more - was true; 1 if the buffer was filled, the last DATA packet has been - emptied and want_more was false; and -EAGAIN if the function needs to be - called again. - - If the last DATA packet is processed but the buffer contains less than - the amount requested, EBADMSG is returned. If want_more wasn't set, but - more data was available, EMSGSIZE is returned. - - If a remote ABORT is detected, the abort code received will be stored in - *_abort and ECONNABORTED will be returned. - - The service ID that the call ended up with is returned into *_service. - This can be used to see if a call got a service upgrade. - - (*) Abort a call. - - void rxrpc_kernel_abort_call(struct socket *sock, - struct rxrpc_call *call, - u32 abort_code); - - This is used to abort a call if it's still in an abortable state. The - abort code specified will be placed in the ABORT message sent. - - (*) Intercept received RxRPC messages. - - typedef void (*rxrpc_interceptor_t)(struct sock *sk, - unsigned long user_call_ID, - struct sk_buff *skb); - - void - rxrpc_kernel_intercept_rx_messages(struct socket *sock, - rxrpc_interceptor_t interceptor); - - This installs an interceptor function on the specified AF_RXRPC socket. - All messages that would otherwise wind up in the socket's Rx queue are - then diverted to this function. Note that care must be taken to process - the messages in the right order to maintain DATA message sequentiality. - - The interceptor function itself is provided with the address of the socket - and handling the incoming message, the ID assigned by the kernel utility - to the call and the socket buffer containing the message. - - The skb->mark field indicates the type of message: - - MARK MEANING - =============================== ======================================= - RXRPC_SKB_MARK_DATA Data message - RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call - RXRPC_SKB_MARK_BUSY Client call rejected as server busy - RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer - RXRPC_SKB_MARK_NET_ERROR Network error detected - RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered - RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance - - The remote abort message can be probed with rxrpc_kernel_get_abort_code(). - The two error messages can be probed with rxrpc_kernel_get_error_number(). - A new call can be accepted with rxrpc_kernel_accept_call(). - - Data messages can have their contents extracted with the usual bunch of - socket buffer manipulation functions. A data message can be determined to - be the last one in a sequence with rxrpc_kernel_is_data_last(). When a - data message has been used up, rxrpc_kernel_data_consumed() should be - called on it. - - Messages should be handled to rxrpc_kernel_free_skb() to dispose of. It - is possible to get extra refs on all types of message for later freeing, - but this may pin the state of a call until the message is finally freed. - - (*) Accept an incoming call. - - struct rxrpc_call * - rxrpc_kernel_accept_call(struct socket *sock, - unsigned long user_call_ID); - - This is used to accept an incoming call and to assign it a call ID. This - function is similar to rxrpc_kernel_begin_call() and calls accepted must - be ended in the same way. - - If this function is successful, an opaque reference to the RxRPC call is - returned. The caller now holds a reference on this and it must be - properly ended. - - (*) Reject an incoming call. - - int rxrpc_kernel_reject_call(struct socket *sock); - - This is used to reject the first incoming call on the socket's queue with - a BUSY message. -ENODATA is returned if there were no incoming calls. - Other errors may be returned if the call had been aborted (-ECONNABORTED) - or had timed out (-ETIME). - - (*) Allocate a null key for doing anonymous security. - - struct key *rxrpc_get_null_key(const char *keyname); - - This is used to allocate a null RxRPC key that can be used to indicate - anonymous security for a particular domain. - - (*) Get the peer address of a call. - - void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call, - struct sockaddr_rxrpc *_srx); - - This is used to find the remote peer address of a call. - - (*) Set the total transmit data size on a call. - - void rxrpc_kernel_set_tx_length(struct socket *sock, - struct rxrpc_call *call, - s64 tx_total_len); - - This sets the amount of data that the caller is intending to transmit on a - call. It's intended to be used for setting the reply size as the request - size should be set when the call is begun. tx_total_len may not be less - than zero. - - (*) Get call RTT. - - u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call); - - Get the RTT time to the peer in use by a call. The value returned is in - nanoseconds. - - (*) Check call still alive. - - bool rxrpc_kernel_check_life(struct socket *sock, - struct rxrpc_call *call, - u32 *_life); - void rxrpc_kernel_probe_life(struct socket *sock, - struct rxrpc_call *call); - - The first function passes back in *_life a number that is updated when - ACKs are received from the peer (notably including PING RESPONSE ACKs - which we can elicit by sending PING ACKs to see if the call still exists - on the server). The caller should compare the numbers of two calls to see - if the call is still alive after waiting for a suitable interval. It also - returns true as long as the call hasn't yet reached the completed state. - - This allows the caller to work out if the server is still contactable and - if the call is still alive on the server while waiting for the server to - process a client operation. - - The second function causes a ping ACK to be transmitted to try to provoke - the peer into responding, which would then cause the value returned by the - first function to change. Note that this must be called in TASK_RUNNING - state. - - (*) Get reply timestamp. - - bool rxrpc_kernel_get_reply_time(struct socket *sock, - struct rxrpc_call *call, - ktime_t *_ts) - - This allows the timestamp on the first DATA packet of the reply of a - client call to be queried, provided that it is still in the Rx ring. If - successful, the timestamp will be stored into *_ts and true will be - returned; false will be returned otherwise. - - (*) Get remote client epoch. - - u32 rxrpc_kernel_get_epoch(struct socket *sock, - struct rxrpc_call *call) - - This allows the epoch that's contained in packets of an incoming client - call to be queried. This value is returned. The function always - successful if the call is still in progress. It shouldn't be called once - the call has expired. Note that calling this on a local client call only - returns the local epoch. - - This value can be used to determine if the remote client has been - restarted as it shouldn't change otherwise. - - (*) Set the maxmimum lifespan on a call. - - void rxrpc_kernel_set_max_life(struct socket *sock, - struct rxrpc_call *call, - unsigned long hard_timeout) - - This sets the maximum lifespan on a call to hard_timeout (which is in - jiffies). In the event of the timeout occurring, the call will be - aborted and -ETIME or -ETIMEDOUT will be returned. - - -======================= -CONFIGURABLE PARAMETERS -======================= - -The RxRPC protocol driver has a number of configurable parameters that can be -adjusted through sysctls in /proc/net/rxrpc/: - - (*) req_ack_delay - - The amount of time in milliseconds after receiving a packet with the - request-ack flag set before we honour the flag and actually send the - requested ack. - - Usually the other side won't stop sending packets until the advertised - reception window is full (to a maximum of 255 packets), so delaying the - ACK permits several packets to be ACK'd in one go. - - (*) soft_ack_delay - - The amount of time in milliseconds after receiving a new packet before we - generate a soft-ACK to tell the sender that it doesn't need to resend. - - (*) idle_ack_delay - - The amount of time in milliseconds after all the packets currently in the - received queue have been consumed before we generate a hard-ACK to tell - the sender it can free its buffers, assuming no other reason occurs that - we would send an ACK. - - (*) resend_timeout - - The amount of time in milliseconds after transmitting a packet before we - transmit it again, assuming no ACK is received from the receiver telling - us they got it. - - (*) max_call_lifetime - - The maximum amount of time in seconds that a call may be in progress - before we preemptively kill it. - - (*) dead_call_expiry - - The amount of time in seconds before we remove a dead call from the call - list. Dead calls are kept around for a little while for the purpose of - repeating ACK and ABORT packets. - - (*) connection_expiry - - The amount of time in seconds after a connection was last used before we - remove it from the connection list. While a connection is in existence, - it serves as a placeholder for negotiated security; when it is deleted, - the security must be renegotiated. - - (*) transport_expiry - - The amount of time in seconds after a transport was last used before we - remove it from the transport list. While a transport is in existence, it - serves to anchor the peer data and keeps the connection ID counter. - - (*) rxrpc_rx_window_size - - The size of the receive window in packets. This is the maximum number of - unconsumed received packets we're willing to hold in memory for any - particular call. - - (*) rxrpc_rx_mtu - - The maximum packet MTU size that we're willing to receive in bytes. This - indicates to the peer whether we're willing to accept jumbo packets. - - (*) rxrpc_rx_jumbo_max - - The maximum number of packets that we're willing to accept in a jumbo - packet. Non-terminal packets in a jumbo packet must contain a four byte - header plus exactly 1412 bytes of data. The terminal packet must contain - a four byte header plus any amount of data. In any event, a jumbo packet - may not exceed rxrpc_rx_mtu in size. diff --git a/MAINTAINERS b/MAINTAINERS index b28823ab48c5..866a0dcd66ef 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14593,7 +14593,7 @@ M: David Howells L: linux-afs@lists.infradead.org S: Supported W: https://www.infradead.org/~dhowells/kafs/ -F: Documentation/networking/rxrpc.txt +F: Documentation/networking/rxrpc.rst F: include/keys/rxrpc-type.h F: include/net/af_rxrpc.h F: include/trace/events/rxrpc.h diff --git a/net/rxrpc/Kconfig b/net/rxrpc/Kconfig index 57ebb29c26ad..d706bb408365 100644 --- a/net/rxrpc/Kconfig +++ b/net/rxrpc/Kconfig @@ -18,7 +18,7 @@ config AF_RXRPC This module at the moment only supports client operations and is currently incomplete. - See Documentation/networking/rxrpc.txt. + See Documentation/networking/rxrpc.rst. config AF_RXRPC_IPV6 bool "IPv6 support for RxRPC" @@ -41,7 +41,7 @@ config AF_RXRPC_DEBUG help Say Y here to make runtime controllable debugging messages appear. - See Documentation/networking/rxrpc.txt. + See Documentation/networking/rxrpc.rst. config RXKAD @@ -56,4 +56,4 @@ config RXKAD Provide kerberos 4 and AFS kaserver security handling for AF_RXRPC through the use of the key retention service. - See Documentation/networking/rxrpc.txt. + See Documentation/networking/rxrpc.rst. diff --git a/net/rxrpc/sysctl.c b/net/rxrpc/sysctl.c index 2bbb38161851..174e903e18de 100644 --- a/net/rxrpc/sysctl.c +++ b/net/rxrpc/sysctl.c @@ -21,7 +21,7 @@ static const unsigned long max_jiffies = MAX_JIFFY_OFFSET; /* * RxRPC operating parameters. * - * See Documentation/networking/rxrpc.txt and the variable definitions for more + * See Documentation/networking/rxrpc.rst and the variable definitions for more * information on the individual parameters. */ static struct ctl_table rxrpc_sysctl_table[] = { -- cgit v1.2.3 From 671d114d8cde3ba4390714b850c86d8b39d31009 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:22 +0200 Subject: docs: networking: convert sctp.txt to ReST - add SPDX header; - add a document title; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Marcelo Ricardo Leitner Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/sctp.rst | 42 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/sctp.txt | 35 ------------------------------- MAINTAINERS | 2 +- 4 files changed, 44 insertions(+), 36 deletions(-) create mode 100644 Documentation/networking/sctp.rst delete mode 100644 Documentation/networking/sctp.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index cd307b9601fa..1761eb715061 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -100,6 +100,7 @@ Contents: rds regulatory rxrpc + sctp .. only:: subproject and html diff --git a/Documentation/networking/sctp.rst b/Documentation/networking/sctp.rst new file mode 100644 index 000000000000..9f4d9c8a925b --- /dev/null +++ b/Documentation/networking/sctp.rst @@ -0,0 +1,42 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Linux Kernel SCTP +================= + +This is the current BETA release of the Linux Kernel SCTP reference +implementation. + +SCTP (Stream Control Transmission Protocol) is a IP based, message oriented, +reliable transport protocol, with congestion control, support for +transparent multi-homing, and multiple ordered streams of messages. +RFC2960 defines the core protocol. The IETF SIGTRAN working group originally +developed the SCTP protocol and later handed the protocol over to the +Transport Area (TSVWG) working group for the continued evolvement of SCTP as a +general purpose transport. + +See the IETF website (http://www.ietf.org) for further documents on SCTP. +See http://www.ietf.org/rfc/rfc2960.txt + +The initial project goal is to create an Linux kernel reference implementation +of SCTP that is RFC 2960 compliant and provides an programming interface +referred to as the UDP-style API of the Sockets Extensions for SCTP, as +proposed in IETF Internet-Drafts. + +Caveats +======= + +- lksctp can be built as statically or as a module. However, be aware that + module removal of lksctp is not yet a safe activity. + +- There is tentative support for IPv6, but most work has gone towards + implementation and testing lksctp on IPv4. + + +For more information, please visit the lksctp project website: + + http://www.sf.net/projects/lksctp + +Or contact the lksctp developers through the mailing list: + + diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.txt deleted file mode 100644 index 97b810ca9082..000000000000 --- a/Documentation/networking/sctp.txt +++ /dev/null @@ -1,35 +0,0 @@ -Linux Kernel SCTP - -This is the current BETA release of the Linux Kernel SCTP reference -implementation. - -SCTP (Stream Control Transmission Protocol) is a IP based, message oriented, -reliable transport protocol, with congestion control, support for -transparent multi-homing, and multiple ordered streams of messages. -RFC2960 defines the core protocol. The IETF SIGTRAN working group originally -developed the SCTP protocol and later handed the protocol over to the -Transport Area (TSVWG) working group for the continued evolvement of SCTP as a -general purpose transport. - -See the IETF website (http://www.ietf.org) for further documents on SCTP. -See http://www.ietf.org/rfc/rfc2960.txt - -The initial project goal is to create an Linux kernel reference implementation -of SCTP that is RFC 2960 compliant and provides an programming interface -referred to as the UDP-style API of the Sockets Extensions for SCTP, as -proposed in IETF Internet-Drafts. - -Caveats: - --lksctp can be built as statically or as a module. However, be aware that -module removal of lksctp is not yet a safe activity. - --There is tentative support for IPv6, but most work has gone towards -implementation and testing lksctp on IPv4. - - -For more information, please visit the lksctp project website: - http://www.sf.net/projects/lksctp - -Or contact the lksctp developers through the mailing list: - diff --git a/MAINTAINERS b/MAINTAINERS index 866a0dcd66ef..0ac9cec0bce6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14999,7 +14999,7 @@ M: Marcelo Ricardo Leitner L: linux-sctp@vger.kernel.org S: Maintained W: http://lksctp.sourceforge.net -F: Documentation/networking/sctp.txt +F: Documentation/networking/sctp.rst F: include/linux/sctp.h F: include/net/sctp/ F: include/uapi/linux/sctp.h -- cgit v1.2.3 From 973d55e590beeca13fece60596ee3b511d36d9da Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:23 +0200 Subject: docs: networking: convert tuntap.txt to ReST - add SPDX header; - use copyright symbol; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/tuntap.rst | 259 ++++++++++++++++++++++++++++++++++++ Documentation/networking/tuntap.txt | 227 ------------------------------- MAINTAINERS | 2 +- drivers/net/Kconfig | 2 +- 5 files changed, 262 insertions(+), 229 deletions(-) create mode 100644 Documentation/networking/tuntap.rst delete mode 100644 Documentation/networking/tuntap.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b423b2db5f96..e7a683f0528d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -111,6 +111,7 @@ Contents: team timestamping tproxy + tuntap .. only:: subproject and html diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst new file mode 100644 index 000000000000..a59d1dd6fdcc --- /dev/null +++ b/Documentation/networking/tuntap.rst @@ -0,0 +1,259 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=============================== +Universal TUN/TAP device driver +=============================== + +Copyright |copy| 1999-2000 Maxim Krasnyansky + + Linux, Solaris drivers + Copyright |copy| 1999-2000 Maxim Krasnyansky + + FreeBSD TAP driver + Copyright |copy| 1999-2000 Maksim Yevmenkin + + Revision of this document 2002 by Florian Thiel + +1. Description +============== + + TUN/TAP provides packet reception and transmission for user space programs. + It can be seen as a simple Point-to-Point or Ethernet device, which, + instead of receiving packets from physical media, receives them from + user space program and instead of sending packets via physical media + writes them to the user space program. + + In order to use the driver a program has to open /dev/net/tun and issue a + corresponding ioctl() to register a network device with the kernel. A network + device will appear as tunXX or tapXX, depending on the options chosen. When + the program closes the file descriptor, the network device and all + corresponding routes will disappear. + + Depending on the type of device chosen the userspace program has to read/write + IP packets (with tun) or ethernet frames (with tap). Which one is being used + depends on the flags given with the ioctl(). + + The package from http://vtun.sourceforge.net/tun contains two simple examples + for how to use tun and tap devices. Both programs work like a bridge between + two network interfaces. + br_select.c - bridge based on select system call. + br_sigio.c - bridge based on async io and SIGIO signal. + However, the best example is VTun http://vtun.sourceforge.net :)) + +2. Configuration +================ + + Create device node:: + + mkdir /dev/net (if it doesn't exist already) + mknod /dev/net/tun c 10 200 + + Set permissions:: + + e.g. chmod 0666 /dev/net/tun + + There's no harm in allowing the device to be accessible by non-root users, + since CAP_NET_ADMIN is required for creating network devices or for + connecting to network devices which aren't owned by the user in question. + If you want to create persistent devices and give ownership of them to + unprivileged users, then you need the /dev/net/tun device to be usable by + those users. + + Driver module autoloading + + Make sure that "Kernel module loader" - module auto-loading + support is enabled in your kernel. The kernel should load it on + first access. + + Manual loading + + insert the module by hand:: + + modprobe tun + + If you do it the latter way, you have to load the module every time you + need it, if you do it the other way it will be automatically loaded when + /dev/net/tun is being opened. + +3. Program interface +==================== + +3.1 Network device allocation +----------------------------- + +``char *dev`` should be the name of the device with a format string (e.g. +"tun%d"), but (as far as I can see) this can be any valid network device name. +Note that the character pointer becomes overwritten with the real device name +(e.g. "tun0"):: + + #include + #include + + int tun_alloc(char *dev) + { + struct ifreq ifr; + int fd, err; + + if( (fd = open("/dev/net/tun", O_RDWR)) < 0 ) + return tun_alloc_old(dev); + + memset(&ifr, 0, sizeof(ifr)); + + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + */ + ifr.ifr_flags = IFF_TUN; + if( *dev ) + strncpy(ifr.ifr_name, dev, IFNAMSIZ); + + if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){ + close(fd); + return err; + } + strcpy(dev, ifr.ifr_name); + return fd; + } + +3.2 Frame format +---------------- + +If flag IFF_NO_PI is not set each frame format is:: + + Flags [2 bytes] + Proto [2 bytes] + Raw protocol(IP, IPv6, etc) frame. + +3.3 Multiqueue tuntap interface +------------------------------- + +From version 3.8, Linux supports multiqueue tuntap which can uses multiple +file descriptors (queues) to parallelize packets sending or receiving. The +device allocation is the same as before, and if user wants to create multiple +queues, TUNSETIFF with the same device name must be called many times with +IFF_MULTI_QUEUE flag. + +``char *dev`` should be the name of the device, queues is the number of queues +to be created, fds is used to store and return the file descriptors (queues) +created to the caller. Each file descriptor were served as the interface of a +queue which could be accessed by userspace. + +:: + + #include + #include + + int tun_alloc_mq(char *dev, int queues, int *fds) + { + struct ifreq ifr; + int fd, err, i; + + if (!dev) + return -1; + + memset(&ifr, 0, sizeof(ifr)); + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + * IFF_MULTI_QUEUE - Create a queue of multiqueue device + */ + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; + strcpy(ifr.ifr_name, dev); + + for (i = 0; i < queues; i++) { + if ((fd = open("/dev/net/tun", O_RDWR)) < 0) + goto err; + err = ioctl(fd, TUNSETIFF, (void *)&ifr); + if (err) { + close(fd); + goto err; + } + fds[i] = fd; + } + + return 0; + err: + for (--i; i >= 0; i--) + close(fds[i]); + return err; + } + +A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When +calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when +calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were +enabled by default after it was created through TUNSETIFF. + +fd is the file descriptor (queue) that we want to enable or disable, when +enable is true we enable it, otherwise we disable it:: + + #include + #include + + int tun_set_queue(int fd, int enable) + { + struct ifreq ifr; + + memset(&ifr, 0, sizeof(ifr)); + + if (enable) + ifr.ifr_flags = IFF_ATTACH_QUEUE; + else + ifr.ifr_flags = IFF_DETACH_QUEUE; + + return ioctl(fd, TUNSETQUEUE, (void *)&ifr); + } + +Universal TUN/TAP device driver Frequently Asked Question +========================================================= + +1. What platforms are supported by TUN/TAP driver ? + +Currently driver has been written for 3 Unices: + + - Linux kernels 2.2.x, 2.4.x + - FreeBSD 3.x, 4.x, 5.x + - Solaris 2.6, 7.0, 8.0 + +2. What is TUN/TAP driver used for? + +As mentioned above, main purpose of TUN/TAP driver is tunneling. +It is used by VTun (http://vtun.sourceforge.net). + +Another interesting application using TUN/TAP is pipsecd +(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec +implementation that can use complete kernel routing (unlike FreeS/WAN). + +3. How does Virtual network device actually work ? + +Virtual network device can be viewed as a simple Point-to-Point or +Ethernet device, which instead of receiving packets from a physical +media, receives them from user space program and instead of sending +packets via physical media sends them to the user space program. + +Let's say that you configured IPv6 on the tap0, then whenever +the kernel sends an IPv6 packet to tap0, it is passed to the application +(VTun for example). The application encrypts, compresses and sends it to +the other side over TCP or UDP. The application on the other side decompresses +and decrypts the data received and writes the packet to the TAP device, +the kernel handles the packet like it came from real physical device. + +4. What is the difference between TUN driver and TAP driver? + +TUN works with IP frames. TAP works with Ethernet frames. + +This means that you have to read/write IP packets when you are using tun and +ethernet frames when using tap. + +5. What is the difference between BPF and TUN/TAP driver? + +BPF is an advanced packet filter. It can be attached to existing +network interface. It does not provide a virtual network interface. +A TUN/TAP driver does provide a virtual network interface and it is possible +to attach BPF to this interface. + +6. Does TAP driver support kernel Ethernet bridging? + +Yes. Linux and FreeBSD drivers support Ethernet bridging. diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt deleted file mode 100644 index 0104830d5075..000000000000 --- a/Documentation/networking/tuntap.txt +++ /dev/null @@ -1,227 +0,0 @@ -Universal TUN/TAP device driver. -Copyright (C) 1999-2000 Maxim Krasnyansky - - Linux, Solaris drivers - Copyright (C) 1999-2000 Maxim Krasnyansky - - FreeBSD TAP driver - Copyright (c) 1999-2000 Maksim Yevmenkin - - Revision of this document 2002 by Florian Thiel - -1. Description - TUN/TAP provides packet reception and transmission for user space programs. - It can be seen as a simple Point-to-Point or Ethernet device, which, - instead of receiving packets from physical media, receives them from - user space program and instead of sending packets via physical media - writes them to the user space program. - - In order to use the driver a program has to open /dev/net/tun and issue a - corresponding ioctl() to register a network device with the kernel. A network - device will appear as tunXX or tapXX, depending on the options chosen. When - the program closes the file descriptor, the network device and all - corresponding routes will disappear. - - Depending on the type of device chosen the userspace program has to read/write - IP packets (with tun) or ethernet frames (with tap). Which one is being used - depends on the flags given with the ioctl(). - - The package from http://vtun.sourceforge.net/tun contains two simple examples - for how to use tun and tap devices. Both programs work like a bridge between - two network interfaces. - br_select.c - bridge based on select system call. - br_sigio.c - bridge based on async io and SIGIO signal. - However, the best example is VTun http://vtun.sourceforge.net :)) - -2. Configuration - Create device node: - mkdir /dev/net (if it doesn't exist already) - mknod /dev/net/tun c 10 200 - - Set permissions: - e.g. chmod 0666 /dev/net/tun - There's no harm in allowing the device to be accessible by non-root users, - since CAP_NET_ADMIN is required for creating network devices or for - connecting to network devices which aren't owned by the user in question. - If you want to create persistent devices and give ownership of them to - unprivileged users, then you need the /dev/net/tun device to be usable by - those users. - - Driver module autoloading - - Make sure that "Kernel module loader" - module auto-loading - support is enabled in your kernel. The kernel should load it on - first access. - - Manual loading - insert the module by hand: - modprobe tun - - If you do it the latter way, you have to load the module every time you - need it, if you do it the other way it will be automatically loaded when - /dev/net/tun is being opened. - -3. Program interface - 3.1 Network device allocation: - - char *dev should be the name of the device with a format string (e.g. - "tun%d"), but (as far as I can see) this can be any valid network device name. - Note that the character pointer becomes overwritten with the real device name - (e.g. "tun0") - - #include - #include - - int tun_alloc(char *dev) - { - struct ifreq ifr; - int fd, err; - - if( (fd = open("/dev/net/tun", O_RDWR)) < 0 ) - return tun_alloc_old(dev); - - memset(&ifr, 0, sizeof(ifr)); - - /* Flags: IFF_TUN - TUN device (no Ethernet headers) - * IFF_TAP - TAP device - * - * IFF_NO_PI - Do not provide packet information - */ - ifr.ifr_flags = IFF_TUN; - if( *dev ) - strncpy(ifr.ifr_name, dev, IFNAMSIZ); - - if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){ - close(fd); - return err; - } - strcpy(dev, ifr.ifr_name); - return fd; - } - - 3.2 Frame format: - If flag IFF_NO_PI is not set each frame format is: - Flags [2 bytes] - Proto [2 bytes] - Raw protocol(IP, IPv6, etc) frame. - - 3.3 Multiqueue tuntap interface: - - From version 3.8, Linux supports multiqueue tuntap which can uses multiple - file descriptors (queues) to parallelize packets sending or receiving. The - device allocation is the same as before, and if user wants to create multiple - queues, TUNSETIFF with the same device name must be called many times with - IFF_MULTI_QUEUE flag. - - char *dev should be the name of the device, queues is the number of queues to - be created, fds is used to store and return the file descriptors (queues) - created to the caller. Each file descriptor were served as the interface of a - queue which could be accessed by userspace. - - #include - #include - - int tun_alloc_mq(char *dev, int queues, int *fds) - { - struct ifreq ifr; - int fd, err, i; - - if (!dev) - return -1; - - memset(&ifr, 0, sizeof(ifr)); - /* Flags: IFF_TUN - TUN device (no Ethernet headers) - * IFF_TAP - TAP device - * - * IFF_NO_PI - Do not provide packet information - * IFF_MULTI_QUEUE - Create a queue of multiqueue device - */ - ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; - strcpy(ifr.ifr_name, dev); - - for (i = 0; i < queues; i++) { - if ((fd = open("/dev/net/tun", O_RDWR)) < 0) - goto err; - err = ioctl(fd, TUNSETIFF, (void *)&ifr); - if (err) { - close(fd); - goto err; - } - fds[i] = fd; - } - - return 0; - err: - for (--i; i >= 0; i--) - close(fds[i]); - return err; - } - - A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When - calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when - calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were - enabled by default after it was created through TUNSETIFF. - - fd is the file descriptor (queue) that we want to enable or disable, when - enable is true we enable it, otherwise we disable it - - #include - #include - - int tun_set_queue(int fd, int enable) - { - struct ifreq ifr; - - memset(&ifr, 0, sizeof(ifr)); - - if (enable) - ifr.ifr_flags = IFF_ATTACH_QUEUE; - else - ifr.ifr_flags = IFF_DETACH_QUEUE; - - return ioctl(fd, TUNSETQUEUE, (void *)&ifr); - } - -Universal TUN/TAP device driver Frequently Asked Question. - -1. What platforms are supported by TUN/TAP driver ? -Currently driver has been written for 3 Unices: - Linux kernels 2.2.x, 2.4.x - FreeBSD 3.x, 4.x, 5.x - Solaris 2.6, 7.0, 8.0 - -2. What is TUN/TAP driver used for? -As mentioned above, main purpose of TUN/TAP driver is tunneling. -It is used by VTun (http://vtun.sourceforge.net). - -Another interesting application using TUN/TAP is pipsecd -(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec -implementation that can use complete kernel routing (unlike FreeS/WAN). - -3. How does Virtual network device actually work ? -Virtual network device can be viewed as a simple Point-to-Point or -Ethernet device, which instead of receiving packets from a physical -media, receives them from user space program and instead of sending -packets via physical media sends them to the user space program. - -Let's say that you configured IPv6 on the tap0, then whenever -the kernel sends an IPv6 packet to tap0, it is passed to the application -(VTun for example). The application encrypts, compresses and sends it to -the other side over TCP or UDP. The application on the other side decompresses -and decrypts the data received and writes the packet to the TAP device, -the kernel handles the packet like it came from real physical device. - -4. What is the difference between TUN driver and TAP driver? -TUN works with IP frames. TAP works with Ethernet frames. - -This means that you have to read/write IP packets when you are using tun and -ethernet frames when using tap. - -5. What is the difference between BPF and TUN/TAP driver? -BPF is an advanced packet filter. It can be attached to existing -network interface. It does not provide a virtual network interface. -A TUN/TAP driver does provide a virtual network interface and it is possible -to attach BPF to this interface. - -6. Does TAP driver support kernel Ethernet bridging? -Yes. Linux and FreeBSD drivers support Ethernet bridging. diff --git a/MAINTAINERS b/MAINTAINERS index 0ac9cec0bce6..6456c5bb02f1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17161,7 +17161,7 @@ TUN/TAP driver M: Maxim Krasnyansky S: Maintained W: http://vtun.sourceforge.net/tun -F: Documentation/networking/tuntap.txt +F: Documentation/networking/tuntap.rst F: arch/um/os-Linux/drivers/ TURBOCHANNEL SUBSYSTEM diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index ad64be98330f..3f2c98a7906c 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -355,7 +355,7 @@ config TUN devices, driver will automatically delete tunXX or tapXX device and all routes corresponding to it. - Please read for more + Please read for more information. To compile this driver as a module, choose M here: the module -- cgit v1.2.3 From 58ccb2b2e87d52ec0b4cbd40b94e0b63e90af873 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:25 +0200 Subject: docs: networking: convert vrf.txt to ReST - add SPDX header; - adjust title markup; - Add a subtitle for the first section; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: David Ahern Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/vrf.rst | 451 +++++++++++++++++++++++++++++++++++++ Documentation/networking/vrf.txt | 418 ---------------------------------- MAINTAINERS | 2 +- 4 files changed, 453 insertions(+), 419 deletions(-) create mode 100644 Documentation/networking/vrf.rst delete mode 100644 Documentation/networking/vrf.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index ca0b0dbfd9ad..2227b9f4509d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -113,6 +113,7 @@ Contents: tproxy tuntap udplite + vrf .. only:: subproject and html diff --git a/Documentation/networking/vrf.rst b/Documentation/networking/vrf.rst new file mode 100644 index 000000000000..0dde145043bc --- /dev/null +++ b/Documentation/networking/vrf.rst @@ -0,0 +1,451 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================== +Virtual Routing and Forwarding (VRF) +==================================== + +The VRF Device +============== + +The VRF device combined with ip rules provides the ability to create virtual +routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the +Linux network stack. One use case is the multi-tenancy problem where each +tenant has their own unique routing tables and in the very least need +different default gateways. + +Processes can be "VRF aware" by binding a socket to the VRF device. Packets +through the socket then use the routing table associated with the VRF +device. An important feature of the VRF device implementation is that it +impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected +(ie., they do not need to be run in each VRF). The design also allows +the use of higher priority ip rules (Policy Based Routing, PBR) to take +precedence over the VRF device rules directing specific traffic as desired. + +In addition, VRF devices allow VRFs to be nested within namespaces. For +example network namespaces provide separation of network interfaces at the +device layer, VLANs on the interfaces within a namespace provide L2 separation +and then VRF devices provide L3 separation. + +Design +------ +A VRF device is created with an associated route table. Network interfaces +are then enslaved to a VRF device:: + + +-----------------------------+ + | vrf-blue | ===> route table 10 + +-----------------------------+ + | | | + +------+ +------+ +-------------+ + | eth1 | | eth2 | ... | bond1 | + +------+ +------+ +-------------+ + | | + +------+ +------+ + | eth8 | | eth9 | + +------+ +------+ + +Packets received on an enslaved device and are switched to the VRF device +in the IPv4 and IPv6 processing stacks giving the impression that packets +flow through the VRF device. Similarly on egress routing rules are used to +send packets to the VRF device driver before getting sent out the actual +interface. This allows tcpdump on a VRF device to capture all packets into +and out of the VRF as a whole\ [1]_. Similarly, netfilter\ [2]_ and tc rules +can be applied using the VRF device to specify rules that apply to the VRF +domain as a whole. + +.. [1] Packets in the forwarded state do not flow through the device, so those + packets are not seen by tcpdump. Will revisit this limitation in a + future release. + +.. [2] Iptables on ingress supports PREROUTING with skb->dev set to the real + ingress device and both INPUT and PREROUTING rules with skb->dev set to + the VRF device. For egress POSTROUTING and OUTPUT rules can be written + using either the VRF device or real egress device. + +Setup +----- +1. VRF device is created with an association to a FIB table. + e.g,:: + + ip link add vrf-blue type vrf table 10 + ip link set dev vrf-blue up + +2. An l3mdev FIB rule directs lookups to the table associated with the device. + A single l3mdev rule is sufficient for all VRFs. The VRF device adds the + l3mdev rule for IPv4 and IPv6 when the first device is created with a + default preference of 1000. Users may delete the rule if desired and add + with a different priority or install per-VRF rules. + + Prior to the v4.8 kernel iif and oif rules are needed for each VRF device:: + + ip ru add oif vrf-blue table 10 + ip ru add iif vrf-blue table 10 + +3. Set the default route for the table (and hence default route for the VRF):: + + ip route add table 10 unreachable default metric 4278198272 + + This high metric value ensures that the default unreachable route can + be overridden by a routing protocol suite. FRRouting interprets + kernel metrics as a combined admin distance (upper byte) and priority + (lower 3 bytes). Thus the above metric translates to [255/8192]. + +4. Enslave L3 interfaces to a VRF device:: + + ip link set dev eth1 master vrf-blue + + Local and connected routes for enslaved devices are automatically moved to + the table associated with VRF device. Any additional routes depending on + the enslaved device are dropped and will need to be reinserted to the VRF + FIB table following the enslavement. + + The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global + addresses as VRF enslavement changes:: + + sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 + +5. Additional VRF routes are added to associated table:: + + ip route add table 10 ... + + +Applications +------------ +Applications that are to work within a VRF need to bind their socket to the +VRF device:: + + setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); + +or to specify the output device using cmsg and IP_PKTINFO. + +By default the scope of the port bindings for unbound sockets is +limited to the default VRF. That is, it will not be matched by packets +arriving on interfaces enslaved to an l3mdev and processes may bind to +the same port if they bind to an l3mdev. + +TCP & UDP services running in the default VRF context (ie., not bound +to any VRF device) can work across all VRF domains by enabling the +tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:: + + sysctl -w net.ipv4.tcp_l3mdev_accept=1 + sysctl -w net.ipv4.udp_l3mdev_accept=1 + +These options are disabled by default so that a socket in a VRF is only +selected for packets in that VRF. There is a similar option for RAW +sockets, which is enabled by default for reasons of backwards compatibility. +This is so as to specify the output device with cmsg and IP_PKTINFO, but +using a socket not bound to the corresponding VRF. This allows e.g. older ping +implementations to be run with specifying the device but without executing it +in the VRF. This option can be disabled so that packets received in a VRF +context are only handled by a raw socket bound to the VRF, and packets in the +default VRF are only handled by a socket not bound to any VRF:: + + sysctl -w net.ipv4.raw_l3mdev_accept=0 + +netfilter rules on the VRF device can be used to limit access to services +running in the default VRF context as well. + +-------------------------------------------------------------------------------- + +Using iproute2 for VRFs +======================= +iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this +section lists both commands where appropriate -- with the vrf keyword and the +older form without it. + +1. Create a VRF + + To instantiate a VRF device and associate it with a table:: + + $ ip link add dev NAME type vrf table ID + + As of v4.8 the kernel supports the l3mdev FIB rule where a single rule + covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first + device create. + +2. List VRFs + + To list VRFs that have been created:: + + $ ip [-d] link show type vrf + NOTE: The -d option is needed to show the table id + + For example:: + + $ ip -d link show type vrf + 11: mgmt: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 1 addrgenmode eui64 + 12: red: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 10 addrgenmode eui64 + 13: blue: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 66 addrgenmode eui64 + 14: green: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 81 addrgenmode eui64 + + + Or in brief output:: + + $ ip -br link show type vrf + mgmt UP 72:b3:ba:91:e2:24 + red UP b6:6f:6e:f6:da:73 + blue UP 36:62:e8:7d:bb:8c + green UP e6:28:b8:63:70:bb + + +3. Assign a Network Interface to a VRF + + Network interfaces are assigned to a VRF by enslaving the netdevice to a + VRF device:: + + $ ip link set dev NAME master NAME + + On enslavement connected and local routes are automatically moved to the + table associated with the VRF device. + + For example:: + + $ ip link set dev eth0 master mgmt + + +4. Show Devices Assigned to a VRF + + To show devices that have been assigned to a specific VRF add the master + option to the ip command:: + + $ ip link show vrf NAME + $ ip link show master NAME + + For example:: + + $ ip link show vrf red + 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff + 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff + 7: eth5: mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff + + + Or using the brief output:: + + $ ip -br link show vrf red + eth1 UP 02:00:00:00:02:02 + eth2 UP 02:00:00:00:02:03 + eth5 DOWN 02:00:00:00:02:06 + + +5. Show Neighbor Entries for a VRF + + To list neighbor entries associated with devices enslaved to a VRF device + add the master option to the ip command:: + + $ ip [-6] neigh show vrf NAME + $ ip [-6] neigh show master NAME + + For example:: + + $ ip neigh show vrf red + 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE + 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE + + $ ip -6 neigh show vrf red + 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE + + +6. Show Addresses for a VRF + + To show addresses for interfaces associated with a VRF add the master + option to the ip command:: + + $ ip addr show vrf NAME + $ ip addr show master NAME + + For example:: + + $ ip addr show vrf red + 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 + link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff + inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 + valid_lft forever preferred_lft forever + inet6 2002:1::2/120 scope global + valid_lft forever preferred_lft forever + inet6 fe80::ff:fe00:202/64 scope link + valid_lft forever preferred_lft forever + 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 + link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff + inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 + valid_lft forever preferred_lft forever + inet6 2002:2::2/120 scope global + valid_lft forever preferred_lft forever + inet6 fe80::ff:fe00:203/64 scope link + valid_lft forever preferred_lft forever + 7: eth5: mtu 1500 qdisc noop master red state DOWN group default qlen 1000 + link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff + + Or in brief format:: + + $ ip -br addr show vrf red + eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 + eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 + eth5 DOWN + + +7. Show Routes for a VRF + + To show routes for a VRF use the ip command to display the table associated + with the VRF device:: + + $ ip [-6] route show vrf NAME + $ ip [-6] route show table ID + + For example:: + + $ ip route show vrf red + unreachable default metric 4278198272 + broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 + 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 + local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 + broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 + broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2 + 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 + local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 + broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 + + $ ip -6 route show vrf red + local 2002:1:: dev lo proto none metric 0 pref medium + local 2002:1::2 dev lo proto none metric 0 pref medium + 2002:1::/120 dev eth1 proto kernel metric 256 pref medium + local 2002:2:: dev lo proto none metric 0 pref medium + local 2002:2::2 dev lo proto none metric 0 pref medium + 2002:2::/120 dev eth2 proto kernel metric 256 pref medium + local fe80:: dev lo proto none metric 0 pref medium + local fe80:: dev lo proto none metric 0 pref medium + local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium + local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium + fe80::/64 dev eth1 proto kernel metric 256 pref medium + fe80::/64 dev eth2 proto kernel metric 256 pref medium + ff00::/8 dev red metric 256 pref medium + ff00::/8 dev eth1 metric 256 pref medium + ff00::/8 dev eth2 metric 256 pref medium + unreachable default dev lo metric 4278198272 error -101 pref medium + +8. Route Lookup for a VRF + + A test route lookup can be done for a VRF:: + + $ ip [-6] route get vrf NAME ADDRESS + $ ip [-6] route get oif NAME ADDRESS + + For example:: + + $ ip route get 10.2.1.40 vrf red + 10.2.1.40 dev eth1 table red src 10.2.1.2 + cache + + $ ip -6 route get 2002:1::32 vrf red + 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium + + +9. Removing Network Interface from a VRF + + Network interfaces are removed from a VRF by breaking the enslavement to + the VRF device:: + + $ ip link set dev NAME nomaster + + Connected routes are moved back to the default table and local entries are + moved to the local table. + + For example:: + + $ ip link set dev eth0 nomaster + +-------------------------------------------------------------------------------- + +Commands used in this example:: + + cat >> /etc/iproute2/rt_tables.d/vrf.conf < route table 10 - +-----------------------------+ - | | | - +------+ +------+ +-------------+ - | eth1 | | eth2 | ... | bond1 | - +------+ +------+ +-------------+ - | | - +------+ +------+ - | eth8 | | eth9 | - +------+ +------+ - -Packets received on an enslaved device and are switched to the VRF device -in the IPv4 and IPv6 processing stacks giving the impression that packets -flow through the VRF device. Similarly on egress routing rules are used to -send packets to the VRF device driver before getting sent out the actual -interface. This allows tcpdump on a VRF device to capture all packets into -and out of the VRF as a whole.[1] Similarly, netfilter[2] and tc rules can be -applied using the VRF device to specify rules that apply to the VRF domain -as a whole. - -[1] Packets in the forwarded state do not flow through the device, so those - packets are not seen by tcpdump. Will revisit this limitation in a - future release. - -[2] Iptables on ingress supports PREROUTING with skb->dev set to the real - ingress device and both INPUT and PREROUTING rules with skb->dev set to - the VRF device. For egress POSTROUTING and OUTPUT rules can be written - using either the VRF device or real egress device. - -Setup ------ -1. VRF device is created with an association to a FIB table. - e.g, ip link add vrf-blue type vrf table 10 - ip link set dev vrf-blue up - -2. An l3mdev FIB rule directs lookups to the table associated with the device. - A single l3mdev rule is sufficient for all VRFs. The VRF device adds the - l3mdev rule for IPv4 and IPv6 when the first device is created with a - default preference of 1000. Users may delete the rule if desired and add - with a different priority or install per-VRF rules. - - Prior to the v4.8 kernel iif and oif rules are needed for each VRF device: - ip ru add oif vrf-blue table 10 - ip ru add iif vrf-blue table 10 - -3. Set the default route for the table (and hence default route for the VRF). - ip route add table 10 unreachable default metric 4278198272 - - This high metric value ensures that the default unreachable route can - be overridden by a routing protocol suite. FRRouting interprets - kernel metrics as a combined admin distance (upper byte) and priority - (lower 3 bytes). Thus the above metric translates to [255/8192]. - -4. Enslave L3 interfaces to a VRF device. - ip link set dev eth1 master vrf-blue - - Local and connected routes for enslaved devices are automatically moved to - the table associated with VRF device. Any additional routes depending on - the enslaved device are dropped and will need to be reinserted to the VRF - FIB table following the enslavement. - - The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global - addresses as VRF enslavement changes. - sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 - -5. Additional VRF routes are added to associated table. - ip route add table 10 ... - - -Applications ------------- -Applications that are to work within a VRF need to bind their socket to the -VRF device: - - setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); - -or to specify the output device using cmsg and IP_PKTINFO. - -By default the scope of the port bindings for unbound sockets is -limited to the default VRF. That is, it will not be matched by packets -arriving on interfaces enslaved to an l3mdev and processes may bind to -the same port if they bind to an l3mdev. - -TCP & UDP services running in the default VRF context (ie., not bound -to any VRF device) can work across all VRF domains by enabling the -tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: - - sysctl -w net.ipv4.tcp_l3mdev_accept=1 - sysctl -w net.ipv4.udp_l3mdev_accept=1 - -These options are disabled by default so that a socket in a VRF is only -selected for packets in that VRF. There is a similar option for RAW -sockets, which is enabled by default for reasons of backwards compatibility. -This is so as to specify the output device with cmsg and IP_PKTINFO, but -using a socket not bound to the corresponding VRF. This allows e.g. older ping -implementations to be run with specifying the device but without executing it -in the VRF. This option can be disabled so that packets received in a VRF -context are only handled by a raw socket bound to the VRF, and packets in the -default VRF are only handled by a socket not bound to any VRF: - - sysctl -w net.ipv4.raw_l3mdev_accept=0 - -netfilter rules on the VRF device can be used to limit access to services -running in the default VRF context as well. - -################################################################################ - -Using iproute2 for VRFs -======================= -iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this -section lists both commands where appropriate -- with the vrf keyword and the -older form without it. - -1. Create a VRF - - To instantiate a VRF device and associate it with a table: - $ ip link add dev NAME type vrf table ID - - As of v4.8 the kernel supports the l3mdev FIB rule where a single rule - covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first - device create. - -2. List VRFs - - To list VRFs that have been created: - $ ip [-d] link show type vrf - NOTE: The -d option is needed to show the table id - - For example: - $ ip -d link show type vrf - 11: mgmt: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 1 addrgenmode eui64 - 12: red: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 10 addrgenmode eui64 - 13: blue: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 66 addrgenmode eui64 - 14: green: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 81 addrgenmode eui64 - - - Or in brief output: - - $ ip -br link show type vrf - mgmt UP 72:b3:ba:91:e2:24 - red UP b6:6f:6e:f6:da:73 - blue UP 36:62:e8:7d:bb:8c - green UP e6:28:b8:63:70:bb - - -3. Assign a Network Interface to a VRF - - Network interfaces are assigned to a VRF by enslaving the netdevice to a - VRF device: - $ ip link set dev NAME master NAME - - On enslavement connected and local routes are automatically moved to the - table associated with the VRF device. - - For example: - $ ip link set dev eth0 master mgmt - - -4. Show Devices Assigned to a VRF - - To show devices that have been assigned to a specific VRF add the master - option to the ip command: - $ ip link show vrf NAME - $ ip link show master NAME - - For example: - $ ip link show vrf red - 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 - link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff - 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 - link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff - 7: eth5: mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 - link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff - - - Or using the brief output: - $ ip -br link show vrf red - eth1 UP 02:00:00:00:02:02 - eth2 UP 02:00:00:00:02:03 - eth5 DOWN 02:00:00:00:02:06 - - -5. Show Neighbor Entries for a VRF - - To list neighbor entries associated with devices enslaved to a VRF device - add the master option to the ip command: - $ ip [-6] neigh show vrf NAME - $ ip [-6] neigh show master NAME - - For example: - $ ip neigh show vrf red - 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE - 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE - - $ ip -6 neigh show vrf red - 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE - - -6. Show Addresses for a VRF - - To show addresses for interfaces associated with a VRF add the master - option to the ip command: - $ ip addr show vrf NAME - $ ip addr show master NAME - - For example: - $ ip addr show vrf red - 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 - link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff - inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 - valid_lft forever preferred_lft forever - inet6 2002:1::2/120 scope global - valid_lft forever preferred_lft forever - inet6 fe80::ff:fe00:202/64 scope link - valid_lft forever preferred_lft forever - 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 - link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff - inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 - valid_lft forever preferred_lft forever - inet6 2002:2::2/120 scope global - valid_lft forever preferred_lft forever - inet6 fe80::ff:fe00:203/64 scope link - valid_lft forever preferred_lft forever - 7: eth5: mtu 1500 qdisc noop master red state DOWN group default qlen 1000 - link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff - - Or in brief format: - $ ip -br addr show vrf red - eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 - eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 - eth5 DOWN - - -7. Show Routes for a VRF - - To show routes for a VRF use the ip command to display the table associated - with the VRF device: - $ ip [-6] route show vrf NAME - $ ip [-6] route show table ID - - For example: - $ ip route show vrf red - unreachable default metric 4278198272 - broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 - 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 - local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 - broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 - broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2 - 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 - local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 - broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 - - $ ip -6 route show vrf red - local 2002:1:: dev lo proto none metric 0 pref medium - local 2002:1::2 dev lo proto none metric 0 pref medium - 2002:1::/120 dev eth1 proto kernel metric 256 pref medium - local 2002:2:: dev lo proto none metric 0 pref medium - local 2002:2::2 dev lo proto none metric 0 pref medium - 2002:2::/120 dev eth2 proto kernel metric 256 pref medium - local fe80:: dev lo proto none metric 0 pref medium - local fe80:: dev lo proto none metric 0 pref medium - local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium - local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium - fe80::/64 dev eth1 proto kernel metric 256 pref medium - fe80::/64 dev eth2 proto kernel metric 256 pref medium - ff00::/8 dev red metric 256 pref medium - ff00::/8 dev eth1 metric 256 pref medium - ff00::/8 dev eth2 metric 256 pref medium - unreachable default dev lo metric 4278198272 error -101 pref medium - -8. Route Lookup for a VRF - - A test route lookup can be done for a VRF: - $ ip [-6] route get vrf NAME ADDRESS - $ ip [-6] route get oif NAME ADDRESS - - For example: - $ ip route get 10.2.1.40 vrf red - 10.2.1.40 dev eth1 table red src 10.2.1.2 - cache - - $ ip -6 route get 2002:1::32 vrf red - 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium - - -9. Removing Network Interface from a VRF - - Network interfaces are removed from a VRF by breaking the enslavement to - the VRF device: - $ ip link set dev NAME nomaster - - Connected routes are moved back to the default table and local entries are - moved to the local table. - - For example: - $ ip link set dev eth0 nomaster - --------------------------------------------------------------------------------- - -Commands used in this example: - -cat >> /etc/iproute2/rt_tables.d/vrf.conf < M: Shrijeet Mukherjee L: netdev@vger.kernel.org S: Maintained -F: Documentation/networking/vrf.txt +F: Documentation/networking/vrf.rst F: drivers/net/vrf.c VSPRINTF -- cgit v1.2.3 From 0046db09d539523ef1470bcad2f2614cc3ef7ddf Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:33 +0200 Subject: docs: networking: convert z8530drv.txt to ReST - add SPDX header; - use copyright symbol; - adjust titles and chapters, adding proper markups; - mark tables as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/z8530drv.rst | 686 ++++++++++++++++++++++++++++++++++ Documentation/networking/z8530drv.txt | 657 -------------------------------- MAINTAINERS | 2 +- drivers/net/hamradio/Kconfig | 4 +- drivers/net/hamradio/scc.c | 2 +- 6 files changed, 691 insertions(+), 661 deletions(-) create mode 100644 Documentation/networking/z8530drv.rst delete mode 100644 Documentation/networking/z8530drv.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 1630801cec19..f5733ca4fbcb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -121,6 +121,7 @@ Contents: xfrm_proc xfrm_sync xfrm_sysctl + z8530drv .. only:: subproject and html diff --git a/Documentation/networking/z8530drv.rst b/Documentation/networking/z8530drv.rst new file mode 100644 index 000000000000..d2942760f167 --- /dev/null +++ b/Documentation/networking/z8530drv.rst @@ -0,0 +1,686 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========================================================= +SCC.C - Linux driver for Z8530 based HDLC cards for AX.25 +========================================================= + + +This is a subset of the documentation. To use this driver you MUST have the +full package from: + +Internet: + + 1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz + + 2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz + +Please note that the information in this document may be hopelessly outdated. +A new version of the documentation, along with links to other important +Linux Kernel AX.25 documentation and programs, is available on +http://yaina.de/jreuter + +Copyright |copy| 1993,2000 by Joerg Reuter DL1BKE + +portions Copyright |copy| 1993 Guido ten Dolle PE1NNZ + +for the complete copyright notice see >> Copying.Z8530DRV << + +1. Initialization of the driver +=============================== + +To use the driver, 3 steps must be performed: + + 1. if compiled as module: loading the module + 2. Setup of hardware, MODEM and KISS parameters with sccinit + 3. Attach each channel to the Linux kernel AX.25 with "ifconfig" + +Unlike the versions below 2.4 this driver is a real network device +driver. If you want to run xNOS instead of our fine kernel AX.25 +use a 2.x version (available from above sites) or read the +AX.25-HOWTO on how to emulate a KISS TNC on network device drivers. + + +1.1 Loading the module +====================== + +(If you're going to compile the driver as a part of the kernel image, + skip this chapter and continue with 1.2) + +Before you can use a module, you'll have to load it with:: + + insmod scc.o + +please read 'man insmod' that comes with module-init-tools. + +You should include the insmod in one of the /etc/rc.d/rc.* files, +and don't forget to insert a call of sccinit after that. It +will read your /etc/z8530drv.conf. + +1.2. /etc/z8530drv.conf +======================= + +To setup all parameters you must run /sbin/sccinit from one +of your rc.*-files. This has to be done BEFORE you can +"ifconfig" an interface. Sccinit reads the file /etc/z8530drv.conf +and sets the hardware, MODEM and KISS parameters. A sample file is +delivered with this package. Change it to your needs. + +The file itself consists of two main sections. + +1.2.1 configuration of hardware parameters +========================================== + +The hardware setup section defines the following parameters for each +Z8530:: + + chip 1 + data_a 0x300 # data port A + ctrl_a 0x304 # control port A + data_b 0x301 # data port B + ctrl_b 0x305 # control port B + irq 5 # IRQ No. 5 + pclock 4915200 # clock + board BAYCOM # hardware type + escc no # enhanced SCC chip? (8580/85180/85280) + vector 0 # latch for interrupt vector + special no # address of special function register + option 0 # option to set via sfr + + +chip + - this is just a delimiter to make sccinit a bit simpler to + program. A parameter has no effect. + +data_a + - the address of the data port A of this Z8530 (needed) +ctrl_a + - the address of the control port A (needed) +data_b + - the address of the data port B (needed) +ctrl_b + - the address of the control port B (needed) + +irq + - the used IRQ for this chip. Different chips can use different + IRQs or the same. If they share an interrupt, it needs to be + specified within one chip-definition only. + +pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is + default), measured in Hertz + +board + - the "type" of the board: + + ======================= ======== + SCC type value + ======================= ======== + PA0HZP SCC card PA0HZP + EAGLE card EAGLE + PC100 card PC100 + PRIMUS-PC (DG9BL) card PRIMUS + BayCom (U)SCC card BAYCOM + ======================= ======== + +escc + - if you want support for ESCC chips (8580, 85180, 85280), set + this to "yes" (option, defaults to "no") + +vector + - address of the vector latch (aka "intack port") for PA0HZP + cards. There can be only one vector latch for all chips! + (option, defaults to 0) + +special + - address of the special function register on several cards. + (option, defaults to 0) + +option - The value you write into that register (option, default is 0) + +You can specify up to four chips (8 channels). If this is not enough, +just change:: + + #define MAXSCC 4 + +to a higher value. + +Example for the BAYCOM USCC: +---------------------------- + +:: + + chip 1 + data_a 0x300 # data port A + ctrl_a 0x304 # control port A + data_b 0x301 # data port B + ctrl_b 0x305 # control port B + irq 5 # IRQ No. 5 (#) + board BAYCOM # hardware type (*) + # + # SCC chip 2 + # + chip 2 + data_a 0x302 + ctrl_a 0x306 + data_b 0x303 + ctrl_b 0x307 + board BAYCOM + +An example for a PA0HZP card: +----------------------------- + +:: + + chip 1 + data_a 0x153 + data_b 0x151 + ctrl_a 0x152 + ctrl_b 0x150 + irq 9 + pclock 4915200 + board PA0HZP + vector 0x168 + escc no + # + # + # + chip 2 + data_a 0x157 + data_b 0x155 + ctrl_a 0x156 + ctrl_b 0x154 + irq 9 + pclock 4915200 + board PA0HZP + vector 0x168 + escc no + +A DRSI would should probably work with this: +-------------------------------------------- +(actually: two DRSI cards...) + +:: + + chip 1 + data_a 0x303 + data_b 0x301 + ctrl_a 0x302 + ctrl_b 0x300 + irq 7 + pclock 4915200 + board DRSI + escc no + # + # + # + chip 2 + data_a 0x313 + data_b 0x311 + ctrl_a 0x312 + ctrl_b 0x310 + irq 7 + pclock 4915200 + board DRSI + escc no + +Note that you cannot use the on-board baudrate generator off DRSI +cards. Use "mode dpll" for clock source (see below). + +This is based on information provided by Mike Bilow (and verified +by Paul Helay) + +The utility "gencfg" +-------------------- + +If you only know the parameters for the PE1CHL driver for DOS, +run gencfg. It will generate the correct port addresses (I hope). +Its parameters are exactly the same as the ones you use with +the "attach scc" command in net, except that the string "init" must +not appear. Example:: + + gencfg 2 0x150 4 2 0 1 0x168 9 4915200 + +will print a skeleton z8530drv.conf for the OptoSCC to stdout. + +:: + + gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10 + +does the same for the BAYCOM USCC card. In my opinion it is much easier +to edit scc_config.h... + + +1.2.2 channel configuration +=========================== + +The channel definition is divided into three sub sections for each +channel: + +An example for scc0:: + + # DEVICE + + device scc0 # the device for the following params + + # MODEM / BUFFERS + + speed 1200 # the default baudrate + clock dpll # clock source: + # dpll = normal half duplex operation + # external = MODEM provides own Rx/Tx clock + # divider = use full duplex divider if + # installed (1) + mode nrzi # HDLC encoding mode + # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM + # nrz = DF9IC 9k6 MODEM + # + bufsize 384 # size of buffers. Note that this must include + # the AX.25 header, not only the data field! + # (optional, defaults to 384) + + # KISS (Layer 1) + + txdelay 36 # (see chapter 1.4) + persist 64 + slot 8 + tail 8 + fulldup 0 + wait 12 + min 3 + maxkey 7 + idle 3 + maxdef 120 + group 0 + txoff off + softdcd on + slip off + +The order WITHIN these sections is unimportant. The order OF these +sections IS important. The MODEM parameters are set with the first +recognized KISS parameter... + +Please note that you can initialize the board only once after boot +(or insmod). You can change all parameters but "mode" and "clock" +later with the Sccparam program or through KISS. Just to avoid +security holes... + +(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not + present at all (BayCom). It feeds back the output of the DPLL + (digital pll) as transmit clock. Using this mode without a divider + installed will normally result in keying the transceiver until + maxkey expires --- of course without sending anything (useful). + +2. Attachment of a channel by your AX.25 software +================================================= + +2.1 Kernel AX.25 +================ + +To set up an AX.25 device you can simply type:: + + ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7 + +This will create a network interface with the IP number 44.128.20.107 +and the callsign "dl0tha". If you do not have any IP number (yet) you +can use any of the 44.128.0.0 network. Note that you do not need +axattach. The purpose of axattach (like slattach) is to create a KISS +network device linked to a TTY. Please read the documentation of the +ax25-utils and the AX.25-HOWTO to learn how to set the parameters of +the kernel AX.25. + +2.2 NOS, NET and TFKISS +======================= + +Since the TTY driver (aka KISS TNC emulation) is gone you need +to emulate the old behaviour. The cost of using these programs is +that you probably need to compile the kernel AX.25, regardless of whether +you actually use it or not. First setup your /etc/ax25/axports, +for example:: + + 9k6 dl0tha-9 9600 255 4 9600 baud port (scc3) + axlink dl0tha-15 38400 255 4 Link to NOS + +Now "ifconfig" the scc device:: + + ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9 + +You can now axattach a pseudo-TTY:: + + axattach /dev/ptys0 axlink + +and start your NOS and attach /dev/ptys0 there. The problem is that +NOS is reachable only via digipeating through the kernel AX.25 +(disastrous on a DAMA controlled channel). To solve this problem, +configure "rxecho" to echo the incoming frames from "9k6" to "axlink" +and outgoing frames from "axlink" to "9k6" and start:: + + rxecho + +Or simply use "kissbridge" coming with z8530drv-utils:: + + ifconfig scc3 hw ax25 dl0tha-9 + kissbridge scc3 /dev/ptys0 + + +3. Adjustment and Display of parameters +======================================= + +3.1 Displaying SCC Parameters: +============================== + +Once a SCC channel has been attached, the parameter settings and +some statistic information can be shown using the param program:: + + dl1bke-u:~$ sccstat scc0 + + Parameters: + + speed : 1200 baud + txdelay : 36 + persist : 255 + slottime : 0 + txtail : 8 + fulldup : 1 + waittime : 12 + mintime : 3 sec + maxkeyup : 7 sec + idletime : 3 sec + maxdefer : 120 sec + group : 0x00 + txoff : off + softdcd : on + SLIP : off + + Status: + + HDLC Z8530 Interrupts Buffers + ----------------------------------------------------------------------- + Sent : 273 RxOver : 0 RxInts : 125074 Size : 384 + Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0 + RxErrors : 1591 ExInts : 11776 + TxErrors : 0 SpInts : 1503 + Tx State : idle + + +The status info shown is: + +============== ============================================================== +Sent number of frames transmitted +Received number of frames received +RxErrors number of receive errors (CRC, ABORT) +TxErrors number of discarded Tx frames (due to various reasons) +Tx State status of the Tx interrupt handler: idle/busy/active/tail (2) +RxOver number of receiver overruns +TxUnder number of transmitter underruns +RxInts number of receiver interrupts +TxInts number of transmitter interrupts +EpInts number of receiver special condition interrupts +SpInts number of external/status interrupts +Size maximum size of an AX.25 frame (*with* AX.25 headers!) +NoSpace number of times a buffer could not get allocated +============== ============================================================== + +An overrun is abnormal. If lots of these occur, the product of +baudrate and number of interfaces is too high for the processing +power of your computer. NoSpace errors are unlikely to be caused by the +driver or the kernel AX.25. + + +3.2 Setting Parameters +====================== + + +The setting of parameters of the emulated KISS TNC is done in the +same way in the SCC driver. You can change parameters by using +the kissparms program from the ax25-utils package or use the program +"sccparam":: + + sccparam + +You can change the following parameters: + +=========== ===== +param value +=========== ===== +speed 1200 +txdelay 36 +persist 255 +slottime 0 +txtail 8 +fulldup 1 +waittime 12 +mintime 3 +maxkeyup 7 +idletime 3 +maxdefer 120 +group 0x00 +txoff off +softdcd on +SLIP off +=========== ===== + + +The parameters have the following meaning: + +speed: + The baudrate on this channel in bits/sec + + Example: sccparam /dev/scc3 speed 9600 + +txdelay: + The delay (in units of 10 ms) after keying of the + transmitter, until the first byte is sent. This is usually + called "TXDELAY" in a TNC. When 0 is specified, the driver + will just wait until the CTS signal is asserted. This + assumes the presence of a timer or other circuitry in the + MODEM and/or transmitter, that asserts CTS when the + transmitter is ready for data. + A normal value of this parameter is 30-36. + + Example: sccparam /dev/scc0 txd 20 + +persist: + This is the probability that the transmitter will be keyed + when the channel is found to be free. It is a value from 0 + to 255, and the probability is (value+1)/256. The value + should be somewhere near 50-60, and should be lowered when + the channel is used more heavily. + + Example: sccparam /dev/scc2 persist 20 + +slottime: + This is the time between samples of the channel. It is + expressed in units of 10 ms. About 200-300 ms (value 20-30) + seems to be a good value. + + Example: sccparam /dev/scc0 slot 20 + +tail: + The time the transmitter will remain keyed after the last + byte of a packet has been transferred to the SCC. This is + necessary because the CRC and a flag still have to leave the + SCC before the transmitter is keyed down. The value depends + on the baudrate selected. A few character times should be + sufficient, e.g. 40ms at 1200 baud. (value 4) + The value of this parameter is in 10 ms units. + + Example: sccparam /dev/scc2 4 + +full: + The full-duplex mode switch. This can be one of the following + values: + + 0: The interface will operate in CSMA mode (the normal + half-duplex packet radio operation) + 1: Fullduplex mode, i.e. the transmitter will be keyed at + any time, without checking the received carrier. It + will be unkeyed when there are no packets to be sent. + 2: Like 1, but the transmitter will remain keyed, also + when there are no packets to be sent. Flags will be + sent in that case, until a timeout (parameter 10) + occurs. + + Example: sccparam /dev/scc0 fulldup off + +wait: + The initial waittime before any transmit attempt, after the + frame has been queue for transmit. This is the length of + the first slot in CSMA mode. In full duplex modes it is + set to 0 for maximum performance. + The value of this parameter is in 10 ms units. + + Example: sccparam /dev/scc1 wait 4 + +maxkey: + The maximal time the transmitter will be keyed to send + packets, in seconds. This can be useful on busy CSMA + channels, to avoid "getting a bad reputation" when you are + generating a lot of traffic. After the specified time has + elapsed, no new frame will be started. Instead, the trans- + mitter will be switched off for a specified time (parameter + min), and then the selected algorithm for keyup will be + started again. + The value 0 as well as "off" will disable this feature, + and allow infinite transmission time. + + Example: sccparam /dev/scc0 maxk 20 + +min: + This is the time the transmitter will be switched off when + the maximum transmission time is exceeded. + + Example: sccparam /dev/scc3 min 10 + +idle: + This parameter specifies the maximum idle time in full duplex + 2 mode, in seconds. When no frames have been sent for this + time, the transmitter will be keyed down. A value of 0 is + has same result as the fullduplex mode 1. This parameter + can be disabled. + + Example: sccparam /dev/scc2 idle off # transmit forever + +maxdefer + This is the maximum time (in seconds) to wait for a free channel + to send. When this timer expires the transmitter will be keyed + IMMEDIATELY. If you love to get trouble with other users you + should set this to a very low value ;-) + + Example: sccparam /dev/scc0 maxdefer 240 # 2 minutes + + +txoff: + When this parameter has the value 0, the transmission of packets + is enable. Otherwise it is disabled. + + Example: sccparam /dev/scc2 txoff on + +group: + It is possible to build special radio equipment to use more than + one frequency on the same band, e.g. using several receivers and + only one transmitter that can be switched between frequencies. + Also, you can connect several radios that are active on the same + band. In these cases, it is not possible, or not a good idea, to + transmit on more than one frequency. The SCC driver provides a + method to lock transmitters on different interfaces, using the + "param group " command. This will only work when + you are using CSMA mode (parameter full = 0). + + The number must be 0 if you want no group restrictions, and + can be computed as follows to create restricted groups: + is the sum of some OCTAL numbers: + + + === ======================================================= + 200 This transmitter will only be keyed when all other + transmitters in the group are off. + 100 This transmitter will only be keyed when the carrier + detect of all other interfaces in the group is off. + 0xx A byte that can be used to define different groups. + Interfaces are in the same group, when the logical AND + between their xx values is nonzero. + === ======================================================= + + Examples: + + When 2 interfaces use group 201, their transmitters will never be + keyed at the same time. + + When 2 interfaces use group 101, the transmitters will only key + when both channels are clear at the same time. When group 301, + the transmitters will not be keyed at the same time. + + Don't forget to convert the octal numbers into decimal before + you set the parameter. + + Example: (to be written) + +softdcd: + use a software dcd instead of the real one... Useful for a very + slow squelch. + + Example: sccparam /dev/scc0 soft on + + +4. Problems +=========== + +If you have tx-problems with your BayCom USCC card please check +the manufacturer of the 8530. SGS chips have a slightly +different timing. Try Zilog... A solution is to write to register 8 +instead to the data port, but this won't work with the ESCC chips. +*SIGH!* + +A very common problem is that the PTT locks until the maxkeyup timer +expires, although interrupts and clock source are correct. In most +cases compiling the driver with CONFIG_SCC_DELAY (set with +make config) solves the problems. For more hints read the (pseudo) FAQ +and the documentation coming with z8530drv-utils. + +I got reports that the driver has problems on some 386-based systems. +(i.e. Amstrad) Those systems have a bogus AT bus timing which will +lead to delayed answers on interrupts. You can recognize these +problems by looking at the output of Sccstat for the suspected +port. If it shows under- and overruns you own such a system. + +Delayed processing of received data: This depends on + +- the kernel version + +- kernel profiling compiled or not + +- a high interrupt load + +- a high load of the machine --- running X, Xmorph, XV and Povray, + while compiling the kernel... hmm ... even with 32 MB RAM ... ;-) + Or running a named for the whole .ampr.org domain on an 8 MB + box... + +- using information from rxecho or kissbridge. + +Kernel panics: please read /linux/README and find out if it +really occurred within the scc driver. + +If you cannot solve a problem, send me + +- a description of the problem, +- information on your hardware (computer system, scc board, modem) +- your kernel version +- the output of cat /proc/net/z8530 + +4. Thor RLC100 +============== + +Mysteriously this board seems not to work with the driver. Anyone +got it up-and-running? + + +Many thanks to Linus Torvalds and Alan Cox for including the driver +in the Linux standard distribution and their support. + +:: + + Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org + AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU + Internet: jreuter@yaina.de + WWW : http://yaina.de/jreuter diff --git a/Documentation/networking/z8530drv.txt b/Documentation/networking/z8530drv.txt deleted file mode 100644 index 2206abbc3e1b..000000000000 --- a/Documentation/networking/z8530drv.txt +++ /dev/null @@ -1,657 +0,0 @@ -This is a subset of the documentation. To use this driver you MUST have the -full package from: - -Internet: -========= - -1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz - -2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz - -Please note that the information in this document may be hopelessly outdated. -A new version of the documentation, along with links to other important -Linux Kernel AX.25 documentation and programs, is available on -http://yaina.de/jreuter - ------------------------------------------------------------------------------ - - - SCC.C - Linux driver for Z8530 based HDLC cards for AX.25 - - ******************************************************************** - - (c) 1993,2000 by Joerg Reuter DL1BKE - - portions (c) 1993 Guido ten Dolle PE1NNZ - - for the complete copyright notice see >> Copying.Z8530DRV << - - ******************************************************************** - - -1. Initialization of the driver -=============================== - -To use the driver, 3 steps must be performed: - - 1. if compiled as module: loading the module - 2. Setup of hardware, MODEM and KISS parameters with sccinit - 3. Attach each channel to the Linux kernel AX.25 with "ifconfig" - -Unlike the versions below 2.4 this driver is a real network device -driver. If you want to run xNOS instead of our fine kernel AX.25 -use a 2.x version (available from above sites) or read the -AX.25-HOWTO on how to emulate a KISS TNC on network device drivers. - - -1.1 Loading the module -====================== - -(If you're going to compile the driver as a part of the kernel image, - skip this chapter and continue with 1.2) - -Before you can use a module, you'll have to load it with - - insmod scc.o - -please read 'man insmod' that comes with module-init-tools. - -You should include the insmod in one of the /etc/rc.d/rc.* files, -and don't forget to insert a call of sccinit after that. It -will read your /etc/z8530drv.conf. - -1.2. /etc/z8530drv.conf -======================= - -To setup all parameters you must run /sbin/sccinit from one -of your rc.*-files. This has to be done BEFORE you can -"ifconfig" an interface. Sccinit reads the file /etc/z8530drv.conf -and sets the hardware, MODEM and KISS parameters. A sample file is -delivered with this package. Change it to your needs. - -The file itself consists of two main sections. - -1.2.1 configuration of hardware parameters -========================================== - -The hardware setup section defines the following parameters for each -Z8530: - -chip 1 -data_a 0x300 # data port A -ctrl_a 0x304 # control port A -data_b 0x301 # data port B -ctrl_b 0x305 # control port B -irq 5 # IRQ No. 5 -pclock 4915200 # clock -board BAYCOM # hardware type -escc no # enhanced SCC chip? (8580/85180/85280) -vector 0 # latch for interrupt vector -special no # address of special function register -option 0 # option to set via sfr - - -chip - this is just a delimiter to make sccinit a bit simpler to - program. A parameter has no effect. - -data_a - the address of the data port A of this Z8530 (needed) -ctrl_a - the address of the control port A (needed) -data_b - the address of the data port B (needed) -ctrl_b - the address of the control port B (needed) - -irq - the used IRQ for this chip. Different chips can use different - IRQs or the same. If they share an interrupt, it needs to be - specified within one chip-definition only. - -pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is - default), measured in Hertz - -board - the "type" of the board: - - SCC type value - --------------------------------- - PA0HZP SCC card PA0HZP - EAGLE card EAGLE - PC100 card PC100 - PRIMUS-PC (DG9BL) card PRIMUS - BayCom (U)SCC card BAYCOM - -escc - if you want support for ESCC chips (8580, 85180, 85280), set - this to "yes" (option, defaults to "no") - -vector - address of the vector latch (aka "intack port") for PA0HZP - cards. There can be only one vector latch for all chips! - (option, defaults to 0) - -special - address of the special function register on several cards. - (option, defaults to 0) - -option - The value you write into that register (option, default is 0) - -You can specify up to four chips (8 channels). If this is not enough, -just change - - #define MAXSCC 4 - -to a higher value. - -Example for the BAYCOM USCC: ----------------------------- - -chip 1 -data_a 0x300 # data port A -ctrl_a 0x304 # control port A -data_b 0x301 # data port B -ctrl_b 0x305 # control port B -irq 5 # IRQ No. 5 (#) -board BAYCOM # hardware type (*) -# -# SCC chip 2 -# -chip 2 -data_a 0x302 -ctrl_a 0x306 -data_b 0x303 -ctrl_b 0x307 -board BAYCOM - -An example for a PA0HZP card: ------------------------------ - -chip 1 -data_a 0x153 -data_b 0x151 -ctrl_a 0x152 -ctrl_b 0x150 -irq 9 -pclock 4915200 -board PA0HZP -vector 0x168 -escc no -# -# -# -chip 2 -data_a 0x157 -data_b 0x155 -ctrl_a 0x156 -ctrl_b 0x154 -irq 9 -pclock 4915200 -board PA0HZP -vector 0x168 -escc no - -A DRSI would should probably work with this: --------------------------------------------- -(actually: two DRSI cards...) - -chip 1 -data_a 0x303 -data_b 0x301 -ctrl_a 0x302 -ctrl_b 0x300 -irq 7 -pclock 4915200 -board DRSI -escc no -# -# -# -chip 2 -data_a 0x313 -data_b 0x311 -ctrl_a 0x312 -ctrl_b 0x310 -irq 7 -pclock 4915200 -board DRSI -escc no - -Note that you cannot use the on-board baudrate generator off DRSI -cards. Use "mode dpll" for clock source (see below). - -This is based on information provided by Mike Bilow (and verified -by Paul Helay) - -The utility "gencfg" --------------------- - -If you only know the parameters for the PE1CHL driver for DOS, -run gencfg. It will generate the correct port addresses (I hope). -Its parameters are exactly the same as the ones you use with -the "attach scc" command in net, except that the string "init" must -not appear. Example: - -gencfg 2 0x150 4 2 0 1 0x168 9 4915200 - -will print a skeleton z8530drv.conf for the OptoSCC to stdout. - -gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10 - -does the same for the BAYCOM USCC card. In my opinion it is much easier -to edit scc_config.h... - - -1.2.2 channel configuration -=========================== - -The channel definition is divided into three sub sections for each -channel: - -An example for scc0: - -# DEVICE - -device scc0 # the device for the following params - -# MODEM / BUFFERS - -speed 1200 # the default baudrate -clock dpll # clock source: - # dpll = normal half duplex operation - # external = MODEM provides own Rx/Tx clock - # divider = use full duplex divider if - # installed (1) -mode nrzi # HDLC encoding mode - # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM - # nrz = DF9IC 9k6 MODEM - # -bufsize 384 # size of buffers. Note that this must include - # the AX.25 header, not only the data field! - # (optional, defaults to 384) - -# KISS (Layer 1) - -txdelay 36 # (see chapter 1.4) -persist 64 -slot 8 -tail 8 -fulldup 0 -wait 12 -min 3 -maxkey 7 -idle 3 -maxdef 120 -group 0 -txoff off -softdcd on -slip off - -The order WITHIN these sections is unimportant. The order OF these -sections IS important. The MODEM parameters are set with the first -recognized KISS parameter... - -Please note that you can initialize the board only once after boot -(or insmod). You can change all parameters but "mode" and "clock" -later with the Sccparam program or through KISS. Just to avoid -security holes... - -(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not - present at all (BayCom). It feeds back the output of the DPLL - (digital pll) as transmit clock. Using this mode without a divider - installed will normally result in keying the transceiver until - maxkey expires --- of course without sending anything (useful). - -2. Attachment of a channel by your AX.25 software -================================================= - -2.1 Kernel AX.25 -================ - -To set up an AX.25 device you can simply type: - - ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7 - -This will create a network interface with the IP number 44.128.20.107 -and the callsign "dl0tha". If you do not have any IP number (yet) you -can use any of the 44.128.0.0 network. Note that you do not need -axattach. The purpose of axattach (like slattach) is to create a KISS -network device linked to a TTY. Please read the documentation of the -ax25-utils and the AX.25-HOWTO to learn how to set the parameters of -the kernel AX.25. - -2.2 NOS, NET and TFKISS -======================= - -Since the TTY driver (aka KISS TNC emulation) is gone you need -to emulate the old behaviour. The cost of using these programs is -that you probably need to compile the kernel AX.25, regardless of whether -you actually use it or not. First setup your /etc/ax25/axports, -for example: - - 9k6 dl0tha-9 9600 255 4 9600 baud port (scc3) - axlink dl0tha-15 38400 255 4 Link to NOS - -Now "ifconfig" the scc device: - - ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9 - -You can now axattach a pseudo-TTY: - - axattach /dev/ptys0 axlink - -and start your NOS and attach /dev/ptys0 there. The problem is that -NOS is reachable only via digipeating through the kernel AX.25 -(disastrous on a DAMA controlled channel). To solve this problem, -configure "rxecho" to echo the incoming frames from "9k6" to "axlink" -and outgoing frames from "axlink" to "9k6" and start: - - rxecho - -Or simply use "kissbridge" coming with z8530drv-utils: - - ifconfig scc3 hw ax25 dl0tha-9 - kissbridge scc3 /dev/ptys0 - - -3. Adjustment and Display of parameters -======================================= - -3.1 Displaying SCC Parameters: -============================== - -Once a SCC channel has been attached, the parameter settings and -some statistic information can be shown using the param program: - -dl1bke-u:~$ sccstat scc0 - -Parameters: - -speed : 1200 baud -txdelay : 36 -persist : 255 -slottime : 0 -txtail : 8 -fulldup : 1 -waittime : 12 -mintime : 3 sec -maxkeyup : 7 sec -idletime : 3 sec -maxdefer : 120 sec -group : 0x00 -txoff : off -softdcd : on -SLIP : off - -Status: - -HDLC Z8530 Interrupts Buffers ------------------------------------------------------------------------ -Sent : 273 RxOver : 0 RxInts : 125074 Size : 384 -Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0 -RxErrors : 1591 ExInts : 11776 -TxErrors : 0 SpInts : 1503 -Tx State : idle - - -The status info shown is: - -Sent - number of frames transmitted -Received - number of frames received -RxErrors - number of receive errors (CRC, ABORT) -TxErrors - number of discarded Tx frames (due to various reasons) -Tx State - status of the Tx interrupt handler: idle/busy/active/tail (2) -RxOver - number of receiver overruns -TxUnder - number of transmitter underruns -RxInts - number of receiver interrupts -TxInts - number of transmitter interrupts -EpInts - number of receiver special condition interrupts -SpInts - number of external/status interrupts -Size - maximum size of an AX.25 frame (*with* AX.25 headers!) -NoSpace - number of times a buffer could not get allocated - -An overrun is abnormal. If lots of these occur, the product of -baudrate and number of interfaces is too high for the processing -power of your computer. NoSpace errors are unlikely to be caused by the -driver or the kernel AX.25. - - -3.2 Setting Parameters -====================== - - -The setting of parameters of the emulated KISS TNC is done in the -same way in the SCC driver. You can change parameters by using -the kissparms program from the ax25-utils package or use the program -"sccparam": - - sccparam - -You can change the following parameters: - -param : value ------------------------- -speed : 1200 -txdelay : 36 -persist : 255 -slottime : 0 -txtail : 8 -fulldup : 1 -waittime : 12 -mintime : 3 -maxkeyup : 7 -idletime : 3 -maxdefer : 120 -group : 0x00 -txoff : off -softdcd : on -SLIP : off - - -The parameters have the following meaning: - -speed: - The baudrate on this channel in bits/sec - - Example: sccparam /dev/scc3 speed 9600 - -txdelay: - The delay (in units of 10 ms) after keying of the - transmitter, until the first byte is sent. This is usually - called "TXDELAY" in a TNC. When 0 is specified, the driver - will just wait until the CTS signal is asserted. This - assumes the presence of a timer or other circuitry in the - MODEM and/or transmitter, that asserts CTS when the - transmitter is ready for data. - A normal value of this parameter is 30-36. - - Example: sccparam /dev/scc0 txd 20 - -persist: - This is the probability that the transmitter will be keyed - when the channel is found to be free. It is a value from 0 - to 255, and the probability is (value+1)/256. The value - should be somewhere near 50-60, and should be lowered when - the channel is used more heavily. - - Example: sccparam /dev/scc2 persist 20 - -slottime: - This is the time between samples of the channel. It is - expressed in units of 10 ms. About 200-300 ms (value 20-30) - seems to be a good value. - - Example: sccparam /dev/scc0 slot 20 - -tail: - The time the transmitter will remain keyed after the last - byte of a packet has been transferred to the SCC. This is - necessary because the CRC and a flag still have to leave the - SCC before the transmitter is keyed down. The value depends - on the baudrate selected. A few character times should be - sufficient, e.g. 40ms at 1200 baud. (value 4) - The value of this parameter is in 10 ms units. - - Example: sccparam /dev/scc2 4 - -full: - The full-duplex mode switch. This can be one of the following - values: - - 0: The interface will operate in CSMA mode (the normal - half-duplex packet radio operation) - 1: Fullduplex mode, i.e. the transmitter will be keyed at - any time, without checking the received carrier. It - will be unkeyed when there are no packets to be sent. - 2: Like 1, but the transmitter will remain keyed, also - when there are no packets to be sent. Flags will be - sent in that case, until a timeout (parameter 10) - occurs. - - Example: sccparam /dev/scc0 fulldup off - -wait: - The initial waittime before any transmit attempt, after the - frame has been queue for transmit. This is the length of - the first slot in CSMA mode. In full duplex modes it is - set to 0 for maximum performance. - The value of this parameter is in 10 ms units. - - Example: sccparam /dev/scc1 wait 4 - -maxkey: - The maximal time the transmitter will be keyed to send - packets, in seconds. This can be useful on busy CSMA - channels, to avoid "getting a bad reputation" when you are - generating a lot of traffic. After the specified time has - elapsed, no new frame will be started. Instead, the trans- - mitter will be switched off for a specified time (parameter - min), and then the selected algorithm for keyup will be - started again. - The value 0 as well as "off" will disable this feature, - and allow infinite transmission time. - - Example: sccparam /dev/scc0 maxk 20 - -min: - This is the time the transmitter will be switched off when - the maximum transmission time is exceeded. - - Example: sccparam /dev/scc3 min 10 - -idle - This parameter specifies the maximum idle time in full duplex - 2 mode, in seconds. When no frames have been sent for this - time, the transmitter will be keyed down. A value of 0 is - has same result as the fullduplex mode 1. This parameter - can be disabled. - - Example: sccparam /dev/scc2 idle off # transmit forever - -maxdefer - This is the maximum time (in seconds) to wait for a free channel - to send. When this timer expires the transmitter will be keyed - IMMEDIATELY. If you love to get trouble with other users you - should set this to a very low value ;-) - - Example: sccparam /dev/scc0 maxdefer 240 # 2 minutes - - -txoff: - When this parameter has the value 0, the transmission of packets - is enable. Otherwise it is disabled. - - Example: sccparam /dev/scc2 txoff on - -group: - It is possible to build special radio equipment to use more than - one frequency on the same band, e.g. using several receivers and - only one transmitter that can be switched between frequencies. - Also, you can connect several radios that are active on the same - band. In these cases, it is not possible, or not a good idea, to - transmit on more than one frequency. The SCC driver provides a - method to lock transmitters on different interfaces, using the - "param group " command. This will only work when - you are using CSMA mode (parameter full = 0). - The number must be 0 if you want no group restrictions, and - can be computed as follows to create restricted groups: - is the sum of some OCTAL numbers: - - 200 This transmitter will only be keyed when all other - transmitters in the group are off. - 100 This transmitter will only be keyed when the carrier - detect of all other interfaces in the group is off. - 0xx A byte that can be used to define different groups. - Interfaces are in the same group, when the logical AND - between their xx values is nonzero. - - Examples: - When 2 interfaces use group 201, their transmitters will never be - keyed at the same time. - When 2 interfaces use group 101, the transmitters will only key - when both channels are clear at the same time. When group 301, - the transmitters will not be keyed at the same time. - - Don't forget to convert the octal numbers into decimal before - you set the parameter. - - Example: (to be written) - -softdcd: - use a software dcd instead of the real one... Useful for a very - slow squelch. - - Example: sccparam /dev/scc0 soft on - - -4. Problems -=========== - -If you have tx-problems with your BayCom USCC card please check -the manufacturer of the 8530. SGS chips have a slightly -different timing. Try Zilog... A solution is to write to register 8 -instead to the data port, but this won't work with the ESCC chips. -*SIGH!* - -A very common problem is that the PTT locks until the maxkeyup timer -expires, although interrupts and clock source are correct. In most -cases compiling the driver with CONFIG_SCC_DELAY (set with -make config) solves the problems. For more hints read the (pseudo) FAQ -and the documentation coming with z8530drv-utils. - -I got reports that the driver has problems on some 386-based systems. -(i.e. Amstrad) Those systems have a bogus AT bus timing which will -lead to delayed answers on interrupts. You can recognize these -problems by looking at the output of Sccstat for the suspected -port. If it shows under- and overruns you own such a system. - -Delayed processing of received data: This depends on - -- the kernel version - -- kernel profiling compiled or not - -- a high interrupt load - -- a high load of the machine --- running X, Xmorph, XV and Povray, - while compiling the kernel... hmm ... even with 32 MB RAM ... ;-) - Or running a named for the whole .ampr.org domain on an 8 MB - box... - -- using information from rxecho or kissbridge. - -Kernel panics: please read /linux/README and find out if it -really occurred within the scc driver. - -If you cannot solve a problem, send me - -- a description of the problem, -- information on your hardware (computer system, scc board, modem) -- your kernel version -- the output of cat /proc/net/z8530 - -4. Thor RLC100 -============== - -Mysteriously this board seems not to work with the driver. Anyone -got it up-and-running? - - -Many thanks to Linus Torvalds and Alan Cox for including the driver -in the Linux standard distribution and their support. - -Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org - AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU - Internet: jreuter@yaina.de - WWW : http://yaina.de/jreuter diff --git a/MAINTAINERS b/MAINTAINERS index d59455c27c42..bee65ebdc67e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18644,7 +18644,7 @@ L: linux-hams@vger.kernel.org S: Maintained W: http://yaina.de/jreuter/ W: http://www.qsl.net/dl1bke/ -F: Documentation/networking/z8530drv.txt +F: Documentation/networking/z8530drv.rst F: drivers/net/hamradio/*scc.c F: drivers/net/hamradio/z8530.h diff --git a/drivers/net/hamradio/Kconfig b/drivers/net/hamradio/Kconfig index fe409819b56d..f4500f04147d 100644 --- a/drivers/net/hamradio/Kconfig +++ b/drivers/net/hamradio/Kconfig @@ -84,7 +84,7 @@ config SCC ---help--- These cards are used to connect your Linux box to an amateur radio in order to communicate with other computers. If you want to use - this, read and the + this, read and the AX25-HOWTO, available from . Also make sure to say Y to "Amateur Radio AX.25 Level 2" support. @@ -98,7 +98,7 @@ config SCC_DELAY help Say Y here if you experience problems with the SCC driver not working properly; please read - for details. + for details. If unsure, say N. diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c index 6c03932d8a6b..33fdd55c6122 100644 --- a/drivers/net/hamradio/scc.c +++ b/drivers/net/hamradio/scc.c @@ -7,7 +7,7 @@ * ------------------ * * You can find a subset of the documentation in - * Documentation/networking/z8530drv.txt. + * Documentation/networking/z8530drv.rst. */ /* -- cgit v1.2.3 From 9ea2af8d16f5612168ed52cb0ec6752bac0877a9 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:35 +0200 Subject: docs: networking: device drivers: convert 3com/vortex.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/3com/vortex.rst | 461 +++++++++++++++++++++ .../networking/device_drivers/3com/vortex.txt | 448 -------------------- Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- drivers/net/ethernet/3com/3c59x.c | 4 +- drivers/net/ethernet/3com/Kconfig | 2 +- 6 files changed, 466 insertions(+), 452 deletions(-) create mode 100644 Documentation/networking/device_drivers/3com/vortex.rst delete mode 100644 Documentation/networking/device_drivers/3com/vortex.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/device_drivers/3com/vortex.rst b/Documentation/networking/device_drivers/3com/vortex.rst new file mode 100644 index 000000000000..800add5be338 --- /dev/null +++ b/Documentation/networking/device_drivers/3com/vortex.rst @@ -0,0 +1,461 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +3Com Vortex device driver +========================= + +Documentation/networking/device_drivers/3com/vortex.rst + +Andrew Morton + +30 April 2000 + + +This document describes the usage and errata of the 3Com "Vortex" device +driver for Linux, 3c59x.c. + +The driver was written by Donald Becker + +Don is no longer the prime maintainer of this version of the driver. +Please report problems to one or more of: + +- Andrew Morton +- Netdev mailing list +- Linux kernel mailing list + +Please note the 'Reporting and Diagnosing Problems' section at the end +of this file. + + +Since kernel 2.3.99-pre6, this driver incorporates the support for the +3c575-series Cardbus cards which used to be handled by 3c575_cb.c. + +This driver supports the following hardware: + + - 3c590 Vortex 10Mbps + - 3c592 EISA 10Mbps Demon/Vortex + - 3c597 EISA Fast Demon/Vortex + - 3c595 Vortex 100baseTx + - 3c595 Vortex 100baseT4 + - 3c595 Vortex 100base-MII + - 3c900 Boomerang 10baseT + - 3c900 Boomerang 10Mbps Combo + - 3c900 Cyclone 10Mbps TPO + - 3c900 Cyclone 10Mbps Combo + - 3c900 Cyclone 10Mbps TPC + - 3c900B-FL Cyclone 10base-FL + - 3c905 Boomerang 100baseTx + - 3c905 Boomerang 100baseT4 + - 3c905B Cyclone 100baseTx + - 3c905B Cyclone 10/100/BNC + - 3c905B-FX Cyclone 100baseFx + - 3c905C Tornado + - 3c920B-EMB-WNM (ATI Radeon 9100 IGP) + - 3c980 Cyclone + - 3c980C Python-T + - 3cSOHO100-TX Hurricane + - 3c555 Laptop Hurricane + - 3c556 Laptop Tornado + - 3c556B Laptop Hurricane + - 3c575 [Megahertz] 10/100 LAN CardBus + - 3c575 Boomerang CardBus + - 3CCFE575BT Cyclone CardBus + - 3CCFE575CT Tornado CardBus + - 3CCFE656 Cyclone CardBus + - 3CCFEM656B Cyclone+Winmodem CardBus + - 3CXFEM656C Tornado+Winmodem CardBus + - 3c450 HomePNA Tornado + - 3c920 Tornado + - 3c982 Hydra Dual Port A + - 3c982 Hydra Dual Port B + - 3c905B-T4 + - 3c920B-EMB-WNM Tornado + +Module parameters +================= + +There are several parameters which may be provided to the driver when +its module is loaded. These are usually placed in ``/etc/modprobe.d/*.conf`` +configuration files. Example:: + + options 3c59x debug=3 rx_copybreak=300 + +If you are using the PCMCIA tools (cardmgr) then the options may be +placed in /etc/pcmcia/config.opts:: + + module "3c59x" opts "debug=3 rx_copybreak=300" + + +The supported parameters are: + +debug=N + + Where N is a number from 0 to 7. Anything above 3 produces a lot + of output in your system logs. debug=1 is default. + +options=N1,N2,N3,... + + Each number in the list provides an option to the corresponding + network card. So if you have two 3c905's and you wish to provide + them with option 0x204 you would use:: + + options=0x204,0x204 + + The individual options are composed of a number of bitfields which + have the following meanings: + + Possible media type settings + + == ================================= + 0 10baseT + 1 10Mbs AUI + 2 undefined + 3 10base2 (BNC) + 4 100base-TX + 5 100base-FX + 6 MII (Media Independent Interface) + 7 Use default setting from EEPROM + 8 Autonegotiate + 9 External MII + 10 Use default setting from EEPROM + == ================================= + + When generating a value for the 'options' setting, the above media + selection values may be OR'ed (or added to) the following: + + ====== ============================================= + 0x8000 Set driver debugging level to 7 + 0x4000 Set driver debugging level to 2 + 0x0400 Enable Wake-on-LAN + 0x0200 Force full duplex mode. + 0x0010 Bus-master enable bit (Old Vortex cards only) + ====== ============================================= + + For example:: + + insmod 3c59x options=0x204 + + will force full-duplex 100base-TX, rather than allowing the usual + autonegotiation. + +global_options=N + + Sets the ``options`` parameter for all 3c59x NICs in the machine. + Entries in the ``options`` array above will override any setting of + this. + +full_duplex=N1,N2,N3... + + Similar to bit 9 of 'options'. Forces the corresponding card into + full-duplex mode. Please use this in preference to the ``options`` + parameter. + + In fact, please don't use this at all! You're better off getting + autonegotiation working properly. + +global_full_duplex=N1 + + Sets full duplex mode for all 3c59x NICs in the machine. Entries + in the ``full_duplex`` array above will override any setting of this. + +flow_ctrl=N1,N2,N3... + + Use 802.3x MAC-layer flow control. The 3com cards only support the + PAUSE command, which means that they will stop sending packets for a + short period if they receive a PAUSE frame from the link partner. + + The driver only allows flow control on a link which is operating in + full duplex mode. + + This feature does not appear to work on the 3c905 - only 3c905B and + 3c905C have been tested. + + The 3com cards appear to only respond to PAUSE frames which are + sent to the reserved destination address of 01:80:c2:00:00:01. They + do not honour PAUSE frames which are sent to the station MAC address. + +rx_copybreak=M + + The driver preallocates 32 full-sized (1536 byte) network buffers + for receiving. When a packet arrives, the driver has to decide + whether to leave the packet in its full-sized buffer, or to allocate + a smaller buffer and copy the packet across into it. + + This is a speed/space tradeoff. + + The value of rx_copybreak is used to decide when to make the copy. + If the packet size is less than rx_copybreak, the packet is copied. + The default value for rx_copybreak is 200 bytes. + +max_interrupt_work=N + + The driver's interrupt service routine can handle many receive and + transmit packets in a single invocation. It does this in a loop. + The value of max_interrupt_work governs how many times the interrupt + service routine will loop. The default value is 32 loops. If this + is exceeded the interrupt service routine gives up and generates a + warning message "eth0: Too much work in interrupt". + +hw_checksums=N1,N2,N3,... + + Recent 3com NICs are able to generate IPv4, TCP and UDP checksums + in hardware. Linux has used the Rx checksumming for a long time. + The "zero copy" patch which is planned for the 2.4 kernel series + allows you to make use of the NIC's DMA scatter/gather and transmit + checksumming as well. + + The driver is set up so that, when the zerocopy patch is applied, + all Tornado and Cyclone devices will use S/G and Tx checksums. + + This module parameter has been provided so you can override this + decision. If you think that Tx checksums are causing a problem, you + may disable the feature with ``hw_checksums=0``. + + If you think your NIC should be performing Tx checksumming and the + driver isn't enabling it, you can force the use of hardware Tx + checksumming with ``hw_checksums=1``. + + The driver drops a message in the logfiles to indicate whether or + not it is using hardware scatter/gather and hardware Tx checksums. + + Scatter/gather and hardware checksums provide considerable + performance improvement for the sendfile() system call, but a small + decrease in throughput for send(). There is no effect upon receive + efficiency. + +compaq_ioaddr=N, +compaq_irq=N, +compaq_device_id=N + + "Variables to work-around the Compaq PCI BIOS32 problem".... + +watchdog=N + + Sets the time duration (in milliseconds) after which the kernel + decides that the transmitter has become stuck and needs to be reset. + This is mainly for debugging purposes, although it may be advantageous + to increase this value on LANs which have very high collision rates. + The default value is 5000 (5.0 seconds). + +enable_wol=N1,N2,N3,... + + Enable Wake-on-LAN support for the relevant interface. Donald + Becker's ``ether-wake`` application may be used to wake suspended + machines. + + Also enables the NIC's power management support. + +global_enable_wol=N + + Sets enable_wol mode for all 3c59x NICs in the machine. Entries in + the ``enable_wol`` array above will override any setting of this. + +Media selection +--------------- + +A number of the older NICs such as the 3c590 and 3c900 series have +10base2 and AUI interfaces. + +Prior to January, 2001 this driver would autoeselect the 10base2 or AUI +port if it didn't detect activity on the 10baseT port. It would then +get stuck on the 10base2 port and a driver reload was necessary to +switch back to 10baseT. This behaviour could not be prevented with a +module option override. + +Later (current) versions of the driver _do_ support locking of the +media type. So if you load the driver module with + + modprobe 3c59x options=0 + +it will permanently select the 10baseT port. Automatic selection of +other media types does not occur. + + +Transmit error, Tx status register 82 +------------------------------------- + +This is a common error which is almost always caused by another host on +the same network being in full-duplex mode, while this host is in +half-duplex mode. You need to find that other host and make it run in +half-duplex mode or fix this host to run in full-duplex mode. + +As a last resort, you can force the 3c59x driver into full-duplex mode +with + + options 3c59x full_duplex=1 + +but this has to be viewed as a workaround for broken network gear and +should only really be used for equipment which cannot autonegotiate. + + +Additional resources +-------------------- + +Details of the device driver implementation are at the top of the source file. + +Additional documentation is available at Don Becker's Linux Drivers site: + + http://www.scyld.com/vortex.html + +Donald Becker's driver development site: + + http://www.scyld.com/network.html + +Donald's vortex-diag program is useful for inspecting the NIC's state: + + http://www.scyld.com/ethercard_diag.html + +Donald's mii-diag program may be used for inspecting and manipulating +the NIC's Media Independent Interface subsystem: + + http://www.scyld.com/ethercard_diag.html#mii-diag + +Donald's wake-on-LAN page: + + http://www.scyld.com/wakeonlan.html + +3Com's DOS-based application for setting up the NICs EEPROMs: + + ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe + + +Autonegotiation notes +--------------------- + + The driver uses a one-minute heartbeat for adapting to changes in + the external LAN environment if link is up and 5 seconds if link is down. + This means that when, for example, a machine is unplugged from a hubbed + 10baseT LAN plugged into a switched 100baseT LAN, the throughput + will be quite dreadful for up to sixty seconds. Be patient. + + Cisco interoperability note from Walter Wong : + + On a side note, adding HAS_NWAY seems to share a problem with the + Cisco 6509 switch. Specifically, you need to change the spanning + tree parameter for the port the machine is plugged into to 'portfast' + mode. Otherwise, the negotiation fails. This has been an issue + we've noticed for a while but haven't had the time to track down. + + Cisco switches (Jeff Busch ) + + My "standard config" for ports to which PC's/servers connect directly:: + + interface FastEthernet0/N + description machinename + load-interval 30 + spanning-tree portfast + + If autonegotiation is a problem, you may need to specify "speed + 100" and "duplex full" as well (or "speed 10" and "duplex half"). + + WARNING: DO NOT hook up hubs/switches/bridges to these + specially-configured ports! The switch will become very confused. + + +Reporting and diagnosing problems +--------------------------------- + +Maintainers find that accurate and complete problem reports are +invaluable in resolving driver problems. We are frequently not able to +reproduce problems and must rely on your patience and efforts to get to +the bottom of the problem. + +If you believe you have a driver problem here are some of the +steps you should take: + +- Is it really a driver problem? + + Eliminate some variables: try different cards, different + computers, different cables, different ports on the switch/hub, + different versions of the kernel or of the driver, etc. + +- OK, it's a driver problem. + + You need to generate a report. Typically this is an email to the + maintainer and/or netdev@vger.kernel.org. The maintainer's + email address will be in the driver source or in the MAINTAINERS file. + +- The contents of your report will vary a lot depending upon the + problem. If it's a kernel crash then you should refer to the + admin-guide/reporting-bugs.rst file. + + But for most problems it is useful to provide the following: + + - Kernel version, driver version + + - A copy of the banner message which the driver generates when + it is initialised. For example: + + eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19 + 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. + MII transceiver found at address 24, status 782d. + Enabling bus-master transmits and whole-frame receives. + + NOTE: You must provide the ``debug=2`` modprobe option to generate + a full detection message. Please do this:: + + modprobe 3c59x debug=2 + + - If it is a PCI device, the relevant output from 'lspci -vx', eg:: + + 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74) + Subsystem: 3Com Corporation: Unknown device 9200 + Flags: bus master, medium devsel, latency 32, IRQ 19 + I/O ports at a400 [size=128] + Memory at db000000 (32-bit, non-prefetchable) [size=128] + Expansion ROM at [disabled] [size=128K] + Capabilities: [dc] Power Management version 2 + 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00 + 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00 + 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10 + 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a + + - A description of the environment: 10baseT? 100baseT? + full/half duplex? switched or hubbed? + + - Any additional module parameters which you may be providing to the driver. + + - Any kernel logs which are produced. The more the merrier. + If this is a large file and you are sending your report to a + mailing list, mention that you have the logfile, but don't send + it. If you're reporting direct to the maintainer then just send + it. + + To ensure that all kernel logs are available, add the + following line to /etc/syslog.conf:: + + kern.* /var/log/messages + + Then restart syslogd with:: + + /etc/rc.d/init.d/syslog restart + + (The above may vary, depending upon which Linux distribution you use). + + - If your problem is reproducible then that's great. Try the + following: + + 1) Increase the debug level. Usually this is done via: + + a) modprobe driver debug=7 + b) In /etc/modprobe.d/driver.conf: + options driver debug=7 + + 2) Recreate the problem with the higher debug level, + send all logs to the maintainer. + + 3) Download you card's diagnostic tool from Donald + Becker's website . + Download mii-diag.c as well. Build these. + + a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is + working correctly. Save the output. + + b) Run the above commands when the card is malfunctioning. Send + both sets of output. + +Finally, please be patient and be prepared to do some work. You may +end up working on this problem for a week or more as the maintainer +asks more questions, asks for more tests, asks for patches to be +applied, etc. At the end of it all, the problem may even remain +unresolved. diff --git a/Documentation/networking/device_drivers/3com/vortex.txt b/Documentation/networking/device_drivers/3com/vortex.txt deleted file mode 100644 index 587f3fcfbcae..000000000000 --- a/Documentation/networking/device_drivers/3com/vortex.txt +++ /dev/null @@ -1,448 +0,0 @@ -Documentation/networking/device_drivers/3com/vortex.txt -Andrew Morton -30 April 2000 - - -This document describes the usage and errata of the 3Com "Vortex" device -driver for Linux, 3c59x.c. - -The driver was written by Donald Becker - -Don is no longer the prime maintainer of this version of the driver. -Please report problems to one or more of: - - Andrew Morton - Netdev mailing list - Linux kernel mailing list - -Please note the 'Reporting and Diagnosing Problems' section at the end -of this file. - - -Since kernel 2.3.99-pre6, this driver incorporates the support for the -3c575-series Cardbus cards which used to be handled by 3c575_cb.c. - -This driver supports the following hardware: - - 3c590 Vortex 10Mbps - 3c592 EISA 10Mbps Demon/Vortex - 3c597 EISA Fast Demon/Vortex - 3c595 Vortex 100baseTx - 3c595 Vortex 100baseT4 - 3c595 Vortex 100base-MII - 3c900 Boomerang 10baseT - 3c900 Boomerang 10Mbps Combo - 3c900 Cyclone 10Mbps TPO - 3c900 Cyclone 10Mbps Combo - 3c900 Cyclone 10Mbps TPC - 3c900B-FL Cyclone 10base-FL - 3c905 Boomerang 100baseTx - 3c905 Boomerang 100baseT4 - 3c905B Cyclone 100baseTx - 3c905B Cyclone 10/100/BNC - 3c905B-FX Cyclone 100baseFx - 3c905C Tornado - 3c920B-EMB-WNM (ATI Radeon 9100 IGP) - 3c980 Cyclone - 3c980C Python-T - 3cSOHO100-TX Hurricane - 3c555 Laptop Hurricane - 3c556 Laptop Tornado - 3c556B Laptop Hurricane - 3c575 [Megahertz] 10/100 LAN CardBus - 3c575 Boomerang CardBus - 3CCFE575BT Cyclone CardBus - 3CCFE575CT Tornado CardBus - 3CCFE656 Cyclone CardBus - 3CCFEM656B Cyclone+Winmodem CardBus - 3CXFEM656C Tornado+Winmodem CardBus - 3c450 HomePNA Tornado - 3c920 Tornado - 3c982 Hydra Dual Port A - 3c982 Hydra Dual Port B - 3c905B-T4 - 3c920B-EMB-WNM Tornado - -Module parameters -================= - -There are several parameters which may be provided to the driver when -its module is loaded. These are usually placed in /etc/modprobe.d/*.conf -configuration files. Example: - -options 3c59x debug=3 rx_copybreak=300 - -If you are using the PCMCIA tools (cardmgr) then the options may be -placed in /etc/pcmcia/config.opts: - -module "3c59x" opts "debug=3 rx_copybreak=300" - - -The supported parameters are: - -debug=N - - Where N is a number from 0 to 7. Anything above 3 produces a lot - of output in your system logs. debug=1 is default. - -options=N1,N2,N3,... - - Each number in the list provides an option to the corresponding - network card. So if you have two 3c905's and you wish to provide - them with option 0x204 you would use: - - options=0x204,0x204 - - The individual options are composed of a number of bitfields which - have the following meanings: - - Possible media type settings - 0 10baseT - 1 10Mbs AUI - 2 undefined - 3 10base2 (BNC) - 4 100base-TX - 5 100base-FX - 6 MII (Media Independent Interface) - 7 Use default setting from EEPROM - 8 Autonegotiate - 9 External MII - 10 Use default setting from EEPROM - - When generating a value for the 'options' setting, the above media - selection values may be OR'ed (or added to) the following: - - 0x8000 Set driver debugging level to 7 - 0x4000 Set driver debugging level to 2 - 0x0400 Enable Wake-on-LAN - 0x0200 Force full duplex mode. - 0x0010 Bus-master enable bit (Old Vortex cards only) - - For example: - - insmod 3c59x options=0x204 - - will force full-duplex 100base-TX, rather than allowing the usual - autonegotiation. - -global_options=N - - Sets the `options' parameter for all 3c59x NICs in the machine. - Entries in the `options' array above will override any setting of - this. - -full_duplex=N1,N2,N3... - - Similar to bit 9 of 'options'. Forces the corresponding card into - full-duplex mode. Please use this in preference to the `options' - parameter. - - In fact, please don't use this at all! You're better off getting - autonegotiation working properly. - -global_full_duplex=N1 - - Sets full duplex mode for all 3c59x NICs in the machine. Entries - in the `full_duplex' array above will override any setting of this. - -flow_ctrl=N1,N2,N3... - - Use 802.3x MAC-layer flow control. The 3com cards only support the - PAUSE command, which means that they will stop sending packets for a - short period if they receive a PAUSE frame from the link partner. - - The driver only allows flow control on a link which is operating in - full duplex mode. - - This feature does not appear to work on the 3c905 - only 3c905B and - 3c905C have been tested. - - The 3com cards appear to only respond to PAUSE frames which are - sent to the reserved destination address of 01:80:c2:00:00:01. They - do not honour PAUSE frames which are sent to the station MAC address. - -rx_copybreak=M - - The driver preallocates 32 full-sized (1536 byte) network buffers - for receiving. When a packet arrives, the driver has to decide - whether to leave the packet in its full-sized buffer, or to allocate - a smaller buffer and copy the packet across into it. - - This is a speed/space tradeoff. - - The value of rx_copybreak is used to decide when to make the copy. - If the packet size is less than rx_copybreak, the packet is copied. - The default value for rx_copybreak is 200 bytes. - -max_interrupt_work=N - - The driver's interrupt service routine can handle many receive and - transmit packets in a single invocation. It does this in a loop. - The value of max_interrupt_work governs how many times the interrupt - service routine will loop. The default value is 32 loops. If this - is exceeded the interrupt service routine gives up and generates a - warning message "eth0: Too much work in interrupt". - -hw_checksums=N1,N2,N3,... - - Recent 3com NICs are able to generate IPv4, TCP and UDP checksums - in hardware. Linux has used the Rx checksumming for a long time. - The "zero copy" patch which is planned for the 2.4 kernel series - allows you to make use of the NIC's DMA scatter/gather and transmit - checksumming as well. - - The driver is set up so that, when the zerocopy patch is applied, - all Tornado and Cyclone devices will use S/G and Tx checksums. - - This module parameter has been provided so you can override this - decision. If you think that Tx checksums are causing a problem, you - may disable the feature with `hw_checksums=0'. - - If you think your NIC should be performing Tx checksumming and the - driver isn't enabling it, you can force the use of hardware Tx - checksumming with `hw_checksums=1'. - - The driver drops a message in the logfiles to indicate whether or - not it is using hardware scatter/gather and hardware Tx checksums. - - Scatter/gather and hardware checksums provide considerable - performance improvement for the sendfile() system call, but a small - decrease in throughput for send(). There is no effect upon receive - efficiency. - -compaq_ioaddr=N -compaq_irq=N -compaq_device_id=N - - "Variables to work-around the Compaq PCI BIOS32 problem".... - -watchdog=N - - Sets the time duration (in milliseconds) after which the kernel - decides that the transmitter has become stuck and needs to be reset. - This is mainly for debugging purposes, although it may be advantageous - to increase this value on LANs which have very high collision rates. - The default value is 5000 (5.0 seconds). - -enable_wol=N1,N2,N3,... - - Enable Wake-on-LAN support for the relevant interface. Donald - Becker's `ether-wake' application may be used to wake suspended - machines. - - Also enables the NIC's power management support. - -global_enable_wol=N - - Sets enable_wol mode for all 3c59x NICs in the machine. Entries in - the `enable_wol' array above will override any setting of this. - -Media selection ---------------- - -A number of the older NICs such as the 3c590 and 3c900 series have -10base2 and AUI interfaces. - -Prior to January, 2001 this driver would autoeselect the 10base2 or AUI -port if it didn't detect activity on the 10baseT port. It would then -get stuck on the 10base2 port and a driver reload was necessary to -switch back to 10baseT. This behaviour could not be prevented with a -module option override. - -Later (current) versions of the driver _do_ support locking of the -media type. So if you load the driver module with - - modprobe 3c59x options=0 - -it will permanently select the 10baseT port. Automatic selection of -other media types does not occur. - - -Transmit error, Tx status register 82 -------------------------------------- - -This is a common error which is almost always caused by another host on -the same network being in full-duplex mode, while this host is in -half-duplex mode. You need to find that other host and make it run in -half-duplex mode or fix this host to run in full-duplex mode. - -As a last resort, you can force the 3c59x driver into full-duplex mode -with - - options 3c59x full_duplex=1 - -but this has to be viewed as a workaround for broken network gear and -should only really be used for equipment which cannot autonegotiate. - - -Additional resources --------------------- - -Details of the device driver implementation are at the top of the source file. - -Additional documentation is available at Don Becker's Linux Drivers site: - - http://www.scyld.com/vortex.html - -Donald Becker's driver development site: - - http://www.scyld.com/network.html - -Donald's vortex-diag program is useful for inspecting the NIC's state: - - http://www.scyld.com/ethercard_diag.html - -Donald's mii-diag program may be used for inspecting and manipulating -the NIC's Media Independent Interface subsystem: - - http://www.scyld.com/ethercard_diag.html#mii-diag - -Donald's wake-on-LAN page: - - http://www.scyld.com/wakeonlan.html - -3Com's DOS-based application for setting up the NICs EEPROMs: - - ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe - - -Autonegotiation notes ---------------------- - - The driver uses a one-minute heartbeat for adapting to changes in - the external LAN environment if link is up and 5 seconds if link is down. - This means that when, for example, a machine is unplugged from a hubbed - 10baseT LAN plugged into a switched 100baseT LAN, the throughput - will be quite dreadful for up to sixty seconds. Be patient. - - Cisco interoperability note from Walter Wong : - - On a side note, adding HAS_NWAY seems to share a problem with the - Cisco 6509 switch. Specifically, you need to change the spanning - tree parameter for the port the machine is plugged into to 'portfast' - mode. Otherwise, the negotiation fails. This has been an issue - we've noticed for a while but haven't had the time to track down. - - Cisco switches (Jeff Busch ) - - My "standard config" for ports to which PC's/servers connect directly: - - interface FastEthernet0/N - description machinename - load-interval 30 - spanning-tree portfast - - If autonegotiation is a problem, you may need to specify "speed - 100" and "duplex full" as well (or "speed 10" and "duplex half"). - - WARNING: DO NOT hook up hubs/switches/bridges to these - specially-configured ports! The switch will become very confused. - - -Reporting and diagnosing problems ---------------------------------- - -Maintainers find that accurate and complete problem reports are -invaluable in resolving driver problems. We are frequently not able to -reproduce problems and must rely on your patience and efforts to get to -the bottom of the problem. - -If you believe you have a driver problem here are some of the -steps you should take: - -- Is it really a driver problem? - - Eliminate some variables: try different cards, different - computers, different cables, different ports on the switch/hub, - different versions of the kernel or of the driver, etc. - -- OK, it's a driver problem. - - You need to generate a report. Typically this is an email to the - maintainer and/or netdev@vger.kernel.org. The maintainer's - email address will be in the driver source or in the MAINTAINERS file. - -- The contents of your report will vary a lot depending upon the - problem. If it's a kernel crash then you should refer to the - admin-guide/reporting-bugs.rst file. - - But for most problems it is useful to provide the following: - - o Kernel version, driver version - - o A copy of the banner message which the driver generates when - it is initialised. For example: - - eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19 - 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. - MII transceiver found at address 24, status 782d. - Enabling bus-master transmits and whole-frame receives. - - NOTE: You must provide the `debug=2' modprobe option to generate - a full detection message. Please do this: - - modprobe 3c59x debug=2 - - o If it is a PCI device, the relevant output from 'lspci -vx', eg: - - 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74) - Subsystem: 3Com Corporation: Unknown device 9200 - Flags: bus master, medium devsel, latency 32, IRQ 19 - I/O ports at a400 [size=128] - Memory at db000000 (32-bit, non-prefetchable) [size=128] - Expansion ROM at [disabled] [size=128K] - Capabilities: [dc] Power Management version 2 - 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00 - 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00 - 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10 - 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a - - o A description of the environment: 10baseT? 100baseT? - full/half duplex? switched or hubbed? - - o Any additional module parameters which you may be providing to the driver. - - o Any kernel logs which are produced. The more the merrier. - If this is a large file and you are sending your report to a - mailing list, mention that you have the logfile, but don't send - it. If you're reporting direct to the maintainer then just send - it. - - To ensure that all kernel logs are available, add the - following line to /etc/syslog.conf: - - kern.* /var/log/messages - - Then restart syslogd with: - - /etc/rc.d/init.d/syslog restart - - (The above may vary, depending upon which Linux distribution you use). - - o If your problem is reproducible then that's great. Try the - following: - - 1) Increase the debug level. Usually this is done via: - - a) modprobe driver debug=7 - b) In /etc/modprobe.d/driver.conf: - options driver debug=7 - - 2) Recreate the problem with the higher debug level, - send all logs to the maintainer. - - 3) Download you card's diagnostic tool from Donald - Becker's website . - Download mii-diag.c as well. Build these. - - a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is - working correctly. Save the output. - - b) Run the above commands when the card is malfunctioning. Send - both sets of output. - -Finally, please be patient and be prepared to do some work. You may -end up working on this problem for a week or more as the maintainer -asks more questions, asks for more tests, asks for patches to be -applied, etc. At the end of it all, the problem may even remain -unresolved. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 402a9188f446..aaac502b81ea 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -28,6 +28,7 @@ Contents: pensando/ionic stmicro/stmmac 3com/3c509 + 3com/vortex .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index bee65ebdc67e..eaea5f1994c9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -147,7 +147,7 @@ Maintainers List M: Steffen Klassert L: netdev@vger.kernel.org S: Odd Fixes -F: Documentation/networking/device_drivers/3com/vortex.txt +F: Documentation/networking/device_drivers/3com/vortex.rst F: drivers/net/ethernet/3com/3c59x.c 3CR990 NETWORK DRIVER diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c index a2b7f7ab8170..5984b7033999 100644 --- a/drivers/net/ethernet/3com/3c59x.c +++ b/drivers/net/ethernet/3com/3c59x.c @@ -1149,7 +1149,7 @@ static int vortex_probe1(struct device *gendev, void __iomem *ioaddr, int irq, print_info = (vortex_debug > 1); if (print_info) - pr_info("See Documentation/networking/device_drivers/3com/vortex.txt\n"); + pr_info("See Documentation/networking/device_drivers/3com/vortex.rst\n"); pr_info("%s: 3Com %s %s at %p.\n", print_name, @@ -1954,7 +1954,7 @@ vortex_error(struct net_device *dev, int status) dev->name, tx_status); if (tx_status == 0x82) { pr_err("Probably a duplex mismatch. See " - "Documentation/networking/device_drivers/3com/vortex.txt\n"); + "Documentation/networking/device_drivers/3com/vortex.rst\n"); } dump_tx_ring(dev); } diff --git a/drivers/net/ethernet/3com/Kconfig b/drivers/net/ethernet/3com/Kconfig index 3a6fc99c6f32..7cc259893cb9 100644 --- a/drivers/net/ethernet/3com/Kconfig +++ b/drivers/net/ethernet/3com/Kconfig @@ -76,7 +76,7 @@ config VORTEX "Hurricane" (3c555/3cSOHO) PCI If you have such a card, say Y here. More specific information is in - and + and in the comments at the beginning of . -- cgit v1.2.3 From 8d299c7e912bd8ebb88b9ac2b8e336c9878783aa Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:36 +0200 Subject: docs: networking: device drivers: convert amazon/ena.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/amazon/ena.rst | 344 +++++++++++++++++++++ .../networking/device_drivers/amazon/ena.txt | 308 ------------------ Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 346 insertions(+), 309 deletions(-) create mode 100644 Documentation/networking/device_drivers/amazon/ena.rst delete mode 100644 Documentation/networking/device_drivers/amazon/ena.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/device_drivers/amazon/ena.rst b/Documentation/networking/device_drivers/amazon/ena.rst new file mode 100644 index 000000000000..11af6388ea87 --- /dev/null +++ b/Documentation/networking/device_drivers/amazon/ena.rst @@ -0,0 +1,344 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Linux kernel driver for Elastic Network Adapter (ENA) family +============================================================ + +Overview +======== + +ENA is a networking interface designed to make good use of modern CPU +features and system architectures. + +The ENA device exposes a lightweight management interface with a +minimal set of memory mapped registers and extendable command set +through an Admin Queue. + +The driver supports a range of ENA devices, is link-speed independent +(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has +a negotiated and extendable feature set. + +Some ENA devices support SR-IOV. This driver is used for both the +SR-IOV Physical Function (PF) and Virtual Function (VF) devices. + +ENA devices enable high speed and low overhead network traffic +processing by providing multiple Tx/Rx queue pairs (the maximum number +is advertised by the device via the Admin Queue), a dedicated MSI-X +interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, +and CPU cacheline optimized data placement. + +The ENA driver supports industry standard TCP/IP offload features such +as checksum offload and TCP transmit segmentation offload (TSO). +Receive-side scaling (RSS) is supported for multi-core scaling. + +The ENA driver and its corresponding devices implement health +monitoring mechanisms such as watchdog, enabling the device and driver +to recover in a manner transparent to the application, as well as +debug logs. + +Some of the ENA devices support a working mode called Low-latency +Queue (LLQ), which saves several more microseconds. + +Supported PCI vendor ID/device IDs +================================== + +========= ======================= +1d0f:0ec2 ENA PF +1d0f:1ec2 ENA PF with LLQ support +1d0f:ec20 ENA VF +1d0f:ec21 ENA VF with LLQ support +========= ======================= + +ENA Source Code Directory Structure +=================================== + +================= ====================================================== +ena_com.[ch] Management communication layer. This layer is + responsible for the handling all the management + (admin) communication between the device and the + driver. +ena_eth_com.[ch] Tx/Rx data path. +ena_admin_defs.h Definition of ENA management interface. +ena_eth_io_defs.h Definition of ENA data path interface. +ena_common_defs.h Common definitions for ena_com layer. +ena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers. +ena_netdev.[ch] Main Linux kernel driver. +ena_syfsfs.[ch] Sysfs files. +ena_ethtool.c ethtool callbacks. +ena_pci_id_tbl.h Supported device IDs. +================= ====================================================== + +Management Interface: +===================== + +ENA management interface is exposed by means of: + +- PCIe Configuration Space +- Device Registers +- Admin Queue (AQ) and Admin Completion Queue (ACQ) +- Asynchronous Event Notification Queue (AENQ) + +ENA device MMIO Registers are accessed only during driver +initialization and are not involved in further normal device +operation. + +AQ is used for submitting management commands, and the +results/responses are reported asynchronously through ACQ. + +ENA introduces a small set of management commands with room for +vendor-specific extensions. Most of the management operations are +framed in a generic Get/Set feature command. + +The following admin queue commands are supported: + +- Create I/O submission queue +- Create I/O completion queue +- Destroy I/O submission queue +- Destroy I/O completion queue +- Get feature +- Set feature +- Configure AENQ +- Get statistics + +Refer to ena_admin_defs.h for the list of supported Get/Set Feature +properties. + +The Asynchronous Event Notification Queue (AENQ) is a uni-directional +queue used by the ENA device to send to the driver events that cannot +be reported using ACQ. AENQ events are subdivided into groups. Each +group may have multiple syndromes, as shown below + +The events are: + + ==================== =============== + Group Syndrome + ==================== =============== + Link state change **X** + Fatal error **X** + Notification Suspend traffic + Notification Resume traffic + Keep-Alive **X** + ==================== =============== + +ACQ and AENQ share the same MSI-X vector. + +Keep-Alive is a special mechanism that allows monitoring of the +device's health. The driver maintains a watchdog (WD) handler which, +if fired, logs the current state and statistics then resets and +restarts the ENA device and driver. A Keep-Alive event is delivered by +the device every second. The driver re-arms the WD upon reception of a +Keep-Alive event. A missed Keep-Alive event causes the WD handler to +fire. + +Data Path Interface +=================== +I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx +SQ correspondingly). Each SQ has a completion queue (CQ) associated +with it. + +The SQs and CQs are implemented as descriptor rings in contiguous +physical memory. + +The ENA driver supports two Queue Operation modes for Tx SQs: + +- Regular mode + + * In this mode the Tx SQs reside in the host's memory. The ENA + device fetches the ENA Tx descriptors and packet data from host + memory. + +- Low Latency Queue (LLQ) mode or "push-mode". + + * In this mode the driver pushes the transmit descriptors and the + first 128 bytes of the packet directly to the ENA device memory + space. The rest of the packet payload is fetched by the + device. For this operation mode, the driver uses a dedicated PCI + device memory BAR, which is mapped with write-combine capability. + +The Rx SQs support only the regular mode. + +Note: Not all ENA devices support LLQ, and this feature is negotiated + with the device upon initialization. If the ENA device does not + support LLQ mode, the driver falls back to the regular mode. + +The driver supports multi-queue for both Tx and Rx. This has various +benefits: + +- Reduced CPU/thread/process contention on a given Ethernet interface. +- Cache miss rate on completion is reduced, particularly for data + cache lines that hold the sk_buff structures. +- Increased process-level parallelism when handling received packets. +- Increased data cache hit rate, by steering kernel processing of + packets to the CPU, where the application thread consuming the + packet is running. +- In hardware interrupt re-direction. + +Interrupt Modes +=============== +The driver assigns a single MSI-X vector per queue pair (for both Tx +and Rx directions). The driver assigns an additional dedicated MSI-X vector +for management (for ACQ and AENQ). + +Management interrupt registration is performed when the Linux kernel +probes the adapter, and it is de-registered when the adapter is +removed. I/O queue interrupt registration is performed when the Linux +interface of the adapter is opened, and it is de-registered when the +interface is closed. + +The management interrupt is named:: + + ena-mgmnt@pci: + +and for each queue pair, an interrupt is named:: + + -Tx-Rx- + +The ENA device operates in auto-mask and auto-clear interrupt +modes. That is, once MSI-X is delivered to the host, its Cause bit is +automatically cleared and the interrupt is masked. The interrupt is +unmasked by the driver after NAPI processing is complete. + +Interrupt Moderation +==================== +ENA driver and device can operate in conventional or adaptive interrupt +moderation mode. + +In conventional mode the driver instructs device to postpone interrupt +posting according to static interrupt delay value. The interrupt delay +value can be configured through ethtool(8). The following ethtool +parameters are supported by the driver: tx-usecs, rx-usecs + +In adaptive interrupt moderation mode the interrupt delay value is +updated by the driver dynamically and adjusted every NAPI cycle +according to the traffic nature. + +By default ENA driver applies adaptive coalescing on Rx traffic and +conventional coalescing on Tx traffic. + +Adaptive coalescing can be switched on/off through ethtool(8) +adaptive_rx on|off parameter. + +The driver chooses interrupt delay value according to the number of +bytes and packets received between interrupt unmasking and interrupt +posting. The driver uses interrupt delay table that subdivides the +range of received bytes/packets into 5 levels and assigns interrupt +delay value to each level. + +The user can enable/disable adaptive moderation, modify the interrupt +delay table and restore its default values through sysfs. + +RX copybreak +============ +The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK +and can be configured by the ETHTOOL_STUNABLE command of the +SIOCETHTOOL ioctl. + +SKB +=== +The driver-allocated SKB for frames received from Rx handling using +NAPI context. The allocation method depends on the size of the packet. +If the frame length is larger than rx_copybreak, napi_get_frags() +is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer +content is copied (by CPU) to the SKB, and the buffer is recycled. + +Statistics +========== +The user can obtain ENA device and driver statistics using ethtool. +The driver can collect regular or extended statistics (including +per-queue stats) from the device. + +In addition the driver logs the stats to syslog upon device reset. + +MTU +=== +The driver supports an arbitrarily large MTU with a maximum that is +negotiated with the device. The driver configures MTU using the +SetFeature command (ENA_ADMIN_MTU property). The user can change MTU +via ip(8) and similar legacy tools. + +Stateless Offloads +================== +The ENA driver supports: + +- TSO over IPv4/IPv6 +- TSO with ECN +- IPv4 header checksum offload +- TCP/UDP over IPv4/IPv6 checksum offloads + +RSS +=== +- The ENA device supports RSS that allows flexible Rx traffic + steering. +- Toeplitz and CRC32 hash functions are supported. +- Different combinations of L2/L3/L4 fields can be configured as + inputs for hash functions. +- The driver configures RSS settings using the AQ SetFeature command + (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and + ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). +- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash + function delivered in the Rx CQ descriptor is set in the received + SKB. +- The user can provide a hash key, hash function, and configure the + indirection table through ethtool(8). + +DATA PATH +========= +Tx +-- + +end_start_xmit() is called by the stack. This function does the following: + +- Maps data buffers (skb->data and frags). +- Populates ena_buf for the push buffer (if the driver and device are + in push mode.) +- Prepares ENA bufs for the remaining frags. +- Allocates a new request ID from the empty req_id ring. The request + ID is the index of the packet in the Tx info. This is used for + out-of-order TX completions. +- Adds the packet to the proper place in the Tx ring. +- Calls ena_com_prepare_tx(), an ENA communication layer that converts + the ena_bufs to ENA descriptors (and adds meta ENA descriptors as + needed.) + + * This function also copies the ENA descriptors and the push buffer + to the Device memory space (if in push mode.) + +- Writes doorbell to the ENA device. +- When the ENA device finishes sending the packet, a completion + interrupt is raised. +- The interrupt handler schedules NAPI. +- The ena_clean_tx_irq() function is called. This function handles the + completion descriptors generated by the ENA, with a single + completion descriptor per completed packet. + + * req_id is retrieved from the completion descriptor. The tx_info of + the packet is retrieved via the req_id. The data buffers are + unmapped and req_id is returned to the empty req_id ring. + * The function stops when the completion descriptors are completed or + the budget is reached. + +Rx +-- + +- When a packet is received from the ENA device. +- The interrupt handler schedules NAPI. +- The ena_clean_rx_irq() function is called. This function calls + ena_rx_pkt(), an ENA communication layer function, which returns the + number of descriptors used for a new unhandled packet, and zero if + no new packet is found. +- Then it calls the ena_clean_rx_irq() function. +- ena_eth_rx_skb() checks packet length: + + * If the packet is small (len < rx_copybreak), the driver allocates + a SKB for the new packet, and copies the packet payload into the + SKB data buffer. + + - In this way the original data buffer is not passed to the stack + and is reused for future Rx packets. + + * Otherwise the function unmaps the Rx buffer, then allocates the + new SKB structure and hooks the Rx buffer to the SKB frags. + +- The new SKB is updated with the necessary information (protocol, + checksum hw verify result, etc.), and then passed to the network + stack, using the NAPI interface function napi_gro_receive(). diff --git a/Documentation/networking/device_drivers/amazon/ena.txt b/Documentation/networking/device_drivers/amazon/ena.txt deleted file mode 100644 index 1bb55c7b604c..000000000000 --- a/Documentation/networking/device_drivers/amazon/ena.txt +++ /dev/null @@ -1,308 +0,0 @@ -Linux kernel driver for Elastic Network Adapter (ENA) family: -============================================================= - -Overview: -========= -ENA is a networking interface designed to make good use of modern CPU -features and system architectures. - -The ENA device exposes a lightweight management interface with a -minimal set of memory mapped registers and extendable command set -through an Admin Queue. - -The driver supports a range of ENA devices, is link-speed independent -(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has -a negotiated and extendable feature set. - -Some ENA devices support SR-IOV. This driver is used for both the -SR-IOV Physical Function (PF) and Virtual Function (VF) devices. - -ENA devices enable high speed and low overhead network traffic -processing by providing multiple Tx/Rx queue pairs (the maximum number -is advertised by the device via the Admin Queue), a dedicated MSI-X -interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, -and CPU cacheline optimized data placement. - -The ENA driver supports industry standard TCP/IP offload features such -as checksum offload and TCP transmit segmentation offload (TSO). -Receive-side scaling (RSS) is supported for multi-core scaling. - -The ENA driver and its corresponding devices implement health -monitoring mechanisms such as watchdog, enabling the device and driver -to recover in a manner transparent to the application, as well as -debug logs. - -Some of the ENA devices support a working mode called Low-latency -Queue (LLQ), which saves several more microseconds. - -Supported PCI vendor ID/device IDs: -=================================== -1d0f:0ec2 - ENA PF -1d0f:1ec2 - ENA PF with LLQ support -1d0f:ec20 - ENA VF -1d0f:ec21 - ENA VF with LLQ support - -ENA Source Code Directory Structure: -==================================== -ena_com.[ch] - Management communication layer. This layer is - responsible for the handling all the management - (admin) communication between the device and the - driver. -ena_eth_com.[ch] - Tx/Rx data path. -ena_admin_defs.h - Definition of ENA management interface. -ena_eth_io_defs.h - Definition of ENA data path interface. -ena_common_defs.h - Common definitions for ena_com layer. -ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers. -ena_netdev.[ch] - Main Linux kernel driver. -ena_syfsfs.[ch] - Sysfs files. -ena_ethtool.c - ethtool callbacks. -ena_pci_id_tbl.h - Supported device IDs. - -Management Interface: -===================== -ENA management interface is exposed by means of: -- PCIe Configuration Space -- Device Registers -- Admin Queue (AQ) and Admin Completion Queue (ACQ) -- Asynchronous Event Notification Queue (AENQ) - -ENA device MMIO Registers are accessed only during driver -initialization and are not involved in further normal device -operation. - -AQ is used for submitting management commands, and the -results/responses are reported asynchronously through ACQ. - -ENA introduces a small set of management commands with room for -vendor-specific extensions. Most of the management operations are -framed in a generic Get/Set feature command. - -The following admin queue commands are supported: -- Create I/O submission queue -- Create I/O completion queue -- Destroy I/O submission queue -- Destroy I/O completion queue -- Get feature -- Set feature -- Configure AENQ -- Get statistics - -Refer to ena_admin_defs.h for the list of supported Get/Set Feature -properties. - -The Asynchronous Event Notification Queue (AENQ) is a uni-directional -queue used by the ENA device to send to the driver events that cannot -be reported using ACQ. AENQ events are subdivided into groups. Each -group may have multiple syndromes, as shown below - -The events are: - Group Syndrome - Link state change - X - - Fatal error - X - - Notification Suspend traffic - Notification Resume traffic - Keep-Alive - X - - -ACQ and AENQ share the same MSI-X vector. - -Keep-Alive is a special mechanism that allows monitoring of the -device's health. The driver maintains a watchdog (WD) handler which, -if fired, logs the current state and statistics then resets and -restarts the ENA device and driver. A Keep-Alive event is delivered by -the device every second. The driver re-arms the WD upon reception of a -Keep-Alive event. A missed Keep-Alive event causes the WD handler to -fire. - -Data Path Interface: -==================== -I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx -SQ correspondingly). Each SQ has a completion queue (CQ) associated -with it. - -The SQs and CQs are implemented as descriptor rings in contiguous -physical memory. - -The ENA driver supports two Queue Operation modes for Tx SQs: -- Regular mode - * In this mode the Tx SQs reside in the host's memory. The ENA - device fetches the ENA Tx descriptors and packet data from host - memory. -- Low Latency Queue (LLQ) mode or "push-mode". - * In this mode the driver pushes the transmit descriptors and the - first 128 bytes of the packet directly to the ENA device memory - space. The rest of the packet payload is fetched by the - device. For this operation mode, the driver uses a dedicated PCI - device memory BAR, which is mapped with write-combine capability. - -The Rx SQs support only the regular mode. - -Note: Not all ENA devices support LLQ, and this feature is negotiated - with the device upon initialization. If the ENA device does not - support LLQ mode, the driver falls back to the regular mode. - -The driver supports multi-queue for both Tx and Rx. This has various -benefits: -- Reduced CPU/thread/process contention on a given Ethernet interface. -- Cache miss rate on completion is reduced, particularly for data - cache lines that hold the sk_buff structures. -- Increased process-level parallelism when handling received packets. -- Increased data cache hit rate, by steering kernel processing of - packets to the CPU, where the application thread consuming the - packet is running. -- In hardware interrupt re-direction. - -Interrupt Modes: -================ -The driver assigns a single MSI-X vector per queue pair (for both Tx -and Rx directions). The driver assigns an additional dedicated MSI-X vector -for management (for ACQ and AENQ). - -Management interrupt registration is performed when the Linux kernel -probes the adapter, and it is de-registered when the adapter is -removed. I/O queue interrupt registration is performed when the Linux -interface of the adapter is opened, and it is de-registered when the -interface is closed. - -The management interrupt is named: - ena-mgmnt@pci: -and for each queue pair, an interrupt is named: - -Tx-Rx- - -The ENA device operates in auto-mask and auto-clear interrupt -modes. That is, once MSI-X is delivered to the host, its Cause bit is -automatically cleared and the interrupt is masked. The interrupt is -unmasked by the driver after NAPI processing is complete. - -Interrupt Moderation: -===================== -ENA driver and device can operate in conventional or adaptive interrupt -moderation mode. - -In conventional mode the driver instructs device to postpone interrupt -posting according to static interrupt delay value. The interrupt delay -value can be configured through ethtool(8). The following ethtool -parameters are supported by the driver: tx-usecs, rx-usecs - -In adaptive interrupt moderation mode the interrupt delay value is -updated by the driver dynamically and adjusted every NAPI cycle -according to the traffic nature. - -By default ENA driver applies adaptive coalescing on Rx traffic and -conventional coalescing on Tx traffic. - -Adaptive coalescing can be switched on/off through ethtool(8) -adaptive_rx on|off parameter. - -The driver chooses interrupt delay value according to the number of -bytes and packets received between interrupt unmasking and interrupt -posting. The driver uses interrupt delay table that subdivides the -range of received bytes/packets into 5 levels and assigns interrupt -delay value to each level. - -The user can enable/disable adaptive moderation, modify the interrupt -delay table and restore its default values through sysfs. - -RX copybreak: -============= -The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK -and can be configured by the ETHTOOL_STUNABLE command of the -SIOCETHTOOL ioctl. - -SKB: -==== -The driver-allocated SKB for frames received from Rx handling using -NAPI context. The allocation method depends on the size of the packet. -If the frame length is larger than rx_copybreak, napi_get_frags() -is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer -content is copied (by CPU) to the SKB, and the buffer is recycled. - -Statistics: -=========== -The user can obtain ENA device and driver statistics using ethtool. -The driver can collect regular or extended statistics (including -per-queue stats) from the device. - -In addition the driver logs the stats to syslog upon device reset. - -MTU: -==== -The driver supports an arbitrarily large MTU with a maximum that is -negotiated with the device. The driver configures MTU using the -SetFeature command (ENA_ADMIN_MTU property). The user can change MTU -via ip(8) and similar legacy tools. - -Stateless Offloads: -=================== -The ENA driver supports: -- TSO over IPv4/IPv6 -- TSO with ECN -- IPv4 header checksum offload -- TCP/UDP over IPv4/IPv6 checksum offloads - -RSS: -==== -- The ENA device supports RSS that allows flexible Rx traffic - steering. -- Toeplitz and CRC32 hash functions are supported. -- Different combinations of L2/L3/L4 fields can be configured as - inputs for hash functions. -- The driver configures RSS settings using the AQ SetFeature command - (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and - ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). -- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash - function delivered in the Rx CQ descriptor is set in the received - SKB. -- The user can provide a hash key, hash function, and configure the - indirection table through ethtool(8). - -DATA PATH: -========== -Tx: ---- -end_start_xmit() is called by the stack. This function does the following: -- Maps data buffers (skb->data and frags). -- Populates ena_buf for the push buffer (if the driver and device are - in push mode.) -- Prepares ENA bufs for the remaining frags. -- Allocates a new request ID from the empty req_id ring. The request - ID is the index of the packet in the Tx info. This is used for - out-of-order TX completions. -- Adds the packet to the proper place in the Tx ring. -- Calls ena_com_prepare_tx(), an ENA communication layer that converts - the ena_bufs to ENA descriptors (and adds meta ENA descriptors as - needed.) - * This function also copies the ENA descriptors and the push buffer - to the Device memory space (if in push mode.) -- Writes doorbell to the ENA device. -- When the ENA device finishes sending the packet, a completion - interrupt is raised. -- The interrupt handler schedules NAPI. -- The ena_clean_tx_irq() function is called. This function handles the - completion descriptors generated by the ENA, with a single - completion descriptor per completed packet. - * req_id is retrieved from the completion descriptor. The tx_info of - the packet is retrieved via the req_id. The data buffers are - unmapped and req_id is returned to the empty req_id ring. - * The function stops when the completion descriptors are completed or - the budget is reached. - -Rx: ---- -- When a packet is received from the ENA device. -- The interrupt handler schedules NAPI. -- The ena_clean_rx_irq() function is called. This function calls - ena_rx_pkt(), an ENA communication layer function, which returns the - number of descriptors used for a new unhandled packet, and zero if - no new packet is found. -- Then it calls the ena_clean_rx_irq() function. -- ena_eth_rx_skb() checks packet length: - * If the packet is small (len < rx_copybreak), the driver allocates - a SKB for the new packet, and copies the packet payload into the - SKB data buffer. - - In this way the original data buffer is not passed to the stack - and is reused for future Rx packets. - * Otherwise the function unmaps the Rx buffer, then allocates the - new SKB structure and hooks the Rx buffer to the SKB frags. -- The new SKB is updated with the necessary information (protocol, - checksum hw verify result, etc.), and then passed to the network - stack, using the NAPI interface function napi_gro_receive(). diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index aaac502b81ea..019a0d2efe67 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -29,6 +29,7 @@ Contents: stmicro/stmmac 3com/3c509 3com/vortex + amazon/ena .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index eaea5f1994c9..7b6c13cc832f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -815,7 +815,7 @@ R: Saeed Bishara R: Zorik Machulsky L: netdev@vger.kernel.org S: Supported -F: Documentation/networking/device_drivers/amazon/ena.txt +F: Documentation/networking/device_drivers/amazon/ena.rst F: drivers/net/ethernet/amazon/ AMAZON RDMA EFA DRIVER -- cgit v1.2.3 From c958119a487ec4578f50b352f45e965a30daa020 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:37 +0200 Subject: docs: networking: device drivers: convert aquantia/atlantic.txt to ReST - add SPDX header; - use copyright symbol; - adjust title and its markup; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../device_drivers/aquantia/atlantic.rst | 556 +++++++++++++++++++++ .../device_drivers/aquantia/atlantic.txt | 479 ------------------ Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 558 insertions(+), 480 deletions(-) create mode 100644 Documentation/networking/device_drivers/aquantia/atlantic.rst delete mode 100644 Documentation/networking/device_drivers/aquantia/atlantic.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.rst b/Documentation/networking/device_drivers/aquantia/atlantic.rst new file mode 100644 index 000000000000..595ddef1c8b3 --- /dev/null +++ b/Documentation/networking/device_drivers/aquantia/atlantic.rst @@ -0,0 +1,556 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=============================== +Marvell(Aquantia) AQtion Driver +=============================== + +For the aQuantia Multi-Gigabit PCI Express Family of Ethernet Adapters + +.. Contents + + - Identifying Your Adapter + - Configuration + - Supported ethtool options + - Command Line Parameters + - Config file parameters + - Support + - License + +Identifying Your Adapter +======================== + +The driver in this release is compatible with AQC-100, AQC-107, AQC-108 +based ethernet adapters. + + +SFP+ Devices (for AQC-100 based adapters) +----------------------------------------- + +This release tested with passive Direct Attach Cables (DAC) and SFP+/LC +Optical Transceiver. + +Configuration +============= + +Viewing Link Messages +--------------------- + Link messages will not be displayed to the console if the distribution is + restricting system messages. In order to see network driver link messages on + your console, set dmesg to eight by entering the following:: + + dmesg -n 8 + + .. note:: + + This setting is not saved across reboots. + +Jumbo Frames +------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16000. Use the `ip` command to + increase the MTU size. For example:: + + ip link set mtu 16000 dev enp1s0 + +ethtool +------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + +NAPI +---- + NAPI (Rx polling mode) is supported in the atlantic driver. + +Supported ethtool options +========================= + +Viewing adapter settings +------------------------ + + :: + + ethtool + + Output example:: + + Settings for enp1s0: + Supported ports: [ TP ] + Supported link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Supported pause frame use: Symmetric + Supports auto-negotiation: Yes + Supported FEC modes: Not reported + Advertised link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Advertised pause frame use: Symmetric + Advertised auto-negotiation: Yes + Advertised FEC modes: Not reported + Speed: 10000Mb/s + Duplex: Full + Port: Twisted Pair + PHYAD: 0 + Transceiver: internal + Auto-negotiation: on + MDI-X: Unknown + Supports Wake-on: g + Wake-on: d + Link detected: yes + + + .. note:: + + AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10. + But you can still use these speeds:: + + ethtool -s eth0 autoneg off speed 2500 + +Viewing adapter information +--------------------------- + + :: + + ethtool -i + + Output example:: + + driver: atlantic + version: 5.2.0-050200rc5-generic-kern + firmware-version: 3.1.78 + expansion-rom-version: + bus-info: 0000:01:00.0 + supports-statistics: yes + supports-test: no + supports-eeprom-access: no + supports-register-dump: yes + supports-priv-flags: no + + +Viewing Ethernet adapter statistics +----------------------------------- + + :: + + ethtool -S + + Output example:: + + NIC statistics: + InPackets: 13238607 + InUCast: 13293852 + InMCast: 52 + InBCast: 3 + InErrors: 0 + OutPackets: 23703019 + OutUCast: 23704941 + OutMCast: 67 + OutBCast: 11 + InUCastOctects: 213182760 + OutUCastOctects: 22698443 + InMCastOctects: 6600 + OutMCastOctects: 8776 + InBCastOctects: 192 + OutBCastOctects: 704 + InOctects: 2131839552 + OutOctects: 226938073 + InPacketsDma: 95532300 + OutPacketsDma: 59503397 + InOctetsDma: 1137102462 + OutOctetsDma: 2394339518 + InDroppedDma: 0 + Queue[0] InPackets: 23567131 + Queue[0] OutPackets: 20070028 + Queue[0] InJumboPackets: 0 + Queue[0] InLroPackets: 0 + Queue[0] InErrors: 0 + Queue[1] InPackets: 45428967 + Queue[1] OutPackets: 11306178 + Queue[1] InJumboPackets: 0 + Queue[1] InLroPackets: 0 + Queue[1] InErrors: 0 + Queue[2] InPackets: 3187011 + Queue[2] OutPackets: 13080381 + Queue[2] InJumboPackets: 0 + Queue[2] InLroPackets: 0 + Queue[2] InErrors: 0 + Queue[3] InPackets: 23349136 + Queue[3] OutPackets: 15046810 + Queue[3] InJumboPackets: 0 + Queue[3] InLroPackets: 0 + Queue[3] InErrors: 0 + +Interrupt coalescing support +---------------------------- + + ITR mode, TX/RX coalescing timings could be viewed with:: + + ethtool -c + + and changed with:: + + ethtool -C tx-usecs rx-usecs + + To disable coalescing:: + + ethtool -C tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1 + +Wake on LAN support +------------------- + + WOL support by magic packet:: + + ethtool -s wol g + + To disable WOL:: + + ethtool -s wol d + +Set and check the driver message level +-------------------------------------- + + Set message level + + :: + + ethtool -s msglvl + + Level values: + + ====== ============================= + 0x0001 general driver status. + 0x0002 hardware probing. + 0x0004 link state. + 0x0008 periodic status check. + 0x0010 interface being brought down. + 0x0020 interface being brought up. + 0x0040 receive error. + 0x0080 transmit error. + 0x0200 interrupt handling. + 0x0400 transmit completion. + 0x0800 receive completion. + 0x1000 packet contents. + 0x2000 hardware status. + 0x4000 Wake-on-LAN status. + ====== ============================= + + By default, the level of debugging messages is set 0x0001(general driver status). + + Check message level + + :: + + ethtool | grep "Current message level" + + If you want to disable the output of messages:: + + ethtool -s msglvl 0 + +RX flow rules (ntuple filters) +------------------------------ + + There are separate rules supported, that applies in that order: + + 1. 16 VLAN ID rules + 2. 16 L2 EtherType rules + 3. 8 L3/L4 5-Tuple rules + + + The driver utilizes the ethtool interface for configuring ntuple filters, + via ``ethtool -N ``. + + To enable or disable the RX flow rules:: + + ethtool -K ethX ntuple + + When disabling ntuple filters, all the user programed filters are + flushed from the driver cache and hardware. All needed filters must + be re-added when ntuple is re-enabled. + + Because of the fixed order of the rules, the location of filters is also fixed: + + - Locations 0 - 15 for VLAN ID filters + - Locations 16 - 31 for L2 EtherType filters + - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6) + + The L3/L4 5-tuple (protocol, source and destination IP address, source and + destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to + 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of + addresses can be supported. Source and destination ports are only compared for + TCP/UDP/SCTP packets. + + To add a filter that directs packet to queue 5, use + ``<-N|-U|--config-nfc|--config-ntuple>`` switch:: + + ethtool -N flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 + + - action is the queue number. + - loc is the rule number. + + For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you must set the loc + number within 32 - 39. + For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you can set 8 rules + for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic + IPv6 is 32 and 36. + At the moment you can not use IPv4 and IPv6 filters at the same time. + + Example filter for IPv6 filter traffic:: + + sudo ethtool -N flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32 + sudo ethtool -N flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36 + + Example filter for IPv4 filter traffic:: + + sudo ethtool -N flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32 + sudo ethtool -N flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33 + sudo ethtool -N flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34 + + If you set action -1, then all traffic corresponding to the filter will be discarded. + + The maximum value action is 31. + + + The VLAN filter (VLAN id) is compared against 16 filters. + VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter + from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID + are passed in the same 'vlan' parameter. + + To add a filter that directs packets from VLAN 2001 to queue 5:: + + ethtool -N flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0 + + + L2 EtherType filters allows filter packet by EtherType field or both EtherType + and User Priority (PCP) field of 802.1Q. + UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to + distinguish VLAN filter from L2 Ethertype filter with UserPriority since both + User Priority and VLAN ID are passed in the same 'vlan' parameter. + + To add a filter that directs IP4 packess of priority 3 to queue 3:: + + ethtool -N flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16 + + To see the list of filters currently present:: + + ethtool <-u|-n|--show-nfc|--show-ntuple> + + Rules may be deleted from the table itself. This is done using:: + + sudo ethtool <-N|-U|--config-nfc|--config-ntuple> delete + + - loc is the rule number to be deleted. + + Rx filters is an interface to load the filter table that funnels all flow + into queue 0 unless an alternative queue is specified using "action". In that + case, any flow that matches the filter criteria will be directed to the + appropriate queue. RX filters is supported on all kernels 2.6.30 and later. + +RSS for UDP +----------- + + Currently, NIC does not support RSS for fragmented IP packets, which leads to + incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the + RX Flow L3/L4 rule may be used. + + Example:: + + ethtool -N eth0 flow-type udp4 action 0 loc 32 + +UDP GSO hardware offload +------------------------ + + UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation + into hardware. A special userspace socket option is required for this, + could be validated with /kernel/tools/testing/selftests/net/:: + + udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100 + + Will cause sending out of 100 byte sized UDP packets formed from single + 6300 bytes user buffer. + + UDP GSO is configured by:: + + ethtool -K eth0 tx-udp-segmentation on + +Private flags (testing) +----------------------- + + Atlantic driver supports private flags for hardware custom features:: + + $ ethtool --show-priv-flags ethX + + Private flags for ethX: + DMASystemLoopback : off + PKTSystemLoopback : off + DMANetworkLoopback : off + PHYInternalLoopback: off + PHYExternalLoopback: off + + Example:: + + $ ethtool --set-priv-flags ethX DMASystemLoopback on + + DMASystemLoopback: DMA Host loopback. + PKTSystemLoopback: Packet buffer host loopback. + DMANetworkLoopback: Network side loopback on DMA block. + PHYInternalLoopback: Internal loopback on Phy. + PHYExternalLoopback: External loopback on Phy (with loopback ethernet cable). + + +Command Line Parameters +======================= +The following command line parameters are available on atlantic driver: + +aq_itr -Interrupt throttling mode +--------------------------------- +Accepted values: 0, 1, 0xFFFF + +Default value: 0xFFFF + +====== ============================================================== +0 Disable interrupt throttling. +1 Enable interrupt throttling and use specified tx and rx rates. +0xFFFF Auto throttling mode. Driver will choose the best RX and TX + interrupt throtting settings based on link speed. +====== ============================================================== + +aq_itr_tx - TX interrupt throttle rate +-------------------------------------- + +Accepted values: 0 - 0x1FF + +Default value: 0 + +TX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +aq_itr_rx - RX interrupt throttle rate +-------------------------------------- + +Accepted values: 0 - 0x1FF + +Default value: 0 + +RX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +.. note:: + + ITR settings could be changed in runtime by ethtool -c means (see below) + +Config file parameters +====================== + +For some fine tuning and performance optimizations, +some parameters can be changed in the {source_dir}/aq_cfg.h file. + +AQ_CFG_RX_PAGEORDER +------------------- + +Default value: 0 + +RX page order override. Thats a power of 2 number of RX pages allocated for +each descriptor. Received descriptor size is still limited by +AQ_CFG_RX_FRAME_MAX. + +Increasing pageorder makes page reuse better (actual on iommu enabled systems). + +AQ_CFG_RX_REFILL_THRES +---------------------- + +Default value: 32 + +RX refill threshold. RX path will not refill freed descriptors until the +specified number of free descriptors is observed. Larger values may help +better page reuse but may lead to packet drops as well. + +AQ_CFG_VECS_DEF +--------------- + +Number of queues + +Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX) + +Default value: 8 + +Notice this value will be capped by the number of cores available on the system. + +AQ_CFG_IS_RSS_DEF +----------------- + +Enable/disable Receive Side Scaling + +This feature allows the adapter to distribute receive processing +across multiple CPU-cores and to prevent from overloading a single CPU core. + +Valid values + +== ======== +0 disabled +1 enabled +== ======== + +Default value: 1 + +AQ_CFG_NUM_RSS_QUEUES_DEF +------------------------- + +Number of queues for Receive Side Scaling + +Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF) + +Default value: AQ_CFG_VECS_DEF + +AQ_CFG_IS_LRO_DEF +----------------- + +Enable/disable Large Receive Offload + +This offload enables the adapter to coalesce multiple TCP segments and indicate +them as a single coalesced unit to the OS networking subsystem. + +The system consumes less energy but it also introduces more latency in packets +processing. + +Valid values + +== ======== +0 disabled +1 enabled +== ======== + +Default value: 1 + +AQ_CFG_TX_CLEAN_BUDGET +---------------------- + +Maximum descriptors to cleanup on TX at once. + +Default value: 256 + +After the aq_cfg.h file changed the driver must be rebuilt to take effect. + +Support +======= + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to aqn_support@marvell.com + +License +======= + +aQuantia Corporation Network Driver + +Copyright |copy| 2014 - 2019 aQuantia Corporation. + +This program is free software; you can redistribute it and/or modify it +under the terms and conditions of the GNU General Public License, +version 2, as published by the Free Software Foundation. diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/aquantia/atlantic.txt deleted file mode 100644 index 2013fcedc2da..000000000000 --- a/Documentation/networking/device_drivers/aquantia/atlantic.txt +++ /dev/null @@ -1,479 +0,0 @@ -Marvell(Aquantia) AQtion Driver for the aQuantia Multi-Gigabit PCI Express -Family of Ethernet Adapters -============================================================================= - -Contents -======== - -- Identifying Your Adapter -- Configuration -- Supported ethtool options -- Command Line Parameters -- Config file parameters -- Support -- License - -Identifying Your Adapter -======================== - -The driver in this release is compatible with AQC-100, AQC-107, AQC-108 based ethernet adapters. - - -SFP+ Devices (for AQC-100 based adapters) ----------------------------------- - -This release tested with passive Direct Attach Cables (DAC) and SFP+/LC Optical Transceiver. - -Configuration -========================= - Viewing Link Messages - --------------------- - Link messages will not be displayed to the console if the distribution is - restricting system messages. In order to see network driver link messages on - your console, set dmesg to eight by entering the following: - - dmesg -n 8 - - NOTE: This setting is not saved across reboots. - - Jumbo Frames - ------------ - The driver supports Jumbo Frames for all adapters. Jumbo Frames support is - enabled by changing the MTU to a value larger than the default of 1500. - The maximum value for the MTU is 16000. Use the `ip` command to - increase the MTU size. For example: - - ip link set mtu 16000 dev enp1s0 - - ethtool - ------- - The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. The latest - ethtool version is required for this functionality. - - NAPI - ---- - NAPI (Rx polling mode) is supported in the atlantic driver. - -Supported ethtool options -============================ - Viewing adapter settings - --------------------- - ethtool - - Output example: - - Settings for enp1s0: - Supported ports: [ TP ] - Supported link modes: 100baseT/Full - 1000baseT/Full - 10000baseT/Full - 2500baseT/Full - 5000baseT/Full - Supported pause frame use: Symmetric - Supports auto-negotiation: Yes - Supported FEC modes: Not reported - Advertised link modes: 100baseT/Full - 1000baseT/Full - 10000baseT/Full - 2500baseT/Full - 5000baseT/Full - Advertised pause frame use: Symmetric - Advertised auto-negotiation: Yes - Advertised FEC modes: Not reported - Speed: 10000Mb/s - Duplex: Full - Port: Twisted Pair - PHYAD: 0 - Transceiver: internal - Auto-negotiation: on - MDI-X: Unknown - Supports Wake-on: g - Wake-on: d - Link detected: yes - - --- - Note: AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10. - But you can still use these speeds: - ethtool -s eth0 autoneg off speed 2500 - - Viewing adapter information - --------------------- - ethtool -i - - Output example: - - driver: atlantic - version: 5.2.0-050200rc5-generic-kern - firmware-version: 3.1.78 - expansion-rom-version: - bus-info: 0000:01:00.0 - supports-statistics: yes - supports-test: no - supports-eeprom-access: no - supports-register-dump: yes - supports-priv-flags: no - - - Viewing Ethernet adapter statistics: - --------------------- - ethtool -S - - Output example: - NIC statistics: - InPackets: 13238607 - InUCast: 13293852 - InMCast: 52 - InBCast: 3 - InErrors: 0 - OutPackets: 23703019 - OutUCast: 23704941 - OutMCast: 67 - OutBCast: 11 - InUCastOctects: 213182760 - OutUCastOctects: 22698443 - InMCastOctects: 6600 - OutMCastOctects: 8776 - InBCastOctects: 192 - OutBCastOctects: 704 - InOctects: 2131839552 - OutOctects: 226938073 - InPacketsDma: 95532300 - OutPacketsDma: 59503397 - InOctetsDma: 1137102462 - OutOctetsDma: 2394339518 - InDroppedDma: 0 - Queue[0] InPackets: 23567131 - Queue[0] OutPackets: 20070028 - Queue[0] InJumboPackets: 0 - Queue[0] InLroPackets: 0 - Queue[0] InErrors: 0 - Queue[1] InPackets: 45428967 - Queue[1] OutPackets: 11306178 - Queue[1] InJumboPackets: 0 - Queue[1] InLroPackets: 0 - Queue[1] InErrors: 0 - Queue[2] InPackets: 3187011 - Queue[2] OutPackets: 13080381 - Queue[2] InJumboPackets: 0 - Queue[2] InLroPackets: 0 - Queue[2] InErrors: 0 - Queue[3] InPackets: 23349136 - Queue[3] OutPackets: 15046810 - Queue[3] InJumboPackets: 0 - Queue[3] InLroPackets: 0 - Queue[3] InErrors: 0 - - Interrupt coalescing support - --------------------------------- - ITR mode, TX/RX coalescing timings could be viewed with: - - ethtool -c - - and changed with: - - ethtool -C tx-usecs rx-usecs - - To disable coalescing: - - ethtool -C tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1 - - Wake on LAN support - --------------------------------- - - WOL support by magic packet: - - ethtool -s wol g - - To disable WOL: - - ethtool -s wol d - - Set and check the driver message level - --------------------------------- - - Set message level - - ethtool -s msglvl - - Level values: - - 0x0001 - general driver status. - 0x0002 - hardware probing. - 0x0004 - link state. - 0x0008 - periodic status check. - 0x0010 - interface being brought down. - 0x0020 - interface being brought up. - 0x0040 - receive error. - 0x0080 - transmit error. - 0x0200 - interrupt handling. - 0x0400 - transmit completion. - 0x0800 - receive completion. - 0x1000 - packet contents. - 0x2000 - hardware status. - 0x4000 - Wake-on-LAN status. - - By default, the level of debugging messages is set 0x0001(general driver status). - - Check message level - - ethtool | grep "Current message level" - - If you want to disable the output of messages - - ethtool -s msglvl 0 - - RX flow rules (ntuple filters) - --------------------------------- - There are separate rules supported, that applies in that order: - 1. 16 VLAN ID rules - 2. 16 L2 EtherType rules - 3. 8 L3/L4 5-Tuple rules - - - The driver utilizes the ethtool interface for configuring ntuple filters, - via "ethtool -N ". - - To enable or disable the RX flow rules: - - ethtool -K ethX ntuple - - When disabling ntuple filters, all the user programed filters are - flushed from the driver cache and hardware. All needed filters must - be re-added when ntuple is re-enabled. - - Because of the fixed order of the rules, the location of filters is also fixed: - - Locations 0 - 15 for VLAN ID filters - - Locations 16 - 31 for L2 EtherType filters - - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6) - - The L3/L4 5-tuple (protocol, source and destination IP address, source and - destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to - 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of - addresses can be supported. Source and destination ports are only compared for - TCP/UDP/SCTP packets. - - To add a filter that directs packet to queue 5, use <-N|-U|--config-nfc|--config-ntuple> switch: - - ethtool -N flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 - - - action is the queue number. - - loc is the rule number. - - For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you must set the loc - number within 32 - 39. - For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you can set 8 rules - for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic - IPv6 is 32 and 36. - At the moment you can not use IPv4 and IPv6 filters at the same time. - - Example filter for IPv6 filter traffic: - - sudo ethtool -N flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32 - sudo ethtool -N flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36 - - Example filter for IPv4 filter traffic: - - sudo ethtool -N flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32 - sudo ethtool -N flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33 - sudo ethtool -N flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34 - - If you set action -1, then all traffic corresponding to the filter will be discarded. - The maximum value action is 31. - - - The VLAN filter (VLAN id) is compared against 16 filters. - VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter - from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID - are passed in the same 'vlan' parameter. - - To add a filter that directs packets from VLAN 2001 to queue 5: - ethtool -N flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0 - - - L2 EtherType filters allows filter packet by EtherType field or both EtherType - and User Priority (PCP) field of 802.1Q. - UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to - distinguish VLAN filter from L2 Ethertype filter with UserPriority since both - User Priority and VLAN ID are passed in the same 'vlan' parameter. - - To add a filter that directs IP4 packess of priority 3 to queue 3: - ethtool -N flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16 - - - To see the list of filters currently present: - - ethtool <-u|-n|--show-nfc|--show-ntuple> - - Rules may be deleted from the table itself. This is done using: - - sudo ethtool <-N|-U|--config-nfc|--config-ntuple> delete - - - loc is the rule number to be deleted. - - Rx filters is an interface to load the filter table that funnels all flow - into queue 0 unless an alternative queue is specified using "action". In that - case, any flow that matches the filter criteria will be directed to the - appropriate queue. RX filters is supported on all kernels 2.6.30 and later. - - RSS for UDP - --------------------------------- - Currently, NIC does not support RSS for fragmented IP packets, which leads to - incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the - RX Flow L3/L4 rule may be used. - - Example: - ethtool -N eth0 flow-type udp4 action 0 loc 32 - - UDP GSO hardware offload - --------------------------------- - UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation - into hardware. A special userspace socket option is required for this, - could be validated with /kernel/tools/testing/selftests/net/ - - udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100 - - Will cause sending out of 100 byte sized UDP packets formed from single - 6300 bytes user buffer. - - UDP GSO is configured by: - - ethtool -K eth0 tx-udp-segmentation on - - Private flags (testing) - --------------------------------- - - Atlantic driver supports private flags for hardware custom features: - - $ ethtool --show-priv-flags ethX - - Private flags for ethX: - DMASystemLoopback : off - PKTSystemLoopback : off - DMANetworkLoopback : off - PHYInternalLoopback: off - PHYExternalLoopback: off - - Example: - - $ ethtool --set-priv-flags ethX DMASystemLoopback on - - DMASystemLoopback: DMA Host loopback. - PKTSystemLoopback: Packet buffer host loopback. - DMANetworkLoopback: Network side loopback on DMA block. - PHYInternalLoopback: Internal loopback on Phy. - PHYExternalLoopback: External loopback on Phy (with loopback ethernet cable). - - -Command Line Parameters -======================= -The following command line parameters are available on atlantic driver: - -aq_itr -Interrupt throttling mode ----------------------------------------- -Accepted values: 0, 1, 0xFFFF -Default value: 0xFFFF -0 - Disable interrupt throttling. -1 - Enable interrupt throttling and use specified tx and rx rates. -0xFFFF - Auto throttling mode. Driver will choose the best RX and TX - interrupt throtting settings based on link speed. - -aq_itr_tx - TX interrupt throttle rate ----------------------------------------- -Accepted values: 0 - 0x1FF -Default value: 0 -TX side throttling in microseconds. Adapter will setup maximum interrupt delay -to this value. Minimum interrupt delay will be a half of this value - -aq_itr_rx - RX interrupt throttle rate ----------------------------------------- -Accepted values: 0 - 0x1FF -Default value: 0 -RX side throttling in microseconds. Adapter will setup maximum interrupt delay -to this value. Minimum interrupt delay will be a half of this value - -Note: ITR settings could be changed in runtime by ethtool -c means (see below) - -Config file parameters -======================= -For some fine tuning and performance optimizations, -some parameters can be changed in the {source_dir}/aq_cfg.h file. - -AQ_CFG_RX_PAGEORDER ----------------------------------------- -Default value: 0 -RX page order override. Thats a power of 2 number of RX pages allocated for -each descriptor. Received descriptor size is still limited by AQ_CFG_RX_FRAME_MAX. -Increasing pageorder makes page reuse better (actual on iommu enabled systems). - -AQ_CFG_RX_REFILL_THRES ----------------------------------------- -Default value: 32 -RX refill threshold. RX path will not refill freed descriptors until the -specified number of free descriptors is observed. Larger values may help -better page reuse but may lead to packet drops as well. - -AQ_CFG_VECS_DEF ------------------------------------------------------------- -Number of queues -Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX) -Default value: 8 -Notice this value will be capped by the number of cores available on the system. - -AQ_CFG_IS_RSS_DEF ------------------------------------------------------------- -Enable/disable Receive Side Scaling - -This feature allows the adapter to distribute receive processing -across multiple CPU-cores and to prevent from overloading a single CPU core. - -Valid values -0 - disabled -1 - enabled - -Default value: 1 - -AQ_CFG_NUM_RSS_QUEUES_DEF ------------------------------------------------------------- -Number of queues for Receive Side Scaling -Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF) - -Default value: AQ_CFG_VECS_DEF - -AQ_CFG_IS_LRO_DEF ------------------------------------------------------------- -Enable/disable Large Receive Offload - -This offload enables the adapter to coalesce multiple TCP segments and indicate -them as a single coalesced unit to the OS networking subsystem. -The system consumes less energy but it also introduces more latency in packets processing. - -Valid values -0 - disabled -1 - enabled - -Default value: 1 - -AQ_CFG_TX_CLEAN_BUDGET ----------------------------------------- -Maximum descriptors to cleanup on TX at once. -Default value: 256 - -After the aq_cfg.h file changed the driver must be rebuilt to take effect. - -Support -======= - -If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related -to the issue to aqn_support@marvell.com - -License -======= - -aQuantia Corporation Network Driver -Copyright(c) 2014 - 2019 aQuantia Corporation. - -This program is free software; you can redistribute it and/or modify it -under the terms and conditions of the GNU General Public License, -version 2, as published by the Free Software Foundation. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 019a0d2efe67..7dde314fc957 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -30,6 +30,7 @@ Contents: 3com/3c509 3com/vortex amazon/ena + aquantia/atlantic .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 7b6c13cc832f..b5cfee17635e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1275,7 +1275,7 @@ L: netdev@vger.kernel.org S: Supported W: https://www.marvell.com/ Q: http://patchwork.ozlabs.org/project/netdev/list/ -F: Documentation/networking/device_drivers/aquantia/atlantic.txt +F: Documentation/networking/device_drivers/aquantia/atlantic.rst F: drivers/net/ethernet/aquantia/atlantic/ AQUANTIA ETHERNET DRIVER PTP SUBSYSTEM -- cgit v1.2.3 From c981977d3a5ce55c96b1b77f42d0a9df0a79244e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:42 +0200 Subject: docs: networking: device drivers: convert dec/dmfe.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/dec/dmfe.rst | 71 ++++++++++++++++++++++ .../networking/device_drivers/dec/dmfe.txt | 66 -------------------- Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- drivers/net/ethernet/dec/tulip/Kconfig | 2 +- 5 files changed, 74 insertions(+), 68 deletions(-) create mode 100644 Documentation/networking/device_drivers/dec/dmfe.rst delete mode 100644 Documentation/networking/device_drivers/dec/dmfe.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/device_drivers/dec/dmfe.rst b/Documentation/networking/device_drivers/dec/dmfe.rst new file mode 100644 index 000000000000..c4cf809cad84 --- /dev/null +++ b/Documentation/networking/device_drivers/dec/dmfe.rst @@ -0,0 +1,71 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================== +Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux +============================================================== + +Note: This driver doesn't have a maintainer. + + +This program is free software; you can redistribute it and/or +modify it under the terms of the GNU General Public License +as published by the Free Software Foundation; either version 2 +of the License, or (at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + + +This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET +10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you +didn't compile this driver as a module, it will automatically load itself on boot and print a +line similar to:: + + dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17) + +If you compiled this driver as a module, you have to load it on boot.You can load it with command:: + + insmod dmfe + +This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass +a mode= setting to module while loading, like:: + + insmod dmfe mode=0 # Force 10M Half Duplex + insmod dmfe mode=1 # Force 100M Half Duplex + insmod dmfe mode=4 # Force 10M Full Duplex + insmod dmfe mode=5 # Force 100M Full Duplex + +Next you should configure your network interface with a command similar to:: + + ifconfig eth0 172.22.3.18 + ^^^^^^^^^^^ + Your IP Address + +Then you may have to modify the default routing table with command:: + + route add default eth0 + + +Now your ethernet card should be up and running. + + +TODO: + +- Implement pci_driver::suspend() and pci_driver::resume() power management methods. +- Check on 64 bit boxes. +- Check and fix on big endian boxes. +- Test and make sure PCI latency is now correct for all cases. + + +Authors: + +Sten Wang : Original Author + +Contributors: + +- Marcelo Tosatti +- Alan Cox +- Jeff Garzik +- Vojtech Pavlik diff --git a/Documentation/networking/device_drivers/dec/dmfe.txt b/Documentation/networking/device_drivers/dec/dmfe.txt deleted file mode 100644 index 25320bf19c86..000000000000 --- a/Documentation/networking/device_drivers/dec/dmfe.txt +++ /dev/null @@ -1,66 +0,0 @@ -Note: This driver doesn't have a maintainer. - -Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux. - -This program is free software; you can redistribute it and/or -modify it under the terms of the GNU General Public License -as published by the Free Software Foundation; either version 2 -of the License, or (at your option) any later version. - -This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. - - -This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET -10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you -didn't compile this driver as a module, it will automatically load itself on boot and print a -line similar to : - - dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17) - -If you compiled this driver as a module, you have to load it on boot.You can load it with command : - - insmod dmfe - -This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass -a mode= setting to module while loading, like : - - insmod dmfe mode=0 # Force 10M Half Duplex - insmod dmfe mode=1 # Force 100M Half Duplex - insmod dmfe mode=4 # Force 10M Full Duplex - insmod dmfe mode=5 # Force 100M Full Duplex - -Next you should configure your network interface with a command similar to : - - ifconfig eth0 172.22.3.18 - ^^^^^^^^^^^ - Your IP Address - -Then you may have to modify the default routing table with command : - - route add default eth0 - - -Now your ethernet card should be up and running. - - -TODO: - -Implement pci_driver::suspend() and pci_driver::resume() power management methods. -Check on 64 bit boxes. -Check and fix on big endian boxes. -Test and make sure PCI latency is now correct for all cases. - - -Authors: - -Sten Wang : Original Author - -Contributors: - -Marcelo Tosatti -Alan Cox -Jeff Garzik -Vojtech Pavlik diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 4ad13ffb5800..09728e964ce1 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -35,6 +35,7 @@ Contents: cirrus/cs89x0 davicom/dm9000 dec/de4x5 + dec/dmfe .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index b5cfee17635e..f0b18c156176 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4694,7 +4694,7 @@ F: net/ax25/sysctl_net_ax25.c DAVICOM FAST ETHERNET (DMFE) NETWORK DRIVER L: netdev@vger.kernel.org S: Orphan -F: Documentation/networking/device_drivers/dec/dmfe.txt +F: Documentation/networking/device_drivers/dec/dmfe.rst F: drivers/net/ethernet/dec/tulip/dmfe.c DC390/AM53C974 SCSI driver diff --git a/drivers/net/ethernet/dec/tulip/Kconfig b/drivers/net/ethernet/dec/tulip/Kconfig index 8c4245d94bb2..177f36f4b89d 100644 --- a/drivers/net/ethernet/dec/tulip/Kconfig +++ b/drivers/net/ethernet/dec/tulip/Kconfig @@ -138,7 +138,7 @@ config DM9102 This driver is for DM9102(A)/DM9132/DM9801 compatible PCI cards from Davicom (). If you have such a network (Ethernet) card, say Y. Some information is contained in the file - . + . To compile this driver as a module, choose M here. The module will be called dmfe. -- cgit v1.2.3 From cf7eba49b2b160f98106b33ca12039b05d812140 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:46 +0200 Subject: docs: networking: device drivers: convert intel/ipw2100.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - use copyright symbol; - use :field: markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/device_drivers/index.rst | 1 + .../networking/device_drivers/intel/ipw2100.rst | 323 +++++++++++++++++++++ .../networking/device_drivers/intel/ipw2100.txt | 293 ------------------- MAINTAINERS | 2 +- drivers/net/wireless/intel/ipw2x00/Kconfig | 2 +- drivers/net/wireless/intel/ipw2x00/ipw2100.c | 2 +- 6 files changed, 327 insertions(+), 296 deletions(-) create mode 100644 Documentation/networking/device_drivers/intel/ipw2100.rst delete mode 100644 Documentation/networking/device_drivers/intel/ipw2100.txt (limited to 'MAINTAINERS') diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index cec3415ee459..54ed10f3d1a7 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -39,6 +39,7 @@ Contents: dlink/dl2k freescale/dpaa freescale/gianfar + intel/ipw2100 .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/intel/ipw2100.rst b/Documentation/networking/device_drivers/intel/ipw2100.rst new file mode 100644 index 000000000000..d54ad522f937 --- /dev/null +++ b/Documentation/networking/device_drivers/intel/ipw2100.rst @@ -0,0 +1,323 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=========================================== +Intel(R) PRO/Wireless 2100 Driver for Linux +=========================================== + +Support for: + +- Intel(R) PRO/Wireless 2100 Network Connection + +Copyright |copy| 2003-2006, Intel Corporation + +README.ipw2100 + +:Version: git-1.1.5 +:Date: January 25, 2006 + +.. Index + + 0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER + 1. Introduction + 2. Release git-1.1.5 Current Features + 3. Command Line Parameters + 4. Sysfs Helper Files + 5. Radio Kill Switch + 6. Dynamic Firmware + 7. Power Management + 8. Support + 9. License + + +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +================================================= + +Important Notice FOR ALL USERS OR DISTRIBUTORS!!!! + +Intel wireless LAN adapters are engineered, manufactured, tested, and +quality checked to ensure that they meet all necessary local and +governmental regulatory agency requirements for the regions that they +are designated and/or marked to ship into. Since wireless LANs are +generally unlicensed devices that share spectrum with radars, +satellites, and other licensed and unlicensed devices, it is sometimes +necessary to dynamically detect, avoid, and limit usage to avoid +interference with these devices. In many instances Intel is required to +provide test data to prove regional and local compliance to regional and +governmental regulations before certification or approval to use the +product is granted. Intel's wireless LAN's EEPROM, firmware, and +software driver are designed to carefully control parameters that affect +radio operation and to ensure electromagnetic compliance (EMC). These +parameters include, without limitation, RF power, spectrum usage, +channel scanning, and human exposure. + +For these reasons Intel cannot permit any manipulation by third parties +of the software provided in binary format with the wireless WLAN +adapters (e.g., the EEPROM and firmware). Furthermore, if you use any +patches, utilities, or code with the Intel wireless LAN adapters that +have been manipulated by an unauthorized party (i.e., patches, +utilities, or code (including open source code modifications) which have +not been validated by Intel), (i) you will be solely responsible for +ensuring the regulatory compliance of the products, (ii) Intel will bear +no liability, under any theory of liability for any issues associated +with the modified products, including without limitation, claims under +the warranty and/or issues arising from regulatory non-compliance, and +(iii) Intel will not provide or be required to assist in providing +support to any third parties for such modified products. + +Note: Many regulatory agencies consider Wireless LAN adapters to be +modules, and accordingly, condition system-level regulatory approval +upon receipt and review of test data documenting that the antennas and +system configuration do not cause the EMC and radio operation to be +non-compliant. + +The drivers available for download from SourceForge are provided as a +part of a development project. Conformance to local regulatory +requirements is the responsibility of the individual developer. As +such, if you are interested in deploying or shipping a driver as part of +solution intended to be used for purposes other than development, please +obtain a tested driver from Intel Customer Support at: + +http://www.intel.com/support/wireless/sb/CS-006408.htm + +1. Introduction +=============== + +This document provides a brief overview of the features supported by the +IPW2100 driver project. The main project website, where the latest +development version of the driver can be found, is: + + http://ipw2100.sourceforge.net + +There you can find the not only the latest releases, but also information about +potential fixes and patches, as well as links to the development mailing list +for the driver project. + + +2. Release git-1.1.5 Current Supported Features +=============================================== + +- Managed (BSS) and Ad-Hoc (IBSS) +- WEP (shared key and open) +- Wireless Tools support +- 802.1x (tested with XSupplicant 1.0.1) + +Enabled (but not supported) features: +- Monitor/RFMon mode +- WPA/WPA2 + +The distinction between officially supported and enabled is a reflection +on the amount of validation and interoperability testing that has been +performed on a given feature. + + +3. Command Line Parameters +========================== + +If the driver is built as a module, the following optional parameters are used +by entering them on the command line with the modprobe command using this +syntax:: + + modprobe ipw2100 [