Portal:DeveloperDocs/nftables internals
This page contains information for Netfilter developers on how nftables internals work.
The kernel subsystem
The nf_tables kernel subsystem contains 2 key components:
- the netlink API (i.e, control plane API)
- the nf_tables core (i.e, the data plane engine)
Other components, such as external modules, are also in place and are intermixed with both the API and the core.
Generally speaking, the nf_tables subsystem is implementing a virtual machine of low-level expressions that operates on network packets.
TODO: add info.
nf_tables netlink API
The source code is mostly in net/netfilter/nf_tables_api.c [elixir src] [git src]
TODO: add info.
nf_tables core
The source code is mostly in net/netfilter/nf_tables_core.c [elixir src] [git src]
You can see there one of the most important functions in the core: nft_do_chain(). In a nut shell, this is the function that evaluates network packets against the ruleset.
The logic in this function is rather simple:
- for each rule in the chain
- for each low level expression in the rule
- evaluate the packet against the expression
- evaluate expression return code (break, continue, drop, accept, jump, goto, etc)
- for each low level expression in the rule
TODO: add info.
expressions
There are many low expressions that allows us to operate over network packets in different ways. You can think on these low level expressions as assembly-like instructions.
- nft_immediate: loads an immediate value into a register.
- nft_cmp: compare a given data with data from a given register.
- nft_payload: set/get arbitrary data from packet headers.
- nft_bitwise: perform bit-wise math operations over data in a given register.
- nft_byteorder: perform byte order operations over data in a given register.
- nft_counter: a basic counter for packet/bytes that gets incremented everything is evaluated for a packet.
- nft_meta: set/get packet meta information, such as related interfaces, timestamps, etc.
- nft_lookup: search for data from a given register (key) into a dataset. If the set is a map/vmap, returns the value for that key.
Relation to User Space
In user space terminology, there's a distinction between statements like counter, jump or log and expressions like ip saddr, tcp dport or meta iifname. What distinguishes them is the fact that statements are valid ruleset elements on their own while expressions are usually arguments to a statement. For instance, take the following payload statement:
ip dscp set 42
Here, ip dscp is an expression identifying what part of the packet payload to mangle, 42 is a constant expression holding the value to assign. There are certain limits as to what may appear after the set keyword, nft does some type checking to make sure it is compatible. But to illustrate the power this concept has, take the following example:
tcp dport set tcp sport
This will mangle a TCP packet's destination port to match whatever its source port value may be. Albeit a bit constructed, this is an example of a statement accepting data from two expressions. Because kernel space does not make the distinction though, there must be some kind of translation happening. To analyse this, nft's --debug=netlink flag is handy:
nft --debug=netlink add rule t c tcp dport set tcp sport ip t c [ meta load l4proto => reg 1 ] [ cmp eq reg 1 0x00000006 ] [ payload load 2b @ transport header + 0 => reg 1 ] [ payload write reg 1 => 2b @ transport header + 2 csum_type 1 csum_off 16 csum_flags 0x0 ]
It prints the (libnftnl/kernel) expressions a rule consists of. This reveals some interesting details:
- There is an invisible match on meta l4proto at the start of the rule. This is a dependency imposed by the tcp sport/dport expressions which work right with TCP packets only. To avoid unexpected results, nft makes sure non-TCP packets won't match the rule right from the start.
- Registers enable data to pass between expressions: The first expression loads the packet's layer4 protocol value into reg 1, the second expression compares reg 1's value against the value 0x6.
- There are two "variants" of a payload expression: one that merely loads data from the packet and one that writes it. One could say the first one is the payload expression and the latter the payload statement in nft's nomenclature.
- The second payload expression which writes 2B from reg 1 into the packet at the offset of the transport header plus 2B (the position of the TCP header's Destination Port field), also holds extra information for a partial checksum update.
The userspace components
There are several important components in the userpsace part of nftables:
- libmnl: generic low level library used to communicate with the kernel using netlink sockets.
- libnftnl: low level library that is capable of interacting with the nf_tables subsystem netlink API in the kernel. Is responsible for creating/parsing the nf_tables netlink messages. Uses libmnl under the hood.
- libnftables: high level library that implements the logic to translate from high level statements to netlink objects and the other way around. Uses libnftnl under the hood.
- nft: the command line interface binary. This is what most end users actually use in their systems. It reads user input and calls libnftables under the hood.
Generally speaking, the userspace compiles high level statements (rules, etc) into the netlink bytecode that the kernel API understands When inspecting the ruleset (i.e, listing it) what it does is the opposite, reconstruct the low level netlink bytecode into high level statements.
libnftnl
This library provides data structures for entities existing in nf_tables nomenclature, such as tables, chains and rules. It serves as an intermediate layer between nftables and iptables-nft user space applications and nfnetlink messages the kernel sends and receives.
In general, each data structure comes with a set of handling routines:
- allocators
- To allocate and free an object of given type
- setters/getters
- Data structure fields are accessed via an attribute number (via a specific enum field)
- serializers
- Populating a netlink message or vice versa
- printers
- Providing a textual representation, mostly for debugging purposes
Where sensible, there is a list-variant, too. If so, it comes with handling routines as well:
- allocators
- Allocating and freeing the list object (and members)
- populators
- Add and remove from the list
Where useful, there might be a lookup routine as well. With nftnl_chain_list, e.g. the list object contains a hash table for chain names as well so list lookup by chain name is faster than a linear search.
A typical extra for list objects are iterators: A data structure containing state while browsing through the list. Usually the only routines used are allocators and a next routine.
These are the entities defined by libnftnl:
- table
- A rather boring "namespace" for chains
- chain
- A container for rules, may attach to a netfilter hook in kernel
- rule
- A container for expressions
- expr
- An nftables VM code instruction
- flowtable
- Similar to a chain, but holds flows between interfaces
- obj
- A generic object, typically holding stateful information
- ruleset
- A container for lists of tables, chains, sets and rules - not used by nftables application anymore
- set
- A container for elements
- set_elem
- A set element
- trace
- A trace event sent by the kernel
nftnl_expr
While nftables distinguishes between expressions and statements, such difference does not quite exist in libnftnl layer. For instance, a statement like:
ip saddr 192.168.0.1
is actually two expressions:
- payload
- loading IPv4 header's source address into a register
- cmp
- comparing data from a register against a stored value
Since expressions have access to the packet, its meta data, all nftables registers (including the verdict register) and may store multiple values internally, they are mighty and versatile.
nftnl_obj
This is a common API for various object types. An object's type is defined post allocation by setting the NFTNL_OBJ_TYPE attribute. Currently existing object types are:
- counter
- quota
- ct helper
- limit
- tunnel
- ct timeout
- secmark
- ct expect
- synproxy
nftnl_batch
This is a wrapper interface around the same functionality in libmnl (which is used internally). In general, nftnl batches aid in collecting multiple netlink messages for kernel submission.
libnftables
One goal in nftables development was to provide users with a library for easier integration into applications than "shelling out" using system() and trying to parse nft command output.
At first, libnftnl was supposed to achieve this but the fact that it exposes internal implementation details apart from being pretty low-level in general made it rather unsuitable from a users' perspective.
To overcome this, nft backend code was separated into a library which should fill the gap between libnftnl on one side and nft application itself on the other.
Usage of libnftables is supposed to be simple and straightforward, almost like calling nft itself but with a bit more convenience. First step is to create a new context:
struct nft_ctx *ctx = nft_ctx_new(0);
The context allows to configure library behaviour on a "per session" basis. With this in place, nftables commands may be executed:
int rc = nft_run_cmd_from_buffer(ctx, "add table inet t");
or whole dump files loaded:
int rc = nft_run_cmd_from_filename(ctx, "/etc/nftables/all-in-one.nft");
To control output, there are a number of functions:
FILE *nft_ctx_set_output(struct nft_ctx *ctx, FILE *fp); int nft_ctx_buffer_output(struct nft_ctx *ctx); int nft_ctx_unbuffer_output(struct nft_ctx *ctx); const char *nft_ctx_get_output_buffer(struct nft_ctx *ctx);
Same for stderr. See libnftables(3) man page for further details.
nft: from user space to the kernel
The following describes the steps and entities involved after a call to nft in user space until the actual communication with the kernel.
Since creation of libnftables, nft is merely a lightweight front-end, basically just creating a libnftables handle, allowing to configure it via command-line options and feeding nftables syntax into it. Within the library, the actual work takes place. It may be divided into several phases:
- Input parsing into internal data structure
- Evaluation and expansion
- Serialization into netlink messages
- nfnetlink message session with kernel
- Error handling
Input parsing into internal data structures
Depending on whether input comes from command line or a file (which may be stdin), main() calls either nft_run_cmd_from_buffer() or nft_run_cmd_from_filename() library functions.
If JSON output was selected (nft -j), the JSON parser (in src/parser_json.c) is tried first. If this did not succeed, the standard ("human-readable") syntax parser is called.
Eventually both parsers populate a list of commands (struct cmd) and a list of error messages (struct error_record) in case errors were detected.
Standard syntax
The parser for standard syntax is implemented in lex and yacc, see src/scanner.l and src/parser_bison.y for reference. It is entered via the generated function nft_parse().
As a basic rule, in lex/yacc the scanner recognizes the words and the parser interprets them in their context. There is also (limited) scanner control from the parser by definition of a scope in which some words are valid or not. The parser defines recursive patterns to match input against. Here is the top-most one, input:
input : /* empty */ | input line { if ($2 != NULL) { $2->location = @2; list_add_tail(&$2->list, state->cmds); } } ;
So it may be empty or (by recursion) consist of a number of line patterns. Each of those lines parses into a command and is appended to the list. The snippet above also shows how parser-provided location data is stored in the command object. This is used for error reporting.
JSON syntax
The JSON parser lives in src/parser_json.c and is entered via nft_parse_json_buffer() function or nft_parse_json_filename(), respectively. It uses jansson library for (de-)serialization and value (un-)packing. To learn about the code and to understand the program flow, json_parse_cmd() function is a good starting point.
Evaluation and expansion
Input evaluation is a crucial step and combines several tasks. It extends input validation from mere syntax checks done by the parser to semantical ones, taking context into perspective.
Input may be changed, too. Sometimes it is necessary to insert extra statements as dependency, sometimes types of right hand sides of comparisons must adjust to left hand side type.
Before all the above, the list of commands is scanned for cache requirements - see nft_cache_evaluate() for details. Since caching may be an expensive operation if in-kernel ruleset is huge, this step attempts to reduce the data fetched from kernel to the bare minimum needed for correct operation. A final call to nft_cache_update() then does the actual fetch.
If evaluation passed, expansion takes place. This is mostly to cover for input in "dump" notation, i.e. rules nested in chains nested in tables, etc. Such input is converted into individual "add" commands as required by the netlink message format. The code is pretty straightforward, see nft_cmd_expand() for reference.
Serialization into netlink messages
In this step, nftables-internal data types are converted into libnftables ones (e.g., struct table into struct nftnl_table). The latter abstract their internal layout as attributes and are therefore opaque to the caller.
libnftnl provides helpers to convert its own data structures into netlink message format: A generic nftnl_nlmsg_build_hdr() for the header and type-specific ones for the payload (e.g., nftnl_table_nlmsg_build_payload()).
The netlink messages are stored in a struct nftnl_batch which provides the backing storage. This surrounding data structure serializes into an introductory NFNL_MSG_BATCH_BEGIN message and a finalizing one with type NFNL_MSG_BATCH_END.
In kernel space, the batch constitutes a transaction: If one of the messages is rejected, none of them take effect. Ditto, if the final batch end message is missing the whole batch will undo. This is how nft's --check option is implemented.
nfnetlink message session with kernel
In nft, communication with the kernel takes place in the function mnl_batch_talk(): It converts the nftnl_batch into a message suitable for sendmsg(), adjusts buffer sizes (if needed), transmits the data and listens for a reply. Any error messages are handled by mnl_batch_extack_cb() function which records them for later reporting. Other messages are relevant for --echo mode, in which the kernel "echoes" the requests back after updating them (with handle values, for instance). These are handled by netlink_echo_callback(), more or less a wrapper around nft's event monitoring code.
Error handling
Each struct cmd object is identified by its own sequence number (monotonic within the batch). Netlink error messages contain this number and also an offset value, which allow to identify not only the problematic message but also the specific attribute of that message which was rejected.
Mapping from message attribute back to line or word(s) of input works via a mapping from attribute offset to the struct location object stored while parsing. That bison parser-provided data holds line and column numbers, allowing nft to underline problematic parts of input when reporting back to the caller.
To follow the above in the source code, see nft_cmd_error() function being called for each command and error it caused. The mapping is established earlier while creating netlink messages, i.e. in code called from do_command() - watch out for the various calls to cmd_add_loc() populating the field attr in struct cmd.
nft: from the kernel to user space
Communication between nft in user space and nftables in kernel happens via netlink, a packet-based IPC mechanism for that purpose. Its kernel source code lives in net/netlink directory and allows to be extended by calling netlink_kernel_create(), passing a unique unit number and a struct netlink_kernel_cfg object.
nfnetlink is such an extension, attempting to serve all netfilter-related user space applications. It is implemented in net/netfilter/nfnetlink.c and itself allows to be extended as well by means of groups (see nfnetlink_groups in include/uapi/linux/netfilter/nfnetlink.h). These in turn map to nfnetlink subsystems, see the constant array nfnl_group2type in nfnetlink source file. NFNL_SUBSYS_NFTABLES is the relevant one here, implemented in net/netfilter/nf_tables_api.c (see nf_tables_subsys and the call to nfnetlink_subsys_register() in there).
For insight, it is worthwhile to remain in generic nfnetlink code for a little longer: nfnetlink_net_ops are registered as a "pernet" subsystem, i.e. each network namespace gets its own instance. Upon netns creation, nfnetlink_net_init() is called which actually creates the NETLINK_NETFILTER subsystem. Its receive callback (nfnetlink_rcv()) checks whether the first message header starts a batch and diverts the code flow accordingly.
For batch handling, subsystems need to define commit and abort callbacks. Also, for each contained message, there must be a responsible callback entry with type NFNL_CB_BATCH. nf_tables_subsys fulfills these requirements.
Each callback in nf_tables_cb (and therefore each supported message type) decides whether it must be part of a batch or not - nfnetlink code does not allow for multiple handlers of the same message. In nftables, only getters for different ruleset elements are non-batch, anything mangling the ruleset is.
Non-batched handlers
These are getters for:
* table * chain * rule * set * set element * generation ID * stateful objects * flowtable
They all behave similar: Unless NLM_F_DUMP flag is set in the message, they perform a lookup based on the required identifiers and return an nfnetlink message to user space. There are type-specific helpers populating a netlink message named nf_tables_fill_<SOMETHING>_info, packet sending is done by a call to nfnetlink_unicast().
If NLM_F_DUMP was given, the getter iterates over all ruleset elements of given type and fills a netlink message for each. In some cases, filtering the output by identifiers given in the request is supported - useful to dump e.g. all rules of a specific chain only.
The iterator code is a bit complicated due to the fact that socket buffer size may be exceeded. In that case, partial data is submitted to user space and the dump continued afterwards. The iterators keep a "cursor" (actually a counter) for where to pick up again.
Batched handlers
To allow for rolling back a transaction which has failed or was aborted, message handlers of type NFNL_CB_BATCH allocate a struct nft_trans object and add it to the per-net commit list. This "log" of what was done is also useful to defer actions till the very end of the transaction. See nf_tables_commit() for reference of what it is used for in the success-case. Similar code is found in nf_tables_abort(), reverting the previous changes.
To make the ruleset update atomic, nftables uses an internal generation ID. Its value alternates between zero and one upon each commit. Ruleset elements have a two-bit "generation mask", indicating whether that element is active in the generation at its bit index. This way, elements may die, get born or stay alive when the generation ID toggles again.