Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH bpf-next 0/6] Support defragmenting IPv4 packets in BPF
@ 2022-12-14 23:25 Daniel Xu
  2022-12-14 23:25 ` [PATCH bpf-next 1/6] ip: frags: Return actual error codes from ip_check_defrag() Daniel Xu
  2022-12-14 23:25 ` [PATCH bpf-next 3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc Daniel Xu
  0 siblings, 2 replies; 5+ messages in thread
From: Daniel Xu @ 2022-12-14 23:25 UTC (permalink / raw)
  To: davem, yoshfuji, dsahern, martin.lau, daniel, ast, andrii
  Cc: ppenkov, dbird, netdev, bpf

=== Context ===

In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:

1. Enforce policy on first fragment and accept all subsequent fragments.
   This works but may let in certain attacks or allow data exfiltration.

2. Enforce policy on first fragment and drop all subsequent fragments.
   This does not really work b/c some protocols may rely on
   fragmentation. For example, DNS may rely on oversized UDP packets for
   large responses.

So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:

    Middleboxes [...] should process IP fragments in a manner that is
    consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
    must maintain state in order to achieve this goal.

=== BPF related bits ===

However, when policy is enforced through BPF, the prog is run before the
kernel reassembles fragmented packets. This leaves BPF developers in a
awkward place: implement reassembly (possibly poorly) or use a stateless
method as described above.

Fortunately, the kernel has robust support for fragmented ipv4 packets.
This patchset wraps the existing defragmentation facilities in a kfunc so
that BPF progs running on middleboxes can reassemble fragmented packets
before applying policy.

=== Patchset details ===

This patchset is (hopefully) relatively straightforward from BPF perspective.
One thing I'd like to call out is the skb_copy()ing of the prog skb. I
did this to maintain the invariant that the ctx remains valid after prog
has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
consume the skb if the skb is a fragment.

Originally I did play around with teaching the verifier about kfuncs
that may consume the ctx and disallowing ctx accesses in ret != 0
branches. It worked ok, but it seemed too complex to modify the
surrounding assumptions about ctx validity.

[0]: https://datatracker.ietf.org/doc/html/rfc8900

Daniel Xu (6):
  ip: frags: Return actual error codes from ip_check_defrag()
  bpf: verifier: Support KF_CHANGES_PKT flag
  bpf, net, frags: Add bpf_ip_check_defrag() kfunc
  bpf: selftests: Support not connecting client socket
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Add bpf_ip_check_defrag() selftest

 Documentation/bpf/kfuncs.rst                  |   7 +
 drivers/net/macvlan.c                         |   2 +-
 include/linux/btf.h                           |   1 +
 include/net/ip.h                              |  11 +
 kernel/bpf/verifier.c                         |   8 +
 net/ipv4/Makefile                             |   1 +
 net/ipv4/ip_fragment.c                        |  13 +-
 net/ipv4/ip_fragment_bpf.c                    |  98 ++++++
 net/packet/af_packet.c                        |   2 +-
 .../selftests/bpf/generate_udp_fragments.py   |  52 +++
 tools/testing/selftests/bpf/network_helpers.c |  26 +-
 tools/testing/selftests/bpf/network_helpers.h |   3 +
 .../bpf/prog_tests/ip_check_defrag.c          | 296 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/ip_check_defrag.c     |  83 +++++
 15 files changed, 589 insertions(+), 15 deletions(-)
 create mode 100644 net/ipv4/ip_fragment_bpf.c
 create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c

-- 
2.39.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH bpf-next 1/6] ip: frags: Return actual error codes from ip_check_defrag()
  2022-12-14 23:25 [PATCH bpf-next 0/6] Support defragmenting IPv4 packets in BPF Daniel Xu
@ 2022-12-14 23:25 ` Daniel Xu
  2022-12-14 23:25 ` [PATCH bpf-next 3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc Daniel Xu
  1 sibling, 0 replies; 5+ messages in thread
From: Daniel Xu @ 2022-12-14 23:25 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern
  Cc: ppenkov, dbird, netdev, linux-kernel, bpf

Once we wrap ip_check_defrag() in a kfunc, it may be useful for progs to
know the exact error condition ip_check_defrag() encountered.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 drivers/net/macvlan.c  |  2 +-
 net/ipv4/ip_fragment.c | 11 ++++++-----
 net/packet/af_packet.c |  2 +-
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 99a971929c8e..b8310e13d7e1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -456,7 +456,7 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
 		unsigned int hash;
 
 		skb = ip_check_defrag(dev_net(skb->dev), skb, IP_DEFRAG_MACVLAN);
-		if (!skb)
+		if (IS_ERR(skb))
 			return RX_HANDLER_CONSUMED;
 		*pskb = skb;
 		eth = eth_hdr(skb);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 69c00ffdcf3e..7406c6b6376d 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -514,6 +514,7 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 	struct iphdr iph;
 	int netoff;
 	u32 len;
+	int err;
 
 	if (skb->protocol != htons(ETH_P_IP))
 		return skb;
@@ -535,15 +536,15 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
 		if (skb) {
 			if (!pskb_may_pull(skb, netoff + iph.ihl * 4)) {
 				kfree_skb(skb);
-				return NULL;
+				return ERR_PTR(-ENOMEM);
 			}
-			if (pskb_trim_rcsum(skb, netoff + len)) {
+			if ((err = pskb_trim_rcsum(skb, netoff + len))) {
 				kfree_skb(skb);
-				return NULL;
+				return ERR_PTR(err);
 			}
 			memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-			if (ip_defrag(net, skb, user))
-				return NULL;
+			if ((err = ip_defrag(net, skb, user)))
+				return ERR_PTR(err);
 			skb_clear_hash(skb);
 		}
 	}
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 41c4ccc3a5d6..cf706f98448e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1472,7 +1472,7 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
 
 	if (fanout_has_flag(f, PACKET_FANOUT_FLAG_DEFRAG)) {
 		skb = ip_check_defrag(net, skb, IP_DEFRAG_AF_PACKET);
-		if (!skb)
+		if (IS_ERR(skb))
 			return 0;
 	}
 	switch (f->type) {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH bpf-next 3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc
  2022-12-14 23:25 [PATCH bpf-next 0/6] Support defragmenting IPv4 packets in BPF Daniel Xu
  2022-12-14 23:25 ` [PATCH bpf-next 1/6] ip: frags: Return actual error codes from ip_check_defrag() Daniel Xu
@ 2022-12-14 23:25 ` Daniel Xu
  2022-12-15 22:31   ` Daniel Borkmann
  1 sibling, 1 reply; 5+ messages in thread
From: Daniel Xu @ 2022-12-14 23:25 UTC (permalink / raw)
  To: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: ppenkov, dbird, linux-kernel, netdev, bpf

This kfunc is used to defragment IPv4 packets. The idea is that if you
see a fragmented packet, you call this kfunc. If the kfunc returns 0,
then the skb has been updated to contain the entire reassembled packet.

If the kfunc returns an error (most likely -EINPROGRESS), then it means
the skb is part of a yet-incomplete original packet. A reasonable
response to -EINPROGRESS is to drop the packet, as the ip defrag
infrastructure is already hanging onto the frag for future reassembly.

Care has been taken to ensure the prog skb remains valid no matter what
the underlying ip_check_defrag() call does. This is in contrast to
ip_defrag(), which may consume the skb if the skb is part of a
yet-incomplete original packet.

So far this kfunc is only callable from TC clsact progs.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/net/ip.h           | 11 +++++
 net/ipv4/Makefile          |  1 +
 net/ipv4/ip_fragment.c     |  2 +
 net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 112 insertions(+)
 create mode 100644 net/ipv4/ip_fragment_bpf.c

diff --git a/include/net/ip.h b/include/net/ip.h
index 144bdfbb25af..14f1e69a6523 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -679,6 +679,7 @@ enum ip_defrag_users {
 	IP_DEFRAG_VS_FWD,
 	IP_DEFRAG_AF_PACKET,
 	IP_DEFRAG_MACVLAN,
+	IP_DEFRAG_BPF,
 };
 
 /* Return true if the value of 'user' is between 'lower_bond'
@@ -692,6 +693,16 @@ static inline bool ip_defrag_user_in_between(u32 user,
 }
 
 int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+int register_ip_frag_bpf(void);
+#else
+static inline int register_ip_frag_bpf(void)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_INET
 struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user);
 #else
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index af7d2cf490fb..749da1599933 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -64,6 +64,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
 obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
 		      xfrm4_output.o xfrm4_protocol.o
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7406c6b6376d..467aa8ace9fb 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -757,5 +757,7 @@ void __init ipfrag_init(void)
 	if (inet_frags_init(&ip4_frags))
 		panic("IP: failed to allocate ip4_frags cache\n");
 	ip4_frags_ctl_register();
+	if (register_ip_frag_bpf())
+		panic("IP: bpf: failed to register ip_frag_bpf\n");
 	register_pernet_subsys(&ip4_frags_ops);
 }
diff --git a/net/ipv4/ip_fragment_bpf.c b/net/ipv4/ip_fragment_bpf.c
new file mode 100644
index 000000000000..a9e5908ed216
--- /dev/null
+++ b/net/ipv4/ip_fragment_bpf.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Unstable ipv4 fragmentation helpers for TC-BPF hook
+ *
+ * These are called from SCHED_CLS BPF programs. Note that it is allowed to
+ * break compatibility for these functions since the interface they are exposed
+ * through to BPF programs is explicitly unstable.
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/ip.h>
+#include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <net/ip.h>
+#include <net/sock.h>
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in ip_fragment BTF");
+
+/* bpf_ip_check_defrag - Defragment an ipv4 packet
+ *
+ * This helper takes an skb as input. If this skb successfully reassembles
+ * the original packet, the skb is updated to contain the original, reassembled
+ * packet.
+ *
+ * Otherwise (on error or incomplete reassembly), the input skb remains
+ * unmodified.
+ *
+ * Parameters:
+ * @ctx		- Pointer to program context (skb)
+ * @netns	- Child network namespace id. If value is a negative signed
+ *		  32-bit integer, the netns of the device in the skb is used.
+ *
+ * Return:
+ * 0 on successfully reassembly or non-fragmented packet. Negative value on
+ * error or incomplete reassembly.
+ */
+int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
+{
+	struct sk_buff *skb = (struct sk_buff *)ctx;
+	struct sk_buff *skb_cpy, *skb_out;
+	struct net *caller_net;
+	struct net *net;
+	int mac_len;
+	void *mac;
+
+	if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
+		return -EINVAL;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+	if ((s32)netns < 0) {
+		net = caller_net;
+	} else {
+		net = get_net_ns_by_id(caller_net, netns);
+		if (unlikely(!net))
+			return -EINVAL;
+	}
+
+	mac_len = skb->mac_len;
+	skb_cpy = skb_copy(skb, GFP_ATOMIC);
+	if (!skb_cpy)
+		return -ENOMEM;
+
+	skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
+	if (IS_ERR(skb_out))
+		return PTR_ERR(skb_out);
+
+	skb_morph(skb, skb_out);
+	kfree_skb(skb_out);
+
+	/* ip_check_defrag() does not maintain mac header, so push empty header
+	 * in so prog sees the correct layout. The empty mac header will be
+	 * later pulled from cls_bpf.
+	 */
+	mac = skb_push(skb, mac_len);
+	memset(mac, 0, mac_len);
+	bpf_compute_data_pointers(skb);
+
+	return 0;
+}
+
+__diag_pop()
+
+BTF_SET8_START(ip_frag_kfunc_set)
+BTF_ID_FLAGS(func, bpf_ip_check_defrag, KF_CHANGES_PKT)
+BTF_SET8_END(ip_frag_kfunc_set)
+
+static const struct btf_kfunc_id_set ip_frag_bpf_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &ip_frag_kfunc_set,
+};
+
+int register_ip_frag_bpf(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS,
+					 &ip_frag_bpf_kfunc_set);
+}
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc
  2022-12-14 23:25 ` [PATCH bpf-next 3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc Daniel Xu
@ 2022-12-15 22:31   ` Daniel Borkmann
  2022-12-15 23:58     ` Daniel Xu
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Borkmann @ 2022-12-15 22:31 UTC (permalink / raw)
  To: Daniel Xu, David S. Miller, Hideaki YOSHIFUJI, David Ahern,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: ppenkov, dbird, linux-kernel, netdev, bpf

Hi Daniel,

Thanks for working on this!

On 12/15/22 12:25 AM, Daniel Xu wrote:
[...]
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/ip.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <net/ip.h>
> +#include <net/sock.h>
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in ip_fragment BTF");
> +
> +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> + *
> + * This helper takes an skb as input. If this skb successfully reassembles
> + * the original packet, the skb is updated to contain the original, reassembled
> + * packet.
> + *
> + * Otherwise (on error or incomplete reassembly), the input skb remains
> + * unmodified.
> + *
> + * Parameters:
> + * @ctx		- Pointer to program context (skb)
> + * @netns	- Child network namespace id. If value is a negative signed
> + *		  32-bit integer, the netns of the device in the skb is used.
> + *
> + * Return:
> + * 0 on successfully reassembly or non-fragmented packet. Negative value on
> + * error or incomplete reassembly.
> + */
> +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)

small nit: for sk lookup helper we've used u32 netns_id, would be nice to have
this consistent here as well.

> +{
> +	struct sk_buff *skb = (struct sk_buff *)ctx;
> +	struct sk_buff *skb_cpy, *skb_out;
> +	struct net *caller_net;
> +	struct net *net;
> +	int mac_len;
> +	void *mac;
> +
> +	if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> +		return -EINVAL;
> +
> +	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> +	if ((s32)netns < 0) {
> +		net = caller_net;
> +	} else {
> +		net = get_net_ns_by_id(caller_net, netns);
> +		if (unlikely(!net))
> +			return -EINVAL;
> +	}
> +
> +	mac_len = skb->mac_len;
> +	skb_cpy = skb_copy(skb, GFP_ATOMIC);
> +	if (!skb_cpy)
> +		return -ENOMEM;

Given slow path, this idea is expensive but okay. Maybe in future it could be lifted
which might be a bigger lift to teach verifier that input ctx cannot be accessed
anymore.. but then frags are very much discouraged either way and bpf_ip_check_defrag()
might only apply in corner case situations (like DNS, etc).

> +	skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
> +	if (IS_ERR(skb_out))
> +		return PTR_ERR(skb_out);

Looks like ip_check_defrag() can gracefully handle IPv6 packet. It will just return back
skb_cpy pointer in that case. However, this brings me to my main complaint.. I don't
think we should merge anything IPv4-related without also having IPv6 equivalent support,
otherwise we're building up tech debt, so pls also add support for the latter.

> +	skb_morph(skb, skb_out);
> +	kfree_skb(skb_out);
> +
> +	/* ip_check_defrag() does not maintain mac header, so push empty header
> +	 * in so prog sees the correct layout. The empty mac header will be
> +	 * later pulled from cls_bpf.
> +	 */
> +	mac = skb_push(skb, mac_len);
> +	memset(mac, 0, mac_len);
> +	bpf_compute_data_pointers(skb);
> +
> +	return 0;
> +}
> +

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc
  2022-12-15 22:31   ` Daniel Borkmann
@ 2022-12-15 23:58     ` Daniel Xu
  0 siblings, 0 replies; 5+ messages in thread
From: Daniel Xu @ 2022-12-15 23:58 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, ppenkov, dbird, linux-kernel,
	netdev, bpf

Hi Daniel,

Thanks for taking a look!

On Thu, Dec 15, 2022 at 11:31:52PM +0100, Daniel Borkmann wrote:
> Hi Daniel,
> 
> Thanks for working on this!
> 
> On 12/15/22 12:25 AM, Daniel Xu wrote:
> [...]
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/ip.h>
> > +#include <linux/filter.h>
> > +#include <linux/netdevice.h>
> > +#include <net/ip.h>
> > +#include <net/sock.h>
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +		  "Global functions as their definitions will be in ip_fragment BTF");
> > +
> > +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> > + *
> > + * This helper takes an skb as input. If this skb successfully reassembles
> > + * the original packet, the skb is updated to contain the original, reassembled
> > + * packet.
> > + *
> > + * Otherwise (on error or incomplete reassembly), the input skb remains
> > + * unmodified.
> > + *
> > + * Parameters:
> > + * @ctx		- Pointer to program context (skb)
> > + * @netns	- Child network namespace id. If value is a negative signed
> > + *		  32-bit integer, the netns of the device in the skb is used.
> > + *
> > + * Return:
> > + * 0 on successfully reassembly or non-fragmented packet. Negative value on
> > + * error or incomplete reassembly.
> > + */
> > +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
> 
> small nit: for sk lookup helper we've used u32 netns_id, would be nice to have
> this consistent here as well.

Hmm, when I grep I see the sk lookup helpers using a u64 as well:

        $ rg "u64 netns" ./include/uapi/linux/bpf.h
        3394: * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags)
        3431: * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags)
        3629: * struct bpf_sock *bpf_skc_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags)
        6238:   __u64 netns_dev;
        6239:   __u64 netns_ino;
        6274:   __u64 netns_dev;
        6275:   __u64 netns_ino;
        $ rg "u32 netns" ./include/uapi/linux/bpf.h
        6335:                   __u32 netns_ino;

Am I looking at the right place?

> 
> > +{
> > +	struct sk_buff *skb = (struct sk_buff *)ctx;
> > +	struct sk_buff *skb_cpy, *skb_out;
> > +	struct net *caller_net;
> > +	struct net *net;
> > +	int mac_len;
> > +	void *mac;
> > +
> > +	if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> > +		return -EINVAL;
> > +
> > +	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> > +	if ((s32)netns < 0) {
> > +		net = caller_net;
> > +	} else {
> > +		net = get_net_ns_by_id(caller_net, netns);
> > +		if (unlikely(!net))
> > +			return -EINVAL;
> > +	}
> > +
> > +	mac_len = skb->mac_len;
> > +	skb_cpy = skb_copy(skb, GFP_ATOMIC);
> > +	if (!skb_cpy)
> > +		return -ENOMEM;
> 
> Given slow path, this idea is expensive but okay. Maybe in future it could be lifted
> which might be a bigger lift to teach verifier that input ctx cannot be accessed
> anymore.. but then frags are very much discouraged either way and bpf_ip_check_defrag()
> might only apply in corner case situations (like DNS, etc).

Originally I did go the route of teaching the verifier:

* https://github.com/danobi/linux/commit/35a66af8d54cca647b0adfc7c1da7105d2603dde
* https://github.com/danobi/linux/commit/e8c86ea75e2ca8f0631632d54ef763381308711e
* https://github.com/danobi/linux/commit/972bcf769f41fbfa7f84ce00faf06b5b57bc6f7a

It didn't seem too bad on the verifier front (or maybe I got it wrong),
but it seemed a bit unwieldy to wire ctx validity information back out
of the program given ip_check_defrag() may kfree() the skb (so stuffing
data inside skb->cb wouldn't work).

And yeah, given frags are highly discouraged, didn't seem like too bad
of a tradeoff.

> 
> > +	skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
> > +	if (IS_ERR(skb_out))
> > +		return PTR_ERR(skb_out);
> 
> Looks like ip_check_defrag() can gracefully handle IPv6 packet. It will just return back
> skb_cpy pointer in that case. However, this brings me to my main complaint.. I don't
> think we should merge anything IPv4-related without also having IPv6 equivalent support,
> otherwise we're building up tech debt, so pls also add support for the latter.

Ok, I'll take a crack at ipv6 support too. Most likely in the form of
another kfunc, depending on how the ipv6 frag infra is set up.

I'll be in and out during the holidays so v2 will most likely come some
time in the new year.

[...]

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-12-15 23:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-14 23:25 [PATCH bpf-next 0/6] Support defragmenting IPv4 packets in BPF Daniel Xu
2022-12-14 23:25 ` [PATCH bpf-next 1/6] ip: frags: Return actual error codes from ip_check_defrag() Daniel Xu
2022-12-14 23:25 ` [PATCH bpf-next 3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc Daniel Xu
2022-12-15 22:31   ` Daniel Borkmann
2022-12-15 23:58     ` Daniel Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).