* [PATCH bpf-next 1/4] bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
2021-08-05 5:01 [PATCH bpf-next 0/4] bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt Martin KaFai Lau
@ 2021-08-05 5:01 ` Martin KaFai Lau
2021-08-05 5:01 ` [PATCH bpf-next 2/4] bpf: selftests: Add sk_state to bpf_tcp_helpers.h Martin KaFai Lau
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2021-08-05 5:01 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, netdev
This patch allows the bpf-tcp-cc to call bpf_setsockopt. One use
case is to allow a bpf-tcp-cc switching to another cc during init().
For example, when the tcp flow is not ecn ready, the bpf_dctcp
can switch to another cc by calling setsockopt(TCP_CONGESTION).
During setsockopt(TCP_CONGESTION), the new tcp-cc's init() will be
called and this could cause a recursion but it is stopped by the
current trampoline's logic (in the prog->active counter).
While retiring a bpf-tcp-cc (e.g. in tcp_v[46]_destroy_sock()),
the tcp stack calls bpf-tcp-cc's release(). To avoid the retiring
bpf-tcp-cc making further changes to the sk, bpf_setsockopt is not
available to the bpf-tcp-cc's release(). This will avoid release()
making setsockopt() call that will potentially allocate new resources.
bpf_getsockopt() is also added to have a symmetrical API, so
less usage surprise.
When the old bpf-tcp-cc is calling setsockopt(TCP_CONGESTION)
to switch to a new cc, the old bpf-tcp-cc will be released by
bpf_struct_ops_put(). Thus, this patch also puts the bpf_struct_ops_map
after a rcu grace period because the trampoline's image cannot be freed
while the old bpf-tcp-cc is still running.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
kernel/bpf/bpf_struct_ops.c | 22 +++++++++++++++++++++-
net/ipv4/bpf_tcp_ca.c | 26 +++++++++++++++++++++++---
2 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 70f6fd4fa305..d6731c32864e 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -28,6 +28,7 @@ struct bpf_struct_ops_value {
struct bpf_struct_ops_map {
struct bpf_map map;
+ struct rcu_head rcu;
const struct bpf_struct_ops *st_ops;
/* protect map_update */
struct mutex lock;
@@ -622,6 +623,14 @@ bool bpf_struct_ops_get(const void *kdata)
return refcount_inc_not_zero(&kvalue->refcnt);
}
+static void bpf_struct_ops_put_rcu(struct rcu_head *head)
+{
+ struct bpf_struct_ops_map *st_map;
+
+ st_map = container_of(head, struct bpf_struct_ops_map, rcu);
+ bpf_map_put(&st_map->map);
+}
+
void bpf_struct_ops_put(const void *kdata)
{
struct bpf_struct_ops_value *kvalue;
@@ -632,6 +641,17 @@ void bpf_struct_ops_put(const void *kdata)
st_map = container_of(kvalue, struct bpf_struct_ops_map,
kvalue);
- bpf_map_put(&st_map->map);
+ /* The struct_ops's function may switch to another struct_ops.
+ *
+ * For example, bpf_tcp_cc_x->init() may switch to
+ * another tcp_cc_y by calling
+ * setsockopt(TCP_CONGESTION, "tcp_cc_y").
+ * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
+ * and its map->refcnt may reach 0 which then free its
+ * trampoline image while tcp_cc_x is still running.
+ *
+ * Thus, a rcu grace period is needed here.
+ */
+ call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
}
}
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 9e41eff4a685..00db0ce3af43 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -10,6 +10,9 @@
#include <net/tcp.h>
#include <net/bpf_sk_storage.h>
+/* "extern" is to avoid sparse warning. It is only used in bpf_struct_ops.c. */
+extern struct bpf_struct_ops bpf_tcp_congestion_ops;
+
static u32 optional_ops[] = {
offsetof(struct tcp_congestion_ops, init),
offsetof(struct tcp_congestion_ops, release),
@@ -167,6 +170,10 @@ static const struct bpf_func_proto *
bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
const struct bpf_prog *prog)
{
+ const struct btf_member *m;
+ const struct btf_type *t;
+ u32 moff, midx;
+
switch (func_id) {
case BPF_FUNC_tcp_send_ack:
return &bpf_tcp_send_ack_proto;
@@ -174,6 +181,22 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
return &bpf_sk_storage_get_proto;
case BPF_FUNC_sk_storage_delete:
return &bpf_sk_storage_delete_proto;
+ case BPF_FUNC_setsockopt:
+ /* Does not allow release() to call setsockopt.
+ * release() is called when the current bpf-tcp-cc
+ * is retiring. It is not allowed to call
+ * setsockopt() to make further changes which
+ * may potentially allocate new resources.
+ */
+ midx = prog->expected_attach_type;
+ t = bpf_tcp_congestion_ops.type;
+ m = &btf_type_member(t)[midx];
+ moff = btf_member_bit_offset(t, m) / 8;
+ if (moff == offsetof(struct tcp_congestion_ops, release))
+ return NULL;
+ return &bpf_sk_setsockopt_proto;
+ case BPF_FUNC_getsockopt:
+ return &bpf_sk_getsockopt_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -286,9 +309,6 @@ static void bpf_tcp_ca_unreg(void *kdata)
tcp_unregister_congestion_control(kdata);
}
-/* Avoid sparse warning. It is only used in bpf_struct_ops.c. */
-extern struct bpf_struct_ops bpf_tcp_congestion_ops;
-
struct bpf_struct_ops bpf_tcp_congestion_ops = {
.verifier_ops = &bpf_tcp_ca_verifier_ops,
.reg = bpf_tcp_ca_reg,
--
2.30.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH bpf-next 2/4] bpf: selftests: Add sk_state to bpf_tcp_helpers.h
2021-08-05 5:01 [PATCH bpf-next 0/4] bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt Martin KaFai Lau
2021-08-05 5:01 ` [PATCH bpf-next 1/4] " Martin KaFai Lau
@ 2021-08-05 5:01 ` Martin KaFai Lau
2021-08-05 5:01 ` [PATCH bpf-next 3/4] bpf: selftests: Add connect_to_fd_opts to network_helpers Martin KaFai Lau
2021-08-05 5:01 ` [PATCH bpf-next 4/4] bpf: selftests: Add dctcp fallback test Martin KaFai Lau
3 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2021-08-05 5:01 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, netdev
Add sk_state define to bpf_tcp_helpers.h. Rename the existing
global variable "sk_state" in the kfunc_call test to "sk_state_res".
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
tools/testing/selftests/bpf/bpf_tcp_helpers.h | 1 +
tools/testing/selftests/bpf/prog_tests/kfunc_call.c | 2 +-
tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c | 4 ++--
3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
index 029589c008c9..e49b7c450b42 100644
--- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -27,6 +27,7 @@ enum sk_pacing {
struct sock {
struct sock_common __sk_common;
+#define sk_state __sk_common.skc_state
unsigned long sk_pacing_rate;
__u32 sk_pacing_status; /* see enum sk_pacing */
} __attribute__((preserve_access_index));
diff --git a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
index 30a7b9b837bf..9611f2bc50df 100644
--- a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
+++ b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
@@ -44,7 +44,7 @@ static void test_subprog(void)
ASSERT_OK(err, "bpf_prog_test_run(test1)");
ASSERT_EQ(retval, 10, "test1-retval");
ASSERT_NEQ(skel->data->active_res, -1, "active_res");
- ASSERT_EQ(skel->data->sk_state, BPF_TCP_CLOSE, "sk_state");
+ ASSERT_EQ(skel->data->sk_state_res, BPF_TCP_CLOSE, "sk_state_res");
kfunc_call_test_subprog__destroy(skel);
}
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c b/tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c
index b2dcb7d9cb03..5fbd9e232d44 100644
--- a/tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c
+++ b/tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c
@@ -9,7 +9,7 @@ extern __u64 bpf_kfunc_call_test1(struct sock *sk, __u32 a, __u64 b,
__u32 c, __u64 d) __ksym;
extern struct sock *bpf_kfunc_call_test3(struct sock *sk) __ksym;
int active_res = -1;
-int sk_state = -1;
+int sk_state_res = -1;
int __noinline f1(struct __sk_buff *skb)
{
@@ -28,7 +28,7 @@ int __noinline f1(struct __sk_buff *skb)
if (active)
active_res = *active;
- sk_state = bpf_kfunc_call_test3((struct sock *)sk)->__sk_common.skc_state;
+ sk_state_res = bpf_kfunc_call_test3((struct sock *)sk)->sk_state;
return (__u32)bpf_kfunc_call_test1((struct sock *)sk, 1, 2, 3, 4);
}
--
2.30.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH bpf-next 3/4] bpf: selftests: Add connect_to_fd_opts to network_helpers
2021-08-05 5:01 [PATCH bpf-next 0/4] bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt Martin KaFai Lau
2021-08-05 5:01 ` [PATCH bpf-next 1/4] " Martin KaFai Lau
2021-08-05 5:01 ` [PATCH bpf-next 2/4] bpf: selftests: Add sk_state to bpf_tcp_helpers.h Martin KaFai Lau
@ 2021-08-05 5:01 ` Martin KaFai Lau
2021-08-05 5:01 ` [PATCH bpf-next 4/4] bpf: selftests: Add dctcp fallback test Martin KaFai Lau
3 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2021-08-05 5:01 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, netdev
The next test requires to setsockopt(TCP_CONGESTION) before
connect(), so a new arg is needed for the connect_to_fd() to specify
the cc's name.
This patch adds a new "struct network_helper_opts" for the future
option needs. It starts with the "cc" and "timeout_ms" option.
A new helper connect_to_fd_opts() is added to take the new
"const struct network_helper_opts *opts" as an arg.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
tools/testing/selftests/bpf/network_helpers.c | 23 +++++++++++++++++--
tools/testing/selftests/bpf/network_helpers.h | 6 +++++
2 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 26468a8f44f3..a8329e0a4af9 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -218,13 +218,18 @@ static int connect_fd_to_addr(int fd,
return 0;
}
-int connect_to_fd(int server_fd, int timeout_ms)
+static const struct network_helper_opts default_opts;
+
+int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
{
struct sockaddr_storage addr;
struct sockaddr_in *addr_in;
socklen_t addrlen, optlen;
int fd, type;
+ if (!opts)
+ opts = &default_opts;
+
optlen = sizeof(type);
if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
log_err("getsockopt(SOL_TYPE)");
@@ -244,7 +249,12 @@ int connect_to_fd(int server_fd, int timeout_ms)
return -1;
}
- if (settimeo(fd, timeout_ms))
+ if (settimeo(fd, opts->timeout_ms))
+ goto error_close;
+
+ if (opts->cc && opts->cc[0] &&
+ setsockopt(fd, SOL_TCP, TCP_CONGESTION, opts->cc,
+ strlen(opts->cc) + 1))
goto error_close;
if (connect_fd_to_addr(fd, &addr, addrlen))
@@ -257,6 +267,15 @@ int connect_to_fd(int server_fd, int timeout_ms)
return -1;
}
+int connect_to_fd(int server_fd, int timeout_ms)
+{
+ struct network_helper_opts opts = {
+ .timeout_ms = timeout_ms,
+ };
+
+ return connect_to_fd_opts(server_fd, &opts);
+}
+
int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms)
{
struct sockaddr_storage addr;
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index d60bc2897770..3021fe432d2d 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -17,6 +17,11 @@ typedef __u16 __sum16;
#define VIP_NUM 5
#define MAGIC_BYTES 123
+struct network_helper_opts {
+ const char *cc;
+ int timeout_ms;
+};
+
/* ipv4 test vector */
struct ipv4_packet {
struct ethhdr eth;
@@ -41,6 +46,7 @@ int *start_reuseport_server(int family, int type, const char *addr_str,
unsigned int nr_listens);
void free_fds(int *fds, unsigned int nr_close_fds);
int connect_to_fd(int server_fd, int timeout_ms);
+int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts);
int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
int timeout_ms);
--
2.30.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH bpf-next 4/4] bpf: selftests: Add dctcp fallback test
2021-08-05 5:01 [PATCH bpf-next 0/4] bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt Martin KaFai Lau
` (2 preceding siblings ...)
2021-08-05 5:01 ` [PATCH bpf-next 3/4] bpf: selftests: Add connect_to_fd_opts to network_helpers Martin KaFai Lau
@ 2021-08-05 5:01 ` Martin KaFai Lau
2021-08-06 16:07 ` Daniel Borkmann
3 siblings, 1 reply; 7+ messages in thread
From: Martin KaFai Lau @ 2021-08-05 5:01 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, netdev
This patch makes the bpf_dctcp test to fallback to cubic by
using setsockopt(TCP_CONGESTION) when the tcp flow is not
ecn ready.
It also checks setsockopt() is not available to release().
The settimeo() from the network_helpers.h is used, so the local
one is removed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
tools/testing/selftests/bpf/bpf_tcp_helpers.h | 4 +
.../selftests/bpf/prog_tests/bpf_tcp_ca.c | 101 ++++++++++++++----
tools/testing/selftests/bpf/progs/bpf_dctcp.c | 20 ++++
.../selftests/bpf/progs/bpf_dctcp_release.c | 26 +++++
4 files changed, 128 insertions(+), 23 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp_release.c
diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
index e49b7c450b42..5a024646918b 100644
--- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -12,6 +12,10 @@
SEC("struct_ops/"#name) \
BPF_PROG(name, args)
+#ifndef SOL_TCP
+#define SOL_TCP 6
+#endif
+
#define tcp_jiffies32 ((__u32)bpf_jiffies64())
struct sock_common {
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index efe1e979affb..b0ba8fa9d0ec 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -4,37 +4,18 @@
#include <linux/err.h>
#include <netinet/tcp.h>
#include <test_progs.h>
+#include "network_helpers.h"
#include "bpf_dctcp.skel.h"
#include "bpf_cubic.skel.h"
#include "bpf_tcp_nogpl.skel.h"
+#include "bpf_dctcp_release.skel.h"
#define min(a, b) ((a) < (b) ? (a) : (b))
static const unsigned int total_bytes = 10 * 1024 * 1024;
-static const struct timeval timeo_sec = { .tv_sec = 10 };
-static const size_t timeo_optlen = sizeof(timeo_sec);
static int expected_stg = 0xeB9F;
static int stop, duration;
-static int settimeo(int fd)
-{
- int err;
-
- err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo_sec,
- timeo_optlen);
- if (CHECK(err == -1, "setsockopt(fd, SO_RCVTIMEO)", "errno:%d\n",
- errno))
- return -1;
-
- err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo_sec,
- timeo_optlen);
- if (CHECK(err == -1, "setsockopt(fd, SO_SNDTIMEO)", "errno:%d\n",
- errno))
- return -1;
-
- return 0;
-}
-
static int settcpca(int fd, const char *tcp_ca)
{
int err;
@@ -61,7 +42,7 @@ static void *server(void *arg)
goto done;
}
- if (settimeo(fd)) {
+ if (settimeo(fd, 0)) {
err = -errno;
goto done;
}
@@ -114,7 +95,7 @@ static void do_test(const char *tcp_ca, const struct bpf_map *sk_stg_map)
}
if (settcpca(lfd, tcp_ca) || settcpca(fd, tcp_ca) ||
- settimeo(lfd) || settimeo(fd))
+ settimeo(lfd, 0) || settimeo(fd, 0))
goto done;
/* bind, listen and start server thread to accept */
@@ -267,6 +248,76 @@ static void test_invalid_license(void)
libbpf_set_print(old_print_fn);
}
+static void test_dctcp_fallback(void)
+{
+ int err, lfd = -1, cli_fd = -1, srv_fd = -1;
+ struct network_helper_opts opts = {
+ .cc = "cubic",
+ };
+ struct bpf_dctcp *dctcp_skel;
+ struct bpf_link *link = NULL;
+ char srv_cc[16];
+ socklen_t cc_len = sizeof(srv_cc);
+
+ dctcp_skel = bpf_dctcp__open();
+ if (!ASSERT_OK_PTR(dctcp_skel, "dctcp_skel"))
+ return;
+ strcpy(dctcp_skel->rodata->fallback, "cubic");
+ if (!ASSERT_OK(bpf_dctcp__load(dctcp_skel), "bpf_dctcp__load"))
+ goto done;
+
+ link = bpf_map__attach_struct_ops(dctcp_skel->maps.dctcp);
+ if (!ASSERT_OK_PTR(link, "dctcp link"))
+ goto done;
+
+ lfd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0);
+ if (!ASSERT_GE(lfd, 0, "lfd") ||
+ !ASSERT_OK(settcpca(lfd, "bpf_dctcp"), "lfd=>bpf_dctcp"))
+ goto done;
+
+ cli_fd = connect_to_fd_opts(lfd, &opts);
+ if (!ASSERT_GE(cli_fd, 0, "cli_fd"))
+ goto done;
+
+ srv_fd = accept(lfd, NULL, 0);
+ if (!ASSERT_GE(srv_fd, 0, "srv_fd"))
+ goto done;
+ ASSERT_STREQ(dctcp_skel->bss->cc_res, "cubic", "cc_res");
+
+ err = getsockopt(srv_fd, SOL_TCP, TCP_CONGESTION, srv_cc, &cc_len);
+ if (!ASSERT_OK(err, "getsockopt(srv_fd, TCP_CONGESTION)"))
+ goto done;
+ ASSERT_STREQ(srv_cc, "cubic", "srv_fd cc");
+
+done:
+ bpf_link__destroy(link);
+ bpf_dctcp__destroy(dctcp_skel);
+ if (lfd != -1)
+ close(lfd);
+ if (srv_fd != -1)
+ close(srv_fd);
+ if (cli_fd != -1)
+ close(cli_fd);
+}
+
+static void test_rel_setsockopt(void)
+{
+ struct bpf_dctcp_release *rel_skel;
+ libbpf_print_fn_t old_print_fn;
+
+ err_str = "unknown func bpf_setsockopt";
+ found = false;
+
+ old_print_fn = libbpf_set_print(libbpf_debug_print);
+ rel_skel = bpf_dctcp_release__open_and_load();
+ libbpf_set_print(old_print_fn);
+
+ ASSERT_ERR_PTR(rel_skel, "rel_skel");
+ ASSERT_TRUE(found, "expected_err_msg");
+
+ bpf_dctcp_release__destroy(rel_skel);
+}
+
void test_bpf_tcp_ca(void)
{
if (test__start_subtest("dctcp"))
@@ -275,4 +326,8 @@ void test_bpf_tcp_ca(void)
test_cubic();
if (test__start_subtest("invalid_license"))
test_invalid_license();
+ if (test__start_subtest("dctcp_fallback"))
+ test_dctcp_fallback();
+ if (test__start_subtest("rel_setsockopt"))
+ test_rel_setsockopt();
}
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
index fd42247da8b4..48df7ffbefdb 100644
--- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
@@ -17,6 +17,9 @@
char _license[] SEC("license") = "GPL";
+volatile const char fallback[TCP_CA_NAME_MAX];
+const char bpf_dctcp[] = "bpf_dctcp";
+char cc_res[TCP_CA_NAME_MAX];
int stg_result = 0;
struct {
@@ -57,6 +60,23 @@ void BPF_PROG(dctcp_init, struct sock *sk)
struct dctcp *ca = inet_csk_ca(sk);
int *stg;
+ if (!(tp->ecn_flags & TCP_ECN_OK) && fallback[0]) {
+ /* Switch to fallback */
+ bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+ (void *)fallback, sizeof(fallback));
+ /* Switch back to myself which the bpf trampoline
+ * stopped calling dctcp_init recursively.
+ */
+ bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+ (void *)bpf_dctcp, sizeof(bpf_dctcp));
+ /* Switch back to fallback */
+ bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+ (void *)fallback, sizeof(fallback));
+ bpf_getsockopt(sk, SOL_TCP, TCP_CONGESTION,
+ (void *)cc_res, sizeof(cc_res));
+ return;
+ }
+
ca->prior_rcv_nxt = tp->rcv_nxt;
ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
ca->loss_cwnd = 0;
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp_release.c b/tools/testing/selftests/bpf/progs/bpf_dctcp_release.c
new file mode 100644
index 000000000000..d836f7c372f0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp_release.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#include <stddef.h>
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/tcp.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+const char cubic[] = "cubic";
+
+void BPF_STRUCT_OPS(dctcp_nouse_release, struct sock *sk)
+{
+ bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+ (void *)cubic, sizeof(cubic));
+}
+
+SEC(".struct_ops")
+struct tcp_congestion_ops dctcp_rel = {
+ .release = (void *)dctcp_nouse_release,
+ .name = "bpf_dctcp_rel",
+};
--
2.30.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next 4/4] bpf: selftests: Add dctcp fallback test
2021-08-05 5:01 ` [PATCH bpf-next 4/4] bpf: selftests: Add dctcp fallback test Martin KaFai Lau
@ 2021-08-06 16:07 ` Daniel Borkmann
2021-08-06 17:42 ` Martin KaFai Lau
0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2021-08-06 16:07 UTC (permalink / raw)
To: Martin KaFai Lau, bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, kernel-team, netdev
On 8/5/21 7:01 AM, Martin KaFai Lau wrote:
> This patch makes the bpf_dctcp test to fallback to cubic by
> using setsockopt(TCP_CONGESTION) when the tcp flow is not
> ecn ready.
>
> It also checks setsockopt() is not available to release().
>
> The settimeo() from the network_helpers.h is used, so the local
> one is removed.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
[...]
> diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
> index fd42247da8b4..48df7ffbefdb 100644
> --- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c
> +++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
> @@ -17,6 +17,9 @@
>
> char _license[] SEC("license") = "GPL";
>
> +volatile const char fallback[TCP_CA_NAME_MAX];
> +const char bpf_dctcp[] = "bpf_dctcp";
> +char cc_res[TCP_CA_NAME_MAX];
> int stg_result = 0;
>
> struct {
> @@ -57,6 +60,23 @@ void BPF_PROG(dctcp_init, struct sock *sk)
> struct dctcp *ca = inet_csk_ca(sk);
> int *stg;
>
> + if (!(tp->ecn_flags & TCP_ECN_OK) && fallback[0]) {
> + /* Switch to fallback */
> + bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> + (void *)fallback, sizeof(fallback));
> + /* Switch back to myself which the bpf trampoline
> + * stopped calling dctcp_init recursively.
> + */
> + bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> + (void *)bpf_dctcp, sizeof(bpf_dctcp));
> + /* Switch back to fallback */
> + bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> + (void *)fallback, sizeof(fallback));
> + bpf_getsockopt(sk, SOL_TCP, TCP_CONGESTION,
> + (void *)cc_res, sizeof(cc_res));
> + return;
Is there a possibility where we later on instead of return refetch ca ptr via
ca = inet_csk_ca(sk) and mangle its struct dctcp fields whereas we're actually
messing with the new ca's internal fields (potentially crashing the kernel e.g.
if there was a pointer in the private struct of the new ca that we'd be corrupting)?
> + }
> +
> ca->prior_rcv_nxt = tp->rcv_nxt;
> ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
> ca->loss_cwnd = 0;
> diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp_release.c b/tools/testing/selftests/bpf/progs/bpf_dctcp_release.c
> new file mode 100644
> index 000000000000..d836f7c372f0
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_dctcp_release.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021 Facebook */
> +
> +#include <stddef.h>
> +#include <linux/bpf.h>
> +#include <linux/types.h>
> +#include <linux/stddef.h>
> +#include <linux/tcp.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bpf_tcp_helpers.h"
> +
> +char _license[] SEC("license") = "GPL";
> +const char cubic[] = "cubic";
> +
> +void BPF_STRUCT_OPS(dctcp_nouse_release, struct sock *sk)
> +{
> + bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> + (void *)cubic, sizeof(cubic));
> +}
> +
> +SEC(".struct_ops")
> +struct tcp_congestion_ops dctcp_rel = {
> + .release = (void *)dctcp_nouse_release,
> + .name = "bpf_dctcp_rel",
> +};
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next 4/4] bpf: selftests: Add dctcp fallback test
2021-08-06 16:07 ` Daniel Borkmann
@ 2021-08-06 17:42 ` Martin KaFai Lau
0 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2021-08-06 17:42 UTC (permalink / raw)
To: Daniel Borkmann
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Eric Dumazet,
Yonghong Song, kernel-team, netdev
On Fri, Aug 06, 2021 at 06:07:01PM +0200, Daniel Borkmann wrote:
> On 8/5/21 7:01 AM, Martin KaFai Lau wrote:
> > This patch makes the bpf_dctcp test to fallback to cubic by
> > using setsockopt(TCP_CONGESTION) when the tcp flow is not
> > ecn ready.
> >
> > It also checks setsockopt() is not available to release().
> >
> > The settimeo() from the network_helpers.h is used, so the local
> > one is removed.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> [...]
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > index fd42247da8b4..48df7ffbefdb 100644
> > --- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > +++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > @@ -17,6 +17,9 @@
> > char _license[] SEC("license") = "GPL";
> > +volatile const char fallback[TCP_CA_NAME_MAX];
> > +const char bpf_dctcp[] = "bpf_dctcp";
> > +char cc_res[TCP_CA_NAME_MAX];
> > int stg_result = 0;
> > struct {
> > @@ -57,6 +60,23 @@ void BPF_PROG(dctcp_init, struct sock *sk)
> > struct dctcp *ca = inet_csk_ca(sk);
> > int *stg;
> > + if (!(tp->ecn_flags & TCP_ECN_OK) && fallback[0]) {
> > + /* Switch to fallback */
> > + bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> > + (void *)fallback, sizeof(fallback));
> > + /* Switch back to myself which the bpf trampoline
> > + * stopped calling dctcp_init recursively.
> > + */
> > + bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> > + (void *)bpf_dctcp, sizeof(bpf_dctcp));
> > + /* Switch back to fallback */
> > + bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> > + (void *)fallback, sizeof(fallback));
> > + bpf_getsockopt(sk, SOL_TCP, TCP_CONGESTION,
> > + (void *)cc_res, sizeof(cc_res));
> > + return;
>
> Is there a possibility where we later on instead of return refetch ca ptr via
> ca = inet_csk_ca(sk) and mangle its struct dctcp fields whereas we're actually
> messing with the new ca's internal fields (potentially crashing the kernel e.g.
> if there was a pointer in the private struct of the new ca that we'd be corrupting)?
Without switching to another tcp-cc,
if the bpf-tcp-cc was buggy (e.g. setting incorrect cwnd), it could also
slow down (or stall) the flow a lot by putting wrong values in its own
icsk_ca_priv.
About the potential pointer value in icsk_ca_priv,
the bpf-tcp-cc can only use the icsk_ca_priv as SCALAR, so switching
to another bpf-tcp-cc should be fine.
If a bpf-tcp-cc is switching to a kernel-tcp-cc, that kernel-tcp-cc
could potentially store a pointer in icsk_ca_priv. The only case I
know is the tcp_cdg.c when icsk_ca_priv is not large enough and it
has to resort to kcalloc and store this pointer in icsk_ca_priv.
Other kernel-tcp-cc stores its data inline in icsk_ca_priv.
The ICSK_CA_PRIV_SIZE has been increased a few times to
store new data inline instead of doing another kmalloc, so
this should be the common case. [cc: Eric]
It could disallow switching to kernel-tcp-cc but I think
it will just end up too limiting and forcing people
to create a bpf-tcp-cc shell to mimic the kernel-tcp-cc
during fallback. Considering only very limited kernel-tcp-cc
stores pointer in icsk_ca_priv, how about imposing a white/black
list for bpf_setsockopt(TCP_CONGESTION), e.g. disallow switching
to tcp_cdg? In the near future, the tagging feature that
Yonghong is working can be used to tag some specific kernel-tcp-cc's
struct that is switchable from bpf side (which most of them should
be switchable). [cc: Yonghong]
WDYT?
Thanks for the review!
^ permalink raw reply [flat|nested] 7+ messages in thread