LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
@ 2019-02-18 18:33 Håkon Bugge
  2019-02-19 14:58 ` Chuck Lever
  2019-06-10 17:53 ` Jason Gunthorpe
  0 siblings, 2 replies; 8+ messages in thread
From: Håkon Bugge @ 2019-02-18 18:33 UTC (permalink / raw)
  To: Yishai Hadas, Doug Ledford, Jason Gunthorpe, jackm, majd
  Cc: linux-rdma, linux-kernel

MAD packet sending/receiving is not properly virtualized in
CX-3. Hence, these are proxied through the PF driver. The proxying
uses UD QPs. The associated CQs are created with completion vector
zero.

This leads to great imbalance in CPU processing, in particular during
heavy RDMA CM traffic.

Solved by selecting the completion vector on a round-robin base.

The imbalance can be demonstrated in a bare-metal environment, where
two nodes have instantiated 8 VFs each. This using dual ported HCAs,
so we have 16 vPorts per physical server.

64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.

Before this commit, we have (excluding all completion IRQs with zero
interrupts):

396: mlx4-1@0000:94:00.0 199126
397: mlx4-2@0000:94:00.0 1

With this commit:

396: mlx4-1@0000:94:00.0 12568
397: mlx4-2@0000:94:00.0 50772
398: mlx4-3@0000:94:00.0 10063
399: mlx4-4@0000:94:00.0 50753
400: mlx4-5@0000:94:00.0 6127
401: mlx4-6@0000:94:00.0 6114
[]
414: mlx4-19@0000:94:00.0 6122
415: mlx4-20@0000:94:00.0 6117

The added pr_info shows:

create_pv_resources: slave:0 port:1, vector:0, num_comp_vectors:62
create_pv_resources: slave:0 port:1, vector:1, num_comp_vectors:62
create_pv_resources: slave:0 port:2, vector:2, num_comp_vectors:62
create_pv_resources: slave:0 port:2, vector:3, num_comp_vectors:62
create_pv_resources: slave:1 port:1, vector:4, num_comp_vectors:62
create_pv_resources: slave:1 port:2, vector:5, num_comp_vectors:62
[]
create_pv_resources: slave:8 port:2, vector:18, num_comp_vectors:62
create_pv_resources: slave:8 port:1, vector:19, num_comp_vectors:62

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 drivers/infiniband/hw/mlx4/mad.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
index 936ee1314bcd..300839e7f519 100644
--- a/drivers/infiniband/hw/mlx4/mad.c
+++ b/drivers/infiniband/hw/mlx4/mad.c
@@ -1973,6 +1973,7 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port,
 {
 	int ret, cq_size;
 	struct ib_cq_init_attr cq_attr = {};
+	static atomic_t comp_vect = ATOMIC_INIT(-1);
 
 	if (ctx->state != DEMUX_PV_STATE_DOWN)
 		return -EEXIST;
@@ -2002,6 +2003,9 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port,
 		cq_size *= 2;
 
 	cq_attr.cqe = cq_size;
+	cq_attr.comp_vector = atomic_inc_return(&comp_vect) % ibdev->num_comp_vectors;
+	pr_info("slave:%d port:%d, vector:%d, num_comp_vectors:%d\n",
+		slave, port, cq_attr.comp_vector, ibdev->num_comp_vectors);
 	ctx->cq = ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,
 			       NULL, ctx, &cq_attr);
 	if (IS_ERR(ctx->cq)) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
  2019-02-18 18:33 [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs Håkon Bugge
@ 2019-02-19 14:58 ` Chuck Lever
       [not found]   ` <66C92ED1-EE5E-4136-A7D7-DBF8A0816800@oracle.com>
  2019-06-10 17:53 ` Jason Gunthorpe
  1 sibling, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2019-02-19 14:58 UTC (permalink / raw)
  To: Håkon Bugge
  Cc: Yishai Hadas, Doug Ledford, Jason Gunthorpe, jackm, majd,
	linux-rdma, linux-kernel

Hey Håkon-

> On Feb 18, 2019, at 1:33 PM, Håkon Bugge <haakon.bugge@oracle.com> wrote:
> 
> MAD packet sending/receiving is not properly virtualized in
> CX-3. Hence, these are proxied through the PF driver. The proxying
> uses UD QPs. The associated CQs are created with completion vector
> zero.
> 
> This leads to great imbalance in CPU processing, in particular during
> heavy RDMA CM traffic.
> 
> Solved by selecting the completion vector on a round-robin base.

I've got a similar patch for NFS and NFSD. I'm wondering if this
should be turned into a core helper, simple as it is. Perhaps
it would be beneficial if all participating ULPs used the same
global counter?


> The imbalance can be demonstrated in a bare-metal environment, where
> two nodes have instantiated 8 VFs each. This using dual ported HCAs,
> so we have 16 vPorts per physical server.
> 
> 64 processes are associated with each vPort and creates and destroys
> one QP for each of the remote 64 processes. That is, 1024 QPs per
> vPort, all in all 16K QPs. The QPs are created/destroyed using the
> CM.
> 
> Before this commit, we have (excluding all completion IRQs with zero
> interrupts):
> 
> 396: mlx4-1@0000:94:00.0 199126
> 397: mlx4-2@0000:94:00.0 1
> 
> With this commit:
> 
> 396: mlx4-1@0000:94:00.0 12568
> 397: mlx4-2@0000:94:00.0 50772
> 398: mlx4-3@0000:94:00.0 10063
> 399: mlx4-4@0000:94:00.0 50753
> 400: mlx4-5@0000:94:00.0 6127
> 401: mlx4-6@0000:94:00.0 6114
> []
> 414: mlx4-19@0000:94:00.0 6122
> 415: mlx4-20@0000:94:00.0 6117
> 
> The added pr_info shows:
> 
> create_pv_resources: slave:0 port:1, vector:0, num_comp_vectors:62
> create_pv_resources: slave:0 port:1, vector:1, num_comp_vectors:62
> create_pv_resources: slave:0 port:2, vector:2, num_comp_vectors:62
> create_pv_resources: slave:0 port:2, vector:3, num_comp_vectors:62
> create_pv_resources: slave:1 port:1, vector:4, num_comp_vectors:62
> create_pv_resources: slave:1 port:2, vector:5, num_comp_vectors:62
> []
> create_pv_resources: slave:8 port:2, vector:18, num_comp_vectors:62
> create_pv_resources: slave:8 port:1, vector:19, num_comp_vectors:62
> 
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
> drivers/infiniband/hw/mlx4/mad.c | 4 ++++
> 1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
> index 936ee1314bcd..300839e7f519 100644
> --- a/drivers/infiniband/hw/mlx4/mad.c
> +++ b/drivers/infiniband/hw/mlx4/mad.c
> @@ -1973,6 +1973,7 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port,
> {
> 	int ret, cq_size;
> 	struct ib_cq_init_attr cq_attr = {};
> +	static atomic_t comp_vect = ATOMIC_INIT(-1);
> 
> 	if (ctx->state != DEMUX_PV_STATE_DOWN)
> 		return -EEXIST;
> @@ -2002,6 +2003,9 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port,
> 		cq_size *= 2;
> 
> 	cq_attr.cqe = cq_size;
> +	cq_attr.comp_vector = atomic_inc_return(&comp_vect) % ibdev->num_comp_vectors;
> +	pr_info("slave:%d port:%d, vector:%d, num_comp_vectors:%d\n",
> +		slave, port, cq_attr.comp_vector, ibdev->num_comp_vectors);
> 	ctx->cq = ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,
> 			       NULL, ctx, &cq_attr);
> 	if (IS_ERR(ctx->cq)) {
> -- 
> 2.20.1
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
       [not found]   ` <66C92ED1-EE5E-4136-A7D7-DBF8A0816800@oracle.com>
@ 2019-02-19 17:39     ` Chuck Lever
  2019-02-20 17:14     ` Jason Gunthorpe
  1 sibling, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2019-02-19 17:39 UTC (permalink / raw)
  To: Håkon Bugge
  Cc: Yishai Hadas, Doug Ledford, Jason Gunthorpe, jackm, majd,
	OFED mailing list, linux-kernel



> On Feb 19, 2019, at 12:32 PM, Håkon Bugge <haakon.bugge@oracle.com> wrote:
> 
> 
> 
>> On 19 Feb 2019, at 15:58, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> Hey Håkon-
>> 
>>> On Feb 18, 2019, at 1:33 PM, Håkon Bugge <haakon.bugge@oracle.com> wrote:
>>> 
>>> MAD packet sending/receiving is not properly virtualized in
>>> CX-3. Hence, these are proxied through the PF driver. The proxying
>>> uses UD QPs. The associated CQs are created with completion vector
>>> zero.
>>> 
>>> This leads to great imbalance in CPU processing, in particular during
>>> heavy RDMA CM traffic.
>>> 
>>> Solved by selecting the completion vector on a round-robin base.
>> 
>> I've got a similar patch for NFS and NFSD. I'm wondering if this
>> should be turned into a core helper, simple as it is. Perhaps
>> it would be beneficial if all participating ULPs used the same
>> global counter?
> 
> 
> A global counter works for this commit, because the QPs and associated CQs are (pretty) persistent. That is, VMs doesn't come and go that often.
> 
> In the more general ULP case, the usage model is probably a lot more intermittent. Hence, a least-load approach is probably better. That can be implemented in ib core. I've seen in the past an enum IB_CQ_USE_LEAST_LOAD_VECTOR for signalling this behaviour and define that to e.g. -1, that is, outside of 0..(num_comp_vectors-1).

Indeed, passing such a value to either ib_create_cq or ib_alloc_cq
could allow the compvec to be selected automatically. Using a
round-robin would be the first step towards something smarter, and
the ULPs need be none the wiser when more smart-i-tude eventually
comes along.


> But this mechanism doesn't know which CQs that delivers the most interrupts. We lack an ib_modify_cq() that may change the CQ to EQ association, to _really_ spread the interrupts, not the CQ to EQ association.
> 
> Anyway, Jason mentioned in a private email that maybe we could use the new completion API or something? I am not familiar with that one (yet).
> 
> Well, I can volunteer to do the least load approach in ib core and change all (plain stupid) zero comp_vectors in ULPs and core, if that seems like an interim approach.

Please update net/sunrpc/xprtrdma/{svc_rdma_,}transport.c as well.
It should be straightforward, and I'm happy to review and test as
needed.


> Thxs, Håkon
> 
> 
> 
> 
>> 
>> 
>>> The imbalance can be demonstrated in a bare-metal environment, where
>>> two nodes have instantiated 8 VFs each. This using dual ported HCAs,
>>> so we have 16 vPorts per physical server.
>>> 
>>> 64 processes are associated with each vPort and creates and destroys
>>> one QP for each of the remote 64 processes. That is, 1024 QPs per
>>> vPort, all in all 16K QPs. The QPs are created/destroyed using the
>>> CM.
>>> 
>>> Before this commit, we have (excluding all completion IRQs with zero
>>> interrupts):
>>> 
>>> 396: mlx4-1@0000:94:00.0 199126
>>> 397: mlx4-2@0000:94:00.0 1
>>> 
>>> With this commit:
>>> 
>>> 396: mlx4-1@0000:94:00.0 12568
>>> 397: mlx4-2@0000:94:00.0 50772
>>> 398: mlx4-3@0000:94:00.0 10063
>>> 399: mlx4-4@0000:94:00.0 50753
>>> 400: mlx4-5@0000:94:00.0 6127
>>> 401: mlx4-6@0000:94:00.0 6114
>>> []
>>> 414: mlx4-19@0000:94:00.0 6122
>>> 415: mlx4-20@0000:94:00.0 6117
>>> 
>>> The added pr_info shows:
>>> 
>>> create_pv_resources: slave:0 port:1, vector:0, num_comp_vectors:62
>>> create_pv_resources: slave:0 port:1, vector:1, num_comp_vectors:62
>>> create_pv_resources: slave:0 port:2, vector:2, num_comp_vectors:62
>>> create_pv_resources: slave:0 port:2, vector:3, num_comp_vectors:62
>>> create_pv_resources: slave:1 port:1, vector:4, num_comp_vectors:62
>>> create_pv_resources: slave:1 port:2, vector:5, num_comp_vectors:62
>>> []
>>> create_pv_resources: slave:8 port:2, vector:18, num_comp_vectors:62
>>> create_pv_resources: slave:8 port:1, vector:19, num_comp_vectors:62
>>> 
>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>>> ---
>>> drivers/infiniband/hw/mlx4/mad.c | 4 ++++
>>> 1 file changed, 4 insertions(+)
>>> 
>>> diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
>>> index 936ee1314bcd..300839e7f519 100644
>>> --- a/drivers/infiniband/hw/mlx4/mad.c
>>> +++ b/drivers/infiniband/hw/mlx4/mad.c
>>> @@ -1973,6 +1973,7 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port,
>>> {
>>> 	int ret, cq_size;
>>> 	struct ib_cq_init_attr cq_attr = {};
>>> +	static atomic_t comp_vect = ATOMIC_INIT(-1);
>>> 
>>> 	if (ctx->state != DEMUX_PV_STATE_DOWN)
>>> 		return -EEXIST;
>>> @@ -2002,6 +2003,9 @@ static int create_pv_resources(struct ib_device *ibdev, int slave, int port,
>>> 		cq_size *= 2;
>>> 
>>> 	cq_attr.cqe = cq_size;
>>> +	cq_attr.comp_vector = atomic_inc_return(&comp_vect) % ibdev->num_comp_vectors;
>>> +	pr_info("slave:%d port:%d, vector:%d, num_comp_vectors:%d\n",
>>> +		slave, port, cq_attr.comp_vector, ibdev->num_comp_vectors);
>>> 	ctx->cq = ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,
>>> 			       NULL, ctx, &cq_attr);
>>> 	if (IS_ERR(ctx->cq)) {
>>> -- 
>>> 2.20.1
>>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
       [not found]   ` <66C92ED1-EE5E-4136-A7D7-DBF8A0816800@oracle.com>
  2019-02-19 17:39     ` Chuck Lever
@ 2019-02-20 17:14     ` Jason Gunthorpe
  2019-02-20 17:46       ` Håkon Bugge
  1 sibling, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2019-02-20 17:14 UTC (permalink / raw)
  To: Håkon Bugge
  Cc: Chuck Lever, Yishai Hadas, Doug Ledford, jackm, majd,
	OFED mailing list, linux-kernel

On Tue, Feb 19, 2019 at 06:32:50PM +0100, Håkon Bugge wrote:
>    Anyway, Jason mentioned in a private email that maybe we could use the
>    new completion API or something? I am not familiar with that one
>    (yet).

I was thinking of the stuff in core/cq.c - but it also doesn't have
automatic comp_vector balancing. It is the logical place to put
something like that though..

An API to manage a bundle of CPU affine CQ's is probably what most
ULPs really need.. (it makes little sense to create a unique CQ for
every QP)

alloc_bundle()
get_cqn_for_flow(bundle)
alloc_qp()
destroy_qp()
put_cqn_for_flow(bundle)
destroy_bundle();

Let the core code balance the cqn's and allocate (shared) CQ
resources.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
  2019-02-20 17:14     ` Jason Gunthorpe
@ 2019-02-20 17:46       ` Håkon Bugge
  2019-02-25 21:46         ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Håkon Bugge @ 2019-02-20 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chuck Lever, Yishai Hadas, Doug Ledford, jackm, majd,
	OFED mailing list, linux-kernel



> On 20 Feb 2019, at 18:14, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Tue, Feb 19, 2019 at 06:32:50PM +0100, Håkon Bugge wrote:
>>   Anyway, Jason mentioned in a private email that maybe we could use the
>>   new completion API or something? I am not familiar with that one
>>   (yet).
> 
> I was thinking of the stuff in core/cq.c - but it also doesn't have
> automatic comp_vector balancing. It is the logical place to put
> something like that though..
> 
> An API to manage a bundle of CPU affine CQ's is probably what most
> ULPs really need.. (it makes little sense to create a unique CQ for
> every QP)

ULPs behave way differently. E.g. RDS creates one tx and one rx CQ per QP.

As I wrote earlier, we do not have any modify_cq() that changes the comp_vector (EQ association). We can balance #CQ associated with the EQs, but we do not know their behaviour.

So, assume 2 completion EQs, and four CQs. CQa and CQb are associated with the first EQ, the two others with the second EQ. That's the "best" we can do. But, if CQa and CQb are the only ones generating events, we will have all interrupt processing on a single CPU. But if we now could modify CQa.comp_vector to be that of the second EQ, we could achieve balance. But not sure if the drivers are able to do this at all.

> alloc_bundle()

You mean alloc a bunch of CQs? How do you know their #cqes and cq_context?


Håkon


> get_cqn_for_flow(bundle)
> alloc_qp()
> destroy_qp()
> put_cqn_for_flow(bundle)
> destroy_bundle();
> 
> Let the core code balance the cqn's and allocate (shared) CQ
> resources.
> 
> Jason


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
  2019-02-20 17:46       ` Håkon Bugge
@ 2019-02-25 21:46         ` Sagi Grimberg
  0 siblings, 0 replies; 8+ messages in thread
From: Sagi Grimberg @ 2019-02-25 21:46 UTC (permalink / raw)
  To: Håkon Bugge, Jason Gunthorpe
  Cc: Chuck Lever, Yishai Hadas, Doug Ledford, jackm, majd,
	OFED mailing list, linux-kernel


>> I was thinking of the stuff in core/cq.c - but it also doesn't have
>> automatic comp_vector balancing. It is the logical place to put
>> something like that though..
>>
>> An API to manage a bundle of CPU affine CQ's is probably what most
>> ULPs really need.. (it makes little sense to create a unique CQ for
>> every QP)
> 
> ULPs behave way differently. E.g. RDS creates one tx and one rx CQ per QP.
> 
> As I wrote earlier, we do not have any modify_cq() that changes the comp_vector (EQ association). We can balance #CQ associated with the EQs, but we do not know their behaviour.
> 
> So, assume 2 completion EQs, and four CQs. CQa and CQb are associated with the first EQ, the two others with the second EQ. That's the "best" we can do. But, if CQa and CQb are the only ones generating events, we will have all interrupt processing on a single CPU. But if we now could modify CQa.comp_vector to be that of the second EQ, we could achieve balance. But not sure if the drivers are able to do this at all.
> 
>> alloc_bundle()
> 
> You mean alloc a bunch of CQs? How do you know their #cqes and cq_context?
> 
> 
> Håkon
> 
> 
>> get_cqn_for_flow(bundle)
>> alloc_qp()
>> destroy_qp()
>> put_cqn_for_flow(bundle)
>> destroy_bundle();
>>
>> Let the core code balance the cqn's and allocate (shared) CQ
>> resources.
>>
>> Jason
> 

I sent a simple patchset back in the day for it [1], IIRC there was
some resistance of having multiple ULPs implicitly share the same
completion queues:

[1]:
--
RDMA/core: Add implicit per-device completion queue
  pools

Allow a ULP to ask the core to implicitly assign a completion
queue to a queue-pair based on a least-used search on a per-device
cq pools. The device CQ pools grow in a lazy fashion with every
QP creation.

In addition, expose an affinity hint for a queue pair creation.
If passed, the core will attempt to attach a CQ with a completion
vector that is directed to the cpu core as the affinity hint
provided.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
--

That one added implicit QP create flags:
--
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index bdb1279a415b..56d42e753eb4 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1098,11 +1098,22 @@ enum ib_qp_create_flags {
         IB_QP_CREATE_SCATTER_FCS                = 1 << 8,
         IB_QP_CREATE_CVLAN_STRIPPING            = 1 << 9,
         IB_QP_CREATE_SOURCE_QPN                 = 1 << 10,
+
+       /* only used by the core, not passed to low-level drivers */
+       IB_QP_CREATE_ASSIGN_CQS                 = 1 << 24,
+       IB_QP_CREATE_AFFINITY_HINT              = 1 << 25,
+
--

Then I modified it to add a ib_cq_pool that a ULP can allocate
privately and then get/put CQs from/to.

[2]:
--
IB/core: Add a simple CQ pool API

Using CQ pools is useful especially for target/server modes.
The server/target implementation will usually serve multiple clients
and will usually have an array of completion queues allocated for that.

In addition, usually the server/target implementation will use a least-used
scheme to select a completion vector to each completion queue in order
to acheive better parallelism.

Having the server/target rdma queue-pairs share completion queues as
much as possible is desirable as it allows for better completion 
aggragation.
One downside of this approach is that some entries of the completion queues
might never be used in case the queue-pairs sizes are not fixed.

This simple CQ pool API allows for both optimizations and exposes a simple
API to alloc/free a completion queue pool and get/put from the pool.

The pool starts by allocating a caller-defined batch of CQs, and grows
in batches in a lazy fashion.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
--

That one had the CQ pool API:
--
+struct ib_cq_pool *ib_alloc_cq_pool(struct ib_device *device, int nr_cqe,
+                       int nr_cqs, enum ib_poll_context poll_ctx);
+void ib_free_cq_pool(struct ib_cq_pool *pool);
+void ib_cq_pool_put(struct ib_cq *cq, unsigned int nents);
+struct ib_cq *ib_cq_pool_get(struct ib_cq_pool *pool, unsigned int nents);
--

I can try to revive this if this becomes interesting again to anyone..

Thoughts?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
  2019-02-18 18:33 [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs Håkon Bugge
  2019-02-19 14:58 ` Chuck Lever
@ 2019-06-10 17:53 ` Jason Gunthorpe
  2019-06-11 14:55   ` Håkon Bugge
  1 sibling, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2019-06-10 17:53 UTC (permalink / raw)
  To: Håkon Bugge
  Cc: Yishai Hadas, Doug Ledford, jackm, majd, linux-rdma, linux-kernel

On Mon, Feb 18, 2019 at 07:33:02PM +0100, Håkon Bugge wrote:
> MAD packet sending/receiving is not properly virtualized in
> CX-3. Hence, these are proxied through the PF driver. The proxying
> uses UD QPs. The associated CQs are created with completion vector
> zero.
> 
> This leads to great imbalance in CPU processing, in particular during
> heavy RDMA CM traffic.
> 
> Solved by selecting the completion vector on a round-robin base.
> 
> The imbalance can be demonstrated in a bare-metal environment, where
> two nodes have instantiated 8 VFs each. This using dual ported HCAs,
> so we have 16 vPorts per physical server.
> 
> 64 processes are associated with each vPort and creates and destroys
> one QP for each of the remote 64 processes. That is, 1024 QPs per
> vPort, all in all 16K QPs. The QPs are created/destroyed using the
> CM.
> 
> Before this commit, we have (excluding all completion IRQs with zero
> interrupts):
> 
> 396: mlx4-1@0000:94:00.0 199126
> 397: mlx4-2@0000:94:00.0 1
> 
> With this commit:
> 
> 396: mlx4-1@0000:94:00.0 12568
> 397: mlx4-2@0000:94:00.0 50772
> 398: mlx4-3@0000:94:00.0 10063
> 399: mlx4-4@0000:94:00.0 50753
> 400: mlx4-5@0000:94:00.0 6127
> 401: mlx4-6@0000:94:00.0 6114
> []
> 414: mlx4-19@0000:94:00.0 6122
> 415: mlx4-20@0000:94:00.0 6117
> 
> The added pr_info shows:
> 
> create_pv_resources: slave:0 port:1, vector:0, num_comp_vectors:62
> create_pv_resources: slave:0 port:1, vector:1, num_comp_vectors:62
> create_pv_resources: slave:0 port:2, vector:2, num_comp_vectors:62
> create_pv_resources: slave:0 port:2, vector:3, num_comp_vectors:62
> create_pv_resources: slave:1 port:1, vector:4, num_comp_vectors:62
> create_pv_resources: slave:1 port:2, vector:5, num_comp_vectors:62
> []
> create_pv_resources: slave:8 port:2, vector:18, num_comp_vectors:62
> create_pv_resources: slave:8 port:1, vector:19, num_comp_vectors:62
> 
> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
> ---
>  drivers/infiniband/hw/mlx4/mad.c | 4 ++++
>  1 file changed, 4 insertions(+)

This has been on patchworks for too long. Is it still relevant, or
were you going to respin this with Chuck's 'least loaded' idea?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
  2019-06-10 17:53 ` Jason Gunthorpe
@ 2019-06-11 14:55   ` Håkon Bugge
  0 siblings, 0 replies; 8+ messages in thread
From: Håkon Bugge @ 2019-06-11 14:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, Doug Ledford, jackm, majd, OFED mailing list, linux-kernel



> On 10 Jun 2019, at 19:53, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, Feb 18, 2019 at 07:33:02PM +0100, Håkon Bugge wrote:
>> MAD packet sending/receiving is not properly virtualized in
>> CX-3. Hence, these are proxied through the PF driver. The proxying
>> uses UD QPs. The associated CQs are created with completion vector
>> zero.
>> 
>> This leads to great imbalance in CPU processing, in particular during
>> heavy RDMA CM traffic.
>> 
>> Solved by selecting the completion vector on a round-robin base.
>> 
>> The imbalance can be demonstrated in a bare-metal environment, where
>> two nodes have instantiated 8 VFs each. This using dual ported HCAs,
>> so we have 16 vPorts per physical server.
>> 
>> 64 processes are associated with each vPort and creates and destroys
>> one QP for each of the remote 64 processes. That is, 1024 QPs per
>> vPort, all in all 16K QPs. The QPs are created/destroyed using the
>> CM.
>> 
>> Before this commit, we have (excluding all completion IRQs with zero
>> interrupts):
>> 
>> 396: mlx4-1@0000:94:00.0 199126
>> 397: mlx4-2@0000:94:00.0 1
>> 
>> With this commit:
>> 
>> 396: mlx4-1@0000:94:00.0 12568
>> 397: mlx4-2@0000:94:00.0 50772
>> 398: mlx4-3@0000:94:00.0 10063
>> 399: mlx4-4@0000:94:00.0 50753
>> 400: mlx4-5@0000:94:00.0 6127
>> 401: mlx4-6@0000:94:00.0 6114
>> []
>> 414: mlx4-19@0000:94:00.0 6122
>> 415: mlx4-20@0000:94:00.0 6117
>> 
>> The added pr_info shows:
>> 
>> create_pv_resources: slave:0 port:1, vector:0, num_comp_vectors:62
>> create_pv_resources: slave:0 port:1, vector:1, num_comp_vectors:62
>> create_pv_resources: slave:0 port:2, vector:2, num_comp_vectors:62
>> create_pv_resources: slave:0 port:2, vector:3, num_comp_vectors:62
>> create_pv_resources: slave:1 port:1, vector:4, num_comp_vectors:62
>> create_pv_resources: slave:1 port:2, vector:5, num_comp_vectors:62
>> []
>> create_pv_resources: slave:8 port:2, vector:18, num_comp_vectors:62
>> create_pv_resources: slave:8 port:1, vector:19, num_comp_vectors:62
>> 
>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>> ---
>> drivers/infiniband/hw/mlx4/mad.c | 4 ++++
>> 1 file changed, 4 insertions(+)
> 
> This has been on patchworks for too long. Is it still relevant, or
> were you going to respin this with Chuck's 'least loaded' idea?

Let me send a commit based on the least loaded idea this week.


Thxs, Håkon

> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-06-11 14:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-18 18:33 [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs Håkon Bugge
2019-02-19 14:58 ` Chuck Lever
     [not found]   ` <66C92ED1-EE5E-4136-A7D7-DBF8A0816800@oracle.com>
2019-02-19 17:39     ` Chuck Lever
2019-02-20 17:14     ` Jason Gunthorpe
2019-02-20 17:46       ` Håkon Bugge
2019-02-25 21:46         ` Sagi Grimberg
2019-06-10 17:53 ` Jason Gunthorpe
2019-06-11 14:55   ` Håkon Bugge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).