LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel
@ 2007-02-09  0:09 Kok, Auke
  2007-02-09  0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke
  2007-02-09  0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke
  0 siblings, 2 replies; 17+ messages in thread
From: Kok, Auke @ 2007-02-09  0:09 UTC (permalink / raw)
  To: David Miller, Garzik, Jeff, netdev, linux-kernel
  Cc: Kok, Auke, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke,
	Ronciak, John


Hi,

The following two patches are submitted for review. Please provide feedback
and comments as usual.

This set of patches implements a generic API for multiqueue-capable
devices to have network stack support to assign multiple flows into each
queue on the NIC.  It provides an interface for non-multiqueue devices
to coexist in a machine running with both types of devices, as well as
providing a modified queuing discipline to assign multiple flows in the
stack to individual queues on the NIC.  This is necessary to alleviate
more head-of-line-blocking issues with different priority flows, where
today QoS only prevents HOLB within the stack, but not in the device.

Some Intel PRO/1000 adapters support this hardware feature, so this
series includes a patch to the e1000 driver to enable this hardware
feature for those devices. You will need to setup the queue's using tc.
Limited queue statistics are available through ethtool for devices that
support multiple queues.

Regards,

Auke Kok
Peter Waskiewicz Jr.

---
These patches apply against commit b892afd1e60132a981b963929e352eabf3306ba2
of branch 'master' from torvalds/linux-2.6.

You can pull these patches through git:

git-pull git://lost.foo-projects.org/~ahkok/git/netdev-2.6 multiqueue

---
Summary:
Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>:
     NET: Multiple queue network device support
     e1000: Implement the new kernel API for multiqueue TX support.

---
 drivers/net/Kconfig               |   13 +
 drivers/net/e1000/e1000.h         |   23 +++
 drivers/net/e1000/e1000_ethtool.c |   43 ++++++
 drivers/net/e1000/e1000_main.c    |  269 +++++++++++++++++++++++++++++++-------
 include/linux/netdevice.h         |   73 ++++++++++
 include/net/sch_generic.h         |    3 
 net/Kconfig                       |   20 ++
 net/core/dev.c                    |   51 +++++++
 net/sched/sch_generic.c           |  117 ++++++++++++++++
 net/sched/sch_prio.c              |  106 ++++++++++++++
 10 files changed, 668 insertions(+), 50 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] NET: Multiple queue network device support
  2007-02-09  0:09 [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel Kok, Auke
@ 2007-02-09  0:09 ` Kok, Auke
  2007-02-27  1:03   ` David Miller
  2007-03-09 13:40   ` Thomas Graf
  2007-02-09  0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke
  1 sibling, 2 replies; 17+ messages in thread
From: Kok, Auke @ 2007-02-09  0:09 UTC (permalink / raw)
  To: David Miller, Garzik, Jeff, netdev, linux-kernel
  Cc: Kok, Auke, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke,
	Ronciak, John


From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>

Added an API and associated supporting routines for multiqueue network
devices.  This allows network devices supporting multiple TX queues to
configure each queue within the netdevice and manage each queue
independantly.  Changes to the PRIO Qdisc also allow a user to map
multiple flows to individual TX queues, taking advantage of each queue
on the device.

Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---

 include/linux/netdevice.h |   73 ++++++++++++++++++++++++++++
 include/net/sch_generic.h |    3 +
 net/Kconfig               |   20 ++++++++
 net/core/dev.c            |   51 ++++++++++++++++++++
 net/sched/sch_generic.c   |  117 +++++++++++++++++++++++++++++++++++++++++++++
 net/sched/sch_prio.c      |  106 ++++++++++++++++++++++++++++++++++++++---
 6 files changed, 364 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e37f50..c7f94a8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -106,6 +106,16 @@ struct netpoll_info;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+struct net_device_subqueue
+{
+	/* Give a lock and a flow control state for each queue */
+	unsigned long	state;
+	spinlock_t	queue_lock ____cacheline_aligned_in_smp;
+};
+#endif
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -339,6 +349,10 @@ struct net_device
 #define NETIF_F_GEN_CSUM	(NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
 #define NETIF_F_ALL_CSUM	(NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM)
 
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+#define NETIF_F_MULTI_QUEUE	8192	/* Has multiple TX/RX queues */
+#endif
+
 	struct net_device	*next_sched;
 
 	/* Interface index. Unique device identifier	*/
@@ -532,6 +546,19 @@ struct net_device
 	struct device		dev;
 	/* space for optional statistics and wireless sysfs groups */
 	struct attribute_group  *sysfs_groups[3];
+
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	/* To retrieve statistics per subqueue - FOR FUTURE USE */
+	struct net_device_stats* (*get_subqueue_stats)(struct net_device *dev,
+							int queue_index);
+
+	/* The TX queue control structures */
+	struct net_device_subqueue	*egress_subqueue;
+	int				egress_subqueue_count;
+	int			(*hard_start_subqueue_xmit)(struct sk_buff *skb,
+							struct net_device *dev,
+							int queue_index);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -673,6 +700,52 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the netif state if we're multi-queue at all.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, int queue_index)
+{
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, int queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+                                         int queue_index)
+{
+	return test_bit(__LINK_STATE_XOFF,
+	                &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, int queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+	                       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
+
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 8208639..8751ba6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -102,6 +102,9 @@ struct Qdisc_ops
 
 	int			(*dump)(struct Qdisc *, struct sk_buff *);
 	int			(*dump_stats)(struct Qdisc *, struct gnet_dump *);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	int			(*map_queue)(struct sk_buff *, struct Qdisc *);
+#endif
 
 	struct module		*owner;
 };
diff --git a/net/Kconfig b/net/Kconfig
index 7dfc949..796edb3 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -38,6 +38,26 @@ source "net/packet/Kconfig"
 source "net/unix/Kconfig"
 source "net/xfrm/Kconfig"
 
+config NET_MULTI_QUEUE_DEVICE
+	bool "Multiple queue network device support (EXPERIMENTAL)"
+	depends on NET_SCHED && EXPERIMENTAL
+	help
+	  Saying Y here will add support for network devices that have more than
+	  one physical TX queue and/or RX queue.
+
+	  Multiple queue devices will require qdiscs that understand how to
+	  queue to multiple targets.  The default packet scheduler will default
+	  to the first queue in a device.  In other words, if you need the
+	  ability to spread traffic across queues, your queueing discipline
+	  needs to know how to do that.
+
+	  Note that saying Y here will give preferential treatment to multiple
+	  queue devices in the network stack.  A slight drop in single-queue
+	  device performance may be seen.
+
+	  Say N here unless you know your network device supports multiple
+	  TX and/or RX queues.
+
 config INET
 	bool "TCP/IP networking"
 	---help---
diff --git a/net/core/dev.c b/net/core/dev.c
index 455d589..42b635c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1477,6 +1477,49 @@ gso:
 	skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
 #endif
 	if (q->enqueue) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+		int queue_index;
+		/* If we're a multi-queue device, get a queue index to lock */
+		if (netif_is_multiqueue(dev))
+		{
+			/* Get the queue index and lock it. */
+			if (likely(q->ops->map_queue)) {
+				queue_index = q->ops->map_queue(skb, q);
+				spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+				rc = q->enqueue(skb, q);
+				/*
+				 * Unlock because the underlying qdisc
+				 * may queue and send a packet from a
+				 * different queue.
+				 */
+				spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+				qdisc_run(dev);
+				rc = rc == NET_XMIT_BYPASS
+				           ? NET_XMIT_SUCCESS : rc;
+				goto out;
+			} else {
+				printk(KERN_CRIT "Device %s is "
+					"multiqueue, but map_queue is "
+					"undefined in the qdisc!!\n",
+					dev->name);
+				kfree_skb(skb);
+			}
+		} else {
+			/* We're not a multi-queue device. */
+			spin_lock(&dev->queue_lock);
+			q = dev->qdisc;
+			if (q->enqueue) {
+				rc = q->enqueue(skb, q);
+				qdisc_run(dev);
+				spin_unlock(&dev->queue_lock);
+				rc = rc == NET_XMIT_BYPASS
+				           ? NET_XMIT_SUCCESS : rc;
+				goto out;
+			}
+			spin_unlock(&dev->queue_lock);
+		}
+#else /* No multi-queue support compiled in */
+
 		/* Grab device queue */
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
@@ -1489,6 +1532,7 @@ gso:
 			goto out;
 		}
 		spin_unlock(&dev->queue_lock);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
 	}
 
 	/* The device has no queue. Common case for software devices:
@@ -2879,6 +2923,9 @@ int register_netdevice(struct net_device *dev)
 	struct hlist_head *head;
 	struct hlist_node *p;
 	int ret;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	int i;
+#endif
 
 	BUG_ON(dev_boot_phase);
 	ASSERT_RTNL();
@@ -2894,6 +2941,10 @@ int register_netdevice(struct net_device *dev)
 #ifdef CONFIG_NET_CLS_ACT
 	spin_lock_init(&dev->ingress_lock);
 #endif
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	for (i = 0; i < dev->egress_subqueue_count; i++)
+		spin_lock_init(&dev->egress_subqueue[i].queue_lock);
+#endif
 
 	dev->iflink = -1;
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index bc116bd..62ca23e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -42,6 +42,7 @@
    to data, participating in scheduling must be additionally
    protected with dev->queue_lock spinlock.
 
+   For single-queue devices:
    The idea is the following:
    - enqueue, dequeue are serialized via top level device
      spinlock dev->queue_lock.
@@ -51,6 +52,12 @@
      hence this lock may be made without local bh disabling.
 
    qdisc_tree_lock must be grabbed BEFORE dev->queue_lock!
+
+   For multi-queue devices:
+   - enqueue, dequeue are serialized with individual queue locks,
+     dev->egress_subqueue[queue_index].queue_lock.
+   - Everything else with tree protection with single queue devices apply
+     to multiqueue devices.
  */
 DEFINE_RWLOCK(qdisc_tree_lock);
 
@@ -92,6 +99,9 @@ static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	int queue_index = 0;
+#endif
 
 	/* Dequeue packet */
 	if (((skb = dev->gso_skb)) || ((skb = q->dequeue(q)))) {
@@ -130,6 +140,72 @@ static inline int qdisc_restart(struct net_device *dev)
 		}
 		
 		{
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+			if (netif_is_multiqueue(dev)) {
+				if (likely(q->ops->map_queue)) {
+					queue_index = q->ops->map_queue(skb, q);
+				} else {
+					printk(KERN_CRIT "Device %s is "
+						"multiqueue, but map_queue is "
+						"undefined in the qdisc!!\n",
+						dev->name);
+					goto requeue;
+				}
+				spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+				/* Check top level device, and any sub-device */
+				if ((!netif_queue_stopped(dev)) &&
+				  (!netif_subqueue_stopped(dev, queue_index))) {
+					int ret;
+					ret = dev->hard_start_subqueue_xmit(skb, dev, queue_index);
+					if (ret == NETDEV_TX_OK) {
+						if (!nolock) {
+							netif_tx_unlock(dev);
+						}
+						return -1;
+					}
+					if (ret == NETDEV_TX_LOCKED && nolock) {
+						spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+						goto collision;
+					}
+				}
+				/* NETDEV_TX_BUSY - we need to requeue */
+				/* Release the driver */
+				if (!nolock) {
+					netif_tx_unlock(dev);
+				}
+				spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+				q = dev->qdisc;
+			}
+			else
+			{
+				/* We're a single-queue device */
+				/* And release queue */
+				spin_unlock(&dev->queue_lock);
+				if (!netif_queue_stopped(dev)) {
+					int ret;
+
+					ret = dev->hard_start_xmit(skb, dev);
+					if (ret == NETDEV_TX_OK) {
+						if (!nolock) {
+							netif_tx_unlock(dev);
+						}
+						spin_lock(&dev->queue_lock);
+						return -1;
+					}
+					if (ret == NETDEV_TX_LOCKED && nolock) {
+						spin_lock(&dev->queue_lock);
+						goto collision;
+					}
+				}
+				/* NETDEV_TX_BUSY - we need to requeue */
+				/* Release the driver */
+				if (!nolock) {
+					netif_tx_unlock(dev);
+				}
+				spin_lock(&dev->queue_lock);
+				q = dev->qdisc;
+			}
+#else /* CONFIG_NET_MULTI_QUEUE_DEVICE */
 			/* And release queue */
 			spin_unlock(&dev->queue_lock);
 
@@ -157,6 +233,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			} 
 			spin_lock(&dev->queue_lock);
 			q = dev->qdisc;
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
 		}
 
 		/* Device kicked us out :(
@@ -175,9 +252,17 @@ requeue:
 		else
 			q->ops->requeue(skb, q);
 		netif_schedule(dev);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+		if (netif_is_multiqueue(dev))
+			spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
 		return 1;
 	}
 	BUG_ON((int) q->q.qlen < 0);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	if (netif_is_multiqueue(dev))
+		spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
 	return q->q.qlen;
 }
 
@@ -287,12 +372,22 @@ static int noop_requeue(struct sk_buff *skb, struct Qdisc* qdisc)
 	return NET_XMIT_CN;
 }
 
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int noop_map_queue(struct sk_buff *skb, struct Qdisc *qdisc)
+{
+	return 0;
+}
+#endif
+
 struct Qdisc_ops noop_qdisc_ops = {
 	.id		=	"noop",
 	.priv_size	=	0,
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
 	.requeue	=	noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	.map_queue	=	noop_map_queue,
+#endif
 	.owner		=	THIS_MODULE,
 };
 
@@ -310,6 +405,9 @@ static struct Qdisc_ops noqueue_qdisc_ops = {
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
 	.requeue	=	noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	.map_queue	=	noop_map_queue,
+#endif
 	.owner		=	THIS_MODULE,
 };
 
@@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc* qdisc)
 	struct sk_buff_head *list = qdisc_priv(qdisc);
 
 	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+		if (netif_is_multiqueue(qdisc->dev))
+			spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
 		if (!skb_queue_empty(list + prio)) {
 			qdisc->q.qlen--;
 			return __qdisc_dequeue_head(qdisc, list + prio);
 		}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+		if (netif_is_multiqueue(qdisc->dev))
+			spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
 	}
 
 	return NULL;
@@ -406,6 +512,14 @@ static int pfifo_fast_init(struct Qdisc *qdisc, struct rtattr *opt)
 	return 0;
 }
 
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int pfifo_fast_map_queue(struct sk_buff *skb, struct Qdisc *qdisc)
+{
+	/* Generic qdisc, so always return queue index 0. */
+	return 0;
+}
+#endif
+
 static struct Qdisc_ops pfifo_fast_ops = {
 	.id		=	"pfifo_fast",
 	.priv_size	=	PFIFO_FAST_BANDS * sizeof(struct sk_buff_head),
@@ -415,6 +529,9 @@ static struct Qdisc_ops pfifo_fast_ops = {
 	.init		=	pfifo_fast_init,
 	.reset		=	pfifo_fast_reset,
 	.dump		=	pfifo_fast_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	.map_queue	=	pfifo_fast_map_queue,
+#endif
 	.owner		=	THIS_MODULE,
 };
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 2567b4c..0e24071 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -43,16 +43,19 @@ struct prio_sched_data
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	int band2queue[TC_PRIO_MAX + 1];
+#endif
 };
 
-
 static struct Qdisc *
-prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
+prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr, int *retband)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
 	u32 band = skb->priority;
 	struct tcf_result res;
 
+	*retband = band;
 	*qerr = NET_XMIT_BYPASS;
 	if (TC_H_MAJ(skb->priority) != sch->handle) {
 #ifdef CONFIG_NET_CLS_ACT
@@ -68,13 +71,16 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #else
 		if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
 #endif
-			if (TC_H_MAJ(band))
+			if (TC_H_MAJ(band)) {
 				band = 0;
+				*retband = 0;
+			}
 			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
+	*retband = band;
 	if (band > q->bands)
 		return q->queues[q->prio2band[0]];
 
@@ -86,8 +92,15 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
 	struct Qdisc *qdisc;
 	int ret;
+	int band; /* Unused in this function, just here for multi-queue */
+
+	/*
+	 * NOTE: if a device is multiqueue, then it is imperative the
+	 * underlying qdisc NOT be another PRIO qdisc.  Otherwise we're going
+	 * to deadlock.  Fixme please.
+	 */
+	qdisc = prio_classify(skb, sch, &ret, &band);
 
-	qdisc = prio_classify(skb, sch, &ret);
 #ifdef CONFIG_NET_CLS_ACT
 	if (qdisc == NULL) {
 
@@ -108,14 +121,34 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return ret; 
 }
 
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int
+prio_map_queue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	/*
+	 * We'll classify the skb here, so return the Qdisc back through
+	 * the function call.  The return value is the queue we need to
+	 * lock, which will be the value returned through band.  This will help
+	 * speed up enqueue, since it won't have to do the classify again.
+	 */
+	int ret;
+	int band;
+	struct prio_sched_data *q = qdisc_priv(sch);
+
+	sch = prio_classify(skb, sch, &ret, &band);
+	return q->band2queue[band];
+}
+#endif
 
 static int
 prio_requeue(struct sk_buff *skb, struct Qdisc* sch)
 {
 	struct Qdisc *qdisc;
 	int ret;
+	int band;
+
+	qdisc = prio_classify(skb, sch, &ret, &band);
 
-	qdisc = prio_classify(skb, sch, &ret);
 #ifdef CONFIG_NET_CLS_ACT
 	if (qdisc == NULL) {
 		if (ret == NET_XMIT_BYPASS)
@@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
 	struct sk_buff *skb;
 	struct prio_sched_data *q = qdisc_priv(sch);
 	int prio;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	int queue;
+#endif
 	struct Qdisc *qdisc;
 
+	/*
+	 * If we're multiqueue, the basic approach is try the lock on each
+	 * queue.  If it's locked, either we're already dequeuing, or enqueue
+	 * is doing something.  Go to the next band if we're locked.  Once we
+	 * have a packet, unlock the queue.  NOTE: the underlying qdisc CANNOT
+	 * be a PRIO qdisc, otherwise we will deadlock.  FIXME
+	 */
 	for (prio = 0; prio < q->bands; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+		if (netif_is_multiqueue(sch->dev)) {
+			queue = q->band2queue[prio];
+			if (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
+				qdisc = q->queues[prio];
+				skb = qdisc->dequeue(qdisc);
+				if (skb) {
+					sch->q.qlen--;
+					skb->priority = prio;
+					spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+					return skb;
+				}
+				spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+			}
+		} else {
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				skb->priority = prio;
+				return skb;
+			}
+		}
+#else
 		qdisc = q->queues[prio];
 		skb = qdisc->dequeue(qdisc);
 		if (skb) {
 			sch->q.qlen--;
+			skb->priority = prio;
 			return skb;
 		}
+#endif
 	}
 	return NULL;
-
 }
 
 static unsigned int prio_drop(struct Qdisc* sch)
@@ -205,6 +273,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 	struct prio_sched_data *q = qdisc_priv(sch);
 	struct tc_prio_qopt *qopt = RTA_DATA(opt);
 	int i;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	int queue;
+	int qmapoffset;
+#endif
 
 	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
 		return -EINVAL;
@@ -247,6 +319,25 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 			}
 		}
 	}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	/* setup queue to band mapping */
+	if (netif_is_multiqueue(sch->dev)) {
+		if (q->bands == sch->dev->egress_subqueue_count)
+			qmapoffset = 0;
+		else
+			qmapoffset = q->bands / sch->dev->egress_subqueue_count;
+
+		queue = 0;
+		for (i = 0; i < q->bands; i++) {
+			q->band2queue[i] = queue;
+			if ( (((i + 1) - queue) == qmapoffset) &&
+			     (queue < (sch->dev->egress_subqueue_count - 1))) {
+				queue++;
+			}
+		}
+	}
+
+#endif
 	return 0;
 }
 
@@ -430,6 +521,9 @@ static struct Qdisc_ops prio_qdisc_ops = {
 	.destroy	=	prio_destroy,
 	.change		=	prio_tune,
 	.dump		=	prio_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+	.map_queue	=	prio_map_queue,
+#endif
 	.owner		=	THIS_MODULE,
 };
 



---
Auke Kok <auke-jan.h.kok@intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support.
  2007-02-09  0:09 [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel Kok, Auke
  2007-02-09  0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke
@ 2007-02-09  0:09 ` Kok, Auke
  2007-03-09 13:11   ` Thomas Graf
  1 sibling, 1 reply; 17+ messages in thread
From: Kok, Auke @ 2007-02-09  0:09 UTC (permalink / raw)
  To: David Miller, Garzik, Jeff, netdev, linux-kernel
  Cc: Kok, Auke, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke,
	Ronciak, John


From: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>

Several newer e1000 chipsets support multiple RX and TX queues. Most
commonly, 82571's and ESB2LAN support 2 rx and 2 rx queues.

Signed-off-by: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---

 drivers/net/Kconfig               |   13 ++
 drivers/net/e1000/e1000.h         |   23 +++
 drivers/net/e1000/e1000_ethtool.c |   43 ++++++
 drivers/net/e1000/e1000_main.c    |  269 +++++++++++++++++++++++++++++++------
 4 files changed, 304 insertions(+), 44 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index ad92b6a..2d758ab 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1988,6 +1988,19 @@ config E1000_DISABLE_PACKET_SPLIT
 
 	  If in doubt, say N.
 
+config E1000_MQ
+	bool "Enable Tx/Rx Multiqueue Support (EXPERIMENTAL)"
+	depends on E1000 && NET_MULTI_QUEUE_DEVICE && EXPERIMENTAL
+	help
+	  Say Y here if you want to enable multiqueue support for supported
+	  e1000 devices.  This will enable both transmit and receive queues
+	  on devices that support them.
+
+	  In order to fully utilize the Tx queue support, use the SCH_PRIO
+	  queueing discipline.
+
+	  If in doubt, say N.
+
 source "drivers/net/ixp2000/Kconfig"
 
 config MYRI_SBUS
diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 689f158..cfcdc9d 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -108,6 +108,10 @@ struct e1000_adapter;
 #define E1000_MIN_RXD                       80
 #define E1000_MAX_82544_RXD               4096
 
+#ifdef CONFIG_E1000_MQ
+#define E1000_MAX_TX_QUEUES                  4
+#endif
+
 /* this is the size past which hardware will drop packets when setting LPE=0 */
 #define MAXIMUM_ETHERNET_VLAN_SIZE 1522
 
@@ -168,6 +172,12 @@ struct e1000_buffer {
 	uint16_t next_to_watch;
 };
 
+#ifdef CONFIG_E1000_MQ
+struct e1000_queue_stats {
+	uint64_t packets;
+	uint64_t bytes;
+};
+#endif
 
 struct e1000_ps_page { struct page *ps_page[PS_PAGE_BUFFERS]; };
 struct e1000_ps_page_dma { uint64_t ps_page_dma[PS_PAGE_BUFFERS]; };
@@ -188,9 +198,16 @@ struct e1000_tx_ring {
 	/* array of buffer information structs */
 	struct e1000_buffer *buffer_info;
 
+#ifdef CONFIG_E1000_MQ
+	/* for tx ring cleanup - needed for multiqueue */
+	spinlock_t tx_queue_lock;
+#endif
 	spinlock_t tx_lock;
 	uint16_t tdh;
 	uint16_t tdt;
+#ifdef CONFIG_E1000_MQ
+	struct e1000_queue_stats tx_stats;
+#endif
 	boolean_t last_tx_tso;
 };
 
@@ -218,6 +235,9 @@ struct e1000_rx_ring {
 
 	uint16_t rdh;
 	uint16_t rdt;
+#ifdef CONFIG_E1000_MQ
+	struct e1000_queue_stats rx_stats;
+#endif
 };
 
 #define E1000_DESC_UNUSED(R) \
@@ -271,6 +291,9 @@ struct e1000_adapter {
 
 	/* TX */
 	struct e1000_tx_ring *tx_ring;      /* One per active queue */
+#ifdef CONFIG_E1000_MQ
+	struct e1000_tx_ring **cpu_tx_ring; /* per-cpu */
+#endif
 	unsigned int restart_queue;
 	unsigned long tx_queue_len;
 	uint32_t txd_cmd;
diff --git a/drivers/net/e1000/e1000_ethtool.c b/drivers/net/e1000/e1000_ethtool.c
index 44ebc72..c8c1500 100644
--- a/drivers/net/e1000/e1000_ethtool.c
+++ b/drivers/net/e1000/e1000_ethtool.c
@@ -105,7 +105,14 @@ static const struct e1000_stats e1000_gstrings_stats[] = {
 	{ "dropped_smbus", E1000_STAT(stats.mgpdc) },
 };
 
+#ifdef CONFIG_E1000_MQ
+#define E1000_QUEUE_STATS_LEN \
+	(((struct e1000_adapter *)netdev->priv)->num_tx_queues + \
+	 ((struct e1000_adapter *)netdev->priv)->num_rx_queues) \
+	* (sizeof(struct e1000_queue_stats) / sizeof(uint64_t))
+#else
 #define E1000_QUEUE_STATS_LEN 0
+#endif
 #define E1000_GLOBAL_STATS_LEN	\
 	sizeof(e1000_gstrings_stats) / sizeof(struct e1000_stats)
 #define E1000_STATS_LEN (E1000_GLOBAL_STATS_LEN + E1000_QUEUE_STATS_LEN)
@@ -1909,6 +1916,11 @@ e1000_get_ethtool_stats(struct net_device *netdev,
 		struct ethtool_stats *stats, uint64_t *data)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
+#ifdef CONFIG_E1000_MQ
+	uint64_t *queue_stat;
+	int stat_count = sizeof(struct e1000_queue_stats) / sizeof(uint64_t);
+	int j, k;
+#endif
 	int i;
 
 	e1000_update_stats(adapter);
@@ -1917,12 +1929,29 @@ e1000_get_ethtool_stats(struct net_device *netdev,
 		data[i] = (e1000_gstrings_stats[i].sizeof_stat ==
 			sizeof(uint64_t)) ? *(uint64_t *)p : *(uint32_t *)p;
 	}
+#ifdef CONFIG_E1000_MQ
+	for (j = 0; j < adapter->num_tx_queues; j++) {
+		queue_stat = (uint64_t *)&adapter->tx_ring[j].tx_stats;
+		for (k = 0; k < stat_count; k++)
+			data[i + k] = queue_stat[k];
+		i += k;
+	}
+	for (j = 0; j < adapter->num_rx_queues; j++) {
+		queue_stat = (uint64_t *)&adapter->rx_ring[j].rx_stats;
+		for (k = 0; k < stat_count; k++)
+			data[i + k] = queue_stat[k];
+		i += k;
+	}
+#endif
 /*	BUG_ON(i != E1000_STATS_LEN); */
 }
 
 static void
 e1000_get_strings(struct net_device *netdev, uint32_t stringset, uint8_t *data)
 {
+#ifdef CONFIG_E1000_MQ
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+#endif
 	uint8_t *p = data;
 	int i;
 
@@ -1937,6 +1966,20 @@ e1000_get_strings(struct net_device *netdev, uint32_t stringset, uint8_t *data)
 			       ETH_GSTRING_LEN);
 			p += ETH_GSTRING_LEN;
 		}
+#ifdef CONFIG_E1000_MQ
+		for (i = 0; i < adapter->num_tx_queues; i++) {
+			sprintf(p, "tx_queue_%u_packets", i);
+			p += ETH_GSTRING_LEN;
+			sprintf(p, "tx_queue_%u_bytes", i);
+			p += ETH_GSTRING_LEN;
+		}
+		for (i = 0; i < adapter->num_rx_queues; i++) {
+			sprintf(p, "rx_queue_%u_packets", i);
+			p+= ETH_GSTRING_LEN;
+			sprintf(p, "rx_queue_%u_bytes", i);
+			p += ETH_GSTRING_LEN;
+		}
+#endif
 /*		BUG_ON(p - data != E1000_STATS_LEN * ETH_GSTRING_LEN); */
 		break;
 	}
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 619c892..a53f065 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -28,6 +28,10 @@
 
 #include "e1000.h"
 #include <net/ip6_checksum.h>
+#ifdef CONFIG_E1000_MQ
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#endif
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -137,6 +141,9 @@ static void e1000_exit_module(void);
 static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent);
 static void __devexit e1000_remove(struct pci_dev *pdev);
 static int e1000_alloc_queues(struct e1000_adapter *adapter);
+#ifdef CONFIG_E1000_MQ
+static void e1000_setup_queue_mapping(struct e1000_adapter *adapter);
+#endif
 static int e1000_sw_init(struct e1000_adapter *adapter);
 static int e1000_open(struct net_device *netdev);
 static int e1000_close(struct net_device *netdev);
@@ -153,7 +160,13 @@ static void e1000_set_multi(struct net_device *netdev);
 static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
+static int e1000_xmit_frame_ring(struct sk_buff *skb, struct net_device *netdev,
+                                 struct e1000_tx_ring *tx_ring);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+#ifdef CONFIG_E1000_MQ
+static int e1000_subqueue_xmit_frame(struct sk_buff *skb,
+                                     struct net_device *netdev, int queue);
+#endif
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
 static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
 static int e1000_set_mac(struct net_device *netdev, void *p);
@@ -547,6 +560,10 @@ e1000_up(struct e1000_adapter *adapter)
 		                      E1000_DESC_UNUSED(ring));
 	}
 
+#ifdef CONFIG_E1000_MQ
+	e1000_setup_queue_mapping(adapter);
+#endif
+
 	adapter->tx_queue_len = netdev->tx_queue_len;
 
 #ifdef CONFIG_E1000_NAPI
@@ -900,7 +917,12 @@ e1000_probe(struct pci_dev *pdev,
 	pci_set_master(pdev);
 
 	err = -ENOMEM;
+#ifdef CONFIG_E1000_MQ
+	netdev = alloc_etherdev(sizeof(struct e1000_adapter) +
+		(sizeof(struct net_device_subqueue) * E1000_MAX_TX_QUEUES));
+#else
 	netdev = alloc_etherdev(sizeof(struct e1000_adapter));
+#endif
 	if (!netdev)
 		goto err_alloc_etherdev;
 
@@ -934,6 +956,9 @@ e1000_probe(struct pci_dev *pdev,
 	netdev->open = &e1000_open;
 	netdev->stop = &e1000_close;
 	netdev->hard_start_xmit = &e1000_xmit_frame;
+#ifdef CONFIG_E1000_MQ
+	netdev->hard_start_subqueue_xmit = &e1000_subqueue_xmit_frame;
+#endif
 	netdev->get_stats = &e1000_get_stats;
 	netdev->set_multicast_list = &e1000_set_multi;
 	netdev->set_mac_address = &e1000_set_mac;
@@ -1317,8 +1342,44 @@ e1000_sw_init(struct e1000_adapter *adapter)
 		hw->master_slave = E1000_MASTER_SLAVE;
 	}
 
+#ifdef CONFIG_E1000_MQ
+	/* Number of supported queues.
+	 * TODO: It's assumed num_rx_queues >= num_tx_queues, since multi-rx
+	 * queues are much more interesting.  Is it worth coding for the
+	 * possibility (however improbable) of num_tx_queues > num_rx_queues?
+	 */
+	switch (hw->mac_type) {
+	case e1000_82571:
+	case e1000_82572:
+	case e1000_80003es2lan:
+		adapter->num_tx_queues = 2;
+		adapter->num_rx_queues = 2;
+		break;
+
+	default:
+		/* All hardware before 82571 only have 1 queue each for Rx/Tx.
+		 * However, the 82571 family does not have MSI-X, so multi-
+		 * queue isn't enabled.
+		 * It'd be wise not to mess with this default case. :) */
+		adapter->num_tx_queues = 1;
+		adapter->num_rx_queues = 1;
+		netdev->egress_subqueue_count = 0;
+		break;
+	}
+	adapter->num_rx_queues = min(adapter->num_rx_queues, num_online_cpus());
+	adapter->num_tx_queues = min(adapter->num_tx_queues, num_online_cpus());
+
+	if ((adapter->num_tx_queues > 1) || (adapter->num_rx_queues > 1)) {
+		netdev->egress_subqueue = (struct net_device_subqueue *)((void *)adapter + sizeof(struct e1000_adapter));
+		netdev->egress_subqueue_count = adapter->num_tx_queues;
+		DPRINTK(DRV, INFO, "Multiqueue Enabled: RX queues = %u, "
+		        "TX queues = %u\n", adapter->num_rx_queues,
+		        adapter->num_tx_queues);
+	}
+#else
 	adapter->num_tx_queues = 1;
 	adapter->num_rx_queues = 1;
+#endif
 
 	if (e1000_alloc_queues(adapter)) {
 		DPRINTK(PROBE, ERR, "Unable to allocate memory for queues\n");
@@ -1334,13 +1395,16 @@ e1000_sw_init(struct e1000_adapter *adapter)
 		set_bit(__LINK_STATE_START, &adapter->polling_netdev[i].state);
 	}
 	spin_lock_init(&adapter->tx_queue_lock);
+#ifdef CONFIG_E1000_MQ
+	for (i = 0; i < adapter->num_tx_queues; i++)
+		spin_lock_init(&adapter->tx_ring[i].tx_queue_lock);
+#endif
 #endif
 
 	atomic_set(&adapter->irq_sem, 1);
 	spin_lock_init(&adapter->stats_lock);
 
 	set_bit(__E1000_DOWN, &adapter->flags);
-
 	return 0;
 }
 
@@ -1382,9 +1446,28 @@ e1000_alloc_queues(struct e1000_adapter *adapter)
 	}
 	memset(adapter->polling_netdev, 0, size);
 #endif
+#ifdef CONFIG_E1000_MQ
+	adapter->cpu_tx_ring = alloc_percpu(struct e1000_tx_ring *);
+#endif
 
 	return E1000_SUCCESS;
 }
+#ifdef CONFIG_E1000_MQ
+static void
+e1000_setup_queue_mapping(struct e1000_adapter *adapter)
+{
+	int i, cpu;
+
+	lock_cpu_hotplug();
+	i = 0;
+	for_each_online_cpu(cpu) {
+		*per_cpu_ptr(adapter->cpu_tx_ring, cpu) =
+		             &adapter->tx_ring[i % adapter->num_tx_queues];
+		i++;
+	}
+	unlock_cpu_hotplug();
+}
+#endif
 
 /**
  * e1000_open - Called when a network interface is made active
@@ -1636,23 +1719,20 @@ e1000_configure_tx(struct e1000_adapter *adapter)
 	struct e1000_hw *hw = &adapter->hw;
 	uint32_t tdlen, tctl, tipg, tarc;
 	uint32_t ipgr1, ipgr2;
+	int i;
 
 	/* Setup the HW Tx Head and Tail descriptor pointers */
-
-	switch (adapter->num_tx_queues) {
-	case 1:
-	default:
-		tdba = adapter->tx_ring[0].dma;
-		tdlen = adapter->tx_ring[0].count *
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		tdba = adapter->tx_ring[i].dma;
+		tdlen = adapter->tx_ring[i].count *
 			sizeof(struct e1000_tx_desc);
-		E1000_WRITE_REG(hw, TDLEN, tdlen);
-		E1000_WRITE_REG(hw, TDBAH, (tdba >> 32));
-		E1000_WRITE_REG(hw, TDBAL, (tdba & 0x00000000ffffffffULL));
-		E1000_WRITE_REG(hw, TDT, 0);
-		E1000_WRITE_REG(hw, TDH, 0);
-		adapter->tx_ring[0].tdh = ((hw->mac_type >= e1000_82543) ? E1000_TDH : E1000_82542_TDH);
-		adapter->tx_ring[0].tdt = ((hw->mac_type >= e1000_82543) ? E1000_TDT : E1000_82542_TDT);
-		break;
+		E1000_WRITE_REG(hw, TDLEN + (i << 8), tdlen);
+		E1000_WRITE_REG(hw, TDBAH + (i << 8), (tdba >> 32));
+		E1000_WRITE_REG(hw, TDBAL + (i << 8), (tdba & 0x00000000ffffffffULL));
+		E1000_WRITE_REG(hw, TDT + (i << 8), 0);
+		E1000_WRITE_REG(hw, TDH + (i << 8), 0);
+		adapter->tx_ring[i].tdh = ((hw->mac_type >= e1000_82543) ? E1000_TDH + (i << 8) : E1000_82542_TDH);
+		adapter->tx_ring[i].tdt = ((hw->mac_type >= e1000_82543) ? E1000_TDT + (i << 8) : E1000_82542_TDT);
 	}
 
 	/* Set the default values for the Tx Inter Packet Gap timer */
@@ -1999,6 +2079,7 @@ e1000_configure_rx(struct e1000_adapter *adapter)
 	uint64_t rdba;
 	struct e1000_hw *hw = &adapter->hw;
 	uint32_t rdlen, rctl, rxcsum, ctrl_ext;
+	int i;
 
 	if (adapter->rx_ps_pages) {
 		/* this is a 32 byte descriptor */
@@ -2042,20 +2123,67 @@ e1000_configure_rx(struct e1000_adapter *adapter)
 
 	/* Setup the HW Rx Head and Tail Descriptor Pointers and
 	 * the Base and Length of the Rx Descriptor Ring */
-	switch (adapter->num_rx_queues) {
-	case 1:
-	default:
-		rdba = adapter->rx_ring[0].dma;
-		E1000_WRITE_REG(hw, RDLEN, rdlen);
-		E1000_WRITE_REG(hw, RDBAH, (rdba >> 32));
-		E1000_WRITE_REG(hw, RDBAL, (rdba & 0x00000000ffffffffULL));
-		E1000_WRITE_REG(hw, RDT, 0);
-		E1000_WRITE_REG(hw, RDH, 0);
-		adapter->rx_ring[0].rdh = ((hw->mac_type >= e1000_82543) ? E1000_RDH : E1000_82542_RDH);
-		adapter->rx_ring[0].rdt = ((hw->mac_type >= e1000_82543) ? E1000_RDT : E1000_82542_RDT);
-		break;
+	for (i = 0; i < adapter->num_rx_queues; i++) {
+		rdba = adapter->rx_ring[i].dma;
+		E1000_WRITE_REG(hw, RDLEN + (i << 8), rdlen);
+		E1000_WRITE_REG(hw, RDBAH + (i << 8), (rdba >> 32));
+		E1000_WRITE_REG(hw, RDBAL + (i << 8), (rdba & 0x00000000ffffffffULL));
+		E1000_WRITE_REG(hw, RDT + (i << 8), 0);
+		E1000_WRITE_REG(hw, RDH + (i << 8), 0);
+		adapter->rx_ring[i].rdh = ((hw->mac_type >= e1000_82543) ? E1000_RDH + (i << 8) : E1000_82542_RDH);
+		adapter->rx_ring[i].rdt = ((hw->mac_type >= e1000_82543) ? E1000_RDT + (i << 8) : E1000_82542_RDT);
 	}
 
+#ifdef CONFIG_E1000_MQ
+	if (adapter->num_rx_queues > 1) {
+		u32 random[10];
+		u32 reta, mrqc;
+		int i;
+
+		get_random_bytes(&random[0], 40);
+
+		switch (adapter->num_rx_queues) {
+		default:
+			reta = 0x00800080;
+			mrqc = E1000_MRQC_ENABLE_RSS_2Q;
+			break;
+		}
+
+		/* Fill out redirection table */
+		for (i = 0; i < 32; i++)
+			E1000_WRITE_REG_ARRAY(hw, RETA, i, reta);
+		/* Fill out hash function seeds */
+		for (i = 0; i < 10; i++)
+			E1000_WRITE_REG_ARRAY(hw, RSSRK, i, random[i]);
+
+		mrqc |= (E1000_MRQC_RSS_FIELD_IPV4 |
+			 E1000_MRQC_RSS_FIELD_IPV4_TCP);
+
+		E1000_WRITE_REG(hw, MRQC, mrqc);
+
+		/* Multiqueue and packet checksumming are mutually exclusive. */
+		rxcsum = E1000_READ_REG(hw, RXCSUM);
+		rxcsum |= E1000_RXCSUM_PCSD;
+		E1000_WRITE_REG(hw, RXCSUM, rxcsum);
+	} else if (hw->mac_type >= e1000_82543) {
+		/* Enable 82543 Receive Checksum Offload for TCP and UDP */
+		rxcsum = E1000_READ_REG(hw, RXCSUM);
+		if (adapter->rx_csum == TRUE) {
+			rxcsum |= E1000_RXCSUM_TUOFL;
+
+			/* Enable 82571 IPv4 payload checksum for UDP fragments
+			 * Must be used in conjunction with packet-split. */
+			if ((hw->mac_type >= e1000_82571) &&
+			    (adapter->rx_ps_pages)) {
+				rxcsum |= E1000_RXCSUM_IPPCSE;
+			}
+		} else {
+			rxcsum &= ~E1000_RXCSUM_TUOFL;
+			/* don't need to clear IPPCSE as it defaults to 0 */
+		}
+		E1000_WRITE_REG(hw, RXCSUM, rxcsum);
+	}
+#else
 	/* Enable 82543 Receive Checksum Offload for TCP and UDP */
 	if (hw->mac_type >= e1000_82543) {
 		rxcsum = E1000_READ_REG(hw, RXCSUM);
@@ -2074,6 +2202,7 @@ e1000_configure_rx(struct e1000_adapter *adapter)
 		}
 		E1000_WRITE_REG(hw, RXCSUM, rxcsum);
 	}
+#endif
 
 	/* enable early receives on 82573, only takes effect if using > 2048
 	 * byte total frame size.  for example only for jumbo frames */
@@ -2555,6 +2684,9 @@ e1000_watchdog(unsigned long data)
 	struct e1000_tx_ring *txdr = adapter->tx_ring;
 	uint32_t link, tctl;
 	int32_t ret_val;
+#ifdef CONFIG_E1000_MQ
+	int i;
+#endif
 
 	ret_val = e1000_check_for_link(&adapter->hw);
 	if ((ret_val == E1000_ERR_PHY) &&
@@ -2652,6 +2784,12 @@ e1000_watchdog(unsigned long data)
 
 			netif_carrier_on(netdev);
 			netif_wake_queue(netdev);
+#ifdef CONFIG_E1000_MQ
+			if (netif_is_multiqueue(netdev))
+				for (i = 0; i < adapter->num_tx_queues; i++)
+					netif_wake_subqueue(netdev, i);
+#endif
+
 			mod_timer(&adapter->phy_info_timer, jiffies + 2 * HZ);
 			adapter->smartspeed = 0;
 		} else {
@@ -3249,10 +3387,10 @@ static int e1000_maybe_stop_tx(struct net_device *netdev,
 
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
 static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+e1000_xmit_frame_ring(struct sk_buff *skb, struct net_device *netdev,
+                      struct e1000_tx_ring *tx_ring)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
-	struct e1000_tx_ring *tx_ring;
 	unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
 	unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
 	unsigned int tx_flags = 0;
@@ -3265,12 +3403,6 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	unsigned int f;
 	len -= skb->data_len;
 
-	/* This goes back to the question of how to logically map a tx queue
-	 * to a flow.  Right now, performance is impacted slightly negatively
-	 * if using multiple tx queues.  If the stack breaks away from a
-	 * single qdisc implementation, we can look at this again. */
-	tx_ring = adapter->tx_ring;
-
 	if (unlikely(skb->len <= 0)) {
 		dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
@@ -3425,6 +3557,31 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	return NETDEV_TX_OK;
 }
 
+static int
+e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+
+	/* This goes back to the question of how to logically map a tx queue
+	 * to a flow.  Right now, performance is impacted slightly negatively
+	 * if using multiple tx queues.  If the stack breaks away from a
+	 * single qdisc implementation, we can look at this again. */
+	return (e1000_xmit_frame_ring(skb, netdev, tx_ring));
+}
+
+#ifdef CONFIG_E1000_MQ
+static int
+e1000_subqueue_xmit_frame(struct sk_buff *skb, struct net_device *netdev,
+                          int queue)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = &adapter->tx_ring[queue];
+
+	return (e1000_xmit_frame_ring(skb, netdev, tx_ring));
+}
+#endif
+
 /**
  * e1000_tx_timeout - Respond to a Tx Hang
  * @netdev: network interface device structure
@@ -3924,6 +4081,7 @@ e1000_clean(struct net_device *poll_dev, int *budget)
 	struct e1000_adapter *adapter;
 	int work_to_do = min(*budget, poll_dev->quota);
 	int tx_cleaned = 0, work_done = 0;
+	int i;
 
 	/* Must NOT use netdev_priv macro here. */
 	adapter = poll_dev->priv;
@@ -3933,18 +4091,27 @@ e1000_clean(struct net_device *poll_dev, int *budget)
 		goto quit_polling;
 
 	/* e1000_clean is called per-cpu.  This lock protects
-	 * tx_ring[0] from being cleaned by multiple cpus
+	 * tx_ring[i] from being cleaned by multiple cpus
 	 * simultaneously.  A failure obtaining the lock means
-	 * tx_ring[0] is currently being cleaned anyway. */
-	if (spin_trylock(&adapter->tx_queue_lock)) {
-		tx_cleaned = e1000_clean_tx_irq(adapter,
-		                                &adapter->tx_ring[0]);
-		spin_unlock(&adapter->tx_queue_lock);
+	 * tx_ring[i] is currently being cleaned anyway. */
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+#ifdef CONFIG_E1000_MQ
+		if (spin_trylock(&adapter->tx_ring[i].tx_queue_lock)) {
+			tx_cleaned = e1000_clean_tx_irq(adapter,
+			                                &adapter->tx_ring[i]);
+			spin_unlock(&adapter->tx_ring[i].tx_queue_lock);
+		}
+#else
+		if (spin_trylock(&adapter->tx_queue_lock)) {
+			tx_cleaned = e1000_clean_tx_irq(adapter,
+			                                &adapter->tx_ring[i]);
+			spin_unlock(&adapter->tx_queue_lock);
+		}
+#endif
+		adapter->clean_rx(adapter, &adapter->rx_ring[i],
+		                  &work_done, work_to_do);
 	}
 
-	adapter->clean_rx(adapter, &adapter->rx_ring[0],
-	                  &work_done, work_to_do);
-
 	*budget -= work_done;
 	poll_dev->quota -= work_done;
 
@@ -3992,6 +4159,9 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter,
 			buffer_info = &tx_ring->buffer_info[i];
 			cleaned = (i == eop);
 
+#ifdef CONFIG_E1000_MQ
+			tx_ring->tx_stats.bytes += buffer_info->length;
+#endif
 			if (cleaned) {
 				struct sk_buff *skb = buffer_info->skb;
 				unsigned int segs, bytecount;
@@ -4008,6 +4178,9 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter,
 			if (unlikely(++i == tx_ring->count)) i = 0;
 		}
 
+#ifdef CONFIG_E1000_MQ
+		tx_ring->tx_stats.packets++;
+#endif
 		eop = tx_ring->buffer_info[i].next_to_watch;
 		eop_desc = E1000_TX_DESC(*tx_ring, eop);
 #ifdef CONFIG_E1000_NAPI
@@ -4269,6 +4442,10 @@ e1000_clean_rx_irq(struct e1000_adapter *adapter,
 		}
 #endif /* CONFIG_E1000_NAPI */
 		netdev->last_rx = jiffies;
+#ifdef CONFIG_E1000_MQ
+		rx_ring->rx_stats.packets++;
+		rx_ring->rx_stats.bytes += length;
+#endif
 
 next_desc:
 		rx_desc->status = 0;
@@ -4375,6 +4552,10 @@ e1000_clean_rx_irq_ps(struct e1000_adapter *adapter,
 
 		/* Good Receive */
 		skb_put(skb, length);
+#ifdef CONFIG_E1000_MQ
+		rx_ring->rx_stats.packets++;
+		rx_ring->rx_stats.bytes += skb->len;
+#endif
 
 		{
 		/* this looks ugly, but it seems compiler issues make it



---
Auke Kok <auke-jan.h.kok@intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] NET: Multiple queue network device support
  2007-02-09  0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke
@ 2007-02-27  1:03   ` David Miller
  2007-02-27 19:38     ` Waskiewicz Jr, Peter P
  2007-03-07 22:18     ` Waskiewicz Jr, Peter P
  2007-03-09 13:40   ` Thomas Graf
  1 sibling, 2 replies; 17+ messages in thread
From: David Miller @ 2007-02-27  1:03 UTC (permalink / raw)
  To: auke-jan.h.kok
  Cc: jgarzik, netdev, linux-kernel, peter.p.waskiewicz.jr,
	jesse.brandeburg, auke, john.ronciak

From: "Kok, Auke" <auke-jan.h.kok@intel.com>
Date: Thu, 08 Feb 2007 16:09:50 -0800

> From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> 
> Added an API and associated supporting routines for multiqueue network
> devices.  This allows network devices supporting multiple TX queues to
> configure each queue within the netdevice and manage each queue
> independantly.  Changes to the PRIO Qdisc also allow a user to map
> multiple flows to individual TX queues, taking advantage of each queue
> on the device.
> 
> Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>

Thanks for posting this.

I was wonding if it would be better to have the ->enqueue() perform
the mapping operation, store the queue index selected inside of the
skb, and just use the normal ->hard_start_xmit() to send the packet
out?

The only thing to take care of is the per-queue locking, but that
should be easily doable.

We might be able to do this even without making sk_buff any larger.
For example, I suppose skb->priority might be deducable down to
a u16.  After a quick audit I really cannot see any legitimate use
of anything larger than 16-bit values, but a more thorough audit
would be necessary to validate this.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/2] NET: Multiple queue network device support
  2007-02-27  1:03   ` David Miller
@ 2007-02-27 19:38     ` Waskiewicz Jr, Peter P
  2007-03-07 22:18     ` Waskiewicz Jr, Peter P
  1 sibling, 0 replies; 17+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-02-27 19:38 UTC (permalink / raw)
  To: David Miller
  Cc: jgarzik, netdev, linux-kernel, Brandeburg, Jesse, auke, Ronciak,
	John, Kok, Auke-jan H

Dave,
	Thanks for the feedback.  Please see below for my comments /
questions.

PJ Waskiewicz
Intel Corporation

> > From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> > 
> > Added an API and associated supporting routines for 
> multiqueue network 
> > devices.  This allows network devices supporting multiple 
> TX queues to 
> > configure each queue within the netdevice and manage each queue 
> > independantly.  Changes to the PRIO Qdisc also allow a user to map 
> > multiple flows to individual TX queues, taking advantage of 
> each queue 
> > on the device.
> > 
> > Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> > Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
> 
> Thanks for posting this.
> 
> I was wonding if it would be better to have the ->enqueue() 
> perform the mapping operation, store the queue index selected 
> inside of the skb, and just use the normal 
> ->hard_start_xmit() to send the packet out?

One of the reasons I chose to add more functions is for the queue
management itself.  Specifically, I needed the ability to lock queues,
stop queues (netif_stop_queue), and wake them up, independent of the
global queue_lock in the netdevice.  At first I did think about storing
the queue index in the skb, but that won't help with the queue locking,
since only netdevice is passed to them outside of the context of the
skb.  As for adding the additional multiqueue version of
hard_start_subqueue_xmit(), I did that so non-MQ devices would have a
clean, untouched path in dev_queue_xmit(), and only MQ devices would
touch a new codepath.  This also removes the need for the device driver
to inspect the queue index in the skb for which Tx ring to place the
packet in, which is good for performance.

For having ->enqueue() take care of the mapping, I thought for awhile
about this.  Originally, I had ->enqueue() doing just that, until I
realized I was not locking the individual subqueues correctly.  The
thought process was once I receive an skb for Tx, I need to lock a
queue, enqueue the skb, call qdisc_run(), then unlock.  That grew into
lock a queue, enqueue, unlock the queue, have dequeue decide which queue
to pull an skb from (since the queue an skb comes from can be different
than the one just enqueued to), lock it, dequeue, unlock, and return the
skb.  The whole issue came down to needing to know what queue to lock
before enqueuing.  If I called enqueue, I would be going in unlocked.  I
suppose this would be ok if the lock is done inside the ->enqueue(), and
unlock is done before exiting; let me look at that possibility.

> 
> The only thing to take care of is the per-queue locking, but 
> that should be easily doable.
> 
> We might be able to do this even without making sk_buff any larger.
> For example, I suppose skb->priority might be deducable down 
> to a u16.  After a quick audit I really cannot see any 
> legitimate use of anything larger than 16-bit values, but a 
> more thorough audit would be necessary to validate this.

The skb->priority field right now is used to determine the PRIO /
pfifo_fast band to place the skb in on enqueue.  I still need to think
about if storing the queue index in the skb would help, due to the queue
management functions outside of the skb context.

Again, thanks for the feedback.  I'll respond once I've looked at moving
locks into ->enqueue() and removing ->map_queue().  Please note that the
PRIO band to queue mapping in this patch has a problem when the number
of bands is <= the number of Tx queues on the NIC.  I have a fix, and
will include that in a repost.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/2] NET: Multiple queue network device support
  2007-02-27  1:03   ` David Miller
  2007-02-27 19:38     ` Waskiewicz Jr, Peter P
@ 2007-03-07 22:18     ` Waskiewicz Jr, Peter P
  2007-03-07 22:42       ` David Miller
  1 sibling, 1 reply; 17+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-03-07 22:18 UTC (permalink / raw)
  To: David Miller
  Cc: jgarzik, netdev, linux-kernel, Brandeburg, Jesse, auke, Ronciak, John

Dave,
	I did some research based on your feedback.  Specifically, I
looked at removing ->map_queue() and allowing ->enqueue() to take care
of mapping and locking of the queues (and ->dequeue()).  I found that it
can be done either way, but I'm not sure I like taking the locking out
of dev_queue_xmit(), and relying on the qdisc ->enqueue() to make sure
locking is completed correctly.  Having a map routine also allows
various parts of the stack to query the skb where it belongs.
	I also looked into storing the mapped queue value in the skb,
namely in skb->priority.  If I were to use this value, I'd lose
filtering capabilities in sch_prio.c, where if no tc filter exists for
the skb in question, it relies on the value of skb->priority to enqueue
to a band.  The value of skb->priority could also be used in a base
driver for further filtering (some of the MAC-based QoS on wireless
devices comes to mind), so I'm not sure I feel comfortable using that
field to store the queue.  I also envision some qdiscs in the future
that could change the mapping of bands to queues without reloading the
qdisc, so storing the queue information in the skb could lead to
out-of-order packets into queues (stale information in the skb).
	What do you think?

Thanks,

PJ Waskiewicz
Software Engineer
LAD - LAN Access Division, New Technologies
Intel Corporation
peter.p.waskiewicz.jr@intel.com
 

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net] 
> Sent: Monday, February 26, 2007 5:03 PM
> To: Kok, Auke-jan H
> Cc: jgarzik@pobox.com; netdev@vger.kernel.org; 
> linux-kernel@vger.kernel.org; Waskiewicz Jr, Peter P; 
> Brandeburg, Jesse; auke@foo-projects.org; Ronciak, John
> Subject: Re: [PATCH 1/2] NET: Multiple queue network device support
> 
> From: "Kok, Auke" <auke-jan.h.kok@intel.com>
> Date: Thu, 08 Feb 2007 16:09:50 -0800
> 
> > From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> > 
> > Added an API and associated supporting routines for 
> multiqueue network 
> > devices.  This allows network devices supporting multiple 
> TX queues to 
> > configure each queue within the netdevice and manage each queue 
> > independantly.  Changes to the PRIO Qdisc also allow a user to map 
> > multiple flows to individual TX queues, taking advantage of 
> each queue 
> > on the device.
> > 
> > Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> > Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
> 
> Thanks for posting this.
> 
> I was wonding if it would be better to have the ->enqueue() 
> perform the mapping operation, store the queue index selected 
> inside of the skb, and just use the normal 
> ->hard_start_xmit() to send the packet out?
> 
> The only thing to take care of is the per-queue locking, but 
> that should be easily doable.
> 
> We might be able to do this even without making sk_buff any larger.
> For example, I suppose skb->priority might be deducable down 
> to a u16.  After a quick audit I really cannot see any 
> legitimate use of anything larger than 16-bit values, but a 
> more thorough audit would be necessary to validate this.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-07 22:18     ` Waskiewicz Jr, Peter P
@ 2007-03-07 22:42       ` David Miller
  2007-03-09  7:26         ` Jarek Poplawski
  0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2007-03-07 22:42 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr
  Cc: jgarzik, netdev, linux-kernel, jesse.brandeburg, auke, john.ronciak


I didn't say to use skb->priority, I said to shrink skb->priority down
to a u16 and then make another u16 which will store your queue mapping
value.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-07 22:42       ` David Miller
@ 2007-03-09  7:26         ` Jarek Poplawski
  0 siblings, 0 replies; 17+ messages in thread
From: Jarek Poplawski @ 2007-03-09  7:26 UTC (permalink / raw)
  To: David Miller
  Cc: peter.p.waskiewicz.jr, jgarzik, netdev, linux-kernel,
	jesse.brandeburg, auke, john.ronciak

On 07-03-2007 23:42, David Miller wrote:
> I didn't say to use skb->priority, I said to shrink skb->priority down
> to a u16 and then make another u16 which will store your queue mapping
> value.

Peter is right: this is fully used by schedulers (prio,
CBQ, HTB, HFSC...) and would break users' scripts, so I
wouldn't recommend, too.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support.
  2007-02-09  0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke
@ 2007-03-09 13:11   ` Thomas Graf
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Graf @ 2007-03-09 13:11 UTC (permalink / raw)
  To: Kok, Auke
  Cc: David Miller, Garzik, Jeff, netdev, linux-kernel,
	Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John

* Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09
> 
> From: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
> 
> Several newer e1000 chipsets support multiple RX and TX queues. Most
> commonly, 82571's and ESB2LAN support 2 rx and 2 rx queues.
> 
> Signed-off-by: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
> ---
> 
>  drivers/net/Kconfig               |   13 ++
>  drivers/net/e1000/e1000.h         |   23 +++
>  drivers/net/e1000/e1000_ethtool.c |   43 ++++++
>  drivers/net/e1000/e1000_main.c    |  269 +++++++++++++++++++++++++++++++------
>  4 files changed, 304 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index ad92b6a..2d758ab 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -1988,6 +1988,19 @@ config E1000_DISABLE_PACKET_SPLIT
>  
>  	  If in doubt, say N.
>  
> +config E1000_MQ
> +	bool "Enable Tx/Rx Multiqueue Support (EXPERIMENTAL)"
> +	depends on E1000 && NET_MULTI_QUEUE_DEVICE && EXPERIMENTAL

Would be better to just select NET_MULTI_QUEUE_DEVICE instead of
depending on it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] NET: Multiple queue network device support
  2007-02-09  0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke
  2007-02-27  1:03   ` David Miller
@ 2007-03-09 13:40   ` Thomas Graf
  2007-03-09 19:25     ` Waskiewicz Jr, Peter P
  2007-03-12  8:58     ` Jarek Poplawski
  1 sibling, 2 replies; 17+ messages in thread
From: Thomas Graf @ 2007-03-09 13:40 UTC (permalink / raw)
  To: Kok, Auke
  Cc: David Miller, Garzik, Jeff, netdev, linux-kernel,
	Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John

* Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 455d589..42b635c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1477,6 +1477,49 @@ gso:
>  	skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
>  #endif
>  	if (q->enqueue) {
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +		int queue_index;
> +		/* If we're a multi-queue device, get a queue index to lock */
> +		if (netif_is_multiqueue(dev))
> +		{
> +			/* Get the queue index and lock it. */
> +			if (likely(q->ops->map_queue)) {
> +				queue_index = q->ops->map_queue(skb, q);
> +				spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> +				rc = q->enqueue(skb, q);
> +				/*
> +				 * Unlock because the underlying qdisc
> +				 * may queue and send a packet from a
> +				 * different queue.
> +				 */
> +				spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> +				qdisc_run(dev);

I really dislike this asymmetric locking logic, while enqueue() is called
with queue_lock held dequeue() is repsonsible to acquire the lock for
qdisc_restart().

> +				rc = rc == NET_XMIT_BYPASS
> +				           ? NET_XMIT_SUCCESS : rc;
> +				goto out;
> +			} else {
> +				printk(KERN_CRIT "Device %s is "
> +					"multiqueue, but map_queue is "
> +					"undefined in the qdisc!!\n",
> +					dev->name);
> +				kfree_skb(skb);

Move this check to tc_modify_qdisc(), it's useless to check this for
every packet being delivered.

> +			}
> +		} else {
> +			/* We're not a multi-queue device. */
> +			spin_lock(&dev->queue_lock);
> +			q = dev->qdisc;
> +			if (q->enqueue) {
> +				rc = q->enqueue(skb, q);
> +				qdisc_run(dev);
> +				spin_unlock(&dev->queue_lock);
> +				rc = rc == NET_XMIT_BYPASS
> +				           ? NET_XMIT_SUCCESS : rc;
> +				goto out;
> +			}
> +			spin_unlock(&dev->queue_lock);

Please don't duplicate already existing code.

> @@ -130,6 +140,72 @@ static inline int qdisc_restart(struct net_device *dev)
>  		}
>  		
>  		{
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +			if (netif_is_multiqueue(dev)) {
> +				if (likely(q->ops->map_queue)) {
> +					queue_index = q->ops->map_queue(skb, q);
> +				} else {
> +					printk(KERN_CRIT "Device %s is "
> +						"multiqueue, but map_queue is "
> +						"undefined in the qdisc!!\n",
> +						dev->name);
> +					goto requeue;
> +				}

Yet another condition completely useless for every transmission.

> +				spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> +				/* Check top level device, and any sub-device */
> +				if ((!netif_queue_stopped(dev)) &&
> +				  (!netif_subqueue_stopped(dev, queue_index))) {
> +					int ret;
> +					ret = dev->hard_start_subqueue_xmit(skb, dev, queue_index);
> +					if (ret == NETDEV_TX_OK) {
> +						if (!nolock) {
> +							netif_tx_unlock(dev);
> +						}
> +						return -1;
> +					}
> +					if (ret == NETDEV_TX_LOCKED && nolock) {
> +						spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> +						goto collision;
> +					}
> +				}
> +				/* NETDEV_TX_BUSY - we need to requeue */
> +				/* Release the driver */
> +				if (!nolock) {
> +					netif_tx_unlock(dev);
> +				}
> +				spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> +				q = dev->qdisc;

This is identical to the existing logic except for the different lock,
the additional to check subqueue state and a different hard_start_xmit
call. Share the logic.

> +			}
> +			else
> +			{
> +				/* We're a single-queue device */
> +				/* And release queue */
> +				spin_unlock(&dev->queue_lock);
> +				if (!netif_queue_stopped(dev)) {
> +					int ret;
> +
> +					ret = dev->hard_start_xmit(skb, dev);
> +					if (ret == NETDEV_TX_OK) {
> +						if (!nolock) {
> +							netif_tx_unlock(dev);
> +						}
> +						spin_lock(&dev->queue_lock);
> +						return -1;
> +					}
> +					if (ret == NETDEV_TX_LOCKED && nolock) {
> +						spin_lock(&dev->queue_lock);
> +						goto collision;
> +					}
> +				}
> +				/* NETDEV_TX_BUSY - we need to requeue */
> +				/* Release the driver */
> +				if (!nolock) {
> +					netif_tx_unlock(dev);
> +				}
> +				spin_lock(&dev->queue_lock);
> +				q = dev->qdisc;

Again, you just copied the existing code into a separate branch, fix the
branching logic!

> @@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc* qdisc)
>  	struct sk_buff_head *list = qdisc_priv(qdisc);
>  
>  	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +		if (netif_is_multiqueue(qdisc->dev))
> +			spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock);
> +#endif
>  		if (!skb_queue_empty(list + prio)) {
>  			qdisc->q.qlen--;
>  			return __qdisc_dequeue_head(qdisc, list + prio);
>  		}
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +		if (netif_is_multiqueue(qdisc->dev))
> +			spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock);
> +#endif

This lock/unlock for every band definitely screws performance.

>  	}
>  
>  	return NULL;
> @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
>  	struct sk_buff *skb;
>  	struct prio_sched_data *q = qdisc_priv(sch);
>  	int prio;
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +	int queue;
> +#endif
>  	struct Qdisc *qdisc;
>  
> +	/*
> +	 * If we're multiqueue, the basic approach is try the lock on each
> +	 * queue.  If it's locked, either we're already dequeuing, or enqueue
> +	 * is doing something.  Go to the next band if we're locked.  Once we
> +	 * have a packet, unlock the queue.  NOTE: the underlying qdisc CANNOT
> +	 * be a PRIO qdisc, otherwise we will deadlock.  FIXME
> +	 */
>  	for (prio = 0; prio < q->bands; prio++) {
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +		if (netif_is_multiqueue(sch->dev)) {
> +			queue = q->band2queue[prio];
> +			if (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
> +				qdisc = q->queues[prio];
> +				skb = qdisc->dequeue(qdisc);
> +				if (skb) {
> +					sch->q.qlen--;
> +					skb->priority = prio;
> +					spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> +					return skb;
> +				}
> +				spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> +			}

Your modified qdisc_restart() expects the queue_lock to be locked, how
can this work?

> +		} else {
> +			qdisc = q->queues[prio];
> +			skb = qdisc->dequeue(qdisc);
> +			if (skb) {
> +				sch->q.qlen--;
> +				skb->priority = prio;
> +				return skb;
> +			}
> +		}
> +#else
>  		qdisc = q->queues[prio];
>  		skb = qdisc->dequeue(qdisc);
>  		if (skb) {
>  			sch->q.qlen--;
> +			skb->priority = prio;
>  			return skb;
>  		}
> +#endif
>  	}
>  	return NULL;
> -
>  }
>  
>  static unsigned int prio_drop(struct Qdisc* sch)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-09 13:40   ` Thomas Graf
@ 2007-03-09 19:25     ` Waskiewicz Jr, Peter P
  2007-03-09 23:01       ` Thomas Graf
  2007-03-12  8:58     ` Jarek Poplawski
  1 sibling, 1 reply; 17+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-03-09 19:25 UTC (permalink / raw)
  To: Thomas Graf, Kok, Auke-jan H
  Cc: David Miller, Garzik, Jeff, netdev, linux-kernel, Brandeburg,
	Jesse, Kok, Auke, Ronciak, John

> -----Original Message-----
> From: Thomas Graf [mailto:tgraf@suug.ch] 
> Sent: Friday, March 09, 2007 5:41 AM
> To: Kok, Auke-jan H
> Cc: David Miller; Garzik, Jeff; netdev@vger.kernel.org; 
> linux-kernel@vger.kernel.org; Waskiewicz Jr, Peter P; 
> Brandeburg, Jesse; Kok, Auke; Ronciak, John
> Subject: Re: [PATCH 1/2] NET: Multiple queue network device support
> 
> * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09
> > diff --git a/net/core/dev.c b/net/core/dev.c index 455d589..42b635c 
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1477,6 +1477,49 @@ gso:
> >  	skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
> >  #endif
> >  	if (q->enqueue) {
> > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > +		int queue_index;
> > +		/* If we're a multi-queue device, get a queue 
> index to lock */
> > +		if (netif_is_multiqueue(dev))
> > +		{
> > +			/* Get the queue index and lock it. */
> > +			if (likely(q->ops->map_queue)) {
> > +				queue_index = q->ops->map_queue(skb, q);
> > +				
> spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> > +				rc = q->enqueue(skb, q);
> > +				/*
> > +				 * Unlock because the underlying qdisc
> > +				 * may queue and send a packet from a
> > +				 * different queue.
> > +				 */
> > +				
> spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> > +				qdisc_run(dev);
> 
> I really dislike this asymmetric locking logic, while 
> enqueue() is called with queue_lock held dequeue() is 
> repsonsible to acquire the lock for qdisc_restart().

I don't see how I can make this symmetrical when dealing with more than
one queue.  If an skb is classified to band 2 which maps to queue 1, we
lock queue 1, and enqueue the packet.  That queue will not be the first
queue checked in dequeue for an skb, so I cannot rely on that lock
protecting queue 0 in dequeue.  Therefore I need to map the skb, lock,
enqueue, unlock, then go into dequeue and lock the correct queue there.
If I were to make this symmetrical, then the process of enqueuing and
dequeuing would be serialized completely, and we'd be back to where we
started without multiqueue.

Also, in the multiqueue code path, dev->queue_lock isn't held anywhere,
it's the subqueue lock.

> 
> > +				rc = rc == NET_XMIT_BYPASS
> > +				           ? NET_XMIT_SUCCESS : rc;
> > +				goto out;
> > +			} else {
> > +				printk(KERN_CRIT "Device %s is "
> > +					"multiqueue, but map_queue is "
> > +					"undefined in the qdisc!!\n",
> > +					dev->name);
> > +				kfree_skb(skb);
> 
> Move this check to tc_modify_qdisc(), it's useless to check 
> this for every packet being delivered.

The patches I sent yesterday modified the non-multiqueue qdiscs to not
load if the device is multiqueue, so this check can be removed
altogether.

> 
> > +			}
> > +		} else {
> > +			/* We're not a multi-queue device. */
> > +			spin_lock(&dev->queue_lock);
> > +			q = dev->qdisc;
> > +			if (q->enqueue) {
> > +				rc = q->enqueue(skb, q);
> > +				qdisc_run(dev);
> > +				spin_unlock(&dev->queue_lock);
> > +				rc = rc == NET_XMIT_BYPASS
> > +				           ? NET_XMIT_SUCCESS : rc;
> > +				goto out;
> > +			}
> > +			spin_unlock(&dev->queue_lock);
> 
> Please don't duplicate already existing code.

I don't want to mix multiqueue and non-multiqueue code in the transmit
path.  This was an attempt to allow MQ and non-MQ devices to coexist in
a machine having separate code paths.  Are you suggesting to combine
them?  That would get very messy trying to determine what type of lock
to grab (subqueue lock or dev->queue_lock), not to mention grabbing the
dev->queue_lock would block multiqueue devices in that same codepath.

> 
> > @@ -130,6 +140,72 @@ static inline int qdisc_restart(struct 
> net_device *dev)
> >  		}
> >  		
> >  		{
> > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > +			if (netif_is_multiqueue(dev)) {
> > +				if (likely(q->ops->map_queue)) {
> > +					queue_index = 
> q->ops->map_queue(skb, q);
> > +				} else {
> > +					printk(KERN_CRIT "Device %s is "
> > +						"multiqueue, 
> but map_queue is "
> > +						"undefined in 
> the qdisc!!\n",
> > +						dev->name);
> > +					goto requeue;
> > +				}
> 
> Yet another condition completely useless for every transmission.

Yep, this will be removed.  Thanks.

> 
> > +				
> spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> > +				/* Check top level device, and 
> any sub-device */
> > +				if ((!netif_queue_stopped(dev)) &&
> > +				  (!netif_subqueue_stopped(dev, 
> queue_index))) {
> > +					int ret;
> > +					ret = 
> dev->hard_start_subqueue_xmit(skb, dev, queue_index);
> > +					if (ret == NETDEV_TX_OK) {
> > +						if (!nolock) {
> > +							
> netif_tx_unlock(dev);
> > +						}
> > +						return -1;
> > +					}
> > +					if (ret == 
> NETDEV_TX_LOCKED && nolock) {
> > +						
> spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> > +						goto collision;
> > +					}
> > +				}
> > +				/* NETDEV_TX_BUSY - we need to 
> requeue */
> > +				/* Release the driver */
> > +				if (!nolock) {
> > +					netif_tx_unlock(dev);
> > +				}
> > +				
> spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> > +				q = dev->qdisc;
> 
> This is identical to the existing logic except for the 
> different lock, the additional to check subqueue state and a 
> different hard_start_xmit call. Share the logic.

Not necessarily.  How we got here and how the locks are working are
logically different due to the different queues.  I'm working on storing
the queue index in the skb now from Dave's suggestion, so there will be
one hard_start_xmit, but the logic is different since the queue lock
we're grabbing will be different than non-multiqueue.  Perhaps I'm
missing something here.

> 
> > +			}
> > +			else
> > +			{
> > +				/* We're a single-queue device */
> > +				/* And release queue */
> > +				spin_unlock(&dev->queue_lock);
> > +				if (!netif_queue_stopped(dev)) {
> > +					int ret;
> > +
> > +					ret = 
> dev->hard_start_xmit(skb, dev);
> > +					if (ret == NETDEV_TX_OK) {
> > +						if (!nolock) {
> > +							
> netif_tx_unlock(dev);
> > +						}
> > +						
> spin_lock(&dev->queue_lock);
> > +						return -1;
> > +					}
> > +					if (ret == 
> NETDEV_TX_LOCKED && nolock) {
> > +						
> spin_lock(&dev->queue_lock);
> > +						goto collision;
> > +					}
> > +				}
> > +				/* NETDEV_TX_BUSY - we need to 
> requeue */
> > +				/* Release the driver */
> > +				if (!nolock) {
> > +					netif_tx_unlock(dev);
> > +				}
> > +				spin_lock(&dev->queue_lock);
> > +				q = dev->qdisc;
> 
> Again, you just copied the existing code into a separate 
> branch, fix the branching logic!

This is another attempt to keep the two codepaths separate.  The only
way I see of combining them is to check netif_is_multiqueue() everytime
I need to grab a lock, which I think would be excessive.

> 
> > @@ -356,10 +454,18 @@ static struct sk_buff 
> *pfifo_fast_dequeue(struct Qdisc* qdisc)
> >  	struct sk_buff_head *list = qdisc_priv(qdisc);
> >  
> >  	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
> > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > +		if (netif_is_multiqueue(qdisc->dev))
> > +			
> spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock);
> > +#endif
> >  		if (!skb_queue_empty(list + prio)) {
> >  			qdisc->q.qlen--;
> >  			return __qdisc_dequeue_head(qdisc, list + prio);
> >  		}
> > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > +		if (netif_is_multiqueue(qdisc->dev))
> > +			
> spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock);
> > +#endif
> 
> This lock/unlock for every band definitely screws performance.
> 

I will move the lock out of the for() loop since every band in
pfifo_fast maps to queue 0.  Thanks.

> >  	}
> >  
> >  	return NULL;
> > @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
> >  	struct sk_buff *skb;
> >  	struct prio_sched_data *q = qdisc_priv(sch);
> >  	int prio;
> > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > +	int queue;
> > +#endif
> >  	struct Qdisc *qdisc;
> >  
> > +	/*
> > +	 * If we're multiqueue, the basic approach is try the 
> lock on each
> > +	 * queue.  If it's locked, either we're already 
> dequeuing, or enqueue
> > +	 * is doing something.  Go to the next band if we're 
> locked.  Once we
> > +	 * have a packet, unlock the queue.  NOTE: the 
> underlying qdisc CANNOT
> > +	 * be a PRIO qdisc, otherwise we will deadlock.  FIXME
> > +	 */
> >  	for (prio = 0; prio < q->bands; prio++) {
> > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > +		if (netif_is_multiqueue(sch->dev)) {
> > +			queue = q->band2queue[prio];
> > +			if 
> (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
> > +				qdisc = q->queues[prio];
> > +				skb = qdisc->dequeue(qdisc);
> > +				if (skb) {
> > +					sch->q.qlen--;
> > +					skb->priority = prio;
> > +					
> spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> > +					return skb;
> > +				}
> > +				
> spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> > +			}
> 
> Your modified qdisc_restart() expects the queue_lock to be 
> locked, how can this work?

No, it doesn't expect the lock to be held.  Because of the multiple
queues, enqueueing and dequeueing are now asynchronous, since I can
enqueue to queue 0 while dequeuing from queue 1.  dev->queue_lock isn't
held, so this can happen.  Therefore the spin_trylock() is used in this
dequeue because I don't want to wait for someone to finish with that
queue in question (e.g. enqueue working), since that will block all
other bands/queues after the band in question.  So if the lock isn't
available to grab, we move to the next band.  If I were to wait for the
lock, I'd serialize the enqueue/dequeue completely, and block other
traffic flows in other queues waiting for the lock.

> 
> > +		} else {
> > +			qdisc = q->queues[prio];
> > +			skb = qdisc->dequeue(qdisc);
> > +			if (skb) {
> > +				sch->q.qlen--;
> > +				skb->priority = prio;
> > +				return skb;
> > +			}
> > +		}
> > +#else
> >  		qdisc = q->queues[prio];
> >  		skb = qdisc->dequeue(qdisc);
> >  		if (skb) {
> >  			sch->q.qlen--;
> > +			skb->priority = prio;
> >  			return skb;
> >  		}
> > +#endif
> >  	}
> >  	return NULL;
> > -
> >  }
> >  
> >  static unsigned int prio_drop(struct Qdisc* sch)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-09 19:25     ` Waskiewicz Jr, Peter P
@ 2007-03-09 23:01       ` Thomas Graf
  2007-03-09 23:27         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Graf @ 2007-03-09 23:01 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev,
	linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John

* Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 2007-03-09 11:25
> > > +			}
> > > +		} else {
> > > +			/* We're not a multi-queue device. */
> > > +			spin_lock(&dev->queue_lock);
> > > +			q = dev->qdisc;
> > > +			if (q->enqueue) {
> > > +				rc = q->enqueue(skb, q);
> > > +				qdisc_run(dev);
> > > +				spin_unlock(&dev->queue_lock);
> > > +				rc = rc == NET_XMIT_BYPASS
> > > +				           ? NET_XMIT_SUCCESS : rc;
> > > +				goto out;
> > > +			}
> > > +			spin_unlock(&dev->queue_lock);
> > 
> > Please don't duplicate already existing code.
> 
> I don't want to mix multiqueue and non-multiqueue code in the transmit
> path.  This was an attempt to allow MQ and non-MQ devices to coexist in
> a machine having separate code paths.  Are you suggesting to combine
> them?  That would get very messy trying to determine what type of lock
> to grab (subqueue lock or dev->queue_lock), not to mention grabbing the
> dev->queue_lock would block multiqueue devices in that same codepath.

The piece of code I quoted above is the branch executed if multi queue
is not enabled. The code you added is 100% identical to the already
existing enqueue logic. Just execute the existing branch if multi queue
is not enabled.

> This is another attempt to keep the two codepaths separate.  The only
> way I see of combining them is to check netif_is_multiqueue() everytime
> I need to grab a lock, which I think would be excessive.

The code added is 100% identical to the existing code, just be a little
smarter on how you do the ifdefs.

> > >  	}
> > >  
> > >  	return NULL;
> > > @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
> > >  	struct sk_buff *skb;
> > >  	struct prio_sched_data *q = qdisc_priv(sch);
> > >  	int prio;
> > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > > +	int queue;
> > > +#endif
> > >  	struct Qdisc *qdisc;
> > >  
> > > +	/*
> > > +	 * If we're multiqueue, the basic approach is try the 
> > lock on each
> > > +	 * queue.  If it's locked, either we're already 
> > dequeuing, or enqueue
> > > +	 * is doing something.  Go to the next band if we're 
> > locked.  Once we
> > > +	 * have a packet, unlock the queue.  NOTE: the 
> > underlying qdisc CANNOT
> > > +	 * be a PRIO qdisc, otherwise we will deadlock.  FIXME
> > > +	 */
> > >  	for (prio = 0; prio < q->bands; prio++) {
> > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> > > +		if (netif_is_multiqueue(sch->dev)) {
> > > +			queue = q->band2queue[prio];
> > > +			if 
> > (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
> > > +				qdisc = q->queues[prio];
> > > +				skb = qdisc->dequeue(qdisc);
> > > +				if (skb) {
> > > +					sch->q.qlen--;
> > > +					skb->priority = prio;
> > > +					
> > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> > > +					return skb;
> > > +				}
> > > +				
> > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> > > +			}
> > 
> > Your modified qdisc_restart() expects the queue_lock to be 
> > locked, how can this work?
> 
> No, it doesn't expect the lock to be held.  Because of the multiple
> queues, enqueueing and dequeueing are now asynchronous, since I can
> enqueue to queue 0 while dequeuing from queue 1.  dev->queue_lock isn't
> held, so this can happen.  Therefore the spin_trylock() is used in this
> dequeue because I don't want to wait for someone to finish with that
> queue in question (e.g. enqueue working), since that will block all
> other bands/queues after the band in question.  So if the lock isn't
> available to grab, we move to the next band.  If I were to wait for the
> lock, I'd serialize the enqueue/dequeue completely, and block other
> traffic flows in other queues waiting for the lock.

The first thing you do in qdisc_restart() after dequeue()ing is unlock
the sub queue lock. You explicitely unlock it before calling qdisc_run()
so I assume dequeue() is expected to keep it locked. Something doesn't
add up.

BTW, which lock serializes your write access to qdisc->q.qlen? It used
to be dev->queue_lock but that is apparently not true for multi queue.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-09 23:01       ` Thomas Graf
@ 2007-03-09 23:27         ` Waskiewicz Jr, Peter P
  2007-03-10  2:34           ` Thomas Graf
  0 siblings, 1 reply; 17+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-03-09 23:27 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev,
	linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John

> * Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 
> 2007-03-09 11:25
> > > > +			}
> > > > +		} else {
> > > > +			/* We're not a multi-queue device. */
> > > > +			spin_lock(&dev->queue_lock);
> > > > +			q = dev->qdisc;
> > > > +			if (q->enqueue) {
> > > > +				rc = q->enqueue(skb, q);
> > > > +				qdisc_run(dev);
> > > > +				spin_unlock(&dev->queue_lock);
> > > > +				rc = rc == NET_XMIT_BYPASS
> > > > +				           ? 
> NET_XMIT_SUCCESS : rc;
> > > > +				goto out;
> > > > +			}
> > > > +			spin_unlock(&dev->queue_lock);
> > > 
> > > Please don't duplicate already existing code.
> > 
> > I don't want to mix multiqueue and non-multiqueue code in 
> the transmit 
> > path.  This was an attempt to allow MQ and non-MQ devices 
> to coexist 
> > in a machine having separate code paths.  Are you suggesting to 
> > combine them?  That would get very messy trying to 
> determine what type 
> > of lock to grab (subqueue lock or dev->queue_lock), not to mention 
> > grabbing the
> > dev->queue_lock would block multiqueue devices in that same 
> codepath.
> 
> The piece of code I quoted above is the branch executed if 
> multi queue is not enabled. The code you added is 100% 
> identical to the already existing enqueue logic. Just execute 
> the existing branch if multi queue is not enabled.
> 

I totally agree this is identical code if multiqueue isn't enabled.
However, when multiqueue is enabled, I don't want to make all network
drivers implement the subqueue API just to be able to lock the queues.
So the first thing I check is netif_is_multiqueue(dev), and if it
*isn't* multiqueue, it will run the existing code.  This way both
non-multiqueue devices don't have to have any knowledge of the MQ API.

> > This is another attempt to keep the two codepaths separate. 
>  The only 
> > way I see of combining them is to check netif_is_multiqueue() 
> > everytime I need to grab a lock, which I think would be excessive.
> 
> The code added is 100% identical to the existing code, just 
> be a little smarter on how you do the ifdefs.

An example could be an on-board adapter isn't multiqueue, and an
expansion card you have in your system is.  If I handle multiqueue being
on with just ifdef's, then I could use the expansion card, but not the
on-board adapter as-is.  The on-board driver would need to be updated to
have 1 queue in the multiqueue context, which is what I tried to avoid.

> > > Your modified qdisc_restart() expects the queue_lock to 
> be locked, 
> > > how can this work?
> > 
> > No, it doesn't expect the lock to be held.  Because of the multiple 
> > queues, enqueueing and dequeueing are now asynchronous, since I can 
> > enqueue to queue 0 while dequeuing from queue 1.  dev->queue_lock 
> > isn't held, so this can happen.  Therefore the 
> spin_trylock() is used 
> > in this dequeue because I don't want to wait for someone to finish 
> > with that queue in question (e.g. enqueue working), since that will 
> > block all other bands/queues after the band in question.  So if the 
> > lock isn't available to grab, we move to the next band.  If 
> I were to 
> > wait for the lock, I'd serialize the enqueue/dequeue 
> completely, and 
> > block other traffic flows in other queues waiting for the lock.
> 
> The first thing you do in qdisc_restart() after dequeue()ing 
> is unlock the sub queue lock. You explicitely unlock it 
> before calling qdisc_run() so I assume dequeue() is expected 
> to keep it locked. Something doesn't add up.

That's the entire point of this extra locking.  enqueue() is going to
put an skb into a band somewhere that maps to some queue, and there is
no way to guarantee the skb I retrieve from dequeue() is headed for the
same queue.  Therefore, I need to unlock the queue after I finish
enqueuing, since having that lock makes little sense to dequeue().
dequeue() will then grab *a* lock on a queue; it may be the same one we
had during enqueue(), but it may not be.  And the placement of the
unlock of that queue is exactly where it happens in non-multiqueue,
which is right before the hard_start_xmit().

I concede that the locking model is complex and seems really nasty, but
to truly separate queue functionality from one another, I see no other
feasible option than to run locking like this.  I am very open to
suggestions.

> 
> BTW, which lock serializes your write access to 
> qdisc->q.qlen? It used to be dev->queue_lock but that is 
> apparently not true for multi queue.
> 

This is a very good catch, because it isn't being protected on the
entire qdisc right now for PRIO.  However, Chris Leech pointed out the
LINK_STATE_QDISC_RUNNING bit is serializing things at the qdisc_run()
level, so that's probably protecting this counter right now.  I'm
looking at pushing that down into the qdisc, but at the same time I can
think of something to serialize these stats containers for the qdisc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-09 23:27         ` Waskiewicz Jr, Peter P
@ 2007-03-10  2:34           ` Thomas Graf
  2007-03-10 20:37             ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Graf @ 2007-03-10  2:34 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev,
	linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John

* Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 2007-03-09 15:27
> That's the entire point of this extra locking.  enqueue() is going to
> put an skb into a band somewhere that maps to some queue, and there is
> no way to guarantee the skb I retrieve from dequeue() is headed for the
> same queue.  Therefore, I need to unlock the queue after I finish
> enqueuing, since having that lock makes little sense to dequeue().
> dequeue() will then grab *a* lock on a queue; it may be the same one we
> had during enqueue(), but it may not be.  And the placement of the
> unlock of that queue is exactly where it happens in non-multiqueue,
> which is right before the hard_start_xmit().

The lock is already unlocked after dequeue, from your prio_dequeue():

       if (netif_is_multiqueue(sch->dev)) {
               queue = q->band2queue[prio];
               if (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
                       qdisc = q->queues[prio];
                       skb = qdisc->dequeue(qdisc);
                       if (skb) {
                               sch->q.qlen--;
                               skb->priority = prio;
                               spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
                               return skb;
                       }
                       spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
       }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-10  2:34           ` Thomas Graf
@ 2007-03-10 20:37             ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 17+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-03-10 20:37 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev,
	linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John

> -----Original Message-----
> From: Thomas Graf [mailto:tgraf@suug.ch] 
> Sent: Friday, March 09, 2007 6:35 PM
> To: Waskiewicz Jr, Peter P
> Cc: Kok, Auke-jan H; David Miller; Garzik, Jeff; 
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; 
> Brandeburg, Jesse; Kok, Auke; Ronciak, John
> Subject: Re: [PATCH 1/2] NET: Multiple queue network device support
> 
> * Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 
> 2007-03-09 15:27
> > That's the entire point of this extra locking.  enqueue() 
> is going to 
> > put an skb into a band somewhere that maps to some queue, 
> and there is 
> > no way to guarantee the skb I retrieve from dequeue() is headed for 
> > the same queue.  Therefore, I need to unlock the queue 
> after I finish 
> > enqueuing, since having that lock makes little sense to dequeue().
> > dequeue() will then grab *a* lock on a queue; it may be the 
> same one 
> > we had during enqueue(), but it may not be.  And the 
> placement of the 
> > unlock of that queue is exactly where it happens in non-multiqueue, 
> > which is right before the hard_start_xmit().
> 
> The lock is already unlocked after dequeue, from your prio_dequeue():
> 
>        if (netif_is_multiqueue(sch->dev)) {
>                queue = q->band2queue[prio];
>                if 
> (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
>                        qdisc = q->queues[prio];
>                        skb = qdisc->dequeue(qdisc);
>                        if (skb) {
>                                sch->q.qlen--;
>                                skb->priority = prio;
>                                
> spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
>                                return skb;
>                        }
>                        
> spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
>        }

Ok, now I see what's wrong.  Taking Dave M.'s recommendation to store
the queue mapping in the skb will let me unlock the queue when dequeue()
returns.  I'll fix this locking issue; thanks for the feedback and
persistent drilling into my thick head.

-PJ Waskiewicz
peter.p.waskiewicz.jr@intel.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-09 13:40   ` Thomas Graf
  2007-03-09 19:25     ` Waskiewicz Jr, Peter P
@ 2007-03-12  8:58     ` Jarek Poplawski
  2007-03-12 20:21       ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 17+ messages in thread
From: Jarek Poplawski @ 2007-03-12  8:58 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Kok, Auke, David Miller, Garzik, Jeff, netdev, linux-kernel,
	Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John

On 09-03-2007 14:40, Thomas Graf wrote:
> * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 455d589..42b635c 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -1477,6 +1477,49 @@ gso:
>>  	skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
>>  #endif
>>  	if (q->enqueue) {
>> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
>> +		int queue_index;
>> +		/* If we're a multi-queue device, get a queue index to lock */
>> +		if (netif_is_multiqueue(dev))
>> +		{
>> +			/* Get the queue index and lock it. */
>> +			if (likely(q->ops->map_queue)) {
>> +				queue_index = q->ops->map_queue(skb, q);
>> +				spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
>> +				rc = q->enqueue(skb, q);

I'm not sure Dave Miller thought about this place, when he proposed
to save the mapping, but I think this could be not enough. This place
is racy: ->map_queue() is called 2 times and with some filters (and
policies/actions) results could differ. And of course the subqueue
lock doesn't prevent any filter from a config change in the meantime.

After second reading of this patch I have doubts it's the proper way
to solve the problem: there are many subqueues but we need a top queue
(prio here) to mange them, anyway. So, why not to build this
functionality directly into the queue? There is no difference for a
device if skbs are going from the subqueue or a class, it is only
interested in the mapping result and a possibility to stop and start
a subqueue and to query its status. All this could be done by adding
the callbacks directly to any classful scheduler or, if not enough,
to write some specialized qdisc based on prio. The possibility to lock
only a subqueue instead of a queue could be analized independently -
current proposal doesn't solve this anyway.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 1/2] NET: Multiple queue network device support
  2007-03-12  8:58     ` Jarek Poplawski
@ 2007-03-12 20:21       ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 17+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-03-12 20:21 UTC (permalink / raw)
  To: Jarek Poplawski, Thomas Graf
  Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev,
	linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John


> -----Original Message-----
> From: Jarek Poplawski [mailto:jarkao2@o2.pl] 
> Sent: Monday, March 12, 2007 1:58 AM
> To: Thomas Graf
> Cc: Kok, Auke-jan H; David Miller; Garzik, Jeff; 
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; 
> Waskiewicz Jr, Peter P; Brandeburg, Jesse; Kok, Auke; Ronciak, John
> Subject: Re: [PATCH 1/2] NET: Multiple queue network device support
> 
> On 09-03-2007 14:40, Thomas Graf wrote:
> > * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09
> >> diff --git a/net/core/dev.c b/net/core/dev.c index 
> 455d589..42b635c 
> >> 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -1477,6 +1477,49 @@ gso:
> >>  	skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
> >>  #endif
> >>  	if (q->enqueue) {
> >> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> >> +		int queue_index;
> >> +		/* If we're a multi-queue device, get a queue 
> index to lock */
> >> +		if (netif_is_multiqueue(dev))
> >> +		{
> >> +			/* Get the queue index and lock it. */
> >> +			if (likely(q->ops->map_queue)) {
> >> +				queue_index = q->ops->map_queue(skb, q);
> >> +				
> spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> >> +				rc = q->enqueue(skb, q);
> 
> I'm not sure Dave Miller thought about this place, when he 
> proposed to save the mapping, but I think this could be not 
> enough. This place is racy: ->map_queue() is called 2 times 
> and with some filters (and
> policies/actions) results could differ. And of course the 
> subqueue lock doesn't prevent any filter from a config change 
> in the meantime.
> 
> After second reading of this patch I have doubts it's the 
> proper way to solve the problem: there are many subqueues but 
> we need a top queue (prio here) to mange them, anyway. So, 
> why not to build this functionality directly into the queue? 
> There is no difference for a device if skbs are going from 
> the subqueue or a class, it is only interested in the mapping 
> result and a possibility to stop and start a subqueue and to 
> query its status. All this could be done by adding the 
> callbacks directly to any classful scheduler or, if not 
> enough, to write some specialized qdisc based on prio. The 
> possibility to lock only a subqueue instead of a queue could 
> be analized independently - current proposal doesn't solve 
> this anyway.
> 
> Regards,
> Jarek P.
> 

Thanks again for the feedback.  Given some discussions I had last week
in the office and the general feedback here, I'm going to remove the new
per-queue locking and leave the start/stop functions for each queue and
combine entry points for hard_start_xmit().  I'll get this out asap for
review once it's been tested here.  If we see issues in the future with
lock contention on the queues, we can revisit the per-queue locking.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-03-12 20:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-09  0:09 [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel Kok, Auke
2007-02-09  0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke
2007-02-27  1:03   ` David Miller
2007-02-27 19:38     ` Waskiewicz Jr, Peter P
2007-03-07 22:18     ` Waskiewicz Jr, Peter P
2007-03-07 22:42       ` David Miller
2007-03-09  7:26         ` Jarek Poplawski
2007-03-09 13:40   ` Thomas Graf
2007-03-09 19:25     ` Waskiewicz Jr, Peter P
2007-03-09 23:01       ` Thomas Graf
2007-03-09 23:27         ` Waskiewicz Jr, Peter P
2007-03-10  2:34           ` Thomas Graf
2007-03-10 20:37             ` Waskiewicz Jr, Peter P
2007-03-12  8:58     ` Jarek Poplawski
2007-03-12 20:21       ` Waskiewicz Jr, Peter P
2007-02-09  0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke
2007-03-09 13:11   ` Thomas Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).