LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/7] sg_ring: a ring of scatterlist arrays
@ 2007-12-19  6:31 Rusty Russell
  2007-12-19  6:33 ` [PATCH 1/7] sg_ring: introduce " Rusty Russell
  2008-01-05 15:31 ` [PATCH 0/7] sg_ring: a ring of scatterlist arrays James Bottomley
  0 siblings, 2 replies; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  6:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

This patch series is the start of my attempt to simplify and make explicit
the chained scatterlist logic.

It's not complete: my SATA box boots and seems happy, but all the other
users of SCSI need to be updated and checked.  But I've gotten far enough
to believe it's worth persuing.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/7] sg_ring: introduce sg_ring: a ring of scatterlist arrays.
  2007-12-19  6:31 [PATCH 0/7] sg_ring: a ring of scatterlist arrays Rusty Russell
@ 2007-12-19  6:33 ` Rusty Russell
  2007-12-19  7:31   ` [PATCH 2/7] sg_ring: use in virtio Rusty Russell
  2008-01-05 15:31 ` [PATCH 0/7] sg_ring: a ring of scatterlist arrays James Bottomley
  1 sibling, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  6:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

This patch introduces 'struct sg_ring', a layer on top of scatterlist
arrays.  It meshes nicely with routines which expect a simple array of
'struct scatterlist' because it is easy to break down the ring into
its constituent arrays.

The sg_ring header also encodes the maximum number of entries, useful
for routines which populate an sg.  We need never hand around a number
of elements any more.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 include/linux/sg_ring.h |   74 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 74 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/sgring.h

diff --git a/include/linux/sg_ring.h b/include/linux/sg_ring.h
new file mode 100644
--- /dev/null
+++ b/include/linux/sg_ring.h
@@ -0,0 +1,107 @@
+#ifndef _LINUX_SG_RING_H
+#define _LINUX_SG_RING_H
+#include <linux/scatterlist.h>
+
+/**
+ * struct sg_ring - a ring of scatterlists
+ * @list: the list_head chaining them together
+ * @num: the number of valid sg entries
+ * @max: the maximum number of sg entries (size of the sg array).
+ * @sg: the array of scatterlist entries.
+ *
+ * This provides a convenient encapsulation of one or more scatter gather
+ * arrays.
+ */
+struct sg_ring
+{
+	struct list_head list;
+	unsigned int num, max;
+	struct scatterlist sg[0];
+};
+
+/* This helper declares an sg ring on the stack or in a struct. */
+#define DECLARE_SG_RING(name, max)		\
+	struct {				\
+		struct sg_ring ring;		\
+		struct scatterlist sg[max];	\
+	} name
+
+/**
+ * sg_ring_init - initialize a scatterlist ring.
+ * @sg: the sg_ring.
+ * @max: the size of the trailing sg array.
+ *
+ * After initialization sg is alone in the ring.
+ */
+static inline void sg_ring_init(struct sg_ring *sg, unsigned int max)
+{
+#ifdef CONFIG_DEBUG_SG
+	unsigned int i;
+	for (i = 0; i < max; i++)
+		sg->sg[i].sg_magic = SG_MAGIC;
+#endif
+	INIT_LIST_HEAD(&sg->list);
+	sg->max = max;
+	/* FIXME: This is to clear the page bits. */
+	sg_init_table(sg->sg, sg->max);
+}
+
+/**
+ * sg_ring_single - initialize a one-element scatterlist ring.
+ * @sg: the sg_ring.
+ * @buf: the pointer to the buffer.
+ * @buflen: the length of the buffer.
+ *
+ * Does sg_ring_init and also sets up first (and only) sg element.
+ */
+static inline void sg_ring_single(struct sg_ring *sg,
+				  const void *buf,
+				  unsigned int buflen)
+{
+	sg_ring_init(sg, 1);
+	sg->num = 1;
+	sg_init_one(&sg->sg[0], buf, buflen);
+}
+
+/**
+ * sg_ring_next - next array in a scatterlist ring.
+ * @sg: the sg_ring.
+ * @head: the sg_ring head.
+ *
+ * This will return NULL once @sg has looped back around to @head.
+ */
+static inline struct sg_ring *sg_ring_next(struct sg_ring *sg,
+					   const struct sg_ring *head)
+{
+	sg = list_first_entry(&sg->list, struct sg_ring, list);
+	if (sg == head)
+		sg = NULL;
+	return sg;
+}
+
+/* Helper for writing for loops */
+static inline struct sg_ring *sg_ring_iter(const struct sg_ring *head,
+					   struct sg_ring *sg, unsigned int *i)
+{
+	(*i)++;
+	/* While loop lets us skip any zero-entry sg_ring arrays */
+	while (*i == sg->num) {
+		*i = 0;
+		sg = sg_ring_next(sg, head);
+		if (!sg)
+			break;
+	}
+	return sg;
+}
+
+/**
+ * sg_ring_for_each - iterate through an entire sg_ring ring
+ * @head: the head of the sg_ring.
+ * @sg: the sg_ring iterator.
+ * @i: an (unsigned) integer which refers to sg->sg[i].
+ *
+ * The current scatterlist element is sg->sg[i].
+ */
+#define sg_ring_for_each(head, sg, i) \
+	for (sg = head, i = 0; sg; sg = sg_ring_iter(head, sg, &i))
+#endif /* _LINUX_SG_RING_H */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/7] sg_ring: use in virtio.
  2007-12-19  6:33 ` [PATCH 1/7] sg_ring: introduce " Rusty Russell
@ 2007-12-19  7:31   ` Rusty Russell
  2007-12-19  7:33     ` [PATCH 3/7] sg_ring: blk_rq_map_sg_ring as a counterpart to blk_rq_map_sg Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  7:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

Using sg_rings, we can join together scatterlists returned by other
subsystems, without needing to allocate an extra element for chaining.
This helps the virtio_blk device which wants to prepend and append
metadata to the request's scatterlist.

As an added bonus, the old virtio layer used to pass a scatterlist
array and two numbers indicating the number of readable and writable
elements respectively; now we can simply hand two sg_rings which is
much clearer (each sg_ring contains its own length).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/block/virtio_blk.c    |   33 ++++++++-------
 drivers/char/virtio_console.c |   13 +++---
 drivers/net/virtio_net.c      |   22 +++++-----
 drivers/virtio/virtio_ring.c  |   90 ++++++++++++++++++++++++++--------------
 include/linux/virtio.h        |   12 ++---
 5 files changed, 99 insertions(+), 71 deletions(-)

diff -r 6fe7cf582293 drivers/block/virtio_blk.c
--- a/drivers/block/virtio_blk.c	Wed Dec 19 16:06:43 2007 +1100
+++ b/drivers/block/virtio_blk.c	Wed Dec 19 18:29:58 2007 +1100
@@ -5,8 +5,6 @@
 #include <linux/virtio.h>
 #include <linux/virtio_blk.h>
 #include <linux/scatterlist.h>
-
-#define VIRTIO_MAX_SG	(3+MAX_PHYS_SEGMENTS)
 
 static unsigned char virtblk_index = 'a';
 struct virtio_blk
@@ -24,8 +22,10 @@ struct virtio_blk
 
 	mempool_t *pool;
 
-	/* Scatterlist: can be too big for stack. */
-	struct scatterlist sg[VIRTIO_MAX_SG];
+	/* Scatterlist ring: can be too big for stack. */
+	DECLARE_SG_RING(out, 1);
+	DECLARE_SG_RING(in, 1);
+	DECLARE_SG_RING(sg, MAX_PHYS_SEGMENTS);
 };
 
 struct virtblk_req
@@ -70,8 +70,8 @@ static bool do_req(struct request_queue 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
 		   struct request *req)
 {
-	unsigned long num, out, in;
 	struct virtblk_req *vbr;
+	struct sg_ring *in;
 
 	vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
 	if (!vbr)
@@ -95,23 +95,24 @@ static bool do_req(struct request_queue 
 	if (blk_barrier_rq(vbr->req))
 		vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
 
-	/* This init could be done at vblk creation time */
-	sg_init_table(vblk->sg, VIRTIO_MAX_SG);
-	sg_set_buf(&vblk->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr));
-	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
-	sg_set_buf(&vblk->sg[num+1], &vbr->in_hdr, sizeof(vbr->in_hdr));
+ 	sg_ring_single(&vblk->out.ring, &vbr->out_hdr, sizeof(vbr->out_hdr));
+ 	sg_ring_init(&vblk->sg.ring, ARRAY_SIZE(vblk->sg.sg));
+ 	vblk->sg.ring.num = blk_rq_map_sg(q, vbr->req, vblk->sg.sg);
+ 	sg_ring_single(&vblk->in.ring, &vbr->in_hdr, sizeof(vbr->in_hdr));
 
 	if (rq_data_dir(vbr->req) == WRITE) {
 		vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
-		out = 1 + num;
-		in = 1;
+		/* Chain write request onto output buffers. */
+		list_add_tail(&vblk->sg.ring.list, &vblk->out.ring.list);
+		in = &vblk->in.ring;
 	} else {
 		vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
-		out = 1;
-		in = 1 + num;
+		/* Chain input (status) buffer at end of read buffers. */
+		list_add_tail(&vblk->in.ring.list, &vblk->sg.ring.list);
+		in = &vblk->sg.ring;
 	}
 
-	if (vblk->vq->vq_ops->add_buf(vblk->vq, vblk->sg, out, in, vbr)) {
+	if (vblk->vq->vq_ops->add_buf(vblk->vq, &vblk->out.ring, in, vbr)) {
 		mempool_free(vbr, vblk->pool);
 		return false;
 	}
@@ -128,7 +129,7 @@ static void do_virtblk_request(struct re
 
 	while ((req = elv_next_request(q)) != NULL) {
 		vblk = req->rq_disk->private_data;
-		BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
+		BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg.sg));
 
 		/* If this request fails, stop queue and wait for something to
 		   finish to restart it. */
diff -r 6fe7cf582293 drivers/char/virtio_console.c
--- a/drivers/char/virtio_console.c	Wed Dec 19 16:06:43 2007 +1100
+++ b/drivers/char/virtio_console.c	Wed Dec 19 18:29:58 2007 +1100
@@ -54,15 +54,15 @@ static struct hv_ops virtio_cons;
  * immediately (lguest's Launcher does). */
 static int put_chars(u32 vtermno, const char *buf, int count)
 {
-	struct scatterlist sg[1];
+	DECLARE_SG_RING(sg, 1);
 	unsigned int len;
 
 	/* This is a convenient routine to initialize a single-elem sg list */
-	sg_init_one(sg, buf, count);
+	sg_ring_single(&sg.ring, buf, count);
 
 	/* add_buf wants a token to identify this buffer: we hand it any
 	 * non-NULL pointer, since there's only ever one buffer. */
-	if (out_vq->vq_ops->add_buf(out_vq, sg, 1, 0, (void *)1) == 0) {
+	if (out_vq->vq_ops->add_buf(out_vq, &sg.ring, NULL, (void *)1) == 0) {
 		/* Tell Host to go! */
 		out_vq->vq_ops->kick(out_vq);
 		/* Chill out until it's done with the buffer. */
@@ -78,11 +78,12 @@ static int put_chars(u32 vtermno, const 
  * queue. */
 static void add_inbuf(void)
 {
-	struct scatterlist sg[1];
-	sg_init_one(sg, inbuf, PAGE_SIZE);
+	DECLARE_SG_RING(sg, 1);
+
+	sg_ring_single(&sg.ring, inbuf, PAGE_SIZE);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (in_vq->vq_ops->add_buf(in_vq, sg, 0, 1, inbuf) != 0)
+	if (in_vq->vq_ops->add_buf(in_vq, NULL, &sg.ring, inbuf) != 0)
 		BUG();
 	in_vq->vq_ops->kick(in_vq);
 }
diff -r 6fe7cf582293 drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c	Wed Dec 19 16:06:43 2007 +1100
+++ b/drivers/net/virtio_net.c	Wed Dec 19 18:29:58 2007 +1100
@@ -211,21 +211,21 @@ static void try_fill_recv(struct virtnet
 static void try_fill_recv(struct virtnet_info *vi)
 {
 	struct sk_buff *skb;
-	struct scatterlist sg[1+MAX_SKB_FRAGS];
-	int num, err;
+	DECLARE_SG_RING(sg, 1+MAX_SKB_FRAGS);
+	int err;
 
-	sg_init_table(sg, 1+MAX_SKB_FRAGS);
+	sg_ring_init(&sg.ring, 1+MAX_SKB_FRAGS);
 	for (;;) {
 		skb = netdev_alloc_skb(vi->dev, MAX_PACKET_LEN);
 		if (unlikely(!skb))
 			break;
 
 		skb_put(skb, MAX_PACKET_LEN);
-		vnet_hdr_to_sg(sg, skb);
-		num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
+		vnet_hdr_to_sg(sg.sg, skb);
+		sg.ring.num = skb_to_sgvec(skb, sg.sg+1, 0, skb->len) + 1;
 		skb_queue_head(&vi->recv, skb);
 
-		err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, num, skb);
+		err = vi->rvq->vq_ops->add_buf(vi->rvq, NULL, &sg.ring, skb);
 		if (err) {
 			skb_unlink(skb, &vi->recv);
 			kfree_skb(skb);
@@ -304,13 +304,13 @@ static int start_xmit(struct sk_buff *sk
 static int start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-	int num, err;
-	struct scatterlist sg[1+MAX_SKB_FRAGS];
+	int err;
+	DECLARE_SG_RING(sg, 1+MAX_SKB_FRAGS);
 	struct virtio_net_hdr *hdr;
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
 	DECLARE_MAC_BUF(mac);
 
-	sg_init_table(sg, 1+MAX_SKB_FRAGS);
+	sg_ring_init(&sg.ring, 1+MAX_SKB_FRAGS);
 
 	pr_debug("%s: xmit %p %s\n", dev->name, skb, print_mac(mac, dest));
 
@@ -345,13 +345,13 @@ static int start_xmit(struct sk_buff *sk
 		hdr->gso_size = 0;
 	}
 
-	vnet_hdr_to_sg(sg, skb);
-	num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
+	vnet_hdr_to_sg(sg.sg, skb);
+	sg.ring.num = skb_to_sgvec(skb, sg.sg+1, 0, skb->len) + 1;
 	__skb_queue_head(&vi->send, skb);
-	vi->stats.sendq_sglen += num;
+	vi->stats.sendq_sglen += sg.ring.num;
 
 again:
-	err = vi->svq->vq_ops->add_buf(vi->svq, sg, num, 0, skb);
+	err = vi->svq->vq_ops->add_buf(vi->svq, &sg.ring, NULL, skb);
 	if (err) {
 		vi->stats.sendq_full++;
 
diff -r 6fe7cf582293 drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c	Wed Dec 19 16:06:43 2007 +1100
+++ b/drivers/virtio/virtio_ring.c	Wed Dec 19 18:29:58 2007 +1100
@@ -69,48 +69,62 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static int add_desc(struct vring_virtqueue *vq, unsigned int i,
+		    struct scatterlist *sg, unsigned int flags)
+{
+	if (vq->num_free == 0)
+		return -ENOSPC;
+
+	vq->vring.desc[i].flags = VRING_DESC_F_NEXT | flags;
+	vq->vring.desc[i].addr = (page_to_pfn(sg_page(sg))<<PAGE_SHIFT)
+		+ sg->offset;
+	vq->vring.desc[i].len = sg->length;
+	vq->num_free--;
+	return vq->vring.desc[i].next;
+}
+
 static int vring_add_buf(struct virtqueue *_vq,
-			 struct scatterlist sg[],
-			 unsigned int out,
-			 unsigned int in,
+			 struct sg_ring *out,
+			 struct sg_ring *in,
 			 void *data)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
-	unsigned int i, avail, head, uninitialized_var(prev);
+	unsigned int j, avail, head, free, uninitialized_var(prev);
+	int i;
+	struct sg_ring empty_sg, *sg;
 
 	BUG_ON(data == NULL);
-	BUG_ON(out + in > vq->vring.num);
-	BUG_ON(out + in == 0);
+
+	sg_ring_init(&empty_sg, 0);
+	empty_sg.num = 0;
+	if (!out)
+		out = &empty_sg;
+	if (!in)
+		in = &empty_sg;
+
+	BUG_ON(in->num == 0 && out->num == 0);
 
 	START_USE(vq);
 
-	if (vq->num_free < out + in) {
-		pr_debug("Can't add buf len %i - avail = %i\n",
-			 out + in, vq->num_free);
-		END_USE(vq);
-		return -ENOSPC;
+	i = head = vq->free_head;
+	free = vq->num_free;
+
+	/* Lay out the output buffers first. */
+	sg_ring_for_each(out, sg, j) {
+		prev = i;
+		i = add_desc(vq, i, &sg->sg[j], 0);
+		if (unlikely(i < 0))
+			goto full;
 	}
 
-	/* We're about to use some buffers from the free list. */
-	vq->num_free -= out + in;
+	/* Lay out the input buffers next. */
+	sg_ring_for_each(in, sg, j) {
+		prev = i;
+		i = add_desc(vq, i, &sg->sg[j], VRING_DESC_F_WRITE);
+		if (unlikely(i < 0))
+			goto full;
+	}
 
-	head = vq->free_head;
-	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
-		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
-		vq->vring.desc[i].addr = (page_to_pfn(sg_page(sg))<<PAGE_SHIFT)
-			+ sg->offset;
-		vq->vring.desc[i].len = sg->length;
-		prev = i;
-		sg++;
-	}
-	for (; in; i = vq->vring.desc[i].next, in--) {
-		vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-		vq->vring.desc[i].addr = (page_to_pfn(sg_page(sg))<<PAGE_SHIFT)
-			+ sg->offset;
-		vq->vring.desc[i].len = sg->length;
-		prev = i;
-		sg++;
-	}
 	/* Last one doesn't continue. */
 	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
 
@@ -128,6 +142,12 @@ static int vring_add_buf(struct virtqueu
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
 	return 0;
+
+full:
+	pr_debug("Buffer needed more than %u on %p\n", free, vq);
+	vq->num_free = free;
+	END_USE(vq);
+	return i;
 }
 
 static void vring_kick(struct virtqueue *_vq)
diff -r 6fe7cf582293 include/linux/virtio.h
--- a/include/linux/virtio.h	Wed Dec 19 16:06:43 2007 +1100
+++ b/include/linux/virtio.h	Wed Dec 19 18:29:58 2007 +1100
@@ -3,7 +3,7 @@
 /* Everything a virtio driver needs to work with any particular virtio
  * implementation. */
 #include <linux/types.h>
-#include <linux/scatterlist.h>
+#include <linux/sg_ring.h>
 #include <linux/spinlock.h>
 #include <linux/device.h>
 #include <linux/mod_devicetable.h>
@@ -27,9 +27,8 @@ struct virtqueue
  * virtqueue_ops - operations for virtqueue abstraction layer
  * @add_buf: expose buffer to other end
  *	vq: the struct virtqueue we're talking about.
- *	sg: the description of the buffer(s).
- *	out_num: the number of sg readable by other side
- *	in_num: the number of sg which are writable (after readable ones)
+ *	out: the scatter gather elements readable by other side (can be NULL)
+ *	in: the scatter gather elements which are writable (can be NULL)
  *	data: the token identifying the buffer.
  *      Returns 0 or an error.
  * @kick: update after add_buf
@@ -56,9 +55,8 @@ struct virtqueue
  */
 struct virtqueue_ops {
 	int (*add_buf)(struct virtqueue *vq,
-		       struct scatterlist sg[],
-		       unsigned int out_num,
-		       unsigned int in_num,
+		       struct sg_ring *out,
+		       struct sg_ring *in,
 		       void *data);
 
 	void (*kick)(struct virtqueue *vq);
diff -r 6fe7cf582293 net/9p/trans_virtio.c
--- a/net/9p/trans_virtio.c	Wed Dec 19 16:06:43 2007 +1100
+++ b/net/9p/trans_virtio.c	Wed Dec 19 18:29:59 2007 +1100
@@ -88,7 +88,7 @@ static int p9_virtio_write(struct p9_tra
 {
 	struct virtio_chan *chan = (struct virtio_chan *) trans->priv;
 	struct virtqueue *out_vq = chan->out_vq;
-	struct scatterlist sg[1];
+	DECLARE_SG_RING(sg, 1);
 	unsigned int len;
 
 	P9_DPRINTK(P9_DEBUG_TRANS, "9p debug: virtio write (%d)\n", count);
@@ -97,11 +97,11 @@ static int p9_virtio_write(struct p9_tra
 	if (rest_of_page(buf) < count)
 		count = rest_of_page(buf);
 
-	sg_init_one(sg, buf, count);
+	sg_ring_single(&sg.ring, buf, count);
 
 	/* add_buf wants a token to identify this buffer: we hand it any
 	 * non-NULL pointer, since there's only ever one buffer. */
-	if (out_vq->vq_ops->add_buf(out_vq, sg, 1, 0, (void *)1) == 0) {
+	if (out_vq->vq_ops->add_buf(out_vq, &sg.ring, NULL, (void *)1) == 0) {
 		/* Tell Host to go! */
 		out_vq->vq_ops->kick(out_vq);
 		/* Chill out until it's done with the buffer. */
@@ -119,12 +119,12 @@ static int p9_virtio_write(struct p9_tra
  * queue. */
 static void add_inbuf(struct virtio_chan *chan)
 {
-	struct scatterlist sg[1];
+	DECLARE_SG_RING(sg, 1);
 
-	sg_init_one(sg, chan->inbuf, PAGE_SIZE);
+	sg_ring_single(&sg.ring, chan->inbuf, PAGE_SIZE);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (chan->in_vq->vq_ops->add_buf(chan->in_vq, sg, 0, 1, chan->inbuf))
+	if (chan->in_vq->vq_ops->add_buf(chan->in_vq,&sg.ring,NULL,chan->inbuf))
 		BUG();
 	chan->in_vq->vq_ops->kick(chan->in_vq);
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/7] sg_ring: blk_rq_map_sg_ring as a counterpart to blk_rq_map_sg.
  2007-12-19  7:31   ` [PATCH 2/7] sg_ring: use in virtio Rusty Russell
@ 2007-12-19  7:33     ` Rusty Russell
  2007-12-19  7:34       ` [PATCH 4/7] sg_ring: dma_map_sg_ring() helper Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  7:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

blk_rq_map_sg_ring as a counterpart to blk_rq_map_sg.

Obvious counterpart to blk_rq_map_sg.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 block/ll_rw_blk.c      |   55 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |    1 +
 2 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -31,6 +31,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 #include <linux/scatterlist.h>
+#include <linux/sg_ring.h>
 
 /*
  * for max sense size
@@ -1364,6 +1365,68 @@ new_segment:
 
 EXPORT_SYMBOL(blk_rq_map_sg);
 
+/**
+ * blk_rq_map_sg_ring - map a request to a scatterlist ring.
+ * @q: the request queue this request applies to.
+ * @rq: the request to map
+ * @sg: the sg_ring to populate.
+ *
+ * There must be enough elements in the sg_ring(s) to map the request.
+ */
+void blk_rq_map_sg_ring(struct request_queue *q, struct request *rq,
+			struct sg_ring *sg)
+{
+	struct bio_vec *bvec, *bvprv;
+	struct req_iterator iter;
+	int i, cluster;
+	struct sg_ring *head = sg;
+	struct scatterlist *sgprv;
+
+	i = 0;
+	cluster = q->queue_flags & (1 << QUEUE_FLAG_CLUSTER);
+
+	/*
+	 * for each bio in rq
+	 */
+	bvprv = NULL;
+	sgprv = NULL;
+	rq_for_each_segment(bvec, rq, iter) {
+		int nbytes = bvec->bv_len;
+
+		if (bvprv && cluster) {
+			if (sgprv->length + nbytes > q->max_segment_size)
+				goto new_segment;
+
+			if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
+				goto new_segment;
+			if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
+				goto new_segment;
+
+			sgprv->length += nbytes;
+		} else {
+new_segment:
+			sg_set_page(sg->sg + i, bvec->bv_page, nbytes,
+				    bvec->bv_offset);
+			sgprv = sg->sg + i;
+			if (++i == sg->max) {
+				sg->num = i;
+				sg = sg_ring_next(sg, head);
+				i = 0;
+			}
+		}
+		bvprv = bvec;
+	} /* segments in rq */
+
+	/* If we were still working on an sg_ring, set the number and
+	 * clear any following sg_rings. */
+	if (sg) {
+		sg->num = i;
+		for (sg = sg_ring_next(sg,head); sg; sg = sg_ring_next(sg,head))
+			sg->num = 0;
+	}
+}
+EXPORT_SYMBOL(blk_rq_map_sg_ring);
+
 /*
  * the standard queue merge functions, can be overridden with device
  * specific ones if so desired
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -777,6 +777,8 @@ extern void blk_ordered_complete_seq(str
 extern void blk_ordered_complete_seq(struct request_queue *, unsigned, int);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
+struct sg_ring;
+extern void blk_rq_map_sg_ring(struct request_queue *, struct request *, struct sg_ring *);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern void generic_unplug_device(struct request_queue *);
 extern void __generic_unplug_device(struct request_queue *);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/7] sg_ring: dma_map_sg_ring() helper
  2007-12-19  7:33     ` [PATCH 3/7] sg_ring: blk_rq_map_sg_ring as a counterpart to blk_rq_map_sg Rusty Russell
@ 2007-12-19  7:34       ` Rusty Russell
  2007-12-19  7:36         ` [PATCH 5/7] sg_ring: Convert core scsi code to sg_ring Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  7:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

Obvious counterpart to dma_map_sg.  Note that this is arch-independent
code; sg_rings are backwards compatible with simple sg arrays.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/base/dma-mapping.c  |   13 +++++++++++++
 include/linux/dma-mapping.h |    4 ++++
 2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -8,6 +8,7 @@
  */
 
 #include <linux/dma-mapping.h>
+#include <linux/sg_ring.h>
 
 /*
  * Managed DMA API
@@ -162,6 +163,59 @@ void dmam_free_noncoherent(struct device
 }
 EXPORT_SYMBOL(dmam_free_noncoherent);
 
+/**
+ * dma_map_sg_ring - Map an entire sg ring
+ * @dev: Device to free noncoherent memory for
+ * @sg: The sg_ring
+ * @direction: DMA_TO_DEVICE, DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+ *
+ * This returns -ENOMEM if mapping fails.  It's not clear that telling you
+ * it failed is useful though.
+ */
+int dma_map_sg_ring(struct device *dev, struct sg_ring *sg,
+		     enum dma_data_direction direction)
+{
+	struct sg_ring *i;
+	unsigned int num;
+
+	for (i = sg; i; i = sg_ring_next(i, sg)) {
+		BUG_ON(i->num > i->max);
+		num = dma_map_sg(dev, i->sg, i->num, direction);
+		if (num == 0 && i->num != 0)
+			goto unmap;
+	}
+	return 0;
+
+unmap:
+	while (sg) {
+		dma_unmap_sg(dev, sg->sg, sg->num, direction);
+		sg = sg_ring_next(sg, i);
+	}
+	return -ENOMEM;
+
+}
+EXPORT_SYMBOL(dma_map_sg_ring);
+
+/**
+ * dma_unmap_sg_ring - Unmap an entire sg ring
+ * @dev: Device to free noncoherent memory for
+ * @sg: The sg_ring
+ * @direction: DMA_TO_DEVICE, DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+ *
+ * Call after dma_map_sg_ring() succeeds.
+ */
+void dma_unmap_sg_ring(struct device *dev, struct sg_ring *sg,
+		       enum dma_data_direction direction)
+{
+	struct sg_ring *i;
+
+	for (i = sg; i; i = sg_ring_next(i, sg)) {
+		BUG_ON(i->num > i->max);
+		dma_unmap_sg(dev, i->sg, i->num, direction);
+	}
+}
+EXPORT_SYMBOL(dma_unmap_sg_ring);
+
 #ifdef ARCH_HAS_DMA_DECLARE_COHERENT_MEMORY
 
 static void dmam_coherent_decl_release(struct device *dev, void *res)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -87,6 +87,12 @@ dma_mark_declared_memory_occupied(struct
 }
 #endif
 
+struct sg_ring;
+extern int dma_map_sg_ring(struct device *dev, struct sg_ring *sg,
+			   enum dma_data_direction direction);
+extern void dma_unmap_sg_ring(struct device *dev, struct sg_ring *sg,
+			      enum dma_data_direction direction);
+
 /*
  * Managed DMA API
  */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/7] sg_ring: Convert core scsi code to sg_ring.
  2007-12-19  7:34       ` [PATCH 4/7] sg_ring: dma_map_sg_ring() helper Rusty Russell
@ 2007-12-19  7:36         ` Rusty Russell
  2007-12-19  7:37           ` [PATCH 6/7] sg_ring: libata simplification Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  7:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

If nothing else, the simplification of this logic shows why I prefer
sg_ring over scatterlist chaining.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff --git a/drivers/block/cciss_scsi.c b/drivers/block/cciss_scsi.c
--- a/drivers/block/cciss_scsi.c
+++ b/drivers/block/cciss_scsi.c
@@ -1191,28 +1191,28 @@ cciss_scatter_gather(struct pci_dev *pde
 		struct scsi_cmnd *cmd)
 {
 	unsigned int len;
-	struct scatterlist *sg;
+	struct sg_ring *sg;
 	__u64 addr64;
-	int use_sg, i;
+	int count = 0, i;
 
-	BUG_ON(scsi_sg_count(cmd) > MAXSGENTRIES);
-
-	use_sg = scsi_dma_map(cmd);
-	if (use_sg) {	/* not too many addrs? */
-		scsi_for_each_sg(cmd, sg, use_sg, i) {
-			addr64 = (__u64) sg_dma_address(sg);
-			len  = sg_dma_len(sg);
+	if (scsi_dma_map(cmd) >= 0) {	/* not too many addrs? */
+		sg_ring_for_each(scsi_sgring(cmd), sg, i) {
+			addr64 = (__u64) sg_dma_address(&sg->sg[i]);
+			len  = sg_dma_len(&sg->sg[i]);
 			cp->SG[i].Addr.lower =
 				(__u32) (addr64 & (__u64) 0x00000000FFFFFFFF);
 			cp->SG[i].Addr.upper =
 				(__u32) ((addr64 >> 32) & (__u64) 0x00000000FFFFFFFF);
 			cp->SG[i].Len = len;
 			cp->SG[i].Ext = 0;  // we are not chaining
+			count++;
 		}
 	}
 
-	cp->Header.SGList = (__u8) use_sg;   /* no. SGs contig in this cmd */
-	cp->Header.SGTotal = (__u16) use_sg; /* total sgs in this cmd list */
+	cp->Header.SGList = (__u8) count;   /* no. SGs contig in this cmd */
+	cp->Header.SGTotal = (__u16) count; /* total sgs in this cmd list */
+
+	BUG_ON(count > MAXSGENTRIES);
 	return;
 }
 
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -619,18 +619,17 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd 
 	ses->data_direction = scmd->sc_data_direction;
 	ses->bufflen = scmd->request_bufflen;
 	ses->buffer = scmd->request_buffer;
-	ses->use_sg = scmd->use_sg;
 	ses->resid = scmd->resid;
 	ses->result = scmd->result;
 
 	if (sense_bytes) {
 		scmd->request_bufflen = min_t(unsigned,
 		                       sizeof(scmd->sense_buffer), sense_bytes);
-		sg_init_one(&ses->sense_sgl, scmd->sense_buffer,
-		                                       scmd->request_bufflen);
-		scmd->request_buffer = &ses->sense_sgl;
+
+		sg_ring_single(&ses->sense_sg.ring, scmd->sense_buffer,
+			       scmd->request_bufflen);
+		scmd->request_buffer = &ses->sense_sg;
 		scmd->sc_data_direction = DMA_FROM_DEVICE;
-		scmd->use_sg = 1;
 		memset(scmd->cmnd, 0, sizeof(scmd->cmnd));
 		scmd->cmnd[0] = REQUEST_SENSE;
 		scmd->cmnd[4] = scmd->request_bufflen;
@@ -639,7 +638,6 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd 
 		scmd->request_buffer = NULL;
 		scmd->request_bufflen = 0;
 		scmd->sc_data_direction = DMA_NONE;
-		scmd->use_sg = 0;
 		if (cmnd) {
 			memset(scmd->cmnd, 0, sizeof(scmd->cmnd));
 			memcpy(scmd->cmnd, cmnd, cmnd_size);
@@ -678,7 +676,6 @@ void scsi_eh_restore_cmnd(struct scsi_cm
 	scmd->sc_data_direction = ses->data_direction;
 	scmd->request_bufflen = ses->bufflen;
 	scmd->request_buffer = ses->buffer;
-	scmd->use_sg = ses->use_sg;
 	scmd->resid = ses->resid;
 	scmd->result = ses->result;
 }
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -17,7 +17,7 @@
 #include <linux/pci.h>
 #include <linux/delay.h>
 #include <linux/hardirq.h>
-#include <linux/scatterlist.h>
+#include <linux/sg_ring.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -737,21 +737,31 @@ static inline unsigned int scsi_sgtable_
 	return index;
 }
 
-struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
+static void free_sgring(struct sg_ring *sg)
 {
 	struct scsi_host_sg_pool *sgp;
-	struct scatterlist *sgl, *prev, *ret;
+
+	sgp = scsi_sg_pools + scsi_sgtable_index(sg->max);
+
+	mempool_free(sg, sgp->pool);
+}
+
+struct sg_ring *scsi_alloc_sgring(struct scsi_cmnd *cmd, unsigned int num,
+				  gfp_t gfp_mask)
+{
+	struct scsi_host_sg_pool *sgp;
+	struct sg_ring *sg, *ret;
 	unsigned int index;
 	int this, left;
 
-	BUG_ON(!cmd->use_sg);
+	BUG_ON(!num);
 
-	left = cmd->use_sg;
-	ret = prev = NULL;
+	left = num;
+	ret = NULL;
 	do {
 		this = left;
 		if (this > SCSI_MAX_SG_SEGMENTS) {
-			this = SCSI_MAX_SG_SEGMENTS - 1;
+			this = SCSI_MAX_SG_SEGMENTS;
 			index = SG_MEMPOOL_NR - 1;
 		} else
 			index = scsi_sgtable_index(this);
@@ -760,32 +770,20 @@ struct scatterlist *scsi_alloc_sgtable(s
 
 		sgp = scsi_sg_pools + index;
 
-		sgl = mempool_alloc(sgp->pool, gfp_mask);
-		if (unlikely(!sgl))
+		sg = mempool_alloc(sgp->pool, gfp_mask);
+		if (unlikely(!sg))
 			goto enomem;
 
-		sg_init_table(sgl, sgp->size);
+		sg_ring_init(sg, sgp->size);
 
 		/*
-		 * first loop through, set initial index and return value
+		 * first loop through, set return value, otherwise
+		 * attach this to the tail.
 		 */
 		if (!ret)
-			ret = sgl;
-
-		/*
-		 * chain previous sglist, if any. we know the previous
-		 * sglist must be the biggest one, or we would not have
-		 * ended up doing another loop.
-		 */
-		if (prev)
-			sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
-
-		/*
-		 * if we have nothing left, mark the last segment as
-		 * end-of-list
-		 */
-		if (!left)
-			sg_mark_end(&sgl[this - 1]);
+			ret = sg;
+		else
+			list_add_tail(&sg->list, &ret->list);
 
 		/*
 		 * don't allow subsequent mempool allocs to sleep, it would
@@ -793,85 +791,28 @@ struct scatterlist *scsi_alloc_sgtable(s
 		 */
 		gfp_mask &= ~__GFP_WAIT;
 		gfp_mask |= __GFP_HIGH;
-		prev = sgl;
 	} while (left);
 
-	/*
-	 * ->use_sg may get modified after dma mapping has potentially
-	 * shrunk the number of segments, so keep a copy of it for free.
-	 */
-	cmd->__use_sg = cmd->use_sg;
 	return ret;
 enomem:
 	if (ret) {
-		/*
-		 * Free entries chained off ret. Since we were trying to
-		 * allocate another sglist, we know that all entries are of
-		 * the max size.
-		 */
-		sgp = scsi_sg_pools + SG_MEMPOOL_NR - 1;
-		prev = ret;
-		ret = &ret[SCSI_MAX_SG_SEGMENTS - 1];
-
-		while ((sgl = sg_chain_ptr(ret)) != NULL) {
-			ret = &sgl[SCSI_MAX_SG_SEGMENTS - 1];
-			mempool_free(sgl, sgp->pool);
-		}
-
-		mempool_free(prev, sgp->pool);
+		while (!list_empty(&ret->list))
+			free_sgring(sg_ring_next(ret));
+		free_sgring(ret);
 	}
 	return NULL;
 }
+EXPORT_SYMBOL(scsi_alloc_sgring);
 
-EXPORT_SYMBOL(scsi_alloc_sgtable);
+void scsi_free_sgring(struct scsi_cmnd *cmd)
+{
+	struct sg_ring *sg = cmd->request_buffer;
 
-void scsi_free_sgtable(struct scsi_cmnd *cmd)
-{
-	struct scatterlist *sgl = cmd->request_buffer;
-	struct scsi_host_sg_pool *sgp;
-
-	/*
-	 * if this is the biggest size sglist, check if we have
-	 * chained parts we need to free
-	 */
-	if (cmd->__use_sg > SCSI_MAX_SG_SEGMENTS) {
-		unsigned short this, left;
-		struct scatterlist *next;
-		unsigned int index;
-
-		left = cmd->__use_sg - (SCSI_MAX_SG_SEGMENTS - 1);
-		next = sg_chain_ptr(&sgl[SCSI_MAX_SG_SEGMENTS - 1]);
-		while (left && next) {
-			sgl = next;
-			this = left;
-			if (this > SCSI_MAX_SG_SEGMENTS) {
-				this = SCSI_MAX_SG_SEGMENTS - 1;
-				index = SG_MEMPOOL_NR - 1;
-			} else
-				index = scsi_sgtable_index(this);
-
-			left -= this;
-
-			sgp = scsi_sg_pools + index;
-
-			if (left)
-				next = sg_chain_ptr(&sgl[sgp->size - 1]);
-
-			mempool_free(sgl, sgp->pool);
-		}
-
-		/*
-		 * Restore original, will be freed below
-		 */
-		sgl = cmd->request_buffer;
-		sgp = scsi_sg_pools + SG_MEMPOOL_NR - 1;
-	} else
-		sgp = scsi_sg_pools + scsi_sgtable_index(cmd->__use_sg);
-
-	mempool_free(sgl, sgp->pool);
+	while (!list_empty(&sg->list))
+		free_sgring(sg_ring_next(sg));
+	free_sgring(sg);
 }
-
-EXPORT_SYMBOL(scsi_free_sgtable);
+EXPORT_SYMBOL(scsi_free_sgring);
 
 /*
  * Function:    scsi_release_buffers()
@@ -892,8 +833,8 @@ EXPORT_SYMBOL(scsi_free_sgtable);
  */
 static void scsi_release_buffers(struct scsi_cmnd *cmd)
 {
-	if (cmd->use_sg)
-		scsi_free_sgtable(cmd);
+	if (cmd->request_buffer)
+		scsi_free_sgring(cmd);
 
 	/*
 	 * Zero these out.  They now point to freed memory, and it is
@@ -1106,20 +1047,20 @@ static int scsi_init_io(struct scsi_cmnd
 static int scsi_init_io(struct scsi_cmnd *cmd)
 {
 	struct request     *req = cmd->request;
-	int		   count;
 
 	/*
 	 * We used to not use scatter-gather for single segment request,
 	 * but now we do (it makes highmem I/O easier to support without
 	 * kmapping pages)
 	 */
-	cmd->use_sg = req->nr_phys_segments;
 
 	/*
 	 * If sg table allocation fails, requeue request later.
 	 */
-	cmd->request_buffer = scsi_alloc_sgtable(cmd, GFP_ATOMIC);
+	cmd->request_buffer = scsi_alloc_sgring(cmd, req->nr_phys_segments,
+						GFP_ATOMIC);
 	if (unlikely(!cmd->request_buffer)) {
+		BUG();
 		scsi_unprep_request(req);
 		return BLKPREP_DEFER;
 	}
@@ -1134,9 +1075,7 @@ static int scsi_init_io(struct scsi_cmnd
 	 * Next, walk the list, and fill in the addresses and sizes of
 	 * each segment.
 	 */
-	count = blk_rq_map_sg(req->q, req, cmd->request_buffer);
-	BUG_ON(count > cmd->use_sg);
-	cmd->use_sg = count;
+	blk_rq_map_sg_ring(req->q, req, cmd->request_buffer);
 	return BLKPREP_OK;
 }
 
@@ -1193,7 +1132,6 @@ int scsi_setup_blk_pc_cmnd(struct scsi_d
 
 		cmd->request_bufflen = 0;
 		cmd->request_buffer = NULL;
-		cmd->use_sg = 0;
 		req->buffer = NULL;
 	}
 
@@ -1751,7 +1689,7 @@ int __init scsi_init_queue(void)
 
 	for (i = 0; i < SG_MEMPOOL_NR; i++) {
 		struct scsi_host_sg_pool *sgp = scsi_sg_pools + i;
-		int size = sgp->size * sizeof(struct scatterlist);
+		int size = sizeof(struct sg_ring) + sgp->size * sizeof(struct scatterlist);
 
 		sgp->slab = kmem_cache_create(sgp->name, size, 0,
 				SLAB_HWCACHE_ALIGN, NULL);
diff --git a/drivers/scsi/scsi_lib_dma.c b/drivers/scsi/scsi_lib_dma.c
--- a/drivers/scsi/scsi_lib_dma.c
+++ b/drivers/scsi/scsi_lib_dma.c
@@ -15,22 +15,17 @@
  * scsi_dma_map - perform DMA mapping against command's sg lists
  * @cmd:	scsi command
  *
- * Returns the number of sg lists actually used, zero if the sg lists
- * is NULL, or -ENOMEM if the mapping failed.
+ * Returns -ENOMEM if the mapping failed.
  */
 int scsi_dma_map(struct scsi_cmnd *cmd)
 {
-	int nseg = 0;
-
-	if (scsi_sg_count(cmd)) {
+	if (scsi_sgring(cmd)) {
 		struct device *dev = cmd->device->host->shost_gendev.parent;
 
-		nseg = dma_map_sg(dev, scsi_sglist(cmd), scsi_sg_count(cmd),
-				  cmd->sc_data_direction);
-		if (unlikely(!nseg))
-			return -ENOMEM;
+		return dma_map_sg_ring(dev, scsi_sgring(cmd),
+				       cmd->sc_data_direction);
 	}
-	return nseg;
+	return 0;
 }
 EXPORT_SYMBOL(scsi_dma_map);
 
@@ -40,11 +35,10 @@ EXPORT_SYMBOL(scsi_dma_map);
  */
 void scsi_dma_unmap(struct scsi_cmnd *cmd)
 {
-	if (scsi_sg_count(cmd)) {
+	if (scsi_sgring(cmd)) {
 		struct device *dev = cmd->device->host->shost_gendev.parent;
 
-		dma_unmap_sg(dev, scsi_sglist(cmd), scsi_sg_count(cmd),
-			     cmd->sc_data_direction);
+		dma_unmap_sg_ring(dev, scsi_sgring(cmd),cmd->sc_data_direction);
 	}
 }
 EXPORT_SYMBOL(scsi_dma_unmap);
diff --git a/drivers/scsi/scsi_tgt_lib.c b/drivers/scsi/scsi_tgt_lib.c
--- a/drivers/scsi/scsi_tgt_lib.c
+++ b/drivers/scsi/scsi_tgt_lib.c
@@ -358,15 +358,15 @@ static int scsi_tgt_init_cmd(struct scsi
 	struct request *rq = cmd->request;
 	int count;
 
-	cmd->use_sg = rq->nr_phys_segments;
-	cmd->request_buffer = scsi_alloc_sgtable(cmd, gfp_mask);
+	cmd->request_buffer = scsi_alloc_sgring(cmd, rq->nr_phys_segments,
+						gfp_mask);
 	if (!cmd->request_buffer)
 		return -ENOMEM;
 
 	cmd->request_bufflen = rq->data_len;
 
 	dprintk("cmd %p cnt %d %lu\n", cmd, cmd->use_sg, rq_data_dir(rq));
-	count = blk_rq_map_sg(rq->q, rq, cmd->request_buffer);
+	count = blk_rq_map_sg_ring(rq->q, rq, cmd->request_buffer);
 	BUG_ON(count > cmd->use_sg);
 	cmd->use_sg = count;
 	return 0;
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -66,10 +66,6 @@ struct scsi_cmnd {
 
 	struct timer_list eh_timeout;	/* Used to time out the command. */
 	void *request_buffer;		/* Actual requested buffer */
-
-	/* These elements define the operation we ultimately want to perform */
-	unsigned short use_sg;	/* Number of pieces of scatter-gather */
-	unsigned short __use_sg;
 
 	unsigned underflow;	/* Return error if less than
 				   this amount is transferred */
@@ -128,14 +124,13 @@ extern void *scsi_kmap_atomic_sg(struct 
 				 size_t *offset, size_t *len);
 extern void scsi_kunmap_atomic_sg(void *virt);
 
-extern struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *, gfp_t);
-extern void scsi_free_sgtable(struct scsi_cmnd *);
+extern struct sg_ring *scsi_alloc_sgring(struct scsi_cmnd *, unsigned, gfp_t);
+extern void scsi_free_sgring(struct scsi_cmnd *);
 
 extern int scsi_dma_map(struct scsi_cmnd *cmd);
 extern void scsi_dma_unmap(struct scsi_cmnd *cmd);
 
-#define scsi_sg_count(cmd) ((cmd)->use_sg)
-#define scsi_sglist(cmd) ((struct scatterlist *)(cmd)->request_buffer)
+#define scsi_sgring(cmd) ((struct sg_ring *)(cmd)->request_buffer)
 #define scsi_bufflen(cmd) ((cmd)->request_bufflen)
 
 static inline void scsi_set_resid(struct scsi_cmnd *cmd, int resid)
diff --git a/include/scsi/scsi_eh.h b/include/scsi/scsi_eh.h
--- a/include/scsi/scsi_eh.h
+++ b/include/scsi/scsi_eh.h
@@ -1,7 +1,7 @@
 #ifndef _SCSI_SCSI_EH_H
 #define _SCSI_SCSI_EH_H
 
-#include <linux/scatterlist.h>
+#include <linux/sg_ring.h>
 
 #include <scsi/scsi_cmnd.h>
 struct scsi_device;
@@ -75,10 +75,9 @@ struct scsi_eh_save {
 
 	void *buffer;
 	unsigned bufflen;
-	unsigned short use_sg;
 	int resid;
 
-	struct scatterlist sense_sgl;
+	DECLARE_SG_RING(sense_sg, 1);
 };
 
 extern void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd,

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/7] sg_ring: libata simplification
  2007-12-19  7:36         ` [PATCH 5/7] sg_ring: Convert core scsi code to sg_ring Rusty Russell
@ 2007-12-19  7:37           ` Rusty Russell
  2007-12-19  7:38             ` [PATCH 7/7] sg_ring: convert core ATA code to sg_ring Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  7:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

[This patch has been obsoleted: Tejun has a more complete one, but
it's not in mainline yet, so this serves to make the next patch
apply].

libata separates the single buffer case from the scatterlist case
internally.  It's not clear that this is necessary.

Remove the ATA_QCFLAG_SINGLE flag, and buf_virt pointer, and always
initialize qc->nbytes in ata_sg_init().

It's possible that the ATA_QCFLAG_SG and ATA_QCFLAG_DMAMAP flags could
be entirely removed, and we could use whether qc->__sg is NULL or not.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff -r 8b1075c7ad47 drivers/ata/libata-core.c
--- a/drivers/ata/libata-core.c	Tue Nov 13 21:00:47 2007 +1100
+++ b/drivers/ata/libata-core.c	Wed Nov 14 15:31:07 2007 +1100
@@ -1648,16 +1648,8 @@ unsigned ata_exec_internal_sg(struct ata
 		memcpy(qc->cdb, cdb, ATAPI_CDB_LEN);
 	qc->flags |= ATA_QCFLAG_RESULT_TF;
 	qc->dma_dir = dma_dir;
-	if (dma_dir != DMA_NONE) {
-		unsigned int i, buflen = 0;
-		struct scatterlist *sg;
-
-		for_each_sg(sgl, sg, n_elem, i)
-			buflen += sg->length;
-
+	if (dma_dir != DMA_NONE)
 		ata_sg_init(qc, sgl, n_elem);
-		qc->nbytes = buflen;
-	}
 
 	qc->private_data = &wait;
 	qc->complete_fn = ata_qc_complete_internal;
@@ -4543,9 +4535,6 @@ void ata_sg_clean(struct ata_queued_cmd 
 	WARN_ON(!(qc->flags & ATA_QCFLAG_DMAMAP));
 	WARN_ON(sg == NULL);
 
-	if (qc->flags & ATA_QCFLAG_SINGLE)
-		WARN_ON(qc->n_elem > 1);
-
 	VPRINTK("unmapping %u sg elements\n", qc->n_elem);
 
 	/* if we padded the buffer out to 32-bit bound, and data
@@ -4566,16 +4555,6 @@ void ata_sg_clean(struct ata_queued_cmd 
 			memcpy(addr + psg->offset, pad_buf, qc->pad_len);
 			kunmap_atomic(addr, KM_IRQ0);
 		}
-	} else {
-		if (qc->n_elem)
-			dma_unmap_single(ap->dev,
-				sg_dma_address(&sg[0]), sg_dma_len(&sg[0]),
-				dir);
-		/* restore sg */
-		sg->length += qc->pad_len;
-		if (pad_buf)
-			memcpy(qc->buf_virt + sg->length - qc->pad_len,
-			       pad_buf, qc->pad_len);
 	}
 
 	qc->flags &= ~ATA_QCFLAG_DMAMAP;
@@ -4807,16 +4786,8 @@ void ata_noop_qc_prep(struct ata_queued_
 
 void ata_sg_init_one(struct ata_queued_cmd *qc, void *buf, unsigned int buflen)
 {
-	qc->flags |= ATA_QCFLAG_SINGLE;
-
-	qc->__sg = &qc->sgent;
-	qc->n_elem = 1;
-	qc->orig_n_elem = 1;
-	qc->buf_virt = buf;
-	qc->nbytes = buflen;
-	qc->cursg = qc->__sg;
-
 	sg_init_one(&qc->sgent, buf, buflen);
+	ata_sg_init(qc, &qc->sgent, 1);
 }
 
 /**
@@ -4841,75 +4813,13 @@ void ata_sg_init(struct ata_queued_cmd *
 	qc->n_elem = n_elem;
 	qc->orig_n_elem = n_elem;
 	qc->cursg = qc->__sg;
-}
-
-/**
- *	ata_sg_setup_one - DMA-map the memory buffer associated with a command.
- *	@qc: Command with memory buffer to be mapped.
- *
- *	DMA-map the memory buffer associated with queued_cmd @qc.
- *
- *	LOCKING:
- *	spin_lock_irqsave(host lock)
- *
- *	RETURNS:
- *	Zero on success, negative on error.
- */
-
-static int ata_sg_setup_one(struct ata_queued_cmd *qc)
-{
-	struct ata_port *ap = qc->ap;
-	int dir = qc->dma_dir;
-	struct scatterlist *sg = qc->__sg;
-	dma_addr_t dma_address;
-	int trim_sg = 0;
-
-	/* we must lengthen transfers to end on a 32-bit boundary */
-	qc->pad_len = sg->length & 3;
-	if (qc->pad_len) {
-		void *pad_buf = ap->pad + (qc->tag * ATA_DMA_PAD_SZ);
-		struct scatterlist *psg = &qc->pad_sgent;
-
-		WARN_ON(qc->dev->class != ATA_DEV_ATAPI);
-
-		memset(pad_buf, 0, ATA_DMA_PAD_SZ);
-
-		if (qc->tf.flags & ATA_TFLAG_WRITE)
-			memcpy(pad_buf, qc->buf_virt + sg->length - qc->pad_len,
-			       qc->pad_len);
-
-		sg_dma_address(psg) = ap->pad_dma + (qc->tag * ATA_DMA_PAD_SZ);
-		sg_dma_len(psg) = ATA_DMA_PAD_SZ;
-		/* trim sg */
-		sg->length -= qc->pad_len;
-		if (sg->length == 0)
-			trim_sg = 1;
-
-		DPRINTK("padding done, sg->length=%u pad_len=%u\n",
-			sg->length, qc->pad_len);
-	}
-
-	if (trim_sg) {
-		qc->n_elem--;
-		goto skip_map;
-	}
-
-	dma_address = dma_map_single(ap->dev, qc->buf_virt,
-				     sg->length, dir);
-	if (dma_mapping_error(dma_address)) {
-		/* restore sg */
-		sg->length += qc->pad_len;
-		return -1;
-	}
-
-	sg_dma_address(sg) = dma_address;
-	sg_dma_len(sg) = sg->length;
-
-skip_map:
-	DPRINTK("mapped buffer of %d bytes for %s\n", sg_dma_len(sg),
-		qc->tf.flags & ATA_TFLAG_WRITE ? "write" : "read");
-
-	return 0;
+
+	qc->nbytes = 0;
+	while (n_elem) {
+		qc->nbytes += sg->length;
+		n_elem--;
+		sg = sg_next(sg);
+	}
 }
 
 /**
@@ -6028,9 +5938,6 @@ void ata_qc_issue(struct ata_queued_cmd 
 	if (ata_should_dma_map(qc)) {
 		if (qc->flags & ATA_QCFLAG_SG) {
 			if (ata_sg_setup(qc))
-				goto sg_err;
-		} else if (qc->flags & ATA_QCFLAG_SINGLE) {
-			if (ata_sg_setup_one(qc))
 				goto sg_err;
 		}
 	} else {
diff -r 8b1075c7ad47 include/linux/libata.h
--- a/include/linux/libata.h	Tue Nov 13 21:00:47 2007 +1100
+++ b/include/linux/libata.h	Wed Nov 14 15:31:07 2007 +1100
@@ -216,8 +216,7 @@ enum {
 	/* struct ata_queued_cmd flags */
 	ATA_QCFLAG_ACTIVE	= (1 << 0), /* cmd not yet ack'd to scsi lyer */
 	ATA_QCFLAG_SG		= (1 << 1), /* have s/g table? */
-	ATA_QCFLAG_SINGLE	= (1 << 2), /* no s/g, just a single buffer */
-	ATA_QCFLAG_DMAMAP	= ATA_QCFLAG_SG | ATA_QCFLAG_SINGLE,
+	ATA_QCFLAG_DMAMAP	= ATA_QCFLAG_SG,
 	ATA_QCFLAG_IO		= (1 << 3), /* standard IO command */
 	ATA_QCFLAG_RESULT_TF	= (1 << 4), /* result TF requested */
 	ATA_QCFLAG_CLEAR_EXCL	= (1 << 5), /* clear excl_link on completion */
@@ -459,7 +458,6 @@ struct ata_queued_cmd {
 
 	struct scatterlist	sgent;
 	struct scatterlist	pad_sgent;
-	void			*buf_virt;
 
 	/* DO NOT iterate over __sg manually, use ata_for_each_sg() */
 	struct scatterlist	*__sg;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 7/7] sg_ring: convert core ATA code to sg_ring.
  2007-12-19  7:37           ` [PATCH 6/7] sg_ring: libata simplification Rusty Russell
@ 2007-12-19  7:38             ` Rusty Russell
  2007-12-26  8:36               ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-19  7:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, linux-ide

ATA relies so heavily on scsi that it needs to be converted at the
same time.

ATA adds padding to scatterlists in scsi commands, but because there was
no good way of appending to those scatterlists, it had to use boutique
iterators to make sure the padding is included.  With sg_ring, ATA can
simply append an sg_ring entry with the padding, and normal iterators
can be used.

I renamed qc->cursg to qc->cur_sg to catch all the users: they should
now be referring to 'qc->cur_sg[qc->cursg_i]' wherever they were using
'qc->cursg'.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff -r a867138da3e0 drivers/ata/ahci.c
--- a/drivers/ata/ahci.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/ahci.c	Wed Dec 19 17:27:09 2007 +1100
@@ -1479,9 +1479,9 @@ static void ahci_tf_read(struct ata_port
 
 static unsigned int ahci_fill_sg(struct ata_queued_cmd *qc, void *cmd_tbl)
 {
-	struct scatterlist *sg;
 	struct ahci_sg *ahci_sg;
-	unsigned int n_sg = 0;
+	struct sg_ring *sg;
+	unsigned int i, n_sg = 0;
 
 	VPRINTK("ENTER\n");
 
@@ -1489,9 +1489,9 @@ static unsigned int ahci_fill_sg(struct 
 	 * Next, the S/G list.
 	 */
 	ahci_sg = cmd_tbl + AHCI_CMD_TBL_HDR_SZ;
-	ata_for_each_sg(sg, qc) {
-		dma_addr_t addr = sg_dma_address(sg);
-		u32 sg_len = sg_dma_len(sg);
+	sg_ring_for_each(qc->sg, sg, i) {
+		dma_addr_t addr = sg_dma_address(&sg->sg[i]);
+		u32 sg_len = sg_dma_len(&sg->sg[i]);
 
 		ahci_sg->addr = cpu_to_le32(addr & 0xffffffff);
 		ahci_sg->addr_hi = cpu_to_le32((addr >> 16) >> 16);
diff -r a867138da3e0 drivers/ata/libata-core.c
--- a/drivers/ata/libata-core.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/libata-core.c	Wed Dec 19 17:27:09 2007 +1100
@@ -1596,8 +1596,8 @@ static void ata_qc_complete_internal(str
  */
 unsigned ata_exec_internal_sg(struct ata_device *dev,
 			      struct ata_taskfile *tf, const u8 *cdb,
-			      int dma_dir, struct scatterlist *sgl,
-			      unsigned int n_elem, unsigned long timeout)
+			      int dma_dir, struct sg_ring *sg,
+			      unsigned long timeout)
 {
 	struct ata_link *link = dev->link;
 	struct ata_port *ap = link->ap;
@@ -1657,7 +1657,7 @@ unsigned ata_exec_internal_sg(struct ata
 	qc->flags |= ATA_QCFLAG_RESULT_TF;
 	qc->dma_dir = dma_dir;
 	if (dma_dir != DMA_NONE)
-		ata_sg_init(qc, sgl, n_elem);
+		ata_sg_init(qc, sg);
 
 	qc->private_data = &wait;
 	qc->complete_fn = ata_qc_complete_internal;
@@ -1770,18 +1770,16 @@ unsigned ata_exec_internal(struct ata_de
 			   int dma_dir, void *buf, unsigned int buflen,
 			   unsigned long timeout)
 {
-	struct scatterlist *psg = NULL, sg;
-	unsigned int n_elem = 0;
+	struct sg_ring *psg = NULL;
+	DECLARE_SG_RING(sg, 1);
 
 	if (dma_dir != DMA_NONE) {
 		WARN_ON(!buf);
-		sg_init_one(&sg, buf, buflen);
-		psg = &sg;
-		n_elem++;
-	}
-
-	return ata_exec_internal_sg(dev, tf, cdb, dma_dir, psg, n_elem,
-				    timeout);
+		sg_ring_single(&sg.ring, buf, buflen);
+		psg = &sg.ring;
+	}
+
+	return ata_exec_internal_sg(dev, tf, cdb, dma_dir, psg, timeout);
 }
 
 /**
@@ -4438,6 +4436,36 @@ static unsigned int ata_dev_init_params(
 	return err_mask;
 }
 
+static struct sg_ring *sg_ring_prev(struct sg_ring *sg)
+{
+	return list_entry(sg->list.prev, struct sg_ring, list);
+}
+
+/* To be paranoid, we handle empty scatterlist ring entries here. */
+static struct sg_ring *sg_find_end(struct sg_ring *sg, unsigned int *num)
+{
+	struct sg_ring *i;
+
+	for (i = sg_ring_prev(sg); i->num == 0; i = sg_ring_prev(i)) {
+		/* sg_ring must not be entirely empty. */
+		BUG_ON(i == sg);
+	}
+	*num = i->num - 1;
+	return i;
+}
+
+static void add_padding_to_tail(struct sg_ring *sg, unsigned int padding)
+{
+	struct sg_ring *tail = sg_ring_prev(sg);
+
+	/* It's possible that the entire sg element got moved to padding. */
+	if (tail->num == 0) {
+		BUG_ON(tail->max == 0);
+		tail->num++;
+	}
+	tail->sg[tail->num - 1].length += padding;
+}
+
 /**
  *	ata_sg_clean - Unmap DMA memory associated with command
  *	@qc: Command containing DMA memory to be released
@@ -4450,37 +4478,39 @@ void ata_sg_clean(struct ata_queued_cmd 
 void ata_sg_clean(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct scatterlist *sg = qc->__sg;
 	int dir = qc->dma_dir;
-	void *pad_buf = NULL;
-
-	WARN_ON(!(qc->flags & ATA_QCFLAG_DMAMAP));
-	WARN_ON(sg == NULL);
-
-	VPRINTK("unmapping %u sg elements\n", qc->n_elem);
-
-	/* if we padded the buffer out to 32-bit bound, and data
-	 * xfer direction is from-device, we must copy from the
-	 * pad buffer back into the supplied buffer
-	 */
-	if (qc->pad_len && !(qc->tf.flags & ATA_TFLAG_WRITE))
-		pad_buf = ap->pad + (qc->tag * ATA_DMA_PAD_SZ);
-
-	if (qc->flags & ATA_QCFLAG_SG) {
-		if (qc->n_elem)
-			dma_unmap_sg(ap->dev, sg, qc->n_elem, dir);
-		/* restore last sg */
-		sg_last(sg, qc->orig_n_elem)->length += qc->pad_len;
-		if (pad_buf) {
-			struct scatterlist *psg = &qc->pad_sgent;
+
+	BUG_ON(!(qc->flags & ATA_QCFLAG_DMAMAP));
+
+	/* If we padded this entry, don't try to unmap padding */
+	if (qc->pad_len) {
+		BUG_ON(list_empty(&qc->pad_sgent.ring.list));
+		list_del(&qc->pad_sgent.ring.list);
+	}
+
+	dma_unmap_sg_ring(ap->dev, qc->sg, dir);
+
+	/* Restore any padding */
+	if (qc->pad_len) {
+		add_padding_to_tail(qc->sg, qc->pad_len);
+
+		/* if we padded the buffer out to 32-bit bound, and data
+		 * xfer direction is from-device, we must copy from the
+		 * pad buffer back into the supplied buffer
+		 */
+		if (!(qc->tf.flags & ATA_TFLAG_WRITE)) {
+			struct scatterlist *psg = qc->pad_sgent.sg;
 			void *addr = kmap_atomic(sg_page(psg), KM_IRQ0);
-			memcpy(addr + psg->offset, pad_buf, qc->pad_len);
+
+			memcpy(addr + psg->offset,
+			       ap->pad + (qc->tag * ATA_DMA_PAD_SZ),
+			       qc->pad_len);
 			kunmap_atomic(addr, KM_IRQ0);
 		}
 	}
 
 	qc->flags &= ~ATA_QCFLAG_DMAMAP;
-	qc->__sg = NULL;
+	qc->sg = NULL;
 }
 
 /**
@@ -4497,14 +4527,11 @@ static void ata_fill_sg(struct ata_queue
 static void ata_fill_sg(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct scatterlist *sg;
-	unsigned int idx;
-
-	WARN_ON(qc->__sg == NULL);
-	WARN_ON(qc->n_elem == 0 && qc->pad_len == 0);
+	struct sg_ring *sg;
+	unsigned int i, idx;
 
 	idx = 0;
-	ata_for_each_sg(sg, qc) {
+	sg_ring_for_each(qc->sg, sg, i) {
 		u32 addr, offset;
 		u32 sg_len, len;
 
@@ -4512,8 +4539,8 @@ static void ata_fill_sg(struct ata_queue
 		 * Note h/w doesn't support 64-bit, so we unconditionally
 		 * truncate dma_addr_t to u32.
 		 */
-		addr = (u32) sg_dma_address(sg);
-		sg_len = sg_dma_len(sg);
+		addr = (u32) sg_dma_address(&sg->sg[i]);
+		sg_len = sg_dma_len(&sg->sg[i]);
 
 		while (sg_len) {
 			offset = addr & 0xffff;
@@ -4551,14 +4578,11 @@ static void ata_fill_sg_dumb(struct ata_
 static void ata_fill_sg_dumb(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct scatterlist *sg;
-	unsigned int idx;
-
-	WARN_ON(qc->__sg == NULL);
-	WARN_ON(qc->n_elem == 0 && qc->pad_len == 0);
+	struct sg_ring *sg;
+	unsigned int i, idx;
 
 	idx = 0;
-	ata_for_each_sg(sg, qc) {
+	sg_ring_for_each(qc->sg, sg, i) {
 		u32 addr, offset;
 		u32 sg_len, len, blen;
 
@@ -4566,8 +4590,8 @@ static void ata_fill_sg_dumb(struct ata_
 		 * Note h/w doesn't support 64-bit, so we unconditionally
 		 * truncate dma_addr_t to u32.
 		 */
-		addr = (u32) sg_dma_address(sg);
-		sg_len = sg_dma_len(sg);
+		addr = (u32) sg_dma_address(&sg->sg[i]);
+		sg_len = sg_dma_len(&sg->sg[i]);
 
 		while (sg_len) {
 			offset = addr & 0xffff;
@@ -4708,39 +4732,34 @@ void ata_noop_qc_prep(struct ata_queued_
 
 void ata_sg_init_one(struct ata_queued_cmd *qc, void *buf, unsigned int buflen)
 {
-	sg_init_one(&qc->sgent, buf, buflen);
-	ata_sg_init(qc, &qc->sgent, 1);
+	sg_ring_single(&qc->sgent.ring, buf, buflen);
+	ata_sg_init(qc, &qc->sgent.ring);
 }
 
 /**
  *	ata_sg_init - Associate command with scatter-gather table.
  *	@qc: Command to be associated
- *	@sg: Scatter-gather table.
- *	@n_elem: Number of elements in s/g table.
+ *	@sg: Scatter-gather ring.
  *
  *	Initialize the data-related elements of queued_cmd @qc
- *	to point to a scatter-gather table @sg, containing @n_elem
- *	elements.
- *
- *	LOCKING:
- *	spin_lock_irqsave(host lock)
- */
-
-void ata_sg_init(struct ata_queued_cmd *qc, struct scatterlist *sg,
-		 unsigned int n_elem)
-{
+ *	to point to a scatter-gather ring @sg.
+ *
+ *	LOCKING:
+ *	spin_lock_irqsave(host lock)
+ */
+
+void ata_sg_init(struct ata_queued_cmd *qc, struct sg_ring *sg)
+{
+	unsigned int i;
+
 	qc->flags |= ATA_QCFLAG_SG;
-	qc->__sg = sg;
-	qc->n_elem = n_elem;
-	qc->orig_n_elem = n_elem;
-	qc->cursg = qc->__sg;
+	qc->sg = sg;
+	qc->cur_sg = qc->sg;
+	qc->cursg_i = 0;
 
 	qc->nbytes = 0;
-	while (n_elem) {
-		qc->nbytes += sg->length;
-		n_elem--;
-		sg = sg_next(sg);
-	}
+	sg_ring_for_each(qc->sg, sg, i)
+		qc->nbytes += sg->sg[i].length;
 }
 
 /**
@@ -4760,18 +4779,20 @@ static int ata_sg_setup(struct ata_queue
 static int ata_sg_setup(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct scatterlist *sg = qc->__sg;
-	struct scatterlist *lsg = sg_last(qc->__sg, qc->n_elem);
-	int n_elem, pre_n_elem, dir, trim_sg = 0;
+	unsigned int lsg;
+	struct sg_ring *end = sg_find_end(qc->sg, &lsg);
 
 	VPRINTK("ENTER, ata%u\n", ap->print_id);
 	WARN_ON(!(qc->flags & ATA_QCFLAG_SG));
 
+	BUG_ON(qc->sg->num == 0);
+	BUG_ON(sg_virt(&qc->sg->sg[0]) == NULL);
+
 	/* we must lengthen transfers to end on a 32-bit boundary */
-	qc->pad_len = lsg->length & 3;
+	qc->pad_len = end->sg[lsg].length & 3;
 	if (qc->pad_len) {
 		void *pad_buf = ap->pad + (qc->tag * ATA_DMA_PAD_SZ);
-		struct scatterlist *psg = &qc->pad_sgent;
+		struct scatterlist *psg = &qc->pad_sgent.sg[0];
 		unsigned int offset;
 
 		WARN_ON(qc->dev->class != ATA_DEV_ATAPI);
@@ -4782,10 +4803,13 @@ static int ata_sg_setup(struct ata_queue
 		 * psg->page/offset are used to copy to-be-written
 		 * data in this function or read data in ata_sg_clean.
 		 */
-		offset = lsg->offset + lsg->length - qc->pad_len;
-		sg_init_table(psg, 1);
-		sg_set_page(psg, nth_page(sg_page(lsg), offset >> PAGE_SHIFT),
-				qc->pad_len, offset_in_page(offset));
+		offset = end->sg[lsg].offset + end->sg[lsg].length - qc->pad_len;
+
+		sg_ring_init(&qc->pad_sgent.ring, 1);
+
+		sg_set_page(psg, nth_page(sg_page(&end->sg[lsg]),
+					  offset >> PAGE_SHIFT),
+			    qc->pad_len, offset_in_page(offset));
 
 		if (qc->tf.flags & ATA_TFLAG_WRITE) {
 			void *addr = kmap_atomic(sg_page(psg), KM_IRQ0);
@@ -4795,36 +4819,25 @@ static int ata_sg_setup(struct ata_queue
 
 		sg_dma_address(psg) = ap->pad_dma + (qc->tag * ATA_DMA_PAD_SZ);
 		sg_dma_len(psg) = ATA_DMA_PAD_SZ;
+
 		/* trim last sg */
-		lsg->length -= qc->pad_len;
-		if (lsg->length == 0)
-			trim_sg = 1;
+		end->sg[lsg].length -= qc->pad_len;
+		if (end->sg[lsg].length == 0)
+			end->num--;
 
 		DPRINTK("padding done, sg[%d].length=%u pad_len=%u\n",
-			qc->n_elem - 1, lsg->length, qc->pad_len);
-	}
-
-	pre_n_elem = qc->n_elem;
-	if (trim_sg && pre_n_elem)
-		pre_n_elem--;
-
-	if (!pre_n_elem) {
-		n_elem = 0;
-		goto skip_map;
-	}
-
-	dir = qc->dma_dir;
-	n_elem = dma_map_sg(ap->dev, sg, pre_n_elem, dir);
-	if (n_elem < 1) {
+			qc->n_elem - 1, end->sg[lsg].length, qc->pad_len);
+	}
+
+	if (dma_map_sg_ring(ap->dev, qc->sg, qc->dma_dir) != 0) {
 		/* restore last sg */
-		lsg->length += qc->pad_len;
+		add_padding_to_tail(qc->sg, qc->pad_len);
 		return -1;
 	}
 
-	DPRINTK("%d sg elements mapped\n", n_elem);
-
-skip_map:
-	qc->n_elem = n_elem;
+	/* We attach pad sg to end for iteration purposes. */
+	if (qc->pad_len)
+		list_add_tail(&qc->pad_sgent.ring.list, &qc->sg->list);
 
 	return 0;
 }
@@ -4934,8 +4947,8 @@ static void ata_pio_sector(struct ata_qu
 	if (qc->curbytes == qc->nbytes - qc->sect_size)
 		ap->hsm_task_state = HSM_ST_LAST;
 
-	page = sg_page(qc->cursg);
-	offset = qc->cursg->offset + qc->cursg_ofs;
+	page = sg_page(&qc->cur_sg->sg[qc->cursg_i]);
+	offset = qc->cur_sg->sg[qc->cursg_i].offset + qc->cursg_ofs;
 
 	/* get the current page and offset */
 	page = nth_page(page, (offset >> PAGE_SHIFT));
@@ -4963,10 +4976,8 @@ static void ata_pio_sector(struct ata_qu
 	qc->curbytes += qc->sect_size;
 	qc->cursg_ofs += qc->sect_size;
 
-	if (qc->cursg_ofs == qc->cursg->length) {
-		qc->cursg = sg_next(qc->cursg);
-		qc->cursg_ofs = 0;
-	}
+	if (qc->cursg_ofs == qc->cur_sg->sg[qc->cursg_i].length)
+		qc->cur_sg = sg_ring_iter(qc->sg, qc->cur_sg, &qc->cursg_i);
 }
 
 /**
@@ -5049,19 +5060,16 @@ static void __atapi_pio_bytes(struct ata
 static void __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
 {
 	int do_write = (qc->tf.flags & ATA_TFLAG_WRITE);
-	struct scatterlist *sg = qc->__sg;
-	struct scatterlist *lsg = sg_last(qc->__sg, qc->n_elem);
 	struct ata_port *ap = qc->ap;
 	struct page *page;
 	unsigned char *buf;
 	unsigned int offset, count;
-	int no_more_sg = 0;
 
 	if (qc->curbytes + bytes >= qc->nbytes)
 		ap->hsm_task_state = HSM_ST_LAST;
 
 next_sg:
-	if (unlikely(no_more_sg)) {
+	if (unlikely(!qc->cur_sg)) {
 		/*
 		 * The end of qc->sg is reached and the device expects
 		 * more data to transfer. In order not to overrun qc->sg
@@ -5084,17 +5092,15 @@ next_sg:
 		return;
 	}
 
-	sg = qc->cursg;
-
-	page = sg_page(sg);
-	offset = sg->offset + qc->cursg_ofs;
+	page = sg_page(&qc->cur_sg->sg[qc->cursg_i]);
+	offset = qc->cur_sg->sg[qc->cursg_i].offset + qc->cursg_ofs;
 
 	/* get the current page and offset */
 	page = nth_page(page, (offset >> PAGE_SHIFT));
 	offset %= PAGE_SIZE;
 
 	/* don't overrun current sg */
-	count = min(sg->length - qc->cursg_ofs, bytes);
+	count = min(qc->cur_sg->sg[qc->cursg_i].length - qc->cursg_ofs, bytes);
 
 	/* don't cross page boundaries */
 	count = min(count, (unsigned int)PAGE_SIZE - offset);
@@ -5122,13 +5128,8 @@ next_sg:
 	qc->curbytes += count;
 	qc->cursg_ofs += count;
 
-	if (qc->cursg_ofs == sg->length) {
-		if (qc->cursg == lsg)
-			no_more_sg = 1;
-
-		qc->cursg = sg_next(qc->cursg);
-		qc->cursg_ofs = 0;
-	}
+	if (qc->cursg_ofs == qc->cur_sg->sg[qc->cursg_i].length)
+		qc->cur_sg = sg_ring_iter(qc->sg, qc->cur_sg, &qc->cursg_i);
 
 	if (bytes)
 		goto next_sg;
diff -r a867138da3e0 drivers/ata/libata-scsi.c
--- a/drivers/ata/libata-scsi.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/libata-scsi.c	Wed Dec 19 17:27:09 2007 +1100
@@ -517,8 +517,7 @@ static struct ata_queued_cmd *ata_scsi_q
 		qc->scsicmd = cmd;
 		qc->scsidone = done;
 
-		qc->__sg = scsi_sglist(cmd);
-		qc->n_elem = scsi_sg_count(cmd);
+		qc->sg = scsi_sgring(cmd);
 	} else {
 		cmd->result = (DID_OK << 16) | (QUEUE_FULL << 1);
 		done(cmd);
@@ -1479,6 +1478,17 @@ static void ata_scsi_qc_complete(struct 
 	ata_qc_free(qc);
 }
 
+static unsigned int sg_ring_len(struct sg_ring *sg_start)
+{
+	struct sg_ring *sg;
+	unsigned int i, len = 0;
+
+	sg_ring_for_each(sg_start, sg, i)
+		len += sg->sg[i].length;
+
+	return len;
+}
+
 /**
  *	ata_scsi_translate - Translate then issue SCSI command to ATA device
  *	@dev: ATA device to which the command is addressed
@@ -1529,7 +1539,16 @@ static int ata_scsi_translate(struct ata
 			goto err_did;
 		}
 
-		ata_sg_init(qc, scsi_sglist(cmd), scsi_sg_count(cmd));
+		ata_sg_init(qc, scsi_sgring(cmd));
+		if (scsi_bufflen(cmd) != sg_ring_len(scsi_sgring(cmd))) {
+			printk("scsi_bufflen = %u, sg_ring num %u/%u len = %u, qc->nbytes = %u\n",
+			       scsi_bufflen(cmd),
+			       scsi_sgring(cmd)->num, scsi_sgring(cmd)->max,
+			       sg_ring_len(scsi_sgring(cmd)),
+			       qc->nbytes);
+		}
+		BUG_ON(scsi_bufflen(cmd) != sg_ring_len(scsi_sgring(cmd)));
+		BUG_ON(scsi_bufflen(cmd) != qc->nbytes);
 
 		qc->dma_dir = cmd->sc_data_direction;
 	}
@@ -1592,9 +1611,9 @@ static unsigned int ata_scsi_rbuf_get(st
 	u8 *buf;
 	unsigned int buflen;
 
-	struct scatterlist *sg = scsi_sglist(cmd);
-
-	if (sg) {
+	if (scsi_sgring(cmd)) {
+		struct scatterlist *sg = &scsi_sgring(cmd)->sg[0];
+		BUG_ON(scsi_sgring(cmd)->num != 1);
 		buf = kmap_atomic(sg_page(sg), KM_IRQ0) + sg->offset;
 		buflen = sg->length;
 	} else {
@@ -1619,9 +1638,8 @@ static unsigned int ata_scsi_rbuf_get(st
 
 static inline void ata_scsi_rbuf_put(struct scsi_cmnd *cmd, u8 *buf)
 {
-	struct scatterlist *sg = scsi_sglist(cmd);
-	if (sg)
-		kunmap_atomic(buf - sg->offset, KM_IRQ0);
+	if (scsi_sgring(cmd))
+		kunmap_atomic(buf - scsi_sgring(cmd)->sg[0].offset, KM_IRQ0);
 }
 
 /**
diff -r a867138da3e0 drivers/ata/libata.h
--- a/drivers/ata/libata.h	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/libata.h	Wed Dec 19 17:27:09 2007 +1100
@@ -73,8 +73,8 @@ extern unsigned ata_exec_internal(struct
 				  unsigned long timeout);
 extern unsigned ata_exec_internal_sg(struct ata_device *dev,
 				     struct ata_taskfile *tf, const u8 *cdb,
-				     int dma_dir, struct scatterlist *sg,
-				     unsigned int n_elem, unsigned long timeout);
+				     int dma_dir, struct sg_ring *sg,
+				     unsigned long timeout);
 extern unsigned int ata_do_simple_cmd(struct ata_device *dev, u8 cmd);
 extern int ata_dev_read_id(struct ata_device *dev, unsigned int *p_class,
 			   unsigned int flags, u16 *id);
diff -r a867138da3e0 drivers/ata/pdc_adma.c
--- a/drivers/ata/pdc_adma.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/pdc_adma.c	Wed Dec 19 17:27:09 2007 +1100
@@ -315,22 +315,22 @@ static void adma_error_handler(struct at
 
 static int adma_fill_sg(struct ata_queued_cmd *qc)
 {
-	struct scatterlist *sg;
+	struct sg_ring *sg;
 	struct ata_port *ap = qc->ap;
 	struct adma_port_priv *pp = ap->private_data;
 	u8  *buf = pp->pkt, *last_buf = NULL;
-	int i = (2 + buf[3]) * 8;
+	int idx, i = (2 + buf[3]) * 8;
 	u8 pFLAGS = pORD | ((qc->tf.flags & ATA_TFLAG_WRITE) ? pDIRO : 0);
 
-	ata_for_each_sg(sg, qc) {
+	sg_ring_for_each(qc->sg, sg, idx) {
 		u32 addr;
 		u32 len;
 
-		addr = (u32)sg_dma_address(sg);
+		addr = (u32)sg_dma_address(&sg->sg[idx]);
 		*(__le32 *)(buf + i) = cpu_to_le32(addr);
 		i += 4;
 
-		len = sg_dma_len(sg) >> 3;
+		len = sg_dma_len(&sg->sg[idx]) >> 3;
 		*(__le32 *)(buf + i) = cpu_to_le32(len);
 		i += 4;
 
diff -r a867138da3e0 drivers/ata/sata_mv.c
--- a/drivers/ata/sata_mv.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/sata_mv.c	Wed Dec 19 17:27:09 2007 +1100
@@ -1134,13 +1134,14 @@ static void mv_fill_sg(struct ata_queued
 static void mv_fill_sg(struct ata_queued_cmd *qc)
 {
 	struct mv_port_priv *pp = qc->ap->private_data;
-	struct scatterlist *sg;
+	struct sg_ring *sg;
+	unsigned int i;
 	struct mv_sg *mv_sg, *last_sg = NULL;
 
 	mv_sg = pp->sg_tbl;
-	ata_for_each_sg(sg, qc) {
-		dma_addr_t addr = sg_dma_address(sg);
-		u32 sg_len = sg_dma_len(sg);
+	sg_ring_for_each(qc->sg, sg, i) {
+		dma_addr_t addr = sg_dma_address(&sg->sg[i]);
+		u32 sg_len = sg_dma_len(&sg->sg[i]);
 
 		while (sg_len) {
 			u32 offset = addr & 0xffff;
diff -r a867138da3e0 drivers/ata/sata_nv.c
--- a/drivers/ata/sata_nv.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/sata_nv.c	Wed Dec 19 17:27:09 2007 +1100
@@ -1315,20 +1315,20 @@ static int nv_adma_host_init(struct ata_
 }
 
 static void nv_adma_fill_aprd(struct ata_queued_cmd *qc,
-			      struct scatterlist *sg,
-			      int idx,
+			      struct sg_ring *sg,
+			      int i, int idx,
 			      struct nv_adma_prd *aprd)
 {
 	u8 flags = 0;
 	if (qc->tf.flags & ATA_TFLAG_WRITE)
 		flags |= NV_APRD_WRITE;
-	if (idx == qc->n_elem - 1)
+	if (!sg_ring_next(sg, qc->sg) && i == sg->num - 1)
 		flags |= NV_APRD_END;
 	else if (idx != 4)
 		flags |= NV_APRD_CONT;
 
-	aprd->addr  = cpu_to_le64(((u64)sg_dma_address(sg)));
-	aprd->len   = cpu_to_le32(((u32)sg_dma_len(sg))); /* len in bytes */
+	aprd->addr  = cpu_to_le64(((u64)sg_dma_address(&sg->sg[i])));
+	aprd->len   = cpu_to_le32(((u32)sg_dma_len(&sg->sg[i]))); /* len in bytes */
 	aprd->flags = flags;
 	aprd->packet_len = 0;
 }
@@ -1336,18 +1336,18 @@ static void nv_adma_fill_sg(struct ata_q
 static void nv_adma_fill_sg(struct ata_queued_cmd *qc, struct nv_adma_cpb *cpb)
 {
 	struct nv_adma_port_priv *pp = qc->ap->private_data;
-	unsigned int idx;
+	unsigned int i, idx;
 	struct nv_adma_prd *aprd;
-	struct scatterlist *sg;
+	struct sg_ring *sg;
 
 	VPRINTK("ENTER\n");
 
 	idx = 0;
 
-	ata_for_each_sg(sg, qc) {
+	sg_ring_for_each(qc->sg, sg, i) {
 		aprd = (idx < 5) ? &cpb->aprd[idx] :
 			       &pp->aprd[NV_ADMA_SGTBL_LEN * qc->tag + (idx-5)];
-		nv_adma_fill_aprd(qc, sg, idx, aprd);
+		nv_adma_fill_aprd(qc, sg, i, idx, aprd);
 		idx++;
 	}
 	if (idx > 5)
@@ -1994,23 +1994,22 @@ static void nv_swncq_fill_sg(struct ata_
 static void nv_swncq_fill_sg(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct scatterlist *sg;
-	unsigned int idx;
+	struct sg_ring *sg;
+	unsigned int i, idx;
 	struct nv_swncq_port_priv *pp = ap->private_data;
 	struct ata_prd *prd;
 
-	WARN_ON(qc->__sg == NULL);
-	WARN_ON(qc->n_elem == 0 && qc->pad_len == 0);
+	WARN_ON(qc->sg == NULL);
 
 	prd = pp->prd + ATA_MAX_PRD * qc->tag;
 
 	idx = 0;
-	ata_for_each_sg(sg, qc) {
+	sg_ring_for_each(qc->sg, sg, i) {
 		u32 addr, offset;
 		u32 sg_len, len;
 
-		addr = (u32)sg_dma_address(sg);
-		sg_len = sg_dma_len(sg);
+		addr = (u32)sg_dma_address(&sg->sg[i]);
+		sg_len = sg_dma_len(&sg->sg[i]);
 
 		while (sg_len) {
 			offset = addr & 0xffff;
diff -r a867138da3e0 drivers/ata/sata_promise.c
--- a/drivers/ata/sata_promise.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/sata_promise.c	Wed Dec 19 17:27:09 2007 +1100
@@ -540,18 +540,17 @@ static void pdc_fill_sg(struct ata_queue
 static void pdc_fill_sg(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct scatterlist *sg;
-	unsigned int idx;
+	struct sg_ring *sg;
+	unsigned int i, idx;
 	const u32 SG_COUNT_ASIC_BUG = 41*4;
 
 	if (!(qc->flags & ATA_QCFLAG_DMAMAP))
 		return;
 
-	WARN_ON(qc->__sg == NULL);
-	WARN_ON(qc->n_elem == 0 && qc->pad_len == 0);
+	WARN_ON(qc->sg == NULL);
 
 	idx = 0;
-	ata_for_each_sg(sg, qc) {
+	sg_ring_for_each(qc->sg, sg, i) {
 		u32 addr, offset;
 		u32 sg_len, len;
 
@@ -559,8 +558,8 @@ static void pdc_fill_sg(struct ata_queue
 		 * Note h/w doesn't support 64-bit, so we unconditionally
 		 * truncate dma_addr_t to u32.
 		 */
-		addr = (u32) sg_dma_address(sg);
-		sg_len = sg_dma_len(sg);
+		addr = (u32) sg_dma_address(&sg->sg[i]);
+		sg_len = sg_dma_len(&sg->sg[i]);
 
 		while (sg_len) {
 			offset = addr & 0xffff;
diff -r a867138da3e0 drivers/ata/sata_qstor.c
--- a/drivers/ata/sata_qstor.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/sata_qstor.c	Wed Dec 19 17:27:09 2007 +1100
@@ -284,25 +284,24 @@ static int qs_scr_write(struct ata_port 
 
 static unsigned int qs_fill_sg(struct ata_queued_cmd *qc)
 {
-	struct scatterlist *sg;
+	struct sg_ring *sg;
 	struct ata_port *ap = qc->ap;
 	struct qs_port_priv *pp = ap->private_data;
-	unsigned int nelem;
+	unsigned int i, nelem;
 	u8 *prd = pp->pkt + QS_CPB_BYTES;
 
-	WARN_ON(qc->__sg == NULL);
-	WARN_ON(qc->n_elem == 0 && qc->pad_len == 0);
+	WARN_ON(qc->sg == NULL);
 
 	nelem = 0;
-	ata_for_each_sg(sg, qc) {
+	sg_ring_for_each(qc->sg, sg, i) {
 		u64 addr;
 		u32 len;
 
-		addr = sg_dma_address(sg);
+		addr = sg_dma_address(&sg->sg[i]);
 		*(__le64 *)prd = cpu_to_le64(addr);
 		prd += sizeof(u64);
 
-		len = sg_dma_len(sg);
+		len = sg_dma_len(&sg->sg[i]);
 		*(__le32 *)prd = cpu_to_le32(len);
 		prd += sizeof(u64);
 
diff -r a867138da3e0 drivers/ata/sata_sil24.c
--- a/drivers/ata/sata_sil24.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/sata_sil24.c	Wed Dec 19 17:27:09 2007 +1100
@@ -811,12 +811,13 @@ static inline void sil24_fill_sg(struct 
 static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
 				 struct sil24_sge *sge)
 {
-	struct scatterlist *sg;
+	struct sg_ring *sg;
+	unsigned int i;
 	struct sil24_sge *last_sge = NULL;
 
-	ata_for_each_sg(sg, qc) {
-		sge->addr = cpu_to_le64(sg_dma_address(sg));
-		sge->cnt = cpu_to_le32(sg_dma_len(sg));
+	sg_ring_for_each(qc->sg, sg, i) {
+		sge->addr = cpu_to_le64(sg_dma_address(&sg->sg[i]));
+		sge->cnt = cpu_to_le32(sg_dma_len(&sg->sg[i]));
 		sge->flags = 0;
 
 		last_sge = sge;
diff -r a867138da3e0 drivers/ata/sata_sx4.c
--- a/drivers/ata/sata_sx4.c	Wed Dec 19 16:48:15 2007 +1100
+++ b/drivers/ata/sata_sx4.c	Wed Dec 19 17:27:09 2007 +1100
@@ -467,7 +467,7 @@ static inline void pdc20621_host_pkt(str
 
 static void pdc20621_dma_prep(struct ata_queued_cmd *qc)
 {
-	struct scatterlist *sg;
+	struct sg_ring *sg;
 	struct ata_port *ap = qc->ap;
 	struct pdc_port_priv *pp = ap->private_data;
 	void __iomem *mmio = ap->host->iomap[PDC_MMIO_BAR];
@@ -487,10 +487,10 @@ static void pdc20621_dma_prep(struct ata
 	 * Build S/G table
 	 */
 	idx = 0;
-	ata_for_each_sg(sg, qc) {
-		buf[idx++] = cpu_to_le32(sg_dma_address(sg));
-		buf[idx++] = cpu_to_le32(sg_dma_len(sg));
-		total_len += sg_dma_len(sg);
+	sg_ring_for_each(qc->sg, sg, i) {
+		buf[idx++] = cpu_to_le32(sg_dma_address(&sg->sg[i]));
+		buf[idx++] = cpu_to_le32(sg_dma_len(&sg->sg[i]));
+		total_len += sg_dma_len(&sg->sg[i]);
 	}
 	buf[idx - 1] |= cpu_to_le32(ATA_PRD_EOT);
 	sgt_len = idx * 4;
diff -r a867138da3e0 include/linux/libata.h
--- a/include/linux/libata.h	Wed Dec 19 16:48:15 2007 +1100
+++ b/include/linux/libata.h	Wed Dec 19 17:27:09 2007 +1100
@@ -29,7 +29,7 @@
 #include <linux/delay.h>
 #include <linux/interrupt.h>
 #include <linux/dma-mapping.h>
-#include <linux/scatterlist.h>
+#include <linux/sg_ring.h>
 #include <linux/io.h>
 #include <linux/ata.h>
 #include <linux/workqueue.h>
@@ -442,9 +442,7 @@ struct ata_queued_cmd {
 
 	unsigned long		flags;		/* ATA_QCFLAG_xxx */
 	unsigned int		tag;
-	unsigned int		n_elem;
 	unsigned int		n_iter;
-	unsigned int		orig_n_elem;
 
 	int			dma_dir;
 
@@ -454,14 +452,14 @@ struct ata_queued_cmd {
 	unsigned int		nbytes;
 	unsigned int		curbytes;
 
-	struct scatterlist	*cursg;
+	struct sg_ring		*cur_sg;
+	unsigned int		cursg_i;
 	unsigned int		cursg_ofs;
 
-	struct scatterlist	sgent;
-	struct scatterlist	pad_sgent;
+	DECLARE_SG_RING(sgent, 1);
+	DECLARE_SG_RING(pad_sgent, 1);
 
-	/* DO NOT iterate over __sg manually, use ata_for_each_sg() */
-	struct scatterlist	*__sg;
+	struct sg_ring		*sg;
 
 	unsigned int		err_mask;
 	struct ata_taskfile	result_tf;
@@ -861,9 +859,8 @@ extern void ata_noop_qc_prep(struct ata_
 extern void ata_noop_qc_prep(struct ata_queued_cmd *qc);
 extern unsigned int ata_qc_issue_prot(struct ata_queued_cmd *qc);
 extern void ata_sg_init_one(struct ata_queued_cmd *qc, void *buf,
-		unsigned int buflen);
-extern void ata_sg_init(struct ata_queued_cmd *qc, struct scatterlist *sg,
-		 unsigned int n_elem);
+			    unsigned int buflen);
+extern void ata_sg_init(struct ata_queued_cmd *qc, struct sg_ring *sg);
 extern unsigned int ata_dev_classify(const struct ata_taskfile *tf);
 extern void ata_dev_disable(struct ata_device *adev);
 extern void ata_id_string(const u16 *id, unsigned char *s,
@@ -1059,35 +1056,6 @@ extern void ata_port_pbar_desc(struct at
 			       const char *name);
 #endif
 
-/*
- * qc helpers
- */
-static inline struct scatterlist *
-ata_qc_first_sg(struct ata_queued_cmd *qc)
-{
-	qc->n_iter = 0;
-	if (qc->n_elem)
-		return qc->__sg;
-	if (qc->pad_len)
-		return &qc->pad_sgent;
-	return NULL;
-}
-
-static inline struct scatterlist *
-ata_qc_next_sg(struct scatterlist *sg, struct ata_queued_cmd *qc)
-{
-	if (sg == &qc->pad_sgent)
-		return NULL;
-	if (++qc->n_iter < qc->n_elem)
-		return sg_next(sg);
-	if (qc->pad_len)
-		return &qc->pad_sgent;
-	return NULL;
-}
-
-#define ata_for_each_sg(sg, qc) \
-	for (sg = ata_qc_first_sg(qc); sg; sg = ata_qc_next_sg(sg, qc))
-
 static inline unsigned int ata_tag_valid(unsigned int tag)
 {
 	return (tag < ATA_MAX_QUEUE) ? 1 : 0;
@@ -1322,12 +1290,12 @@ static inline void ata_qc_reinit(struct 
 static inline void ata_qc_reinit(struct ata_queued_cmd *qc)
 {
 	qc->dma_dir = DMA_NONE;
-	qc->__sg = NULL;
+	qc->sg = NULL;
 	qc->flags = 0;
-	qc->cursg = NULL;
+	qc->cur_sg = NULL;
+	qc->cursg_i = 0;
 	qc->cursg_ofs = 0;
 	qc->nbytes = qc->curbytes = 0;
-	qc->n_elem = 0;
 	qc->n_iter = 0;
 	qc->err_mask = 0;
 	qc->pad_len = 0;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/7] sg_ring: convert core ATA code to sg_ring.
  2007-12-19  7:38             ` [PATCH 7/7] sg_ring: convert core ATA code to sg_ring Rusty Russell
@ 2007-12-26  8:36               ` Tejun Heo
  2007-12-26 17:12                 ` James Bottomley
  2007-12-27  0:24                 ` Rusty Russell
  0 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2007-12-26  8:36 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide

Hello, Rusty Russell.

Rusty Russell wrote:
> ATA relies so heavily on scsi that it needs to be converted at the
> same time.
> 
> ATA adds padding to scatterlists in scsi commands, but because there was
> no good way of appending to those scatterlists, it had to use boutique
> iterators to make sure the padding is included.  With sg_ring, ATA can
> simply append an sg_ring entry with the padding, and normal iterators
> can be used.
> 
> I renamed qc->cursg to qc->cur_sg to catch all the users: they should
> now be referring to 'qc->cur_sg[qc->cursg_i]' wherever they were using
> 'qc->cursg'.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

There's pending patchset to make libata use chained sg.  Please search
for "[PATCHSET] libata: improve ATAPI data transfer handling, take #3"
on your favorite mailing archive (I'm writing this message offline so
can't give you the url at the moment).  The 11th patch in the patchset
converts libata to use chained sg and and the following ones extend it
to also chain drain sg entries.

I agree that the current chained sg isn't easy to use and it would be
nice to have some abstraction on top of it but sg_ring seems to
implement duplicate feature which is already available through sg
chaining albeit cumbersome to use.

It would be better to build upon sg chaining as we already have it.  I
think it can be made much easier with a bit more safe guards,
generalization and some helpers.

Thanks.

(PS, I haven't followed the sg chaining discussion.  Why is sg chaining
an optional feature?  Performance overhead on low end machines?)

-- 
tejun


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/7] sg_ring: convert core ATA code to sg_ring.
  2007-12-26  8:36               ` Tejun Heo
@ 2007-12-26 17:12                 ` James Bottomley
  2007-12-27  0:24                 ` Rusty Russell
  1 sibling, 0 replies; 25+ messages in thread
From: James Bottomley @ 2007-12-26 17:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Rusty Russell, Jens Axboe, linux-kernel, linux-scsi, linux-ide


On Wed, 2007-12-26 at 17:36 +0900, Tejun Heo wrote:
> (PS, I haven't followed the sg chaining discussion.  Why is sg chaining
> an optional feature?  Performance overhead on low end machines?)

The idea of SG chaining is to allow drivers that wish to take advantage
of it to increase their transfer lengths beyond
MAX_HW_SEGMENTS*PAGE_SIZE by using chaining.  However, drivers that stay
below MAX_HW_SEGMENTS for the scatterlist length don't need to be
altered.

The ultimate goal (well, perhaps more wish) is to have all drivers
converted, so SCSI can use something small for the default scatterlist
sizing and dump all the sglist mempool stuff (although this may never be
reached).

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/7] sg_ring: convert core ATA code to sg_ring.
  2007-12-26  8:36               ` Tejun Heo
  2007-12-26 17:12                 ` James Bottomley
@ 2007-12-27  0:24                 ` Rusty Russell
  2007-12-27  4:21                   ` Tejun Heo
  1 sibling, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2007-12-27  0:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide

On Wednesday 26 December 2007 19:36:36 Tejun Heo wrote:
> It would be better to build upon sg chaining as we already have it.  I
> think it can be made much easier with a bit more safe guards,
> generalization and some helpers.

Hi Tejun,

    I did this work to replace sg chaining.  The current chaining code gives 
us longer sgs, and (if used carefully) the ability to append in some 
circumstances.  But it's clumsy, and we can do better.  This is my attempt.

    I understand that people are experiencing whiplash from the idea of 
another change just after the sg chaining changes which have already gone in.  

    At the very least, I think I'll have to come up with a better transition 
plan for SCSI drivers, rather than trying to convert them all at once... In 
theory, sg chaining can be done through the sg ring arrays, but that's pretty 
horrible.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/7] sg_ring: convert core ATA code to sg_ring.
  2007-12-27  0:24                 ` Rusty Russell
@ 2007-12-27  4:21                   ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2007-12-27  4:21 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide

Hello, Rusty.

Rusty Russell wrote:
> On Wednesday 26 December 2007 19:36:36 Tejun Heo wrote:
>> It would be better to build upon sg chaining as we already have it.  I
>> think it can be made much easier with a bit more safe guards,
>> generalization and some helpers.
> 
>     I did this work to replace sg chaining.  The current chaining code gives 
> us longer sgs, and (if used carefully) the ability to append in some 
> circumstances.  But it's clumsy, and we can do better.  This is my attempt.

Ah.. okay.  That makes sense then.  I fully agree that sg chaining as it
currently stands is pretty ugly to use.  Appending extra entries
requires reserving one extra space in the sg list to be appended to
accommodate the last entry of the original list, which will be used for
chaining now.  Also, it's quite brittle in that information can get out
of sync easily (there's no one API for sg list manipulation, things have
to be done manually).

>     I understand that people are experiencing whiplash from the idea of 
> another change just after the sg chaining changes which have already gone in.  
> 
>     At the very least, I think I'll have to come up with a better transition 
> plan for SCSI drivers, rather than trying to convert them all at once... In 
> theory, sg chaining can be done through the sg ring arrays, but that's pretty 
> horrible.

I don't necessarily think large one time conversion is bad.  They're
necessary at times and make sense when they can increase long term
maintainability of the code.  One thing I'm concerned about is the
mixture of sg chaining and sg ring.  It would be best if we can agree on
conversion plan (or sg chaining extension plan).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2007-12-19  6:31 [PATCH 0/7] sg_ring: a ring of scatterlist arrays Rusty Russell
  2007-12-19  6:33 ` [PATCH 1/7] sg_ring: introduce " Rusty Russell
@ 2008-01-05 15:31 ` James Bottomley
  2008-01-07  4:38   ` Rusty Russell
  1 sibling, 1 reply; 25+ messages in thread
From: James Bottomley @ 2008-01-05 15:31 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide


On Wed, 2007-12-19 at 17:31 +1100, Rusty Russell wrote:
> This patch series is the start of my attempt to simplify and make explicit
> the chained scatterlist logic.
> 
> It's not complete: my SATA box boots and seems happy, but all the other
> users of SCSI need to be updated and checked.  But I've gotten far enough
> to believe it's worth persuing.

Sorry for the delay in looking at this, I was busy with Holidays and
things.

When I compare sg_ring with the current sg_chain (and later sg_table)
implementations, I'm actually struck by how similar they are.

The other thing I note is that the problem you're claiming to solve with
sg_ring (the ability to add extra scatterlists to the front or the back
of an existing one) is already solved with sg_chain, so the only real
advantage of sg_ring was that it contains explicit counts, which
sg_table (in -mm) also introduces.

The other differences are that sg_ring only allows adding at the front
or back of an existing sg_ring, it doesn't allow splicing at any point
like sg_chain does, so I'd say it's less functional (not that I actually
want anyone ever to do this, of course ...)

The final point is that sg_ring requires a two level traversal:  ring
list then scatterlist, whereas sg_chain only requires a single level
traversal.  I grant that we can abstract out the traversal into
something that would make users think they're only doing a single level,
but I don't see what the extra level really buys us.

The above analysis seems to suggest that sg_chain is simpler and has
more functionality than sg_ring, unless I've missed anything?

The only thing missing from sg_chain perhaps is an accessor function
that does the splicing, which I can easily construct if you want to try
it out in virtio.

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-05 15:31 ` [PATCH 0/7] sg_ring: a ring of scatterlist arrays James Bottomley
@ 2008-01-07  4:38   ` Rusty Russell
  2008-01-07  5:01     ` Tejun Heo
  2008-01-07 15:48     ` James Bottomley
  0 siblings, 2 replies; 25+ messages in thread
From: Rusty Russell @ 2008-01-07  4:38 UTC (permalink / raw)
  To: James Bottomley; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide

On Sunday 06 January 2008 02:31:12 James Bottomley wrote:
> On Wed, 2007-12-19 at 17:31 +1100, Rusty Russell wrote:
> > This patch series is the start of my attempt to simplify and make
> > explicit the chained scatterlist logic.
> >
> > It's not complete: my SATA box boots and seems happy, but all the other
> > users of SCSI need to be updated and checked.  But I've gotten far enough
> > to believe it's worth persuing.
>
> Sorry for the delay in looking at this, I was busy with Holidays and
> things.

Thankyou for your consideration.

> When I compare sg_ring with the current sg_chain (and later sg_table)
> implementations, I'm actually struck by how similar they are.

I agree, they're solving the same problem.  It is possible that the pain of 
change is no longer worthwhile, but I hate to see us give up on this.  We're 
adding complexity without making it harder to misuse.

> The other thing I note is that the problem you're claiming to solve with
> sg_ring (the ability to add extra scatterlists to the front or the back
> of an existing one) is already solved with sg_chain, so the only real
> advantage of sg_ring was that it contains explicit counts, which
> sg_table (in -mm) also introduces.

I just converted virtio using latest Linus for fair comparison, and it's still 
pretty ugly.  sg_ring needs more work to de-scsi it.  sg_table is almost 
sg_ring except it retains all the chaining warts.

But we hit the same problems:

1) sg_chain loses information.  The clever chain packaging makes reading easy, 
but manipulation is severely limited.  You can append to your own chains by 
padding, but not someone elses.  This works for SCSI, but what about the rest 
of us?  And don't even think of joining mapped chains: it will almost work.

2) sg_chain's end marker is only for reading non-dma elements, not for mapped 
chains, nor for writing into chains.  Plus it's a duplicate of the num arg, 
which is still passed around for efficiency.  (virtio had to implement 
count_sg() because I didn't want those two redundant num args).

3) By overloading struct scatterlist, it's invisible what code has been 
converted to chains, and which hasn't.  Too clever by half!

Look at sg_chain(): it claims to join two scatterlists, but it doesn't.  It 
assumes that prv is an array, not a chain.  Because you've overloaded an 
existing type, this happens everywhere.  Try passing skb_to_sgvec a chained 
skb.

sg_ring would not have required any change to existing drivers, just those 
that want to use long sg lists.  And then it's damn obvious which are which.

4) sg_chain missed a chance to combine all the relevent information (max, num, 
num_dma and the sg array). sg_table comes close, but you still can't join 
them (no max information, and even if there were, what if it's full?).  
Unlike sg_ring, it's unlikely to reduce bugs.

5) (A little moot now) sg_ring didn't require arch changes.

> The other differences are that sg_ring only allows adding at the front
> or back of an existing sg_ring, it doesn't allow splicing at any point
> like sg_chain does, so I'd say it's less functional (not that I actually
> want anyone ever to do this, of course ...)

Well it's just as possible, but you might have to copy more elements (with sg 
chaining you need only copy 1 sg element to make room for the chain elem).  
Agreed that it's a little out there...

> The final point is that sg_ring requires a two level traversal:  ring
> list then scatterlist, whereas sg_chain only requires a single level
> traversal.  I grant that we can abstract out the traversal into
> something that would make users think they're only doing a single level,
> but I don't see what the extra level really buys us.

We hide the real complexity from users and it makes it less safe for 
non-trivial cases.

Hence the introduction of YA datastructure: sg_table.  This is getting close: 
just hang the array off it and you'll have sg_ring and no requirement for 
dynamic alloc all the time.  And once you have a header, no need for chaining 
tricks...

> The only thing missing from sg_chain perhaps is an accessor function
> that does the splicing, which I can easily construct if you want to try
> it out in virtio.

I don't need that (I prepend and append), but it'd be nice if sg_next took a 
const struct scatterlist *.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07  4:38   ` Rusty Russell
@ 2008-01-07  5:01     ` Tejun Heo
  2008-01-07  5:28       ` Rusty Russell
  2008-01-07 15:48     ` James Bottomley
  1 sibling, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2008-01-07  5:01 UTC (permalink / raw)
  To: Rusty Russell
  Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi, linux-ide

Hello,

Rusty Russell wrote:
>> The other thing I note is that the problem you're claiming to solve with
>> sg_ring (the ability to add extra scatterlists to the front or the back
>> of an existing one) is already solved with sg_chain, so the only real
>> advantage of sg_ring was that it contains explicit counts, which
>> sg_table (in -mm) also introduces.
> 
> I just converted virtio using latest Linus for fair comparison, and it's still 
> pretty ugly.  sg_ring needs more work to de-scsi it.  sg_table is almost 
> sg_ring except it retains all the chaining warts.

I agree it's ugly.

> But we hit the same problems:
> 
> 1) sg_chain loses information.  The clever chain packaging makes reading easy, 
> but manipulation is severely limited.  You can append to your own chains by 
> padding, but not someone elses.  This works for SCSI, but what about the rest 
> of us?  And don't even think of joining mapped chains: it will almost work.

You can append by allocating one more element on the chain to be
appended and moving the last element of the first chain to it while
using the last element for chaining.  Join can be done by similar
technique using three element extra chain in the middle.  But, yeah,
it's ugly as hell.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07  5:01     ` Tejun Heo
@ 2008-01-07  5:28       ` Rusty Russell
  2008-01-07  6:37         ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2008-01-07  5:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi, linux-ide

On Monday 07 January 2008 16:01:40 Tejun Heo wrote:
> > But we hit the same problems:
> >
> > 1) sg_chain loses information.  The clever chain packaging makes reading
> > easy, but manipulation is severely limited.  You can append to your own
> > chains by padding, but not someone elses.  This works for SCSI, but what
> > about the rest of us?  And don't even think of joining mapped chains: it
> > will almost work.
>
> You can append by allocating one more element on the chain to be
> appended and moving the last element of the first chain to it while
> using the last element for chaining.

Hi Tejun,

   Nice try!  Even ignoring the ugliness of undoing such an operation if the 
caller doesn't expect you to mangle their chains, consider a one-element sg 
array. :(

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07  5:28       ` Rusty Russell
@ 2008-01-07  6:37         ` Tejun Heo
  2008-01-07  8:34           ` Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2008-01-07  6:37 UTC (permalink / raw)
  To: Rusty Russell
  Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi, linux-ide

Rusty Russell wrote:
> On Monday 07 January 2008 16:01:40 Tejun Heo wrote:
>>> But we hit the same problems:
>>>
>>> 1) sg_chain loses information.  The clever chain packaging makes reading
>>> easy, but manipulation is severely limited.  You can append to your own
>>> chains by padding, but not someone elses.  This works for SCSI, but what
>>> about the rest of us?  And don't even think of joining mapped chains: it
>>> will almost work.
>> You can append by allocating one more element on the chain to be
>> appended and moving the last element of the first chain to it while
>> using the last element for chaining.
> 
> Hi Tejun,
> 
>    Nice try!  Even ignoring the ugliness of undoing such an operation if the 
> caller doesn't expect you to mangle their chains, consider a one-element sg 
> array. :(

Heh heh, that can be dealt with by skipping the first chain if the first
chain is empty after chaining.  Please take a look at
ata_sg_setup_extra() in the following.

http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=blob;f=drivers/ata/libata-core.c;h=32dde5bbc75ed53e89ac17040da2cd0621a37161;hb=c8847e473a4a2844244784226eb362be10d52ce9

That said, yeah, it's seriously ugly.  Restoring the original sg is ugly
too.  I definitely agree that we need some improvements here.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07  6:37         ` Tejun Heo
@ 2008-01-07  8:34           ` Rusty Russell
  2008-01-07  8:45             ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2008-01-07  8:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi, linux-ide

On Monday 07 January 2008 17:37:41 Tejun Heo wrote:
> Rusty Russell wrote:
> > Hi Tejun,
> >
> >    Nice try!  Even ignoring the ugliness of undoing such an operation if
> > the caller doesn't expect you to mangle their chains, consider a
> > one-element sg array. :(
>
> Heh heh, that can be dealt with by skipping the first chain if the first
> chain is empty after chaining.  Please take a look at
> ata_sg_setup_extra() in the following.
>
> http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=blob;f=driver
>s/ata/libata-core.c;h=32dde5bbc75ed53e89ac17040da2cd0621a37161;hb=c8847e473a
>4a2844244784226eb362be10d52ce9
>
> That said, yeah, it's seriously ugly.  Restoring the original sg is ugly
> too.  I definitely agree that we need some improvements here.

Erk, that's beyond ugly, into actual evil.

To make this general you need to find the last N 1-element chains (but SCSI 
doesn't do this of course).  Oh the horror...

I'd be remiss if I didn't mention that the sg_ring ata patches were 
straightforward, and indescribably beautiful if compared to this!

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07  8:34           ` Rusty Russell
@ 2008-01-07  8:45             ` Tejun Heo
  2008-01-07 12:17               ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2008-01-07  8:45 UTC (permalink / raw)
  To: Rusty Russell
  Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi, linux-ide

Rusty Russell wrote:
> On Monday 07 January 2008 17:37:41 Tejun Heo wrote:
>> Rusty Russell wrote:
>>> Hi Tejun,
>>>
>>>    Nice try!  Even ignoring the ugliness of undoing such an operation if
>>> the caller doesn't expect you to mangle their chains, consider a
>>> one-element sg array. :(
>> Heh heh, that can be dealt with by skipping the first chain if the first
>> chain is empty after chaining.  Please take a look at
>> ata_sg_setup_extra() in the following.
>>
>> http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=blob;f=driver
>> s/ata/libata-core.c;h=32dde5bbc75ed53e89ac17040da2cd0621a37161;hb=c8847e473a
>> 4a2844244784226eb362be10d52ce9
>>
>> That said, yeah, it's seriously ugly.  Restoring the original sg is ugly
>> too.  I definitely agree that we need some improvements here.
> 
> Erk, that's beyond ugly, into actual evil.

/me agrees.

> To make this general you need to find the last N 1-element chains (but SCSI 
> doesn't do this of course).  Oh the horror...

/me agrees.

> I'd be remiss if I didn't mention that the sg_ring ata patches were 
> straightforward, and indescribably beautiful if compared to this!

/me agrees.  As long as this can be made sane using one unified
interface, I don't care whether it's sg_ring, table or dish.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07  8:45             ` Tejun Heo
@ 2008-01-07 12:17               ` Herbert Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Herbert Xu @ 2008-01-07 12:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: rusty, James.Bottomley, jens.axboe, linux-kernel, linux-scsi, linux-ide

Tejun Heo <htejun@gmail.com> wrote:
>
> /me agrees.  As long as this can be made sane using one unified
> interface, I don't care whether it's sg_ring, table or dish.

Give us sg_annulus :)

On a more serious note, having played with SG chaining for a couple
of years in the crypto layer, I must say Rusty's proposal is a great
improvement for me.  The reason is that if the crypto layer gets an
SG list from its user, it is not at liberty to just modify it.  As
such chaining at the end is cumbersome with the current interface
because we have to duplicate the whole list in order to add just a
single block at the end.

While with sg_ring we could just put the original list along with the
new one in a new list.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07  4:38   ` Rusty Russell
  2008-01-07  5:01     ` Tejun Heo
@ 2008-01-07 15:48     ` James Bottomley
  2008-01-08  0:39       ` Rusty Russell
  1 sibling, 1 reply; 25+ messages in thread
From: James Bottomley @ 2008-01-07 15:48 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide

On Mon, 2008-01-07 at 15:38 +1100, Rusty Russell wrote:
> On Sunday 06 January 2008 02:31:12 James Bottomley wrote:
> > On Wed, 2007-12-19 at 17:31 +1100, Rusty Russell wrote:
> > > This patch series is the start of my attempt to simplify and make
> > > explicit the chained scatterlist logic.
> > >
> > > It's not complete: my SATA box boots and seems happy, but all the other
> > > users of SCSI need to be updated and checked.  But I've gotten far enough
> > > to believe it's worth persuing.
> >
> > Sorry for the delay in looking at this, I was busy with Holidays and
> > things.
> 
> Thankyou for your consideration.
> 
> > When I compare sg_ring with the current sg_chain (and later sg_table)
> > implementations, I'm actually struck by how similar they are.
> 
> I agree, they're solving the same problem.  It is possible that the pain of 
> change is no longer worthwhile, but I hate to see us give up on this.  We're 
> adding complexity without making it harder to misuse.

We're always open to new APIs (or more powerful and expanded old ones).
The way we've been doing the sg_chain conversion is to slide API layers
into the drivers so sg_chain becomes a simple API flip when we turn it
on.  Unfortunately sg_ring doesn't quite fit nicely into this.

> > The other thing I note is that the problem you're claiming to solve with
> > sg_ring (the ability to add extra scatterlists to the front or the back
> > of an existing one) is already solved with sg_chain, so the only real
> > advantage of sg_ring was that it contains explicit counts, which
> > sg_table (in -mm) also introduces.
> 
> I just converted virtio using latest Linus for fair comparison

Erm, but that's not really a fair comparison; you need the sg_table code
in

git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git 

branch sg as well.

> , and it's still 
> pretty ugly.  sg_ring needs more work to de-scsi it.  sg_table is almost 
> sg_ring except it retains all the chaining warts.
> 
> But we hit the same problems:
> 
> 1) sg_chain loses information.  The clever chain packaging makes reading easy, 
> but manipulation is severely limited.  You can append to your own chains by 
> padding, but not someone elses.  This works for SCSI, but what about the rest 
> of us?  And don't even think of joining mapped chains: it will almost work.

OK, but realistically some of your criticisms are way outside of the
design intent.  Scatterlists (and now sg_chains) are the way the block
subsystem hands pages to its underlying block devices.  We do this
through two APIs:  blk_rq_map_sg() to take a request and place all the
pages into a scatterlist respecting the merging parameters and
dma_map_sg() which takes the scatterlist and produces an arch specific
DMA mapped scatterlist reusing the old list.  The requirement is that
dma_unmap_sg return the chain to its former state (so block device may
map, and unmap between congestion waits to avoid running systems out of
scarce IOMMU mappings), but there's not even a parallel
blk_rq_unmap_sg() beacuse the scatterlist is assumed disposable by the
block driver after the request completes.

By "someone elses" I'm assuming you mean a chain in a stacked block
driver?  You are actually allowed to transform these ... depending on
who does the dma mapping, of course, if the lower driver (yours) does
DMA mapping, you can do anything you want.  If the upper driver does,
then you have to preserve idempotence.

There have never until now been any requirements to join already
dma_map_sg() converted scatterlists ... that would wreak havoc with the
way we reuse the list plus damage the idempotence of map/unmap.  What is
the use case you have for this?

> 2) sg_chain's end marker is only for reading non-dma elements, not for mapped 
> chains, nor for writing into chains.  Plus it's a duplicate of the num arg, 
> which is still passed around for efficiency.  (virtio had to implement 
> count_sg() because I didn't want those two redundant num args).
> 
> 3) By overloading struct scatterlist, it's invisible what code has been 
> converted to chains, and which hasn't.  Too clever by half!

No it's not ... that's the whole point.  Code not yet converted to use
the API accessors is by definition unable to use chaining.  Code
converted to use the accessors by design doesn't care (and so is
"converted to chains").

> Look at sg_chain(): it claims to join two scatterlists, but it doesn't.  It 
> assumes that prv is an array, not a chain.  Because you've overloaded an 
> existing type, this happens everywhere.  Try passing skb_to_sgvec a chained 
> skb.
> 
> sg_ring would not have required any change to existing drivers, just those 
> that want to use long sg lists.  And then it's damn obvious which are which.

Which, by definition, would have been pretty much all of them.

Another driving factor is that the mempool allocation of scatterlists is
suboptimal in terms of their bucket size, so we were looking to tune
that once the chaining code was in.  The guess is that we'd probalby end
up with a uniform size for scatterlist chunks, but if all drivers aren't
converted we can't do this type of tuning.

> 4) sg_chain missed a chance to combine all the relevent information (max, num, 
> num_dma and the sg array). sg_table comes close, but you still can't join 
> them (no max information, and even if there were, what if it's full?).  
> Unlike sg_ring, it's unlikely to reduce bugs.

sg_table is sg_chain ... they're incremental extensions of the same body
of work.

> 5) (A little moot now) sg_ring didn't require arch changes.
> 
> > The other differences are that sg_ring only allows adding at the front
> > or back of an existing sg_ring, it doesn't allow splicing at any point
> > like sg_chain does, so I'd say it's less functional (not that I actually
> > want anyone ever to do this, of course ...)
> 
> Well it's just as possible, but you might have to copy more elements (with sg 
> chaining you need only copy 1 sg element to make room for the chain elem).  
> Agreed that it's a little out there...
> 
> > The final point is that sg_ring requires a two level traversal:  ring
> > list then scatterlist, whereas sg_chain only requires a single level
> > traversal.  I grant that we can abstract out the traversal into
> > something that would make users think they're only doing a single level,
> > but I don't see what the extra level really buys us.
> 
> We hide the real complexity from users and it makes it less safe for 
> non-trivial cases.
> 
> Hence the introduction of YA datastructure: sg_table.  This is getting close: 
> just hang the array off it and you'll have sg_ring and no requirement for 
> dynamic alloc all the time.  And once you have a header, no need for chaining 
> tricks...
> 
> > The only thing missing from sg_chain perhaps is an accessor function
> > that does the splicing, which I can easily construct if you want to try
> > it out in virtio.
> 
> I don't need that (I prepend and append), but it'd be nice if sg_next took a 
> const struct scatterlist *.

What all this is saying to me is that we have an undesigned, ad hoc
interface for scatterlist manipulation in passthrough drivers.  Perhaps
we should start by defining that API and its associated requirements.
i.e. what is it that pass through drivers want to be able to do to
scatterlists?  The only such example of a driver like that I know is the
crypto API and now your virtio.  Once we have the actual requirements
and then the API, I think the natural data structures will probably drop
out.

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-07 15:48     ` James Bottomley
@ 2008-01-08  0:39       ` Rusty Russell
  2008-01-09 22:10         ` James Bottomley
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2008-01-08  0:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide, Tejun Heo

On Tuesday 08 January 2008 02:48:23 James Bottomley wrote:
> We're always open to new APIs (or more powerful and expanded old ones).
> The way we've been doing the sg_chain conversion is to slide API layers
> into the drivers so sg_chain becomes a simple API flip when we turn it
> on.  Unfortunately sg_ring doesn't quite fit nicely into this.

Hi James,

   Well, it didn't touch any drivers.  The only ones which needed altering 
were those which wanted to use large scatter-gather lists.  You think of the 
subtlety of sg-chain conversion as a feature; it's a bug :)

> > > The other thing I note is that the problem you're claiming to solve
> > > with sg_ring (the ability to add extra scatterlists to the front or the
> > > back of an existing one) is already solved with sg_chain, so the only
> > > real advantage of sg_ring was that it contains explicit counts, which
> > > sg_table (in -mm) also introduces.
> >
> > I just converted virtio using latest Linus for fair comparison
>
> Erm, but that's not really a fair comparison; you need the sg_table code
> in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git
>
> branch sg as well.

Actually, I think it's a wash.  Now callers need to set up an sg_table as 
well.  It will save the count_sg() calls though.

> > , and it's still
> > pretty ugly.  sg_ring needs more work to de-scsi it.  sg_table is almost
> > sg_ring except it retains all the chaining warts.
> >
> > But we hit the same problems:
> >
> > 1) sg_chain loses information.  The clever chain packaging makes reading
> > easy, but manipulation is severely limited.  You can append to your own
> > chains by padding, but not someone elses.  This works for SCSI, but what
> > about the rest of us?  And don't even think of joining mapped chains: it
> > will almost work.
>
> OK, but realistically some of your criticisms are way outside of the
> design intent.  Scatterlists (and now sg_chains) are the way the block
> subsystem hands pages to its underlying block devices.

James, scatterlists are used for more than the block subsystem.  I know you're 
designing for that, but we can do better.

    Because a single sg_ring is trivially convertable to and from a 
scatterlist *, the compatibility story is nice.  You can implement a 
subsystem (say, the block layer) with sg_ring, and still hand out struct 
scatterlist arrays for backwards compat: old code won't ask for v. long 
scatterlist arrays anyway.

    Now we have sg_chaining, we can actually convert complex sg_rings into sg 
chains when someone asks, as my most recent patch does.  I think that's one 
abstraction too many, but it provides a transition path.

> There have never until now been any requirements to join already
> dma_map_sg() converted scatterlists ... that would wreak havoc with the
> way we reuse the list plus damage the idempotence of map/unmap.  What is
> the use case you have for this?

(This was, admittedly, a made-up example).

> > 2) sg_chain's end marker is only for reading non-dma elements, not for
> > mapped chains, nor for writing into chains.  Plus it's a duplicate of the
> > num arg, which is still passed around for efficiency.  (virtio had to
> > implement count_sg() because I didn't want those two redundant num args).
> >
> > 3) By overloading struct scatterlist, it's invisible what code has been
> > converted to chains, and which hasn't.  Too clever by half!
>
> No it's not ... that's the whole point.  Code not yet converted to use
> the API accessors is by definition unable to use chaining.  Code
> converted to use the accessors by design doesn't care (and so is
> "converted to chains").

But you've overloaded the type: what's the difference between:
	/* Before conversion */
	int foo(struct scatterlist *, int);
And:
	/* After conversion */
	int foo(struct scatterlist *, int);

You have to wade through the implementation to see the difference: does this 
function take a "chained scatterlist" or an "array scatterlist"?

Then you add insult to injury by implementing sg_chain() *which doesn't handle 
chained scatterlists!*.

> > sg_ring would not have required any change to existing drivers, just
> > those that want to use long sg lists.  And then it's damn obvious which
> > are which.
>
> Which, by definition, would have been pretty much all of them.

But it would have started with none of them, and it would have been done over 
time.  Eventually we might have had a flag day to remove raw scatterlist 
arrays.

> > 4) sg_chain missed a chance to combine all the relevent information (max,
> > num, num_dma and the sg array). sg_table comes close, but you still can't
> > join them (no max information, and even if there were, what if it's
> > full?). Unlike sg_ring, it's unlikely to reduce bugs.
>
> sg_table is sg_chain ... they're incremental extensions of the same body
> of work.

Yes.

> The only such example of a driver like that I know is the
> crypto API and now your virtio.  Once we have the actual requirements
> and then the API, I think the natural data structures will probably drop
> out.

ATA wants to chain sgs (Tejun cc'd).  It hasn't been practical to chain up 
until now, but I'd say it's going to be more widely used now it is.

Basically, sg_chain solves one problem: larger sg lists for scsi.  That's 
fine, but I want to see better use of scatterlists everywhere in the kernel.

sg_chains suck for manipulation, and AFAICT that's inherent.  Here, take a 
look at the sg_ring conversion of scsi_alloc_sgtable and scsi_free_sgtable 
and you can see why I'm unhappy with the sg_chain code:

@@ -737,21 +745,41 @@ static inline unsigned int scsi_sgtable_
 	return index;
 }
 
-struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
+static void free_sgring(struct sg_ring *head)
 {
 	struct scsi_host_sg_pool *sgp;
-	struct scatterlist *sgl, *prev, *ret;
+	struct sg_ring *sg, *n;
+
+	/* Free any other entries in the ring. */
+	list_for_each_entry_safe(sg, n, &head->list, list) {
+		list_del(&sg->list);
+		sgp = scsi_sg_pools + scsi_sgtable_index(sg->max);
+		mempool_free(sg, sgp->pool);
+	}
+
+	/* Now free the head of the ring. */
+	BUG_ON(!list_empty(&head->list));
+
+	sgp = scsi_sg_pools + scsi_sgtable_index(head->max);
+	mempool_free(head, sgp->pool);
+}
+
+struct sg_ring *scsi_alloc_sgring(struct scsi_cmnd *cmd, unsigned int num,
+				  gfp_t gfp_mask)
+{
+	struct scsi_host_sg_pool *sgp;
+	struct sg_ring *sg, *ret;
 	unsigned int index;
 	int this, left;
 
-	BUG_ON(!cmd->use_sg);
+	BUG_ON(!num);
 
-	left = cmd->use_sg;
-	ret = prev = NULL;
+	left = num;
+	ret = NULL;
 	do {
 		this = left;
 		if (this > SCSI_MAX_SG_SEGMENTS) {
-			this = SCSI_MAX_SG_SEGMENTS - 1;
+			this = SCSI_MAX_SG_SEGMENTS;
 			index = SG_MEMPOOL_NR - 1;
 		} else
 			index = scsi_sgtable_index(this);
@@ -760,32 +788,20 @@ struct scatterlist *scsi_alloc_sgtable(s
 
 		sgp = scsi_sg_pools + index;
 
-		sgl = mempool_alloc(sgp->pool, gfp_mask);
-		if (unlikely(!sgl))
+		sg = mempool_alloc(sgp->pool, gfp_mask);
+		if (unlikely(!sg))
 			goto enomem;
 
-		sg_init_table(sgl, sgp->size);
+		sg_ring_init(sg, sgp->size - SCSI_SG_PAD);
 
 		/*
-		 * first loop through, set initial index and return value
+		 * first loop through, set return value, otherwise
+		 * attach this to the tail.
 		 */
 		if (!ret)
-			ret = sgl;
-
-		/*
-		 * chain previous sglist, if any. we know the previous
-		 * sglist must be the biggest one, or we would not have
-		 * ended up doing another loop.
-		 */
-		if (prev)
-			sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
-
-		/*
-		 * if we have nothing left, mark the last segment as
-		 * end-of-list
-		 */
-		if (!left)
-			sg_mark_end(&sgl[this - 1]);
+			ret = sg;
+		else
+			list_add_tail(&sg->list, &ret->list);
 
 		/*
 		 * don't allow subsequent mempool allocs to sleep, it would
@@ -793,85 +809,21 @@ struct scatterlist *scsi_alloc_sgtable(s
 		 */
 		gfp_mask &= ~__GFP_WAIT;
 		gfp_mask |= __GFP_HIGH;
-		prev = sgl;
 	} while (left);
 
-	/*
-	 * ->use_sg may get modified after dma mapping has potentially
-	 * shrunk the number of segments, so keep a copy of it for free.
-	 */
-	cmd->__use_sg = cmd->use_sg;
 	return ret;
 enomem:
-	if (ret) {
-		/*
-		 * Free entries chained off ret. Since we were trying to
-		 * allocate another sglist, we know that all entries are of
-		 * the max size.
-		 */
-		sgp = scsi_sg_pools + SG_MEMPOOL_NR - 1;
-		prev = ret;
-		ret = &ret[SCSI_MAX_SG_SEGMENTS - 1];
-
-		while ((sgl = sg_chain_ptr(ret)) != NULL) {
-			ret = &sgl[SCSI_MAX_SG_SEGMENTS - 1];
-			mempool_free(sgl, sgp->pool);
-		}
-
-		mempool_free(prev, sgp->pool);
-	}
+	if (ret)
+		free_sgring(ret);
 	return NULL;
 }
+EXPORT_SYMBOL(scsi_alloc_sgring);
 
-EXPORT_SYMBOL(scsi_alloc_sgtable);
-
-void scsi_free_sgtable(struct scsi_cmnd *cmd)
+void scsi_free_sgring(struct scsi_cmnd *cmd)
 {
-	struct scatterlist *sgl = cmd->request_buffer;
-	struct scsi_host_sg_pool *sgp;
-
-	/*
-	 * if this is the biggest size sglist, check if we have
-	 * chained parts we need to free
-	 */
-	if (cmd->__use_sg > SCSI_MAX_SG_SEGMENTS) {
-		unsigned short this, left;
-		struct scatterlist *next;
-		unsigned int index;
-
-		left = cmd->__use_sg - (SCSI_MAX_SG_SEGMENTS - 1);
-		next = sg_chain_ptr(&sgl[SCSI_MAX_SG_SEGMENTS - 1]);
-		while (left && next) {
-			sgl = next;
-			this = left;
-			if (this > SCSI_MAX_SG_SEGMENTS) {
-				this = SCSI_MAX_SG_SEGMENTS - 1;
-				index = SG_MEMPOOL_NR - 1;
-			} else
-				index = scsi_sgtable_index(this);
-
-			left -= this;
-
-			sgp = scsi_sg_pools + index;
-
-			if (left)
-				next = sg_chain_ptr(&sgl[sgp->size - 1]);
-
-			mempool_free(sgl, sgp->pool);
-		}
-
-		/*
-		 * Restore original, will be freed below
-		 */
-		sgl = cmd->request_buffer;
-		sgp = scsi_sg_pools + SG_MEMPOOL_NR - 1;
-	} else
-		sgp = scsi_sg_pools + scsi_sgtable_index(cmd->__use_sg);
-
-	mempool_free(sgl, sgp->pool);
+	free_sgring(cmd->sg);
 }
-
-EXPORT_SYMBOL(scsi_free_sgtable);
+EXPORT_SYMBOL(scsi_free_sgring);
 
 /*
  * Function:    scsi_release_buffers()

Hope that clarifies,
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-08  0:39       ` Rusty Russell
@ 2008-01-09 22:10         ` James Bottomley
  2008-01-10  2:01           ` Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: James Bottomley @ 2008-01-09 22:10 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide, Tejun Heo


On Tue, 2008-01-08 at 11:39 +1100, Rusty Russell wrote:
> On Tuesday 08 January 2008 02:48:23 James Bottomley wrote:
> > We're always open to new APIs (or more powerful and expanded old ones).
> > The way we've been doing the sg_chain conversion is to slide API layers
> > into the drivers so sg_chain becomes a simple API flip when we turn it
> > on.  Unfortunately sg_ring doesn't quite fit nicely into this.
> 
> Hi James,
> 
>    Well, it didn't touch any drivers.  The only ones which needed altering 
> were those which wanted to use large scatter-gather lists.  You think of the 
> subtlety of sg-chain conversion as a feature; it's a bug :)

No, I think of the side effect of hiding scatterlist manipulations
inside accessors as a feature because it insulates the drivers from the
details of the implementation.

> > > > The other thing I note is that the problem you're claiming to solve
> > > > with sg_ring (the ability to add extra scatterlists to the front or the
> > > > back of an existing one) is already solved with sg_chain, so the only
> > > > real advantage of sg_ring was that it contains explicit counts, which
> > > > sg_table (in -mm) also introduces.
> > >
> > > I just converted virtio using latest Linus for fair comparison
> >
> > Erm, but that's not really a fair comparison; you need the sg_table code
> > in
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git
> >
> > branch sg as well.
> 
> Actually, I think it's a wash.  Now callers need to set up an sg_table as 
> well.  It will save the count_sg() calls though.
> 
> > > , and it's still
> > > pretty ugly.  sg_ring needs more work to de-scsi it.  sg_table is almost
> > > sg_ring except it retains all the chaining warts.
> > >
> > > But we hit the same problems:
> > >
> > > 1) sg_chain loses information.  The clever chain packaging makes reading
> > > easy, but manipulation is severely limited.  You can append to your own
> > > chains by padding, but not someone elses.  This works for SCSI, but what
> > > about the rest of us?  And don't even think of joining mapped chains: it
> > > will almost work.
> >
> > OK, but realistically some of your criticisms are way outside of the
> > design intent.  Scatterlists (and now sg_chains) are the way the block
> > subsystem hands pages to its underlying block devices.
> 
> James, scatterlists are used for more than the block subsystem.  I know you're 
> designing for that, but we can do better.
> 
>     Because a single sg_ring is trivially convertable to and from a 
> scatterlist *, the compatibility story is nice.  You can implement a 
> subsystem (say, the block layer) with sg_ring, and still hand out struct 
> scatterlist arrays for backwards compat: old code won't ask for v. long 
> scatterlist arrays anyway.
> 
>     Now we have sg_chaining, we can actually convert complex sg_rings into sg 
> chains when someone asks, as my most recent patch does.  I think that's one 
> abstraction too many, but it provides a transition path.
> 
> > There have never until now been any requirements to join already
> > dma_map_sg() converted scatterlists ... that would wreak havoc with the
> > way we reuse the list plus damage the idempotence of map/unmap.  What is
> > the use case you have for this?
> 
> (This was, admittedly, a made-up example).
> 
> > > 2) sg_chain's end marker is only for reading non-dma elements, not for
> > > mapped chains, nor for writing into chains.  Plus it's a duplicate of the
> > > num arg, which is still passed around for efficiency.  (virtio had to
> > > implement count_sg() because I didn't want those two redundant num args).
> > >
> > > 3) By overloading struct scatterlist, it's invisible what code has been
> > > converted to chains, and which hasn't.  Too clever by half!
> >
> > No it's not ... that's the whole point.  Code not yet converted to use
> > the API accessors is by definition unable to use chaining.  Code
> > converted to use the accessors by design doesn't care (and so is
> > "converted to chains").
> 
> But you've overloaded the type: what's the difference between:
> 	/* Before conversion */
> 	int foo(struct scatterlist *, int);
> And:
> 	/* After conversion */
> 	int foo(struct scatterlist *, int);
> 
> You have to wade through the implementation to see the difference: does this 
> function take a "chained scatterlist" or an "array scatterlist"?
> 
> Then you add insult to injury by implementing sg_chain() *which doesn't handle 
> chained scatterlists!*.
> 
> > > sg_ring would not have required any change to existing drivers, just
> > > those that want to use long sg lists.  And then it's damn obvious which
> > > are which.
> >
> > Which, by definition, would have been pretty much all of them.
> 
> But it would have started with none of them, and it would have been done over 
> time.  Eventually we might have had a flag day to remove raw scatterlist 
> arrays.
> 
> > > 4) sg_chain missed a chance to combine all the relevent information (max,
> > > num, num_dma and the sg array). sg_table comes close, but you still can't
> > > join them (no max information, and even if there were, what if it's
> > > full?). Unlike sg_ring, it's unlikely to reduce bugs.
> >
> > sg_table is sg_chain ... they're incremental extensions of the same body
> > of work.
> 
> Yes.
> 
> > The only such example of a driver like that I know is the
> > crypto API and now your virtio.  Once we have the actual requirements
> > and then the API, I think the natural data structures will probably drop
> > out.
> 
> ATA wants to chain sgs (Tejun cc'd).  It hasn't been practical to chain up 
> until now, but I'd say it's going to be more widely used now it is.

Actually, I think we might be able to solve ATA's problem without ever
having it manipulate the sg lists, but that's really a different issue.

> Basically, sg_chain solves one problem: larger sg lists for scsi.  That's 
> fine, but I want to see better use of scatterlists everywhere in the kernel.
> 
> sg_chains suck for manipulation, and AFAICT that's inherent.  Here, take a 
> look at the sg_ring conversion of scsi_alloc_sgtable and scsi_free_sgtable 
> and you can see why I'm unhappy with the sg_chain code:
> 
[...]
> Hope that clarifies,

Actually, not really.  If I want to continue the competition, I can just
point out that your sg_ring code is bigger than those corresponding
segments are after the sg_table conversion of scsi_lib.c ...

However, this is pointless.  We seem to be talking past each other.
Your sg_ring is an implementation, like sg_table.  What I want to do is
actually insulate the device drivers (even the drivers that want to
manipulate sg lists) from the implementation.  The idea being that if
this is done correctly, we can alter the implementation from sg_table to
sg_ring to sg_annulus and not have to change any of the driver code.

Since your code introduces new ring handling primitives, it's running
counter to that grain.  What I'm primarily interested in is what's
broken about the new accessor API ... why doesn't it work to hide the
internals of sg_ring?  It seems to me, if I try to do the conversion,
all that's missing is this problematic two level traversal of primitives
that sg_ring requires.

However, a lot of the churn seems to be because you want to change the
fundamental element from a scatterlist entry to a sg_ring.  Worse still,
we now get exposure of locations of scatterlist elements.  Like this
conversion for instance:


> -       for (k = 0; k < scp->use_sg; ++k, sg = sg_next(sg)) {
> -               kaddr = (unsigned char *)kmap_atomic(sg_page(sg), KM_USER0);
> +       sg_ring_for_each(scp->sg, sg, k) {
> +               kaddr = (unsigned char *)kmap_atomic(sg_page(&sg->sg[k]),
> +                                                    KM_USER0);
>                 if (NULL == kaddr)
>                         return -1;
> -               kaddr_off = (unsigned char *)kaddr + sg->offset;
> -               len = sg->length;
> +               kaddr_off = (unsigned char *)kaddr + sg->sg[k].offset;
> +               len = sg->sg[k].length;

Here, you're re-exposing the internals of how we traverse the lists.
This is explicitly something the sg_next() and friends designedly get
rid of.  All drivers want to know is what the next scatterlist entry is,
not where it comes from.

>                 if ((req_len + len) > max_arr_len) {
>                         len = max_arr_len - req_len;
>                         fin = 1;

I suspect if you re-roll sg_ring to conform to the current accessors,
you'll find it pretty much fits, but probably looks a lot like sg_table.
But if it doesn't; if there's something in the accessors that's missing,
that's what I want to know about.

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-09 22:10         ` James Bottomley
@ 2008-01-10  2:01           ` Rusty Russell
  2008-01-10 15:27             ` James Bottomley
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2008-01-10  2:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide, Tejun Heo

On Thursday 10 January 2008 09:10:37 James Bottomley wrote:
> On Tue, 2008-01-08 at 11:39 +1100, Rusty Russell wrote:
> > On Tuesday 08 January 2008 02:48:23 James Bottomley wrote:
> > > We're always open to new APIs (or more powerful and expanded old ones).
> > > The way we've been doing the sg_chain conversion is to slide API layers
> > > into the drivers so sg_chain becomes a simple API flip when we turn it
> > > on.  Unfortunately sg_ring doesn't quite fit nicely into this.
> >
> > Hi James,
> >
> >    Well, it didn't touch any drivers.  The only ones which needed
> > altering were those which wanted to use large scatter-gather lists.  You
> > think of the subtlety of sg-chain conversion as a feature; it's a bug :)
>
> No, I think of the side effect of hiding scatterlist manipulations
> inside accessors as a feature because it insulates the drivers from the
> details of the implementation.

I completely disagree.  Abstractions add complexity, and that costs.  They 
steal information from their users, and as they build up like sediment they 
make debugging and understanding harder.

For example, an array is simple and well understood.   An abstraction would 
just buy confusion and YA thing to learn.

When a single array was no longer sufficient, I figured a linked-list of 
arrays was the simplest replacement.  Easy to understand and accepted 
practice (tho rings are a bit exotic).  Because the implementation is the 
interface, authors can see what is legal.

More concretely, you're already regarding your abstractions as too expensive, 
which is why sg_chain() doesn't handle chained sgs.

Result: you've got a complex implementation and a complex interface with a 
complex abstraction.

> > sg_chains suck for manipulation, and AFAICT that's inherent.  Here, take
> > a look at the sg_ring conversion of scsi_alloc_sgtable and
> > scsi_free_sgtable and you can see why I'm unhappy with the sg_chain code:
> [...]
>
> > Hope that clarifies,
>
> Actually, not really.  If I want to continue the competition, I can just
> point out that your sg_ring code is bigger than those corresponding
> segments are after the sg_table conversion of scsi_lib.c ...
>
> However, this is pointless.

No, it's exactly the point.  These details *matter*.  The implementation 
*matters*.  sg_table moves this code out of scsi_lib (good!), but still 
demonstrates how much of a PITA they are to manipulate.

As for being able to make arbitrary changes in future without hitting drivers: 
this is the Kernel ABI pipe dream writ small.

OK, I give in with -ETIMEDOUT.  I'll go away now and do something 
productive :)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays
  2008-01-10  2:01           ` Rusty Russell
@ 2008-01-10 15:27             ` James Bottomley
  0 siblings, 0 replies; 25+ messages in thread
From: James Bottomley @ 2008-01-10 15:27 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Jens Axboe, linux-kernel, linux-scsi, linux-ide, Tejun Heo

On Thu, 2008-01-10 at 13:01 +1100, Rusty Russell wrote: 
> On Thursday 10 January 2008 09:10:37 James Bottomley wrote:
> > On Tue, 2008-01-08 at 11:39 +1100, Rusty Russell wrote:
> > > On Tuesday 08 January 2008 02:48:23 James Bottomley wrote:
> > > > We're always open to new APIs (or more powerful and expanded old ones).
> > > > The way we've been doing the sg_chain conversion is to slide API layers
> > > > into the drivers so sg_chain becomes a simple API flip when we turn it
> > > > on.  Unfortunately sg_ring doesn't quite fit nicely into this.
> > >
> > > Hi James,
> > >
> > >    Well, it didn't touch any drivers.  The only ones which needed
> > > altering were those which wanted to use large scatter-gather lists.  You
> > > think of the subtlety of sg-chain conversion as a feature; it's a bug :)
> >
> > No, I think of the side effect of hiding scatterlist manipulations
> > inside accessors as a feature because it insulates the drivers from the
> > details of the implementation.
> 
> I completely disagree.  Abstractions add complexity, and that costs.  They 
> steal information from their users, and as they build up like sediment they 
> make debugging and understanding harder.

Don't be silly ... abstractions are the core of how we program something
as complex as platform hardware.  By design, abstractions hide
information.  Good abstractions hide information you don't need to know.
Take the DMA API for instance: dma_map_sg() hides all of the platform
specifics of whether or not you have an IOMMU and how it is programmed.
That's a good abstraction ... drivers would be a complete mess if they
had to do it themselves.

> For example, an array is simple and well understood.   An abstraction would 
> just buy confusion and YA thing to learn.

> When a single array was no longer sufficient, I figured a linked-list of 
> arrays was the simplest replacement.  Easy to understand and accepted 
> practice (tho rings are a bit exotic).  Because the implementation is the 
> interface, authors can see what is legal.

Really, I don't buy this at all.  The first thing you complained about
when you came up with sg_ring was how painful it was to update the 30
odd SCSI drivers.  That was also my reaction when the scatterlist
chaining idea came along.  My second was supposing this isn't quite
right and we modify the way chaining is done ... do I have to update all
the drivers all over again?  Hence one of the input requirements to the
update was a consumer abstraction to insulate drivers from the
implementation.

Thinking about information that the drivers don't need to know, the
obvious thing is how the actual scatterlist is implemented, whether its
a list based ring, a pointer/mark based chain or an as yet to be
invented annulus.  All I think most consumers need is:

     1. A way to get the first scatterlist element
     2. A way to get the next scatterlist element
     3. An iterator over the entire scatterlist
     4. A way to tell if they've reached the last scatterlist element.

Actually 1,2,4 satisfy 3 but it's such a common operation in most
drivers that it was a good simplification.  Other things are sizing and
element count.

> More concretely, you're already regarding your abstractions as too expensive, 
> which is why sg_chain() doesn't handle chained sgs.

sg_chain isn't part of the consumer interface it's part of the
producer/manipulator interface.  I care a lot less about getting that
one right first time because it only occurs in one place in the SCSI
mid-layer and is easy to update and fix.

My principal concern at the moment is the consumer interface ... if
there's a fault in that we need to know now.

> Result: you've got a complex implementation and a complex interface with a 
> complex abstraction.
> 
> > > sg_chains suck for manipulation, and AFAICT that's inherent.  Here, take
> > > a look at the sg_ring conversion of scsi_alloc_sgtable and
> > > scsi_free_sgtable and you can see why I'm unhappy with the sg_chain code:
> > [...]
> >
> > > Hope that clarifies,
> >
> > Actually, not really.  If I want to continue the competition, I can just
> > point out that your sg_ring code is bigger than those corresponding
> > segments are after the sg_table conversion of scsi_lib.c ...
> >
> > However, this is pointless.
> 
> No, it's exactly the point.  These details *matter*.  The implementation 
> *matters*.  sg_table moves this code out of scsi_lib (good!), but still 
> demonstrates how much of a PITA they are to manipulate.
> 
> As for being able to make arbitrary changes in future without hitting drivers: 
> this is the Kernel ABI pipe dream writ small.

No, the kernel ABI is about preserving binary semantics for proprietary
drivers.  We're talking about an API here which is a totally different
thing.  I don't believe I said anywhere that it should be a fixed API
for all time ... that would also be silly.  However, that doesn't mean
you don't take the best care you can to get an API as right as you can.
If it's wrong, it will be altered, but it is much nicer to get it right
first time so it doesn't have to be.

James

> OK, I give in with -ETIMEDOUT.  I'll go away now and do something 
> productive :)
> 
> Cheers,
> Rusty.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-01-10 15:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-19  6:31 [PATCH 0/7] sg_ring: a ring of scatterlist arrays Rusty Russell
2007-12-19  6:33 ` [PATCH 1/7] sg_ring: introduce " Rusty Russell
2007-12-19  7:31   ` [PATCH 2/7] sg_ring: use in virtio Rusty Russell
2007-12-19  7:33     ` [PATCH 3/7] sg_ring: blk_rq_map_sg_ring as a counterpart to blk_rq_map_sg Rusty Russell
2007-12-19  7:34       ` [PATCH 4/7] sg_ring: dma_map_sg_ring() helper Rusty Russell
2007-12-19  7:36         ` [PATCH 5/7] sg_ring: Convert core scsi code to sg_ring Rusty Russell
2007-12-19  7:37           ` [PATCH 6/7] sg_ring: libata simplification Rusty Russell
2007-12-19  7:38             ` [PATCH 7/7] sg_ring: convert core ATA code to sg_ring Rusty Russell
2007-12-26  8:36               ` Tejun Heo
2007-12-26 17:12                 ` James Bottomley
2007-12-27  0:24                 ` Rusty Russell
2007-12-27  4:21                   ` Tejun Heo
2008-01-05 15:31 ` [PATCH 0/7] sg_ring: a ring of scatterlist arrays James Bottomley
2008-01-07  4:38   ` Rusty Russell
2008-01-07  5:01     ` Tejun Heo
2008-01-07  5:28       ` Rusty Russell
2008-01-07  6:37         ` Tejun Heo
2008-01-07  8:34           ` Rusty Russell
2008-01-07  8:45             ` Tejun Heo
2008-01-07 12:17               ` Herbert Xu
2008-01-07 15:48     ` James Bottomley
2008-01-08  0:39       ` Rusty Russell
2008-01-09 22:10         ` James Bottomley
2008-01-10  2:01           ` Rusty Russell
2008-01-10 15:27             ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).