LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-raid@vger.kernel.org, akpm@osdl.org,
	linux-kernel@vger.kernel.org, christopher.leech@intel.com
Subject: Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction
Date: Mon, 9 Oct 2006 08:18:33 +1000	[thread overview]
Message-ID: <17705.31033.706571.150023@cse.unsw.edu.au> (raw)
In-Reply-To: message from Dan Williams on Monday September 11



On Monday September 11, dan.j.williams@intel.com wrote:
> Neil,
> 
> The following patches implement hardware accelerated raid5 for the Intel
> Xscale® series of I/O Processors.  The MD changes allow stripe
> operations to run outside the spin lock in a work queue.  Hardware
> acceleration is achieved by using a dma-engine-aware work queue routine
> instead of the default software only routine.

Hi Dan,
 Sorry for the delay in replying.
 I've looked through these patches at last (mostly the raid-specific
 bits) and while there is clearly a lot of good stuff here, it does
 'feel' right - it just seems too complex.

 The particular issues that stand out to me are:
   - 33 new STRIPE_OP_* flags.  I'm sure there doesn't need to be that
      many new flags.
   - the "raid5 dma client" patch moves far too much internal
     knowledge about raid5 into drivers/dma.

 Clearly there are some complex issues being dealt with and some
 complexity is to be expected, but I feel there must be room for some
 serious simplification.

 Let me try to describe how I envisage it might work.

 As you know, the theory-of-operation of handle_stripe is that it
 assesses the state of a stripe deciding what actions to perform and
 then performs them.  Synchronous actions (e.g. current parity calcs)
 are performed 'in-line'.  Async actions (reads, writes) and actions
 that cannot be performed under a spinlock (->b_end_io) are recorded
 as being needed and then are initiated at the end of handle_stripe
 outside of the sh->lock.

 The proposal is to bring the parity and other bulk-memory operations
 out of the spinlock and make them optionally asynchronous.

 The set of tasks that might be needed to be performed on a stripe
 are:
	Clear a target cache block
	pre-xor various cache blocks into a target
	copy data out of bios into cache blocks. (drain)
	post-xor various cache blocks into a target
	copy data into bios out of cache blocks (fill)
	test if a cache block is all zeros
	start a read on a cache block
	start a write on a cache block

 (There is also a memcpy when expanding raid5.  I think I would try to
  simply avoid that copy and move pointers around instead).

 Some of these steps require sequencing. e.g.
   clear, pre-xor, copy, post-xor, write
 for a rwm cycle.
 We could require handle_stripe to be called again for each step.
 i.e. first call just clears the target and flags it as clear.  Next
 call initiates the pre-xor and flags that as done.  Etc.  However I
 think that would make the non-offloaded case too slow, or at least
 too clumsy.

 So instead we set flags to say what needs to be done and have a
 workqueue system that does it.

 (so far this is all quite similar to what you have done.)

 So handle_stripe would set various flag and other things (like
 identify which block was the 'target' block) and run the following
 in a workqueue:

raid5_do_stuff(struct stripe_head *sh)
{
	raid5_cont_t *conf = sh->raid_conf;

	if (test_bit(CLEAR_TARGET, &sh->ops.pending)) {
		struct page = *p->sh->dev[sh->ops.target].page;
		rv = async_memset(p, 0, 0, PAGE_SIZE, ops_done, sh);
		if (rv != BUSY)
			clear_bit(CLEAR_TARGET, &sh->ops.pending);
		if (rv != COMPLETE)
			goto out;
	}

	while (test_bit(PRE_XOR, &sh->ops.pending)) {
		struct page *plist[XOR_MAX];
		int offset[XOR_MAX];
		int pos = 0;
		int d;

		for (d = sh->ops.nextdev;
		     d < conf->raid_disks && pos < XOR_MAX ;
		     d++) {
			if (sh->ops.nextdev == sh->ops.target)
				continue;
			if (!test_bit(R5_WantPreXor, &sh->dev[d].flags))
				continue;
			plist[pos] = sh->dev[d].page;
			offset[pos++] = 0;
		}
		if (pos) {
			struct page *p = sh->dev[sh->ops.target].page;
			rv = async_xor(p, 0, plist, offset, pos, PAGE_SIZE,
				       ops_done, sh);
			if (rv != BUSY)
				sh->ops.nextdev = d;
			if (rv != COMPLETE)
				goto out;
		} else {
			clear_bit(PRE_XOR, &sh->ops.pending);
			sh->ops.nextdev = 0;
		}
	}
		
	while (test_bit(COPY_IN, &sh0>ops.pending)) {
		...
	}
	....

	if (test_bit(START_IO, &sh->ops.pending)) {
		int d;
		for (d = 0 ; d < conf->raid_disk ; d++) {
			/* all that code from the end of handle_stripe */
		}

	release_stripe(conf, sh);
	return;

 out:
	if (rv == BUSY) {
		/* wait on something and try again ???*/
	}
	return;
}

ops_done(struct stripe_head *sh)
{
	queue_work(....whatever..);
}


Things to note:
 - we keep track of where we are up to in sh->ops.
      .pending is flags saying what is left to be done
      .next_dev is the next device to process for operations that
        work on several devices
      .next_bio, .next_iov will be needed for copy operations that
        cross multiple bios and iovecs.

 - Each sh->dev has R5_Want flags reflecting which multi-device
   operations are wanted on each device.

 - async bulk-memory operations take pages, offsets, and lengths,
   and can return COMPLETE (if the operation was performed
   synchronously) IN_PROGRESS (if it has been started, or at least
   queued) or BUSY if it couldn't even be queued.  Exactly what to do
   in that case I'm not sure.  Probably we need a waitqueue to wait
   on.

 - The interface between the client and the ADMA hardware is a
   collection of async_ functions.  async_memcpy, async_xor,
   async_memset etc.

   I gather there needs to be some understanding
   about whether the pages are already appropriately mapped for DMA or
   whether a mapping is needed.  Maybe an extra flag argument should
   be passed.

   I imagine that any piece of ADMA hardware would register with the
   'async_*' subsystem, and a call to async_X would be routed as
   appropriate, or be run in-line.

This approach introduces 8 flags for sh->ops.pending and maybe two or
three new R5_Want* flags.  It also keeps the raid5 knowledge firmly in
the raid5 code base.  So it seems to keep the complexity under control

Would this approach make sense to you?  Is there something really
important I have missed?

(I'll try and be more responsive next time).

Thanks,
NeilBrown

  parent reply	other threads:[~2006-10-09  0:31 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-11 23:00 Dan Williams
2006-09-11 23:17 ` [PATCH 01/19] raid5: raid5_do_soft_block_ops Dan Williams
2006-09-11 23:34   ` Jeff Garzik
2006-09-11 23:17 ` [PATCH 02/19] raid5: move write operations to a workqueue Dan Williams
2006-09-11 23:36   ` Jeff Garzik
2006-09-11 23:17 ` [PATCH 03/19] raid5: move check parity " Dan Williams
2006-09-11 23:17 ` [PATCH 04/19] raid5: move compute block " Dan Williams
2006-09-11 23:18 ` [PATCH 05/19] raid5: move read completion copies " Dan Williams
2006-09-11 23:18 ` [PATCH 06/19] raid5: move the reconstruct write expansion operation " Dan Williams
2006-09-11 23:18 ` [PATCH 07/19] raid5: remove compute_block and compute_parity5 Dan Williams
2006-09-11 23:18 ` [PATCH 08/19] dmaengine: enable multiple clients and operations Dan Williams
2006-09-11 23:44   ` Jeff Garzik
2006-09-12  0:14     ` Dan Williams
2006-09-12  0:52       ` Roland Dreier
2006-09-12  6:18         ` Dan Williams
2006-09-12  9:15           ` Evgeniy Polyakov
2006-09-13  4:04           ` Jeff Garzik
2006-09-15 16:38     ` Olof Johansson
2006-09-15 19:44       ` [PATCH] dmaengine: clean up and abstract function types (was Re: [PATCH 08/19] dmaengine: enable multiple clients and operations) Olof Johansson
2006-09-15 20:02         ` [PATCH] [v2] " Olof Johansson
2006-09-18 22:56         ` [PATCH] " Dan Williams
2006-09-19  1:05           ` Olof Johansson
2006-09-19 11:20             ` Alan Cox
2006-09-19 16:32               ` Olof Johansson
2006-09-11 23:18 ` [PATCH 09/19] dmaengine: reduce backend address permutations Dan Williams
2006-09-15 14:46   ` Olof Johansson
2006-09-11 23:18 ` [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to clients Dan Williams
2006-09-11 23:18 ` [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation Dan Williams
2006-09-11 23:50   ` Jeff Garzik
2006-09-11 23:18 ` [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not support memcpy Dan Williams
2006-09-11 23:51   ` Jeff Garzik
2006-09-11 23:18 ` [PATCH 13/19] dmaengine: add support for dma xor zero sum operations Dan Williams
2006-09-11 23:18 ` [PATCH 14/19] dmaengine: add dma_sync_wait Dan Williams
2006-09-11 23:52   ` Jeff Garzik
2006-09-11 23:18 ` [PATCH 15/19] dmaengine: raid5 dma client Dan Williams
2006-09-11 23:54   ` Jeff Garzik
2006-09-11 23:19 ` [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID engines Dan Williams
2006-09-15 14:57   ` Olof Johansson
2006-09-11 23:19 ` [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs Dan Williams
2006-09-11 23:55   ` Jeff Garzik
2006-09-11 23:19 ` [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization Dan Williams
2006-09-11 23:56   ` Jeff Garzik
2006-09-11 23:19 ` [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver Dan Williams
2006-09-11 23:38 ` [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction Jeff Garzik
2006-09-11 23:53   ` Dan Williams
2006-09-12  2:41     ` Jeff Garzik
2006-09-12  5:47       ` Dan Williams
2006-09-13  4:05         ` Jeff Garzik
2006-09-13  7:15 ` Jakob Oestergaard
2006-09-13 19:17   ` Dan Williams
2006-09-14  7:42     ` Jakob Oestergaard
2006-10-11  1:46       ` Dan Williams
2006-10-08 22:18 ` Neil Brown [this message]
2006-10-10 18:23   ` Dan Williams
2006-10-11  2:44     ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17705.31033.706571.150023@cse.unsw.edu.au \
    --to=neilb@suse.de \
    --cc=akpm@osdl.org \
    --cc=christopher.leech@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --subject='Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).