LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* do_page_fault() mm->mmap_sem deadlocks
@ 2004-05-17 15:58 Andy Whitcroft
  2004-05-17 18:07 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Andy Whitcroft @ 2004-05-17 15:58 UTC (permalink / raw)
  To: linuxppc64-dev, linux-kernel
  Cc: Adam Litke, David Gibson, Andrew Morton, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 4095 bytes --]

Whilst trying to debug a bug in PPC64 unmap handling we noticed that it is  relatively easy for a kernel bug to lead to a deadlock trying to down(mm->mmap_sem) when trying to categorise a fault for handling.  The fault handlers grab the mmap_sem lock to scan the VMA list to see if this fault is against a valid segment of the task address space.  This deadlocks if this task is in a system call which also manipulates the virtual address, such as munmap().  This is very undesirable as it prevents the kernel from detecting the fault and reporting it as an oops severely hampering debug.

Most kernel code should not access any area which leads to a fault.  Indeed the only code which should trigger faults from kernel space are routines dealing with copying data to and from user space.  Fault originating outside of these specific routines are not valid and could be 'failed' without the need to reference current->mm, thereby preventing deadlock and ensuring the failure is reported.

Looking at ppc64 and i386 it seems that all faults from legitimate sources will correspond to accesses from IP's on the exceptions list.  By validating the access against this list we should be able to distinguish the 'good' from the 'bad'.

I have put together patches for both ppc64 and i386 to validate the faults against this list.  Testing on these patches seems to validate this assumption.  I have assumed that faults triggered by kernel space are going to be rare relative to user space as most pointers will have been recently referenced by the calling application, but this search is of concern for performance.  Another option might be to employ a scheme similar to that proposed by wli for detecting scheduler routines, marking all routines using getuser/putuser for contiguous layout in the kernel image, allowing for a simple range comparison to provide this check.  A quick kernbench run on the patched kernels seems to indicate minimal impact (details below).

Do people think this is something which is worth pursuing.  The patches attached are  minimal and there are definatly optimisations to be made (such as sharing the result of the exception lookup).  I guess this might even be something which is worth having as a config option for when hangs is there is deemed to be any degradation?  If people think this is something worth looking at I'll come back with cleaner, better tested and benchmarked patches.

-apw


i386 266:
837.27user 74.38system 1:00.58elapsed 1504%CPU (0avgtext+0avgdata 0maxresident)k
828.96user 73.89system 1:00.76elapsed 1485%CPU (0avgtext+0avgdata 0maxresident)k
837.46user 74.39system 1:01.02elapsed 1494%CPU (0avgtext+0avgdata 0maxresident)k
834.82user 73.73system 1:00.74elapsed 1495%CPU (0avgtext+0avgdata 0maxresident)k
837.39user 75.05system 1:00.53elapsed 1507%CPU (0avgtext+0avgdata 0maxresident)k

i386 266+patch:
833.67user 73.43system 0:58.52elapsed 1550%CPU (0avgtext+0avgdata 0maxresident)k
839.90user 73.67system 0:58.48elapsed 1562%CPU (0avgtext+0avgdata 0maxresident)k
843.59user 75.17system 0:58.79elapsed 1562%CPU (0avgtext+0avgdata 0maxresident)k
842.79user 75.37system 1:00.90elapsed 1507%CPU (0avgtext+0avgdata 0maxresident)k
838.55user 74.27system 1:00.53elapsed 1507%CPU (0avgtext+0avgdata 0maxresident)k

ppc64 266:
293.06user 26.90system 0:41.45elapsed 771%CPU (0avgtext+0avgdata 0maxresident)k
293.07user 26.91system 0:41.53elapsed 770%CPU (0avgtext+0avgdata 0maxresident)k
293.15user 26.83system 0:41.10elapsed 778%CPU (0avgtext+0avgdata 0maxresident)k
293.27user 26.55system 0:41.18elapsed 776%CPU (0avgtext+0avgdata 0maxresident)k
293.03user 26.84system 0:41.09elapsed 778%CPU (0avgtext+0avgdata 0maxresident)k

ppc64 266+patch:
293.16user 26.84system 0:41.12elapsed 778%CPU (0avgtext+0avgdata 0maxresident)k
292.85user 27.03system 0:41.25elapsed 775%CPU (0avgtext+0avgdata 0maxresident)k
293.07user 26.89system 0:41.32elapsed 774%CPU (0avgtext+0avgdata 0maxresident)k
293.25user 26.80system 0:41.22elapsed 776%CPU (0avgtext+0avgdata 0maxresident)k
292.71user 27.24system 0:41.48elapsed 771%CPU (0avgtext+0avgdata 0maxresident)k


[-- Attachment #2: 010-kernel_fault_ppc64.txt --]
[-- Type: text/plain, Size: 1876 bytes --]

---
 fault.c |   33 ++++++++++++++++++++++++++++-----
 1 files changed, 28 insertions(+), 5 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/ppc64/mm/fault.c current/arch/ppc64/mm/fault.c
--- reference/arch/ppc64/mm/fault.c	2004-05-10 14:55:08.000000000 +0100
+++ current/arch/ppc64/mm/fault.c	2004-05-17 17:50:41.000000000 +0100
@@ -75,6 +75,8 @@ static int store_updates_sp(struct pt_re
 	return 0;
 }
 
+int check_page_fault(struct pt_regs *regs, unsigned long address, int sig);
+
 /*
  * The error_code parameter is
  *  - DSISR for a non-SLB data access fault,
@@ -93,12 +95,18 @@ void do_page_fault(struct pt_regs *regs,
 	if (regs->trap == 0x300 || regs->trap == 0x380) {
 		if (debugger_fault_handler(regs))
 			return;
-	}
 
-	/* On a kernel SLB miss we can only check for a valid exception entry */
-	if (!user_mode(regs) && (regs->trap == 0x380)) {
-		bad_page_fault(regs, address, SIGSEGV);
-		return;
+		/* On a kernel SLB miss we can only check for a valid exception entry,
+		 * on a regular miss we had better be actually in one of the routines
+		 * which are permitted to access userland. */
+		if (!user_mode(regs)) {
+			if (regs->trap == 0x380) {
+				bad_page_fault(regs, address, SIGSEGV);
+				return;
+			}
+			check_page_fault(regs, address, SIGSEGV);
+		}
+
 	}
 
 	if (error_code & 0x00400000) {
@@ -259,3 +267,18 @@ void bad_page_fault(struct pt_regs *regs
 	/* kernel has accessed a bad area */
 	die("Kernel access of bad area", regs, sig);
 }
+
+int
+check_page_fault(struct pt_regs *regs, unsigned long address, int sig)
+{
+	const struct exception_table_entry *entry;
+
+	/* Are we prepared to handle this fault?  */
+	if ((entry = search_exception_tables(regs->nip)) != NULL) {
+		/* regs->nip = entry->fixup; */
+		return 1;
+	}
+
+	/* kernel has accessed a bad area */
+	die("Kernel access of bad area", regs, sig);
+}

[-- Attachment #3: 015-kernel_fault_i386.txt --]
[-- Type: text/plain, Size: 1541 bytes --]

---
 extable.c |   12 ++++++++++++
 fault.c   |    7 +++++++
 2 files changed, 19 insertions(+)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/mm/extable.c current/arch/i386/mm/extable.c
--- reference/arch/i386/mm/extable.c	2004-03-11 20:47:12.000000000 +0000
+++ current/arch/i386/mm/extable.c	2004-05-17 15:22:54.000000000 +0100
@@ -34,3 +34,15 @@ int fixup_exception(struct pt_regs *regs
 
 	return 0;
 }
+
+int check_exception(struct pt_regs *regs)
+{
+	const struct exception_table_entry *fixup;
+
+	fixup = search_exception_tables(regs->eip);
+	if (fixup) {
+		return 1;
+	}
+
+	return 0;
+}
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/mm/fault.c current/arch/i386/mm/fault.c
--- reference/arch/i386/mm/fault.c	2004-01-09 06:59:02.000000000 +0000
+++ current/arch/i386/mm/fault.c	2004-05-17 15:51:22.000000000 +0100
@@ -198,6 +198,7 @@ static inline int is_prefetch(struct pt_
 } 
 
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
+int check_exception(struct pt_regs *regs);
 
 /*
  * This routine handles page faults.  It determines the address,
@@ -262,6 +263,12 @@ asmlinkage void do_page_fault(struct pt_
 	if (in_atomic() || !mm)
 		goto bad_area_nosemaphore;
 
+	/* If we are in the kernel we should expect faults to user space to
+	 * occur only from instructions designated as user access.  These
+	 * all have exception handlers. */
+	if ((error_code & 4) == 0 && !check_exception(regs))
+		goto bad_area_nosemaphore;
+
 	down_read(&mm->mmap_sem);
 
 	vma = find_vma(mm, address);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: do_page_fault() mm->mmap_sem deadlocks
  2004-05-17 15:58 do_page_fault() mm->mmap_sem deadlocks Andy Whitcroft
@ 2004-05-17 18:07 ` Andrew Morton
  0 siblings, 0 replies; 3+ messages in thread
From: Andrew Morton @ 2004-05-17 18:07 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linuxppc64-dev, linux-kernel, agl, david, mbligh

Andy Whitcroft <apw@shadowen.org> wrote:
>
> Whilst trying to debug a bug in PPC64 unmap handling we noticed that it is
> relatively easy for a kernel bug to lead to a deadlock trying to
> down(mm->mmap_sem) when trying to categorise a fault for handling.  The
> fault handlers grab the mmap_sem lock to scan the VMA list to see if this
> fault is against a valid segment of the task address space.  This deadlocks
> if this task is in a system call which also manipulates the virtual
> address, such as munmap().  This is very undesirable as it prevents the
> kernel from detecting the fault and reporting it as an oops severely
> hampering debug.

Yes.  The "oops deadlocks when mmap_sem is held" problem is rather
irritating, and would be nice to fix.

> Most kernel code should not access any area which leads to a fault.  Indeed
> the only code which should trigger faults from kernel space are routines
> dealing with copying data to and from user space.

And vmalloc faults on x86: we fault in those 4k pte's on-demand.  Your
patch seems to cover that OK.

>  Fault originating
> outside of these specific routines are not valid and could be 'failed'
> without the need to reference current->mm, thereby preventing deadlock and
> ensuring the failure is reported.

Think so.


> ...
> I have assumed that faults triggered by kernel space are going
> to be rare relative to user space as most pointers will have been recently
> referenced by the calling application, but this search is of concern for
> performance.

These kernel-mode faults are by no means uncommon - probably the most
frequent scenario is:

	p = malloc(...);
	read(fd, p, n);

Here, generic_file_read() will take a copy_to_user() fault for each page to
COW it.

Which makes one wonder why we aren't caching the result of the previous
exception table search.  I bet that would get a nice hit rate, and would
fix up any overhead which your change introduces.

> 
> Do people think this is something which is worth pursuing.

I do.

>  The patches
> attached are minimal and there are definatly optimisations to be made (such
> as sharing the result of the exception lookup).  I guess this might even be
> something which is worth having as a config option for when hangs is there
> is deemed to be any degradation?  If people think this is something worth
> looking at I'll come back with cleaner, better tested and benchmarked
> patches.

Please.  Also please consider (and instrument) a last-hit cache in
search_exception_tables().  Maybe it should be per-task.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: do_page_fault() mm->mmap_sem deadlocks
@ 2004-05-19 15:29 Andy Whitcroft
  0 siblings, 0 replies; 3+ messages in thread
From: Andy Whitcroft @ 2004-05-19 15:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc64-dev, linux-kernel, agl, david, mbligh

[-- Attachment #1: Type: text/plain, Size: 2358 bytes --]

--On Monday, May 17, 2004 11:07:45 -0700 Andrew Morton <akpm@osdl.org> 
wrote:

> Yes.  The "oops deadlocks when mmap_sem is held" problem is rather
> irritating, and would be nice to fix.

Ok.

> These kernel-mode faults are by no means uncommon - probably the most
> frequent scenario is:
>
> 	p = malloc(...);
> 	read(fd, p, n);
>
> Here, generic_file_read() will take a copy_to_user() fault for each page
> to COW it.
>
> Which makes one wonder why we aren't caching the result of the previous
> exception table search.  I bet that would get a nice hit rate, and would
> fix up any overhead which your change introduces.
[...]
> Please.  Also please consider (and instrument) a last-hit cache in
> search_exception_tables().  Maybe it should be per-task.

Ok.  As you suggested I tried out a cache for this, per system and per cpu 
versions.  I found that if you are using it on all kernel faults you get 
about a 75% hit ratio.  However, with the instrumentation in it is clear 
that without my additional usage the exception table is rarely used, once 
on a large i386 machine and about 11 times on my smaller i386 machine even 
through a kernbench run.  Indeed, its pretty obvious when you look at it, 
good faults such as the read scenario above match against the address space 
and are handled.  We don't need to look them up in the exceptions table 
unless we cannot fix the fault.

So attempt #2. With the current code as long as we are able to get the 
mmap_sem we will detect the bad pointer and oops as required.  Only if we 
find mmap_sem locked might we be heading for deadlock and is the extra 
'source' check of value.  So I now down_read_trylock mmap_sem and only if 
that fails do I apply the 'source' exception checks.

>From the i386 version:

+       if (!down_read_trylock(&mm->mmap_sem)) {
+               if ((error_code & 4) == 0 && !check_exception(regs))
+                       goto bad_area_nosemaphore;
+               down_read(&mm->mmap_sem);
+       }

This will catch the deadlock case and catches nearly all of the failures 
that full source validation would catch.  Invalid accesses to valid user 
space addresses will succeed as they would currently.  It should add _no_ 
additional overhead for the vast majority of faults.

i386 and ppc64 versions attached.  Testing with kernbench shows no 
performance impact.

-apw

[-- Attachment #2: 010-fault_deadlock_ppc64.txt --]
[-- Type: text/plain, Size: 2757 bytes --]

If a fault in the kernel leads to an unexpected protection fault whilst in
a code path which holds mmap_sem we will deadlock in do_page_fault() while
trying to classify the fault.  By carefully testing the source of the fault
we can detect and OOPS on the vast majority of these, greatly enhancing
diagnosis of such bugs.

---
 fault.c |   39 ++++++++++++++++++++++++++++++++++++++-
 1 files changed, 38 insertions(+), 1 deletion(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/ppc64/mm/fault.c current/arch/ppc64/mm/fault.c
--- reference/arch/ppc64/mm/fault.c	2004-05-10 14:55:08.000000000 +0100
+++ current/arch/ppc64/mm/fault.c	2004-05-19 14:37:15.000000000 +0100
@@ -75,6 +75,8 @@ static int store_updates_sp(struct pt_re
 	return 0;
 }
 
+int check_exception(struct pt_regs *regs);
+
 /*
  * The error_code parameter is
  *  - DSISR for a non-SLB data access fault,
@@ -110,7 +112,29 @@ void do_page_fault(struct pt_regs *regs,
 		bad_page_fault(regs, address, SIGSEGV);
 		return;
 	}
-	down_read(&mm->mmap_sem);
+
+	/* When running in the kernel we expect faults to occur only to
+	 * addresses in user space.  All other faults represent errors in the
+	 * kernel and should generate an OOPS.  Unfortunatly, in the case of an
+	 * erroneous fault occuring in a code path which already holds mmap_sem
+	 * we will deadlock attempting to validate the fault against the
+	 * address space.  Luckily the kernel only validly references user
+	 * space from well defined areas of code, which are listed in the
+	 * exceptions table.  
+	 *
+	 * As the vast majority of faults will be valid we will only perform
+	 * the source reference check when there is a possibilty of a deadlock.
+	 * Attempt to lock the address space, if we cannot we then validate the
+	 * source.  If this is invalid we can skip the address space check,
+	 * thus avoiding the deadlock.
+	 */
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		if (!user_mode(regs) && !check_exception(regs))
+			goto bad_area_nosemaphore;
+
+		down_read(&mm->mmap_sem);
+	}
+
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -200,6 +224,7 @@ good_area:
 bad_area:
 	up_read(&mm->mmap_sem);
 
+bad_area_nosemaphore:
 	/* User mode accesses cause a SIGSEGV */
 	if (user_mode(regs)) {
 		info.si_signo = SIGSEGV;
@@ -259,3 +284,15 @@ void bad_page_fault(struct pt_regs *regs
 	/* kernel has accessed a bad area */
 	die("Kernel access of bad area", regs, sig);
 }
+
+int check_exception(struct pt_regs *regs)
+{
+	const struct exception_table_entry *entry;
+
+	/* Are we prepared to handle this fault?  */
+	if ((entry = search_exception_tables(regs->nip)) != NULL) {
+		return 1;
+	}
+
+	return 0;
+}

[-- Attachment #3: 015-fault_deadlock_i386.txt --]
[-- Type: text/plain, Size: 2597 bytes --]

If a fault in the kernel leads to an unexpected protection fault whilst in
a code path which holds mmap_sem we will deadlock in do_page_fault() while
trying to classify the fault.  By carefully testing the source of the fault
we can detect and OOPS on the vast majority of these, greatly enhancing
diagnosis of such bugs.

---
 extable.c |   12 ++++++++++++
 fault.c   |   22 +++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff -upN reference/arch/i386/mm/extable.c current/arch/i386/mm/extable.c
--- reference/arch/i386/mm/extable.c	2004-03-11 20:47:12.000000000 +0000
+++ current/arch/i386/mm/extable.c	2004-05-19 12:58:14.000000000 +0100
@@ -34,3 +34,15 @@ int fixup_exception(struct pt_regs *regs
 
 	return 0;
 }
+
+int check_exception(struct pt_regs *regs)
+{
+	const struct exception_table_entry *fixup;
+
+	fixup = search_exception_tables(regs->eip);
+	if (fixup) {
+		return 1;
+	}
+
+	return 0;
+}
diff -upN reference/arch/i386/mm/fault.c current/arch/i386/mm/fault.c
--- reference/arch/i386/mm/fault.c	2004-01-09 06:59:02.000000000 +0000
+++ current/arch/i386/mm/fault.c	2004-05-19 12:58:14.000000000 +0100
@@ -198,6 +198,7 @@ static inline int is_prefetch(struct pt_
 } 
 
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
+int check_exception(struct pt_regs *regs);
 
 /*
  * This routine handles page faults.  It determines the address,
@@ -262,7 +263,26 @@ asmlinkage void do_page_fault(struct pt_
 	if (in_atomic() || !mm)
 		goto bad_area_nosemaphore;
 
-	down_read(&mm->mmap_sem);
+	/* When running in the kernel we expect faults to occur only to
+	 * addresses in user space.  All other faults represent errors in the
+	 * kernel and should generate an OOPS.  Unfortunatly, in the case of an
+	 * erroneous fault occuring in a code path which already holds mmap_sem
+	 * we will deadlock attempting to validate the fault against the
+	 * address space.  Luckily the kernel only validly references user
+	 * space from well defined areas of code, which are listed in the
+	 * exceptions table.  
+	 *
+	 * As the vast majority of faults will be valid we will only perform
+	 * the source reference check when there is a possibilty of a deadlock.
+	 * Attempt to lock the address space, if we cannot we then validate the
+	 * source.  If this is invalid we can skip the address space check,
+	 * thus avoiding the deadlock.
+	 */
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		if ((error_code & 4) == 0 && !check_exception(regs))
+			goto bad_area_nosemaphore;
+		down_read(&mm->mmap_sem);
+	}
 
 	vma = find_vma(mm, address);
 	if (!vma)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-05-19 15:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-17 15:58 do_page_fault() mm->mmap_sem deadlocks Andy Whitcroft
2004-05-17 18:07 ` Andrew Morton
2004-05-19 15:29 Andy Whitcroft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).